···11+Description22+33+ DRBD is a shared-nothing, synchronously replicated block device. It44+ is designed to serve as a building block for high availability55+ clusters and in this context, is a "drop-in" replacement for shared66+ storage. Simplistically, you could see it as a network RAID 1.77+88+ Please visit http://www.drbd.org to find out more.99+1010+The here included files are intended to help understand the implementation1111+1212+DRBD-8.3-data-packets.svg, DRBD-data-packets.svg 1313+ relates some functions, and write packets.1414+1515+conn-states-8.dot, disk-states-8.dot, node-states-8.dot1616+ The sub graphs of DRBD's state transitions
···17581758F: drivers/scsi/dpt*17591759F: drivers/scsi/dpt/1760176017611761+DRBD DRIVER17621762+P: Philipp Reisner17631763+P: Lars Ellenberg17641764+M: drbd-dev@lists.linbit.com17651765+L: drbd-user@lists.linbit.com17661766+W: http://www.drbd.org17671767+T: git git://git.drbd.org/linux-2.6-drbd.git drbd17681768+T: git git://git.drbd.org/drbd-8.3.git17691769+S: Supported17701770+F: drivers/block/drbd/17711771+F: lib/lru_cache.c17721772+F: Documentation/blockdev/drbd/17731773+17611774DRIVER CORE, KOBJECTS, AND SYSFS17621775M: Greg Kroah-Hartman <gregkh@suse.de>17631776T: quilt kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/
+2
drivers/block/Kconfig
···271271 instead, which can be configured to be on-disk compatible with the272272 cryptoloop device.273273274274+source "drivers/block/drbd/Kconfig"275275+274276config BLK_DEV_NBD275277 tristate "Network block device support"276278 depends on NET
···11+#22+# DRBD device driver configuration33+#44+55+comment "DRBD disabled because PROC_FS, INET or CONNECTOR not selected"66+ depends on !PROC_FS || !INET || !CONNECTOR77+88+config BLK_DEV_DRBD99+ tristate "DRBD Distributed Replicated Block Device support"1010+ depends on PROC_FS && INET && CONNECTOR1111+ select LRU_CACHE1212+ default n1313+ help1414+1515+ NOTE: In order to authenticate connections you have to select1616+ CRYPTO_HMAC and a hash function as well.1717+1818+ DRBD is a shared-nothing, synchronously replicated block device. It1919+ is designed to serve as a building block for high availability2020+ clusters and in this context, is a "drop-in" replacement for shared2121+ storage. Simplistically, you could see it as a network RAID 1.2222+2323+ Each minor device has a role, which can be 'primary' or 'secondary'.2424+ On the node with the primary device the application is supposed to2525+ run and to access the device (/dev/drbdX). Every write is sent to2626+ the local 'lower level block device' and, across the network, to the2727+ node with the device in 'secondary' state. The secondary device2828+ simply writes the data to its lower level block device.2929+3030+ DRBD can also be used in dual-Primary mode (device writable on both3131+ nodes), which means it can exhibit shared disk semantics in a3232+ shared-nothing cluster. Needless to say, on top of dual-Primary3333+ DRBD utilizing a cluster file system is necessary to maintain for3434+ cache coherency.3535+3636+ For automatic failover you need a cluster manager (e.g. heartbeat).3737+ See also: http://www.drbd.org/, http://www.linux-ha.org3838+3939+ If unsure, say N.4040+4141+config DRBD_TRACE4242+ tristate "DRBD tracing"4343+ depends on BLK_DEV_DRBD4444+ select TRACEPOINTS4545+ default n4646+ help4747+4848+ Say Y here if you want to be able to trace various events in DRBD.4949+5050+ If unsure, say N.5151+5252+config DRBD_FAULT_INJECTION5353+ bool "DRBD fault injection"5454+ depends on BLK_DEV_DRBD5555+ help5656+5757+ Say Y here if you want to simulate IO errors, in order to test DRBD's5858+ behavior.5959+6060+ The actual simulation of IO errors is done by writing 3 values to6161+ /sys/module/drbd/parameters/6262+6363+ enable_faults: bitmask of...6464+ 1 meta data write6565+ 2 read6666+ 4 resync data write6767+ 8 read6868+ 16 data write6969+ 32 data read7070+ 64 read ahead7171+ 128 kmalloc of bitmap7272+ 256 allocation of EE (epoch_entries)7373+7474+ fault_devs: bitmask of minor numbers7575+ fault_rate: frequency in percent7676+7777+ Example: Simulate data write errors on /dev/drbd0 with a probability of 5%.7878+ echo 16 > /sys/module/drbd/parameters/enable_faults7979+ echo 1 > /sys/module/drbd/parameters/fault_devs8080+ echo 5 > /sys/module/drbd/parameters/fault_rate8181+8282+ If unsure, say N.
···11+/*22+ drbd_actlog.c33+44+ This file is part of DRBD by Philipp Reisner and Lars Ellenberg.55+66+ Copyright (C) 2003-2008, LINBIT Information Technologies GmbH.77+ Copyright (C) 2003-2008, Philipp Reisner <philipp.reisner@linbit.com>.88+ Copyright (C) 2003-2008, Lars Ellenberg <lars.ellenberg@linbit.com>.99+1010+ drbd is free software; you can redistribute it and/or modify1111+ it under the terms of the GNU General Public License as published by1212+ the Free Software Foundation; either version 2, or (at your option)1313+ any later version.1414+1515+ drbd is distributed in the hope that it will be useful,1616+ but WITHOUT ANY WARRANTY; without even the implied warranty of1717+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the1818+ GNU General Public License for more details.1919+2020+ You should have received a copy of the GNU General Public License2121+ along with drbd; see the file COPYING. If not, write to2222+ the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, USA.2323+2424+ */2525+2626+#include <linux/slab.h>2727+#include <linux/drbd.h>2828+#include "drbd_int.h"2929+#include "drbd_tracing.h"3030+#include "drbd_wrappers.h"3131+3232+/* We maintain a trivial check sum in our on disk activity log.3333+ * With that we can ensure correct operation even when the storage3434+ * device might do a partial (last) sector write while loosing power.3535+ */3636+struct __packed al_transaction {3737+ u32 magic;3838+ u32 tr_number;3939+ struct __packed {4040+ u32 pos;4141+ u32 extent; } updates[1 + AL_EXTENTS_PT];4242+ u32 xor_sum;4343+};4444+4545+struct update_odbm_work {4646+ struct drbd_work w;4747+ unsigned int enr;4848+};4949+5050+struct update_al_work {5151+ struct drbd_work w;5252+ struct lc_element *al_ext;5353+ struct completion event;5454+ unsigned int enr;5555+ /* if old_enr != LC_FREE, write corresponding bitmap sector, too */5656+ unsigned int old_enr;5757+};5858+5959+struct drbd_atodb_wait {6060+ atomic_t count;6161+ struct completion io_done;6262+ struct drbd_conf *mdev;6363+ int error;6464+};6565+6666+6767+int w_al_write_transaction(struct drbd_conf *, struct drbd_work *, int);6868+6969+/* The actual tracepoint needs to have constant number of known arguments...7070+ */7171+void trace_drbd_resync(struct drbd_conf *mdev, int level, const char *fmt, ...)7272+{7373+ va_list ap;7474+7575+ va_start(ap, fmt);7676+ trace__drbd_resync(mdev, level, fmt, ap);7777+ va_end(ap);7878+}7979+8080+static int _drbd_md_sync_page_io(struct drbd_conf *mdev,8181+ struct drbd_backing_dev *bdev,8282+ struct page *page, sector_t sector,8383+ int rw, int size)8484+{8585+ struct bio *bio;8686+ struct drbd_md_io md_io;8787+ int ok;8888+8989+ md_io.mdev = mdev;9090+ init_completion(&md_io.event);9191+ md_io.error = 0;9292+9393+ if ((rw & WRITE) && !test_bit(MD_NO_BARRIER, &mdev->flags))9494+ rw |= (1 << BIO_RW_BARRIER);9595+ rw |= ((1<<BIO_RW_UNPLUG) | (1<<BIO_RW_SYNCIO));9696+9797+ retry:9898+ bio = bio_alloc(GFP_NOIO, 1);9999+ bio->bi_bdev = bdev->md_bdev;100100+ bio->bi_sector = sector;101101+ ok = (bio_add_page(bio, page, size, 0) == size);102102+ if (!ok)103103+ goto out;104104+ bio->bi_private = &md_io;105105+ bio->bi_end_io = drbd_md_io_complete;106106+ bio->bi_rw = rw;107107+108108+ trace_drbd_bio(mdev, "Md", bio, 0, NULL);109109+110110+ if (FAULT_ACTIVE(mdev, (rw & WRITE) ? DRBD_FAULT_MD_WR : DRBD_FAULT_MD_RD))111111+ bio_endio(bio, -EIO);112112+ else113113+ submit_bio(rw, bio);114114+ wait_for_completion(&md_io.event);115115+ ok = bio_flagged(bio, BIO_UPTODATE) && md_io.error == 0;116116+117117+ /* check for unsupported barrier op.118118+ * would rather check on EOPNOTSUPP, but that is not reliable.119119+ * don't try again for ANY return value != 0 */120120+ if (unlikely(bio_rw_flagged(bio, BIO_RW_BARRIER) && !ok)) {121121+ /* Try again with no barrier */122122+ dev_warn(DEV, "Barriers not supported on meta data device - disabling\n");123123+ set_bit(MD_NO_BARRIER, &mdev->flags);124124+ rw &= ~(1 << BIO_RW_BARRIER);125125+ bio_put(bio);126126+ goto retry;127127+ }128128+ out:129129+ bio_put(bio);130130+ return ok;131131+}132132+133133+int drbd_md_sync_page_io(struct drbd_conf *mdev, struct drbd_backing_dev *bdev,134134+ sector_t sector, int rw)135135+{136136+ int logical_block_size, mask, ok;137137+ int offset = 0;138138+ struct page *iop = mdev->md_io_page;139139+140140+ D_ASSERT(mutex_is_locked(&mdev->md_io_mutex));141141+142142+ BUG_ON(!bdev->md_bdev);143143+144144+ logical_block_size = bdev_logical_block_size(bdev->md_bdev);145145+ if (logical_block_size == 0)146146+ logical_block_size = MD_SECTOR_SIZE;147147+148148+ /* in case logical_block_size != 512 [ s390 only? ] */149149+ if (logical_block_size != MD_SECTOR_SIZE) {150150+ mask = (logical_block_size / MD_SECTOR_SIZE) - 1;151151+ D_ASSERT(mask == 1 || mask == 3 || mask == 7);152152+ D_ASSERT(logical_block_size == (mask+1) * MD_SECTOR_SIZE);153153+ offset = sector & mask;154154+ sector = sector & ~mask;155155+ iop = mdev->md_io_tmpp;156156+157157+ if (rw & WRITE) {158158+ /* these are GFP_KERNEL pages, pre-allocated159159+ * on device initialization */160160+ void *p = page_address(mdev->md_io_page);161161+ void *hp = page_address(mdev->md_io_tmpp);162162+163163+ ok = _drbd_md_sync_page_io(mdev, bdev, iop, sector,164164+ READ, logical_block_size);165165+166166+ if (unlikely(!ok)) {167167+ dev_err(DEV, "drbd_md_sync_page_io(,%llus,"168168+ "READ [logical_block_size!=512]) failed!\n",169169+ (unsigned long long)sector);170170+ return 0;171171+ }172172+173173+ memcpy(hp + offset*MD_SECTOR_SIZE, p, MD_SECTOR_SIZE);174174+ }175175+ }176176+177177+ if (sector < drbd_md_first_sector(bdev) ||178178+ sector > drbd_md_last_sector(bdev))179179+ dev_alert(DEV, "%s [%d]:%s(,%llus,%s) out of range md access!\n",180180+ current->comm, current->pid, __func__,181181+ (unsigned long long)sector, (rw & WRITE) ? "WRITE" : "READ");182182+183183+ ok = _drbd_md_sync_page_io(mdev, bdev, iop, sector, rw, logical_block_size);184184+ if (unlikely(!ok)) {185185+ dev_err(DEV, "drbd_md_sync_page_io(,%llus,%s) failed!\n",186186+ (unsigned long long)sector, (rw & WRITE) ? "WRITE" : "READ");187187+ return 0;188188+ }189189+190190+ if (logical_block_size != MD_SECTOR_SIZE && !(rw & WRITE)) {191191+ void *p = page_address(mdev->md_io_page);192192+ void *hp = page_address(mdev->md_io_tmpp);193193+194194+ memcpy(p, hp + offset*MD_SECTOR_SIZE, MD_SECTOR_SIZE);195195+ }196196+197197+ return ok;198198+}199199+200200+static struct lc_element *_al_get(struct drbd_conf *mdev, unsigned int enr)201201+{202202+ struct lc_element *al_ext;203203+ struct lc_element *tmp;204204+ unsigned long al_flags = 0;205205+206206+ spin_lock_irq(&mdev->al_lock);207207+ tmp = lc_find(mdev->resync, enr/AL_EXT_PER_BM_SECT);208208+ if (unlikely(tmp != NULL)) {209209+ struct bm_extent *bm_ext = lc_entry(tmp, struct bm_extent, lce);210210+ if (test_bit(BME_NO_WRITES, &bm_ext->flags)) {211211+ spin_unlock_irq(&mdev->al_lock);212212+ return NULL;213213+ }214214+ }215215+ al_ext = lc_get(mdev->act_log, enr);216216+ al_flags = mdev->act_log->flags;217217+ spin_unlock_irq(&mdev->al_lock);218218+219219+ /*220220+ if (!al_ext) {221221+ if (al_flags & LC_STARVING)222222+ dev_warn(DEV, "Have to wait for LRU element (AL too small?)\n");223223+ if (al_flags & LC_DIRTY)224224+ dev_warn(DEV, "Ongoing AL update (AL device too slow?)\n");225225+ }226226+ */227227+228228+ return al_ext;229229+}230230+231231+void drbd_al_begin_io(struct drbd_conf *mdev, sector_t sector)232232+{233233+ unsigned int enr = (sector >> (AL_EXTENT_SHIFT-9));234234+ struct lc_element *al_ext;235235+ struct update_al_work al_work;236236+237237+ D_ASSERT(atomic_read(&mdev->local_cnt) > 0);238238+239239+ trace_drbd_actlog(mdev, sector, "al_begin_io");240240+241241+ wait_event(mdev->al_wait, (al_ext = _al_get(mdev, enr)));242242+243243+ if (al_ext->lc_number != enr) {244244+ /* drbd_al_write_transaction(mdev,al_ext,enr);245245+ * recurses into generic_make_request(), which246246+ * disallows recursion, bios being serialized on the247247+ * current->bio_tail list now.248248+ * we have to delegate updates to the activity log249249+ * to the worker thread. */250250+ init_completion(&al_work.event);251251+ al_work.al_ext = al_ext;252252+ al_work.enr = enr;253253+ al_work.old_enr = al_ext->lc_number;254254+ al_work.w.cb = w_al_write_transaction;255255+ drbd_queue_work_front(&mdev->data.work, &al_work.w);256256+ wait_for_completion(&al_work.event);257257+258258+ mdev->al_writ_cnt++;259259+260260+ spin_lock_irq(&mdev->al_lock);261261+ lc_changed(mdev->act_log, al_ext);262262+ spin_unlock_irq(&mdev->al_lock);263263+ wake_up(&mdev->al_wait);264264+ }265265+}266266+267267+void drbd_al_complete_io(struct drbd_conf *mdev, sector_t sector)268268+{269269+ unsigned int enr = (sector >> (AL_EXTENT_SHIFT-9));270270+ struct lc_element *extent;271271+ unsigned long flags;272272+273273+ trace_drbd_actlog(mdev, sector, "al_complete_io");274274+275275+ spin_lock_irqsave(&mdev->al_lock, flags);276276+277277+ extent = lc_find(mdev->act_log, enr);278278+279279+ if (!extent) {280280+ spin_unlock_irqrestore(&mdev->al_lock, flags);281281+ dev_err(DEV, "al_complete_io() called on inactive extent %u\n", enr);282282+ return;283283+ }284284+285285+ if (lc_put(mdev->act_log, extent) == 0)286286+ wake_up(&mdev->al_wait);287287+288288+ spin_unlock_irqrestore(&mdev->al_lock, flags);289289+}290290+291291+int292292+w_al_write_transaction(struct drbd_conf *mdev, struct drbd_work *w, int unused)293293+{294294+ struct update_al_work *aw = container_of(w, struct update_al_work, w);295295+ struct lc_element *updated = aw->al_ext;296296+ const unsigned int new_enr = aw->enr;297297+ const unsigned int evicted = aw->old_enr;298298+ struct al_transaction *buffer;299299+ sector_t sector;300300+ int i, n, mx;301301+ unsigned int extent_nr;302302+ u32 xor_sum = 0;303303+304304+ if (!get_ldev(mdev)) {305305+ dev_err(DEV, "get_ldev() failed in w_al_write_transaction\n");306306+ complete(&((struct update_al_work *)w)->event);307307+ return 1;308308+ }309309+ /* do we have to do a bitmap write, first?310310+ * TODO reduce maximum latency:311311+ * submit both bios, then wait for both,312312+ * instead of doing two synchronous sector writes. */313313+ if (mdev->state.conn < C_CONNECTED && evicted != LC_FREE)314314+ drbd_bm_write_sect(mdev, evicted/AL_EXT_PER_BM_SECT);315315+316316+ mutex_lock(&mdev->md_io_mutex); /* protects md_io_page, al_tr_cycle, ... */317317+ buffer = (struct al_transaction *)page_address(mdev->md_io_page);318318+319319+ buffer->magic = __constant_cpu_to_be32(DRBD_MAGIC);320320+ buffer->tr_number = cpu_to_be32(mdev->al_tr_number);321321+322322+ n = lc_index_of(mdev->act_log, updated);323323+324324+ buffer->updates[0].pos = cpu_to_be32(n);325325+ buffer->updates[0].extent = cpu_to_be32(new_enr);326326+327327+ xor_sum ^= new_enr;328328+329329+ mx = min_t(int, AL_EXTENTS_PT,330330+ mdev->act_log->nr_elements - mdev->al_tr_cycle);331331+ for (i = 0; i < mx; i++) {332332+ unsigned idx = mdev->al_tr_cycle + i;333333+ extent_nr = lc_element_by_index(mdev->act_log, idx)->lc_number;334334+ buffer->updates[i+1].pos = cpu_to_be32(idx);335335+ buffer->updates[i+1].extent = cpu_to_be32(extent_nr);336336+ xor_sum ^= extent_nr;337337+ }338338+ for (; i < AL_EXTENTS_PT; i++) {339339+ buffer->updates[i+1].pos = __constant_cpu_to_be32(-1);340340+ buffer->updates[i+1].extent = __constant_cpu_to_be32(LC_FREE);341341+ xor_sum ^= LC_FREE;342342+ }343343+ mdev->al_tr_cycle += AL_EXTENTS_PT;344344+ if (mdev->al_tr_cycle >= mdev->act_log->nr_elements)345345+ mdev->al_tr_cycle = 0;346346+347347+ buffer->xor_sum = cpu_to_be32(xor_sum);348348+349349+ sector = mdev->ldev->md.md_offset350350+ + mdev->ldev->md.al_offset + mdev->al_tr_pos;351351+352352+ if (!drbd_md_sync_page_io(mdev, mdev->ldev, sector, WRITE))353353+ drbd_chk_io_error(mdev, 1, TRUE);354354+355355+ if (++mdev->al_tr_pos >356356+ div_ceil(mdev->act_log->nr_elements, AL_EXTENTS_PT))357357+ mdev->al_tr_pos = 0;358358+359359+ D_ASSERT(mdev->al_tr_pos < MD_AL_MAX_SIZE);360360+ mdev->al_tr_number++;361361+362362+ mutex_unlock(&mdev->md_io_mutex);363363+364364+ complete(&((struct update_al_work *)w)->event);365365+ put_ldev(mdev);366366+367367+ return 1;368368+}369369+370370+/**371371+ * drbd_al_read_tr() - Read a single transaction from the on disk activity log372372+ * @mdev: DRBD device.373373+ * @bdev: Block device to read form.374374+ * @b: pointer to an al_transaction.375375+ * @index: On disk slot of the transaction to read.376376+ *377377+ * Returns -1 on IO error, 0 on checksum error and 1 upon success.378378+ */379379+static int drbd_al_read_tr(struct drbd_conf *mdev,380380+ struct drbd_backing_dev *bdev,381381+ struct al_transaction *b,382382+ int index)383383+{384384+ sector_t sector;385385+ int rv, i;386386+ u32 xor_sum = 0;387387+388388+ sector = bdev->md.md_offset + bdev->md.al_offset + index;389389+390390+ /* Dont process error normally,391391+ * as this is done before disk is attached! */392392+ if (!drbd_md_sync_page_io(mdev, bdev, sector, READ))393393+ return -1;394394+395395+ rv = (be32_to_cpu(b->magic) == DRBD_MAGIC);396396+397397+ for (i = 0; i < AL_EXTENTS_PT + 1; i++)398398+ xor_sum ^= be32_to_cpu(b->updates[i].extent);399399+ rv &= (xor_sum == be32_to_cpu(b->xor_sum));400400+401401+ return rv;402402+}403403+404404+/**405405+ * drbd_al_read_log() - Restores the activity log from its on disk representation.406406+ * @mdev: DRBD device.407407+ * @bdev: Block device to read form.408408+ *409409+ * Returns 1 on success, returns 0 when reading the log failed due to IO errors.410410+ */411411+int drbd_al_read_log(struct drbd_conf *mdev, struct drbd_backing_dev *bdev)412412+{413413+ struct al_transaction *buffer;414414+ int i;415415+ int rv;416416+ int mx;417417+ int active_extents = 0;418418+ int transactions = 0;419419+ int found_valid = 0;420420+ int from = 0;421421+ int to = 0;422422+ u32 from_tnr = 0;423423+ u32 to_tnr = 0;424424+ u32 cnr;425425+426426+ mx = div_ceil(mdev->act_log->nr_elements, AL_EXTENTS_PT);427427+428428+ /* lock out all other meta data io for now,429429+ * and make sure the page is mapped.430430+ */431431+ mutex_lock(&mdev->md_io_mutex);432432+ buffer = page_address(mdev->md_io_page);433433+434434+ /* Find the valid transaction in the log */435435+ for (i = 0; i <= mx; i++) {436436+ rv = drbd_al_read_tr(mdev, bdev, buffer, i);437437+ if (rv == 0)438438+ continue;439439+ if (rv == -1) {440440+ mutex_unlock(&mdev->md_io_mutex);441441+ return 0;442442+ }443443+ cnr = be32_to_cpu(buffer->tr_number);444444+445445+ if (++found_valid == 1) {446446+ from = i;447447+ to = i;448448+ from_tnr = cnr;449449+ to_tnr = cnr;450450+ continue;451451+ }452452+ if ((int)cnr - (int)from_tnr < 0) {453453+ D_ASSERT(from_tnr - cnr + i - from == mx+1);454454+ from = i;455455+ from_tnr = cnr;456456+ }457457+ if ((int)cnr - (int)to_tnr > 0) {458458+ D_ASSERT(cnr - to_tnr == i - to);459459+ to = i;460460+ to_tnr = cnr;461461+ }462462+ }463463+464464+ if (!found_valid) {465465+ dev_warn(DEV, "No usable activity log found.\n");466466+ mutex_unlock(&mdev->md_io_mutex);467467+ return 1;468468+ }469469+470470+ /* Read the valid transactions.471471+ * dev_info(DEV, "Reading from %d to %d.\n",from,to); */472472+ i = from;473473+ while (1) {474474+ int j, pos;475475+ unsigned int extent_nr;476476+ unsigned int trn;477477+478478+ rv = drbd_al_read_tr(mdev, bdev, buffer, i);479479+ ERR_IF(rv == 0) goto cancel;480480+ if (rv == -1) {481481+ mutex_unlock(&mdev->md_io_mutex);482482+ return 0;483483+ }484484+485485+ trn = be32_to_cpu(buffer->tr_number);486486+487487+ spin_lock_irq(&mdev->al_lock);488488+489489+ /* This loop runs backwards because in the cyclic490490+ elements there might be an old version of the491491+ updated element (in slot 0). So the element in slot 0492492+ can overwrite old versions. */493493+ for (j = AL_EXTENTS_PT; j >= 0; j--) {494494+ pos = be32_to_cpu(buffer->updates[j].pos);495495+ extent_nr = be32_to_cpu(buffer->updates[j].extent);496496+497497+ if (extent_nr == LC_FREE)498498+ continue;499499+500500+ lc_set(mdev->act_log, extent_nr, pos);501501+ active_extents++;502502+ }503503+ spin_unlock_irq(&mdev->al_lock);504504+505505+ transactions++;506506+507507+cancel:508508+ if (i == to)509509+ break;510510+ i++;511511+ if (i > mx)512512+ i = 0;513513+ }514514+515515+ mdev->al_tr_number = to_tnr+1;516516+ mdev->al_tr_pos = to;517517+ if (++mdev->al_tr_pos >518518+ div_ceil(mdev->act_log->nr_elements, AL_EXTENTS_PT))519519+ mdev->al_tr_pos = 0;520520+521521+ /* ok, we are done with it */522522+ mutex_unlock(&mdev->md_io_mutex);523523+524524+ dev_info(DEV, "Found %d transactions (%d active extents) in activity log.\n",525525+ transactions, active_extents);526526+527527+ return 1;528528+}529529+530530+static void atodb_endio(struct bio *bio, int error)531531+{532532+ struct drbd_atodb_wait *wc = bio->bi_private;533533+ struct drbd_conf *mdev = wc->mdev;534534+ struct page *page;535535+ int uptodate = bio_flagged(bio, BIO_UPTODATE);536536+537537+ /* strange behavior of some lower level drivers...538538+ * fail the request by clearing the uptodate flag,539539+ * but do not return any error?! */540540+ if (!error && !uptodate)541541+ error = -EIO;542542+543543+ drbd_chk_io_error(mdev, error, TRUE);544544+ if (error && wc->error == 0)545545+ wc->error = error;546546+547547+ if (atomic_dec_and_test(&wc->count))548548+ complete(&wc->io_done);549549+550550+ page = bio->bi_io_vec[0].bv_page;551551+ put_page(page);552552+ bio_put(bio);553553+ mdev->bm_writ_cnt++;554554+ put_ldev(mdev);555555+}556556+557557+#define S2W(s) ((s)<<(BM_EXT_SHIFT-BM_BLOCK_SHIFT-LN2_BPL))558558+/* activity log to on disk bitmap -- prepare bio unless that sector559559+ * is already covered by previously prepared bios */560560+static int atodb_prepare_unless_covered(struct drbd_conf *mdev,561561+ struct bio **bios,562562+ unsigned int enr,563563+ struct drbd_atodb_wait *wc) __must_hold(local)564564+{565565+ struct bio *bio;566566+ struct page *page;567567+ sector_t on_disk_sector = enr + mdev->ldev->md.md_offset568568+ + mdev->ldev->md.bm_offset;569569+ unsigned int page_offset = PAGE_SIZE;570570+ int offset;571571+ int i = 0;572572+ int err = -ENOMEM;573573+574574+ /* Check if that enr is already covered by an already created bio.575575+ * Caution, bios[] is not NULL terminated,576576+ * but only initialized to all NULL.577577+ * For completely scattered activity log,578578+ * the last invocation iterates over all bios,579579+ * and finds the last NULL entry.580580+ */581581+ while ((bio = bios[i])) {582582+ if (bio->bi_sector == on_disk_sector)583583+ return 0;584584+ i++;585585+ }586586+ /* bios[i] == NULL, the next not yet used slot */587587+588588+ /* GFP_KERNEL, we are not in the write-out path */589589+ bio = bio_alloc(GFP_KERNEL, 1);590590+ if (bio == NULL)591591+ return -ENOMEM;592592+593593+ if (i > 0) {594594+ const struct bio_vec *prev_bv = bios[i-1]->bi_io_vec;595595+ page_offset = prev_bv->bv_offset + prev_bv->bv_len;596596+ page = prev_bv->bv_page;597597+ }598598+ if (page_offset == PAGE_SIZE) {599599+ page = alloc_page(__GFP_HIGHMEM);600600+ if (page == NULL)601601+ goto out_bio_put;602602+ page_offset = 0;603603+ } else {604604+ get_page(page);605605+ }606606+607607+ offset = S2W(enr);608608+ drbd_bm_get_lel(mdev, offset,609609+ min_t(size_t, S2W(1), drbd_bm_words(mdev) - offset),610610+ kmap(page) + page_offset);611611+ kunmap(page);612612+613613+ bio->bi_private = wc;614614+ bio->bi_end_io = atodb_endio;615615+ bio->bi_bdev = mdev->ldev->md_bdev;616616+ bio->bi_sector = on_disk_sector;617617+618618+ if (bio_add_page(bio, page, MD_SECTOR_SIZE, page_offset) != MD_SECTOR_SIZE)619619+ goto out_put_page;620620+621621+ atomic_inc(&wc->count);622622+ /* we already know that we may do this...623623+ * get_ldev_if_state(mdev,D_ATTACHING);624624+ * just get the extra reference, so that the local_cnt reflects625625+ * the number of pending IO requests DRBD at its backing device.626626+ */627627+ atomic_inc(&mdev->local_cnt);628628+629629+ bios[i] = bio;630630+631631+ return 0;632632+633633+out_put_page:634634+ err = -EINVAL;635635+ put_page(page);636636+out_bio_put:637637+ bio_put(bio);638638+ return err;639639+}640640+641641+/**642642+ * drbd_al_to_on_disk_bm() - * Writes bitmap parts covered by active AL extents643643+ * @mdev: DRBD device.644644+ *645645+ * Called when we detach (unconfigure) local storage,646646+ * or when we go from R_PRIMARY to R_SECONDARY role.647647+ */648648+void drbd_al_to_on_disk_bm(struct drbd_conf *mdev)649649+{650650+ int i, nr_elements;651651+ unsigned int enr;652652+ struct bio **bios;653653+ struct drbd_atodb_wait wc;654654+655655+ ERR_IF (!get_ldev_if_state(mdev, D_ATTACHING))656656+ return; /* sorry, I don't have any act_log etc... */657657+658658+ wait_event(mdev->al_wait, lc_try_lock(mdev->act_log));659659+660660+ nr_elements = mdev->act_log->nr_elements;661661+662662+ /* GFP_KERNEL, we are not in anyone's write-out path */663663+ bios = kzalloc(sizeof(struct bio *) * nr_elements, GFP_KERNEL);664664+ if (!bios)665665+ goto submit_one_by_one;666666+667667+ atomic_set(&wc.count, 0);668668+ init_completion(&wc.io_done);669669+ wc.mdev = mdev;670670+ wc.error = 0;671671+672672+ for (i = 0; i < nr_elements; i++) {673673+ enr = lc_element_by_index(mdev->act_log, i)->lc_number;674674+ if (enr == LC_FREE)675675+ continue;676676+ /* next statement also does atomic_inc wc.count and local_cnt */677677+ if (atodb_prepare_unless_covered(mdev, bios,678678+ enr/AL_EXT_PER_BM_SECT,679679+ &wc))680680+ goto free_bios_submit_one_by_one;681681+ }682682+683683+ /* unnecessary optimization? */684684+ lc_unlock(mdev->act_log);685685+ wake_up(&mdev->al_wait);686686+687687+ /* all prepared, submit them */688688+ for (i = 0; i < nr_elements; i++) {689689+ if (bios[i] == NULL)690690+ break;691691+ if (FAULT_ACTIVE(mdev, DRBD_FAULT_MD_WR)) {692692+ bios[i]->bi_rw = WRITE;693693+ bio_endio(bios[i], -EIO);694694+ } else {695695+ submit_bio(WRITE, bios[i]);696696+ }697697+ }698698+699699+ drbd_blk_run_queue(bdev_get_queue(mdev->ldev->md_bdev));700700+701701+ /* always (try to) flush bitmap to stable storage */702702+ drbd_md_flush(mdev);703703+704704+ /* In case we did not submit a single IO do not wait for705705+ * them to complete. ( Because we would wait forever here. )706706+ *707707+ * In case we had IOs and they are already complete, there708708+ * is not point in waiting anyways.709709+ * Therefore this if () ... */710710+ if (atomic_read(&wc.count))711711+ wait_for_completion(&wc.io_done);712712+713713+ put_ldev(mdev);714714+715715+ kfree(bios);716716+ return;717717+718718+ free_bios_submit_one_by_one:719719+ /* free everything by calling the endio callback directly. */720720+ for (i = 0; i < nr_elements && bios[i]; i++)721721+ bio_endio(bios[i], 0);722722+723723+ kfree(bios);724724+725725+ submit_one_by_one:726726+ dev_warn(DEV, "Using the slow drbd_al_to_on_disk_bm()\n");727727+728728+ for (i = 0; i < mdev->act_log->nr_elements; i++) {729729+ enr = lc_element_by_index(mdev->act_log, i)->lc_number;730730+ if (enr == LC_FREE)731731+ continue;732732+ /* Really slow: if we have al-extents 16..19 active,733733+ * sector 4 will be written four times! Synchronous! */734734+ drbd_bm_write_sect(mdev, enr/AL_EXT_PER_BM_SECT);735735+ }736736+737737+ lc_unlock(mdev->act_log);738738+ wake_up(&mdev->al_wait);739739+ put_ldev(mdev);740740+}741741+742742+/**743743+ * drbd_al_apply_to_bm() - Sets the bitmap to diry(1) where covered ba active AL extents744744+ * @mdev: DRBD device.745745+ */746746+void drbd_al_apply_to_bm(struct drbd_conf *mdev)747747+{748748+ unsigned int enr;749749+ unsigned long add = 0;750750+ char ppb[10];751751+ int i;752752+753753+ wait_event(mdev->al_wait, lc_try_lock(mdev->act_log));754754+755755+ for (i = 0; i < mdev->act_log->nr_elements; i++) {756756+ enr = lc_element_by_index(mdev->act_log, i)->lc_number;757757+ if (enr == LC_FREE)758758+ continue;759759+ add += drbd_bm_ALe_set_all(mdev, enr);760760+ }761761+762762+ lc_unlock(mdev->act_log);763763+ wake_up(&mdev->al_wait);764764+765765+ dev_info(DEV, "Marked additional %s as out-of-sync based on AL.\n",766766+ ppsize(ppb, Bit2KB(add)));767767+}768768+769769+static int _try_lc_del(struct drbd_conf *mdev, struct lc_element *al_ext)770770+{771771+ int rv;772772+773773+ spin_lock_irq(&mdev->al_lock);774774+ rv = (al_ext->refcnt == 0);775775+ if (likely(rv))776776+ lc_del(mdev->act_log, al_ext);777777+ spin_unlock_irq(&mdev->al_lock);778778+779779+ return rv;780780+}781781+782782+/**783783+ * drbd_al_shrink() - Removes all active extents form the activity log784784+ * @mdev: DRBD device.785785+ *786786+ * Removes all active extents form the activity log, waiting until787787+ * the reference count of each entry dropped to 0 first, of course.788788+ *789789+ * You need to lock mdev->act_log with lc_try_lock() / lc_unlock()790790+ */791791+void drbd_al_shrink(struct drbd_conf *mdev)792792+{793793+ struct lc_element *al_ext;794794+ int i;795795+796796+ D_ASSERT(test_bit(__LC_DIRTY, &mdev->act_log->flags));797797+798798+ for (i = 0; i < mdev->act_log->nr_elements; i++) {799799+ al_ext = lc_element_by_index(mdev->act_log, i);800800+ if (al_ext->lc_number == LC_FREE)801801+ continue;802802+ wait_event(mdev->al_wait, _try_lc_del(mdev, al_ext));803803+ }804804+805805+ wake_up(&mdev->al_wait);806806+}807807+808808+static int w_update_odbm(struct drbd_conf *mdev, struct drbd_work *w, int unused)809809+{810810+ struct update_odbm_work *udw = container_of(w, struct update_odbm_work, w);811811+812812+ if (!get_ldev(mdev)) {813813+ if (__ratelimit(&drbd_ratelimit_state))814814+ dev_warn(DEV, "Can not update on disk bitmap, local IO disabled.\n");815815+ kfree(udw);816816+ return 1;817817+ }818818+819819+ drbd_bm_write_sect(mdev, udw->enr);820820+ put_ldev(mdev);821821+822822+ kfree(udw);823823+824824+ if (drbd_bm_total_weight(mdev) <= mdev->rs_failed) {825825+ switch (mdev->state.conn) {826826+ case C_SYNC_SOURCE: case C_SYNC_TARGET:827827+ case C_PAUSED_SYNC_S: case C_PAUSED_SYNC_T:828828+ drbd_resync_finished(mdev);829829+ default:830830+ /* nothing to do */831831+ break;832832+ }833833+ }834834+ drbd_bcast_sync_progress(mdev);835835+836836+ return 1;837837+}838838+839839+840840+/* ATTENTION. The AL's extents are 4MB each, while the extents in the841841+ * resync LRU-cache are 16MB each.842842+ * The caller of this function has to hold an get_ldev() reference.843843+ *844844+ * TODO will be obsoleted once we have a caching lru of the on disk bitmap845845+ */846846+static void drbd_try_clear_on_disk_bm(struct drbd_conf *mdev, sector_t sector,847847+ int count, int success)848848+{849849+ struct lc_element *e;850850+ struct update_odbm_work *udw;851851+852852+ unsigned int enr;853853+854854+ D_ASSERT(atomic_read(&mdev->local_cnt));855855+856856+ /* I simply assume that a sector/size pair never crosses857857+ * a 16 MB extent border. (Currently this is true...) */858858+ enr = BM_SECT_TO_EXT(sector);859859+860860+ e = lc_get(mdev->resync, enr);861861+ if (e) {862862+ struct bm_extent *ext = lc_entry(e, struct bm_extent, lce);863863+ if (ext->lce.lc_number == enr) {864864+ if (success)865865+ ext->rs_left -= count;866866+ else867867+ ext->rs_failed += count;868868+ if (ext->rs_left < ext->rs_failed) {869869+ dev_err(DEV, "BAD! sector=%llus enr=%u rs_left=%d "870870+ "rs_failed=%d count=%d\n",871871+ (unsigned long long)sector,872872+ ext->lce.lc_number, ext->rs_left,873873+ ext->rs_failed, count);874874+ dump_stack();875875+876876+ lc_put(mdev->resync, &ext->lce);877877+ drbd_force_state(mdev, NS(conn, C_DISCONNECTING));878878+ return;879879+ }880880+ } else {881881+ /* Normally this element should be in the cache,882882+ * since drbd_rs_begin_io() pulled it already in.883883+ *884884+ * But maybe an application write finished, and we set885885+ * something outside the resync lru_cache in sync.886886+ */887887+ int rs_left = drbd_bm_e_weight(mdev, enr);888888+ if (ext->flags != 0) {889889+ dev_warn(DEV, "changing resync lce: %d[%u;%02lx]"890890+ " -> %d[%u;00]\n",891891+ ext->lce.lc_number, ext->rs_left,892892+ ext->flags, enr, rs_left);893893+ ext->flags = 0;894894+ }895895+ if (ext->rs_failed) {896896+ dev_warn(DEV, "Kicking resync_lru element enr=%u "897897+ "out with rs_failed=%d\n",898898+ ext->lce.lc_number, ext->rs_failed);899899+ set_bit(WRITE_BM_AFTER_RESYNC, &mdev->flags);900900+ }901901+ ext->rs_left = rs_left;902902+ ext->rs_failed = success ? 0 : count;903903+ lc_changed(mdev->resync, &ext->lce);904904+ }905905+ lc_put(mdev->resync, &ext->lce);906906+ /* no race, we are within the al_lock! */907907+908908+ if (ext->rs_left == ext->rs_failed) {909909+ ext->rs_failed = 0;910910+911911+ udw = kmalloc(sizeof(*udw), GFP_ATOMIC);912912+ if (udw) {913913+ udw->enr = ext->lce.lc_number;914914+ udw->w.cb = w_update_odbm;915915+ drbd_queue_work_front(&mdev->data.work, &udw->w);916916+ } else {917917+ dev_warn(DEV, "Could not kmalloc an udw\n");918918+ set_bit(WRITE_BM_AFTER_RESYNC, &mdev->flags);919919+ }920920+ }921921+ } else {922922+ dev_err(DEV, "lc_get() failed! locked=%d/%d flags=%lu\n",923923+ mdev->resync_locked,924924+ mdev->resync->nr_elements,925925+ mdev->resync->flags);926926+ }927927+}928928+929929+/* clear the bit corresponding to the piece of storage in question:930930+ * size byte of data starting from sector. Only clear a bits of the affected931931+ * one ore more _aligned_ BM_BLOCK_SIZE blocks.932932+ *933933+ * called by worker on C_SYNC_TARGET and receiver on SyncSource.934934+ *935935+ */936936+void __drbd_set_in_sync(struct drbd_conf *mdev, sector_t sector, int size,937937+ const char *file, const unsigned int line)938938+{939939+ /* Is called from worker and receiver context _only_ */940940+ unsigned long sbnr, ebnr, lbnr;941941+ unsigned long count = 0;942942+ sector_t esector, nr_sectors;943943+ int wake_up = 0;944944+ unsigned long flags;945945+946946+ if (size <= 0 || (size & 0x1ff) != 0 || size > DRBD_MAX_SEGMENT_SIZE) {947947+ dev_err(DEV, "drbd_set_in_sync: sector=%llus size=%d nonsense!\n",948948+ (unsigned long long)sector, size);949949+ return;950950+ }951951+ nr_sectors = drbd_get_capacity(mdev->this_bdev);952952+ esector = sector + (size >> 9) - 1;953953+954954+ ERR_IF(sector >= nr_sectors) return;955955+ ERR_IF(esector >= nr_sectors) esector = (nr_sectors-1);956956+957957+ lbnr = BM_SECT_TO_BIT(nr_sectors-1);958958+959959+ /* we clear it (in sync).960960+ * round up start sector, round down end sector. we make sure we only961961+ * clear full, aligned, BM_BLOCK_SIZE (4K) blocks */962962+ if (unlikely(esector < BM_SECT_PER_BIT-1))963963+ return;964964+ if (unlikely(esector == (nr_sectors-1)))965965+ ebnr = lbnr;966966+ else967967+ ebnr = BM_SECT_TO_BIT(esector - (BM_SECT_PER_BIT-1));968968+ sbnr = BM_SECT_TO_BIT(sector + BM_SECT_PER_BIT-1);969969+970970+ trace_drbd_resync(mdev, TRACE_LVL_METRICS,971971+ "drbd_set_in_sync: sector=%llus size=%u sbnr=%lu ebnr=%lu\n",972972+ (unsigned long long)sector, size, sbnr, ebnr);973973+974974+ if (sbnr > ebnr)975975+ return;976976+977977+ /*978978+ * ok, (capacity & 7) != 0 sometimes, but who cares...979979+ * we count rs_{total,left} in bits, not sectors.980980+ */981981+ spin_lock_irqsave(&mdev->al_lock, flags);982982+ count = drbd_bm_clear_bits(mdev, sbnr, ebnr);983983+ if (count) {984984+ /* we need the lock for drbd_try_clear_on_disk_bm */985985+ if (jiffies - mdev->rs_mark_time > HZ*10) {986986+ /* should be rolling marks,987987+ * but we estimate only anyways. */988988+ if (mdev->rs_mark_left != drbd_bm_total_weight(mdev) &&989989+ mdev->state.conn != C_PAUSED_SYNC_T &&990990+ mdev->state.conn != C_PAUSED_SYNC_S) {991991+ mdev->rs_mark_time = jiffies;992992+ mdev->rs_mark_left = drbd_bm_total_weight(mdev);993993+ }994994+ }995995+ if (get_ldev(mdev)) {996996+ drbd_try_clear_on_disk_bm(mdev, sector, count, TRUE);997997+ put_ldev(mdev);998998+ }999999+ /* just wake_up unconditional now, various lc_chaged(),10001000+ * lc_put() in drbd_try_clear_on_disk_bm(). */10011001+ wake_up = 1;10021002+ }10031003+ spin_unlock_irqrestore(&mdev->al_lock, flags);10041004+ if (wake_up)10051005+ wake_up(&mdev->al_wait);10061006+}10071007+10081008+/*10091009+ * this is intended to set one request worth of data out of sync.10101010+ * affects at least 1 bit,10111011+ * and at most 1+DRBD_MAX_SEGMENT_SIZE/BM_BLOCK_SIZE bits.10121012+ *10131013+ * called by tl_clear and drbd_send_dblock (==drbd_make_request).10141014+ * so this can be _any_ process.10151015+ */10161016+void __drbd_set_out_of_sync(struct drbd_conf *mdev, sector_t sector, int size,10171017+ const char *file, const unsigned int line)10181018+{10191019+ unsigned long sbnr, ebnr, lbnr, flags;10201020+ sector_t esector, nr_sectors;10211021+ unsigned int enr, count;10221022+ struct lc_element *e;10231023+10241024+ if (size <= 0 || (size & 0x1ff) != 0 || size > DRBD_MAX_SEGMENT_SIZE) {10251025+ dev_err(DEV, "sector: %llus, size: %d\n",10261026+ (unsigned long long)sector, size);10271027+ return;10281028+ }10291029+10301030+ if (!get_ldev(mdev))10311031+ return; /* no disk, no metadata, no bitmap to set bits in */10321032+10331033+ nr_sectors = drbd_get_capacity(mdev->this_bdev);10341034+ esector = sector + (size >> 9) - 1;10351035+10361036+ ERR_IF(sector >= nr_sectors)10371037+ goto out;10381038+ ERR_IF(esector >= nr_sectors)10391039+ esector = (nr_sectors-1);10401040+10411041+ lbnr = BM_SECT_TO_BIT(nr_sectors-1);10421042+10431043+ /* we set it out of sync,10441044+ * we do not need to round anything here */10451045+ sbnr = BM_SECT_TO_BIT(sector);10461046+ ebnr = BM_SECT_TO_BIT(esector);10471047+10481048+ trace_drbd_resync(mdev, TRACE_LVL_METRICS,10491049+ "drbd_set_out_of_sync: sector=%llus size=%u sbnr=%lu ebnr=%lu\n",10501050+ (unsigned long long)sector, size, sbnr, ebnr);10511051+10521052+ /* ok, (capacity & 7) != 0 sometimes, but who cares...10531053+ * we count rs_{total,left} in bits, not sectors. */10541054+ spin_lock_irqsave(&mdev->al_lock, flags);10551055+ count = drbd_bm_set_bits(mdev, sbnr, ebnr);10561056+10571057+ enr = BM_SECT_TO_EXT(sector);10581058+ e = lc_find(mdev->resync, enr);10591059+ if (e)10601060+ lc_entry(e, struct bm_extent, lce)->rs_left += count;10611061+ spin_unlock_irqrestore(&mdev->al_lock, flags);10621062+10631063+out:10641064+ put_ldev(mdev);10651065+}10661066+10671067+static10681068+struct bm_extent *_bme_get(struct drbd_conf *mdev, unsigned int enr)10691069+{10701070+ struct lc_element *e;10711071+ struct bm_extent *bm_ext;10721072+ int wakeup = 0;10731073+ unsigned long rs_flags;10741074+10751075+ spin_lock_irq(&mdev->al_lock);10761076+ if (mdev->resync_locked > mdev->resync->nr_elements/2) {10771077+ spin_unlock_irq(&mdev->al_lock);10781078+ return NULL;10791079+ }10801080+ e = lc_get(mdev->resync, enr);10811081+ bm_ext = e ? lc_entry(e, struct bm_extent, lce) : NULL;10821082+ if (bm_ext) {10831083+ if (bm_ext->lce.lc_number != enr) {10841084+ bm_ext->rs_left = drbd_bm_e_weight(mdev, enr);10851085+ bm_ext->rs_failed = 0;10861086+ lc_changed(mdev->resync, &bm_ext->lce);10871087+ wakeup = 1;10881088+ }10891089+ if (bm_ext->lce.refcnt == 1)10901090+ mdev->resync_locked++;10911091+ set_bit(BME_NO_WRITES, &bm_ext->flags);10921092+ }10931093+ rs_flags = mdev->resync->flags;10941094+ spin_unlock_irq(&mdev->al_lock);10951095+ if (wakeup)10961096+ wake_up(&mdev->al_wait);10971097+10981098+ if (!bm_ext) {10991099+ if (rs_flags & LC_STARVING)11001100+ dev_warn(DEV, "Have to wait for element"11011101+ " (resync LRU too small?)\n");11021102+ BUG_ON(rs_flags & LC_DIRTY);11031103+ }11041104+11051105+ return bm_ext;11061106+}11071107+11081108+static int _is_in_al(struct drbd_conf *mdev, unsigned int enr)11091109+{11101110+ struct lc_element *al_ext;11111111+ int rv = 0;11121112+11131113+ spin_lock_irq(&mdev->al_lock);11141114+ if (unlikely(enr == mdev->act_log->new_number))11151115+ rv = 1;11161116+ else {11171117+ al_ext = lc_find(mdev->act_log, enr);11181118+ if (al_ext) {11191119+ if (al_ext->refcnt)11201120+ rv = 1;11211121+ }11221122+ }11231123+ spin_unlock_irq(&mdev->al_lock);11241124+11251125+ /*11261126+ if (unlikely(rv)) {11271127+ dev_info(DEV, "Delaying sync read until app's write is done\n");11281128+ }11291129+ */11301130+ return rv;11311131+}11321132+11331133+/**11341134+ * drbd_rs_begin_io() - Gets an extent in the resync LRU cache and sets it to BME_LOCKED11351135+ * @mdev: DRBD device.11361136+ * @sector: The sector number.11371137+ *11381138+ * This functions sleeps on al_wait. Returns 1 on success, 0 if interrupted.11391139+ */11401140+int drbd_rs_begin_io(struct drbd_conf *mdev, sector_t sector)11411141+{11421142+ unsigned int enr = BM_SECT_TO_EXT(sector);11431143+ struct bm_extent *bm_ext;11441144+ int i, sig;11451145+11461146+ trace_drbd_resync(mdev, TRACE_LVL_ALL,11471147+ "drbd_rs_begin_io: sector=%llus (rs_end=%d)\n",11481148+ (unsigned long long)sector, enr);11491149+11501150+ sig = wait_event_interruptible(mdev->al_wait,11511151+ (bm_ext = _bme_get(mdev, enr)));11521152+ if (sig)11531153+ return 0;11541154+11551155+ if (test_bit(BME_LOCKED, &bm_ext->flags))11561156+ return 1;11571157+11581158+ for (i = 0; i < AL_EXT_PER_BM_SECT; i++) {11591159+ sig = wait_event_interruptible(mdev->al_wait,11601160+ !_is_in_al(mdev, enr * AL_EXT_PER_BM_SECT + i));11611161+ if (sig) {11621162+ spin_lock_irq(&mdev->al_lock);11631163+ if (lc_put(mdev->resync, &bm_ext->lce) == 0) {11641164+ clear_bit(BME_NO_WRITES, &bm_ext->flags);11651165+ mdev->resync_locked--;11661166+ wake_up(&mdev->al_wait);11671167+ }11681168+ spin_unlock_irq(&mdev->al_lock);11691169+ return 0;11701170+ }11711171+ }11721172+11731173+ set_bit(BME_LOCKED, &bm_ext->flags);11741174+11751175+ return 1;11761176+}11771177+11781178+/**11791179+ * drbd_try_rs_begin_io() - Gets an extent in the resync LRU cache, does not sleep11801180+ * @mdev: DRBD device.11811181+ * @sector: The sector number.11821182+ *11831183+ * Gets an extent in the resync LRU cache, sets it to BME_NO_WRITES, then11841184+ * tries to set it to BME_LOCKED. Returns 0 upon success, and -EAGAIN11851185+ * if there is still application IO going on in this area.11861186+ */11871187+int drbd_try_rs_begin_io(struct drbd_conf *mdev, sector_t sector)11881188+{11891189+ unsigned int enr = BM_SECT_TO_EXT(sector);11901190+ const unsigned int al_enr = enr*AL_EXT_PER_BM_SECT;11911191+ struct lc_element *e;11921192+ struct bm_extent *bm_ext;11931193+ int i;11941194+11951195+ trace_drbd_resync(mdev, TRACE_LVL_ALL, "drbd_try_rs_begin_io: sector=%llus\n",11961196+ (unsigned long long)sector);11971197+11981198+ spin_lock_irq(&mdev->al_lock);11991199+ if (mdev->resync_wenr != LC_FREE && mdev->resync_wenr != enr) {12001200+ /* in case you have very heavy scattered io, it may12011201+ * stall the syncer undefined if we give up the ref count12021202+ * when we try again and requeue.12031203+ *12041204+ * if we don't give up the refcount, but the next time12051205+ * we are scheduled this extent has been "synced" by new12061206+ * application writes, we'd miss the lc_put on the12071207+ * extent we keep the refcount on.12081208+ * so we remembered which extent we had to try again, and12091209+ * if the next requested one is something else, we do12101210+ * the lc_put here...12111211+ * we also have to wake_up12121212+ */12131213+12141214+ trace_drbd_resync(mdev, TRACE_LVL_ALL,12151215+ "dropping %u, apparently got 'synced' by application io\n",12161216+ mdev->resync_wenr);12171217+12181218+ e = lc_find(mdev->resync, mdev->resync_wenr);12191219+ bm_ext = e ? lc_entry(e, struct bm_extent, lce) : NULL;12201220+ if (bm_ext) {12211221+ D_ASSERT(!test_bit(BME_LOCKED, &bm_ext->flags));12221222+ D_ASSERT(test_bit(BME_NO_WRITES, &bm_ext->flags));12231223+ clear_bit(BME_NO_WRITES, &bm_ext->flags);12241224+ mdev->resync_wenr = LC_FREE;12251225+ if (lc_put(mdev->resync, &bm_ext->lce) == 0)12261226+ mdev->resync_locked--;12271227+ wake_up(&mdev->al_wait);12281228+ } else {12291229+ dev_alert(DEV, "LOGIC BUG\n");12301230+ }12311231+ }12321232+ /* TRY. */12331233+ e = lc_try_get(mdev->resync, enr);12341234+ bm_ext = e ? lc_entry(e, struct bm_extent, lce) : NULL;12351235+ if (bm_ext) {12361236+ if (test_bit(BME_LOCKED, &bm_ext->flags))12371237+ goto proceed;12381238+ if (!test_and_set_bit(BME_NO_WRITES, &bm_ext->flags)) {12391239+ mdev->resync_locked++;12401240+ } else {12411241+ /* we did set the BME_NO_WRITES,12421242+ * but then could not set BME_LOCKED,12431243+ * so we tried again.12441244+ * drop the extra reference. */12451245+ trace_drbd_resync(mdev, TRACE_LVL_ALL,12461246+ "dropping extra reference on %u\n", enr);12471247+12481248+ bm_ext->lce.refcnt--;12491249+ D_ASSERT(bm_ext->lce.refcnt > 0);12501250+ }12511251+ goto check_al;12521252+ } else {12531253+ /* do we rather want to try later? */12541254+ if (mdev->resync_locked > mdev->resync->nr_elements-3) {12551255+ trace_drbd_resync(mdev, TRACE_LVL_ALL,12561256+ "resync_locked = %u!\n", mdev->resync_locked);12571257+12581258+ goto try_again;12591259+ }12601260+ /* Do or do not. There is no try. -- Yoda */12611261+ e = lc_get(mdev->resync, enr);12621262+ bm_ext = e ? lc_entry(e, struct bm_extent, lce) : NULL;12631263+ if (!bm_ext) {12641264+ const unsigned long rs_flags = mdev->resync->flags;12651265+ if (rs_flags & LC_STARVING)12661266+ dev_warn(DEV, "Have to wait for element"12671267+ " (resync LRU too small?)\n");12681268+ BUG_ON(rs_flags & LC_DIRTY);12691269+ goto try_again;12701270+ }12711271+ if (bm_ext->lce.lc_number != enr) {12721272+ bm_ext->rs_left = drbd_bm_e_weight(mdev, enr);12731273+ bm_ext->rs_failed = 0;12741274+ lc_changed(mdev->resync, &bm_ext->lce);12751275+ wake_up(&mdev->al_wait);12761276+ D_ASSERT(test_bit(BME_LOCKED, &bm_ext->flags) == 0);12771277+ }12781278+ set_bit(BME_NO_WRITES, &bm_ext->flags);12791279+ D_ASSERT(bm_ext->lce.refcnt == 1);12801280+ mdev->resync_locked++;12811281+ goto check_al;12821282+ }12831283+check_al:12841284+ trace_drbd_resync(mdev, TRACE_LVL_ALL, "checking al for %u\n", enr);12851285+12861286+ for (i = 0; i < AL_EXT_PER_BM_SECT; i++) {12871287+ if (unlikely(al_enr+i == mdev->act_log->new_number))12881288+ goto try_again;12891289+ if (lc_is_used(mdev->act_log, al_enr+i))12901290+ goto try_again;12911291+ }12921292+ set_bit(BME_LOCKED, &bm_ext->flags);12931293+proceed:12941294+ mdev->resync_wenr = LC_FREE;12951295+ spin_unlock_irq(&mdev->al_lock);12961296+ return 0;12971297+12981298+try_again:12991299+ trace_drbd_resync(mdev, TRACE_LVL_ALL, "need to try again for %u\n", enr);13001300+ if (bm_ext)13011301+ mdev->resync_wenr = enr;13021302+ spin_unlock_irq(&mdev->al_lock);13031303+ return -EAGAIN;13041304+}13051305+13061306+void drbd_rs_complete_io(struct drbd_conf *mdev, sector_t sector)13071307+{13081308+ unsigned int enr = BM_SECT_TO_EXT(sector);13091309+ struct lc_element *e;13101310+ struct bm_extent *bm_ext;13111311+ unsigned long flags;13121312+13131313+ trace_drbd_resync(mdev, TRACE_LVL_ALL,13141314+ "drbd_rs_complete_io: sector=%llus (rs_enr=%d)\n",13151315+ (long long)sector, enr);13161316+13171317+ spin_lock_irqsave(&mdev->al_lock, flags);13181318+ e = lc_find(mdev->resync, enr);13191319+ bm_ext = e ? lc_entry(e, struct bm_extent, lce) : NULL;13201320+ if (!bm_ext) {13211321+ spin_unlock_irqrestore(&mdev->al_lock, flags);13221322+ if (__ratelimit(&drbd_ratelimit_state))13231323+ dev_err(DEV, "drbd_rs_complete_io() called, but extent not found\n");13241324+ return;13251325+ }13261326+13271327+ if (bm_ext->lce.refcnt == 0) {13281328+ spin_unlock_irqrestore(&mdev->al_lock, flags);13291329+ dev_err(DEV, "drbd_rs_complete_io(,%llu [=%u]) called, "13301330+ "but refcnt is 0!?\n",13311331+ (unsigned long long)sector, enr);13321332+ return;13331333+ }13341334+13351335+ if (lc_put(mdev->resync, &bm_ext->lce) == 0) {13361336+ clear_bit(BME_LOCKED, &bm_ext->flags);13371337+ clear_bit(BME_NO_WRITES, &bm_ext->flags);13381338+ mdev->resync_locked--;13391339+ wake_up(&mdev->al_wait);13401340+ }13411341+13421342+ spin_unlock_irqrestore(&mdev->al_lock, flags);13431343+}13441344+13451345+/**13461346+ * drbd_rs_cancel_all() - Removes all extents from the resync LRU (even BME_LOCKED)13471347+ * @mdev: DRBD device.13481348+ */13491349+void drbd_rs_cancel_all(struct drbd_conf *mdev)13501350+{13511351+ trace_drbd_resync(mdev, TRACE_LVL_METRICS, "drbd_rs_cancel_all\n");13521352+13531353+ spin_lock_irq(&mdev->al_lock);13541354+13551355+ if (get_ldev_if_state(mdev, D_FAILED)) { /* Makes sure ->resync is there. */13561356+ lc_reset(mdev->resync);13571357+ put_ldev(mdev);13581358+ }13591359+ mdev->resync_locked = 0;13601360+ mdev->resync_wenr = LC_FREE;13611361+ spin_unlock_irq(&mdev->al_lock);13621362+ wake_up(&mdev->al_wait);13631363+}13641364+13651365+/**13661366+ * drbd_rs_del_all() - Gracefully remove all extents from the resync LRU13671367+ * @mdev: DRBD device.13681368+ *13691369+ * Returns 0 upon success, -EAGAIN if at least one reference count was13701370+ * not zero.13711371+ */13721372+int drbd_rs_del_all(struct drbd_conf *mdev)13731373+{13741374+ struct lc_element *e;13751375+ struct bm_extent *bm_ext;13761376+ int i;13771377+13781378+ trace_drbd_resync(mdev, TRACE_LVL_METRICS, "drbd_rs_del_all\n");13791379+13801380+ spin_lock_irq(&mdev->al_lock);13811381+13821382+ if (get_ldev_if_state(mdev, D_FAILED)) {13831383+ /* ok, ->resync is there. */13841384+ for (i = 0; i < mdev->resync->nr_elements; i++) {13851385+ e = lc_element_by_index(mdev->resync, i);13861386+ bm_ext = e ? lc_entry(e, struct bm_extent, lce) : NULL;13871387+ if (bm_ext->lce.lc_number == LC_FREE)13881388+ continue;13891389+ if (bm_ext->lce.lc_number == mdev->resync_wenr) {13901390+ dev_info(DEV, "dropping %u in drbd_rs_del_all, apparently"13911391+ " got 'synced' by application io\n",13921392+ mdev->resync_wenr);13931393+ D_ASSERT(!test_bit(BME_LOCKED, &bm_ext->flags));13941394+ D_ASSERT(test_bit(BME_NO_WRITES, &bm_ext->flags));13951395+ clear_bit(BME_NO_WRITES, &bm_ext->flags);13961396+ mdev->resync_wenr = LC_FREE;13971397+ lc_put(mdev->resync, &bm_ext->lce);13981398+ }13991399+ if (bm_ext->lce.refcnt != 0) {14001400+ dev_info(DEV, "Retrying drbd_rs_del_all() later. "14011401+ "refcnt=%d\n", bm_ext->lce.refcnt);14021402+ put_ldev(mdev);14031403+ spin_unlock_irq(&mdev->al_lock);14041404+ return -EAGAIN;14051405+ }14061406+ D_ASSERT(!test_bit(BME_LOCKED, &bm_ext->flags));14071407+ D_ASSERT(!test_bit(BME_NO_WRITES, &bm_ext->flags));14081408+ lc_del(mdev->resync, &bm_ext->lce);14091409+ }14101410+ D_ASSERT(mdev->resync->used == 0);14111411+ put_ldev(mdev);14121412+ }14131413+ spin_unlock_irq(&mdev->al_lock);14141414+14151415+ return 0;14161416+}14171417+14181418+/**14191419+ * drbd_rs_failed_io() - Record information on a failure to resync the specified blocks14201420+ * @mdev: DRBD device.14211421+ * @sector: The sector number.14221422+ * @size: Size of failed IO operation, in byte.14231423+ */14241424+void drbd_rs_failed_io(struct drbd_conf *mdev, sector_t sector, int size)14251425+{14261426+ /* Is called from worker and receiver context _only_ */14271427+ unsigned long sbnr, ebnr, lbnr;14281428+ unsigned long count;14291429+ sector_t esector, nr_sectors;14301430+ int wake_up = 0;14311431+14321432+ trace_drbd_resync(mdev, TRACE_LVL_SUMMARY,14331433+ "drbd_rs_failed_io: sector=%llus, size=%u\n",14341434+ (unsigned long long)sector, size);14351435+14361436+ if (size <= 0 || (size & 0x1ff) != 0 || size > DRBD_MAX_SEGMENT_SIZE) {14371437+ dev_err(DEV, "drbd_rs_failed_io: sector=%llus size=%d nonsense!\n",14381438+ (unsigned long long)sector, size);14391439+ return;14401440+ }14411441+ nr_sectors = drbd_get_capacity(mdev->this_bdev);14421442+ esector = sector + (size >> 9) - 1;14431443+14441444+ ERR_IF(sector >= nr_sectors) return;14451445+ ERR_IF(esector >= nr_sectors) esector = (nr_sectors-1);14461446+14471447+ lbnr = BM_SECT_TO_BIT(nr_sectors-1);14481448+14491449+ /*14501450+ * round up start sector, round down end sector. we make sure we only14511451+ * handle full, aligned, BM_BLOCK_SIZE (4K) blocks */14521452+ if (unlikely(esector < BM_SECT_PER_BIT-1))14531453+ return;14541454+ if (unlikely(esector == (nr_sectors-1)))14551455+ ebnr = lbnr;14561456+ else14571457+ ebnr = BM_SECT_TO_BIT(esector - (BM_SECT_PER_BIT-1));14581458+ sbnr = BM_SECT_TO_BIT(sector + BM_SECT_PER_BIT-1);14591459+14601460+ if (sbnr > ebnr)14611461+ return;14621462+14631463+ /*14641464+ * ok, (capacity & 7) != 0 sometimes, but who cares...14651465+ * we count rs_{total,left} in bits, not sectors.14661466+ */14671467+ spin_lock_irq(&mdev->al_lock);14681468+ count = drbd_bm_count_bits(mdev, sbnr, ebnr);14691469+ if (count) {14701470+ mdev->rs_failed += count;14711471+14721472+ if (get_ldev(mdev)) {14731473+ drbd_try_clear_on_disk_bm(mdev, sector, count, FALSE);14741474+ put_ldev(mdev);14751475+ }14761476+14771477+ /* just wake_up unconditional now, various lc_chaged(),14781478+ * lc_put() in drbd_try_clear_on_disk_bm(). */14791479+ wake_up = 1;14801480+ }14811481+ spin_unlock_irq(&mdev->al_lock);14821482+ if (wake_up)14831483+ wake_up(&mdev->al_wait);14841484+}
+1327
drivers/block/drbd/drbd_bitmap.c
···11+/*22+ drbd_bitmap.c33+44+ This file is part of DRBD by Philipp Reisner and Lars Ellenberg.55+66+ Copyright (C) 2004-2008, LINBIT Information Technologies GmbH.77+ Copyright (C) 2004-2008, Philipp Reisner <philipp.reisner@linbit.com>.88+ Copyright (C) 2004-2008, Lars Ellenberg <lars.ellenberg@linbit.com>.99+1010+ drbd is free software; you can redistribute it and/or modify1111+ it under the terms of the GNU General Public License as published by1212+ the Free Software Foundation; either version 2, or (at your option)1313+ any later version.1414+1515+ drbd is distributed in the hope that it will be useful,1616+ but WITHOUT ANY WARRANTY; without even the implied warranty of1717+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the1818+ GNU General Public License for more details.1919+2020+ You should have received a copy of the GNU General Public License2121+ along with drbd; see the file COPYING. If not, write to2222+ the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, USA.2323+ */2424+2525+#include <linux/bitops.h>2626+#include <linux/vmalloc.h>2727+#include <linux/string.h>2828+#include <linux/drbd.h>2929+#include <asm/kmap_types.h>3030+#include "drbd_int.h"3131+3232+/* OPAQUE outside this file!3333+ * interface defined in drbd_int.h3434+3535+ * convention:3636+ * function name drbd_bm_... => used elsewhere, "public".3737+ * function name bm_... => internal to implementation, "private".3838+3939+ * Note that since find_first_bit returns int, at the current granularity of4040+ * the bitmap (4KB per byte), this implementation "only" supports up to4141+ * 1<<(32+12) == 16 TB...4242+ */4343+4444+/*4545+ * NOTE4646+ * Access to the *bm_pages is protected by bm_lock.4747+ * It is safe to read the other members within the lock.4848+ *4949+ * drbd_bm_set_bits is called from bio_endio callbacks,5050+ * We may be called with irq already disabled,5151+ * so we need spin_lock_irqsave().5252+ * And we need the kmap_atomic.5353+ */5454+struct drbd_bitmap {5555+ struct page **bm_pages;5656+ spinlock_t bm_lock;5757+ /* WARNING unsigned long bm_*:5858+ * 32bit number of bit offset is just enough for 512 MB bitmap.5959+ * it will blow up if we make the bitmap bigger...6060+ * not that it makes much sense to have a bitmap that large,6161+ * rather change the granularity to 16k or 64k or something.6262+ * (that implies other problems, however...)6363+ */6464+ unsigned long bm_set; /* nr of set bits; THINK maybe atomic_t? */6565+ unsigned long bm_bits;6666+ size_t bm_words;6767+ size_t bm_number_of_pages;6868+ sector_t bm_dev_capacity;6969+ struct semaphore bm_change; /* serializes resize operations */7070+7171+ atomic_t bm_async_io;7272+ wait_queue_head_t bm_io_wait;7373+7474+ unsigned long bm_flags;7575+7676+ /* debugging aid, in case we are still racy somewhere */7777+ char *bm_why;7878+ struct task_struct *bm_task;7979+};8080+8181+/* definition of bits in bm_flags */8282+#define BM_LOCKED 08383+#define BM_MD_IO_ERROR 18484+#define BM_P_VMALLOCED 28585+8686+static int bm_is_locked(struct drbd_bitmap *b)8787+{8888+ return test_bit(BM_LOCKED, &b->bm_flags);8989+}9090+9191+#define bm_print_lock_info(m) __bm_print_lock_info(m, __func__)9292+static void __bm_print_lock_info(struct drbd_conf *mdev, const char *func)9393+{9494+ struct drbd_bitmap *b = mdev->bitmap;9595+ if (!__ratelimit(&drbd_ratelimit_state))9696+ return;9797+ dev_err(DEV, "FIXME %s in %s, bitmap locked for '%s' by %s\n",9898+ current == mdev->receiver.task ? "receiver" :9999+ current == mdev->asender.task ? "asender" :100100+ current == mdev->worker.task ? "worker" : current->comm,101101+ func, b->bm_why ?: "?",102102+ b->bm_task == mdev->receiver.task ? "receiver" :103103+ b->bm_task == mdev->asender.task ? "asender" :104104+ b->bm_task == mdev->worker.task ? "worker" : "?");105105+}106106+107107+void drbd_bm_lock(struct drbd_conf *mdev, char *why)108108+{109109+ struct drbd_bitmap *b = mdev->bitmap;110110+ int trylock_failed;111111+112112+ if (!b) {113113+ dev_err(DEV, "FIXME no bitmap in drbd_bm_lock!?\n");114114+ return;115115+ }116116+117117+ trylock_failed = down_trylock(&b->bm_change);118118+119119+ if (trylock_failed) {120120+ dev_warn(DEV, "%s going to '%s' but bitmap already locked for '%s' by %s\n",121121+ current == mdev->receiver.task ? "receiver" :122122+ current == mdev->asender.task ? "asender" :123123+ current == mdev->worker.task ? "worker" : current->comm,124124+ why, b->bm_why ?: "?",125125+ b->bm_task == mdev->receiver.task ? "receiver" :126126+ b->bm_task == mdev->asender.task ? "asender" :127127+ b->bm_task == mdev->worker.task ? "worker" : "?");128128+ down(&b->bm_change);129129+ }130130+ if (__test_and_set_bit(BM_LOCKED, &b->bm_flags))131131+ dev_err(DEV, "FIXME bitmap already locked in bm_lock\n");132132+133133+ b->bm_why = why;134134+ b->bm_task = current;135135+}136136+137137+void drbd_bm_unlock(struct drbd_conf *mdev)138138+{139139+ struct drbd_bitmap *b = mdev->bitmap;140140+ if (!b) {141141+ dev_err(DEV, "FIXME no bitmap in drbd_bm_unlock!?\n");142142+ return;143143+ }144144+145145+ if (!__test_and_clear_bit(BM_LOCKED, &mdev->bitmap->bm_flags))146146+ dev_err(DEV, "FIXME bitmap not locked in bm_unlock\n");147147+148148+ b->bm_why = NULL;149149+ b->bm_task = NULL;150150+ up(&b->bm_change);151151+}152152+153153+/* word offset to long pointer */154154+static unsigned long *__bm_map_paddr(struct drbd_bitmap *b, unsigned long offset, const enum km_type km)155155+{156156+ struct page *page;157157+ unsigned long page_nr;158158+159159+ /* page_nr = (word*sizeof(long)) >> PAGE_SHIFT; */160160+ page_nr = offset >> (PAGE_SHIFT - LN2_BPL + 3);161161+ BUG_ON(page_nr >= b->bm_number_of_pages);162162+ page = b->bm_pages[page_nr];163163+164164+ return (unsigned long *) kmap_atomic(page, km);165165+}166166+167167+static unsigned long * bm_map_paddr(struct drbd_bitmap *b, unsigned long offset)168168+{169169+ return __bm_map_paddr(b, offset, KM_IRQ1);170170+}171171+172172+static void __bm_unmap(unsigned long *p_addr, const enum km_type km)173173+{174174+ kunmap_atomic(p_addr, km);175175+};176176+177177+static void bm_unmap(unsigned long *p_addr)178178+{179179+ return __bm_unmap(p_addr, KM_IRQ1);180180+}181181+182182+/* long word offset of _bitmap_ sector */183183+#define S2W(s) ((s)<<(BM_EXT_SHIFT-BM_BLOCK_SHIFT-LN2_BPL))184184+/* word offset from start of bitmap to word number _in_page_185185+ * modulo longs per page186186+#define MLPP(X) ((X) % (PAGE_SIZE/sizeof(long))187187+ hm, well, Philipp thinks gcc might not optimze the % into & (... - 1)188188+ so do it explicitly:189189+ */190190+#define MLPP(X) ((X) & ((PAGE_SIZE/sizeof(long))-1))191191+192192+/* Long words per page */193193+#define LWPP (PAGE_SIZE/sizeof(long))194194+195195+/*196196+ * actually most functions herein should take a struct drbd_bitmap*, not a197197+ * struct drbd_conf*, but for the debug macros I like to have the mdev around198198+ * to be able to report device specific.199199+ */200200+201201+static void bm_free_pages(struct page **pages, unsigned long number)202202+{203203+ unsigned long i;204204+ if (!pages)205205+ return;206206+207207+ for (i = 0; i < number; i++) {208208+ if (!pages[i]) {209209+ printk(KERN_ALERT "drbd: bm_free_pages tried to free "210210+ "a NULL pointer; i=%lu n=%lu\n",211211+ i, number);212212+ continue;213213+ }214214+ __free_page(pages[i]);215215+ pages[i] = NULL;216216+ }217217+}218218+219219+static void bm_vk_free(void *ptr, int v)220220+{221221+ if (v)222222+ vfree(ptr);223223+ else224224+ kfree(ptr);225225+}226226+227227+/*228228+ * "have" and "want" are NUMBER OF PAGES.229229+ */230230+static struct page **bm_realloc_pages(struct drbd_bitmap *b, unsigned long want)231231+{232232+ struct page **old_pages = b->bm_pages;233233+ struct page **new_pages, *page;234234+ unsigned int i, bytes, vmalloced = 0;235235+ unsigned long have = b->bm_number_of_pages;236236+237237+ BUG_ON(have == 0 && old_pages != NULL);238238+ BUG_ON(have != 0 && old_pages == NULL);239239+240240+ if (have == want)241241+ return old_pages;242242+243243+ /* Trying kmalloc first, falling back to vmalloc.244244+ * GFP_KERNEL is ok, as this is done when a lower level disk is245245+ * "attached" to the drbd. Context is receiver thread or cqueue246246+ * thread. As we have no disk yet, we are not in the IO path,247247+ * not even the IO path of the peer. */248248+ bytes = sizeof(struct page *)*want;249249+ new_pages = kmalloc(bytes, GFP_KERNEL);250250+ if (!new_pages) {251251+ new_pages = vmalloc(bytes);252252+ if (!new_pages)253253+ return NULL;254254+ vmalloced = 1;255255+ }256256+257257+ memset(new_pages, 0, bytes);258258+ if (want >= have) {259259+ for (i = 0; i < have; i++)260260+ new_pages[i] = old_pages[i];261261+ for (; i < want; i++) {262262+ page = alloc_page(GFP_HIGHUSER);263263+ if (!page) {264264+ bm_free_pages(new_pages + have, i - have);265265+ bm_vk_free(new_pages, vmalloced);266266+ return NULL;267267+ }268268+ new_pages[i] = page;269269+ }270270+ } else {271271+ for (i = 0; i < want; i++)272272+ new_pages[i] = old_pages[i];273273+ /* NOT HERE, we are outside the spinlock!274274+ bm_free_pages(old_pages + want, have - want);275275+ */276276+ }277277+278278+ if (vmalloced)279279+ set_bit(BM_P_VMALLOCED, &b->bm_flags);280280+ else281281+ clear_bit(BM_P_VMALLOCED, &b->bm_flags);282282+283283+ return new_pages;284284+}285285+286286+/*287287+ * called on driver init only. TODO call when a device is created.288288+ * allocates the drbd_bitmap, and stores it in mdev->bitmap.289289+ */290290+int drbd_bm_init(struct drbd_conf *mdev)291291+{292292+ struct drbd_bitmap *b = mdev->bitmap;293293+ WARN_ON(b != NULL);294294+ b = kzalloc(sizeof(struct drbd_bitmap), GFP_KERNEL);295295+ if (!b)296296+ return -ENOMEM;297297+ spin_lock_init(&b->bm_lock);298298+ init_MUTEX(&b->bm_change);299299+ init_waitqueue_head(&b->bm_io_wait);300300+301301+ mdev->bitmap = b;302302+303303+ return 0;304304+}305305+306306+sector_t drbd_bm_capacity(struct drbd_conf *mdev)307307+{308308+ ERR_IF(!mdev->bitmap) return 0;309309+ return mdev->bitmap->bm_dev_capacity;310310+}311311+312312+/* called on driver unload. TODO: call when a device is destroyed.313313+ */314314+void drbd_bm_cleanup(struct drbd_conf *mdev)315315+{316316+ ERR_IF (!mdev->bitmap) return;317317+ bm_free_pages(mdev->bitmap->bm_pages, mdev->bitmap->bm_number_of_pages);318318+ bm_vk_free(mdev->bitmap->bm_pages, test_bit(BM_P_VMALLOCED, &mdev->bitmap->bm_flags));319319+ kfree(mdev->bitmap);320320+ mdev->bitmap = NULL;321321+}322322+323323+/*324324+ * since (b->bm_bits % BITS_PER_LONG) != 0,325325+ * this masks out the remaining bits.326326+ * Returns the number of bits cleared.327327+ */328328+static int bm_clear_surplus(struct drbd_bitmap *b)329329+{330330+ const unsigned long mask = (1UL << (b->bm_bits & (BITS_PER_LONG-1))) - 1;331331+ size_t w = b->bm_bits >> LN2_BPL;332332+ int cleared = 0;333333+ unsigned long *p_addr, *bm;334334+335335+ p_addr = bm_map_paddr(b, w);336336+ bm = p_addr + MLPP(w);337337+ if (w < b->bm_words) {338338+ cleared = hweight_long(*bm & ~mask);339339+ *bm &= mask;340340+ w++; bm++;341341+ }342342+343343+ if (w < b->bm_words) {344344+ cleared += hweight_long(*bm);345345+ *bm = 0;346346+ }347347+ bm_unmap(p_addr);348348+ return cleared;349349+}350350+351351+static void bm_set_surplus(struct drbd_bitmap *b)352352+{353353+ const unsigned long mask = (1UL << (b->bm_bits & (BITS_PER_LONG-1))) - 1;354354+ size_t w = b->bm_bits >> LN2_BPL;355355+ unsigned long *p_addr, *bm;356356+357357+ p_addr = bm_map_paddr(b, w);358358+ bm = p_addr + MLPP(w);359359+ if (w < b->bm_words) {360360+ *bm |= ~mask;361361+ bm++; w++;362362+ }363363+364364+ if (w < b->bm_words) {365365+ *bm = ~(0UL);366366+ }367367+ bm_unmap(p_addr);368368+}369369+370370+static unsigned long __bm_count_bits(struct drbd_bitmap *b, const int swap_endian)371371+{372372+ unsigned long *p_addr, *bm, offset = 0;373373+ unsigned long bits = 0;374374+ unsigned long i, do_now;375375+376376+ while (offset < b->bm_words) {377377+ i = do_now = min_t(size_t, b->bm_words-offset, LWPP);378378+ p_addr = __bm_map_paddr(b, offset, KM_USER0);379379+ bm = p_addr + MLPP(offset);380380+ while (i--) {381381+#ifndef __LITTLE_ENDIAN382382+ if (swap_endian)383383+ *bm = lel_to_cpu(*bm);384384+#endif385385+ bits += hweight_long(*bm++);386386+ }387387+ __bm_unmap(p_addr, KM_USER0);388388+ offset += do_now;389389+ cond_resched();390390+ }391391+392392+ return bits;393393+}394394+395395+static unsigned long bm_count_bits(struct drbd_bitmap *b)396396+{397397+ return __bm_count_bits(b, 0);398398+}399399+400400+static unsigned long bm_count_bits_swap_endian(struct drbd_bitmap *b)401401+{402402+ return __bm_count_bits(b, 1);403403+}404404+405405+/* offset and len in long words.*/406406+static void bm_memset(struct drbd_bitmap *b, size_t offset, int c, size_t len)407407+{408408+ unsigned long *p_addr, *bm;409409+ size_t do_now, end;410410+411411+#define BM_SECTORS_PER_BIT (BM_BLOCK_SIZE/512)412412+413413+ end = offset + len;414414+415415+ if (end > b->bm_words) {416416+ printk(KERN_ALERT "drbd: bm_memset end > bm_words\n");417417+ return;418418+ }419419+420420+ while (offset < end) {421421+ do_now = min_t(size_t, ALIGN(offset + 1, LWPP), end) - offset;422422+ p_addr = bm_map_paddr(b, offset);423423+ bm = p_addr + MLPP(offset);424424+ if (bm+do_now > p_addr + LWPP) {425425+ printk(KERN_ALERT "drbd: BUG BUG BUG! p_addr:%p bm:%p do_now:%d\n",426426+ p_addr, bm, (int)do_now);427427+ break; /* breaks to after catch_oob_access_end() only! */428428+ }429429+ memset(bm, c, do_now * sizeof(long));430430+ bm_unmap(p_addr);431431+ offset += do_now;432432+ }433433+}434434+435435+/*436436+ * make sure the bitmap has enough room for the attached storage,437437+ * if necessary, resize.438438+ * called whenever we may have changed the device size.439439+ * returns -ENOMEM if we could not allocate enough memory, 0 on success.440440+ * In case this is actually a resize, we copy the old bitmap into the new one.441441+ * Otherwise, the bitmap is initialized to all bits set.442442+ */443443+int drbd_bm_resize(struct drbd_conf *mdev, sector_t capacity)444444+{445445+ struct drbd_bitmap *b = mdev->bitmap;446446+ unsigned long bits, words, owords, obits, *p_addr, *bm;447447+ unsigned long want, have, onpages; /* number of pages */448448+ struct page **npages, **opages = NULL;449449+ int err = 0, growing;450450+ int opages_vmalloced;451451+452452+ ERR_IF(!b) return -ENOMEM;453453+454454+ drbd_bm_lock(mdev, "resize");455455+456456+ dev_info(DEV, "drbd_bm_resize called with capacity == %llu\n",457457+ (unsigned long long)capacity);458458+459459+ if (capacity == b->bm_dev_capacity)460460+ goto out;461461+462462+ opages_vmalloced = test_bit(BM_P_VMALLOCED, &b->bm_flags);463463+464464+ if (capacity == 0) {465465+ spin_lock_irq(&b->bm_lock);466466+ opages = b->bm_pages;467467+ onpages = b->bm_number_of_pages;468468+ owords = b->bm_words;469469+ b->bm_pages = NULL;470470+ b->bm_number_of_pages =471471+ b->bm_set =472472+ b->bm_bits =473473+ b->bm_words =474474+ b->bm_dev_capacity = 0;475475+ spin_unlock_irq(&b->bm_lock);476476+ bm_free_pages(opages, onpages);477477+ bm_vk_free(opages, opages_vmalloced);478478+ goto out;479479+ }480480+ bits = BM_SECT_TO_BIT(ALIGN(capacity, BM_SECT_PER_BIT));481481+482482+ /* if we would use483483+ words = ALIGN(bits,BITS_PER_LONG) >> LN2_BPL;484484+ a 32bit host could present the wrong number of words485485+ to a 64bit host.486486+ */487487+ words = ALIGN(bits, 64) >> LN2_BPL;488488+489489+ if (get_ldev(mdev)) {490490+ D_ASSERT((u64)bits <= (((u64)mdev->ldev->md.md_size_sect-MD_BM_OFFSET) << 12));491491+ put_ldev(mdev);492492+ }493493+494494+ /* one extra long to catch off by one errors */495495+ want = ALIGN((words+1)*sizeof(long), PAGE_SIZE) >> PAGE_SHIFT;496496+ have = b->bm_number_of_pages;497497+ if (want == have) {498498+ D_ASSERT(b->bm_pages != NULL);499499+ npages = b->bm_pages;500500+ } else {501501+ if (FAULT_ACTIVE(mdev, DRBD_FAULT_BM_ALLOC))502502+ npages = NULL;503503+ else504504+ npages = bm_realloc_pages(b, want);505505+ }506506+507507+ if (!npages) {508508+ err = -ENOMEM;509509+ goto out;510510+ }511511+512512+ spin_lock_irq(&b->bm_lock);513513+ opages = b->bm_pages;514514+ owords = b->bm_words;515515+ obits = b->bm_bits;516516+517517+ growing = bits > obits;518518+ if (opages)519519+ bm_set_surplus(b);520520+521521+ b->bm_pages = npages;522522+ b->bm_number_of_pages = want;523523+ b->bm_bits = bits;524524+ b->bm_words = words;525525+ b->bm_dev_capacity = capacity;526526+527527+ if (growing) {528528+ bm_memset(b, owords, 0xff, words-owords);529529+ b->bm_set += bits - obits;530530+ }531531+532532+ if (want < have) {533533+ /* implicit: (opages != NULL) && (opages != npages) */534534+ bm_free_pages(opages + want, have - want);535535+ }536536+537537+ p_addr = bm_map_paddr(b, words);538538+ bm = p_addr + MLPP(words);539539+ *bm = DRBD_MAGIC;540540+ bm_unmap(p_addr);541541+542542+ (void)bm_clear_surplus(b);543543+544544+ spin_unlock_irq(&b->bm_lock);545545+ if (opages != npages)546546+ bm_vk_free(opages, opages_vmalloced);547547+ if (!growing)548548+ b->bm_set = bm_count_bits(b);549549+ dev_info(DEV, "resync bitmap: bits=%lu words=%lu\n", bits, words);550550+551551+ out:552552+ drbd_bm_unlock(mdev);553553+ return err;554554+}555555+556556+/* inherently racy:557557+ * if not protected by other means, return value may be out of date when558558+ * leaving this function...559559+ * we still need to lock it, since it is important that this returns560560+ * bm_set == 0 precisely.561561+ *562562+ * maybe bm_set should be atomic_t ?563563+ */564564+static unsigned long _drbd_bm_total_weight(struct drbd_conf *mdev)565565+{566566+ struct drbd_bitmap *b = mdev->bitmap;567567+ unsigned long s;568568+ unsigned long flags;569569+570570+ ERR_IF(!b) return 0;571571+ ERR_IF(!b->bm_pages) return 0;572572+573573+ spin_lock_irqsave(&b->bm_lock, flags);574574+ s = b->bm_set;575575+ spin_unlock_irqrestore(&b->bm_lock, flags);576576+577577+ return s;578578+}579579+580580+unsigned long drbd_bm_total_weight(struct drbd_conf *mdev)581581+{582582+ unsigned long s;583583+ /* if I don't have a disk, I don't know about out-of-sync status */584584+ if (!get_ldev_if_state(mdev, D_NEGOTIATING))585585+ return 0;586586+ s = _drbd_bm_total_weight(mdev);587587+ put_ldev(mdev);588588+ return s;589589+}590590+591591+size_t drbd_bm_words(struct drbd_conf *mdev)592592+{593593+ struct drbd_bitmap *b = mdev->bitmap;594594+ ERR_IF(!b) return 0;595595+ ERR_IF(!b->bm_pages) return 0;596596+597597+ return b->bm_words;598598+}599599+600600+unsigned long drbd_bm_bits(struct drbd_conf *mdev)601601+{602602+ struct drbd_bitmap *b = mdev->bitmap;603603+ ERR_IF(!b) return 0;604604+605605+ return b->bm_bits;606606+}607607+608608+/* merge number words from buffer into the bitmap starting at offset.609609+ * buffer[i] is expected to be little endian unsigned long.610610+ * bitmap must be locked by drbd_bm_lock.611611+ * currently only used from receive_bitmap.612612+ */613613+void drbd_bm_merge_lel(struct drbd_conf *mdev, size_t offset, size_t number,614614+ unsigned long *buffer)615615+{616616+ struct drbd_bitmap *b = mdev->bitmap;617617+ unsigned long *p_addr, *bm;618618+ unsigned long word, bits;619619+ size_t end, do_now;620620+621621+ end = offset + number;622622+623623+ ERR_IF(!b) return;624624+ ERR_IF(!b->bm_pages) return;625625+ if (number == 0)626626+ return;627627+ WARN_ON(offset >= b->bm_words);628628+ WARN_ON(end > b->bm_words);629629+630630+ spin_lock_irq(&b->bm_lock);631631+ while (offset < end) {632632+ do_now = min_t(size_t, ALIGN(offset+1, LWPP), end) - offset;633633+ p_addr = bm_map_paddr(b, offset);634634+ bm = p_addr + MLPP(offset);635635+ offset += do_now;636636+ while (do_now--) {637637+ bits = hweight_long(*bm);638638+ word = *bm | lel_to_cpu(*buffer++);639639+ *bm++ = word;640640+ b->bm_set += hweight_long(word) - bits;641641+ }642642+ bm_unmap(p_addr);643643+ }644644+ /* with 32bit <-> 64bit cross-platform connect645645+ * this is only correct for current usage,646646+ * where we _know_ that we are 64 bit aligned,647647+ * and know that this function is used in this way, too...648648+ */649649+ if (end == b->bm_words)650650+ b->bm_set -= bm_clear_surplus(b);651651+652652+ spin_unlock_irq(&b->bm_lock);653653+}654654+655655+/* copy number words from the bitmap starting at offset into the buffer.656656+ * buffer[i] will be little endian unsigned long.657657+ */658658+void drbd_bm_get_lel(struct drbd_conf *mdev, size_t offset, size_t number,659659+ unsigned long *buffer)660660+{661661+ struct drbd_bitmap *b = mdev->bitmap;662662+ unsigned long *p_addr, *bm;663663+ size_t end, do_now;664664+665665+ end = offset + number;666666+667667+ ERR_IF(!b) return;668668+ ERR_IF(!b->bm_pages) return;669669+670670+ spin_lock_irq(&b->bm_lock);671671+ if ((offset >= b->bm_words) ||672672+ (end > b->bm_words) ||673673+ (number <= 0))674674+ dev_err(DEV, "offset=%lu number=%lu bm_words=%lu\n",675675+ (unsigned long) offset,676676+ (unsigned long) number,677677+ (unsigned long) b->bm_words);678678+ else {679679+ while (offset < end) {680680+ do_now = min_t(size_t, ALIGN(offset+1, LWPP), end) - offset;681681+ p_addr = bm_map_paddr(b, offset);682682+ bm = p_addr + MLPP(offset);683683+ offset += do_now;684684+ while (do_now--)685685+ *buffer++ = cpu_to_lel(*bm++);686686+ bm_unmap(p_addr);687687+ }688688+ }689689+ spin_unlock_irq(&b->bm_lock);690690+}691691+692692+/* set all bits in the bitmap */693693+void drbd_bm_set_all(struct drbd_conf *mdev)694694+{695695+ struct drbd_bitmap *b = mdev->bitmap;696696+ ERR_IF(!b) return;697697+ ERR_IF(!b->bm_pages) return;698698+699699+ spin_lock_irq(&b->bm_lock);700700+ bm_memset(b, 0, 0xff, b->bm_words);701701+ (void)bm_clear_surplus(b);702702+ b->bm_set = b->bm_bits;703703+ spin_unlock_irq(&b->bm_lock);704704+}705705+706706+/* clear all bits in the bitmap */707707+void drbd_bm_clear_all(struct drbd_conf *mdev)708708+{709709+ struct drbd_bitmap *b = mdev->bitmap;710710+ ERR_IF(!b) return;711711+ ERR_IF(!b->bm_pages) return;712712+713713+ spin_lock_irq(&b->bm_lock);714714+ bm_memset(b, 0, 0, b->bm_words);715715+ b->bm_set = 0;716716+ spin_unlock_irq(&b->bm_lock);717717+}718718+719719+static void bm_async_io_complete(struct bio *bio, int error)720720+{721721+ struct drbd_bitmap *b = bio->bi_private;722722+ int uptodate = bio_flagged(bio, BIO_UPTODATE);723723+724724+725725+ /* strange behavior of some lower level drivers...726726+ * fail the request by clearing the uptodate flag,727727+ * but do not return any error?!728728+ * do we want to WARN() on this? */729729+ if (!error && !uptodate)730730+ error = -EIO;731731+732732+ if (error) {733733+ /* doh. what now?734734+ * for now, set all bits, and flag MD_IO_ERROR */735735+ __set_bit(BM_MD_IO_ERROR, &b->bm_flags);736736+ }737737+ if (atomic_dec_and_test(&b->bm_async_io))738738+ wake_up(&b->bm_io_wait);739739+740740+ bio_put(bio);741741+}742742+743743+static void bm_page_io_async(struct drbd_conf *mdev, struct drbd_bitmap *b, int page_nr, int rw) __must_hold(local)744744+{745745+ /* we are process context. we always get a bio */746746+ struct bio *bio = bio_alloc(GFP_KERNEL, 1);747747+ unsigned int len;748748+ sector_t on_disk_sector =749749+ mdev->ldev->md.md_offset + mdev->ldev->md.bm_offset;750750+ on_disk_sector += ((sector_t)page_nr) << (PAGE_SHIFT-9);751751+752752+ /* this might happen with very small753753+ * flexible external meta data device */754754+ len = min_t(unsigned int, PAGE_SIZE,755755+ (drbd_md_last_sector(mdev->ldev) - on_disk_sector + 1)<<9);756756+757757+ bio->bi_bdev = mdev->ldev->md_bdev;758758+ bio->bi_sector = on_disk_sector;759759+ bio_add_page(bio, b->bm_pages[page_nr], len, 0);760760+ bio->bi_private = b;761761+ bio->bi_end_io = bm_async_io_complete;762762+763763+ if (FAULT_ACTIVE(mdev, (rw & WRITE) ? DRBD_FAULT_MD_WR : DRBD_FAULT_MD_RD)) {764764+ bio->bi_rw |= rw;765765+ bio_endio(bio, -EIO);766766+ } else {767767+ submit_bio(rw, bio);768768+ }769769+}770770+771771+# if defined(__LITTLE_ENDIAN)772772+ /* nothing to do, on disk == in memory */773773+# define bm_cpu_to_lel(x) ((void)0)774774+# else775775+void bm_cpu_to_lel(struct drbd_bitmap *b)776776+{777777+ /* need to cpu_to_lel all the pages ...778778+ * this may be optimized by using779779+ * cpu_to_lel(-1) == -1 and cpu_to_lel(0) == 0;780780+ * the following is still not optimal, but better than nothing */781781+ unsigned int i;782782+ unsigned long *p_addr, *bm;783783+ if (b->bm_set == 0) {784784+ /* no page at all; avoid swap if all is 0 */785785+ i = b->bm_number_of_pages;786786+ } else if (b->bm_set == b->bm_bits) {787787+ /* only the last page */788788+ i = b->bm_number_of_pages - 1;789789+ } else {790790+ /* all pages */791791+ i = 0;792792+ }793793+ for (; i < b->bm_number_of_pages; i++) {794794+ p_addr = kmap_atomic(b->bm_pages[i], KM_USER0);795795+ for (bm = p_addr; bm < p_addr + PAGE_SIZE/sizeof(long); bm++)796796+ *bm = cpu_to_lel(*bm);797797+ kunmap_atomic(p_addr, KM_USER0);798798+ }799799+}800800+# endif801801+/* lel_to_cpu == cpu_to_lel */802802+# define bm_lel_to_cpu(x) bm_cpu_to_lel(x)803803+804804+/*805805+ * bm_rw: read/write the whole bitmap from/to its on disk location.806806+ */807807+static int bm_rw(struct drbd_conf *mdev, int rw) __must_hold(local)808808+{809809+ struct drbd_bitmap *b = mdev->bitmap;810810+ /* sector_t sector; */811811+ int bm_words, num_pages, i;812812+ unsigned long now;813813+ char ppb[10];814814+ int err = 0;815815+816816+ WARN_ON(!bm_is_locked(b));817817+818818+ /* no spinlock here, the drbd_bm_lock should be enough! */819819+820820+ bm_words = drbd_bm_words(mdev);821821+ num_pages = (bm_words*sizeof(long) + PAGE_SIZE-1) >> PAGE_SHIFT;822822+823823+ /* on disk bitmap is little endian */824824+ if (rw == WRITE)825825+ bm_cpu_to_lel(b);826826+827827+ now = jiffies;828828+ atomic_set(&b->bm_async_io, num_pages);829829+ __clear_bit(BM_MD_IO_ERROR, &b->bm_flags);830830+831831+ /* let the layers below us try to merge these bios... */832832+ for (i = 0; i < num_pages; i++)833833+ bm_page_io_async(mdev, b, i, rw);834834+835835+ drbd_blk_run_queue(bdev_get_queue(mdev->ldev->md_bdev));836836+ wait_event(b->bm_io_wait, atomic_read(&b->bm_async_io) == 0);837837+838838+ if (test_bit(BM_MD_IO_ERROR, &b->bm_flags)) {839839+ dev_alert(DEV, "we had at least one MD IO ERROR during bitmap IO\n");840840+ drbd_chk_io_error(mdev, 1, TRUE);841841+ err = -EIO;842842+ }843843+844844+ now = jiffies;845845+ if (rw == WRITE) {846846+ /* swap back endianness */847847+ bm_lel_to_cpu(b);848848+ /* flush bitmap to stable storage */849849+ drbd_md_flush(mdev);850850+ } else /* rw == READ */ {851851+ /* just read, if necessary adjust endianness */852852+ b->bm_set = bm_count_bits_swap_endian(b);853853+ dev_info(DEV, "recounting of set bits took additional %lu jiffies\n",854854+ jiffies - now);855855+ }856856+ now = b->bm_set;857857+858858+ dev_info(DEV, "%s (%lu bits) marked out-of-sync by on disk bit-map.\n",859859+ ppsize(ppb, now << (BM_BLOCK_SHIFT-10)), now);860860+861861+ return err;862862+}863863+864864+/**865865+ * drbd_bm_read() - Read the whole bitmap from its on disk location.866866+ * @mdev: DRBD device.867867+ */868868+int drbd_bm_read(struct drbd_conf *mdev) __must_hold(local)869869+{870870+ return bm_rw(mdev, READ);871871+}872872+873873+/**874874+ * drbd_bm_write() - Write the whole bitmap to its on disk location.875875+ * @mdev: DRBD device.876876+ */877877+int drbd_bm_write(struct drbd_conf *mdev) __must_hold(local)878878+{879879+ return bm_rw(mdev, WRITE);880880+}881881+882882+/**883883+ * drbd_bm_write_sect: Writes a 512 (MD_SECTOR_SIZE) byte piece of the bitmap884884+ * @mdev: DRBD device.885885+ * @enr: Extent number in the resync lru (happens to be sector offset)886886+ *887887+ * The BM_EXT_SIZE is on purpose exactly the amount of the bitmap covered888888+ * by a single sector write. Therefore enr == sector offset from the889889+ * start of the bitmap.890890+ */891891+int drbd_bm_write_sect(struct drbd_conf *mdev, unsigned long enr) __must_hold(local)892892+{893893+ sector_t on_disk_sector = enr + mdev->ldev->md.md_offset894894+ + mdev->ldev->md.bm_offset;895895+ int bm_words, num_words, offset;896896+ int err = 0;897897+898898+ mutex_lock(&mdev->md_io_mutex);899899+ bm_words = drbd_bm_words(mdev);900900+ offset = S2W(enr); /* word offset into bitmap */901901+ num_words = min(S2W(1), bm_words - offset);902902+ if (num_words < S2W(1))903903+ memset(page_address(mdev->md_io_page), 0, MD_SECTOR_SIZE);904904+ drbd_bm_get_lel(mdev, offset, num_words,905905+ page_address(mdev->md_io_page));906906+ if (!drbd_md_sync_page_io(mdev, mdev->ldev, on_disk_sector, WRITE)) {907907+ int i;908908+ err = -EIO;909909+ dev_err(DEV, "IO ERROR writing bitmap sector %lu "910910+ "(meta-disk sector %llus)\n",911911+ enr, (unsigned long long)on_disk_sector);912912+ drbd_chk_io_error(mdev, 1, TRUE);913913+ for (i = 0; i < AL_EXT_PER_BM_SECT; i++)914914+ drbd_bm_ALe_set_all(mdev, enr*AL_EXT_PER_BM_SECT+i);915915+ }916916+ mdev->bm_writ_cnt++;917917+ mutex_unlock(&mdev->md_io_mutex);918918+ return err;919919+}920920+921921+/* NOTE922922+ * find_first_bit returns int, we return unsigned long.923923+ * should not make much difference anyways, but ...924924+ *925925+ * this returns a bit number, NOT a sector!926926+ */927927+#define BPP_MASK ((1UL << (PAGE_SHIFT+3)) - 1)928928+static unsigned long __bm_find_next(struct drbd_conf *mdev, unsigned long bm_fo,929929+ const int find_zero_bit, const enum km_type km)930930+{931931+ struct drbd_bitmap *b = mdev->bitmap;932932+ unsigned long i = -1UL;933933+ unsigned long *p_addr;934934+ unsigned long bit_offset; /* bit offset of the mapped page. */935935+936936+ if (bm_fo > b->bm_bits) {937937+ dev_err(DEV, "bm_fo=%lu bm_bits=%lu\n", bm_fo, b->bm_bits);938938+ } else {939939+ while (bm_fo < b->bm_bits) {940940+ unsigned long offset;941941+ bit_offset = bm_fo & ~BPP_MASK; /* bit offset of the page */942942+ offset = bit_offset >> LN2_BPL; /* word offset of the page */943943+ p_addr = __bm_map_paddr(b, offset, km);944944+945945+ if (find_zero_bit)946946+ i = find_next_zero_bit(p_addr, PAGE_SIZE*8, bm_fo & BPP_MASK);947947+ else948948+ i = find_next_bit(p_addr, PAGE_SIZE*8, bm_fo & BPP_MASK);949949+950950+ __bm_unmap(p_addr, km);951951+ if (i < PAGE_SIZE*8) {952952+ i = bit_offset + i;953953+ if (i >= b->bm_bits)954954+ break;955955+ goto found;956956+ }957957+ bm_fo = bit_offset + PAGE_SIZE*8;958958+ }959959+ i = -1UL;960960+ }961961+ found:962962+ return i;963963+}964964+965965+static unsigned long bm_find_next(struct drbd_conf *mdev,966966+ unsigned long bm_fo, const int find_zero_bit)967967+{968968+ struct drbd_bitmap *b = mdev->bitmap;969969+ unsigned long i = -1UL;970970+971971+ ERR_IF(!b) return i;972972+ ERR_IF(!b->bm_pages) return i;973973+974974+ spin_lock_irq(&b->bm_lock);975975+ if (bm_is_locked(b))976976+ bm_print_lock_info(mdev);977977+978978+ i = __bm_find_next(mdev, bm_fo, find_zero_bit, KM_IRQ1);979979+980980+ spin_unlock_irq(&b->bm_lock);981981+ return i;982982+}983983+984984+unsigned long drbd_bm_find_next(struct drbd_conf *mdev, unsigned long bm_fo)985985+{986986+ return bm_find_next(mdev, bm_fo, 0);987987+}988988+989989+#if 0990990+/* not yet needed for anything. */991991+unsigned long drbd_bm_find_next_zero(struct drbd_conf *mdev, unsigned long bm_fo)992992+{993993+ return bm_find_next(mdev, bm_fo, 1);994994+}995995+#endif996996+997997+/* does not spin_lock_irqsave.998998+ * you must take drbd_bm_lock() first */999999+unsigned long _drbd_bm_find_next(struct drbd_conf *mdev, unsigned long bm_fo)10001000+{10011001+ /* WARN_ON(!bm_is_locked(mdev)); */10021002+ return __bm_find_next(mdev, bm_fo, 0, KM_USER1);10031003+}10041004+10051005+unsigned long _drbd_bm_find_next_zero(struct drbd_conf *mdev, unsigned long bm_fo)10061006+{10071007+ /* WARN_ON(!bm_is_locked(mdev)); */10081008+ return __bm_find_next(mdev, bm_fo, 1, KM_USER1);10091009+}10101010+10111011+/* returns number of bits actually changed.10121012+ * for val != 0, we change 0 -> 1, return code positive10131013+ * for val == 0, we change 1 -> 0, return code negative10141014+ * wants bitnr, not sector.10151015+ * expected to be called for only a few bits (e - s about BITS_PER_LONG).10161016+ * Must hold bitmap lock already. */10171017+int __bm_change_bits_to(struct drbd_conf *mdev, const unsigned long s,10181018+ unsigned long e, int val, const enum km_type km)10191019+{10201020+ struct drbd_bitmap *b = mdev->bitmap;10211021+ unsigned long *p_addr = NULL;10221022+ unsigned long bitnr;10231023+ unsigned long last_page_nr = -1UL;10241024+ int c = 0;10251025+10261026+ if (e >= b->bm_bits) {10271027+ dev_err(DEV, "ASSERT FAILED: bit_s=%lu bit_e=%lu bm_bits=%lu\n",10281028+ s, e, b->bm_bits);10291029+ e = b->bm_bits ? b->bm_bits -1 : 0;10301030+ }10311031+ for (bitnr = s; bitnr <= e; bitnr++) {10321032+ unsigned long offset = bitnr>>LN2_BPL;10331033+ unsigned long page_nr = offset >> (PAGE_SHIFT - LN2_BPL + 3);10341034+ if (page_nr != last_page_nr) {10351035+ if (p_addr)10361036+ __bm_unmap(p_addr, km);10371037+ p_addr = __bm_map_paddr(b, offset, km);10381038+ last_page_nr = page_nr;10391039+ }10401040+ if (val)10411041+ c += (0 == __test_and_set_bit(bitnr & BPP_MASK, p_addr));10421042+ else10431043+ c -= (0 != __test_and_clear_bit(bitnr & BPP_MASK, p_addr));10441044+ }10451045+ if (p_addr)10461046+ __bm_unmap(p_addr, km);10471047+ b->bm_set += c;10481048+ return c;10491049+}10501050+10511051+/* returns number of bits actually changed.10521052+ * for val != 0, we change 0 -> 1, return code positive10531053+ * for val == 0, we change 1 -> 0, return code negative10541054+ * wants bitnr, not sector */10551055+int bm_change_bits_to(struct drbd_conf *mdev, const unsigned long s,10561056+ const unsigned long e, int val)10571057+{10581058+ unsigned long flags;10591059+ struct drbd_bitmap *b = mdev->bitmap;10601060+ int c = 0;10611061+10621062+ ERR_IF(!b) return 1;10631063+ ERR_IF(!b->bm_pages) return 0;10641064+10651065+ spin_lock_irqsave(&b->bm_lock, flags);10661066+ if (bm_is_locked(b))10671067+ bm_print_lock_info(mdev);10681068+10691069+ c = __bm_change_bits_to(mdev, s, e, val, KM_IRQ1);10701070+10711071+ spin_unlock_irqrestore(&b->bm_lock, flags);10721072+ return c;10731073+}10741074+10751075+/* returns number of bits changed 0 -> 1 */10761076+int drbd_bm_set_bits(struct drbd_conf *mdev, const unsigned long s, const unsigned long e)10771077+{10781078+ return bm_change_bits_to(mdev, s, e, 1);10791079+}10801080+10811081+/* returns number of bits changed 1 -> 0 */10821082+int drbd_bm_clear_bits(struct drbd_conf *mdev, const unsigned long s, const unsigned long e)10831083+{10841084+ return -bm_change_bits_to(mdev, s, e, 0);10851085+}10861086+10871087+/* sets all bits in full words,10881088+ * from first_word up to, but not including, last_word */10891089+static inline void bm_set_full_words_within_one_page(struct drbd_bitmap *b,10901090+ int page_nr, int first_word, int last_word)10911091+{10921092+ int i;10931093+ int bits;10941094+ unsigned long *paddr = kmap_atomic(b->bm_pages[page_nr], KM_USER0);10951095+ for (i = first_word; i < last_word; i++) {10961096+ bits = hweight_long(paddr[i]);10971097+ paddr[i] = ~0UL;10981098+ b->bm_set += BITS_PER_LONG - bits;10991099+ }11001100+ kunmap_atomic(paddr, KM_USER0);11011101+}11021102+11031103+/* Same thing as drbd_bm_set_bits, but without taking the spin_lock_irqsave.11041104+ * You must first drbd_bm_lock().11051105+ * Can be called to set the whole bitmap in one go.11061106+ * Sets bits from s to e _inclusive_. */11071107+void _drbd_bm_set_bits(struct drbd_conf *mdev, const unsigned long s, const unsigned long e)11081108+{11091109+ /* First set_bit from the first bit (s)11101110+ * up to the next long boundary (sl),11111111+ * then assign full words up to the last long boundary (el),11121112+ * then set_bit up to and including the last bit (e).11131113+ *11141114+ * Do not use memset, because we must account for changes,11151115+ * so we need to loop over the words with hweight() anyways.11161116+ */11171117+ unsigned long sl = ALIGN(s,BITS_PER_LONG);11181118+ unsigned long el = (e+1) & ~((unsigned long)BITS_PER_LONG-1);11191119+ int first_page;11201120+ int last_page;11211121+ int page_nr;11221122+ int first_word;11231123+ int last_word;11241124+11251125+ if (e - s <= 3*BITS_PER_LONG) {11261126+ /* don't bother; el and sl may even be wrong. */11271127+ __bm_change_bits_to(mdev, s, e, 1, KM_USER0);11281128+ return;11291129+ }11301130+11311131+ /* difference is large enough that we can trust sl and el */11321132+11331133+ /* bits filling the current long */11341134+ if (sl)11351135+ __bm_change_bits_to(mdev, s, sl-1, 1, KM_USER0);11361136+11371137+ first_page = sl >> (3 + PAGE_SHIFT);11381138+ last_page = el >> (3 + PAGE_SHIFT);11391139+11401140+ /* MLPP: modulo longs per page */11411141+ /* LWPP: long words per page */11421142+ first_word = MLPP(sl >> LN2_BPL);11431143+ last_word = LWPP;11441144+11451145+ /* first and full pages, unless first page == last page */11461146+ for (page_nr = first_page; page_nr < last_page; page_nr++) {11471147+ bm_set_full_words_within_one_page(mdev->bitmap, page_nr, first_word, last_word);11481148+ cond_resched();11491149+ first_word = 0;11501150+ }11511151+11521152+ /* last page (respectively only page, for first page == last page) */11531153+ last_word = MLPP(el >> LN2_BPL);11541154+ bm_set_full_words_within_one_page(mdev->bitmap, last_page, first_word, last_word);11551155+11561156+ /* possibly trailing bits.11571157+ * example: (e & 63) == 63, el will be e+1.11581158+ * if that even was the very last bit,11591159+ * it would trigger an assert in __bm_change_bits_to()11601160+ */11611161+ if (el <= e)11621162+ __bm_change_bits_to(mdev, el, e, 1, KM_USER0);11631163+}11641164+11651165+/* returns bit state11661166+ * wants bitnr, NOT sector.11671167+ * inherently racy... area needs to be locked by means of {al,rs}_lru11681168+ * 1 ... bit set11691169+ * 0 ... bit not set11701170+ * -1 ... first out of bounds access, stop testing for bits!11711171+ */11721172+int drbd_bm_test_bit(struct drbd_conf *mdev, const unsigned long bitnr)11731173+{11741174+ unsigned long flags;11751175+ struct drbd_bitmap *b = mdev->bitmap;11761176+ unsigned long *p_addr;11771177+ int i;11781178+11791179+ ERR_IF(!b) return 0;11801180+ ERR_IF(!b->bm_pages) return 0;11811181+11821182+ spin_lock_irqsave(&b->bm_lock, flags);11831183+ if (bm_is_locked(b))11841184+ bm_print_lock_info(mdev);11851185+ if (bitnr < b->bm_bits) {11861186+ unsigned long offset = bitnr>>LN2_BPL;11871187+ p_addr = bm_map_paddr(b, offset);11881188+ i = test_bit(bitnr & BPP_MASK, p_addr) ? 1 : 0;11891189+ bm_unmap(p_addr);11901190+ } else if (bitnr == b->bm_bits) {11911191+ i = -1;11921192+ } else { /* (bitnr > b->bm_bits) */11931193+ dev_err(DEV, "bitnr=%lu > bm_bits=%lu\n", bitnr, b->bm_bits);11941194+ i = 0;11951195+ }11961196+11971197+ spin_unlock_irqrestore(&b->bm_lock, flags);11981198+ return i;11991199+}12001200+12011201+/* returns number of bits set in the range [s, e] */12021202+int drbd_bm_count_bits(struct drbd_conf *mdev, const unsigned long s, const unsigned long e)12031203+{12041204+ unsigned long flags;12051205+ struct drbd_bitmap *b = mdev->bitmap;12061206+ unsigned long *p_addr = NULL, page_nr = -1;12071207+ unsigned long bitnr;12081208+ int c = 0;12091209+ size_t w;12101210+12111211+ /* If this is called without a bitmap, that is a bug. But just to be12121212+ * robust in case we screwed up elsewhere, in that case pretend there12131213+ * was one dirty bit in the requested area, so we won't try to do a12141214+ * local read there (no bitmap probably implies no disk) */12151215+ ERR_IF(!b) return 1;12161216+ ERR_IF(!b->bm_pages) return 1;12171217+12181218+ spin_lock_irqsave(&b->bm_lock, flags);12191219+ if (bm_is_locked(b))12201220+ bm_print_lock_info(mdev);12211221+ for (bitnr = s; bitnr <= e; bitnr++) {12221222+ w = bitnr >> LN2_BPL;12231223+ if (page_nr != w >> (PAGE_SHIFT - LN2_BPL + 3)) {12241224+ page_nr = w >> (PAGE_SHIFT - LN2_BPL + 3);12251225+ if (p_addr)12261226+ bm_unmap(p_addr);12271227+ p_addr = bm_map_paddr(b, w);12281228+ }12291229+ ERR_IF (bitnr >= b->bm_bits) {12301230+ dev_err(DEV, "bitnr=%lu bm_bits=%lu\n", bitnr, b->bm_bits);12311231+ } else {12321232+ c += (0 != test_bit(bitnr - (page_nr << (PAGE_SHIFT+3)), p_addr));12331233+ }12341234+ }12351235+ if (p_addr)12361236+ bm_unmap(p_addr);12371237+ spin_unlock_irqrestore(&b->bm_lock, flags);12381238+ return c;12391239+}12401240+12411241+12421242+/* inherently racy...12431243+ * return value may be already out-of-date when this function returns.12441244+ * but the general usage is that this is only use during a cstate when bits are12451245+ * only cleared, not set, and typically only care for the case when the return12461246+ * value is zero, or we already "locked" this "bitmap extent" by other means.12471247+ *12481248+ * enr is bm-extent number, since we chose to name one sector (512 bytes)12491249+ * worth of the bitmap a "bitmap extent".12501250+ *12511251+ * TODO12521252+ * I think since we use it like a reference count, we should use the real12531253+ * reference count of some bitmap extent element from some lru instead...12541254+ *12551255+ */12561256+int drbd_bm_e_weight(struct drbd_conf *mdev, unsigned long enr)12571257+{12581258+ struct drbd_bitmap *b = mdev->bitmap;12591259+ int count, s, e;12601260+ unsigned long flags;12611261+ unsigned long *p_addr, *bm;12621262+12631263+ ERR_IF(!b) return 0;12641264+ ERR_IF(!b->bm_pages) return 0;12651265+12661266+ spin_lock_irqsave(&b->bm_lock, flags);12671267+ if (bm_is_locked(b))12681268+ bm_print_lock_info(mdev);12691269+12701270+ s = S2W(enr);12711271+ e = min((size_t)S2W(enr+1), b->bm_words);12721272+ count = 0;12731273+ if (s < b->bm_words) {12741274+ int n = e-s;12751275+ p_addr = bm_map_paddr(b, s);12761276+ bm = p_addr + MLPP(s);12771277+ while (n--)12781278+ count += hweight_long(*bm++);12791279+ bm_unmap(p_addr);12801280+ } else {12811281+ dev_err(DEV, "start offset (%d) too large in drbd_bm_e_weight\n", s);12821282+ }12831283+ spin_unlock_irqrestore(&b->bm_lock, flags);12841284+ return count;12851285+}12861286+12871287+/* set all bits covered by the AL-extent al_enr */12881288+unsigned long drbd_bm_ALe_set_all(struct drbd_conf *mdev, unsigned long al_enr)12891289+{12901290+ struct drbd_bitmap *b = mdev->bitmap;12911291+ unsigned long *p_addr, *bm;12921292+ unsigned long weight;12931293+ int count, s, e, i, do_now;12941294+ ERR_IF(!b) return 0;12951295+ ERR_IF(!b->bm_pages) return 0;12961296+12971297+ spin_lock_irq(&b->bm_lock);12981298+ if (bm_is_locked(b))12991299+ bm_print_lock_info(mdev);13001300+ weight = b->bm_set;13011301+13021302+ s = al_enr * BM_WORDS_PER_AL_EXT;13031303+ e = min_t(size_t, s + BM_WORDS_PER_AL_EXT, b->bm_words);13041304+ /* assert that s and e are on the same page */13051305+ D_ASSERT((e-1) >> (PAGE_SHIFT - LN2_BPL + 3)13061306+ == s >> (PAGE_SHIFT - LN2_BPL + 3));13071307+ count = 0;13081308+ if (s < b->bm_words) {13091309+ i = do_now = e-s;13101310+ p_addr = bm_map_paddr(b, s);13111311+ bm = p_addr + MLPP(s);13121312+ while (i--) {13131313+ count += hweight_long(*bm);13141314+ *bm = -1UL;13151315+ bm++;13161316+ }13171317+ bm_unmap(p_addr);13181318+ b->bm_set += do_now*BITS_PER_LONG - count;13191319+ if (e == b->bm_words)13201320+ b->bm_set -= bm_clear_surplus(b);13211321+ } else {13221322+ dev_err(DEV, "start offset (%d) too large in drbd_bm_ALe_set_all\n", s);13231323+ }13241324+ weight = b->bm_set - weight;13251325+ spin_unlock_irq(&b->bm_lock);13261326+ return weight;13271327+}
+2258
drivers/block/drbd/drbd_int.h
···11+/*22+ drbd_int.h33+44+ This file is part of DRBD by Philipp Reisner and Lars Ellenberg.55+66+ Copyright (C) 2001-2008, LINBIT Information Technologies GmbH.77+ Copyright (C) 1999-2008, Philipp Reisner <philipp.reisner@linbit.com>.88+ Copyright (C) 2002-2008, Lars Ellenberg <lars.ellenberg@linbit.com>.99+1010+ drbd is free software; you can redistribute it and/or modify1111+ it under the terms of the GNU General Public License as published by1212+ the Free Software Foundation; either version 2, or (at your option)1313+ any later version.1414+1515+ drbd is distributed in the hope that it will be useful,1616+ but WITHOUT ANY WARRANTY; without even the implied warranty of1717+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the1818+ GNU General Public License for more details.1919+2020+ You should have received a copy of the GNU General Public License2121+ along with drbd; see the file COPYING. If not, write to2222+ the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, USA.2323+2424+*/2525+2626+#ifndef _DRBD_INT_H2727+#define _DRBD_INT_H2828+2929+#include <linux/compiler.h>3030+#include <linux/types.h>3131+#include <linux/version.h>3232+#include <linux/list.h>3333+#include <linux/sched.h>3434+#include <linux/bitops.h>3535+#include <linux/slab.h>3636+#include <linux/crypto.h>3737+#include <linux/tcp.h>3838+#include <linux/mutex.h>3939+#include <linux/major.h>4040+#include <linux/blkdev.h>4141+#include <linux/genhd.h>4242+#include <net/tcp.h>4343+#include <linux/lru_cache.h>4444+4545+#ifdef __CHECKER__4646+# define __protected_by(x) __attribute__((require_context(x,1,999,"rdwr")))4747+# define __protected_read_by(x) __attribute__((require_context(x,1,999,"read")))4848+# define __protected_write_by(x) __attribute__((require_context(x,1,999,"write")))4949+# define __must_hold(x) __attribute__((context(x,1,1), require_context(x,1,999,"call")))5050+#else5151+# define __protected_by(x)5252+# define __protected_read_by(x)5353+# define __protected_write_by(x)5454+# define __must_hold(x)5555+#endif5656+5757+#define __no_warn(lock, stmt) do { __acquire(lock); stmt; __release(lock); } while (0)5858+5959+/* module parameter, defined in drbd_main.c */6060+extern unsigned int minor_count;6161+extern int disable_sendpage;6262+extern int allow_oos;6363+extern unsigned int cn_idx;6464+6565+#ifdef CONFIG_DRBD_FAULT_INJECTION6666+extern int enable_faults;6767+extern int fault_rate;6868+extern int fault_devs;6969+#endif7070+7171+extern char usermode_helper[];7272+7373+7474+#ifndef TRUE7575+#define TRUE 17676+#endif7777+#ifndef FALSE7878+#define FALSE 07979+#endif8080+8181+/* I don't remember why XCPU ...8282+ * This is used to wake the asender,8383+ * and to interrupt sending the sending task8484+ * on disconnect.8585+ */8686+#define DRBD_SIG SIGXCPU8787+8888+/* This is used to stop/restart our threads.8989+ * Cannot use SIGTERM nor SIGKILL, since these9090+ * are sent out by init on runlevel changes9191+ * I choose SIGHUP for now.9292+ */9393+#define DRBD_SIGKILL SIGHUP9494+9595+/* All EEs on the free list should have ID_VACANT (== 0)9696+ * freshly allocated EEs get !ID_VACANT (== 1)9797+ * so if it says "cannot dereference null pointer at adress 0x00000001",9898+ * it is most likely one of these :( */9999+100100+#define ID_IN_SYNC (4711ULL)101101+#define ID_OUT_OF_SYNC (4712ULL)102102+103103+#define ID_SYNCER (-1ULL)104104+#define ID_VACANT 0105105+#define is_syncer_block_id(id) ((id) == ID_SYNCER)106106+107107+struct drbd_conf;108108+109109+110110+/* to shorten dev_warn(DEV, "msg"); and relatives statements */111111+#define DEV (disk_to_dev(mdev->vdisk))112112+113113+#define D_ASSERT(exp) if (!(exp)) \114114+ dev_err(DEV, "ASSERT( " #exp " ) in %s:%d\n", __FILE__, __LINE__)115115+116116+#define ERR_IF(exp) if (({ \117117+ int _b = (exp) != 0; \118118+ if (_b) dev_err(DEV, "%s: (%s) in %s:%d\n", \119119+ __func__, #exp, __FILE__, __LINE__); \120120+ _b; \121121+ }))122122+123123+/* Defines to control fault insertion */124124+enum {125125+ DRBD_FAULT_MD_WR = 0, /* meta data write */126126+ DRBD_FAULT_MD_RD = 1, /* read */127127+ DRBD_FAULT_RS_WR = 2, /* resync */128128+ DRBD_FAULT_RS_RD = 3,129129+ DRBD_FAULT_DT_WR = 4, /* data */130130+ DRBD_FAULT_DT_RD = 5,131131+ DRBD_FAULT_DT_RA = 6, /* data read ahead */132132+ DRBD_FAULT_BM_ALLOC = 7, /* bitmap allocation */133133+ DRBD_FAULT_AL_EE = 8, /* alloc ee */134134+135135+ DRBD_FAULT_MAX,136136+};137137+138138+extern void trace_drbd_resync(struct drbd_conf *mdev, int level, const char *fmt, ...);139139+140140+#ifdef CONFIG_DRBD_FAULT_INJECTION141141+extern unsigned int142142+_drbd_insert_fault(struct drbd_conf *mdev, unsigned int type);143143+static inline int144144+drbd_insert_fault(struct drbd_conf *mdev, unsigned int type) {145145+ return fault_rate &&146146+ (enable_faults & (1<<type)) &&147147+ _drbd_insert_fault(mdev, type);148148+}149149+#define FAULT_ACTIVE(_m, _t) (drbd_insert_fault((_m), (_t)))150150+151151+#else152152+#define FAULT_ACTIVE(_m, _t) (0)153153+#endif154154+155155+/* integer division, round _UP_ to the next integer */156156+#define div_ceil(A, B) ((A)/(B) + ((A)%(B) ? 1 : 0))157157+/* usual integer division */158158+#define div_floor(A, B) ((A)/(B))159159+160160+/* drbd_meta-data.c (still in drbd_main.c) */161161+/* 4th incarnation of the disk layout. */162162+#define DRBD_MD_MAGIC (DRBD_MAGIC+4)163163+164164+extern struct drbd_conf **minor_table;165165+extern struct ratelimit_state drbd_ratelimit_state;166166+167167+/* on the wire */168168+enum drbd_packets {169169+ /* receiver (data socket) */170170+ P_DATA = 0x00,171171+ P_DATA_REPLY = 0x01, /* Response to P_DATA_REQUEST */172172+ P_RS_DATA_REPLY = 0x02, /* Response to P_RS_DATA_REQUEST */173173+ P_BARRIER = 0x03,174174+ P_BITMAP = 0x04,175175+ P_BECOME_SYNC_TARGET = 0x05,176176+ P_BECOME_SYNC_SOURCE = 0x06,177177+ P_UNPLUG_REMOTE = 0x07, /* Used at various times to hint the peer */178178+ P_DATA_REQUEST = 0x08, /* Used to ask for a data block */179179+ P_RS_DATA_REQUEST = 0x09, /* Used to ask for a data block for resync */180180+ P_SYNC_PARAM = 0x0a,181181+ P_PROTOCOL = 0x0b,182182+ P_UUIDS = 0x0c,183183+ P_SIZES = 0x0d,184184+ P_STATE = 0x0e,185185+ P_SYNC_UUID = 0x0f,186186+ P_AUTH_CHALLENGE = 0x10,187187+ P_AUTH_RESPONSE = 0x11,188188+ P_STATE_CHG_REQ = 0x12,189189+190190+ /* asender (meta socket */191191+ P_PING = 0x13,192192+ P_PING_ACK = 0x14,193193+ P_RECV_ACK = 0x15, /* Used in protocol B */194194+ P_WRITE_ACK = 0x16, /* Used in protocol C */195195+ P_RS_WRITE_ACK = 0x17, /* Is a P_WRITE_ACK, additionally call set_in_sync(). */196196+ P_DISCARD_ACK = 0x18, /* Used in proto C, two-primaries conflict detection */197197+ P_NEG_ACK = 0x19, /* Sent if local disk is unusable */198198+ P_NEG_DREPLY = 0x1a, /* Local disk is broken... */199199+ P_NEG_RS_DREPLY = 0x1b, /* Local disk is broken... */200200+ P_BARRIER_ACK = 0x1c,201201+ P_STATE_CHG_REPLY = 0x1d,202202+203203+ /* "new" commands, no longer fitting into the ordering scheme above */204204+205205+ P_OV_REQUEST = 0x1e, /* data socket */206206+ P_OV_REPLY = 0x1f,207207+ P_OV_RESULT = 0x20, /* meta socket */208208+ P_CSUM_RS_REQUEST = 0x21, /* data socket */209209+ P_RS_IS_IN_SYNC = 0x22, /* meta socket */210210+ P_SYNC_PARAM89 = 0x23, /* data socket, protocol version 89 replacement for P_SYNC_PARAM */211211+ P_COMPRESSED_BITMAP = 0x24, /* compressed or otherwise encoded bitmap transfer */212212+213213+ P_MAX_CMD = 0x25,214214+ P_MAY_IGNORE = 0x100, /* Flag to test if (cmd > P_MAY_IGNORE) ... */215215+ P_MAX_OPT_CMD = 0x101,216216+217217+ /* special command ids for handshake */218218+219219+ P_HAND_SHAKE_M = 0xfff1, /* First Packet on the MetaSock */220220+ P_HAND_SHAKE_S = 0xfff2, /* First Packet on the Socket */221221+222222+ P_HAND_SHAKE = 0xfffe /* FIXED for the next century! */223223+};224224+225225+static inline const char *cmdname(enum drbd_packets cmd)226226+{227227+ /* THINK may need to become several global tables228228+ * when we want to support more than229229+ * one PRO_VERSION */230230+ static const char *cmdnames[] = {231231+ [P_DATA] = "Data",232232+ [P_DATA_REPLY] = "DataReply",233233+ [P_RS_DATA_REPLY] = "RSDataReply",234234+ [P_BARRIER] = "Barrier",235235+ [P_BITMAP] = "ReportBitMap",236236+ [P_BECOME_SYNC_TARGET] = "BecomeSyncTarget",237237+ [P_BECOME_SYNC_SOURCE] = "BecomeSyncSource",238238+ [P_UNPLUG_REMOTE] = "UnplugRemote",239239+ [P_DATA_REQUEST] = "DataRequest",240240+ [P_RS_DATA_REQUEST] = "RSDataRequest",241241+ [P_SYNC_PARAM] = "SyncParam",242242+ [P_SYNC_PARAM89] = "SyncParam89",243243+ [P_PROTOCOL] = "ReportProtocol",244244+ [P_UUIDS] = "ReportUUIDs",245245+ [P_SIZES] = "ReportSizes",246246+ [P_STATE] = "ReportState",247247+ [P_SYNC_UUID] = "ReportSyncUUID",248248+ [P_AUTH_CHALLENGE] = "AuthChallenge",249249+ [P_AUTH_RESPONSE] = "AuthResponse",250250+ [P_PING] = "Ping",251251+ [P_PING_ACK] = "PingAck",252252+ [P_RECV_ACK] = "RecvAck",253253+ [P_WRITE_ACK] = "WriteAck",254254+ [P_RS_WRITE_ACK] = "RSWriteAck",255255+ [P_DISCARD_ACK] = "DiscardAck",256256+ [P_NEG_ACK] = "NegAck",257257+ [P_NEG_DREPLY] = "NegDReply",258258+ [P_NEG_RS_DREPLY] = "NegRSDReply",259259+ [P_BARRIER_ACK] = "BarrierAck",260260+ [P_STATE_CHG_REQ] = "StateChgRequest",261261+ [P_STATE_CHG_REPLY] = "StateChgReply",262262+ [P_OV_REQUEST] = "OVRequest",263263+ [P_OV_REPLY] = "OVReply",264264+ [P_OV_RESULT] = "OVResult",265265+ [P_MAX_CMD] = NULL,266266+ };267267+268268+ if (cmd == P_HAND_SHAKE_M)269269+ return "HandShakeM";270270+ if (cmd == P_HAND_SHAKE_S)271271+ return "HandShakeS";272272+ if (cmd == P_HAND_SHAKE)273273+ return "HandShake";274274+ if (cmd >= P_MAX_CMD)275275+ return "Unknown";276276+ return cmdnames[cmd];277277+}278278+279279+/* for sending/receiving the bitmap,280280+ * possibly in some encoding scheme */281281+struct bm_xfer_ctx {282282+ /* "const"283283+ * stores total bits and long words284284+ * of the bitmap, so we don't need to285285+ * call the accessor functions over and again. */286286+ unsigned long bm_bits;287287+ unsigned long bm_words;288288+ /* during xfer, current position within the bitmap */289289+ unsigned long bit_offset;290290+ unsigned long word_offset;291291+292292+ /* statistics; index: (h->command == P_BITMAP) */293293+ unsigned packets[2];294294+ unsigned bytes[2];295295+};296296+297297+extern void INFO_bm_xfer_stats(struct drbd_conf *mdev,298298+ const char *direction, struct bm_xfer_ctx *c);299299+300300+static inline void bm_xfer_ctx_bit_to_word_offset(struct bm_xfer_ctx *c)301301+{302302+ /* word_offset counts "native long words" (32 or 64 bit),303303+ * aligned at 64 bit.304304+ * Encoded packet may end at an unaligned bit offset.305305+ * In case a fallback clear text packet is transmitted in306306+ * between, we adjust this offset back to the last 64bit307307+ * aligned "native long word", which makes coding and decoding308308+ * the plain text bitmap much more convenient. */309309+#if BITS_PER_LONG == 64310310+ c->word_offset = c->bit_offset >> 6;311311+#elif BITS_PER_LONG == 32312312+ c->word_offset = c->bit_offset >> 5;313313+ c->word_offset &= ~(1UL);314314+#else315315+# error "unsupported BITS_PER_LONG"316316+#endif317317+}318318+319319+#ifndef __packed320320+#define __packed __attribute__((packed))321321+#endif322322+323323+/* This is the layout for a packet on the wire.324324+ * The byteorder is the network byte order.325325+ * (except block_id and barrier fields.326326+ * these are pointers to local structs327327+ * and have no relevance for the partner,328328+ * which just echoes them as received.)329329+ *330330+ * NOTE that the payload starts at a long aligned offset,331331+ * regardless of 32 or 64 bit arch!332332+ */333333+struct p_header {334334+ u32 magic;335335+ u16 command;336336+ u16 length; /* bytes of data after this header */337337+ u8 payload[0];338338+} __packed;339339+/* 8 bytes. packet FIXED for the next century! */340340+341341+/*342342+ * short commands, packets without payload, plain p_header:343343+ * P_PING344344+ * P_PING_ACK345345+ * P_BECOME_SYNC_TARGET346346+ * P_BECOME_SYNC_SOURCE347347+ * P_UNPLUG_REMOTE348348+ */349349+350350+/*351351+ * commands with out-of-struct payload:352352+ * P_BITMAP (no additional fields)353353+ * P_DATA, P_DATA_REPLY (see p_data)354354+ * P_COMPRESSED_BITMAP (see receive_compressed_bitmap)355355+ */356356+357357+/* these defines must not be changed without changing the protocol version */358358+#define DP_HARDBARRIER 1359359+#define DP_RW_SYNC 2360360+#define DP_MAY_SET_IN_SYNC 4361361+362362+struct p_data {363363+ struct p_header head;364364+ u64 sector; /* 64 bits sector number */365365+ u64 block_id; /* to identify the request in protocol B&C */366366+ u32 seq_num;367367+ u32 dp_flags;368368+} __packed;369369+370370+/*371371+ * commands which share a struct:372372+ * p_block_ack:373373+ * P_RECV_ACK (proto B), P_WRITE_ACK (proto C),374374+ * P_DISCARD_ACK (proto C, two-primaries conflict detection)375375+ * p_block_req:376376+ * P_DATA_REQUEST, P_RS_DATA_REQUEST377377+ */378378+struct p_block_ack {379379+ struct p_header head;380380+ u64 sector;381381+ u64 block_id;382382+ u32 blksize;383383+ u32 seq_num;384384+} __packed;385385+386386+387387+struct p_block_req {388388+ struct p_header head;389389+ u64 sector;390390+ u64 block_id;391391+ u32 blksize;392392+ u32 pad; /* to multiple of 8 Byte */393393+} __packed;394394+395395+/*396396+ * commands with their own struct for additional fields:397397+ * P_HAND_SHAKE398398+ * P_BARRIER399399+ * P_BARRIER_ACK400400+ * P_SYNC_PARAM401401+ * ReportParams402402+ */403403+404404+struct p_handshake {405405+ struct p_header head; /* 8 bytes */406406+ u32 protocol_min;407407+ u32 feature_flags;408408+ u32 protocol_max;409409+410410+ /* should be more than enough for future enhancements411411+ * for now, feature_flags and the reserverd array shall be zero.412412+ */413413+414414+ u32 _pad;415415+ u64 reserverd[7];416416+} __packed;417417+/* 80 bytes, FIXED for the next century */418418+419419+struct p_barrier {420420+ struct p_header head;421421+ u32 barrier; /* barrier number _handle_ only */422422+ u32 pad; /* to multiple of 8 Byte */423423+} __packed;424424+425425+struct p_barrier_ack {426426+ struct p_header head;427427+ u32 barrier;428428+ u32 set_size;429429+} __packed;430430+431431+struct p_rs_param {432432+ struct p_header head;433433+ u32 rate;434434+435435+ /* Since protocol version 88 and higher. */436436+ char verify_alg[0];437437+} __packed;438438+439439+struct p_rs_param_89 {440440+ struct p_header head;441441+ u32 rate;442442+ /* protocol version 89: */443443+ char verify_alg[SHARED_SECRET_MAX];444444+ char csums_alg[SHARED_SECRET_MAX];445445+} __packed;446446+447447+struct p_protocol {448448+ struct p_header head;449449+ u32 protocol;450450+ u32 after_sb_0p;451451+ u32 after_sb_1p;452452+ u32 after_sb_2p;453453+ u32 want_lose;454454+ u32 two_primaries;455455+456456+ /* Since protocol version 87 and higher. */457457+ char integrity_alg[0];458458+459459+} __packed;460460+461461+struct p_uuids {462462+ struct p_header head;463463+ u64 uuid[UI_EXTENDED_SIZE];464464+} __packed;465465+466466+struct p_rs_uuid {467467+ struct p_header head;468468+ u64 uuid;469469+} __packed;470470+471471+struct p_sizes {472472+ struct p_header head;473473+ u64 d_size; /* size of disk */474474+ u64 u_size; /* user requested size */475475+ u64 c_size; /* current exported size */476476+ u32 max_segment_size; /* Maximal size of a BIO */477477+ u32 queue_order_type;478478+} __packed;479479+480480+struct p_state {481481+ struct p_header head;482482+ u32 state;483483+} __packed;484484+485485+struct p_req_state {486486+ struct p_header head;487487+ u32 mask;488488+ u32 val;489489+} __packed;490490+491491+struct p_req_state_reply {492492+ struct p_header head;493493+ u32 retcode;494494+} __packed;495495+496496+struct p_drbd06_param {497497+ u64 size;498498+ u32 state;499499+ u32 blksize;500500+ u32 protocol;501501+ u32 version;502502+ u32 gen_cnt[5];503503+ u32 bit_map_gen[5];504504+} __packed;505505+506506+struct p_discard {507507+ struct p_header head;508508+ u64 block_id;509509+ u32 seq_num;510510+ u32 pad;511511+} __packed;512512+513513+/* Valid values for the encoding field.514514+ * Bump proto version when changing this. */515515+enum drbd_bitmap_code {516516+ /* RLE_VLI_Bytes = 0,517517+ * and other bit variants had been defined during518518+ * algorithm evaluation. */519519+ RLE_VLI_Bits = 2,520520+};521521+522522+struct p_compressed_bm {523523+ struct p_header head;524524+ /* (encoding & 0x0f): actual encoding, see enum drbd_bitmap_code525525+ * (encoding & 0x80): polarity (set/unset) of first runlength526526+ * ((encoding >> 4) & 0x07): pad_bits, number of trailing zero bits527527+ * used to pad up to head.length bytes528528+ */529529+ u8 encoding;530530+531531+ u8 code[0];532532+} __packed;533533+534534+/* DCBP: Drbd Compressed Bitmap Packet ... */535535+static inline enum drbd_bitmap_code536536+DCBP_get_code(struct p_compressed_bm *p)537537+{538538+ return (enum drbd_bitmap_code)(p->encoding & 0x0f);539539+}540540+541541+static inline void542542+DCBP_set_code(struct p_compressed_bm *p, enum drbd_bitmap_code code)543543+{544544+ BUG_ON(code & ~0xf);545545+ p->encoding = (p->encoding & ~0xf) | code;546546+}547547+548548+static inline int549549+DCBP_get_start(struct p_compressed_bm *p)550550+{551551+ return (p->encoding & 0x80) != 0;552552+}553553+554554+static inline void555555+DCBP_set_start(struct p_compressed_bm *p, int set)556556+{557557+ p->encoding = (p->encoding & ~0x80) | (set ? 0x80 : 0);558558+}559559+560560+static inline int561561+DCBP_get_pad_bits(struct p_compressed_bm *p)562562+{563563+ return (p->encoding >> 4) & 0x7;564564+}565565+566566+static inline void567567+DCBP_set_pad_bits(struct p_compressed_bm *p, int n)568568+{569569+ BUG_ON(n & ~0x7);570570+ p->encoding = (p->encoding & (~0x7 << 4)) | (n << 4);571571+}572572+573573+/* one bitmap packet, including the p_header,574574+ * should fit within one _architecture independend_ page.575575+ * so we need to use the fixed size 4KiB page size576576+ * most architechtures have used for a long time.577577+ */578578+#define BM_PACKET_PAYLOAD_BYTES (4096 - sizeof(struct p_header))579579+#define BM_PACKET_WORDS (BM_PACKET_PAYLOAD_BYTES/sizeof(long))580580+#define BM_PACKET_VLI_BYTES_MAX (4096 - sizeof(struct p_compressed_bm))581581+#if (PAGE_SIZE < 4096)582582+/* drbd_send_bitmap / receive_bitmap would break horribly */583583+#error "PAGE_SIZE too small"584584+#endif585585+586586+union p_polymorph {587587+ struct p_header header;588588+ struct p_handshake handshake;589589+ struct p_data data;590590+ struct p_block_ack block_ack;591591+ struct p_barrier barrier;592592+ struct p_barrier_ack barrier_ack;593593+ struct p_rs_param_89 rs_param_89;594594+ struct p_protocol protocol;595595+ struct p_sizes sizes;596596+ struct p_uuids uuids;597597+ struct p_state state;598598+ struct p_req_state req_state;599599+ struct p_req_state_reply req_state_reply;600600+ struct p_block_req block_req;601601+} __packed;602602+603603+/**********************************************************************/604604+enum drbd_thread_state {605605+ None,606606+ Running,607607+ Exiting,608608+ Restarting609609+};610610+611611+struct drbd_thread {612612+ spinlock_t t_lock;613613+ struct task_struct *task;614614+ struct completion stop;615615+ enum drbd_thread_state t_state;616616+ int (*function) (struct drbd_thread *);617617+ struct drbd_conf *mdev;618618+ int reset_cpu_mask;619619+};620620+621621+static inline enum drbd_thread_state get_t_state(struct drbd_thread *thi)622622+{623623+ /* THINK testing the t_state seems to be uncritical in all cases624624+ * (but thread_{start,stop}), so we can read it *without* the lock.625625+ * --lge */626626+627627+ smp_rmb();628628+ return thi->t_state;629629+}630630+631631+632632+/*633633+ * Having this as the first member of a struct provides sort of "inheritance".634634+ * "derived" structs can be "drbd_queue_work()"ed.635635+ * The callback should know and cast back to the descendant struct.636636+ * drbd_request and drbd_epoch_entry are descendants of drbd_work.637637+ */638638+struct drbd_work;639639+typedef int (*drbd_work_cb)(struct drbd_conf *, struct drbd_work *, int cancel);640640+struct drbd_work {641641+ struct list_head list;642642+ drbd_work_cb cb;643643+};644644+645645+struct drbd_tl_epoch;646646+struct drbd_request {647647+ struct drbd_work w;648648+ struct drbd_conf *mdev;649649+650650+ /* if local IO is not allowed, will be NULL.651651+ * if local IO _is_ allowed, holds the locally submitted bio clone,652652+ * or, after local IO completion, the ERR_PTR(error).653653+ * see drbd_endio_pri(). */654654+ struct bio *private_bio;655655+656656+ struct hlist_node colision;657657+ sector_t sector;658658+ unsigned int size;659659+ unsigned int epoch; /* barrier_nr */660660+661661+ /* barrier_nr: used to check on "completion" whether this req was in662662+ * the current epoch, and we therefore have to close it,663663+ * starting a new epoch...664664+ */665665+666666+ /* up to here, the struct layout is identical to drbd_epoch_entry;667667+ * we might be able to use that to our advantage... */668668+669669+ struct list_head tl_requests; /* ring list in the transfer log */670670+ struct bio *master_bio; /* master bio pointer */671671+ unsigned long rq_state; /* see comments above _req_mod() */672672+ int seq_num;673673+ unsigned long start_time;674674+};675675+676676+struct drbd_tl_epoch {677677+ struct drbd_work w;678678+ struct list_head requests; /* requests before */679679+ struct drbd_tl_epoch *next; /* pointer to the next barrier */680680+ unsigned int br_number; /* the barriers identifier. */681681+ int n_req; /* number of requests attached before this barrier */682682+};683683+684684+struct drbd_request;685685+686686+/* These Tl_epoch_entries may be in one of 6 lists:687687+ active_ee .. data packet being written688688+ sync_ee .. syncer block being written689689+ done_ee .. block written, need to send P_WRITE_ACK690690+ read_ee .. [RS]P_DATA_REQUEST being read691691+*/692692+693693+struct drbd_epoch {694694+ struct list_head list;695695+ unsigned int barrier_nr;696696+ atomic_t epoch_size; /* increased on every request added. */697697+ atomic_t active; /* increased on every req. added, and dec on every finished. */698698+ unsigned long flags;699699+};700700+701701+/* drbd_epoch flag bits */702702+enum {703703+ DE_BARRIER_IN_NEXT_EPOCH_ISSUED,704704+ DE_BARRIER_IN_NEXT_EPOCH_DONE,705705+ DE_CONTAINS_A_BARRIER,706706+ DE_HAVE_BARRIER_NUMBER,707707+ DE_IS_FINISHING,708708+};709709+710710+enum epoch_event {711711+ EV_PUT,712712+ EV_GOT_BARRIER_NR,713713+ EV_BARRIER_DONE,714714+ EV_BECAME_LAST,715715+ EV_TRACE_FLUSH, /* TRACE_ are not real events, only used for tracing */716716+ EV_TRACE_ADD_BARRIER, /* Doing the first write as a barrier write */717717+ EV_TRACE_SETTING_BI, /* Barrier is expressed with the first write of the next epoch */718718+ EV_TRACE_ALLOC,719719+ EV_TRACE_FREE,720720+ EV_CLEANUP = 32, /* used as flag */721721+};722722+723723+struct drbd_epoch_entry {724724+ struct drbd_work w;725725+ struct drbd_conf *mdev;726726+ struct bio *private_bio;727727+ struct hlist_node colision;728728+ sector_t sector;729729+ unsigned int size;730730+ struct drbd_epoch *epoch;731731+732732+ /* up to here, the struct layout is identical to drbd_request;733733+ * we might be able to use that to our advantage... */734734+735735+ unsigned int flags;736736+ u64 block_id;737737+};738738+739739+struct drbd_wq_barrier {740740+ struct drbd_work w;741741+ struct completion done;742742+};743743+744744+struct digest_info {745745+ int digest_size;746746+ void *digest;747747+};748748+749749+/* ee flag bits */750750+enum {751751+ __EE_CALL_AL_COMPLETE_IO,752752+ __EE_CONFLICT_PENDING,753753+ __EE_MAY_SET_IN_SYNC,754754+ __EE_IS_BARRIER,755755+};756756+#define EE_CALL_AL_COMPLETE_IO (1<<__EE_CALL_AL_COMPLETE_IO)757757+#define EE_CONFLICT_PENDING (1<<__EE_CONFLICT_PENDING)758758+#define EE_MAY_SET_IN_SYNC (1<<__EE_MAY_SET_IN_SYNC)759759+#define EE_IS_BARRIER (1<<__EE_IS_BARRIER)760760+761761+/* global flag bits */762762+enum {763763+ CREATE_BARRIER, /* next P_DATA is preceeded by a P_BARRIER */764764+ SIGNAL_ASENDER, /* whether asender wants to be interrupted */765765+ SEND_PING, /* whether asender should send a ping asap */766766+767767+ STOP_SYNC_TIMER, /* tell timer to cancel itself */768768+ UNPLUG_QUEUED, /* only relevant with kernel 2.4 */769769+ UNPLUG_REMOTE, /* sending a "UnplugRemote" could help */770770+ MD_DIRTY, /* current uuids and flags not yet on disk */771771+ DISCARD_CONCURRENT, /* Set on one node, cleared on the peer! */772772+ USE_DEGR_WFC_T, /* degr-wfc-timeout instead of wfc-timeout. */773773+ CLUSTER_ST_CHANGE, /* Cluster wide state change going on... */774774+ CL_ST_CHG_SUCCESS,775775+ CL_ST_CHG_FAIL,776776+ CRASHED_PRIMARY, /* This node was a crashed primary.777777+ * Gets cleared when the state.conn778778+ * goes into C_CONNECTED state. */779779+ WRITE_BM_AFTER_RESYNC, /* A kmalloc() during resync failed */780780+ NO_BARRIER_SUPP, /* underlying block device doesn't implement barriers */781781+ CONSIDER_RESYNC,782782+783783+ MD_NO_BARRIER, /* meta data device does not support barriers,784784+ so don't even try */785785+ SUSPEND_IO, /* suspend application io */786786+ BITMAP_IO, /* suspend application io;787787+ once no more io in flight, start bitmap io */788788+ BITMAP_IO_QUEUED, /* Started bitmap IO */789789+ RESYNC_AFTER_NEG, /* Resync after online grow after the attach&negotiate finished. */790790+ NET_CONGESTED, /* The data socket is congested */791791+792792+ CONFIG_PENDING, /* serialization of (re)configuration requests.793793+ * if set, also prevents the device from dying */794794+ DEVICE_DYING, /* device became unconfigured,795795+ * but worker thread is still handling the cleanup.796796+ * reconfiguring (nl_disk_conf, nl_net_conf) is dissalowed,797797+ * while this is set. */798798+ RESIZE_PENDING, /* Size change detected locally, waiting for the response from799799+ * the peer, if it changed there as well. */800800+};801801+802802+struct drbd_bitmap; /* opaque for drbd_conf */803803+804804+/* TODO sort members for performance805805+ * MAYBE group them further */806806+807807+/* THINK maybe we actually want to use the default "event/%s" worker threads808808+ * or similar in linux 2.6, which uses per cpu data and threads.809809+ *810810+ * To be general, this might need a spin_lock member.811811+ * For now, please use the mdev->req_lock to protect list_head,812812+ * see drbd_queue_work below.813813+ */814814+struct drbd_work_queue {815815+ struct list_head q;816816+ struct semaphore s; /* producers up it, worker down()s it */817817+ spinlock_t q_lock; /* to protect the list. */818818+};819819+820820+struct drbd_socket {821821+ struct drbd_work_queue work;822822+ struct mutex mutex;823823+ struct socket *socket;824824+ /* this way we get our825825+ * send/receive buffers off the stack */826826+ union p_polymorph sbuf;827827+ union p_polymorph rbuf;828828+};829829+830830+struct drbd_md {831831+ u64 md_offset; /* sector offset to 'super' block */832832+833833+ u64 la_size_sect; /* last agreed size, unit sectors */834834+ u64 uuid[UI_SIZE];835835+ u64 device_uuid;836836+ u32 flags;837837+ u32 md_size_sect;838838+839839+ s32 al_offset; /* signed relative sector offset to al area */840840+ s32 bm_offset; /* signed relative sector offset to bitmap */841841+842842+ /* u32 al_nr_extents; important for restoring the AL843843+ * is stored into sync_conf.al_extents, which in turn844844+ * gets applied to act_log->nr_elements845845+ */846846+};847847+848848+/* for sync_conf and other types... */849849+#define NL_PACKET(name, number, fields) struct name { fields };850850+#define NL_INTEGER(pn,pr,member) int member;851851+#define NL_INT64(pn,pr,member) __u64 member;852852+#define NL_BIT(pn,pr,member) unsigned member:1;853853+#define NL_STRING(pn,pr,member,len) unsigned char member[len]; int member ## _len;854854+#include "linux/drbd_nl.h"855855+856856+struct drbd_backing_dev {857857+ struct block_device *backing_bdev;858858+ struct block_device *md_bdev;859859+ struct file *lo_file;860860+ struct file *md_file;861861+ struct drbd_md md;862862+ struct disk_conf dc; /* The user provided config... */863863+ sector_t known_size; /* last known size of that backing device */864864+};865865+866866+struct drbd_md_io {867867+ struct drbd_conf *mdev;868868+ struct completion event;869869+ int error;870870+};871871+872872+struct bm_io_work {873873+ struct drbd_work w;874874+ char *why;875875+ int (*io_fn)(struct drbd_conf *mdev);876876+ void (*done)(struct drbd_conf *mdev, int rv);877877+};878878+879879+enum write_ordering_e {880880+ WO_none,881881+ WO_drain_io,882882+ WO_bdev_flush,883883+ WO_bio_barrier884884+};885885+886886+struct drbd_conf {887887+ /* things that are stored as / read from meta data on disk */888888+ unsigned long flags;889889+890890+ /* configured by drbdsetup */891891+ struct net_conf *net_conf; /* protected by get_net_conf() and put_net_conf() */892892+ struct syncer_conf sync_conf;893893+ struct drbd_backing_dev *ldev __protected_by(local);894894+895895+ sector_t p_size; /* partner's disk size */896896+ struct request_queue *rq_queue;897897+ struct block_device *this_bdev;898898+ struct gendisk *vdisk;899899+900900+ struct drbd_socket data; /* data/barrier/cstate/parameter packets */901901+ struct drbd_socket meta; /* ping/ack (metadata) packets */902902+ int agreed_pro_version; /* actually used protocol version */903903+ unsigned long last_received; /* in jiffies, either socket */904904+ unsigned int ko_count;905905+ struct drbd_work resync_work,906906+ unplug_work,907907+ md_sync_work;908908+ struct timer_list resync_timer;909909+ struct timer_list md_sync_timer;910910+911911+ /* Used after attach while negotiating new disk state. */912912+ union drbd_state new_state_tmp;913913+914914+ union drbd_state state;915915+ wait_queue_head_t misc_wait;916916+ wait_queue_head_t state_wait; /* upon each state change. */917917+ unsigned int send_cnt;918918+ unsigned int recv_cnt;919919+ unsigned int read_cnt;920920+ unsigned int writ_cnt;921921+ unsigned int al_writ_cnt;922922+ unsigned int bm_writ_cnt;923923+ atomic_t ap_bio_cnt; /* Requests we need to complete */924924+ atomic_t ap_pending_cnt; /* AP data packets on the wire, ack expected */925925+ atomic_t rs_pending_cnt; /* RS request/data packets on the wire */926926+ atomic_t unacked_cnt; /* Need to send replys for */927927+ atomic_t local_cnt; /* Waiting for local completion */928928+ atomic_t net_cnt; /* Users of net_conf */929929+ spinlock_t req_lock;930930+ struct drbd_tl_epoch *unused_spare_tle; /* for pre-allocation */931931+ struct drbd_tl_epoch *newest_tle;932932+ struct drbd_tl_epoch *oldest_tle;933933+ struct list_head out_of_sequence_requests;934934+ struct hlist_head *tl_hash;935935+ unsigned int tl_hash_s;936936+937937+ /* blocks to sync in this run [unit BM_BLOCK_SIZE] */938938+ unsigned long rs_total;939939+ /* number of sync IOs that failed in this run */940940+ unsigned long rs_failed;941941+ /* Syncer's start time [unit jiffies] */942942+ unsigned long rs_start;943943+ /* cumulated time in PausedSyncX state [unit jiffies] */944944+ unsigned long rs_paused;945945+ /* block not up-to-date at mark [unit BM_BLOCK_SIZE] */946946+ unsigned long rs_mark_left;947947+ /* marks's time [unit jiffies] */948948+ unsigned long rs_mark_time;949949+ /* skipped because csum was equeal [unit BM_BLOCK_SIZE] */950950+ unsigned long rs_same_csum;951951+952952+ /* where does the admin want us to start? (sector) */953953+ sector_t ov_start_sector;954954+ /* where are we now? (sector) */955955+ sector_t ov_position;956956+ /* Start sector of out of sync range (to merge printk reporting). */957957+ sector_t ov_last_oos_start;958958+ /* size of out-of-sync range in sectors. */959959+ sector_t ov_last_oos_size;960960+ unsigned long ov_left; /* in bits */961961+ struct crypto_hash *csums_tfm;962962+ struct crypto_hash *verify_tfm;963963+964964+ struct drbd_thread receiver;965965+ struct drbd_thread worker;966966+ struct drbd_thread asender;967967+ struct drbd_bitmap *bitmap;968968+ unsigned long bm_resync_fo; /* bit offset for drbd_bm_find_next */969969+970970+ /* Used to track operations of resync... */971971+ struct lru_cache *resync;972972+ /* Number of locked elements in resync LRU */973973+ unsigned int resync_locked;974974+ /* resync extent number waiting for application requests */975975+ unsigned int resync_wenr;976976+977977+ int open_cnt;978978+ u64 *p_uuid;979979+ struct drbd_epoch *current_epoch;980980+ spinlock_t epoch_lock;981981+ unsigned int epochs;982982+ enum write_ordering_e write_ordering;983983+ struct list_head active_ee; /* IO in progress */984984+ struct list_head sync_ee; /* IO in progress */985985+ struct list_head done_ee; /* send ack */986986+ struct list_head read_ee; /* IO in progress */987987+ struct list_head net_ee; /* zero-copy network send in progress */988988+ struct hlist_head *ee_hash; /* is proteced by req_lock! */989989+ unsigned int ee_hash_s;990990+991991+ /* this one is protected by ee_lock, single thread */992992+ struct drbd_epoch_entry *last_write_w_barrier;993993+994994+ int next_barrier_nr;995995+ struct hlist_head *app_reads_hash; /* is proteced by req_lock */996996+ struct list_head resync_reads;997997+ atomic_t pp_in_use;998998+ wait_queue_head_t ee_wait;999999+ struct page *md_io_page; /* one page buffer for md_io */10001000+ struct page *md_io_tmpp; /* for logical_block_size != 512 */10011001+ struct mutex md_io_mutex; /* protects the md_io_buffer */10021002+ spinlock_t al_lock;10031003+ wait_queue_head_t al_wait;10041004+ struct lru_cache *act_log; /* activity log */10051005+ unsigned int al_tr_number;10061006+ int al_tr_cycle;10071007+ int al_tr_pos; /* position of the next transaction in the journal */10081008+ struct crypto_hash *cram_hmac_tfm;10091009+ struct crypto_hash *integrity_w_tfm; /* to be used by the worker thread */10101010+ struct crypto_hash *integrity_r_tfm; /* to be used by the receiver thread */10111011+ void *int_dig_out;10121012+ void *int_dig_in;10131013+ void *int_dig_vv;10141014+ wait_queue_head_t seq_wait;10151015+ atomic_t packet_seq;10161016+ unsigned int peer_seq;10171017+ spinlock_t peer_seq_lock;10181018+ unsigned int minor;10191019+ unsigned long comm_bm_set; /* communicated number of set bits. */10201020+ cpumask_var_t cpu_mask;10211021+ struct bm_io_work bm_io_work;10221022+ u64 ed_uuid; /* UUID of the exposed data */10231023+ struct mutex state_mutex;10241024+ char congestion_reason; /* Why we where congested... */10251025+};10261026+10271027+static inline struct drbd_conf *minor_to_mdev(unsigned int minor)10281028+{10291029+ struct drbd_conf *mdev;10301030+10311031+ mdev = minor < minor_count ? minor_table[minor] : NULL;10321032+10331033+ return mdev;10341034+}10351035+10361036+static inline unsigned int mdev_to_minor(struct drbd_conf *mdev)10371037+{10381038+ return mdev->minor;10391039+}10401040+10411041+/* returns 1 if it was successfull,10421042+ * returns 0 if there was no data socket.10431043+ * so wherever you are going to use the data.socket, e.g. do10441044+ * if (!drbd_get_data_sock(mdev))10451045+ * return 0;10461046+ * CODE();10471047+ * drbd_put_data_sock(mdev);10481048+ */10491049+static inline int drbd_get_data_sock(struct drbd_conf *mdev)10501050+{10511051+ mutex_lock(&mdev->data.mutex);10521052+ /* drbd_disconnect() could have called drbd_free_sock()10531053+ * while we were waiting in down()... */10541054+ if (unlikely(mdev->data.socket == NULL)) {10551055+ mutex_unlock(&mdev->data.mutex);10561056+ return 0;10571057+ }10581058+ return 1;10591059+}10601060+10611061+static inline void drbd_put_data_sock(struct drbd_conf *mdev)10621062+{10631063+ mutex_unlock(&mdev->data.mutex);10641064+}10651065+10661066+/*10671067+ * function declarations10681068+ *************************/10691069+10701070+/* drbd_main.c */10711071+10721072+enum chg_state_flags {10731073+ CS_HARD = 1,10741074+ CS_VERBOSE = 2,10751075+ CS_WAIT_COMPLETE = 4,10761076+ CS_SERIALIZE = 8,10771077+ CS_ORDERED = CS_WAIT_COMPLETE + CS_SERIALIZE,10781078+};10791079+10801080+extern void drbd_init_set_defaults(struct drbd_conf *mdev);10811081+extern int drbd_change_state(struct drbd_conf *mdev, enum chg_state_flags f,10821082+ union drbd_state mask, union drbd_state val);10831083+extern void drbd_force_state(struct drbd_conf *, union drbd_state,10841084+ union drbd_state);10851085+extern int _drbd_request_state(struct drbd_conf *, union drbd_state,10861086+ union drbd_state, enum chg_state_flags);10871087+extern int __drbd_set_state(struct drbd_conf *, union drbd_state,10881088+ enum chg_state_flags, struct completion *done);10891089+extern void print_st_err(struct drbd_conf *, union drbd_state,10901090+ union drbd_state, int);10911091+extern int drbd_thread_start(struct drbd_thread *thi);10921092+extern void _drbd_thread_stop(struct drbd_thread *thi, int restart, int wait);10931093+#ifdef CONFIG_SMP10941094+extern void drbd_thread_current_set_cpu(struct drbd_conf *mdev);10951095+extern void drbd_calc_cpu_mask(struct drbd_conf *mdev);10961096+#else10971097+#define drbd_thread_current_set_cpu(A) ({})10981098+#define drbd_calc_cpu_mask(A) ({})10991099+#endif11001100+extern void drbd_free_resources(struct drbd_conf *mdev);11011101+extern void tl_release(struct drbd_conf *mdev, unsigned int barrier_nr,11021102+ unsigned int set_size);11031103+extern void tl_clear(struct drbd_conf *mdev);11041104+extern void _tl_add_barrier(struct drbd_conf *, struct drbd_tl_epoch *);11051105+extern void drbd_free_sock(struct drbd_conf *mdev);11061106+extern int drbd_send(struct drbd_conf *mdev, struct socket *sock,11071107+ void *buf, size_t size, unsigned msg_flags);11081108+extern int drbd_send_protocol(struct drbd_conf *mdev);11091109+extern int drbd_send_uuids(struct drbd_conf *mdev);11101110+extern int drbd_send_uuids_skip_initial_sync(struct drbd_conf *mdev);11111111+extern int drbd_send_sync_uuid(struct drbd_conf *mdev, u64 val);11121112+extern int drbd_send_sizes(struct drbd_conf *mdev, int trigger_reply);11131113+extern int _drbd_send_state(struct drbd_conf *mdev);11141114+extern int drbd_send_state(struct drbd_conf *mdev);11151115+extern int _drbd_send_cmd(struct drbd_conf *mdev, struct socket *sock,11161116+ enum drbd_packets cmd, struct p_header *h,11171117+ size_t size, unsigned msg_flags);11181118+#define USE_DATA_SOCKET 111191119+#define USE_META_SOCKET 011201120+extern int drbd_send_cmd(struct drbd_conf *mdev, int use_data_socket,11211121+ enum drbd_packets cmd, struct p_header *h,11221122+ size_t size);11231123+extern int drbd_send_cmd2(struct drbd_conf *mdev, enum drbd_packets cmd,11241124+ char *data, size_t size);11251125+extern int drbd_send_sync_param(struct drbd_conf *mdev, struct syncer_conf *sc);11261126+extern int drbd_send_b_ack(struct drbd_conf *mdev, u32 barrier_nr,11271127+ u32 set_size);11281128+extern int drbd_send_ack(struct drbd_conf *mdev, enum drbd_packets cmd,11291129+ struct drbd_epoch_entry *e);11301130+extern int drbd_send_ack_rp(struct drbd_conf *mdev, enum drbd_packets cmd,11311131+ struct p_block_req *rp);11321132+extern int drbd_send_ack_dp(struct drbd_conf *mdev, enum drbd_packets cmd,11331133+ struct p_data *dp);11341134+extern int drbd_send_ack_ex(struct drbd_conf *mdev, enum drbd_packets cmd,11351135+ sector_t sector, int blksize, u64 block_id);11361136+extern int drbd_send_block(struct drbd_conf *mdev, enum drbd_packets cmd,11371137+ struct drbd_epoch_entry *e);11381138+extern int drbd_send_dblock(struct drbd_conf *mdev, struct drbd_request *req);11391139+extern int _drbd_send_barrier(struct drbd_conf *mdev,11401140+ struct drbd_tl_epoch *barrier);11411141+extern int drbd_send_drequest(struct drbd_conf *mdev, int cmd,11421142+ sector_t sector, int size, u64 block_id);11431143+extern int drbd_send_drequest_csum(struct drbd_conf *mdev,11441144+ sector_t sector,int size,11451145+ void *digest, int digest_size,11461146+ enum drbd_packets cmd);11471147+extern int drbd_send_ov_request(struct drbd_conf *mdev,sector_t sector,int size);11481148+11491149+extern int drbd_send_bitmap(struct drbd_conf *mdev);11501150+extern int _drbd_send_bitmap(struct drbd_conf *mdev);11511151+extern int drbd_send_sr_reply(struct drbd_conf *mdev, int retcode);11521152+extern void drbd_free_bc(struct drbd_backing_dev *ldev);11531153+extern void drbd_mdev_cleanup(struct drbd_conf *mdev);11541154+11551155+/* drbd_meta-data.c (still in drbd_main.c) */11561156+extern void drbd_md_sync(struct drbd_conf *mdev);11571157+extern int drbd_md_read(struct drbd_conf *mdev, struct drbd_backing_dev *bdev);11581158+/* maybe define them below as inline? */11591159+extern void drbd_uuid_set(struct drbd_conf *mdev, int idx, u64 val) __must_hold(local);11601160+extern void _drbd_uuid_set(struct drbd_conf *mdev, int idx, u64 val) __must_hold(local);11611161+extern void drbd_uuid_new_current(struct drbd_conf *mdev) __must_hold(local);11621162+extern void _drbd_uuid_new_current(struct drbd_conf *mdev) __must_hold(local);11631163+extern void drbd_uuid_set_bm(struct drbd_conf *mdev, u64 val) __must_hold(local);11641164+extern void drbd_md_set_flag(struct drbd_conf *mdev, int flags) __must_hold(local);11651165+extern void drbd_md_clear_flag(struct drbd_conf *mdev, int flags)__must_hold(local);11661166+extern int drbd_md_test_flag(struct drbd_backing_dev *, int);11671167+extern void drbd_md_mark_dirty(struct drbd_conf *mdev);11681168+extern void drbd_queue_bitmap_io(struct drbd_conf *mdev,11691169+ int (*io_fn)(struct drbd_conf *),11701170+ void (*done)(struct drbd_conf *, int),11711171+ char *why);11721172+extern int drbd_bmio_set_n_write(struct drbd_conf *mdev);11731173+extern int drbd_bmio_clear_n_write(struct drbd_conf *mdev);11741174+extern int drbd_bitmap_io(struct drbd_conf *mdev, int (*io_fn)(struct drbd_conf *), char *why);11751175+11761176+11771177+/* Meta data layout11781178+ We reserve a 128MB Block (4k aligned)11791179+ * either at the end of the backing device11801180+ * or on a seperate meta data device. */11811181+11821182+#define MD_RESERVED_SECT (128LU << 11) /* 128 MB, unit sectors */11831183+/* The following numbers are sectors */11841184+#define MD_AL_OFFSET 8 /* 8 Sectors after start of meta area */11851185+#define MD_AL_MAX_SIZE 64 /* = 32 kb LOG ~ 3776 extents ~ 14 GB Storage */11861186+/* Allows up to about 3.8TB */11871187+#define MD_BM_OFFSET (MD_AL_OFFSET + MD_AL_MAX_SIZE)11881188+11891189+/* Since the smalles IO unit is usually 512 byte */11901190+#define MD_SECTOR_SHIFT 911911191+#define MD_SECTOR_SIZE (1<<MD_SECTOR_SHIFT)11921192+11931193+/* activity log */11941194+#define AL_EXTENTS_PT ((MD_SECTOR_SIZE-12)/8-1) /* 61 ; Extents per 512B sector */11951195+#define AL_EXTENT_SHIFT 22 /* One extent represents 4M Storage */11961196+#define AL_EXTENT_SIZE (1<<AL_EXTENT_SHIFT)11971197+11981198+#if BITS_PER_LONG == 3211991199+#define LN2_BPL 512001200+#define cpu_to_lel(A) cpu_to_le32(A)12011201+#define lel_to_cpu(A) le32_to_cpu(A)12021202+#elif BITS_PER_LONG == 6412031203+#define LN2_BPL 612041204+#define cpu_to_lel(A) cpu_to_le64(A)12051205+#define lel_to_cpu(A) le64_to_cpu(A)12061206+#else12071207+#error "LN2 of BITS_PER_LONG unknown!"12081208+#endif12091209+12101210+/* resync bitmap */12111211+/* 16MB sized 'bitmap extent' to track syncer usage */12121212+struct bm_extent {12131213+ int rs_left; /* number of bits set (out of sync) in this extent. */12141214+ int rs_failed; /* number of failed resync requests in this extent. */12151215+ unsigned long flags;12161216+ struct lc_element lce;12171217+};12181218+12191219+#define BME_NO_WRITES 0 /* bm_extent.flags: no more requests on this one! */12201220+#define BME_LOCKED 1 /* bm_extent.flags: syncer active on this one. */12211221+12221222+/* drbd_bitmap.c */12231223+/*12241224+ * We need to store one bit for a block.12251225+ * Example: 1GB disk @ 4096 byte blocks ==> we need 32 KB bitmap.12261226+ * Bit 0 ==> local node thinks this block is binary identical on both nodes12271227+ * Bit 1 ==> local node thinks this block needs to be synced.12281228+ */12291229+12301230+#define BM_BLOCK_SHIFT 12 /* 4k per bit */12311231+#define BM_BLOCK_SIZE (1<<BM_BLOCK_SHIFT)12321232+/* (9+3) : 512 bytes @ 8 bits; representing 16M storage12331233+ * per sector of on disk bitmap */12341234+#define BM_EXT_SHIFT (BM_BLOCK_SHIFT + MD_SECTOR_SHIFT + 3) /* = 24 */12351235+#define BM_EXT_SIZE (1<<BM_EXT_SHIFT)12361236+12371237+#if (BM_EXT_SHIFT != 24) || (BM_BLOCK_SHIFT != 12)12381238+#error "HAVE YOU FIXED drbdmeta AS WELL??"12391239+#endif12401240+12411241+/* thus many _storage_ sectors are described by one bit */12421242+#define BM_SECT_TO_BIT(x) ((x)>>(BM_BLOCK_SHIFT-9))12431243+#define BM_BIT_TO_SECT(x) ((sector_t)(x)<<(BM_BLOCK_SHIFT-9))12441244+#define BM_SECT_PER_BIT BM_BIT_TO_SECT(1)12451245+12461246+/* bit to represented kilo byte conversion */12471247+#define Bit2KB(bits) ((bits)<<(BM_BLOCK_SHIFT-10))12481248+12491249+/* in which _bitmap_ extent (resp. sector) the bit for a certain12501250+ * _storage_ sector is located in */12511251+#define BM_SECT_TO_EXT(x) ((x)>>(BM_EXT_SHIFT-9))12521252+12531253+/* how much _storage_ sectors we have per bitmap sector */12541254+#define BM_EXT_TO_SECT(x) ((sector_t)(x) << (BM_EXT_SHIFT-9))12551255+#define BM_SECT_PER_EXT BM_EXT_TO_SECT(1)12561256+12571257+/* in one sector of the bitmap, we have this many activity_log extents. */12581258+#define AL_EXT_PER_BM_SECT (1 << (BM_EXT_SHIFT - AL_EXTENT_SHIFT))12591259+#define BM_WORDS_PER_AL_EXT (1 << (AL_EXTENT_SHIFT-BM_BLOCK_SHIFT-LN2_BPL))12601260+12611261+#define BM_BLOCKS_PER_BM_EXT_B (BM_EXT_SHIFT - BM_BLOCK_SHIFT)12621262+#define BM_BLOCKS_PER_BM_EXT_MASK ((1<<BM_BLOCKS_PER_BM_EXT_B) - 1)12631263+12641264+/* the extent in "PER_EXTENT" below is an activity log extent12651265+ * we need that many (long words/bytes) to store the bitmap12661266+ * of one AL_EXTENT_SIZE chunk of storage.12671267+ * we can store the bitmap for that many AL_EXTENTS within12681268+ * one sector of the _on_disk_ bitmap:12691269+ * bit 0 bit 37 bit 38 bit (512*8)-112701270+ * ...|........|........|.. // ..|........|12711271+ * sect. 0 `296 `304 ^(512*8*8)-112721272+ *12731273+#define BM_WORDS_PER_EXT ( (AL_EXT_SIZE/BM_BLOCK_SIZE) / BITS_PER_LONG )12741274+#define BM_BYTES_PER_EXT ( (AL_EXT_SIZE/BM_BLOCK_SIZE) / 8 ) // 12812751275+#define BM_EXT_PER_SECT ( 512 / BM_BYTES_PER_EXTENT ) // 412761276+ */12771277+12781278+#define DRBD_MAX_SECTORS_32 (0xffffffffLU)12791279+#define DRBD_MAX_SECTORS_BM \12801280+ ((MD_RESERVED_SECT - MD_BM_OFFSET) * (1LL<<(BM_EXT_SHIFT-9)))12811281+#if DRBD_MAX_SECTORS_BM < DRBD_MAX_SECTORS_3212821282+#define DRBD_MAX_SECTORS DRBD_MAX_SECTORS_BM12831283+#define DRBD_MAX_SECTORS_FLEX DRBD_MAX_SECTORS_BM12841284+#elif !defined(CONFIG_LBD) && BITS_PER_LONG == 3212851285+#define DRBD_MAX_SECTORS DRBD_MAX_SECTORS_3212861286+#define DRBD_MAX_SECTORS_FLEX DRBD_MAX_SECTORS_3212871287+#else12881288+#define DRBD_MAX_SECTORS DRBD_MAX_SECTORS_BM12891289+/* 16 TB in units of sectors */12901290+#if BITS_PER_LONG == 3212911291+/* adjust by one page worth of bitmap,12921292+ * so we won't wrap around in drbd_bm_find_next_bit.12931293+ * you should use 64bit OS for that much storage, anyways. */12941294+#define DRBD_MAX_SECTORS_FLEX BM_BIT_TO_SECT(0xffff7fff)12951295+#else12961296+#define DRBD_MAX_SECTORS_FLEX BM_BIT_TO_SECT(0x1LU << 32)12971297+#endif12981298+#endif12991299+13001300+/* Sector shift value for the "hash" functions of tl_hash and ee_hash tables.13011301+ * With a value of 6 all IO in one 32K block make it to the same slot of the13021302+ * hash table. */13031303+#define HT_SHIFT 613041304+#define DRBD_MAX_SEGMENT_SIZE (1U<<(9+HT_SHIFT))13051305+13061306+/* Number of elements in the app_reads_hash */13071307+#define APP_R_HSIZE 1513081308+13091309+extern int drbd_bm_init(struct drbd_conf *mdev);13101310+extern int drbd_bm_resize(struct drbd_conf *mdev, sector_t sectors);13111311+extern void drbd_bm_cleanup(struct drbd_conf *mdev);13121312+extern void drbd_bm_set_all(struct drbd_conf *mdev);13131313+extern void drbd_bm_clear_all(struct drbd_conf *mdev);13141314+extern int drbd_bm_set_bits(13151315+ struct drbd_conf *mdev, unsigned long s, unsigned long e);13161316+extern int drbd_bm_clear_bits(13171317+ struct drbd_conf *mdev, unsigned long s, unsigned long e);13181318+/* bm_set_bits variant for use while holding drbd_bm_lock */13191319+extern void _drbd_bm_set_bits(struct drbd_conf *mdev,13201320+ const unsigned long s, const unsigned long e);13211321+extern int drbd_bm_test_bit(struct drbd_conf *mdev, unsigned long bitnr);13221322+extern int drbd_bm_e_weight(struct drbd_conf *mdev, unsigned long enr);13231323+extern int drbd_bm_write_sect(struct drbd_conf *mdev, unsigned long enr) __must_hold(local);13241324+extern int drbd_bm_read(struct drbd_conf *mdev) __must_hold(local);13251325+extern int drbd_bm_write(struct drbd_conf *mdev) __must_hold(local);13261326+extern unsigned long drbd_bm_ALe_set_all(struct drbd_conf *mdev,13271327+ unsigned long al_enr);13281328+extern size_t drbd_bm_words(struct drbd_conf *mdev);13291329+extern unsigned long drbd_bm_bits(struct drbd_conf *mdev);13301330+extern sector_t drbd_bm_capacity(struct drbd_conf *mdev);13311331+extern unsigned long drbd_bm_find_next(struct drbd_conf *mdev, unsigned long bm_fo);13321332+/* bm_find_next variants for use while you hold drbd_bm_lock() */13331333+extern unsigned long _drbd_bm_find_next(struct drbd_conf *mdev, unsigned long bm_fo);13341334+extern unsigned long _drbd_bm_find_next_zero(struct drbd_conf *mdev, unsigned long bm_fo);13351335+extern unsigned long drbd_bm_total_weight(struct drbd_conf *mdev);13361336+extern int drbd_bm_rs_done(struct drbd_conf *mdev);13371337+/* for receive_bitmap */13381338+extern void drbd_bm_merge_lel(struct drbd_conf *mdev, size_t offset,13391339+ size_t number, unsigned long *buffer);13401340+/* for _drbd_send_bitmap and drbd_bm_write_sect */13411341+extern void drbd_bm_get_lel(struct drbd_conf *mdev, size_t offset,13421342+ size_t number, unsigned long *buffer);13431343+13441344+extern void drbd_bm_lock(struct drbd_conf *mdev, char *why);13451345+extern void drbd_bm_unlock(struct drbd_conf *mdev);13461346+13471347+extern int drbd_bm_count_bits(struct drbd_conf *mdev, const unsigned long s, const unsigned long e);13481348+/* drbd_main.c */13491349+13501350+extern struct kmem_cache *drbd_request_cache;13511351+extern struct kmem_cache *drbd_ee_cache; /* epoch entries */13521352+extern struct kmem_cache *drbd_bm_ext_cache; /* bitmap extents */13531353+extern struct kmem_cache *drbd_al_ext_cache; /* activity log extents */13541354+extern mempool_t *drbd_request_mempool;13551355+extern mempool_t *drbd_ee_mempool;13561356+13571357+extern struct page *drbd_pp_pool; /* drbd's page pool */13581358+extern spinlock_t drbd_pp_lock;13591359+extern int drbd_pp_vacant;13601360+extern wait_queue_head_t drbd_pp_wait;13611361+13621362+extern rwlock_t global_state_lock;13631363+13641364+extern struct drbd_conf *drbd_new_device(unsigned int minor);13651365+extern void drbd_free_mdev(struct drbd_conf *mdev);13661366+13671367+extern int proc_details;13681368+13691369+/* drbd_req */13701370+extern int drbd_make_request_26(struct request_queue *q, struct bio *bio);13711371+extern int drbd_read_remote(struct drbd_conf *mdev, struct drbd_request *req);13721372+extern int drbd_merge_bvec(struct request_queue *q, struct bvec_merge_data *bvm, struct bio_vec *bvec);13731373+extern int is_valid_ar_handle(struct drbd_request *, sector_t);13741374+13751375+13761376+/* drbd_nl.c */13771377+extern void drbd_suspend_io(struct drbd_conf *mdev);13781378+extern void drbd_resume_io(struct drbd_conf *mdev);13791379+extern char *ppsize(char *buf, unsigned long long size);13801380+extern sector_t drbd_new_dev_size(struct drbd_conf *,13811381+ struct drbd_backing_dev *);13821382+enum determine_dev_size { dev_size_error = -1, unchanged = 0, shrunk = 1, grew = 2 };13831383+extern enum determine_dev_size drbd_determin_dev_size(struct drbd_conf *) __must_hold(local);13841384+extern void resync_after_online_grow(struct drbd_conf *);13851385+extern void drbd_setup_queue_param(struct drbd_conf *mdev, unsigned int) __must_hold(local);13861386+extern int drbd_set_role(struct drbd_conf *mdev, enum drbd_role new_role,13871387+ int force);13881388+enum drbd_disk_state drbd_try_outdate_peer(struct drbd_conf *mdev);13891389+extern int drbd_khelper(struct drbd_conf *mdev, char *cmd);13901390+13911391+/* drbd_worker.c */13921392+extern int drbd_worker(struct drbd_thread *thi);13931393+extern int drbd_alter_sa(struct drbd_conf *mdev, int na);13941394+extern void drbd_start_resync(struct drbd_conf *mdev, enum drbd_conns side);13951395+extern void resume_next_sg(struct drbd_conf *mdev);13961396+extern void suspend_other_sg(struct drbd_conf *mdev);13971397+extern int drbd_resync_finished(struct drbd_conf *mdev);13981398+/* maybe rather drbd_main.c ? */13991399+extern int drbd_md_sync_page_io(struct drbd_conf *mdev,14001400+ struct drbd_backing_dev *bdev, sector_t sector, int rw);14011401+extern void drbd_ov_oos_found(struct drbd_conf*, sector_t, int);14021402+14031403+static inline void ov_oos_print(struct drbd_conf *mdev)14041404+{14051405+ if (mdev->ov_last_oos_size) {14061406+ dev_err(DEV, "Out of sync: start=%llu, size=%lu (sectors)\n",14071407+ (unsigned long long)mdev->ov_last_oos_start,14081408+ (unsigned long)mdev->ov_last_oos_size);14091409+ }14101410+ mdev->ov_last_oos_size=0;14111411+}14121412+14131413+14141414+extern void drbd_csum(struct drbd_conf *, struct crypto_hash *, struct bio *, void *);14151415+/* worker callbacks */14161416+extern int w_req_cancel_conflict(struct drbd_conf *, struct drbd_work *, int);14171417+extern int w_read_retry_remote(struct drbd_conf *, struct drbd_work *, int);14181418+extern int w_e_end_data_req(struct drbd_conf *, struct drbd_work *, int);14191419+extern int w_e_end_rsdata_req(struct drbd_conf *, struct drbd_work *, int);14201420+extern int w_e_end_csum_rs_req(struct drbd_conf *, struct drbd_work *, int);14211421+extern int w_e_end_ov_reply(struct drbd_conf *, struct drbd_work *, int);14221422+extern int w_e_end_ov_req(struct drbd_conf *, struct drbd_work *, int);14231423+extern int w_ov_finished(struct drbd_conf *, struct drbd_work *, int);14241424+extern int w_resync_inactive(struct drbd_conf *, struct drbd_work *, int);14251425+extern int w_resume_next_sg(struct drbd_conf *, struct drbd_work *, int);14261426+extern int w_io_error(struct drbd_conf *, struct drbd_work *, int);14271427+extern int w_send_write_hint(struct drbd_conf *, struct drbd_work *, int);14281428+extern int w_make_resync_request(struct drbd_conf *, struct drbd_work *, int);14291429+extern int w_send_dblock(struct drbd_conf *, struct drbd_work *, int);14301430+extern int w_send_barrier(struct drbd_conf *, struct drbd_work *, int);14311431+extern int w_send_read_req(struct drbd_conf *, struct drbd_work *, int);14321432+extern int w_prev_work_done(struct drbd_conf *, struct drbd_work *, int);14331433+extern int w_e_reissue(struct drbd_conf *, struct drbd_work *, int);14341434+14351435+extern void resync_timer_fn(unsigned long data);14361436+14371437+/* drbd_receiver.c */14381438+extern int drbd_release_ee(struct drbd_conf *mdev, struct list_head *list);14391439+extern struct drbd_epoch_entry *drbd_alloc_ee(struct drbd_conf *mdev,14401440+ u64 id,14411441+ sector_t sector,14421442+ unsigned int data_size,14431443+ gfp_t gfp_mask) __must_hold(local);14441444+extern void drbd_free_ee(struct drbd_conf *mdev, struct drbd_epoch_entry *e);14451445+extern void drbd_wait_ee_list_empty(struct drbd_conf *mdev,14461446+ struct list_head *head);14471447+extern void _drbd_wait_ee_list_empty(struct drbd_conf *mdev,14481448+ struct list_head *head);14491449+extern void drbd_set_recv_tcq(struct drbd_conf *mdev, int tcq_enabled);14501450+extern void _drbd_clear_done_ee(struct drbd_conf *mdev, struct list_head *to_be_freed);14511451+extern void drbd_flush_workqueue(struct drbd_conf *mdev);14521452+14531453+/* yes, there is kernel_setsockopt, but only since 2.6.18. we don't need to14541454+ * mess with get_fs/set_fs, we know we are KERNEL_DS always. */14551455+static inline int drbd_setsockopt(struct socket *sock, int level, int optname,14561456+ char __user *optval, int optlen)14571457+{14581458+ int err;14591459+ if (level == SOL_SOCKET)14601460+ err = sock_setsockopt(sock, level, optname, optval, optlen);14611461+ else14621462+ err = sock->ops->setsockopt(sock, level, optname, optval,14631463+ optlen);14641464+ return err;14651465+}14661466+14671467+static inline void drbd_tcp_cork(struct socket *sock)14681468+{14691469+ int __user val = 1;14701470+ (void) drbd_setsockopt(sock, SOL_TCP, TCP_CORK,14711471+ (char __user *)&val, sizeof(val));14721472+}14731473+14741474+static inline void drbd_tcp_uncork(struct socket *sock)14751475+{14761476+ int __user val = 0;14771477+ (void) drbd_setsockopt(sock, SOL_TCP, TCP_CORK,14781478+ (char __user *)&val, sizeof(val));14791479+}14801480+14811481+static inline void drbd_tcp_nodelay(struct socket *sock)14821482+{14831483+ int __user val = 1;14841484+ (void) drbd_setsockopt(sock, SOL_TCP, TCP_NODELAY,14851485+ (char __user *)&val, sizeof(val));14861486+}14871487+14881488+static inline void drbd_tcp_quickack(struct socket *sock)14891489+{14901490+ int __user val = 1;14911491+ (void) drbd_setsockopt(sock, SOL_TCP, TCP_QUICKACK,14921492+ (char __user *)&val, sizeof(val));14931493+}14941494+14951495+void drbd_bump_write_ordering(struct drbd_conf *mdev, enum write_ordering_e wo);14961496+14971497+/* drbd_proc.c */14981498+extern struct proc_dir_entry *drbd_proc;14991499+extern struct file_operations drbd_proc_fops;15001500+extern const char *drbd_conn_str(enum drbd_conns s);15011501+extern const char *drbd_role_str(enum drbd_role s);15021502+15031503+/* drbd_actlog.c */15041504+extern void drbd_al_begin_io(struct drbd_conf *mdev, sector_t sector);15051505+extern void drbd_al_complete_io(struct drbd_conf *mdev, sector_t sector);15061506+extern void drbd_rs_complete_io(struct drbd_conf *mdev, sector_t sector);15071507+extern int drbd_rs_begin_io(struct drbd_conf *mdev, sector_t sector);15081508+extern int drbd_try_rs_begin_io(struct drbd_conf *mdev, sector_t sector);15091509+extern void drbd_rs_cancel_all(struct drbd_conf *mdev);15101510+extern int drbd_rs_del_all(struct drbd_conf *mdev);15111511+extern void drbd_rs_failed_io(struct drbd_conf *mdev,15121512+ sector_t sector, int size);15131513+extern int drbd_al_read_log(struct drbd_conf *mdev, struct drbd_backing_dev *);15141514+extern void __drbd_set_in_sync(struct drbd_conf *mdev, sector_t sector,15151515+ int size, const char *file, const unsigned int line);15161516+#define drbd_set_in_sync(mdev, sector, size) \15171517+ __drbd_set_in_sync(mdev, sector, size, __FILE__, __LINE__)15181518+extern void __drbd_set_out_of_sync(struct drbd_conf *mdev, sector_t sector,15191519+ int size, const char *file, const unsigned int line);15201520+#define drbd_set_out_of_sync(mdev, sector, size) \15211521+ __drbd_set_out_of_sync(mdev, sector, size, __FILE__, __LINE__)15221522+extern void drbd_al_apply_to_bm(struct drbd_conf *mdev);15231523+extern void drbd_al_to_on_disk_bm(struct drbd_conf *mdev);15241524+extern void drbd_al_shrink(struct drbd_conf *mdev);15251525+15261526+15271527+/* drbd_nl.c */15281528+15291529+void drbd_nl_cleanup(void);15301530+int __init drbd_nl_init(void);15311531+void drbd_bcast_state(struct drbd_conf *mdev, union drbd_state);15321532+void drbd_bcast_sync_progress(struct drbd_conf *mdev);15331533+void drbd_bcast_ee(struct drbd_conf *mdev,15341534+ const char *reason, const int dgs,15351535+ const char* seen_hash, const char* calc_hash,15361536+ const struct drbd_epoch_entry* e);15371537+15381538+15391539+/**15401540+ * DOC: DRBD State macros15411541+ *15421542+ * These macros are used to express state changes in easily readable form.15431543+ *15441544+ * The NS macros expand to a mask and a value, that can be bit ored onto the15451545+ * current state as soon as the spinlock (req_lock) was taken.15461546+ *15471547+ * The _NS macros are used for state functions that get called with the15481548+ * spinlock. These macros expand directly to the new state value.15491549+ *15501550+ * Besides the basic forms NS() and _NS() additional _?NS[23] are defined15511551+ * to express state changes that affect more than one aspect of the state.15521552+ *15531553+ * E.g. NS2(conn, C_CONNECTED, peer, R_SECONDARY)15541554+ * Means that the network connection was established and that the peer15551555+ * is in secondary role.15561556+ */15571557+#define role_MASK R_MASK15581558+#define peer_MASK R_MASK15591559+#define disk_MASK D_MASK15601560+#define pdsk_MASK D_MASK15611561+#define conn_MASK C_MASK15621562+#define susp_MASK 115631563+#define user_isp_MASK 115641564+#define aftr_isp_MASK 115651565+15661566+#define NS(T, S) \15671567+ ({ union drbd_state mask; mask.i = 0; mask.T = T##_MASK; mask; }), \15681568+ ({ union drbd_state val; val.i = 0; val.T = (S); val; })15691569+#define NS2(T1, S1, T2, S2) \15701570+ ({ union drbd_state mask; mask.i = 0; mask.T1 = T1##_MASK; \15711571+ mask.T2 = T2##_MASK; mask; }), \15721572+ ({ union drbd_state val; val.i = 0; val.T1 = (S1); \15731573+ val.T2 = (S2); val; })15741574+#define NS3(T1, S1, T2, S2, T3, S3) \15751575+ ({ union drbd_state mask; mask.i = 0; mask.T1 = T1##_MASK; \15761576+ mask.T2 = T2##_MASK; mask.T3 = T3##_MASK; mask; }), \15771577+ ({ union drbd_state val; val.i = 0; val.T1 = (S1); \15781578+ val.T2 = (S2); val.T3 = (S3); val; })15791579+15801580+#define _NS(D, T, S) \15811581+ D, ({ union drbd_state __ns; __ns.i = D->state.i; __ns.T = (S); __ns; })15821582+#define _NS2(D, T1, S1, T2, S2) \15831583+ D, ({ union drbd_state __ns; __ns.i = D->state.i; __ns.T1 = (S1); \15841584+ __ns.T2 = (S2); __ns; })15851585+#define _NS3(D, T1, S1, T2, S2, T3, S3) \15861586+ D, ({ union drbd_state __ns; __ns.i = D->state.i; __ns.T1 = (S1); \15871587+ __ns.T2 = (S2); __ns.T3 = (S3); __ns; })15881588+15891589+/*15901590+ * inline helper functions15911591+ *************************/15921592+15931593+static inline void drbd_state_lock(struct drbd_conf *mdev)15941594+{15951595+ wait_event(mdev->misc_wait,15961596+ !test_and_set_bit(CLUSTER_ST_CHANGE, &mdev->flags));15971597+}15981598+15991599+static inline void drbd_state_unlock(struct drbd_conf *mdev)16001600+{16011601+ clear_bit(CLUSTER_ST_CHANGE, &mdev->flags);16021602+ wake_up(&mdev->misc_wait);16031603+}16041604+16051605+static inline int _drbd_set_state(struct drbd_conf *mdev,16061606+ union drbd_state ns, enum chg_state_flags flags,16071607+ struct completion *done)16081608+{16091609+ int rv;16101610+16111611+ read_lock(&global_state_lock);16121612+ rv = __drbd_set_state(mdev, ns, flags, done);16131613+ read_unlock(&global_state_lock);16141614+16151615+ return rv;16161616+}16171617+16181618+/**16191619+ * drbd_request_state() - Reqest a state change16201620+ * @mdev: DRBD device.16211621+ * @mask: mask of state bits to change.16221622+ * @val: value of new state bits.16231623+ *16241624+ * This is the most graceful way of requesting a state change. It is verbose16251625+ * quite verbose in case the state change is not possible, and all those16261626+ * state changes are globally serialized.16271627+ */16281628+static inline int drbd_request_state(struct drbd_conf *mdev,16291629+ union drbd_state mask,16301630+ union drbd_state val)16311631+{16321632+ return _drbd_request_state(mdev, mask, val, CS_VERBOSE + CS_ORDERED);16331633+}16341634+16351635+#define __drbd_chk_io_error(m,f) __drbd_chk_io_error_(m,f, __func__)16361636+static inline void __drbd_chk_io_error_(struct drbd_conf *mdev, int forcedetach, const char *where)16371637+{16381638+ switch (mdev->ldev->dc.on_io_error) {16391639+ case EP_PASS_ON:16401640+ if (!forcedetach) {16411641+ if (printk_ratelimit())16421642+ dev_err(DEV, "Local IO failed in %s."16431643+ "Passing error on...\n", where);16441644+ break;16451645+ }16461646+ /* NOTE fall through to detach case if forcedetach set */16471647+ case EP_DETACH:16481648+ case EP_CALL_HELPER:16491649+ if (mdev->state.disk > D_FAILED) {16501650+ _drbd_set_state(_NS(mdev, disk, D_FAILED), CS_HARD, NULL);16511651+ dev_err(DEV, "Local IO failed in %s."16521652+ "Detaching...\n", where);16531653+ }16541654+ break;16551655+ }16561656+}16571657+16581658+/**16591659+ * drbd_chk_io_error: Handle the on_io_error setting, should be called from all io completion handlers16601660+ * @mdev: DRBD device.16611661+ * @error: Error code passed to the IO completion callback16621662+ * @forcedetach: Force detach. I.e. the error happened while accessing the meta data16631663+ *16641664+ * See also drbd_main.c:after_state_ch() if (os.disk > D_FAILED && ns.disk == D_FAILED)16651665+ */16661666+#define drbd_chk_io_error(m,e,f) drbd_chk_io_error_(m,e,f, __func__)16671667+static inline void drbd_chk_io_error_(struct drbd_conf *mdev,16681668+ int error, int forcedetach, const char *where)16691669+{16701670+ if (error) {16711671+ unsigned long flags;16721672+ spin_lock_irqsave(&mdev->req_lock, flags);16731673+ __drbd_chk_io_error_(mdev, forcedetach, where);16741674+ spin_unlock_irqrestore(&mdev->req_lock, flags);16751675+ }16761676+}16771677+16781678+16791679+/**16801680+ * drbd_md_first_sector() - Returns the first sector number of the meta data area16811681+ * @bdev: Meta data block device.16821682+ *16831683+ * BTW, for internal meta data, this happens to be the maximum capacity16841684+ * we could agree upon with our peer node.16851685+ */16861686+static inline sector_t drbd_md_first_sector(struct drbd_backing_dev *bdev)16871687+{16881688+ switch (bdev->dc.meta_dev_idx) {16891689+ case DRBD_MD_INDEX_INTERNAL:16901690+ case DRBD_MD_INDEX_FLEX_INT:16911691+ return bdev->md.md_offset + bdev->md.bm_offset;16921692+ case DRBD_MD_INDEX_FLEX_EXT:16931693+ default:16941694+ return bdev->md.md_offset;16951695+ }16961696+}16971697+16981698+/**16991699+ * drbd_md_last_sector() - Return the last sector number of the meta data area17001700+ * @bdev: Meta data block device.17011701+ */17021702+static inline sector_t drbd_md_last_sector(struct drbd_backing_dev *bdev)17031703+{17041704+ switch (bdev->dc.meta_dev_idx) {17051705+ case DRBD_MD_INDEX_INTERNAL:17061706+ case DRBD_MD_INDEX_FLEX_INT:17071707+ return bdev->md.md_offset + MD_AL_OFFSET - 1;17081708+ case DRBD_MD_INDEX_FLEX_EXT:17091709+ default:17101710+ return bdev->md.md_offset + bdev->md.md_size_sect;17111711+ }17121712+}17131713+17141714+/* Returns the number of 512 byte sectors of the device */17151715+static inline sector_t drbd_get_capacity(struct block_device *bdev)17161716+{17171717+ /* return bdev ? get_capacity(bdev->bd_disk) : 0; */17181718+ return bdev ? bdev->bd_inode->i_size >> 9 : 0;17191719+}17201720+17211721+/**17221722+ * drbd_get_max_capacity() - Returns the capacity we announce to out peer17231723+ * @bdev: Meta data block device.17241724+ *17251725+ * returns the capacity we announce to out peer. we clip ourselves at the17261726+ * various MAX_SECTORS, because if we don't, current implementation will17271727+ * oops sooner or later17281728+ */17291729+static inline sector_t drbd_get_max_capacity(struct drbd_backing_dev *bdev)17301730+{17311731+ sector_t s;17321732+ switch (bdev->dc.meta_dev_idx) {17331733+ case DRBD_MD_INDEX_INTERNAL:17341734+ case DRBD_MD_INDEX_FLEX_INT:17351735+ s = drbd_get_capacity(bdev->backing_bdev)17361736+ ? min_t(sector_t, DRBD_MAX_SECTORS_FLEX,17371737+ drbd_md_first_sector(bdev))17381738+ : 0;17391739+ break;17401740+ case DRBD_MD_INDEX_FLEX_EXT:17411741+ s = min_t(sector_t, DRBD_MAX_SECTORS_FLEX,17421742+ drbd_get_capacity(bdev->backing_bdev));17431743+ /* clip at maximum size the meta device can support */17441744+ s = min_t(sector_t, s,17451745+ BM_EXT_TO_SECT(bdev->md.md_size_sect17461746+ - bdev->md.bm_offset));17471747+ break;17481748+ default:17491749+ s = min_t(sector_t, DRBD_MAX_SECTORS,17501750+ drbd_get_capacity(bdev->backing_bdev));17511751+ }17521752+ return s;17531753+}17541754+17551755+/**17561756+ * drbd_md_ss__() - Return the sector number of our meta data super block17571757+ * @mdev: DRBD device.17581758+ * @bdev: Meta data block device.17591759+ */17601760+static inline sector_t drbd_md_ss__(struct drbd_conf *mdev,17611761+ struct drbd_backing_dev *bdev)17621762+{17631763+ switch (bdev->dc.meta_dev_idx) {17641764+ default: /* external, some index */17651765+ return MD_RESERVED_SECT * bdev->dc.meta_dev_idx;17661766+ case DRBD_MD_INDEX_INTERNAL:17671767+ /* with drbd08, internal meta data is always "flexible" */17681768+ case DRBD_MD_INDEX_FLEX_INT:17691769+ /* sizeof(struct md_on_disk_07) == 4k17701770+ * position: last 4k aligned block of 4k size */17711771+ if (!bdev->backing_bdev) {17721772+ if (__ratelimit(&drbd_ratelimit_state)) {17731773+ dev_err(DEV, "bdev->backing_bdev==NULL\n");17741774+ dump_stack();17751775+ }17761776+ return 0;17771777+ }17781778+ return (drbd_get_capacity(bdev->backing_bdev) & ~7ULL)17791779+ - MD_AL_OFFSET;17801780+ case DRBD_MD_INDEX_FLEX_EXT:17811781+ return 0;17821782+ }17831783+}17841784+17851785+static inline void17861786+_drbd_queue_work(struct drbd_work_queue *q, struct drbd_work *w)17871787+{17881788+ list_add_tail(&w->list, &q->q);17891789+ up(&q->s);17901790+}17911791+17921792+static inline void17931793+drbd_queue_work_front(struct drbd_work_queue *q, struct drbd_work *w)17941794+{17951795+ unsigned long flags;17961796+ spin_lock_irqsave(&q->q_lock, flags);17971797+ list_add(&w->list, &q->q);17981798+ up(&q->s); /* within the spinlock,17991799+ see comment near end of drbd_worker() */18001800+ spin_unlock_irqrestore(&q->q_lock, flags);18011801+}18021802+18031803+static inline void18041804+drbd_queue_work(struct drbd_work_queue *q, struct drbd_work *w)18051805+{18061806+ unsigned long flags;18071807+ spin_lock_irqsave(&q->q_lock, flags);18081808+ list_add_tail(&w->list, &q->q);18091809+ up(&q->s); /* within the spinlock,18101810+ see comment near end of drbd_worker() */18111811+ spin_unlock_irqrestore(&q->q_lock, flags);18121812+}18131813+18141814+static inline void wake_asender(struct drbd_conf *mdev)18151815+{18161816+ if (test_bit(SIGNAL_ASENDER, &mdev->flags))18171817+ force_sig(DRBD_SIG, mdev->asender.task);18181818+}18191819+18201820+static inline void request_ping(struct drbd_conf *mdev)18211821+{18221822+ set_bit(SEND_PING, &mdev->flags);18231823+ wake_asender(mdev);18241824+}18251825+18261826+static inline int drbd_send_short_cmd(struct drbd_conf *mdev,18271827+ enum drbd_packets cmd)18281828+{18291829+ struct p_header h;18301830+ return drbd_send_cmd(mdev, USE_DATA_SOCKET, cmd, &h, sizeof(h));18311831+}18321832+18331833+static inline int drbd_send_ping(struct drbd_conf *mdev)18341834+{18351835+ struct p_header h;18361836+ return drbd_send_cmd(mdev, USE_META_SOCKET, P_PING, &h, sizeof(h));18371837+}18381838+18391839+static inline int drbd_send_ping_ack(struct drbd_conf *mdev)18401840+{18411841+ struct p_header h;18421842+ return drbd_send_cmd(mdev, USE_META_SOCKET, P_PING_ACK, &h, sizeof(h));18431843+}18441844+18451845+static inline void drbd_thread_stop(struct drbd_thread *thi)18461846+{18471847+ _drbd_thread_stop(thi, FALSE, TRUE);18481848+}18491849+18501850+static inline void drbd_thread_stop_nowait(struct drbd_thread *thi)18511851+{18521852+ _drbd_thread_stop(thi, FALSE, FALSE);18531853+}18541854+18551855+static inline void drbd_thread_restart_nowait(struct drbd_thread *thi)18561856+{18571857+ _drbd_thread_stop(thi, TRUE, FALSE);18581858+}18591859+18601860+/* counts how many answer packets packets we expect from our peer,18611861+ * for either explicit application requests,18621862+ * or implicit barrier packets as necessary.18631863+ * increased:18641864+ * w_send_barrier18651865+ * _req_mod(req, queue_for_net_write or queue_for_net_read);18661866+ * it is much easier and equally valid to count what we queue for the18671867+ * worker, even before it actually was queued or send.18681868+ * (drbd_make_request_common; recovery path on read io-error)18691869+ * decreased:18701870+ * got_BarrierAck (respective tl_clear, tl_clear_barrier)18711871+ * _req_mod(req, data_received)18721872+ * [from receive_DataReply]18731873+ * _req_mod(req, write_acked_by_peer or recv_acked_by_peer or neg_acked)18741874+ * [from got_BlockAck (P_WRITE_ACK, P_RECV_ACK)]18751875+ * for some reason it is NOT decreased in got_NegAck,18761876+ * but in the resulting cleanup code from report_params.18771877+ * we should try to remember the reason for that...18781878+ * _req_mod(req, send_failed or send_canceled)18791879+ * _req_mod(req, connection_lost_while_pending)18801880+ * [from tl_clear_barrier]18811881+ */18821882+static inline void inc_ap_pending(struct drbd_conf *mdev)18831883+{18841884+ atomic_inc(&mdev->ap_pending_cnt);18851885+}18861886+18871887+#define ERR_IF_CNT_IS_NEGATIVE(which) \18881888+ if (atomic_read(&mdev->which) < 0) \18891889+ dev_err(DEV, "in %s:%d: " #which " = %d < 0 !\n", \18901890+ __func__ , __LINE__ , \18911891+ atomic_read(&mdev->which))18921892+18931893+#define dec_ap_pending(mdev) do { \18941894+ typecheck(struct drbd_conf *, mdev); \18951895+ if (atomic_dec_and_test(&mdev->ap_pending_cnt)) \18961896+ wake_up(&mdev->misc_wait); \18971897+ ERR_IF_CNT_IS_NEGATIVE(ap_pending_cnt); } while (0)18981898+18991899+/* counts how many resync-related answers we still expect from the peer19001900+ * increase decrease19011901+ * C_SYNC_TARGET sends P_RS_DATA_REQUEST (and expects P_RS_DATA_REPLY)19021902+ * C_SYNC_SOURCE sends P_RS_DATA_REPLY (and expects P_WRITE_ACK whith ID_SYNCER)19031903+ * (or P_NEG_ACK with ID_SYNCER)19041904+ */19051905+static inline void inc_rs_pending(struct drbd_conf *mdev)19061906+{19071907+ atomic_inc(&mdev->rs_pending_cnt);19081908+}19091909+19101910+#define dec_rs_pending(mdev) do { \19111911+ typecheck(struct drbd_conf *, mdev); \19121912+ atomic_dec(&mdev->rs_pending_cnt); \19131913+ ERR_IF_CNT_IS_NEGATIVE(rs_pending_cnt); } while (0)19141914+19151915+/* counts how many answers we still need to send to the peer.19161916+ * increased on19171917+ * receive_Data unless protocol A;19181918+ * we need to send a P_RECV_ACK (proto B)19191919+ * or P_WRITE_ACK (proto C)19201920+ * receive_RSDataReply (recv_resync_read) we need to send a P_WRITE_ACK19211921+ * receive_DataRequest (receive_RSDataRequest) we need to send back P_DATA19221922+ * receive_Barrier_* we need to send a P_BARRIER_ACK19231923+ */19241924+static inline void inc_unacked(struct drbd_conf *mdev)19251925+{19261926+ atomic_inc(&mdev->unacked_cnt);19271927+}19281928+19291929+#define dec_unacked(mdev) do { \19301930+ typecheck(struct drbd_conf *, mdev); \19311931+ atomic_dec(&mdev->unacked_cnt); \19321932+ ERR_IF_CNT_IS_NEGATIVE(unacked_cnt); } while (0)19331933+19341934+#define sub_unacked(mdev, n) do { \19351935+ typecheck(struct drbd_conf *, mdev); \19361936+ atomic_sub(n, &mdev->unacked_cnt); \19371937+ ERR_IF_CNT_IS_NEGATIVE(unacked_cnt); } while (0)19381938+19391939+19401940+static inline void put_net_conf(struct drbd_conf *mdev)19411941+{19421942+ if (atomic_dec_and_test(&mdev->net_cnt))19431943+ wake_up(&mdev->misc_wait);19441944+}19451945+19461946+/**19471947+ * get_net_conf() - Increase ref count on mdev->net_conf; Returns 0 if nothing there19481948+ * @mdev: DRBD device.19491949+ *19501950+ * You have to call put_net_conf() when finished working with mdev->net_conf.19511951+ */19521952+static inline int get_net_conf(struct drbd_conf *mdev)19531953+{19541954+ int have_net_conf;19551955+19561956+ atomic_inc(&mdev->net_cnt);19571957+ have_net_conf = mdev->state.conn >= C_UNCONNECTED;19581958+ if (!have_net_conf)19591959+ put_net_conf(mdev);19601960+ return have_net_conf;19611961+}19621962+19631963+/**19641964+ * get_ldev() - Increase the ref count on mdev->ldev. Returns 0 if there is no ldev19651965+ * @M: DRBD device.19661966+ *19671967+ * You have to call put_ldev() when finished working with mdev->ldev.19681968+ */19691969+#define get_ldev(M) __cond_lock(local, _get_ldev_if_state(M,D_INCONSISTENT))19701970+#define get_ldev_if_state(M,MINS) __cond_lock(local, _get_ldev_if_state(M,MINS))19711971+19721972+static inline void put_ldev(struct drbd_conf *mdev)19731973+{19741974+ __release(local);19751975+ if (atomic_dec_and_test(&mdev->local_cnt))19761976+ wake_up(&mdev->misc_wait);19771977+ D_ASSERT(atomic_read(&mdev->local_cnt) >= 0);19781978+}19791979+19801980+#ifndef __CHECKER__19811981+static inline int _get_ldev_if_state(struct drbd_conf *mdev, enum drbd_disk_state mins)19821982+{19831983+ int io_allowed;19841984+19851985+ atomic_inc(&mdev->local_cnt);19861986+ io_allowed = (mdev->state.disk >= mins);19871987+ if (!io_allowed)19881988+ put_ldev(mdev);19891989+ return io_allowed;19901990+}19911991+#else19921992+extern int _get_ldev_if_state(struct drbd_conf *mdev, enum drbd_disk_state mins);19931993+#endif19941994+19951995+/* you must have an "get_ldev" reference */19961996+static inline void drbd_get_syncer_progress(struct drbd_conf *mdev,19971997+ unsigned long *bits_left, unsigned int *per_mil_done)19981998+{19991999+ /*20002000+ * this is to break it at compile time when we change that20012001+ * (we may feel 4TB maximum storage per drbd is not enough)20022002+ */20032003+ typecheck(unsigned long, mdev->rs_total);20042004+20052005+ /* note: both rs_total and rs_left are in bits, i.e. in20062006+ * units of BM_BLOCK_SIZE.20072007+ * for the percentage, we don't care. */20082008+20092009+ *bits_left = drbd_bm_total_weight(mdev) - mdev->rs_failed;20102010+ /* >> 10 to prevent overflow,20112011+ * +1 to prevent division by zero */20122012+ if (*bits_left > mdev->rs_total) {20132013+ /* doh. maybe a logic bug somewhere.20142014+ * may also be just a race condition20152015+ * between this and a disconnect during sync.20162016+ * for now, just prevent in-kernel buffer overflow.20172017+ */20182018+ smp_rmb();20192019+ dev_warn(DEV, "cs:%s rs_left=%lu > rs_total=%lu (rs_failed %lu)\n",20202020+ drbd_conn_str(mdev->state.conn),20212021+ *bits_left, mdev->rs_total, mdev->rs_failed);20222022+ *per_mil_done = 0;20232023+ } else {20242024+ /* make sure the calculation happens in long context */20252025+ unsigned long tmp = 1000UL -20262026+ (*bits_left >> 10)*1000UL20272027+ / ((mdev->rs_total >> 10) + 1UL);20282028+ *per_mil_done = tmp;20292029+ }20302030+}20312031+20322032+20332033+/* this throttles on-the-fly application requests20342034+ * according to max_buffers settings;20352035+ * maybe re-implement using semaphores? */20362036+static inline int drbd_get_max_buffers(struct drbd_conf *mdev)20372037+{20382038+ int mxb = 1000000; /* arbitrary limit on open requests */20392039+ if (get_net_conf(mdev)) {20402040+ mxb = mdev->net_conf->max_buffers;20412041+ put_net_conf(mdev);20422042+ }20432043+ return mxb;20442044+}20452045+20462046+static inline int drbd_state_is_stable(union drbd_state s)20472047+{20482048+20492049+ /* DO NOT add a default clause, we want the compiler to warn us20502050+ * for any newly introduced state we may have forgotten to add here */20512051+20522052+ switch ((enum drbd_conns)s.conn) {20532053+ /* new io only accepted when there is no connection, ... */20542054+ case C_STANDALONE:20552055+ case C_WF_CONNECTION:20562056+ /* ... or there is a well established connection. */20572057+ case C_CONNECTED:20582058+ case C_SYNC_SOURCE:20592059+ case C_SYNC_TARGET:20602060+ case C_VERIFY_S:20612061+ case C_VERIFY_T:20622062+ case C_PAUSED_SYNC_S:20632063+ case C_PAUSED_SYNC_T:20642064+ /* maybe stable, look at the disk state */20652065+ break;20662066+20672067+ /* no new io accepted during tansitional states20682068+ * like handshake or teardown */20692069+ case C_DISCONNECTING:20702070+ case C_UNCONNECTED:20712071+ case C_TIMEOUT:20722072+ case C_BROKEN_PIPE:20732073+ case C_NETWORK_FAILURE:20742074+ case C_PROTOCOL_ERROR:20752075+ case C_TEAR_DOWN:20762076+ case C_WF_REPORT_PARAMS:20772077+ case C_STARTING_SYNC_S:20782078+ case C_STARTING_SYNC_T:20792079+ case C_WF_BITMAP_S:20802080+ case C_WF_BITMAP_T:20812081+ case C_WF_SYNC_UUID:20822082+ case C_MASK:20832083+ /* not "stable" */20842084+ return 0;20852085+ }20862086+20872087+ switch ((enum drbd_disk_state)s.disk) {20882088+ case D_DISKLESS:20892089+ case D_INCONSISTENT:20902090+ case D_OUTDATED:20912091+ case D_CONSISTENT:20922092+ case D_UP_TO_DATE:20932093+ /* disk state is stable as well. */20942094+ break;20952095+20962096+ /* no new io accepted during tansitional states */20972097+ case D_ATTACHING:20982098+ case D_FAILED:20992099+ case D_NEGOTIATING:21002100+ case D_UNKNOWN:21012101+ case D_MASK:21022102+ /* not "stable" */21032103+ return 0;21042104+ }21052105+21062106+ return 1;21072107+}21082108+21092109+static inline int __inc_ap_bio_cond(struct drbd_conf *mdev)21102110+{21112111+ int mxb = drbd_get_max_buffers(mdev);21122112+21132113+ if (mdev->state.susp)21142114+ return 0;21152115+ if (test_bit(SUSPEND_IO, &mdev->flags))21162116+ return 0;21172117+21182118+ /* to avoid potential deadlock or bitmap corruption,21192119+ * in various places, we only allow new application io21202120+ * to start during "stable" states. */21212121+21222122+ /* no new io accepted when attaching or detaching the disk */21232123+ if (!drbd_state_is_stable(mdev->state))21242124+ return 0;21252125+21262126+ /* since some older kernels don't have atomic_add_unless,21272127+ * and we are within the spinlock anyways, we have this workaround. */21282128+ if (atomic_read(&mdev->ap_bio_cnt) > mxb)21292129+ return 0;21302130+ if (test_bit(BITMAP_IO, &mdev->flags))21312131+ return 0;21322132+ return 1;21332133+}21342134+21352135+/* I'd like to use wait_event_lock_irq,21362136+ * but I'm not sure when it got introduced,21372137+ * and not sure when it has 3 or 4 arguments */21382138+static inline void inc_ap_bio(struct drbd_conf *mdev, int one_or_two)21392139+{21402140+ /* compare with after_state_ch,21412141+ * os.conn != C_WF_BITMAP_S && ns.conn == C_WF_BITMAP_S */21422142+ DEFINE_WAIT(wait);21432143+21442144+ /* we wait here21452145+ * as long as the device is suspended21462146+ * until the bitmap is no longer on the fly during connection21472147+ * handshake as long as we would exeed the max_buffer limit.21482148+ *21492149+ * to avoid races with the reconnect code,21502150+ * we need to atomic_inc within the spinlock. */21512151+21522152+ spin_lock_irq(&mdev->req_lock);21532153+ while (!__inc_ap_bio_cond(mdev)) {21542154+ prepare_to_wait(&mdev->misc_wait, &wait, TASK_UNINTERRUPTIBLE);21552155+ spin_unlock_irq(&mdev->req_lock);21562156+ schedule();21572157+ finish_wait(&mdev->misc_wait, &wait);21582158+ spin_lock_irq(&mdev->req_lock);21592159+ }21602160+ atomic_add(one_or_two, &mdev->ap_bio_cnt);21612161+ spin_unlock_irq(&mdev->req_lock);21622162+}21632163+21642164+static inline void dec_ap_bio(struct drbd_conf *mdev)21652165+{21662166+ int mxb = drbd_get_max_buffers(mdev);21672167+ int ap_bio = atomic_dec_return(&mdev->ap_bio_cnt);21682168+21692169+ D_ASSERT(ap_bio >= 0);21702170+ /* this currently does wake_up for every dec_ap_bio!21712171+ * maybe rather introduce some type of hysteresis?21722172+ * e.g. (ap_bio == mxb/2 || ap_bio == 0) ? */21732173+ if (ap_bio < mxb)21742174+ wake_up(&mdev->misc_wait);21752175+ if (ap_bio == 0 && test_bit(BITMAP_IO, &mdev->flags)) {21762176+ if (!test_and_set_bit(BITMAP_IO_QUEUED, &mdev->flags))21772177+ drbd_queue_work(&mdev->data.work, &mdev->bm_io_work.w);21782178+ }21792179+}21802180+21812181+static inline void drbd_set_ed_uuid(struct drbd_conf *mdev, u64 val)21822182+{21832183+ mdev->ed_uuid = val;21842184+}21852185+21862186+static inline int seq_cmp(u32 a, u32 b)21872187+{21882188+ /* we assume wrap around at 32bit.21892189+ * for wrap around at 24bit (old atomic_t),21902190+ * we'd have to21912191+ * a <<= 8; b <<= 8;21922192+ */21932193+ return (s32)(a) - (s32)(b);21942194+}21952195+#define seq_lt(a, b) (seq_cmp((a), (b)) < 0)21962196+#define seq_gt(a, b) (seq_cmp((a), (b)) > 0)21972197+#define seq_ge(a, b) (seq_cmp((a), (b)) >= 0)21982198+#define seq_le(a, b) (seq_cmp((a), (b)) <= 0)21992199+/* CAUTION: please no side effects in arguments! */22002200+#define seq_max(a, b) ((u32)(seq_gt((a), (b)) ? (a) : (b)))22012201+22022202+static inline void update_peer_seq(struct drbd_conf *mdev, unsigned int new_seq)22032203+{22042204+ unsigned int m;22052205+ spin_lock(&mdev->peer_seq_lock);22062206+ m = seq_max(mdev->peer_seq, new_seq);22072207+ mdev->peer_seq = m;22082208+ spin_unlock(&mdev->peer_seq_lock);22092209+ if (m == new_seq)22102210+ wake_up(&mdev->seq_wait);22112211+}22122212+22132213+static inline void drbd_update_congested(struct drbd_conf *mdev)22142214+{22152215+ struct sock *sk = mdev->data.socket->sk;22162216+ if (sk->sk_wmem_queued > sk->sk_sndbuf * 4 / 5)22172217+ set_bit(NET_CONGESTED, &mdev->flags);22182218+}22192219+22202220+static inline int drbd_queue_order_type(struct drbd_conf *mdev)22212221+{22222222+ /* sorry, we currently have no working implementation22232223+ * of distributed TCQ stuff */22242224+#ifndef QUEUE_ORDERED_NONE22252225+#define QUEUE_ORDERED_NONE 022262226+#endif22272227+ return QUEUE_ORDERED_NONE;22282228+}22292229+22302230+static inline void drbd_blk_run_queue(struct request_queue *q)22312231+{22322232+ if (q && q->unplug_fn)22332233+ q->unplug_fn(q);22342234+}22352235+22362236+static inline void drbd_kick_lo(struct drbd_conf *mdev)22372237+{22382238+ if (get_ldev(mdev)) {22392239+ drbd_blk_run_queue(bdev_get_queue(mdev->ldev->backing_bdev));22402240+ put_ldev(mdev);22412241+ }22422242+}22432243+22442244+static inline void drbd_md_flush(struct drbd_conf *mdev)22452245+{22462246+ int r;22472247+22482248+ if (test_bit(MD_NO_BARRIER, &mdev->flags))22492249+ return;22502250+22512251+ r = blkdev_issue_flush(mdev->ldev->md_bdev, NULL);22522252+ if (r) {22532253+ set_bit(MD_NO_BARRIER, &mdev->flags);22542254+ dev_err(DEV, "meta data flush failed with status %d, disabling md-flushes\n", r);22552255+ }22562256+}22572257+22582258+#endif
+3735
drivers/block/drbd/drbd_main.c
···11+/*22+ drbd.c33+44+ This file is part of DRBD by Philipp Reisner and Lars Ellenberg.55+66+ Copyright (C) 2001-2008, LINBIT Information Technologies GmbH.77+ Copyright (C) 1999-2008, Philipp Reisner <philipp.reisner@linbit.com>.88+ Copyright (C) 2002-2008, Lars Ellenberg <lars.ellenberg@linbit.com>.99+1010+ Thanks to Carter Burden, Bart Grantham and Gennadiy Nerubayev1111+ from Logicworks, Inc. for making SDP replication support possible.1212+1313+ drbd is free software; you can redistribute it and/or modify1414+ it under the terms of the GNU General Public License as published by1515+ the Free Software Foundation; either version 2, or (at your option)1616+ any later version.1717+1818+ drbd is distributed in the hope that it will be useful,1919+ but WITHOUT ANY WARRANTY; without even the implied warranty of2020+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the2121+ GNU General Public License for more details.2222+2323+ You should have received a copy of the GNU General Public License2424+ along with drbd; see the file COPYING. If not, write to2525+ the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, USA.2626+2727+ */2828+2929+#include <linux/autoconf.h>3030+#include <linux/module.h>3131+#include <linux/version.h>3232+#include <linux/drbd.h>3333+#include <asm/uaccess.h>3434+#include <asm/types.h>3535+#include <net/sock.h>3636+#include <linux/ctype.h>3737+#include <linux/smp_lock.h>3838+#include <linux/fs.h>3939+#include <linux/file.h>4040+#include <linux/proc_fs.h>4141+#include <linux/init.h>4242+#include <linux/mm.h>4343+#include <linux/memcontrol.h>4444+#include <linux/mm_inline.h>4545+#include <linux/slab.h>4646+#include <linux/random.h>4747+#include <linux/reboot.h>4848+#include <linux/notifier.h>4949+#include <linux/kthread.h>5050+5151+#define __KERNEL_SYSCALLS__5252+#include <linux/unistd.h>5353+#include <linux/vmalloc.h>5454+5555+#include <linux/drbd_limits.h>5656+#include "drbd_int.h"5757+#include "drbd_tracing.h"5858+#include "drbd_req.h" /* only for _req_mod in tl_release and tl_clear */5959+6060+#include "drbd_vli.h"6161+6262+struct after_state_chg_work {6363+ struct drbd_work w;6464+ union drbd_state os;6565+ union drbd_state ns;6666+ enum chg_state_flags flags;6767+ struct completion *done;6868+};6969+7070+int drbdd_init(struct drbd_thread *);7171+int drbd_worker(struct drbd_thread *);7272+int drbd_asender(struct drbd_thread *);7373+7474+int drbd_init(void);7575+static int drbd_open(struct block_device *bdev, fmode_t mode);7676+static int drbd_release(struct gendisk *gd, fmode_t mode);7777+static int w_after_state_ch(struct drbd_conf *mdev, struct drbd_work *w, int unused);7878+static void after_state_ch(struct drbd_conf *mdev, union drbd_state os,7979+ union drbd_state ns, enum chg_state_flags flags);8080+static int w_md_sync(struct drbd_conf *mdev, struct drbd_work *w, int unused);8181+static void md_sync_timer_fn(unsigned long data);8282+static int w_bitmap_io(struct drbd_conf *mdev, struct drbd_work *w, int unused);8383+8484+DEFINE_TRACE(drbd_unplug);8585+DEFINE_TRACE(drbd_uuid);8686+DEFINE_TRACE(drbd_ee);8787+DEFINE_TRACE(drbd_packet);8888+DEFINE_TRACE(drbd_md_io);8989+DEFINE_TRACE(drbd_epoch);9090+DEFINE_TRACE(drbd_netlink);9191+DEFINE_TRACE(drbd_actlog);9292+DEFINE_TRACE(drbd_bio);9393+DEFINE_TRACE(_drbd_resync);9494+DEFINE_TRACE(drbd_req);9595+9696+MODULE_AUTHOR("Philipp Reisner <phil@linbit.com>, "9797+ "Lars Ellenberg <lars@linbit.com>");9898+MODULE_DESCRIPTION("drbd - Distributed Replicated Block Device v" REL_VERSION);9999+MODULE_VERSION(REL_VERSION);100100+MODULE_LICENSE("GPL");101101+MODULE_PARM_DESC(minor_count, "Maximum number of drbd devices (1-255)");102102+MODULE_ALIAS_BLOCKDEV_MAJOR(DRBD_MAJOR);103103+104104+#include <linux/moduleparam.h>105105+/* allow_open_on_secondary */106106+MODULE_PARM_DESC(allow_oos, "DONT USE!");107107+/* thanks to these macros, if compiled into the kernel (not-module),108108+ * this becomes the boot parameter drbd.minor_count */109109+module_param(minor_count, uint, 0444);110110+module_param(disable_sendpage, bool, 0644);111111+module_param(allow_oos, bool, 0);112112+module_param(cn_idx, uint, 0444);113113+module_param(proc_details, int, 0644);114114+115115+#ifdef CONFIG_DRBD_FAULT_INJECTION116116+int enable_faults;117117+int fault_rate;118118+static int fault_count;119119+int fault_devs;120120+/* bitmap of enabled faults */121121+module_param(enable_faults, int, 0664);122122+/* fault rate % value - applies to all enabled faults */123123+module_param(fault_rate, int, 0664);124124+/* count of faults inserted */125125+module_param(fault_count, int, 0664);126126+/* bitmap of devices to insert faults on */127127+module_param(fault_devs, int, 0644);128128+#endif129129+130130+/* module parameter, defined */131131+unsigned int minor_count = 32;132132+int disable_sendpage;133133+int allow_oos;134134+unsigned int cn_idx = CN_IDX_DRBD;135135+int proc_details; /* Detail level in proc drbd*/136136+137137+/* Module parameter for setting the user mode helper program138138+ * to run. Default is /sbin/drbdadm */139139+char usermode_helper[80] = "/sbin/drbdadm";140140+141141+module_param_string(usermode_helper, usermode_helper, sizeof(usermode_helper), 0644);142142+143143+/* in 2.6.x, our device mapping and config info contains our virtual gendisks144144+ * as member "struct gendisk *vdisk;"145145+ */146146+struct drbd_conf **minor_table;147147+148148+struct kmem_cache *drbd_request_cache;149149+struct kmem_cache *drbd_ee_cache; /* epoch entries */150150+struct kmem_cache *drbd_bm_ext_cache; /* bitmap extents */151151+struct kmem_cache *drbd_al_ext_cache; /* activity log extents */152152+mempool_t *drbd_request_mempool;153153+mempool_t *drbd_ee_mempool;154154+155155+/* I do not use a standard mempool, because:156156+ 1) I want to hand out the pre-allocated objects first.157157+ 2) I want to be able to interrupt sleeping allocation with a signal.158158+ Note: This is a single linked list, the next pointer is the private159159+ member of struct page.160160+ */161161+struct page *drbd_pp_pool;162162+spinlock_t drbd_pp_lock;163163+int drbd_pp_vacant;164164+wait_queue_head_t drbd_pp_wait;165165+166166+DEFINE_RATELIMIT_STATE(drbd_ratelimit_state, 5 * HZ, 5);167167+168168+static struct block_device_operations drbd_ops = {169169+ .owner = THIS_MODULE,170170+ .open = drbd_open,171171+ .release = drbd_release,172172+};173173+174174+#define ARRY_SIZE(A) (sizeof(A)/sizeof(A[0]))175175+176176+#ifdef __CHECKER__177177+/* When checking with sparse, and this is an inline function, sparse will178178+ give tons of false positives. When this is a real functions sparse works.179179+ */180180+int _get_ldev_if_state(struct drbd_conf *mdev, enum drbd_disk_state mins)181181+{182182+ int io_allowed;183183+184184+ atomic_inc(&mdev->local_cnt);185185+ io_allowed = (mdev->state.disk >= mins);186186+ if (!io_allowed) {187187+ if (atomic_dec_and_test(&mdev->local_cnt))188188+ wake_up(&mdev->misc_wait);189189+ }190190+ return io_allowed;191191+}192192+193193+#endif194194+195195+/**196196+ * DOC: The transfer log197197+ *198198+ * The transfer log is a single linked list of &struct drbd_tl_epoch objects.199199+ * mdev->newest_tle points to the head, mdev->oldest_tle points to the tail200200+ * of the list. There is always at least one &struct drbd_tl_epoch object.201201+ *202202+ * Each &struct drbd_tl_epoch has a circular double linked list of requests203203+ * attached.204204+ */205205+static int tl_init(struct drbd_conf *mdev)206206+{207207+ struct drbd_tl_epoch *b;208208+209209+ /* during device minor initialization, we may well use GFP_KERNEL */210210+ b = kmalloc(sizeof(struct drbd_tl_epoch), GFP_KERNEL);211211+ if (!b)212212+ return 0;213213+ INIT_LIST_HEAD(&b->requests);214214+ INIT_LIST_HEAD(&b->w.list);215215+ b->next = NULL;216216+ b->br_number = 4711;217217+ b->n_req = 0;218218+ b->w.cb = NULL; /* if this is != NULL, we need to dec_ap_pending in tl_clear */219219+220220+ mdev->oldest_tle = b;221221+ mdev->newest_tle = b;222222+ INIT_LIST_HEAD(&mdev->out_of_sequence_requests);223223+224224+ mdev->tl_hash = NULL;225225+ mdev->tl_hash_s = 0;226226+227227+ return 1;228228+}229229+230230+static void tl_cleanup(struct drbd_conf *mdev)231231+{232232+ D_ASSERT(mdev->oldest_tle == mdev->newest_tle);233233+ D_ASSERT(list_empty(&mdev->out_of_sequence_requests));234234+ kfree(mdev->oldest_tle);235235+ mdev->oldest_tle = NULL;236236+ kfree(mdev->unused_spare_tle);237237+ mdev->unused_spare_tle = NULL;238238+ kfree(mdev->tl_hash);239239+ mdev->tl_hash = NULL;240240+ mdev->tl_hash_s = 0;241241+}242242+243243+/**244244+ * _tl_add_barrier() - Adds a barrier to the transfer log245245+ * @mdev: DRBD device.246246+ * @new: Barrier to be added before the current head of the TL.247247+ *248248+ * The caller must hold the req_lock.249249+ */250250+void _tl_add_barrier(struct drbd_conf *mdev, struct drbd_tl_epoch *new)251251+{252252+ struct drbd_tl_epoch *newest_before;253253+254254+ INIT_LIST_HEAD(&new->requests);255255+ INIT_LIST_HEAD(&new->w.list);256256+ new->w.cb = NULL; /* if this is != NULL, we need to dec_ap_pending in tl_clear */257257+ new->next = NULL;258258+ new->n_req = 0;259259+260260+ newest_before = mdev->newest_tle;261261+ /* never send a barrier number == 0, because that is special-cased262262+ * when using TCQ for our write ordering code */263263+ new->br_number = (newest_before->br_number+1) ?: 1;264264+ if (mdev->newest_tle != new) {265265+ mdev->newest_tle->next = new;266266+ mdev->newest_tle = new;267267+ }268268+}269269+270270+/**271271+ * tl_release() - Free or recycle the oldest &struct drbd_tl_epoch object of the TL272272+ * @mdev: DRBD device.273273+ * @barrier_nr: Expected identifier of the DRBD write barrier packet.274274+ * @set_size: Expected number of requests before that barrier.275275+ *276276+ * In case the passed barrier_nr or set_size does not match the oldest277277+ * &struct drbd_tl_epoch objects this function will cause a termination278278+ * of the connection.279279+ */280280+void tl_release(struct drbd_conf *mdev, unsigned int barrier_nr,281281+ unsigned int set_size)282282+{283283+ struct drbd_tl_epoch *b, *nob; /* next old barrier */284284+ struct list_head *le, *tle;285285+ struct drbd_request *r;286286+287287+ spin_lock_irq(&mdev->req_lock);288288+289289+ b = mdev->oldest_tle;290290+291291+ /* first some paranoia code */292292+ if (b == NULL) {293293+ dev_err(DEV, "BAD! BarrierAck #%u received, but no epoch in tl!?\n",294294+ barrier_nr);295295+ goto bail;296296+ }297297+ if (b->br_number != barrier_nr) {298298+ dev_err(DEV, "BAD! BarrierAck #%u received, expected #%u!\n",299299+ barrier_nr, b->br_number);300300+ goto bail;301301+ }302302+ if (b->n_req != set_size) {303303+ dev_err(DEV, "BAD! BarrierAck #%u received with n_req=%u, expected n_req=%u!\n",304304+ barrier_nr, set_size, b->n_req);305305+ goto bail;306306+ }307307+308308+ /* Clean up list of requests processed during current epoch */309309+ list_for_each_safe(le, tle, &b->requests) {310310+ r = list_entry(le, struct drbd_request, tl_requests);311311+ _req_mod(r, barrier_acked);312312+ }313313+ /* There could be requests on the list waiting for completion314314+ of the write to the local disk. To avoid corruptions of315315+ slab's data structures we have to remove the lists head.316316+317317+ Also there could have been a barrier ack out of sequence, overtaking318318+ the write acks - which would be a bug and violating write ordering.319319+ To not deadlock in case we lose connection while such requests are320320+ still pending, we need some way to find them for the321321+ _req_mode(connection_lost_while_pending).322322+323323+ These have been list_move'd to the out_of_sequence_requests list in324324+ _req_mod(, barrier_acked) above.325325+ */326326+ list_del_init(&b->requests);327327+328328+ nob = b->next;329329+ if (test_and_clear_bit(CREATE_BARRIER, &mdev->flags)) {330330+ _tl_add_barrier(mdev, b);331331+ if (nob)332332+ mdev->oldest_tle = nob;333333+ /* if nob == NULL b was the only barrier, and becomes the new334334+ barrier. Therefore mdev->oldest_tle points already to b */335335+ } else {336336+ D_ASSERT(nob != NULL);337337+ mdev->oldest_tle = nob;338338+ kfree(b);339339+ }340340+341341+ spin_unlock_irq(&mdev->req_lock);342342+ dec_ap_pending(mdev);343343+344344+ return;345345+346346+bail:347347+ spin_unlock_irq(&mdev->req_lock);348348+ drbd_force_state(mdev, NS(conn, C_PROTOCOL_ERROR));349349+}350350+351351+352352+/**353353+ * tl_clear() - Clears all requests and &struct drbd_tl_epoch objects out of the TL354354+ * @mdev: DRBD device.355355+ *356356+ * This is called after the connection to the peer was lost. The storage covered357357+ * by the requests on the transfer gets marked as our of sync. Called from the358358+ * receiver thread and the worker thread.359359+ */360360+void tl_clear(struct drbd_conf *mdev)361361+{362362+ struct drbd_tl_epoch *b, *tmp;363363+ struct list_head *le, *tle;364364+ struct drbd_request *r;365365+ int new_initial_bnr = net_random();366366+367367+ spin_lock_irq(&mdev->req_lock);368368+369369+ b = mdev->oldest_tle;370370+ while (b) {371371+ list_for_each_safe(le, tle, &b->requests) {372372+ r = list_entry(le, struct drbd_request, tl_requests);373373+ /* It would be nice to complete outside of spinlock.374374+ * But this is easier for now. */375375+ _req_mod(r, connection_lost_while_pending);376376+ }377377+ tmp = b->next;378378+379379+ /* there could still be requests on that ring list,380380+ * in case local io is still pending */381381+ list_del(&b->requests);382382+383383+ /* dec_ap_pending corresponding to queue_barrier.384384+ * the newest barrier may not have been queued yet,385385+ * in which case w.cb is still NULL. */386386+ if (b->w.cb != NULL)387387+ dec_ap_pending(mdev);388388+389389+ if (b == mdev->newest_tle) {390390+ /* recycle, but reinit! */391391+ D_ASSERT(tmp == NULL);392392+ INIT_LIST_HEAD(&b->requests);393393+ INIT_LIST_HEAD(&b->w.list);394394+ b->w.cb = NULL;395395+ b->br_number = new_initial_bnr;396396+ b->n_req = 0;397397+398398+ mdev->oldest_tle = b;399399+ break;400400+ }401401+ kfree(b);402402+ b = tmp;403403+ }404404+405405+ /* we expect this list to be empty. */406406+ D_ASSERT(list_empty(&mdev->out_of_sequence_requests));407407+408408+ /* but just in case, clean it up anyways! */409409+ list_for_each_safe(le, tle, &mdev->out_of_sequence_requests) {410410+ r = list_entry(le, struct drbd_request, tl_requests);411411+ /* It would be nice to complete outside of spinlock.412412+ * But this is easier for now. */413413+ _req_mod(r, connection_lost_while_pending);414414+ }415415+416416+ /* ensure bit indicating barrier is required is clear */417417+ clear_bit(CREATE_BARRIER, &mdev->flags);418418+419419+ spin_unlock_irq(&mdev->req_lock);420420+}421421+422422+/**423423+ * cl_wide_st_chg() - TRUE if the state change is a cluster wide one424424+ * @mdev: DRBD device.425425+ * @os: old (current) state.426426+ * @ns: new (wanted) state.427427+ */428428+static int cl_wide_st_chg(struct drbd_conf *mdev,429429+ union drbd_state os, union drbd_state ns)430430+{431431+ return (os.conn >= C_CONNECTED && ns.conn >= C_CONNECTED &&432432+ ((os.role != R_PRIMARY && ns.role == R_PRIMARY) ||433433+ (os.conn != C_STARTING_SYNC_T && ns.conn == C_STARTING_SYNC_T) ||434434+ (os.conn != C_STARTING_SYNC_S && ns.conn == C_STARTING_SYNC_S) ||435435+ (os.disk != D_DISKLESS && ns.disk == D_DISKLESS))) ||436436+ (os.conn >= C_CONNECTED && ns.conn == C_DISCONNECTING) ||437437+ (os.conn == C_CONNECTED && ns.conn == C_VERIFY_S);438438+}439439+440440+int drbd_change_state(struct drbd_conf *mdev, enum chg_state_flags f,441441+ union drbd_state mask, union drbd_state val)442442+{443443+ unsigned long flags;444444+ union drbd_state os, ns;445445+ int rv;446446+447447+ spin_lock_irqsave(&mdev->req_lock, flags);448448+ os = mdev->state;449449+ ns.i = (os.i & ~mask.i) | val.i;450450+ rv = _drbd_set_state(mdev, ns, f, NULL);451451+ ns = mdev->state;452452+ spin_unlock_irqrestore(&mdev->req_lock, flags);453453+454454+ return rv;455455+}456456+457457+/**458458+ * drbd_force_state() - Impose a change which happens outside our control on our state459459+ * @mdev: DRBD device.460460+ * @mask: mask of state bits to change.461461+ * @val: value of new state bits.462462+ */463463+void drbd_force_state(struct drbd_conf *mdev,464464+ union drbd_state mask, union drbd_state val)465465+{466466+ drbd_change_state(mdev, CS_HARD, mask, val);467467+}468468+469469+static int is_valid_state(struct drbd_conf *mdev, union drbd_state ns);470470+static int is_valid_state_transition(struct drbd_conf *,471471+ union drbd_state, union drbd_state);472472+static union drbd_state sanitize_state(struct drbd_conf *mdev, union drbd_state os,473473+ union drbd_state ns, int *warn_sync_abort);474474+int drbd_send_state_req(struct drbd_conf *,475475+ union drbd_state, union drbd_state);476476+477477+static enum drbd_state_ret_codes _req_st_cond(struct drbd_conf *mdev,478478+ union drbd_state mask, union drbd_state val)479479+{480480+ union drbd_state os, ns;481481+ unsigned long flags;482482+ int rv;483483+484484+ if (test_and_clear_bit(CL_ST_CHG_SUCCESS, &mdev->flags))485485+ return SS_CW_SUCCESS;486486+487487+ if (test_and_clear_bit(CL_ST_CHG_FAIL, &mdev->flags))488488+ return SS_CW_FAILED_BY_PEER;489489+490490+ rv = 0;491491+ spin_lock_irqsave(&mdev->req_lock, flags);492492+ os = mdev->state;493493+ ns.i = (os.i & ~mask.i) | val.i;494494+ ns = sanitize_state(mdev, os, ns, NULL);495495+496496+ if (!cl_wide_st_chg(mdev, os, ns))497497+ rv = SS_CW_NO_NEED;498498+ if (!rv) {499499+ rv = is_valid_state(mdev, ns);500500+ if (rv == SS_SUCCESS) {501501+ rv = is_valid_state_transition(mdev, ns, os);502502+ if (rv == SS_SUCCESS)503503+ rv = 0; /* cont waiting, otherwise fail. */504504+ }505505+ }506506+ spin_unlock_irqrestore(&mdev->req_lock, flags);507507+508508+ return rv;509509+}510510+511511+/**512512+ * drbd_req_state() - Perform an eventually cluster wide state change513513+ * @mdev: DRBD device.514514+ * @mask: mask of state bits to change.515515+ * @val: value of new state bits.516516+ * @f: flags517517+ *518518+ * Should not be called directly, use drbd_request_state() or519519+ * _drbd_request_state().520520+ */521521+static int drbd_req_state(struct drbd_conf *mdev,522522+ union drbd_state mask, union drbd_state val,523523+ enum chg_state_flags f)524524+{525525+ struct completion done;526526+ unsigned long flags;527527+ union drbd_state os, ns;528528+ int rv;529529+530530+ init_completion(&done);531531+532532+ if (f & CS_SERIALIZE)533533+ mutex_lock(&mdev->state_mutex);534534+535535+ spin_lock_irqsave(&mdev->req_lock, flags);536536+ os = mdev->state;537537+ ns.i = (os.i & ~mask.i) | val.i;538538+ ns = sanitize_state(mdev, os, ns, NULL);539539+540540+ if (cl_wide_st_chg(mdev, os, ns)) {541541+ rv = is_valid_state(mdev, ns);542542+ if (rv == SS_SUCCESS)543543+ rv = is_valid_state_transition(mdev, ns, os);544544+ spin_unlock_irqrestore(&mdev->req_lock, flags);545545+546546+ if (rv < SS_SUCCESS) {547547+ if (f & CS_VERBOSE)548548+ print_st_err(mdev, os, ns, rv);549549+ goto abort;550550+ }551551+552552+ drbd_state_lock(mdev);553553+ if (!drbd_send_state_req(mdev, mask, val)) {554554+ drbd_state_unlock(mdev);555555+ rv = SS_CW_FAILED_BY_PEER;556556+ if (f & CS_VERBOSE)557557+ print_st_err(mdev, os, ns, rv);558558+ goto abort;559559+ }560560+561561+ wait_event(mdev->state_wait,562562+ (rv = _req_st_cond(mdev, mask, val)));563563+564564+ if (rv < SS_SUCCESS) {565565+ drbd_state_unlock(mdev);566566+ if (f & CS_VERBOSE)567567+ print_st_err(mdev, os, ns, rv);568568+ goto abort;569569+ }570570+ spin_lock_irqsave(&mdev->req_lock, flags);571571+ os = mdev->state;572572+ ns.i = (os.i & ~mask.i) | val.i;573573+ rv = _drbd_set_state(mdev, ns, f, &done);574574+ drbd_state_unlock(mdev);575575+ } else {576576+ rv = _drbd_set_state(mdev, ns, f, &done);577577+ }578578+579579+ spin_unlock_irqrestore(&mdev->req_lock, flags);580580+581581+ if (f & CS_WAIT_COMPLETE && rv == SS_SUCCESS) {582582+ D_ASSERT(current != mdev->worker.task);583583+ wait_for_completion(&done);584584+ }585585+586586+abort:587587+ if (f & CS_SERIALIZE)588588+ mutex_unlock(&mdev->state_mutex);589589+590590+ return rv;591591+}592592+593593+/**594594+ * _drbd_request_state() - Request a state change (with flags)595595+ * @mdev: DRBD device.596596+ * @mask: mask of state bits to change.597597+ * @val: value of new state bits.598598+ * @f: flags599599+ *600600+ * Cousin of drbd_request_state(), useful with the CS_WAIT_COMPLETE601601+ * flag, or when logging of failed state change requests is not desired.602602+ */603603+int _drbd_request_state(struct drbd_conf *mdev, union drbd_state mask,604604+ union drbd_state val, enum chg_state_flags f)605605+{606606+ int rv;607607+608608+ wait_event(mdev->state_wait,609609+ (rv = drbd_req_state(mdev, mask, val, f)) != SS_IN_TRANSIENT_STATE);610610+611611+ return rv;612612+}613613+614614+static void print_st(struct drbd_conf *mdev, char *name, union drbd_state ns)615615+{616616+ dev_err(DEV, " %s = { cs:%s ro:%s/%s ds:%s/%s %c%c%c%c }\n",617617+ name,618618+ drbd_conn_str(ns.conn),619619+ drbd_role_str(ns.role),620620+ drbd_role_str(ns.peer),621621+ drbd_disk_str(ns.disk),622622+ drbd_disk_str(ns.pdsk),623623+ ns.susp ? 's' : 'r',624624+ ns.aftr_isp ? 'a' : '-',625625+ ns.peer_isp ? 'p' : '-',626626+ ns.user_isp ? 'u' : '-'627627+ );628628+}629629+630630+void print_st_err(struct drbd_conf *mdev,631631+ union drbd_state os, union drbd_state ns, int err)632632+{633633+ if (err == SS_IN_TRANSIENT_STATE)634634+ return;635635+ dev_err(DEV, "State change failed: %s\n", drbd_set_st_err_str(err));636636+ print_st(mdev, " state", os);637637+ print_st(mdev, "wanted", ns);638638+}639639+640640+641641+#define drbd_peer_str drbd_role_str642642+#define drbd_pdsk_str drbd_disk_str643643+644644+#define drbd_susp_str(A) ((A) ? "1" : "0")645645+#define drbd_aftr_isp_str(A) ((A) ? "1" : "0")646646+#define drbd_peer_isp_str(A) ((A) ? "1" : "0")647647+#define drbd_user_isp_str(A) ((A) ? "1" : "0")648648+649649+#define PSC(A) \650650+ ({ if (ns.A != os.A) { \651651+ pbp += sprintf(pbp, #A "( %s -> %s ) ", \652652+ drbd_##A##_str(os.A), \653653+ drbd_##A##_str(ns.A)); \654654+ } })655655+656656+/**657657+ * is_valid_state() - Returns an SS_ error code if ns is not valid658658+ * @mdev: DRBD device.659659+ * @ns: State to consider.660660+ */661661+static int is_valid_state(struct drbd_conf *mdev, union drbd_state ns)662662+{663663+ /* See drbd_state_sw_errors in drbd_strings.c */664664+665665+ enum drbd_fencing_p fp;666666+ int rv = SS_SUCCESS;667667+668668+ fp = FP_DONT_CARE;669669+ if (get_ldev(mdev)) {670670+ fp = mdev->ldev->dc.fencing;671671+ put_ldev(mdev);672672+ }673673+674674+ if (get_net_conf(mdev)) {675675+ if (!mdev->net_conf->two_primaries &&676676+ ns.role == R_PRIMARY && ns.peer == R_PRIMARY)677677+ rv = SS_TWO_PRIMARIES;678678+ put_net_conf(mdev);679679+ }680680+681681+ if (rv <= 0)682682+ /* already found a reason to abort */;683683+ else if (ns.role == R_SECONDARY && mdev->open_cnt)684684+ rv = SS_DEVICE_IN_USE;685685+686686+ else if (ns.role == R_PRIMARY && ns.conn < C_CONNECTED && ns.disk < D_UP_TO_DATE)687687+ rv = SS_NO_UP_TO_DATE_DISK;688688+689689+ else if (fp >= FP_RESOURCE &&690690+ ns.role == R_PRIMARY && ns.conn < C_CONNECTED && ns.pdsk >= D_UNKNOWN)691691+ rv = SS_PRIMARY_NOP;692692+693693+ else if (ns.role == R_PRIMARY && ns.disk <= D_INCONSISTENT && ns.pdsk <= D_INCONSISTENT)694694+ rv = SS_NO_UP_TO_DATE_DISK;695695+696696+ else if (ns.conn > C_CONNECTED && ns.disk < D_INCONSISTENT)697697+ rv = SS_NO_LOCAL_DISK;698698+699699+ else if (ns.conn > C_CONNECTED && ns.pdsk < D_INCONSISTENT)700700+ rv = SS_NO_REMOTE_DISK;701701+702702+ else if ((ns.conn == C_CONNECTED ||703703+ ns.conn == C_WF_BITMAP_S ||704704+ ns.conn == C_SYNC_SOURCE ||705705+ ns.conn == C_PAUSED_SYNC_S) &&706706+ ns.disk == D_OUTDATED)707707+ rv = SS_CONNECTED_OUTDATES;708708+709709+ else if ((ns.conn == C_VERIFY_S || ns.conn == C_VERIFY_T) &&710710+ (mdev->sync_conf.verify_alg[0] == 0))711711+ rv = SS_NO_VERIFY_ALG;712712+713713+ else if ((ns.conn == C_VERIFY_S || ns.conn == C_VERIFY_T) &&714714+ mdev->agreed_pro_version < 88)715715+ rv = SS_NOT_SUPPORTED;716716+717717+ return rv;718718+}719719+720720+/**721721+ * is_valid_state_transition() - Returns an SS_ error code if the state transition is not possible722722+ * @mdev: DRBD device.723723+ * @ns: new state.724724+ * @os: old state.725725+ */726726+static int is_valid_state_transition(struct drbd_conf *mdev,727727+ union drbd_state ns, union drbd_state os)728728+{729729+ int rv = SS_SUCCESS;730730+731731+ if ((ns.conn == C_STARTING_SYNC_T || ns.conn == C_STARTING_SYNC_S) &&732732+ os.conn > C_CONNECTED)733733+ rv = SS_RESYNC_RUNNING;734734+735735+ if (ns.conn == C_DISCONNECTING && os.conn == C_STANDALONE)736736+ rv = SS_ALREADY_STANDALONE;737737+738738+ if (ns.disk > D_ATTACHING && os.disk == D_DISKLESS)739739+ rv = SS_IS_DISKLESS;740740+741741+ if (ns.conn == C_WF_CONNECTION && os.conn < C_UNCONNECTED)742742+ rv = SS_NO_NET_CONFIG;743743+744744+ if (ns.disk == D_OUTDATED && os.disk < D_OUTDATED && os.disk != D_ATTACHING)745745+ rv = SS_LOWER_THAN_OUTDATED;746746+747747+ if (ns.conn == C_DISCONNECTING && os.conn == C_UNCONNECTED)748748+ rv = SS_IN_TRANSIENT_STATE;749749+750750+ if (ns.conn == os.conn && ns.conn == C_WF_REPORT_PARAMS)751751+ rv = SS_IN_TRANSIENT_STATE;752752+753753+ if ((ns.conn == C_VERIFY_S || ns.conn == C_VERIFY_T) && os.conn < C_CONNECTED)754754+ rv = SS_NEED_CONNECTION;755755+756756+ if ((ns.conn == C_VERIFY_S || ns.conn == C_VERIFY_T) &&757757+ ns.conn != os.conn && os.conn > C_CONNECTED)758758+ rv = SS_RESYNC_RUNNING;759759+760760+ if ((ns.conn == C_STARTING_SYNC_S || ns.conn == C_STARTING_SYNC_T) &&761761+ os.conn < C_CONNECTED)762762+ rv = SS_NEED_CONNECTION;763763+764764+ return rv;765765+}766766+767767+/**768768+ * sanitize_state() - Resolves implicitly necessary additional changes to a state transition769769+ * @mdev: DRBD device.770770+ * @os: old state.771771+ * @ns: new state.772772+ * @warn_sync_abort:773773+ *774774+ * When we loose connection, we have to set the state of the peers disk (pdsk)775775+ * to D_UNKNOWN. This rule and many more along those lines are in this function.776776+ */777777+static union drbd_state sanitize_state(struct drbd_conf *mdev, union drbd_state os,778778+ union drbd_state ns, int *warn_sync_abort)779779+{780780+ enum drbd_fencing_p fp;781781+782782+ fp = FP_DONT_CARE;783783+ if (get_ldev(mdev)) {784784+ fp = mdev->ldev->dc.fencing;785785+ put_ldev(mdev);786786+ }787787+788788+ /* Disallow Network errors to configure a device's network part */789789+ if ((ns.conn >= C_TIMEOUT && ns.conn <= C_TEAR_DOWN) &&790790+ os.conn <= C_DISCONNECTING)791791+ ns.conn = os.conn;792792+793793+ /* After a network error (+C_TEAR_DOWN) only C_UNCONNECTED or C_DISCONNECTING can follow */794794+ if (os.conn >= C_TIMEOUT && os.conn <= C_TEAR_DOWN &&795795+ ns.conn != C_UNCONNECTED && ns.conn != C_DISCONNECTING)796796+ ns.conn = os.conn;797797+798798+ /* After C_DISCONNECTING only C_STANDALONE may follow */799799+ if (os.conn == C_DISCONNECTING && ns.conn != C_STANDALONE)800800+ ns.conn = os.conn;801801+802802+ if (ns.conn < C_CONNECTED) {803803+ ns.peer_isp = 0;804804+ ns.peer = R_UNKNOWN;805805+ if (ns.pdsk > D_UNKNOWN || ns.pdsk < D_INCONSISTENT)806806+ ns.pdsk = D_UNKNOWN;807807+ }808808+809809+ /* Clear the aftr_isp when becoming unconfigured */810810+ if (ns.conn == C_STANDALONE && ns.disk == D_DISKLESS && ns.role == R_SECONDARY)811811+ ns.aftr_isp = 0;812812+813813+ if (ns.conn <= C_DISCONNECTING && ns.disk == D_DISKLESS)814814+ ns.pdsk = D_UNKNOWN;815815+816816+ /* Abort resync if a disk fails/detaches */817817+ if (os.conn > C_CONNECTED && ns.conn > C_CONNECTED &&818818+ (ns.disk <= D_FAILED || ns.pdsk <= D_FAILED)) {819819+ if (warn_sync_abort)820820+ *warn_sync_abort = 1;821821+ ns.conn = C_CONNECTED;822822+ }823823+824824+ if (ns.conn >= C_CONNECTED &&825825+ ((ns.disk == D_CONSISTENT || ns.disk == D_OUTDATED) ||826826+ (ns.disk == D_NEGOTIATING && ns.conn == C_WF_BITMAP_T))) {827827+ switch (ns.conn) {828828+ case C_WF_BITMAP_T:829829+ case C_PAUSED_SYNC_T:830830+ ns.disk = D_OUTDATED;831831+ break;832832+ case C_CONNECTED:833833+ case C_WF_BITMAP_S:834834+ case C_SYNC_SOURCE:835835+ case C_PAUSED_SYNC_S:836836+ ns.disk = D_UP_TO_DATE;837837+ break;838838+ case C_SYNC_TARGET:839839+ ns.disk = D_INCONSISTENT;840840+ dev_warn(DEV, "Implicitly set disk state Inconsistent!\n");841841+ break;842842+ }843843+ if (os.disk == D_OUTDATED && ns.disk == D_UP_TO_DATE)844844+ dev_warn(DEV, "Implicitly set disk from Outdated to UpToDate\n");845845+ }846846+847847+ if (ns.conn >= C_CONNECTED &&848848+ (ns.pdsk == D_CONSISTENT || ns.pdsk == D_OUTDATED)) {849849+ switch (ns.conn) {850850+ case C_CONNECTED:851851+ case C_WF_BITMAP_T:852852+ case C_PAUSED_SYNC_T:853853+ case C_SYNC_TARGET:854854+ ns.pdsk = D_UP_TO_DATE;855855+ break;856856+ case C_WF_BITMAP_S:857857+ case C_PAUSED_SYNC_S:858858+ ns.pdsk = D_OUTDATED;859859+ break;860860+ case C_SYNC_SOURCE:861861+ ns.pdsk = D_INCONSISTENT;862862+ dev_warn(DEV, "Implicitly set pdsk Inconsistent!\n");863863+ break;864864+ }865865+ if (os.pdsk == D_OUTDATED && ns.pdsk == D_UP_TO_DATE)866866+ dev_warn(DEV, "Implicitly set pdsk from Outdated to UpToDate\n");867867+ }868868+869869+ /* Connection breaks down before we finished "Negotiating" */870870+ if (ns.conn < C_CONNECTED && ns.disk == D_NEGOTIATING &&871871+ get_ldev_if_state(mdev, D_NEGOTIATING)) {872872+ if (mdev->ed_uuid == mdev->ldev->md.uuid[UI_CURRENT]) {873873+ ns.disk = mdev->new_state_tmp.disk;874874+ ns.pdsk = mdev->new_state_tmp.pdsk;875875+ } else {876876+ dev_alert(DEV, "Connection lost while negotiating, no data!\n");877877+ ns.disk = D_DISKLESS;878878+ ns.pdsk = D_UNKNOWN;879879+ }880880+ put_ldev(mdev);881881+ }882882+883883+ if (fp == FP_STONITH &&884884+ (ns.role == R_PRIMARY &&885885+ ns.conn < C_CONNECTED &&886886+ ns.pdsk > D_OUTDATED))887887+ ns.susp = 1;888888+889889+ if (ns.aftr_isp || ns.peer_isp || ns.user_isp) {890890+ if (ns.conn == C_SYNC_SOURCE)891891+ ns.conn = C_PAUSED_SYNC_S;892892+ if (ns.conn == C_SYNC_TARGET)893893+ ns.conn = C_PAUSED_SYNC_T;894894+ } else {895895+ if (ns.conn == C_PAUSED_SYNC_S)896896+ ns.conn = C_SYNC_SOURCE;897897+ if (ns.conn == C_PAUSED_SYNC_T)898898+ ns.conn = C_SYNC_TARGET;899899+ }900900+901901+ return ns;902902+}903903+904904+/* helper for __drbd_set_state */905905+static void set_ov_position(struct drbd_conf *mdev, enum drbd_conns cs)906906+{907907+ if (cs == C_VERIFY_T) {908908+ /* starting online verify from an arbitrary position909909+ * does not fit well into the existing protocol.910910+ * on C_VERIFY_T, we initialize ov_left and friends911911+ * implicitly in receive_DataRequest once the912912+ * first P_OV_REQUEST is received */913913+ mdev->ov_start_sector = ~(sector_t)0;914914+ } else {915915+ unsigned long bit = BM_SECT_TO_BIT(mdev->ov_start_sector);916916+ if (bit >= mdev->rs_total)917917+ mdev->ov_start_sector =918918+ BM_BIT_TO_SECT(mdev->rs_total - 1);919919+ mdev->ov_position = mdev->ov_start_sector;920920+ }921921+}922922+923923+/**924924+ * __drbd_set_state() - Set a new DRBD state925925+ * @mdev: DRBD device.926926+ * @ns: new state.927927+ * @flags: Flags928928+ * @done: Optional completion, that will get completed after the after_state_ch() finished929929+ *930930+ * Caller needs to hold req_lock, and global_state_lock. Do not call directly.931931+ */932932+int __drbd_set_state(struct drbd_conf *mdev,933933+ union drbd_state ns, enum chg_state_flags flags,934934+ struct completion *done)935935+{936936+ union drbd_state os;937937+ int rv = SS_SUCCESS;938938+ int warn_sync_abort = 0;939939+ struct after_state_chg_work *ascw;940940+941941+ os = mdev->state;942942+943943+ ns = sanitize_state(mdev, os, ns, &warn_sync_abort);944944+945945+ if (ns.i == os.i)946946+ return SS_NOTHING_TO_DO;947947+948948+ if (!(flags & CS_HARD)) {949949+ /* pre-state-change checks ; only look at ns */950950+ /* See drbd_state_sw_errors in drbd_strings.c */951951+952952+ rv = is_valid_state(mdev, ns);953953+ if (rv < SS_SUCCESS) {954954+ /* If the old state was illegal as well, then let955955+ this happen...*/956956+957957+ if (is_valid_state(mdev, os) == rv) {958958+ dev_err(DEV, "Considering state change from bad state. "959959+ "Error would be: '%s'\n",960960+ drbd_set_st_err_str(rv));961961+ print_st(mdev, "old", os);962962+ print_st(mdev, "new", ns);963963+ rv = is_valid_state_transition(mdev, ns, os);964964+ }965965+ } else966966+ rv = is_valid_state_transition(mdev, ns, os);967967+ }968968+969969+ if (rv < SS_SUCCESS) {970970+ if (flags & CS_VERBOSE)971971+ print_st_err(mdev, os, ns, rv);972972+ return rv;973973+ }974974+975975+ if (warn_sync_abort)976976+ dev_warn(DEV, "Resync aborted.\n");977977+978978+ {979979+ char *pbp, pb[300];980980+ pbp = pb;981981+ *pbp = 0;982982+ PSC(role);983983+ PSC(peer);984984+ PSC(conn);985985+ PSC(disk);986986+ PSC(pdsk);987987+ PSC(susp);988988+ PSC(aftr_isp);989989+ PSC(peer_isp);990990+ PSC(user_isp);991991+ dev_info(DEV, "%s\n", pb);992992+ }993993+994994+ /* solve the race between becoming unconfigured,995995+ * worker doing the cleanup, and996996+ * admin reconfiguring us:997997+ * on (re)configure, first set CONFIG_PENDING,998998+ * then wait for a potentially exiting worker,999999+ * start the worker, and schedule one no_op.10001000+ * then proceed with configuration.10011001+ */10021002+ if (ns.disk == D_DISKLESS &&10031003+ ns.conn == C_STANDALONE &&10041004+ ns.role == R_SECONDARY &&10051005+ !test_and_set_bit(CONFIG_PENDING, &mdev->flags))10061006+ set_bit(DEVICE_DYING, &mdev->flags);10071007+10081008+ mdev->state.i = ns.i;10091009+ wake_up(&mdev->misc_wait);10101010+ wake_up(&mdev->state_wait);10111011+10121012+ /* post-state-change actions */10131013+ if (os.conn >= C_SYNC_SOURCE && ns.conn <= C_CONNECTED) {10141014+ set_bit(STOP_SYNC_TIMER, &mdev->flags);10151015+ mod_timer(&mdev->resync_timer, jiffies);10161016+ }10171017+10181018+ /* aborted verify run. log the last position */10191019+ if ((os.conn == C_VERIFY_S || os.conn == C_VERIFY_T) &&10201020+ ns.conn < C_CONNECTED) {10211021+ mdev->ov_start_sector =10221022+ BM_BIT_TO_SECT(mdev->rs_total - mdev->ov_left);10231023+ dev_info(DEV, "Online Verify reached sector %llu\n",10241024+ (unsigned long long)mdev->ov_start_sector);10251025+ }10261026+10271027+ if ((os.conn == C_PAUSED_SYNC_T || os.conn == C_PAUSED_SYNC_S) &&10281028+ (ns.conn == C_SYNC_TARGET || ns.conn == C_SYNC_SOURCE)) {10291029+ dev_info(DEV, "Syncer continues.\n");10301030+ mdev->rs_paused += (long)jiffies-(long)mdev->rs_mark_time;10311031+ if (ns.conn == C_SYNC_TARGET) {10321032+ if (!test_and_clear_bit(STOP_SYNC_TIMER, &mdev->flags))10331033+ mod_timer(&mdev->resync_timer, jiffies);10341034+ /* This if (!test_bit) is only needed for the case10351035+ that a device that has ceased to used its timer,10361036+ i.e. it is already in drbd_resync_finished() gets10371037+ paused and resumed. */10381038+ }10391039+ }10401040+10411041+ if ((os.conn == C_SYNC_TARGET || os.conn == C_SYNC_SOURCE) &&10421042+ (ns.conn == C_PAUSED_SYNC_T || ns.conn == C_PAUSED_SYNC_S)) {10431043+ dev_info(DEV, "Resync suspended\n");10441044+ mdev->rs_mark_time = jiffies;10451045+ if (ns.conn == C_PAUSED_SYNC_T)10461046+ set_bit(STOP_SYNC_TIMER, &mdev->flags);10471047+ }10481048+10491049+ if (os.conn == C_CONNECTED &&10501050+ (ns.conn == C_VERIFY_S || ns.conn == C_VERIFY_T)) {10511051+ mdev->ov_position = 0;10521052+ mdev->rs_total =10531053+ mdev->rs_mark_left = drbd_bm_bits(mdev);10541054+ if (mdev->agreed_pro_version >= 90)10551055+ set_ov_position(mdev, ns.conn);10561056+ else10571057+ mdev->ov_start_sector = 0;10581058+ mdev->ov_left = mdev->rs_total10591059+ - BM_SECT_TO_BIT(mdev->ov_position);10601060+ mdev->rs_start =10611061+ mdev->rs_mark_time = jiffies;10621062+ mdev->ov_last_oos_size = 0;10631063+ mdev->ov_last_oos_start = 0;10641064+10651065+ if (ns.conn == C_VERIFY_S) {10661066+ dev_info(DEV, "Starting Online Verify from sector %llu\n",10671067+ (unsigned long long)mdev->ov_position);10681068+ mod_timer(&mdev->resync_timer, jiffies);10691069+ }10701070+ }10711071+10721072+ if (get_ldev(mdev)) {10731073+ u32 mdf = mdev->ldev->md.flags & ~(MDF_CONSISTENT|MDF_PRIMARY_IND|10741074+ MDF_CONNECTED_IND|MDF_WAS_UP_TO_DATE|10751075+ MDF_PEER_OUT_DATED|MDF_CRASHED_PRIMARY);10761076+10771077+ if (test_bit(CRASHED_PRIMARY, &mdev->flags))10781078+ mdf |= MDF_CRASHED_PRIMARY;10791079+ if (mdev->state.role == R_PRIMARY ||10801080+ (mdev->state.pdsk < D_INCONSISTENT && mdev->state.peer == R_PRIMARY))10811081+ mdf |= MDF_PRIMARY_IND;10821082+ if (mdev->state.conn > C_WF_REPORT_PARAMS)10831083+ mdf |= MDF_CONNECTED_IND;10841084+ if (mdev->state.disk > D_INCONSISTENT)10851085+ mdf |= MDF_CONSISTENT;10861086+ if (mdev->state.disk > D_OUTDATED)10871087+ mdf |= MDF_WAS_UP_TO_DATE;10881088+ if (mdev->state.pdsk <= D_OUTDATED && mdev->state.pdsk >= D_INCONSISTENT)10891089+ mdf |= MDF_PEER_OUT_DATED;10901090+ if (mdf != mdev->ldev->md.flags) {10911091+ mdev->ldev->md.flags = mdf;10921092+ drbd_md_mark_dirty(mdev);10931093+ }10941094+ if (os.disk < D_CONSISTENT && ns.disk >= D_CONSISTENT)10951095+ drbd_set_ed_uuid(mdev, mdev->ldev->md.uuid[UI_CURRENT]);10961096+ put_ldev(mdev);10971097+ }10981098+10991099+ /* Peer was forced D_UP_TO_DATE & R_PRIMARY, consider to resync */11001100+ if (os.disk == D_INCONSISTENT && os.pdsk == D_INCONSISTENT &&11011101+ os.peer == R_SECONDARY && ns.peer == R_PRIMARY)11021102+ set_bit(CONSIDER_RESYNC, &mdev->flags);11031103+11041104+ /* Receiver should clean up itself */11051105+ if (os.conn != C_DISCONNECTING && ns.conn == C_DISCONNECTING)11061106+ drbd_thread_stop_nowait(&mdev->receiver);11071107+11081108+ /* Now the receiver finished cleaning up itself, it should die */11091109+ if (os.conn != C_STANDALONE && ns.conn == C_STANDALONE)11101110+ drbd_thread_stop_nowait(&mdev->receiver);11111111+11121112+ /* Upon network failure, we need to restart the receiver. */11131113+ if (os.conn > C_TEAR_DOWN &&11141114+ ns.conn <= C_TEAR_DOWN && ns.conn >= C_TIMEOUT)11151115+ drbd_thread_restart_nowait(&mdev->receiver);11161116+11171117+ ascw = kmalloc(sizeof(*ascw), GFP_ATOMIC);11181118+ if (ascw) {11191119+ ascw->os = os;11201120+ ascw->ns = ns;11211121+ ascw->flags = flags;11221122+ ascw->w.cb = w_after_state_ch;11231123+ ascw->done = done;11241124+ drbd_queue_work(&mdev->data.work, &ascw->w);11251125+ } else {11261126+ dev_warn(DEV, "Could not kmalloc an ascw\n");11271127+ }11281128+11291129+ return rv;11301130+}11311131+11321132+static int w_after_state_ch(struct drbd_conf *mdev, struct drbd_work *w, int unused)11331133+{11341134+ struct after_state_chg_work *ascw =11351135+ container_of(w, struct after_state_chg_work, w);11361136+ after_state_ch(mdev, ascw->os, ascw->ns, ascw->flags);11371137+ if (ascw->flags & CS_WAIT_COMPLETE) {11381138+ D_ASSERT(ascw->done != NULL);11391139+ complete(ascw->done);11401140+ }11411141+ kfree(ascw);11421142+11431143+ return 1;11441144+}11451145+11461146+static void abw_start_sync(struct drbd_conf *mdev, int rv)11471147+{11481148+ if (rv) {11491149+ dev_err(DEV, "Writing the bitmap failed not starting resync.\n");11501150+ _drbd_request_state(mdev, NS(conn, C_CONNECTED), CS_VERBOSE);11511151+ return;11521152+ }11531153+11541154+ switch (mdev->state.conn) {11551155+ case C_STARTING_SYNC_T:11561156+ _drbd_request_state(mdev, NS(conn, C_WF_SYNC_UUID), CS_VERBOSE);11571157+ break;11581158+ case C_STARTING_SYNC_S:11591159+ drbd_start_resync(mdev, C_SYNC_SOURCE);11601160+ break;11611161+ }11621162+}11631163+11641164+/**11651165+ * after_state_ch() - Perform after state change actions that may sleep11661166+ * @mdev: DRBD device.11671167+ * @os: old state.11681168+ * @ns: new state.11691169+ * @flags: Flags11701170+ */11711171+static void after_state_ch(struct drbd_conf *mdev, union drbd_state os,11721172+ union drbd_state ns, enum chg_state_flags flags)11731173+{11741174+ enum drbd_fencing_p fp;11751175+11761176+ if (os.conn != C_CONNECTED && ns.conn == C_CONNECTED) {11771177+ clear_bit(CRASHED_PRIMARY, &mdev->flags);11781178+ if (mdev->p_uuid)11791179+ mdev->p_uuid[UI_FLAGS] &= ~((u64)2);11801180+ }11811181+11821182+ fp = FP_DONT_CARE;11831183+ if (get_ldev(mdev)) {11841184+ fp = mdev->ldev->dc.fencing;11851185+ put_ldev(mdev);11861186+ }11871187+11881188+ /* Inform userspace about the change... */11891189+ drbd_bcast_state(mdev, ns);11901190+11911191+ if (!(os.role == R_PRIMARY && os.disk < D_UP_TO_DATE && os.pdsk < D_UP_TO_DATE) &&11921192+ (ns.role == R_PRIMARY && ns.disk < D_UP_TO_DATE && ns.pdsk < D_UP_TO_DATE))11931193+ drbd_khelper(mdev, "pri-on-incon-degr");11941194+11951195+ /* Here we have the actions that are performed after a11961196+ state change. This function might sleep */11971197+11981198+ if (fp == FP_STONITH && ns.susp) {11991199+ /* case1: The outdate peer handler is successful:12001200+ * case2: The connection was established again: */12011201+ if ((os.pdsk > D_OUTDATED && ns.pdsk <= D_OUTDATED) ||12021202+ (os.conn < C_CONNECTED && ns.conn >= C_CONNECTED)) {12031203+ tl_clear(mdev);12041204+ spin_lock_irq(&mdev->req_lock);12051205+ _drbd_set_state(_NS(mdev, susp, 0), CS_VERBOSE, NULL);12061206+ spin_unlock_irq(&mdev->req_lock);12071207+ }12081208+ }12091209+ /* Do not change the order of the if above and the two below... */12101210+ if (os.pdsk == D_DISKLESS && ns.pdsk > D_DISKLESS) { /* attach on the peer */12111211+ drbd_send_uuids(mdev);12121212+ drbd_send_state(mdev);12131213+ }12141214+ if (os.conn != C_WF_BITMAP_S && ns.conn == C_WF_BITMAP_S)12151215+ drbd_queue_bitmap_io(mdev, &drbd_send_bitmap, NULL, "send_bitmap (WFBitMapS)");12161216+12171217+ /* Lost contact to peer's copy of the data */12181218+ if ((os.pdsk >= D_INCONSISTENT &&12191219+ os.pdsk != D_UNKNOWN &&12201220+ os.pdsk != D_OUTDATED)12211221+ && (ns.pdsk < D_INCONSISTENT ||12221222+ ns.pdsk == D_UNKNOWN ||12231223+ ns.pdsk == D_OUTDATED)) {12241224+ kfree(mdev->p_uuid);12251225+ mdev->p_uuid = NULL;12261226+ if (get_ldev(mdev)) {12271227+ if ((ns.role == R_PRIMARY || ns.peer == R_PRIMARY) &&12281228+ mdev->ldev->md.uuid[UI_BITMAP] == 0 && ns.disk >= D_UP_TO_DATE) {12291229+ drbd_uuid_new_current(mdev);12301230+ drbd_send_uuids(mdev);12311231+ }12321232+ put_ldev(mdev);12331233+ }12341234+ }12351235+12361236+ if (ns.pdsk < D_INCONSISTENT && get_ldev(mdev)) {12371237+ if (ns.peer == R_PRIMARY && mdev->ldev->md.uuid[UI_BITMAP] == 0)12381238+ drbd_uuid_new_current(mdev);12391239+12401240+ /* D_DISKLESS Peer becomes secondary */12411241+ if (os.peer == R_PRIMARY && ns.peer == R_SECONDARY)12421242+ drbd_al_to_on_disk_bm(mdev);12431243+ put_ldev(mdev);12441244+ }12451245+12461246+ /* Last part of the attaching process ... */12471247+ if (ns.conn >= C_CONNECTED &&12481248+ os.disk == D_ATTACHING && ns.disk == D_NEGOTIATING) {12491249+ kfree(mdev->p_uuid); /* We expect to receive up-to-date UUIDs soon. */12501250+ mdev->p_uuid = NULL; /* ...to not use the old ones in the mean time */12511251+ drbd_send_sizes(mdev, 0); /* to start sync... */12521252+ drbd_send_uuids(mdev);12531253+ drbd_send_state(mdev);12541254+ }12551255+12561256+ /* We want to pause/continue resync, tell peer. */12571257+ if (ns.conn >= C_CONNECTED &&12581258+ ((os.aftr_isp != ns.aftr_isp) ||12591259+ (os.user_isp != ns.user_isp)))12601260+ drbd_send_state(mdev);12611261+12621262+ /* In case one of the isp bits got set, suspend other devices. */12631263+ if ((!os.aftr_isp && !os.peer_isp && !os.user_isp) &&12641264+ (ns.aftr_isp || ns.peer_isp || ns.user_isp))12651265+ suspend_other_sg(mdev);12661266+12671267+ /* Make sure the peer gets informed about eventual state12681268+ changes (ISP bits) while we were in WFReportParams. */12691269+ if (os.conn == C_WF_REPORT_PARAMS && ns.conn >= C_CONNECTED)12701270+ drbd_send_state(mdev);12711271+12721272+ /* We are in the progress to start a full sync... */12731273+ if ((os.conn != C_STARTING_SYNC_T && ns.conn == C_STARTING_SYNC_T) ||12741274+ (os.conn != C_STARTING_SYNC_S && ns.conn == C_STARTING_SYNC_S))12751275+ drbd_queue_bitmap_io(mdev, &drbd_bmio_set_n_write, &abw_start_sync, "set_n_write from StartingSync");12761276+12771277+ /* We are invalidating our self... */12781278+ if (os.conn < C_CONNECTED && ns.conn < C_CONNECTED &&12791279+ os.disk > D_INCONSISTENT && ns.disk == D_INCONSISTENT)12801280+ drbd_queue_bitmap_io(mdev, &drbd_bmio_set_n_write, NULL, "set_n_write from invalidate");12811281+12821282+ if (os.disk > D_FAILED && ns.disk == D_FAILED) {12831283+ enum drbd_io_error_p eh;12841284+12851285+ eh = EP_PASS_ON;12861286+ if (get_ldev_if_state(mdev, D_FAILED)) {12871287+ eh = mdev->ldev->dc.on_io_error;12881288+ put_ldev(mdev);12891289+ }12901290+12911291+ drbd_rs_cancel_all(mdev);12921292+ /* since get_ldev() only works as long as disk>=D_INCONSISTENT,12931293+ and it is D_DISKLESS here, local_cnt can only go down, it can12941294+ not increase... It will reach zero */12951295+ wait_event(mdev->misc_wait, !atomic_read(&mdev->local_cnt));12961296+ mdev->rs_total = 0;12971297+ mdev->rs_failed = 0;12981298+ atomic_set(&mdev->rs_pending_cnt, 0);12991299+13001300+ spin_lock_irq(&mdev->req_lock);13011301+ _drbd_set_state(_NS(mdev, disk, D_DISKLESS), CS_HARD, NULL);13021302+ spin_unlock_irq(&mdev->req_lock);13031303+13041304+ if (eh == EP_CALL_HELPER)13051305+ drbd_khelper(mdev, "local-io-error");13061306+ }13071307+13081308+ if (os.disk > D_DISKLESS && ns.disk == D_DISKLESS) {13091309+13101310+ if (os.disk == D_FAILED) /* && ns.disk == D_DISKLESS*/ {13111311+ if (drbd_send_state(mdev))13121312+ dev_warn(DEV, "Notified peer that my disk is broken.\n");13131313+ else13141314+ dev_err(DEV, "Sending state in drbd_io_error() failed\n");13151315+ }13161316+13171317+ lc_destroy(mdev->resync);13181318+ mdev->resync = NULL;13191319+ lc_destroy(mdev->act_log);13201320+ mdev->act_log = NULL;13211321+ __no_warn(local,13221322+ drbd_free_bc(mdev->ldev);13231323+ mdev->ldev = NULL;);13241324+13251325+ if (mdev->md_io_tmpp)13261326+ __free_page(mdev->md_io_tmpp);13271327+ }13281328+13291329+ /* Disks got bigger while they were detached */13301330+ if (ns.disk > D_NEGOTIATING && ns.pdsk > D_NEGOTIATING &&13311331+ test_and_clear_bit(RESYNC_AFTER_NEG, &mdev->flags)) {13321332+ if (ns.conn == C_CONNECTED)13331333+ resync_after_online_grow(mdev);13341334+ }13351335+13361336+ /* A resync finished or aborted, wake paused devices... */13371337+ if ((os.conn > C_CONNECTED && ns.conn <= C_CONNECTED) ||13381338+ (os.peer_isp && !ns.peer_isp) ||13391339+ (os.user_isp && !ns.user_isp))13401340+ resume_next_sg(mdev);13411341+13421342+ /* Upon network connection, we need to start the receiver */13431343+ if (os.conn == C_STANDALONE && ns.conn == C_UNCONNECTED)13441344+ drbd_thread_start(&mdev->receiver);13451345+13461346+ /* Terminate worker thread if we are unconfigured - it will be13471347+ restarted as needed... */13481348+ if (ns.disk == D_DISKLESS &&13491349+ ns.conn == C_STANDALONE &&13501350+ ns.role == R_SECONDARY) {13511351+ if (os.aftr_isp != ns.aftr_isp)13521352+ resume_next_sg(mdev);13531353+ /* set in __drbd_set_state, unless CONFIG_PENDING was set */13541354+ if (test_bit(DEVICE_DYING, &mdev->flags))13551355+ drbd_thread_stop_nowait(&mdev->worker);13561356+ }13571357+13581358+ drbd_md_sync(mdev);13591359+}13601360+13611361+13621362+static int drbd_thread_setup(void *arg)13631363+{13641364+ struct drbd_thread *thi = (struct drbd_thread *) arg;13651365+ struct drbd_conf *mdev = thi->mdev;13661366+ unsigned long flags;13671367+ int retval;13681368+13691369+restart:13701370+ retval = thi->function(thi);13711371+13721372+ spin_lock_irqsave(&thi->t_lock, flags);13731373+13741374+ /* if the receiver has been "Exiting", the last thing it did13751375+ * was set the conn state to "StandAlone",13761376+ * if now a re-connect request comes in, conn state goes C_UNCONNECTED,13771377+ * and receiver thread will be "started".13781378+ * drbd_thread_start needs to set "Restarting" in that case.13791379+ * t_state check and assignment needs to be within the same spinlock,13801380+ * so either thread_start sees Exiting, and can remap to Restarting,13811381+ * or thread_start see None, and can proceed as normal.13821382+ */13831383+13841384+ if (thi->t_state == Restarting) {13851385+ dev_info(DEV, "Restarting %s\n", current->comm);13861386+ thi->t_state = Running;13871387+ spin_unlock_irqrestore(&thi->t_lock, flags);13881388+ goto restart;13891389+ }13901390+13911391+ thi->task = NULL;13921392+ thi->t_state = None;13931393+ smp_mb();13941394+ complete(&thi->stop);13951395+ spin_unlock_irqrestore(&thi->t_lock, flags);13961396+13971397+ dev_info(DEV, "Terminating %s\n", current->comm);13981398+13991399+ /* Release mod reference taken when thread was started */14001400+ module_put(THIS_MODULE);14011401+ return retval;14021402+}14031403+14041404+static void drbd_thread_init(struct drbd_conf *mdev, struct drbd_thread *thi,14051405+ int (*func) (struct drbd_thread *))14061406+{14071407+ spin_lock_init(&thi->t_lock);14081408+ thi->task = NULL;14091409+ thi->t_state = None;14101410+ thi->function = func;14111411+ thi->mdev = mdev;14121412+}14131413+14141414+int drbd_thread_start(struct drbd_thread *thi)14151415+{14161416+ struct drbd_conf *mdev = thi->mdev;14171417+ struct task_struct *nt;14181418+ unsigned long flags;14191419+14201420+ const char *me =14211421+ thi == &mdev->receiver ? "receiver" :14221422+ thi == &mdev->asender ? "asender" :14231423+ thi == &mdev->worker ? "worker" : "NONSENSE";14241424+14251425+ /* is used from state engine doing drbd_thread_stop_nowait,14261426+ * while holding the req lock irqsave */14271427+ spin_lock_irqsave(&thi->t_lock, flags);14281428+14291429+ switch (thi->t_state) {14301430+ case None:14311431+ dev_info(DEV, "Starting %s thread (from %s [%d])\n",14321432+ me, current->comm, current->pid);14331433+14341434+ /* Get ref on module for thread - this is released when thread exits */14351435+ if (!try_module_get(THIS_MODULE)) {14361436+ dev_err(DEV, "Failed to get module reference in drbd_thread_start\n");14371437+ spin_unlock_irqrestore(&thi->t_lock, flags);14381438+ return FALSE;14391439+ }14401440+14411441+ init_completion(&thi->stop);14421442+ D_ASSERT(thi->task == NULL);14431443+ thi->reset_cpu_mask = 1;14441444+ thi->t_state = Running;14451445+ spin_unlock_irqrestore(&thi->t_lock, flags);14461446+ flush_signals(current); /* otherw. may get -ERESTARTNOINTR */14471447+14481448+ nt = kthread_create(drbd_thread_setup, (void *) thi,14491449+ "drbd%d_%s", mdev_to_minor(mdev), me);14501450+14511451+ if (IS_ERR(nt)) {14521452+ dev_err(DEV, "Couldn't start thread\n");14531453+14541454+ module_put(THIS_MODULE);14551455+ return FALSE;14561456+ }14571457+ spin_lock_irqsave(&thi->t_lock, flags);14581458+ thi->task = nt;14591459+ thi->t_state = Running;14601460+ spin_unlock_irqrestore(&thi->t_lock, flags);14611461+ wake_up_process(nt);14621462+ break;14631463+ case Exiting:14641464+ thi->t_state = Restarting;14651465+ dev_info(DEV, "Restarting %s thread (from %s [%d])\n",14661466+ me, current->comm, current->pid);14671467+ /* fall through */14681468+ case Running:14691469+ case Restarting:14701470+ default:14711471+ spin_unlock_irqrestore(&thi->t_lock, flags);14721472+ break;14731473+ }14741474+14751475+ return TRUE;14761476+}14771477+14781478+14791479+void _drbd_thread_stop(struct drbd_thread *thi, int restart, int wait)14801480+{14811481+ unsigned long flags;14821482+14831483+ enum drbd_thread_state ns = restart ? Restarting : Exiting;14841484+14851485+ /* may be called from state engine, holding the req lock irqsave */14861486+ spin_lock_irqsave(&thi->t_lock, flags);14871487+14881488+ if (thi->t_state == None) {14891489+ spin_unlock_irqrestore(&thi->t_lock, flags);14901490+ if (restart)14911491+ drbd_thread_start(thi);14921492+ return;14931493+ }14941494+14951495+ if (thi->t_state != ns) {14961496+ if (thi->task == NULL) {14971497+ spin_unlock_irqrestore(&thi->t_lock, flags);14981498+ return;14991499+ }15001500+15011501+ thi->t_state = ns;15021502+ smp_mb();15031503+ init_completion(&thi->stop);15041504+ if (thi->task != current)15051505+ force_sig(DRBD_SIGKILL, thi->task);15061506+15071507+ }15081508+15091509+ spin_unlock_irqrestore(&thi->t_lock, flags);15101510+15111511+ if (wait)15121512+ wait_for_completion(&thi->stop);15131513+}15141514+15151515+#ifdef CONFIG_SMP15161516+/**15171517+ * drbd_calc_cpu_mask() - Generate CPU masks, spread over all CPUs15181518+ * @mdev: DRBD device.15191519+ *15201520+ * Forces all threads of a device onto the same CPU. This is beneficial for15211521+ * DRBD's performance. May be overwritten by user's configuration.15221522+ */15231523+void drbd_calc_cpu_mask(struct drbd_conf *mdev)15241524+{15251525+ int ord, cpu;15261526+15271527+ /* user override. */15281528+ if (cpumask_weight(mdev->cpu_mask))15291529+ return;15301530+15311531+ ord = mdev_to_minor(mdev) % cpumask_weight(cpu_online_mask);15321532+ for_each_online_cpu(cpu) {15331533+ if (ord-- == 0) {15341534+ cpumask_set_cpu(cpu, mdev->cpu_mask);15351535+ return;15361536+ }15371537+ }15381538+ /* should not be reached */15391539+ cpumask_setall(mdev->cpu_mask);15401540+}15411541+15421542+/**15431543+ * drbd_thread_current_set_cpu() - modifies the cpu mask of the _current_ thread15441544+ * @mdev: DRBD device.15451545+ *15461546+ * call in the "main loop" of _all_ threads, no need for any mutex, current won't die15471547+ * prematurely.15481548+ */15491549+void drbd_thread_current_set_cpu(struct drbd_conf *mdev)15501550+{15511551+ struct task_struct *p = current;15521552+ struct drbd_thread *thi =15531553+ p == mdev->asender.task ? &mdev->asender :15541554+ p == mdev->receiver.task ? &mdev->receiver :15551555+ p == mdev->worker.task ? &mdev->worker :15561556+ NULL;15571557+ ERR_IF(thi == NULL)15581558+ return;15591559+ if (!thi->reset_cpu_mask)15601560+ return;15611561+ thi->reset_cpu_mask = 0;15621562+ set_cpus_allowed_ptr(p, mdev->cpu_mask);15631563+}15641564+#endif15651565+15661566+/* the appropriate socket mutex must be held already */15671567+int _drbd_send_cmd(struct drbd_conf *mdev, struct socket *sock,15681568+ enum drbd_packets cmd, struct p_header *h,15691569+ size_t size, unsigned msg_flags)15701570+{15711571+ int sent, ok;15721572+15731573+ ERR_IF(!h) return FALSE;15741574+ ERR_IF(!size) return FALSE;15751575+15761576+ h->magic = BE_DRBD_MAGIC;15771577+ h->command = cpu_to_be16(cmd);15781578+ h->length = cpu_to_be16(size-sizeof(struct p_header));15791579+15801580+ trace_drbd_packet(mdev, sock, 0, (void *)h, __FILE__, __LINE__);15811581+ sent = drbd_send(mdev, sock, h, size, msg_flags);15821582+15831583+ ok = (sent == size);15841584+ if (!ok)15851585+ dev_err(DEV, "short sent %s size=%d sent=%d\n",15861586+ cmdname(cmd), (int)size, sent);15871587+ return ok;15881588+}15891589+15901590+/* don't pass the socket. we may only look at it15911591+ * when we hold the appropriate socket mutex.15921592+ */15931593+int drbd_send_cmd(struct drbd_conf *mdev, int use_data_socket,15941594+ enum drbd_packets cmd, struct p_header *h, size_t size)15951595+{15961596+ int ok = 0;15971597+ struct socket *sock;15981598+15991599+ if (use_data_socket) {16001600+ mutex_lock(&mdev->data.mutex);16011601+ sock = mdev->data.socket;16021602+ } else {16031603+ mutex_lock(&mdev->meta.mutex);16041604+ sock = mdev->meta.socket;16051605+ }16061606+16071607+ /* drbd_disconnect() could have called drbd_free_sock()16081608+ * while we were waiting in down()... */16091609+ if (likely(sock != NULL))16101610+ ok = _drbd_send_cmd(mdev, sock, cmd, h, size, 0);16111611+16121612+ if (use_data_socket)16131613+ mutex_unlock(&mdev->data.mutex);16141614+ else16151615+ mutex_unlock(&mdev->meta.mutex);16161616+ return ok;16171617+}16181618+16191619+int drbd_send_cmd2(struct drbd_conf *mdev, enum drbd_packets cmd, char *data,16201620+ size_t size)16211621+{16221622+ struct p_header h;16231623+ int ok;16241624+16251625+ h.magic = BE_DRBD_MAGIC;16261626+ h.command = cpu_to_be16(cmd);16271627+ h.length = cpu_to_be16(size);16281628+16291629+ if (!drbd_get_data_sock(mdev))16301630+ return 0;16311631+16321632+ trace_drbd_packet(mdev, mdev->data.socket, 0, (void *)&h, __FILE__, __LINE__);16331633+16341634+ ok = (sizeof(h) ==16351635+ drbd_send(mdev, mdev->data.socket, &h, sizeof(h), 0));16361636+ ok = ok && (size ==16371637+ drbd_send(mdev, mdev->data.socket, data, size, 0));16381638+16391639+ drbd_put_data_sock(mdev);16401640+16411641+ return ok;16421642+}16431643+16441644+int drbd_send_sync_param(struct drbd_conf *mdev, struct syncer_conf *sc)16451645+{16461646+ struct p_rs_param_89 *p;16471647+ struct socket *sock;16481648+ int size, rv;16491649+ const int apv = mdev->agreed_pro_version;16501650+16511651+ size = apv <= 87 ? sizeof(struct p_rs_param)16521652+ : apv == 88 ? sizeof(struct p_rs_param)16531653+ + strlen(mdev->sync_conf.verify_alg) + 116541654+ : /* 89 */ sizeof(struct p_rs_param_89);16551655+16561656+ /* used from admin command context and receiver/worker context.16571657+ * to avoid kmalloc, grab the socket right here,16581658+ * then use the pre-allocated sbuf there */16591659+ mutex_lock(&mdev->data.mutex);16601660+ sock = mdev->data.socket;16611661+16621662+ if (likely(sock != NULL)) {16631663+ enum drbd_packets cmd = apv >= 89 ? P_SYNC_PARAM89 : P_SYNC_PARAM;16641664+16651665+ p = &mdev->data.sbuf.rs_param_89;16661666+16671667+ /* initialize verify_alg and csums_alg */16681668+ memset(p->verify_alg, 0, 2 * SHARED_SECRET_MAX);16691669+16701670+ p->rate = cpu_to_be32(sc->rate);16711671+16721672+ if (apv >= 88)16731673+ strcpy(p->verify_alg, mdev->sync_conf.verify_alg);16741674+ if (apv >= 89)16751675+ strcpy(p->csums_alg, mdev->sync_conf.csums_alg);16761676+16771677+ rv = _drbd_send_cmd(mdev, sock, cmd, &p->head, size, 0);16781678+ } else16791679+ rv = 0; /* not ok */16801680+16811681+ mutex_unlock(&mdev->data.mutex);16821682+16831683+ return rv;16841684+}16851685+16861686+int drbd_send_protocol(struct drbd_conf *mdev)16871687+{16881688+ struct p_protocol *p;16891689+ int size, rv;16901690+16911691+ size = sizeof(struct p_protocol);16921692+16931693+ if (mdev->agreed_pro_version >= 87)16941694+ size += strlen(mdev->net_conf->integrity_alg) + 1;16951695+16961696+ /* we must not recurse into our own queue,16971697+ * as that is blocked during handshake */16981698+ p = kmalloc(size, GFP_NOIO);16991699+ if (p == NULL)17001700+ return 0;17011701+17021702+ p->protocol = cpu_to_be32(mdev->net_conf->wire_protocol);17031703+ p->after_sb_0p = cpu_to_be32(mdev->net_conf->after_sb_0p);17041704+ p->after_sb_1p = cpu_to_be32(mdev->net_conf->after_sb_1p);17051705+ p->after_sb_2p = cpu_to_be32(mdev->net_conf->after_sb_2p);17061706+ p->want_lose = cpu_to_be32(mdev->net_conf->want_lose);17071707+ p->two_primaries = cpu_to_be32(mdev->net_conf->two_primaries);17081708+17091709+ if (mdev->agreed_pro_version >= 87)17101710+ strcpy(p->integrity_alg, mdev->net_conf->integrity_alg);17111711+17121712+ rv = drbd_send_cmd(mdev, USE_DATA_SOCKET, P_PROTOCOL,17131713+ (struct p_header *)p, size);17141714+ kfree(p);17151715+ return rv;17161716+}17171717+17181718+int _drbd_send_uuids(struct drbd_conf *mdev, u64 uuid_flags)17191719+{17201720+ struct p_uuids p;17211721+ int i;17221722+17231723+ if (!get_ldev_if_state(mdev, D_NEGOTIATING))17241724+ return 1;17251725+17261726+ for (i = UI_CURRENT; i < UI_SIZE; i++)17271727+ p.uuid[i] = mdev->ldev ? cpu_to_be64(mdev->ldev->md.uuid[i]) : 0;17281728+17291729+ mdev->comm_bm_set = drbd_bm_total_weight(mdev);17301730+ p.uuid[UI_SIZE] = cpu_to_be64(mdev->comm_bm_set);17311731+ uuid_flags |= mdev->net_conf->want_lose ? 1 : 0;17321732+ uuid_flags |= test_bit(CRASHED_PRIMARY, &mdev->flags) ? 2 : 0;17331733+ uuid_flags |= mdev->new_state_tmp.disk == D_INCONSISTENT ? 4 : 0;17341734+ p.uuid[UI_FLAGS] = cpu_to_be64(uuid_flags);17351735+17361736+ put_ldev(mdev);17371737+17381738+ return drbd_send_cmd(mdev, USE_DATA_SOCKET, P_UUIDS,17391739+ (struct p_header *)&p, sizeof(p));17401740+}17411741+17421742+int drbd_send_uuids(struct drbd_conf *mdev)17431743+{17441744+ return _drbd_send_uuids(mdev, 0);17451745+}17461746+17471747+int drbd_send_uuids_skip_initial_sync(struct drbd_conf *mdev)17481748+{17491749+ return _drbd_send_uuids(mdev, 8);17501750+}17511751+17521752+17531753+int drbd_send_sync_uuid(struct drbd_conf *mdev, u64 val)17541754+{17551755+ struct p_rs_uuid p;17561756+17571757+ p.uuid = cpu_to_be64(val);17581758+17591759+ return drbd_send_cmd(mdev, USE_DATA_SOCKET, P_SYNC_UUID,17601760+ (struct p_header *)&p, sizeof(p));17611761+}17621762+17631763+int drbd_send_sizes(struct drbd_conf *mdev, int trigger_reply)17641764+{17651765+ struct p_sizes p;17661766+ sector_t d_size, u_size;17671767+ int q_order_type;17681768+ int ok;17691769+17701770+ if (get_ldev_if_state(mdev, D_NEGOTIATING)) {17711771+ D_ASSERT(mdev->ldev->backing_bdev);17721772+ d_size = drbd_get_max_capacity(mdev->ldev);17731773+ u_size = mdev->ldev->dc.disk_size;17741774+ q_order_type = drbd_queue_order_type(mdev);17751775+ p.queue_order_type = cpu_to_be32(drbd_queue_order_type(mdev));17761776+ put_ldev(mdev);17771777+ } else {17781778+ d_size = 0;17791779+ u_size = 0;17801780+ q_order_type = QUEUE_ORDERED_NONE;17811781+ }17821782+17831783+ p.d_size = cpu_to_be64(d_size);17841784+ p.u_size = cpu_to_be64(u_size);17851785+ p.c_size = cpu_to_be64(trigger_reply ? 0 : drbd_get_capacity(mdev->this_bdev));17861786+ p.max_segment_size = cpu_to_be32(queue_max_segment_size(mdev->rq_queue));17871787+ p.queue_order_type = cpu_to_be32(q_order_type);17881788+17891789+ ok = drbd_send_cmd(mdev, USE_DATA_SOCKET, P_SIZES,17901790+ (struct p_header *)&p, sizeof(p));17911791+ return ok;17921792+}17931793+17941794+/**17951795+ * drbd_send_state() - Sends the drbd state to the peer17961796+ * @mdev: DRBD device.17971797+ */17981798+int drbd_send_state(struct drbd_conf *mdev)17991799+{18001800+ struct socket *sock;18011801+ struct p_state p;18021802+ int ok = 0;18031803+18041804+ /* Grab state lock so we wont send state if we're in the middle18051805+ * of a cluster wide state change on another thread */18061806+ drbd_state_lock(mdev);18071807+18081808+ mutex_lock(&mdev->data.mutex);18091809+18101810+ p.state = cpu_to_be32(mdev->state.i); /* Within the send mutex */18111811+ sock = mdev->data.socket;18121812+18131813+ if (likely(sock != NULL)) {18141814+ ok = _drbd_send_cmd(mdev, sock, P_STATE,18151815+ (struct p_header *)&p, sizeof(p), 0);18161816+ }18171817+18181818+ mutex_unlock(&mdev->data.mutex);18191819+18201820+ drbd_state_unlock(mdev);18211821+ return ok;18221822+}18231823+18241824+int drbd_send_state_req(struct drbd_conf *mdev,18251825+ union drbd_state mask, union drbd_state val)18261826+{18271827+ struct p_req_state p;18281828+18291829+ p.mask = cpu_to_be32(mask.i);18301830+ p.val = cpu_to_be32(val.i);18311831+18321832+ return drbd_send_cmd(mdev, USE_DATA_SOCKET, P_STATE_CHG_REQ,18331833+ (struct p_header *)&p, sizeof(p));18341834+}18351835+18361836+int drbd_send_sr_reply(struct drbd_conf *mdev, int retcode)18371837+{18381838+ struct p_req_state_reply p;18391839+18401840+ p.retcode = cpu_to_be32(retcode);18411841+18421842+ return drbd_send_cmd(mdev, USE_META_SOCKET, P_STATE_CHG_REPLY,18431843+ (struct p_header *)&p, sizeof(p));18441844+}18451845+18461846+int fill_bitmap_rle_bits(struct drbd_conf *mdev,18471847+ struct p_compressed_bm *p,18481848+ struct bm_xfer_ctx *c)18491849+{18501850+ struct bitstream bs;18511851+ unsigned long plain_bits;18521852+ unsigned long tmp;18531853+ unsigned long rl;18541854+ unsigned len;18551855+ unsigned toggle;18561856+ int bits;18571857+18581858+ /* may we use this feature? */18591859+ if ((mdev->sync_conf.use_rle == 0) ||18601860+ (mdev->agreed_pro_version < 90))18611861+ return 0;18621862+18631863+ if (c->bit_offset >= c->bm_bits)18641864+ return 0; /* nothing to do. */18651865+18661866+ /* use at most thus many bytes */18671867+ bitstream_init(&bs, p->code, BM_PACKET_VLI_BYTES_MAX, 0);18681868+ memset(p->code, 0, BM_PACKET_VLI_BYTES_MAX);18691869+ /* plain bits covered in this code string */18701870+ plain_bits = 0;18711871+18721872+ /* p->encoding & 0x80 stores whether the first run length is set.18731873+ * bit offset is implicit.18741874+ * start with toggle == 2 to be able to tell the first iteration */18751875+ toggle = 2;18761876+18771877+ /* see how much plain bits we can stuff into one packet18781878+ * using RLE and VLI. */18791879+ do {18801880+ tmp = (toggle == 0) ? _drbd_bm_find_next_zero(mdev, c->bit_offset)18811881+ : _drbd_bm_find_next(mdev, c->bit_offset);18821882+ if (tmp == -1UL)18831883+ tmp = c->bm_bits;18841884+ rl = tmp - c->bit_offset;18851885+18861886+ if (toggle == 2) { /* first iteration */18871887+ if (rl == 0) {18881888+ /* the first checked bit was set,18891889+ * store start value, */18901890+ DCBP_set_start(p, 1);18911891+ /* but skip encoding of zero run length */18921892+ toggle = !toggle;18931893+ continue;18941894+ }18951895+ DCBP_set_start(p, 0);18961896+ }18971897+18981898+ /* paranoia: catch zero runlength.18991899+ * can only happen if bitmap is modified while we scan it. */19001900+ if (rl == 0) {19011901+ dev_err(DEV, "unexpected zero runlength while encoding bitmap "19021902+ "t:%u bo:%lu\n", toggle, c->bit_offset);19031903+ return -1;19041904+ }19051905+19061906+ bits = vli_encode_bits(&bs, rl);19071907+ if (bits == -ENOBUFS) /* buffer full */19081908+ break;19091909+ if (bits <= 0) {19101910+ dev_err(DEV, "error while encoding bitmap: %d\n", bits);19111911+ return 0;19121912+ }19131913+19141914+ toggle = !toggle;19151915+ plain_bits += rl;19161916+ c->bit_offset = tmp;19171917+ } while (c->bit_offset < c->bm_bits);19181918+19191919+ len = bs.cur.b - p->code + !!bs.cur.bit;19201920+19211921+ if (plain_bits < (len << 3)) {19221922+ /* incompressible with this method.19231923+ * we need to rewind both word and bit position. */19241924+ c->bit_offset -= plain_bits;19251925+ bm_xfer_ctx_bit_to_word_offset(c);19261926+ c->bit_offset = c->word_offset * BITS_PER_LONG;19271927+ return 0;19281928+ }19291929+19301930+ /* RLE + VLI was able to compress it just fine.19311931+ * update c->word_offset. */19321932+ bm_xfer_ctx_bit_to_word_offset(c);19331933+19341934+ /* store pad_bits */19351935+ DCBP_set_pad_bits(p, (8 - bs.cur.bit) & 0x7);19361936+19371937+ return len;19381938+}19391939+19401940+enum { OK, FAILED, DONE }19411941+send_bitmap_rle_or_plain(struct drbd_conf *mdev,19421942+ struct p_header *h, struct bm_xfer_ctx *c)19431943+{19441944+ struct p_compressed_bm *p = (void*)h;19451945+ unsigned long num_words;19461946+ int len;19471947+ int ok;19481948+19491949+ len = fill_bitmap_rle_bits(mdev, p, c);19501950+19511951+ if (len < 0)19521952+ return FAILED;19531953+19541954+ if (len) {19551955+ DCBP_set_code(p, RLE_VLI_Bits);19561956+ ok = _drbd_send_cmd(mdev, mdev->data.socket, P_COMPRESSED_BITMAP, h,19571957+ sizeof(*p) + len, 0);19581958+19591959+ c->packets[0]++;19601960+ c->bytes[0] += sizeof(*p) + len;19611961+19621962+ if (c->bit_offset >= c->bm_bits)19631963+ len = 0; /* DONE */19641964+ } else {19651965+ /* was not compressible.19661966+ * send a buffer full of plain text bits instead. */19671967+ num_words = min_t(size_t, BM_PACKET_WORDS, c->bm_words - c->word_offset);19681968+ len = num_words * sizeof(long);19691969+ if (len)19701970+ drbd_bm_get_lel(mdev, c->word_offset, num_words, (unsigned long*)h->payload);19711971+ ok = _drbd_send_cmd(mdev, mdev->data.socket, P_BITMAP,19721972+ h, sizeof(struct p_header) + len, 0);19731973+ c->word_offset += num_words;19741974+ c->bit_offset = c->word_offset * BITS_PER_LONG;19751975+19761976+ c->packets[1]++;19771977+ c->bytes[1] += sizeof(struct p_header) + len;19781978+19791979+ if (c->bit_offset > c->bm_bits)19801980+ c->bit_offset = c->bm_bits;19811981+ }19821982+ ok = ok ? ((len == 0) ? DONE : OK) : FAILED;19831983+19841984+ if (ok == DONE)19851985+ INFO_bm_xfer_stats(mdev, "send", c);19861986+ return ok;19871987+}19881988+19891989+/* See the comment at receive_bitmap() */19901990+int _drbd_send_bitmap(struct drbd_conf *mdev)19911991+{19921992+ struct bm_xfer_ctx c;19931993+ struct p_header *p;19941994+ int ret;19951995+19961996+ ERR_IF(!mdev->bitmap) return FALSE;19971997+19981998+ /* maybe we should use some per thread scratch page,19991999+ * and allocate that during initial device creation? */20002000+ p = (struct p_header *) __get_free_page(GFP_NOIO);20012001+ if (!p) {20022002+ dev_err(DEV, "failed to allocate one page buffer in %s\n", __func__);20032003+ return FALSE;20042004+ }20052005+20062006+ if (get_ldev(mdev)) {20072007+ if (drbd_md_test_flag(mdev->ldev, MDF_FULL_SYNC)) {20082008+ dev_info(DEV, "Writing the whole bitmap, MDF_FullSync was set.\n");20092009+ drbd_bm_set_all(mdev);20102010+ if (drbd_bm_write(mdev)) {20112011+ /* write_bm did fail! Leave full sync flag set in Meta P_DATA20122012+ * but otherwise process as per normal - need to tell other20132013+ * side that a full resync is required! */20142014+ dev_err(DEV, "Failed to write bitmap to disk!\n");20152015+ } else {20162016+ drbd_md_clear_flag(mdev, MDF_FULL_SYNC);20172017+ drbd_md_sync(mdev);20182018+ }20192019+ }20202020+ put_ldev(mdev);20212021+ }20222022+20232023+ c = (struct bm_xfer_ctx) {20242024+ .bm_bits = drbd_bm_bits(mdev),20252025+ .bm_words = drbd_bm_words(mdev),20262026+ };20272027+20282028+ do {20292029+ ret = send_bitmap_rle_or_plain(mdev, p, &c);20302030+ } while (ret == OK);20312031+20322032+ free_page((unsigned long) p);20332033+ return (ret == DONE);20342034+}20352035+20362036+int drbd_send_bitmap(struct drbd_conf *mdev)20372037+{20382038+ int err;20392039+20402040+ if (!drbd_get_data_sock(mdev))20412041+ return -1;20422042+ err = !_drbd_send_bitmap(mdev);20432043+ drbd_put_data_sock(mdev);20442044+ return err;20452045+}20462046+20472047+int drbd_send_b_ack(struct drbd_conf *mdev, u32 barrier_nr, u32 set_size)20482048+{20492049+ int ok;20502050+ struct p_barrier_ack p;20512051+20522052+ p.barrier = barrier_nr;20532053+ p.set_size = cpu_to_be32(set_size);20542054+20552055+ if (mdev->state.conn < C_CONNECTED)20562056+ return FALSE;20572057+ ok = drbd_send_cmd(mdev, USE_META_SOCKET, P_BARRIER_ACK,20582058+ (struct p_header *)&p, sizeof(p));20592059+ return ok;20602060+}20612061+20622062+/**20632063+ * _drbd_send_ack() - Sends an ack packet20642064+ * @mdev: DRBD device.20652065+ * @cmd: Packet command code.20662066+ * @sector: sector, needs to be in big endian byte order20672067+ * @blksize: size in byte, needs to be in big endian byte order20682068+ * @block_id: Id, big endian byte order20692069+ */20702070+static int _drbd_send_ack(struct drbd_conf *mdev, enum drbd_packets cmd,20712071+ u64 sector,20722072+ u32 blksize,20732073+ u64 block_id)20742074+{20752075+ int ok;20762076+ struct p_block_ack p;20772077+20782078+ p.sector = sector;20792079+ p.block_id = block_id;20802080+ p.blksize = blksize;20812081+ p.seq_num = cpu_to_be32(atomic_add_return(1, &mdev->packet_seq));20822082+20832083+ if (!mdev->meta.socket || mdev->state.conn < C_CONNECTED)20842084+ return FALSE;20852085+ ok = drbd_send_cmd(mdev, USE_META_SOCKET, cmd,20862086+ (struct p_header *)&p, sizeof(p));20872087+ return ok;20882088+}20892089+20902090+int drbd_send_ack_dp(struct drbd_conf *mdev, enum drbd_packets cmd,20912091+ struct p_data *dp)20922092+{20932093+ const int header_size = sizeof(struct p_data)20942094+ - sizeof(struct p_header);20952095+ int data_size = ((struct p_header *)dp)->length - header_size;20962096+20972097+ return _drbd_send_ack(mdev, cmd, dp->sector, cpu_to_be32(data_size),20982098+ dp->block_id);20992099+}21002100+21012101+int drbd_send_ack_rp(struct drbd_conf *mdev, enum drbd_packets cmd,21022102+ struct p_block_req *rp)21032103+{21042104+ return _drbd_send_ack(mdev, cmd, rp->sector, rp->blksize, rp->block_id);21052105+}21062106+21072107+/**21082108+ * drbd_send_ack() - Sends an ack packet21092109+ * @mdev: DRBD device.21102110+ * @cmd: Packet command code.21112111+ * @e: Epoch entry.21122112+ */21132113+int drbd_send_ack(struct drbd_conf *mdev,21142114+ enum drbd_packets cmd, struct drbd_epoch_entry *e)21152115+{21162116+ return _drbd_send_ack(mdev, cmd,21172117+ cpu_to_be64(e->sector),21182118+ cpu_to_be32(e->size),21192119+ e->block_id);21202120+}21212121+21222122+/* This function misuses the block_id field to signal if the blocks21232123+ * are is sync or not. */21242124+int drbd_send_ack_ex(struct drbd_conf *mdev, enum drbd_packets cmd,21252125+ sector_t sector, int blksize, u64 block_id)21262126+{21272127+ return _drbd_send_ack(mdev, cmd,21282128+ cpu_to_be64(sector),21292129+ cpu_to_be32(blksize),21302130+ cpu_to_be64(block_id));21312131+}21322132+21332133+int drbd_send_drequest(struct drbd_conf *mdev, int cmd,21342134+ sector_t sector, int size, u64 block_id)21352135+{21362136+ int ok;21372137+ struct p_block_req p;21382138+21392139+ p.sector = cpu_to_be64(sector);21402140+ p.block_id = block_id;21412141+ p.blksize = cpu_to_be32(size);21422142+21432143+ ok = drbd_send_cmd(mdev, USE_DATA_SOCKET, cmd,21442144+ (struct p_header *)&p, sizeof(p));21452145+ return ok;21462146+}21472147+21482148+int drbd_send_drequest_csum(struct drbd_conf *mdev,21492149+ sector_t sector, int size,21502150+ void *digest, int digest_size,21512151+ enum drbd_packets cmd)21522152+{21532153+ int ok;21542154+ struct p_block_req p;21552155+21562156+ p.sector = cpu_to_be64(sector);21572157+ p.block_id = BE_DRBD_MAGIC + 0xbeef;21582158+ p.blksize = cpu_to_be32(size);21592159+21602160+ p.head.magic = BE_DRBD_MAGIC;21612161+ p.head.command = cpu_to_be16(cmd);21622162+ p.head.length = cpu_to_be16(sizeof(p) - sizeof(struct p_header) + digest_size);21632163+21642164+ mutex_lock(&mdev->data.mutex);21652165+21662166+ ok = (sizeof(p) == drbd_send(mdev, mdev->data.socket, &p, sizeof(p), 0));21672167+ ok = ok && (digest_size == drbd_send(mdev, mdev->data.socket, digest, digest_size, 0));21682168+21692169+ mutex_unlock(&mdev->data.mutex);21702170+21712171+ return ok;21722172+}21732173+21742174+int drbd_send_ov_request(struct drbd_conf *mdev, sector_t sector, int size)21752175+{21762176+ int ok;21772177+ struct p_block_req p;21782178+21792179+ p.sector = cpu_to_be64(sector);21802180+ p.block_id = BE_DRBD_MAGIC + 0xbabe;21812181+ p.blksize = cpu_to_be32(size);21822182+21832183+ ok = drbd_send_cmd(mdev, USE_DATA_SOCKET, P_OV_REQUEST,21842184+ (struct p_header *)&p, sizeof(p));21852185+ return ok;21862186+}21872187+21882188+/* called on sndtimeo21892189+ * returns FALSE if we should retry,21902190+ * TRUE if we think connection is dead21912191+ */21922192+static int we_should_drop_the_connection(struct drbd_conf *mdev, struct socket *sock)21932193+{21942194+ int drop_it;21952195+ /* long elapsed = (long)(jiffies - mdev->last_received); */21962196+21972197+ drop_it = mdev->meta.socket == sock21982198+ || !mdev->asender.task21992199+ || get_t_state(&mdev->asender) != Running22002200+ || mdev->state.conn < C_CONNECTED;22012201+22022202+ if (drop_it)22032203+ return TRUE;22042204+22052205+ drop_it = !--mdev->ko_count;22062206+ if (!drop_it) {22072207+ dev_err(DEV, "[%s/%d] sock_sendmsg time expired, ko = %u\n",22082208+ current->comm, current->pid, mdev->ko_count);22092209+ request_ping(mdev);22102210+ }22112211+22122212+ return drop_it; /* && (mdev->state == R_PRIMARY) */;22132213+}22142214+22152215+/* The idea of sendpage seems to be to put some kind of reference22162216+ * to the page into the skb, and to hand it over to the NIC. In22172217+ * this process get_page() gets called.22182218+ *22192219+ * As soon as the page was really sent over the network put_page()22202220+ * gets called by some part of the network layer. [ NIC driver? ]22212221+ *22222222+ * [ get_page() / put_page() increment/decrement the count. If count22232223+ * reaches 0 the page will be freed. ]22242224+ *22252225+ * This works nicely with pages from FSs.22262226+ * But this means that in protocol A we might signal IO completion too early!22272227+ *22282228+ * In order not to corrupt data during a resync we must make sure22292229+ * that we do not reuse our own buffer pages (EEs) to early, therefore22302230+ * we have the net_ee list.22312231+ *22322232+ * XFS seems to have problems, still, it submits pages with page_count == 0!22332233+ * As a workaround, we disable sendpage on pages22342234+ * with page_count == 0 or PageSlab.22352235+ */22362236+static int _drbd_no_send_page(struct drbd_conf *mdev, struct page *page,22372237+ int offset, size_t size)22382238+{22392239+ int sent = drbd_send(mdev, mdev->data.socket, kmap(page) + offset, size, 0);22402240+ kunmap(page);22412241+ if (sent == size)22422242+ mdev->send_cnt += size>>9;22432243+ return sent == size;22442244+}22452245+22462246+static int _drbd_send_page(struct drbd_conf *mdev, struct page *page,22472247+ int offset, size_t size)22482248+{22492249+ mm_segment_t oldfs = get_fs();22502250+ int sent, ok;22512251+ int len = size;22522252+22532253+ /* e.g. XFS meta- & log-data is in slab pages, which have a22542254+ * page_count of 0 and/or have PageSlab() set.22552255+ * we cannot use send_page for those, as that does get_page();22562256+ * put_page(); and would cause either a VM_BUG directly, or22572257+ * __page_cache_release a page that would actually still be referenced22582258+ * by someone, leading to some obscure delayed Oops somewhere else. */22592259+ if (disable_sendpage || (page_count(page) < 1) || PageSlab(page))22602260+ return _drbd_no_send_page(mdev, page, offset, size);22612261+22622262+ drbd_update_congested(mdev);22632263+ set_fs(KERNEL_DS);22642264+ do {22652265+ sent = mdev->data.socket->ops->sendpage(mdev->data.socket, page,22662266+ offset, len,22672267+ MSG_NOSIGNAL);22682268+ if (sent == -EAGAIN) {22692269+ if (we_should_drop_the_connection(mdev,22702270+ mdev->data.socket))22712271+ break;22722272+ else22732273+ continue;22742274+ }22752275+ if (sent <= 0) {22762276+ dev_warn(DEV, "%s: size=%d len=%d sent=%d\n",22772277+ __func__, (int)size, len, sent);22782278+ break;22792279+ }22802280+ len -= sent;22812281+ offset += sent;22822282+ } while (len > 0 /* THINK && mdev->cstate >= C_CONNECTED*/);22832283+ set_fs(oldfs);22842284+ clear_bit(NET_CONGESTED, &mdev->flags);22852285+22862286+ ok = (len == 0);22872287+ if (likely(ok))22882288+ mdev->send_cnt += size>>9;22892289+ return ok;22902290+}22912291+22922292+static int _drbd_send_bio(struct drbd_conf *mdev, struct bio *bio)22932293+{22942294+ struct bio_vec *bvec;22952295+ int i;22962296+ __bio_for_each_segment(bvec, bio, i, 0) {22972297+ if (!_drbd_no_send_page(mdev, bvec->bv_page,22982298+ bvec->bv_offset, bvec->bv_len))22992299+ return 0;23002300+ }23012301+ return 1;23022302+}23032303+23042304+static int _drbd_send_zc_bio(struct drbd_conf *mdev, struct bio *bio)23052305+{23062306+ struct bio_vec *bvec;23072307+ int i;23082308+ __bio_for_each_segment(bvec, bio, i, 0) {23092309+ if (!_drbd_send_page(mdev, bvec->bv_page,23102310+ bvec->bv_offset, bvec->bv_len))23112311+ return 0;23122312+ }23132313+23142314+ return 1;23152315+}23162316+23172317+/* Used to send write requests23182318+ * R_PRIMARY -> Peer (P_DATA)23192319+ */23202320+int drbd_send_dblock(struct drbd_conf *mdev, struct drbd_request *req)23212321+{23222322+ int ok = 1;23232323+ struct p_data p;23242324+ unsigned int dp_flags = 0;23252325+ void *dgb;23262326+ int dgs;23272327+23282328+ if (!drbd_get_data_sock(mdev))23292329+ return 0;23302330+23312331+ dgs = (mdev->agreed_pro_version >= 87 && mdev->integrity_w_tfm) ?23322332+ crypto_hash_digestsize(mdev->integrity_w_tfm) : 0;23332333+23342334+ p.head.magic = BE_DRBD_MAGIC;23352335+ p.head.command = cpu_to_be16(P_DATA);23362336+ p.head.length =23372337+ cpu_to_be16(sizeof(p) - sizeof(struct p_header) + dgs + req->size);23382338+23392339+ p.sector = cpu_to_be64(req->sector);23402340+ p.block_id = (unsigned long)req;23412341+ p.seq_num = cpu_to_be32(req->seq_num =23422342+ atomic_add_return(1, &mdev->packet_seq));23432343+ dp_flags = 0;23442344+23452345+ /* NOTE: no need to check if barriers supported here as we would23462346+ * not pass the test in make_request_common in that case23472347+ */23482348+ if (bio_rw_flagged(req->master_bio, BIO_RW_BARRIER)) {23492349+ dev_err(DEV, "ASSERT FAILED would have set DP_HARDBARRIER\n");23502350+ /* dp_flags |= DP_HARDBARRIER; */23512351+ }23522352+ if (bio_rw_flagged(req->master_bio, BIO_RW_SYNCIO))23532353+ dp_flags |= DP_RW_SYNC;23542354+ /* for now handle SYNCIO and UNPLUG23552355+ * as if they still were one and the same flag */23562356+ if (bio_rw_flagged(req->master_bio, BIO_RW_UNPLUG))23572357+ dp_flags |= DP_RW_SYNC;23582358+ if (mdev->state.conn >= C_SYNC_SOURCE &&23592359+ mdev->state.conn <= C_PAUSED_SYNC_T)23602360+ dp_flags |= DP_MAY_SET_IN_SYNC;23612361+23622362+ p.dp_flags = cpu_to_be32(dp_flags);23632363+ trace_drbd_packet(mdev, mdev->data.socket, 0, (void *)&p, __FILE__, __LINE__);23642364+ set_bit(UNPLUG_REMOTE, &mdev->flags);23652365+ ok = (sizeof(p) ==23662366+ drbd_send(mdev, mdev->data.socket, &p, sizeof(p), MSG_MORE));23672367+ if (ok && dgs) {23682368+ dgb = mdev->int_dig_out;23692369+ drbd_csum(mdev, mdev->integrity_w_tfm, req->master_bio, dgb);23702370+ ok = drbd_send(mdev, mdev->data.socket, dgb, dgs, MSG_MORE);23712371+ }23722372+ if (ok) {23732373+ if (mdev->net_conf->wire_protocol == DRBD_PROT_A)23742374+ ok = _drbd_send_bio(mdev, req->master_bio);23752375+ else23762376+ ok = _drbd_send_zc_bio(mdev, req->master_bio);23772377+ }23782378+23792379+ drbd_put_data_sock(mdev);23802380+ return ok;23812381+}23822382+23832383+/* answer packet, used to send data back for read requests:23842384+ * Peer -> (diskless) R_PRIMARY (P_DATA_REPLY)23852385+ * C_SYNC_SOURCE -> C_SYNC_TARGET (P_RS_DATA_REPLY)23862386+ */23872387+int drbd_send_block(struct drbd_conf *mdev, enum drbd_packets cmd,23882388+ struct drbd_epoch_entry *e)23892389+{23902390+ int ok;23912391+ struct p_data p;23922392+ void *dgb;23932393+ int dgs;23942394+23952395+ dgs = (mdev->agreed_pro_version >= 87 && mdev->integrity_w_tfm) ?23962396+ crypto_hash_digestsize(mdev->integrity_w_tfm) : 0;23972397+23982398+ p.head.magic = BE_DRBD_MAGIC;23992399+ p.head.command = cpu_to_be16(cmd);24002400+ p.head.length =24012401+ cpu_to_be16(sizeof(p) - sizeof(struct p_header) + dgs + e->size);24022402+24032403+ p.sector = cpu_to_be64(e->sector);24042404+ p.block_id = e->block_id;24052405+ /* p.seq_num = 0; No sequence numbers here.. */24062406+24072407+ /* Only called by our kernel thread.24082408+ * This one may be interrupted by DRBD_SIG and/or DRBD_SIGKILL24092409+ * in response to admin command or module unload.24102410+ */24112411+ if (!drbd_get_data_sock(mdev))24122412+ return 0;24132413+24142414+ trace_drbd_packet(mdev, mdev->data.socket, 0, (void *)&p, __FILE__, __LINE__);24152415+ ok = sizeof(p) == drbd_send(mdev, mdev->data.socket, &p,24162416+ sizeof(p), MSG_MORE);24172417+ if (ok && dgs) {24182418+ dgb = mdev->int_dig_out;24192419+ drbd_csum(mdev, mdev->integrity_w_tfm, e->private_bio, dgb);24202420+ ok = drbd_send(mdev, mdev->data.socket, dgb, dgs, MSG_MORE);24212421+ }24222422+ if (ok)24232423+ ok = _drbd_send_zc_bio(mdev, e->private_bio);24242424+24252425+ drbd_put_data_sock(mdev);24262426+ return ok;24272427+}24282428+24292429+/*24302430+ drbd_send distinguishes two cases:24312431+24322432+ Packets sent via the data socket "sock"24332433+ and packets sent via the meta data socket "msock"24342434+24352435+ sock msock24362436+ -----------------+-------------------------+------------------------------24372437+ timeout conf.timeout / 2 conf.timeout / 224382438+ timeout action send a ping via msock Abort communication24392439+ and close all sockets24402440+*/24412441+24422442+/*24432443+ * you must have down()ed the appropriate [m]sock_mutex elsewhere!24442444+ */24452445+int drbd_send(struct drbd_conf *mdev, struct socket *sock,24462446+ void *buf, size_t size, unsigned msg_flags)24472447+{24482448+ struct kvec iov;24492449+ struct msghdr msg;24502450+ int rv, sent = 0;24512451+24522452+ if (!sock)24532453+ return -1000;24542454+24552455+ /* THINK if (signal_pending) return ... ? */24562456+24572457+ iov.iov_base = buf;24582458+ iov.iov_len = size;24592459+24602460+ msg.msg_name = NULL;24612461+ msg.msg_namelen = 0;24622462+ msg.msg_control = NULL;24632463+ msg.msg_controllen = 0;24642464+ msg.msg_flags = msg_flags | MSG_NOSIGNAL;24652465+24662466+ if (sock == mdev->data.socket) {24672467+ mdev->ko_count = mdev->net_conf->ko_count;24682468+ drbd_update_congested(mdev);24692469+ }24702470+ do {24712471+ /* STRANGE24722472+ * tcp_sendmsg does _not_ use its size parameter at all ?24732473+ *24742474+ * -EAGAIN on timeout, -EINTR on signal.24752475+ */24762476+/* THINK24772477+ * do we need to block DRBD_SIG if sock == &meta.socket ??24782478+ * otherwise wake_asender() might interrupt some send_*Ack !24792479+ */24802480+ rv = kernel_sendmsg(sock, &msg, &iov, 1, size);24812481+ if (rv == -EAGAIN) {24822482+ if (we_should_drop_the_connection(mdev, sock))24832483+ break;24842484+ else24852485+ continue;24862486+ }24872487+ D_ASSERT(rv != 0);24882488+ if (rv == -EINTR) {24892489+ flush_signals(current);24902490+ rv = 0;24912491+ }24922492+ if (rv < 0)24932493+ break;24942494+ sent += rv;24952495+ iov.iov_base += rv;24962496+ iov.iov_len -= rv;24972497+ } while (sent < size);24982498+24992499+ if (sock == mdev->data.socket)25002500+ clear_bit(NET_CONGESTED, &mdev->flags);25012501+25022502+ if (rv <= 0) {25032503+ if (rv != -EAGAIN) {25042504+ dev_err(DEV, "%s_sendmsg returned %d\n",25052505+ sock == mdev->meta.socket ? "msock" : "sock",25062506+ rv);25072507+ drbd_force_state(mdev, NS(conn, C_BROKEN_PIPE));25082508+ } else25092509+ drbd_force_state(mdev, NS(conn, C_TIMEOUT));25102510+ }25112511+25122512+ return sent;25132513+}25142514+25152515+static int drbd_open(struct block_device *bdev, fmode_t mode)25162516+{25172517+ struct drbd_conf *mdev = bdev->bd_disk->private_data;25182518+ unsigned long flags;25192519+ int rv = 0;25202520+25212521+ spin_lock_irqsave(&mdev->req_lock, flags);25222522+ /* to have a stable mdev->state.role25232523+ * and no race with updating open_cnt */25242524+25252525+ if (mdev->state.role != R_PRIMARY) {25262526+ if (mode & FMODE_WRITE)25272527+ rv = -EROFS;25282528+ else if (!allow_oos)25292529+ rv = -EMEDIUMTYPE;25302530+ }25312531+25322532+ if (!rv)25332533+ mdev->open_cnt++;25342534+ spin_unlock_irqrestore(&mdev->req_lock, flags);25352535+25362536+ return rv;25372537+}25382538+25392539+static int drbd_release(struct gendisk *gd, fmode_t mode)25402540+{25412541+ struct drbd_conf *mdev = gd->private_data;25422542+ mdev->open_cnt--;25432543+ return 0;25442544+}25452545+25462546+static void drbd_unplug_fn(struct request_queue *q)25472547+{25482548+ struct drbd_conf *mdev = q->queuedata;25492549+25502550+ trace_drbd_unplug(mdev, "got unplugged");25512551+25522552+ /* unplug FIRST */25532553+ spin_lock_irq(q->queue_lock);25542554+ blk_remove_plug(q);25552555+ spin_unlock_irq(q->queue_lock);25562556+25572557+ /* only if connected */25582558+ spin_lock_irq(&mdev->req_lock);25592559+ if (mdev->state.pdsk >= D_INCONSISTENT && mdev->state.conn >= C_CONNECTED) {25602560+ D_ASSERT(mdev->state.role == R_PRIMARY);25612561+ if (test_and_clear_bit(UNPLUG_REMOTE, &mdev->flags)) {25622562+ /* add to the data.work queue,25632563+ * unless already queued.25642564+ * XXX this might be a good addition to drbd_queue_work25652565+ * anyways, to detect "double queuing" ... */25662566+ if (list_empty(&mdev->unplug_work.list))25672567+ drbd_queue_work(&mdev->data.work,25682568+ &mdev->unplug_work);25692569+ }25702570+ }25712571+ spin_unlock_irq(&mdev->req_lock);25722572+25732573+ if (mdev->state.disk >= D_INCONSISTENT)25742574+ drbd_kick_lo(mdev);25752575+}25762576+25772577+static void drbd_set_defaults(struct drbd_conf *mdev)25782578+{25792579+ mdev->sync_conf.after = DRBD_AFTER_DEF;25802580+ mdev->sync_conf.rate = DRBD_RATE_DEF;25812581+ mdev->sync_conf.al_extents = DRBD_AL_EXTENTS_DEF;25822582+ mdev->state = (union drbd_state) {25832583+ { .role = R_SECONDARY,25842584+ .peer = R_UNKNOWN,25852585+ .conn = C_STANDALONE,25862586+ .disk = D_DISKLESS,25872587+ .pdsk = D_UNKNOWN,25882588+ .susp = 025892589+ } };25902590+}25912591+25922592+void drbd_init_set_defaults(struct drbd_conf *mdev)25932593+{25942594+ /* the memset(,0,) did most of this.25952595+ * note: only assignments, no allocation in here */25962596+25972597+ drbd_set_defaults(mdev);25982598+25992599+ /* for now, we do NOT yet support it,26002600+ * even though we start some framework26012601+ * to eventually support barriers */26022602+ set_bit(NO_BARRIER_SUPP, &mdev->flags);26032603+26042604+ atomic_set(&mdev->ap_bio_cnt, 0);26052605+ atomic_set(&mdev->ap_pending_cnt, 0);26062606+ atomic_set(&mdev->rs_pending_cnt, 0);26072607+ atomic_set(&mdev->unacked_cnt, 0);26082608+ atomic_set(&mdev->local_cnt, 0);26092609+ atomic_set(&mdev->net_cnt, 0);26102610+ atomic_set(&mdev->packet_seq, 0);26112611+ atomic_set(&mdev->pp_in_use, 0);26122612+26132613+ mutex_init(&mdev->md_io_mutex);26142614+ mutex_init(&mdev->data.mutex);26152615+ mutex_init(&mdev->meta.mutex);26162616+ sema_init(&mdev->data.work.s, 0);26172617+ sema_init(&mdev->meta.work.s, 0);26182618+ mutex_init(&mdev->state_mutex);26192619+26202620+ spin_lock_init(&mdev->data.work.q_lock);26212621+ spin_lock_init(&mdev->meta.work.q_lock);26222622+26232623+ spin_lock_init(&mdev->al_lock);26242624+ spin_lock_init(&mdev->req_lock);26252625+ spin_lock_init(&mdev->peer_seq_lock);26262626+ spin_lock_init(&mdev->epoch_lock);26272627+26282628+ INIT_LIST_HEAD(&mdev->active_ee);26292629+ INIT_LIST_HEAD(&mdev->sync_ee);26302630+ INIT_LIST_HEAD(&mdev->done_ee);26312631+ INIT_LIST_HEAD(&mdev->read_ee);26322632+ INIT_LIST_HEAD(&mdev->net_ee);26332633+ INIT_LIST_HEAD(&mdev->resync_reads);26342634+ INIT_LIST_HEAD(&mdev->data.work.q);26352635+ INIT_LIST_HEAD(&mdev->meta.work.q);26362636+ INIT_LIST_HEAD(&mdev->resync_work.list);26372637+ INIT_LIST_HEAD(&mdev->unplug_work.list);26382638+ INIT_LIST_HEAD(&mdev->md_sync_work.list);26392639+ INIT_LIST_HEAD(&mdev->bm_io_work.w.list);26402640+ mdev->resync_work.cb = w_resync_inactive;26412641+ mdev->unplug_work.cb = w_send_write_hint;26422642+ mdev->md_sync_work.cb = w_md_sync;26432643+ mdev->bm_io_work.w.cb = w_bitmap_io;26442644+ init_timer(&mdev->resync_timer);26452645+ init_timer(&mdev->md_sync_timer);26462646+ mdev->resync_timer.function = resync_timer_fn;26472647+ mdev->resync_timer.data = (unsigned long) mdev;26482648+ mdev->md_sync_timer.function = md_sync_timer_fn;26492649+ mdev->md_sync_timer.data = (unsigned long) mdev;26502650+26512651+ init_waitqueue_head(&mdev->misc_wait);26522652+ init_waitqueue_head(&mdev->state_wait);26532653+ init_waitqueue_head(&mdev->ee_wait);26542654+ init_waitqueue_head(&mdev->al_wait);26552655+ init_waitqueue_head(&mdev->seq_wait);26562656+26572657+ drbd_thread_init(mdev, &mdev->receiver, drbdd_init);26582658+ drbd_thread_init(mdev, &mdev->worker, drbd_worker);26592659+ drbd_thread_init(mdev, &mdev->asender, drbd_asender);26602660+26612661+ mdev->agreed_pro_version = PRO_VERSION_MAX;26622662+ mdev->write_ordering = WO_bio_barrier;26632663+ mdev->resync_wenr = LC_FREE;26642664+}26652665+26662666+void drbd_mdev_cleanup(struct drbd_conf *mdev)26672667+{26682668+ if (mdev->receiver.t_state != None)26692669+ dev_err(DEV, "ASSERT FAILED: receiver t_state == %d expected 0.\n",26702670+ mdev->receiver.t_state);26712671+26722672+ /* no need to lock it, I'm the only thread alive */26732673+ if (atomic_read(&mdev->current_epoch->epoch_size) != 0)26742674+ dev_err(DEV, "epoch_size:%d\n", atomic_read(&mdev->current_epoch->epoch_size));26752675+ mdev->al_writ_cnt =26762676+ mdev->bm_writ_cnt =26772677+ mdev->read_cnt =26782678+ mdev->recv_cnt =26792679+ mdev->send_cnt =26802680+ mdev->writ_cnt =26812681+ mdev->p_size =26822682+ mdev->rs_start =26832683+ mdev->rs_total =26842684+ mdev->rs_failed =26852685+ mdev->rs_mark_left =26862686+ mdev->rs_mark_time = 0;26872687+ D_ASSERT(mdev->net_conf == NULL);26882688+26892689+ drbd_set_my_capacity(mdev, 0);26902690+ if (mdev->bitmap) {26912691+ /* maybe never allocated. */26922692+ drbd_bm_resize(mdev, 0);26932693+ drbd_bm_cleanup(mdev);26942694+ }26952695+26962696+ drbd_free_resources(mdev);26972697+26982698+ /*26992699+ * currently we drbd_init_ee only on module load, so27002700+ * we may do drbd_release_ee only on module unload!27012701+ */27022702+ D_ASSERT(list_empty(&mdev->active_ee));27032703+ D_ASSERT(list_empty(&mdev->sync_ee));27042704+ D_ASSERT(list_empty(&mdev->done_ee));27052705+ D_ASSERT(list_empty(&mdev->read_ee));27062706+ D_ASSERT(list_empty(&mdev->net_ee));27072707+ D_ASSERT(list_empty(&mdev->resync_reads));27082708+ D_ASSERT(list_empty(&mdev->data.work.q));27092709+ D_ASSERT(list_empty(&mdev->meta.work.q));27102710+ D_ASSERT(list_empty(&mdev->resync_work.list));27112711+ D_ASSERT(list_empty(&mdev->unplug_work.list));27122712+27132713+}27142714+27152715+27162716+static void drbd_destroy_mempools(void)27172717+{27182718+ struct page *page;27192719+27202720+ while (drbd_pp_pool) {27212721+ page = drbd_pp_pool;27222722+ drbd_pp_pool = (struct page *)page_private(page);27232723+ __free_page(page);27242724+ drbd_pp_vacant--;27252725+ }27262726+27272727+ /* D_ASSERT(atomic_read(&drbd_pp_vacant)==0); */27282728+27292729+ if (drbd_ee_mempool)27302730+ mempool_destroy(drbd_ee_mempool);27312731+ if (drbd_request_mempool)27322732+ mempool_destroy(drbd_request_mempool);27332733+ if (drbd_ee_cache)27342734+ kmem_cache_destroy(drbd_ee_cache);27352735+ if (drbd_request_cache)27362736+ kmem_cache_destroy(drbd_request_cache);27372737+ if (drbd_bm_ext_cache)27382738+ kmem_cache_destroy(drbd_bm_ext_cache);27392739+ if (drbd_al_ext_cache)27402740+ kmem_cache_destroy(drbd_al_ext_cache);27412741+27422742+ drbd_ee_mempool = NULL;27432743+ drbd_request_mempool = NULL;27442744+ drbd_ee_cache = NULL;27452745+ drbd_request_cache = NULL;27462746+ drbd_bm_ext_cache = NULL;27472747+ drbd_al_ext_cache = NULL;27482748+27492749+ return;27502750+}27512751+27522752+static int drbd_create_mempools(void)27532753+{27542754+ struct page *page;27552755+ const int number = (DRBD_MAX_SEGMENT_SIZE/PAGE_SIZE) * minor_count;27562756+ int i;27572757+27582758+ /* prepare our caches and mempools */27592759+ drbd_request_mempool = NULL;27602760+ drbd_ee_cache = NULL;27612761+ drbd_request_cache = NULL;27622762+ drbd_bm_ext_cache = NULL;27632763+ drbd_al_ext_cache = NULL;27642764+ drbd_pp_pool = NULL;27652765+27662766+ /* caches */27672767+ drbd_request_cache = kmem_cache_create(27682768+ "drbd_req", sizeof(struct drbd_request), 0, 0, NULL);27692769+ if (drbd_request_cache == NULL)27702770+ goto Enomem;27712771+27722772+ drbd_ee_cache = kmem_cache_create(27732773+ "drbd_ee", sizeof(struct drbd_epoch_entry), 0, 0, NULL);27742774+ if (drbd_ee_cache == NULL)27752775+ goto Enomem;27762776+27772777+ drbd_bm_ext_cache = kmem_cache_create(27782778+ "drbd_bm", sizeof(struct bm_extent), 0, 0, NULL);27792779+ if (drbd_bm_ext_cache == NULL)27802780+ goto Enomem;27812781+27822782+ drbd_al_ext_cache = kmem_cache_create(27832783+ "drbd_al", sizeof(struct lc_element), 0, 0, NULL);27842784+ if (drbd_al_ext_cache == NULL)27852785+ goto Enomem;27862786+27872787+ /* mempools */27882788+ drbd_request_mempool = mempool_create(number,27892789+ mempool_alloc_slab, mempool_free_slab, drbd_request_cache);27902790+ if (drbd_request_mempool == NULL)27912791+ goto Enomem;27922792+27932793+ drbd_ee_mempool = mempool_create(number,27942794+ mempool_alloc_slab, mempool_free_slab, drbd_ee_cache);27952795+ if (drbd_request_mempool == NULL)27962796+ goto Enomem;27972797+27982798+ /* drbd's page pool */27992799+ spin_lock_init(&drbd_pp_lock);28002800+28012801+ for (i = 0; i < number; i++) {28022802+ page = alloc_page(GFP_HIGHUSER);28032803+ if (!page)28042804+ goto Enomem;28052805+ set_page_private(page, (unsigned long)drbd_pp_pool);28062806+ drbd_pp_pool = page;28072807+ }28082808+ drbd_pp_vacant = number;28092809+28102810+ return 0;28112811+28122812+Enomem:28132813+ drbd_destroy_mempools(); /* in case we allocated some */28142814+ return -ENOMEM;28152815+}28162816+28172817+static int drbd_notify_sys(struct notifier_block *this, unsigned long code,28182818+ void *unused)28192819+{28202820+ /* just so we have it. you never know what interesting things we28212821+ * might want to do here some day...28222822+ */28232823+28242824+ return NOTIFY_DONE;28252825+}28262826+28272827+static struct notifier_block drbd_notifier = {28282828+ .notifier_call = drbd_notify_sys,28292829+};28302830+28312831+static void drbd_release_ee_lists(struct drbd_conf *mdev)28322832+{28332833+ int rr;28342834+28352835+ rr = drbd_release_ee(mdev, &mdev->active_ee);28362836+ if (rr)28372837+ dev_err(DEV, "%d EEs in active list found!\n", rr);28382838+28392839+ rr = drbd_release_ee(mdev, &mdev->sync_ee);28402840+ if (rr)28412841+ dev_err(DEV, "%d EEs in sync list found!\n", rr);28422842+28432843+ rr = drbd_release_ee(mdev, &mdev->read_ee);28442844+ if (rr)28452845+ dev_err(DEV, "%d EEs in read list found!\n", rr);28462846+28472847+ rr = drbd_release_ee(mdev, &mdev->done_ee);28482848+ if (rr)28492849+ dev_err(DEV, "%d EEs in done list found!\n", rr);28502850+28512851+ rr = drbd_release_ee(mdev, &mdev->net_ee);28522852+ if (rr)28532853+ dev_err(DEV, "%d EEs in net list found!\n", rr);28542854+}28552855+28562856+/* caution. no locking.28572857+ * currently only used from module cleanup code. */28582858+static void drbd_delete_device(unsigned int minor)28592859+{28602860+ struct drbd_conf *mdev = minor_to_mdev(minor);28612861+28622862+ if (!mdev)28632863+ return;28642864+28652865+ /* paranoia asserts */28662866+ if (mdev->open_cnt != 0)28672867+ dev_err(DEV, "open_cnt = %d in %s:%u", mdev->open_cnt,28682868+ __FILE__ , __LINE__);28692869+28702870+ ERR_IF (!list_empty(&mdev->data.work.q)) {28712871+ struct list_head *lp;28722872+ list_for_each(lp, &mdev->data.work.q) {28732873+ dev_err(DEV, "lp = %p\n", lp);28742874+ }28752875+ };28762876+ /* end paranoia asserts */28772877+28782878+ del_gendisk(mdev->vdisk);28792879+28802880+ /* cleanup stuff that may have been allocated during28812881+ * device (re-)configuration or state changes */28822882+28832883+ if (mdev->this_bdev)28842884+ bdput(mdev->this_bdev);28852885+28862886+ drbd_free_resources(mdev);28872887+28882888+ drbd_release_ee_lists(mdev);28892889+28902890+ /* should be free'd on disconnect? */28912891+ kfree(mdev->ee_hash);28922892+ /*28932893+ mdev->ee_hash_s = 0;28942894+ mdev->ee_hash = NULL;28952895+ */28962896+28972897+ lc_destroy(mdev->act_log);28982898+ lc_destroy(mdev->resync);28992899+29002900+ kfree(mdev->p_uuid);29012901+ /* mdev->p_uuid = NULL; */29022902+29032903+ kfree(mdev->int_dig_out);29042904+ kfree(mdev->int_dig_in);29052905+ kfree(mdev->int_dig_vv);29062906+29072907+ /* cleanup the rest that has been29082908+ * allocated from drbd_new_device29092909+ * and actually free the mdev itself */29102910+ drbd_free_mdev(mdev);29112911+}29122912+29132913+static void drbd_cleanup(void)29142914+{29152915+ unsigned int i;29162916+29172917+ unregister_reboot_notifier(&drbd_notifier);29182918+29192919+ drbd_nl_cleanup();29202920+29212921+ if (minor_table) {29222922+ if (drbd_proc)29232923+ remove_proc_entry("drbd", NULL);29242924+ i = minor_count;29252925+ while (i--)29262926+ drbd_delete_device(i);29272927+ drbd_destroy_mempools();29282928+ }29292929+29302930+ kfree(minor_table);29312931+29322932+ unregister_blkdev(DRBD_MAJOR, "drbd");29332933+29342934+ printk(KERN_INFO "drbd: module cleanup done.\n");29352935+}29362936+29372937+/**29382938+ * drbd_congested() - Callback for pdflush29392939+ * @congested_data: User data29402940+ * @bdi_bits: Bits pdflush is currently interested in29412941+ *29422942+ * Returns 1<<BDI_async_congested and/or 1<<BDI_sync_congested if we are congested.29432943+ */29442944+static int drbd_congested(void *congested_data, int bdi_bits)29452945+{29462946+ struct drbd_conf *mdev = congested_data;29472947+ struct request_queue *q;29482948+ char reason = '-';29492949+ int r = 0;29502950+29512951+ if (!__inc_ap_bio_cond(mdev)) {29522952+ /* DRBD has frozen IO */29532953+ r = bdi_bits;29542954+ reason = 'd';29552955+ goto out;29562956+ }29572957+29582958+ if (get_ldev(mdev)) {29592959+ q = bdev_get_queue(mdev->ldev->backing_bdev);29602960+ r = bdi_congested(&q->backing_dev_info, bdi_bits);29612961+ put_ldev(mdev);29622962+ if (r)29632963+ reason = 'b';29642964+ }29652965+29662966+ if (bdi_bits & (1 << BDI_async_congested) && test_bit(NET_CONGESTED, &mdev->flags)) {29672967+ r |= (1 << BDI_async_congested);29682968+ reason = reason == 'b' ? 'a' : 'n';29692969+ }29702970+29712971+out:29722972+ mdev->congestion_reason = reason;29732973+ return r;29742974+}29752975+29762976+struct drbd_conf *drbd_new_device(unsigned int minor)29772977+{29782978+ struct drbd_conf *mdev;29792979+ struct gendisk *disk;29802980+ struct request_queue *q;29812981+29822982+ /* GFP_KERNEL, we are outside of all write-out paths */29832983+ mdev = kzalloc(sizeof(struct drbd_conf), GFP_KERNEL);29842984+ if (!mdev)29852985+ return NULL;29862986+ if (!zalloc_cpumask_var(&mdev->cpu_mask, GFP_KERNEL))29872987+ goto out_no_cpumask;29882988+29892989+ mdev->minor = minor;29902990+29912991+ drbd_init_set_defaults(mdev);29922992+29932993+ q = blk_alloc_queue(GFP_KERNEL);29942994+ if (!q)29952995+ goto out_no_q;29962996+ mdev->rq_queue = q;29972997+ q->queuedata = mdev;29982998+ blk_queue_max_segment_size(q, DRBD_MAX_SEGMENT_SIZE);29992999+30003000+ disk = alloc_disk(1);30013001+ if (!disk)30023002+ goto out_no_disk;30033003+ mdev->vdisk = disk;30043004+30053005+ set_disk_ro(disk, TRUE);30063006+30073007+ disk->queue = q;30083008+ disk->major = DRBD_MAJOR;30093009+ disk->first_minor = minor;30103010+ disk->fops = &drbd_ops;30113011+ sprintf(disk->disk_name, "drbd%d", minor);30123012+ disk->private_data = mdev;30133013+30143014+ mdev->this_bdev = bdget(MKDEV(DRBD_MAJOR, minor));30153015+ /* we have no partitions. we contain only ourselves. */30163016+ mdev->this_bdev->bd_contains = mdev->this_bdev;30173017+30183018+ q->backing_dev_info.congested_fn = drbd_congested;30193019+ q->backing_dev_info.congested_data = mdev;30203020+30213021+ blk_queue_make_request(q, drbd_make_request_26);30223022+ blk_queue_bounce_limit(q, BLK_BOUNCE_ANY);30233023+ blk_queue_merge_bvec(q, drbd_merge_bvec);30243024+ q->queue_lock = &mdev->req_lock; /* needed since we use */30253025+ /* plugging on a queue, that actually has no requests! */30263026+ q->unplug_fn = drbd_unplug_fn;30273027+30283028+ mdev->md_io_page = alloc_page(GFP_KERNEL);30293029+ if (!mdev->md_io_page)30303030+ goto out_no_io_page;30313031+30323032+ if (drbd_bm_init(mdev))30333033+ goto out_no_bitmap;30343034+ /* no need to lock access, we are still initializing this minor device. */30353035+ if (!tl_init(mdev))30363036+ goto out_no_tl;30373037+30383038+ mdev->app_reads_hash = kzalloc(APP_R_HSIZE*sizeof(void *), GFP_KERNEL);30393039+ if (!mdev->app_reads_hash)30403040+ goto out_no_app_reads;30413041+30423042+ mdev->current_epoch = kzalloc(sizeof(struct drbd_epoch), GFP_KERNEL);30433043+ if (!mdev->current_epoch)30443044+ goto out_no_epoch;30453045+30463046+ INIT_LIST_HEAD(&mdev->current_epoch->list);30473047+ mdev->epochs = 1;30483048+30493049+ return mdev;30503050+30513051+/* out_whatever_else:30523052+ kfree(mdev->current_epoch); */30533053+out_no_epoch:30543054+ kfree(mdev->app_reads_hash);30553055+out_no_app_reads:30563056+ tl_cleanup(mdev);30573057+out_no_tl:30583058+ drbd_bm_cleanup(mdev);30593059+out_no_bitmap:30603060+ __free_page(mdev->md_io_page);30613061+out_no_io_page:30623062+ put_disk(disk);30633063+out_no_disk:30643064+ blk_cleanup_queue(q);30653065+out_no_q:30663066+ free_cpumask_var(mdev->cpu_mask);30673067+out_no_cpumask:30683068+ kfree(mdev);30693069+ return NULL;30703070+}30713071+30723072+/* counterpart of drbd_new_device.30733073+ * last part of drbd_delete_device. */30743074+void drbd_free_mdev(struct drbd_conf *mdev)30753075+{30763076+ kfree(mdev->current_epoch);30773077+ kfree(mdev->app_reads_hash);30783078+ tl_cleanup(mdev);30793079+ if (mdev->bitmap) /* should no longer be there. */30803080+ drbd_bm_cleanup(mdev);30813081+ __free_page(mdev->md_io_page);30823082+ put_disk(mdev->vdisk);30833083+ blk_cleanup_queue(mdev->rq_queue);30843084+ free_cpumask_var(mdev->cpu_mask);30853085+ kfree(mdev);30863086+}30873087+30883088+30893089+int __init drbd_init(void)30903090+{30913091+ int err;30923092+30933093+ if (sizeof(struct p_handshake) != 80) {30943094+ printk(KERN_ERR30953095+ "drbd: never change the size or layout "30963096+ "of the HandShake packet.\n");30973097+ return -EINVAL;30983098+ }30993099+31003100+ if (1 > minor_count || minor_count > 255) {31013101+ printk(KERN_ERR31023102+ "drbd: invalid minor_count (%d)\n", minor_count);31033103+#ifdef MODULE31043104+ return -EINVAL;31053105+#else31063106+ minor_count = 8;31073107+#endif31083108+ }31093109+31103110+ err = drbd_nl_init();31113111+ if (err)31123112+ return err;31133113+31143114+ err = register_blkdev(DRBD_MAJOR, "drbd");31153115+ if (err) {31163116+ printk(KERN_ERR31173117+ "drbd: unable to register block device major %d\n",31183118+ DRBD_MAJOR);31193119+ return err;31203120+ }31213121+31223122+ register_reboot_notifier(&drbd_notifier);31233123+31243124+ /*31253125+ * allocate all necessary structs31263126+ */31273127+ err = -ENOMEM;31283128+31293129+ init_waitqueue_head(&drbd_pp_wait);31303130+31313131+ drbd_proc = NULL; /* play safe for drbd_cleanup */31323132+ minor_table = kzalloc(sizeof(struct drbd_conf *)*minor_count,31333133+ GFP_KERNEL);31343134+ if (!minor_table)31353135+ goto Enomem;31363136+31373137+ err = drbd_create_mempools();31383138+ if (err)31393139+ goto Enomem;31403140+31413141+ drbd_proc = proc_create("drbd", S_IFREG | S_IRUGO , NULL, &drbd_proc_fops);31423142+ if (!drbd_proc) {31433143+ printk(KERN_ERR "drbd: unable to register proc file\n");31443144+ goto Enomem;31453145+ }31463146+31473147+ rwlock_init(&global_state_lock);31483148+31493149+ printk(KERN_INFO "drbd: initialized. "31503150+ "Version: " REL_VERSION " (api:%d/proto:%d-%d)\n",31513151+ API_VERSION, PRO_VERSION_MIN, PRO_VERSION_MAX);31523152+ printk(KERN_INFO "drbd: %s\n", drbd_buildtag());31533153+ printk(KERN_INFO "drbd: registered as block device major %d\n",31543154+ DRBD_MAJOR);31553155+ printk(KERN_INFO "drbd: minor_table @ 0x%p\n", minor_table);31563156+31573157+ return 0; /* Success! */31583158+31593159+Enomem:31603160+ drbd_cleanup();31613161+ if (err == -ENOMEM)31623162+ /* currently always the case */31633163+ printk(KERN_ERR "drbd: ran out of memory\n");31643164+ else31653165+ printk(KERN_ERR "drbd: initialization failure\n");31663166+ return err;31673167+}31683168+31693169+void drbd_free_bc(struct drbd_backing_dev *ldev)31703170+{31713171+ if (ldev == NULL)31723172+ return;31733173+31743174+ bd_release(ldev->backing_bdev);31753175+ bd_release(ldev->md_bdev);31763176+31773177+ fput(ldev->lo_file);31783178+ fput(ldev->md_file);31793179+31803180+ kfree(ldev);31813181+}31823182+31833183+void drbd_free_sock(struct drbd_conf *mdev)31843184+{31853185+ if (mdev->data.socket) {31863186+ kernel_sock_shutdown(mdev->data.socket, SHUT_RDWR);31873187+ sock_release(mdev->data.socket);31883188+ mdev->data.socket = NULL;31893189+ }31903190+ if (mdev->meta.socket) {31913191+ kernel_sock_shutdown(mdev->meta.socket, SHUT_RDWR);31923192+ sock_release(mdev->meta.socket);31933193+ mdev->meta.socket = NULL;31943194+ }31953195+}31963196+31973197+31983198+void drbd_free_resources(struct drbd_conf *mdev)31993199+{32003200+ crypto_free_hash(mdev->csums_tfm);32013201+ mdev->csums_tfm = NULL;32023202+ crypto_free_hash(mdev->verify_tfm);32033203+ mdev->verify_tfm = NULL;32043204+ crypto_free_hash(mdev->cram_hmac_tfm);32053205+ mdev->cram_hmac_tfm = NULL;32063206+ crypto_free_hash(mdev->integrity_w_tfm);32073207+ mdev->integrity_w_tfm = NULL;32083208+ crypto_free_hash(mdev->integrity_r_tfm);32093209+ mdev->integrity_r_tfm = NULL;32103210+32113211+ drbd_free_sock(mdev);32123212+32133213+ __no_warn(local,32143214+ drbd_free_bc(mdev->ldev);32153215+ mdev->ldev = NULL;);32163216+}32173217+32183218+/* meta data management */32193219+32203220+struct meta_data_on_disk {32213221+ u64 la_size; /* last agreed size. */32223222+ u64 uuid[UI_SIZE]; /* UUIDs. */32233223+ u64 device_uuid;32243224+ u64 reserved_u64_1;32253225+ u32 flags; /* MDF */32263226+ u32 magic;32273227+ u32 md_size_sect;32283228+ u32 al_offset; /* offset to this block */32293229+ u32 al_nr_extents; /* important for restoring the AL */32303230+ /* `-- act_log->nr_elements <-- sync_conf.al_extents */32313231+ u32 bm_offset; /* offset to the bitmap, from here */32323232+ u32 bm_bytes_per_bit; /* BM_BLOCK_SIZE */32333233+ u32 reserved_u32[4];32343234+32353235+} __packed;32363236+32373237+/**32383238+ * drbd_md_sync() - Writes the meta data super block if the MD_DIRTY flag bit is set32393239+ * @mdev: DRBD device.32403240+ */32413241+void drbd_md_sync(struct drbd_conf *mdev)32423242+{32433243+ struct meta_data_on_disk *buffer;32443244+ sector_t sector;32453245+ int i;32463246+32473247+ if (!test_and_clear_bit(MD_DIRTY, &mdev->flags))32483248+ return;32493249+ del_timer(&mdev->md_sync_timer);32503250+32513251+ /* We use here D_FAILED and not D_ATTACHING because we try to write32523252+ * metadata even if we detach due to a disk failure! */32533253+ if (!get_ldev_if_state(mdev, D_FAILED))32543254+ return;32553255+32563256+ trace_drbd_md_io(mdev, WRITE, mdev->ldev);32573257+32583258+ mutex_lock(&mdev->md_io_mutex);32593259+ buffer = (struct meta_data_on_disk *)page_address(mdev->md_io_page);32603260+ memset(buffer, 0, 512);32613261+32623262+ buffer->la_size = cpu_to_be64(drbd_get_capacity(mdev->this_bdev));32633263+ for (i = UI_CURRENT; i < UI_SIZE; i++)32643264+ buffer->uuid[i] = cpu_to_be64(mdev->ldev->md.uuid[i]);32653265+ buffer->flags = cpu_to_be32(mdev->ldev->md.flags);32663266+ buffer->magic = cpu_to_be32(DRBD_MD_MAGIC);32673267+32683268+ buffer->md_size_sect = cpu_to_be32(mdev->ldev->md.md_size_sect);32693269+ buffer->al_offset = cpu_to_be32(mdev->ldev->md.al_offset);32703270+ buffer->al_nr_extents = cpu_to_be32(mdev->act_log->nr_elements);32713271+ buffer->bm_bytes_per_bit = cpu_to_be32(BM_BLOCK_SIZE);32723272+ buffer->device_uuid = cpu_to_be64(mdev->ldev->md.device_uuid);32733273+32743274+ buffer->bm_offset = cpu_to_be32(mdev->ldev->md.bm_offset);32753275+32763276+ D_ASSERT(drbd_md_ss__(mdev, mdev->ldev) == mdev->ldev->md.md_offset);32773277+ sector = mdev->ldev->md.md_offset;32783278+32793279+ if (drbd_md_sync_page_io(mdev, mdev->ldev, sector, WRITE)) {32803280+ clear_bit(MD_DIRTY, &mdev->flags);32813281+ } else {32823282+ /* this was a try anyways ... */32833283+ dev_err(DEV, "meta data update failed!\n");32843284+32853285+ drbd_chk_io_error(mdev, 1, TRUE);32863286+ }32873287+32883288+ /* Update mdev->ldev->md.la_size_sect,32893289+ * since we updated it on metadata. */32903290+ mdev->ldev->md.la_size_sect = drbd_get_capacity(mdev->this_bdev);32913291+32923292+ mutex_unlock(&mdev->md_io_mutex);32933293+ put_ldev(mdev);32943294+}32953295+32963296+/**32973297+ * drbd_md_read() - Reads in the meta data super block32983298+ * @mdev: DRBD device.32993299+ * @bdev: Device from which the meta data should be read in.33003300+ *33013301+ * Return 0 (NO_ERROR) on success, and an enum drbd_ret_codes in case33023302+ * something goes wrong. Currently only: ERR_IO_MD_DISK, ERR_MD_INVALID.33033303+ */33043304+int drbd_md_read(struct drbd_conf *mdev, struct drbd_backing_dev *bdev)33053305+{33063306+ struct meta_data_on_disk *buffer;33073307+ int i, rv = NO_ERROR;33083308+33093309+ if (!get_ldev_if_state(mdev, D_ATTACHING))33103310+ return ERR_IO_MD_DISK;33113311+33123312+ trace_drbd_md_io(mdev, READ, bdev);33133313+33143314+ mutex_lock(&mdev->md_io_mutex);33153315+ buffer = (struct meta_data_on_disk *)page_address(mdev->md_io_page);33163316+33173317+ if (!drbd_md_sync_page_io(mdev, bdev, bdev->md.md_offset, READ)) {33183318+ /* NOTE: cant do normal error processing here as this is33193319+ called BEFORE disk is attached */33203320+ dev_err(DEV, "Error while reading metadata.\n");33213321+ rv = ERR_IO_MD_DISK;33223322+ goto err;33233323+ }33243324+33253325+ if (be32_to_cpu(buffer->magic) != DRBD_MD_MAGIC) {33263326+ dev_err(DEV, "Error while reading metadata, magic not found.\n");33273327+ rv = ERR_MD_INVALID;33283328+ goto err;33293329+ }33303330+ if (be32_to_cpu(buffer->al_offset) != bdev->md.al_offset) {33313331+ dev_err(DEV, "unexpected al_offset: %d (expected %d)\n",33323332+ be32_to_cpu(buffer->al_offset), bdev->md.al_offset);33333333+ rv = ERR_MD_INVALID;33343334+ goto err;33353335+ }33363336+ if (be32_to_cpu(buffer->bm_offset) != bdev->md.bm_offset) {33373337+ dev_err(DEV, "unexpected bm_offset: %d (expected %d)\n",33383338+ be32_to_cpu(buffer->bm_offset), bdev->md.bm_offset);33393339+ rv = ERR_MD_INVALID;33403340+ goto err;33413341+ }33423342+ if (be32_to_cpu(buffer->md_size_sect) != bdev->md.md_size_sect) {33433343+ dev_err(DEV, "unexpected md_size: %u (expected %u)\n",33443344+ be32_to_cpu(buffer->md_size_sect), bdev->md.md_size_sect);33453345+ rv = ERR_MD_INVALID;33463346+ goto err;33473347+ }33483348+33493349+ if (be32_to_cpu(buffer->bm_bytes_per_bit) != BM_BLOCK_SIZE) {33503350+ dev_err(DEV, "unexpected bm_bytes_per_bit: %u (expected %u)\n",33513351+ be32_to_cpu(buffer->bm_bytes_per_bit), BM_BLOCK_SIZE);33523352+ rv = ERR_MD_INVALID;33533353+ goto err;33543354+ }33553355+33563356+ bdev->md.la_size_sect = be64_to_cpu(buffer->la_size);33573357+ for (i = UI_CURRENT; i < UI_SIZE; i++)33583358+ bdev->md.uuid[i] = be64_to_cpu(buffer->uuid[i]);33593359+ bdev->md.flags = be32_to_cpu(buffer->flags);33603360+ mdev->sync_conf.al_extents = be32_to_cpu(buffer->al_nr_extents);33613361+ bdev->md.device_uuid = be64_to_cpu(buffer->device_uuid);33623362+33633363+ if (mdev->sync_conf.al_extents < 7)33643364+ mdev->sync_conf.al_extents = 127;33653365+33663366+ err:33673367+ mutex_unlock(&mdev->md_io_mutex);33683368+ put_ldev(mdev);33693369+33703370+ return rv;33713371+}33723372+33733373+/**33743374+ * drbd_md_mark_dirty() - Mark meta data super block as dirty33753375+ * @mdev: DRBD device.33763376+ *33773377+ * Call this function if you change anything that should be written to33783378+ * the meta-data super block. This function sets MD_DIRTY, and starts a33793379+ * timer that ensures that within five seconds you have to call drbd_md_sync().33803380+ */33813381+void drbd_md_mark_dirty(struct drbd_conf *mdev)33823382+{33833383+ set_bit(MD_DIRTY, &mdev->flags);33843384+ mod_timer(&mdev->md_sync_timer, jiffies + 5*HZ);33853385+}33863386+33873387+33883388+static void drbd_uuid_move_history(struct drbd_conf *mdev) __must_hold(local)33893389+{33903390+ int i;33913391+33923392+ for (i = UI_HISTORY_START; i < UI_HISTORY_END; i++) {33933393+ mdev->ldev->md.uuid[i+1] = mdev->ldev->md.uuid[i];33943394+33953395+ trace_drbd_uuid(mdev, i+1);33963396+ }33973397+}33983398+33993399+void _drbd_uuid_set(struct drbd_conf *mdev, int idx, u64 val) __must_hold(local)34003400+{34013401+ if (idx == UI_CURRENT) {34023402+ if (mdev->state.role == R_PRIMARY)34033403+ val |= 1;34043404+ else34053405+ val &= ~((u64)1);34063406+34073407+ drbd_set_ed_uuid(mdev, val);34083408+ }34093409+34103410+ mdev->ldev->md.uuid[idx] = val;34113411+ trace_drbd_uuid(mdev, idx);34123412+ drbd_md_mark_dirty(mdev);34133413+}34143414+34153415+34163416+void drbd_uuid_set(struct drbd_conf *mdev, int idx, u64 val) __must_hold(local)34173417+{34183418+ if (mdev->ldev->md.uuid[idx]) {34193419+ drbd_uuid_move_history(mdev);34203420+ mdev->ldev->md.uuid[UI_HISTORY_START] = mdev->ldev->md.uuid[idx];34213421+ trace_drbd_uuid(mdev, UI_HISTORY_START);34223422+ }34233423+ _drbd_uuid_set(mdev, idx, val);34243424+}34253425+34263426+/**34273427+ * drbd_uuid_new_current() - Creates a new current UUID34283428+ * @mdev: DRBD device.34293429+ *34303430+ * Creates a new current UUID, and rotates the old current UUID into34313431+ * the bitmap slot. Causes an incremental resync upon next connect.34323432+ */34333433+void drbd_uuid_new_current(struct drbd_conf *mdev) __must_hold(local)34343434+{34353435+ u64 val;34363436+34373437+ dev_info(DEV, "Creating new current UUID\n");34383438+ D_ASSERT(mdev->ldev->md.uuid[UI_BITMAP] == 0);34393439+ mdev->ldev->md.uuid[UI_BITMAP] = mdev->ldev->md.uuid[UI_CURRENT];34403440+ trace_drbd_uuid(mdev, UI_BITMAP);34413441+34423442+ get_random_bytes(&val, sizeof(u64));34433443+ _drbd_uuid_set(mdev, UI_CURRENT, val);34443444+}34453445+34463446+void drbd_uuid_set_bm(struct drbd_conf *mdev, u64 val) __must_hold(local)34473447+{34483448+ if (mdev->ldev->md.uuid[UI_BITMAP] == 0 && val == 0)34493449+ return;34503450+34513451+ if (val == 0) {34523452+ drbd_uuid_move_history(mdev);34533453+ mdev->ldev->md.uuid[UI_HISTORY_START] = mdev->ldev->md.uuid[UI_BITMAP];34543454+ mdev->ldev->md.uuid[UI_BITMAP] = 0;34553455+ trace_drbd_uuid(mdev, UI_HISTORY_START);34563456+ trace_drbd_uuid(mdev, UI_BITMAP);34573457+ } else {34583458+ if (mdev->ldev->md.uuid[UI_BITMAP])34593459+ dev_warn(DEV, "bm UUID already set");34603460+34613461+ mdev->ldev->md.uuid[UI_BITMAP] = val;34623462+ mdev->ldev->md.uuid[UI_BITMAP] &= ~((u64)1);34633463+34643464+ trace_drbd_uuid(mdev, UI_BITMAP);34653465+ }34663466+ drbd_md_mark_dirty(mdev);34673467+}34683468+34693469+/**34703470+ * drbd_bmio_set_n_write() - io_fn for drbd_queue_bitmap_io() or drbd_bitmap_io()34713471+ * @mdev: DRBD device.34723472+ *34733473+ * Sets all bits in the bitmap and writes the whole bitmap to stable storage.34743474+ */34753475+int drbd_bmio_set_n_write(struct drbd_conf *mdev)34763476+{34773477+ int rv = -EIO;34783478+34793479+ if (get_ldev_if_state(mdev, D_ATTACHING)) {34803480+ drbd_md_set_flag(mdev, MDF_FULL_SYNC);34813481+ drbd_md_sync(mdev);34823482+ drbd_bm_set_all(mdev);34833483+34843484+ rv = drbd_bm_write(mdev);34853485+34863486+ if (!rv) {34873487+ drbd_md_clear_flag(mdev, MDF_FULL_SYNC);34883488+ drbd_md_sync(mdev);34893489+ }34903490+34913491+ put_ldev(mdev);34923492+ }34933493+34943494+ return rv;34953495+}34963496+34973497+/**34983498+ * drbd_bmio_clear_n_write() - io_fn for drbd_queue_bitmap_io() or drbd_bitmap_io()34993499+ * @mdev: DRBD device.35003500+ *35013501+ * Clears all bits in the bitmap and writes the whole bitmap to stable storage.35023502+ */35033503+int drbd_bmio_clear_n_write(struct drbd_conf *mdev)35043504+{35053505+ int rv = -EIO;35063506+35073507+ if (get_ldev_if_state(mdev, D_ATTACHING)) {35083508+ drbd_bm_clear_all(mdev);35093509+ rv = drbd_bm_write(mdev);35103510+ put_ldev(mdev);35113511+ }35123512+35133513+ return rv;35143514+}35153515+35163516+static int w_bitmap_io(struct drbd_conf *mdev, struct drbd_work *w, int unused)35173517+{35183518+ struct bm_io_work *work = container_of(w, struct bm_io_work, w);35193519+ int rv;35203520+35213521+ D_ASSERT(atomic_read(&mdev->ap_bio_cnt) == 0);35223522+35233523+ drbd_bm_lock(mdev, work->why);35243524+ rv = work->io_fn(mdev);35253525+ drbd_bm_unlock(mdev);35263526+35273527+ clear_bit(BITMAP_IO, &mdev->flags);35283528+ wake_up(&mdev->misc_wait);35293529+35303530+ if (work->done)35313531+ work->done(mdev, rv);35323532+35333533+ clear_bit(BITMAP_IO_QUEUED, &mdev->flags);35343534+ work->why = NULL;35353535+35363536+ return 1;35373537+}35383538+35393539+/**35403540+ * drbd_queue_bitmap_io() - Queues an IO operation on the whole bitmap35413541+ * @mdev: DRBD device.35423542+ * @io_fn: IO callback to be called when bitmap IO is possible35433543+ * @done: callback to be called after the bitmap IO was performed35443544+ * @why: Descriptive text of the reason for doing the IO35453545+ *35463546+ * While IO on the bitmap happens we freeze application IO thus we ensure35473547+ * that drbd_set_out_of_sync() can not be called. This function MAY ONLY be35483548+ * called from worker context. It MUST NOT be used while a previous such35493549+ * work is still pending!35503550+ */35513551+void drbd_queue_bitmap_io(struct drbd_conf *mdev,35523552+ int (*io_fn)(struct drbd_conf *),35533553+ void (*done)(struct drbd_conf *, int),35543554+ char *why)35553555+{35563556+ D_ASSERT(current == mdev->worker.task);35573557+35583558+ D_ASSERT(!test_bit(BITMAP_IO_QUEUED, &mdev->flags));35593559+ D_ASSERT(!test_bit(BITMAP_IO, &mdev->flags));35603560+ D_ASSERT(list_empty(&mdev->bm_io_work.w.list));35613561+ if (mdev->bm_io_work.why)35623562+ dev_err(DEV, "FIXME going to queue '%s' but '%s' still pending?\n",35633563+ why, mdev->bm_io_work.why);35643564+35653565+ mdev->bm_io_work.io_fn = io_fn;35663566+ mdev->bm_io_work.done = done;35673567+ mdev->bm_io_work.why = why;35683568+35693569+ set_bit(BITMAP_IO, &mdev->flags);35703570+ if (atomic_read(&mdev->ap_bio_cnt) == 0) {35713571+ if (list_empty(&mdev->bm_io_work.w.list)) {35723572+ set_bit(BITMAP_IO_QUEUED, &mdev->flags);35733573+ drbd_queue_work(&mdev->data.work, &mdev->bm_io_work.w);35743574+ } else35753575+ dev_err(DEV, "FIXME avoided double queuing bm_io_work\n");35763576+ }35773577+}35783578+35793579+/**35803580+ * drbd_bitmap_io() - Does an IO operation on the whole bitmap35813581+ * @mdev: DRBD device.35823582+ * @io_fn: IO callback to be called when bitmap IO is possible35833583+ * @why: Descriptive text of the reason for doing the IO35843584+ *35853585+ * freezes application IO while that the actual IO operations runs. This35863586+ * functions MAY NOT be called from worker context.35873587+ */35883588+int drbd_bitmap_io(struct drbd_conf *mdev, int (*io_fn)(struct drbd_conf *), char *why)35893589+{35903590+ int rv;35913591+35923592+ D_ASSERT(current != mdev->worker.task);35933593+35943594+ drbd_suspend_io(mdev);35953595+35963596+ drbd_bm_lock(mdev, why);35973597+ rv = io_fn(mdev);35983598+ drbd_bm_unlock(mdev);35993599+36003600+ drbd_resume_io(mdev);36013601+36023602+ return rv;36033603+}36043604+36053605+void drbd_md_set_flag(struct drbd_conf *mdev, int flag) __must_hold(local)36063606+{36073607+ if ((mdev->ldev->md.flags & flag) != flag) {36083608+ drbd_md_mark_dirty(mdev);36093609+ mdev->ldev->md.flags |= flag;36103610+ }36113611+}36123612+36133613+void drbd_md_clear_flag(struct drbd_conf *mdev, int flag) __must_hold(local)36143614+{36153615+ if ((mdev->ldev->md.flags & flag) != 0) {36163616+ drbd_md_mark_dirty(mdev);36173617+ mdev->ldev->md.flags &= ~flag;36183618+ }36193619+}36203620+int drbd_md_test_flag(struct drbd_backing_dev *bdev, int flag)36213621+{36223622+ return (bdev->md.flags & flag) != 0;36233623+}36243624+36253625+static void md_sync_timer_fn(unsigned long data)36263626+{36273627+ struct drbd_conf *mdev = (struct drbd_conf *) data;36283628+36293629+ drbd_queue_work_front(&mdev->data.work, &mdev->md_sync_work);36303630+}36313631+36323632+static int w_md_sync(struct drbd_conf *mdev, struct drbd_work *w, int unused)36333633+{36343634+ dev_warn(DEV, "md_sync_timer expired! Worker calls drbd_md_sync().\n");36353635+ drbd_md_sync(mdev);36363636+36373637+ return 1;36383638+}36393639+36403640+#ifdef CONFIG_DRBD_FAULT_INJECTION36413641+/* Fault insertion support including random number generator shamelessly36423642+ * stolen from kernel/rcutorture.c */36433643+struct fault_random_state {36443644+ unsigned long state;36453645+ unsigned long count;36463646+};36473647+36483648+#define FAULT_RANDOM_MULT 39916801 /* prime */36493649+#define FAULT_RANDOM_ADD 479001701 /* prime */36503650+#define FAULT_RANDOM_REFRESH 1000036513651+36523652+/*36533653+ * Crude but fast random-number generator. Uses a linear congruential36543654+ * generator, with occasional help from get_random_bytes().36553655+ */36563656+static unsigned long36573657+_drbd_fault_random(struct fault_random_state *rsp)36583658+{36593659+ long refresh;36603660+36613661+ if (--rsp->count < 0) {36623662+ get_random_bytes(&refresh, sizeof(refresh));36633663+ rsp->state += refresh;36643664+ rsp->count = FAULT_RANDOM_REFRESH;36653665+ }36663666+ rsp->state = rsp->state * FAULT_RANDOM_MULT + FAULT_RANDOM_ADD;36673667+ return swahw32(rsp->state);36683668+}36693669+36703670+static char *36713671+_drbd_fault_str(unsigned int type) {36723672+ static char *_faults[] = {36733673+ [DRBD_FAULT_MD_WR] = "Meta-data write",36743674+ [DRBD_FAULT_MD_RD] = "Meta-data read",36753675+ [DRBD_FAULT_RS_WR] = "Resync write",36763676+ [DRBD_FAULT_RS_RD] = "Resync read",36773677+ [DRBD_FAULT_DT_WR] = "Data write",36783678+ [DRBD_FAULT_DT_RD] = "Data read",36793679+ [DRBD_FAULT_DT_RA] = "Data read ahead",36803680+ [DRBD_FAULT_BM_ALLOC] = "BM allocation",36813681+ [DRBD_FAULT_AL_EE] = "EE allocation"36823682+ };36833683+36843684+ return (type < DRBD_FAULT_MAX) ? _faults[type] : "**Unknown**";36853685+}36863686+36873687+unsigned int36883688+_drbd_insert_fault(struct drbd_conf *mdev, unsigned int type)36893689+{36903690+ static struct fault_random_state rrs = {0, 0};36913691+36923692+ unsigned int ret = (36933693+ (fault_devs == 0 ||36943694+ ((1 << mdev_to_minor(mdev)) & fault_devs) != 0) &&36953695+ (((_drbd_fault_random(&rrs) % 100) + 1) <= fault_rate));36963696+36973697+ if (ret) {36983698+ fault_count++;36993699+37003700+ if (printk_ratelimit())37013701+ dev_warn(DEV, "***Simulating %s failure\n",37023702+ _drbd_fault_str(type));37033703+ }37043704+37053705+ return ret;37063706+}37073707+#endif37083708+37093709+const char *drbd_buildtag(void)37103710+{37113711+ /* DRBD built from external sources has here a reference to the37123712+ git hash of the source code. */37133713+37143714+ static char buildtag[38] = "\0uilt-in";37153715+37163716+ if (buildtag[0] == 0) {37173717+#ifdef CONFIG_MODULES37183718+ if (THIS_MODULE != NULL)37193719+ sprintf(buildtag, "srcversion: %-24s", THIS_MODULE->srcversion);37203720+ else37213721+#endif37223722+ buildtag[0] = 'b';37233723+ }37243724+37253725+ return buildtag;37263726+}37273727+37283728+module_init(drbd_init)37293729+module_exit(drbd_cleanup)37303730+37313731+/* For drbd_tracing: */37323732+EXPORT_SYMBOL(drbd_conn_str);37333733+EXPORT_SYMBOL(drbd_role_str);37343734+EXPORT_SYMBOL(drbd_disk_str);37353735+EXPORT_SYMBOL(drbd_set_st_err_str);
+2365
drivers/block/drbd/drbd_nl.c
···11+/*22+ drbd_nl.c33+44+ This file is part of DRBD by Philipp Reisner and Lars Ellenberg.55+66+ Copyright (C) 2001-2008, LINBIT Information Technologies GmbH.77+ Copyright (C) 1999-2008, Philipp Reisner <philipp.reisner@linbit.com>.88+ Copyright (C) 2002-2008, Lars Ellenberg <lars.ellenberg@linbit.com>.99+1010+ drbd is free software; you can redistribute it and/or modify1111+ it under the terms of the GNU General Public License as published by1212+ the Free Software Foundation; either version 2, or (at your option)1313+ any later version.1414+1515+ drbd is distributed in the hope that it will be useful,1616+ but WITHOUT ANY WARRANTY; without even the implied warranty of1717+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the1818+ GNU General Public License for more details.1919+2020+ You should have received a copy of the GNU General Public License2121+ along with drbd; see the file COPYING. If not, write to2222+ the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, USA.2323+2424+ */2525+2626+#include <linux/autoconf.h>2727+#include <linux/module.h>2828+#include <linux/drbd.h>2929+#include <linux/in.h>3030+#include <linux/fs.h>3131+#include <linux/file.h>3232+#include <linux/slab.h>3333+#include <linux/connector.h>3434+#include <linux/blkpg.h>3535+#include <linux/cpumask.h>3636+#include "drbd_int.h"3737+#include "drbd_tracing.h"3838+#include "drbd_wrappers.h"3939+#include <asm/unaligned.h>4040+#include <linux/drbd_tag_magic.h>4141+#include <linux/drbd_limits.h>4242+4343+static unsigned short *tl_add_blob(unsigned short *, enum drbd_tags, const void *, int);4444+static unsigned short *tl_add_str(unsigned short *, enum drbd_tags, const char *);4545+static unsigned short *tl_add_int(unsigned short *, enum drbd_tags, const void *);4646+4747+/* see get_sb_bdev and bd_claim */4848+static char *drbd_m_holder = "Hands off! this is DRBD's meta data device.";4949+5050+/* Generate the tag_list to struct functions */5151+#define NL_PACKET(name, number, fields) \5252+static int name ## _from_tags(struct drbd_conf *mdev, \5353+ unsigned short *tags, struct name *arg) __attribute__ ((unused)); \5454+static int name ## _from_tags(struct drbd_conf *mdev, \5555+ unsigned short *tags, struct name *arg) \5656+{ \5757+ int tag; \5858+ int dlen; \5959+ \6060+ while ((tag = get_unaligned(tags++)) != TT_END) { \6161+ dlen = get_unaligned(tags++); \6262+ switch (tag_number(tag)) { \6363+ fields \6464+ default: \6565+ if (tag & T_MANDATORY) { \6666+ dev_err(DEV, "Unknown tag: %d\n", tag_number(tag)); \6767+ return 0; \6868+ } \6969+ } \7070+ tags = (unsigned short *)((char *)tags + dlen); \7171+ } \7272+ return 1; \7373+}7474+#define NL_INTEGER(pn, pr, member) \7575+ case pn: /* D_ASSERT( tag_type(tag) == TT_INTEGER ); */ \7676+ arg->member = get_unaligned((int *)(tags)); \7777+ break;7878+#define NL_INT64(pn, pr, member) \7979+ case pn: /* D_ASSERT( tag_type(tag) == TT_INT64 ); */ \8080+ arg->member = get_unaligned((u64 *)(tags)); \8181+ break;8282+#define NL_BIT(pn, pr, member) \8383+ case pn: /* D_ASSERT( tag_type(tag) == TT_BIT ); */ \8484+ arg->member = *(char *)(tags) ? 1 : 0; \8585+ break;8686+#define NL_STRING(pn, pr, member, len) \8787+ case pn: /* D_ASSERT( tag_type(tag) == TT_STRING ); */ \8888+ if (dlen > len) { \8989+ dev_err(DEV, "arg too long: %s (%u wanted, max len: %u bytes)\n", \9090+ #member, dlen, (unsigned int)len); \9191+ return 0; \9292+ } \9393+ arg->member ## _len = dlen; \9494+ memcpy(arg->member, tags, min_t(size_t, dlen, len)); \9595+ break;9696+#include "linux/drbd_nl.h"9797+9898+/* Generate the struct to tag_list functions */9999+#define NL_PACKET(name, number, fields) \100100+static unsigned short* \101101+name ## _to_tags(struct drbd_conf *mdev, \102102+ struct name *arg, unsigned short *tags) __attribute__ ((unused)); \103103+static unsigned short* \104104+name ## _to_tags(struct drbd_conf *mdev, \105105+ struct name *arg, unsigned short *tags) \106106+{ \107107+ fields \108108+ return tags; \109109+}110110+111111+#define NL_INTEGER(pn, pr, member) \112112+ put_unaligned(pn | pr | TT_INTEGER, tags++); \113113+ put_unaligned(sizeof(int), tags++); \114114+ put_unaligned(arg->member, (int *)tags); \115115+ tags = (unsigned short *)((char *)tags+sizeof(int));116116+#define NL_INT64(pn, pr, member) \117117+ put_unaligned(pn | pr | TT_INT64, tags++); \118118+ put_unaligned(sizeof(u64), tags++); \119119+ put_unaligned(arg->member, (u64 *)tags); \120120+ tags = (unsigned short *)((char *)tags+sizeof(u64));121121+#define NL_BIT(pn, pr, member) \122122+ put_unaligned(pn | pr | TT_BIT, tags++); \123123+ put_unaligned(sizeof(char), tags++); \124124+ *(char *)tags = arg->member; \125125+ tags = (unsigned short *)((char *)tags+sizeof(char));126126+#define NL_STRING(pn, pr, member, len) \127127+ put_unaligned(pn | pr | TT_STRING, tags++); \128128+ put_unaligned(arg->member ## _len, tags++); \129129+ memcpy(tags, arg->member, arg->member ## _len); \130130+ tags = (unsigned short *)((char *)tags + arg->member ## _len);131131+#include "linux/drbd_nl.h"132132+133133+void drbd_bcast_ev_helper(struct drbd_conf *mdev, char *helper_name);134134+void drbd_nl_send_reply(struct cn_msg *, int);135135+136136+int drbd_khelper(struct drbd_conf *mdev, char *cmd)137137+{138138+ char *envp[] = { "HOME=/",139139+ "TERM=linux",140140+ "PATH=/sbin:/usr/sbin:/bin:/usr/bin",141141+ NULL, /* Will be set to address family */142142+ NULL, /* Will be set to address */143143+ NULL };144144+145145+ char mb[12], af[20], ad[60], *afs;146146+ char *argv[] = {usermode_helper, cmd, mb, NULL };147147+ int ret;148148+149149+ snprintf(mb, 12, "minor-%d", mdev_to_minor(mdev));150150+151151+ if (get_net_conf(mdev)) {152152+ switch (((struct sockaddr *)mdev->net_conf->peer_addr)->sa_family) {153153+ case AF_INET6:154154+ afs = "ipv6";155155+ snprintf(ad, 60, "DRBD_PEER_ADDRESS=%pI6",156156+ &((struct sockaddr_in6 *)mdev->net_conf->peer_addr)->sin6_addr);157157+ break;158158+ case AF_INET:159159+ afs = "ipv4";160160+ snprintf(ad, 60, "DRBD_PEER_ADDRESS=%pI4",161161+ &((struct sockaddr_in *)mdev->net_conf->peer_addr)->sin_addr);162162+ break;163163+ default:164164+ afs = "ssocks";165165+ snprintf(ad, 60, "DRBD_PEER_ADDRESS=%pI4",166166+ &((struct sockaddr_in *)mdev->net_conf->peer_addr)->sin_addr);167167+ }168168+ snprintf(af, 20, "DRBD_PEER_AF=%s", afs);169169+ envp[3]=af;170170+ envp[4]=ad;171171+ put_net_conf(mdev);172172+ }173173+174174+ dev_info(DEV, "helper command: %s %s %s\n", usermode_helper, cmd, mb);175175+176176+ drbd_bcast_ev_helper(mdev, cmd);177177+ ret = call_usermodehelper(usermode_helper, argv, envp, 1);178178+ if (ret)179179+ dev_warn(DEV, "helper command: %s %s %s exit code %u (0x%x)\n",180180+ usermode_helper, cmd, mb,181181+ (ret >> 8) & 0xff, ret);182182+ else183183+ dev_info(DEV, "helper command: %s %s %s exit code %u (0x%x)\n",184184+ usermode_helper, cmd, mb,185185+ (ret >> 8) & 0xff, ret);186186+187187+ if (ret < 0) /* Ignore any ERRNOs we got. */188188+ ret = 0;189189+190190+ return ret;191191+}192192+193193+enum drbd_disk_state drbd_try_outdate_peer(struct drbd_conf *mdev)194194+{195195+ char *ex_to_string;196196+ int r;197197+ enum drbd_disk_state nps;198198+ enum drbd_fencing_p fp;199199+200200+ D_ASSERT(mdev->state.pdsk == D_UNKNOWN);201201+202202+ if (get_ldev_if_state(mdev, D_CONSISTENT)) {203203+ fp = mdev->ldev->dc.fencing;204204+ put_ldev(mdev);205205+ } else {206206+ dev_warn(DEV, "Not fencing peer, I'm not even Consistent myself.\n");207207+ return mdev->state.pdsk;208208+ }209209+210210+ if (fp == FP_STONITH)211211+ _drbd_request_state(mdev, NS(susp, 1), CS_WAIT_COMPLETE);212212+213213+ r = drbd_khelper(mdev, "fence-peer");214214+215215+ switch ((r>>8) & 0xff) {216216+ case 3: /* peer is inconsistent */217217+ ex_to_string = "peer is inconsistent or worse";218218+ nps = D_INCONSISTENT;219219+ break;220220+ case 4: /* peer got outdated, or was already outdated */221221+ ex_to_string = "peer was fenced";222222+ nps = D_OUTDATED;223223+ break;224224+ case 5: /* peer was down */225225+ if (mdev->state.disk == D_UP_TO_DATE) {226226+ /* we will(have) create(d) a new UUID anyways... */227227+ ex_to_string = "peer is unreachable, assumed to be dead";228228+ nps = D_OUTDATED;229229+ } else {230230+ ex_to_string = "peer unreachable, doing nothing since disk != UpToDate";231231+ nps = mdev->state.pdsk;232232+ }233233+ break;234234+ case 6: /* Peer is primary, voluntarily outdate myself.235235+ * This is useful when an unconnected R_SECONDARY is asked to236236+ * become R_PRIMARY, but finds the other peer being active. */237237+ ex_to_string = "peer is active";238238+ dev_warn(DEV, "Peer is primary, outdating myself.\n");239239+ nps = D_UNKNOWN;240240+ _drbd_request_state(mdev, NS(disk, D_OUTDATED), CS_WAIT_COMPLETE);241241+ break;242242+ case 7:243243+ if (fp != FP_STONITH)244244+ dev_err(DEV, "fence-peer() = 7 && fencing != Stonith !!!\n");245245+ ex_to_string = "peer was stonithed";246246+ nps = D_OUTDATED;247247+ break;248248+ default:249249+ /* The script is broken ... */250250+ nps = D_UNKNOWN;251251+ dev_err(DEV, "fence-peer helper broken, returned %d\n", (r>>8)&0xff);252252+ return nps;253253+ }254254+255255+ dev_info(DEV, "fence-peer helper returned %d (%s)\n",256256+ (r>>8) & 0xff, ex_to_string);257257+ return nps;258258+}259259+260260+261261+int drbd_set_role(struct drbd_conf *mdev, enum drbd_role new_role, int force)262262+{263263+ const int max_tries = 4;264264+ int r = 0;265265+ int try = 0;266266+ int forced = 0;267267+ union drbd_state mask, val;268268+ enum drbd_disk_state nps;269269+270270+ if (new_role == R_PRIMARY)271271+ request_ping(mdev); /* Detect a dead peer ASAP */272272+273273+ mutex_lock(&mdev->state_mutex);274274+275275+ mask.i = 0; mask.role = R_MASK;276276+ val.i = 0; val.role = new_role;277277+278278+ while (try++ < max_tries) {279279+ r = _drbd_request_state(mdev, mask, val, CS_WAIT_COMPLETE);280280+281281+ /* in case we first succeeded to outdate,282282+ * but now suddenly could establish a connection */283283+ if (r == SS_CW_FAILED_BY_PEER && mask.pdsk != 0) {284284+ val.pdsk = 0;285285+ mask.pdsk = 0;286286+ continue;287287+ }288288+289289+ if (r == SS_NO_UP_TO_DATE_DISK && force &&290290+ (mdev->state.disk == D_INCONSISTENT ||291291+ mdev->state.disk == D_OUTDATED)) {292292+ mask.disk = D_MASK;293293+ val.disk = D_UP_TO_DATE;294294+ forced = 1;295295+ continue;296296+ }297297+298298+ if (r == SS_NO_UP_TO_DATE_DISK &&299299+ mdev->state.disk == D_CONSISTENT && mask.pdsk == 0) {300300+ D_ASSERT(mdev->state.pdsk == D_UNKNOWN);301301+ nps = drbd_try_outdate_peer(mdev);302302+303303+ if (nps == D_OUTDATED || nps == D_INCONSISTENT) {304304+ val.disk = D_UP_TO_DATE;305305+ mask.disk = D_MASK;306306+ }307307+308308+ val.pdsk = nps;309309+ mask.pdsk = D_MASK;310310+311311+ continue;312312+ }313313+314314+ if (r == SS_NOTHING_TO_DO)315315+ goto fail;316316+ if (r == SS_PRIMARY_NOP && mask.pdsk == 0) {317317+ nps = drbd_try_outdate_peer(mdev);318318+319319+ if (force && nps > D_OUTDATED) {320320+ dev_warn(DEV, "Forced into split brain situation!\n");321321+ nps = D_OUTDATED;322322+ }323323+324324+ mask.pdsk = D_MASK;325325+ val.pdsk = nps;326326+327327+ continue;328328+ }329329+ if (r == SS_TWO_PRIMARIES) {330330+ /* Maybe the peer is detected as dead very soon...331331+ retry at most once more in this case. */332332+ __set_current_state(TASK_INTERRUPTIBLE);333333+ schedule_timeout((mdev->net_conf->ping_timeo+1)*HZ/10);334334+ if (try < max_tries)335335+ try = max_tries - 1;336336+ continue;337337+ }338338+ if (r < SS_SUCCESS) {339339+ r = _drbd_request_state(mdev, mask, val,340340+ CS_VERBOSE + CS_WAIT_COMPLETE);341341+ if (r < SS_SUCCESS)342342+ goto fail;343343+ }344344+ break;345345+ }346346+347347+ if (r < SS_SUCCESS)348348+ goto fail;349349+350350+ if (forced)351351+ dev_warn(DEV, "Forced to consider local data as UpToDate!\n");352352+353353+ /* Wait until nothing is on the fly :) */354354+ wait_event(mdev->misc_wait, atomic_read(&mdev->ap_pending_cnt) == 0);355355+356356+ if (new_role == R_SECONDARY) {357357+ set_disk_ro(mdev->vdisk, TRUE);358358+ if (get_ldev(mdev)) {359359+ mdev->ldev->md.uuid[UI_CURRENT] &= ~(u64)1;360360+ put_ldev(mdev);361361+ }362362+ } else {363363+ if (get_net_conf(mdev)) {364364+ mdev->net_conf->want_lose = 0;365365+ put_net_conf(mdev);366366+ }367367+ set_disk_ro(mdev->vdisk, FALSE);368368+ if (get_ldev(mdev)) {369369+ if (((mdev->state.conn < C_CONNECTED ||370370+ mdev->state.pdsk <= D_FAILED)371371+ && mdev->ldev->md.uuid[UI_BITMAP] == 0) || forced)372372+ drbd_uuid_new_current(mdev);373373+374374+ mdev->ldev->md.uuid[UI_CURRENT] |= (u64)1;375375+ put_ldev(mdev);376376+ }377377+ }378378+379379+ if ((new_role == R_SECONDARY) && get_ldev(mdev)) {380380+ drbd_al_to_on_disk_bm(mdev);381381+ put_ldev(mdev);382382+ }383383+384384+ if (mdev->state.conn >= C_WF_REPORT_PARAMS) {385385+ /* if this was forced, we should consider sync */386386+ if (forced)387387+ drbd_send_uuids(mdev);388388+ drbd_send_state(mdev);389389+ }390390+391391+ drbd_md_sync(mdev);392392+393393+ kobject_uevent(&disk_to_dev(mdev->vdisk)->kobj, KOBJ_CHANGE);394394+ fail:395395+ mutex_unlock(&mdev->state_mutex);396396+ return r;397397+}398398+399399+400400+static int drbd_nl_primary(struct drbd_conf *mdev, struct drbd_nl_cfg_req *nlp,401401+ struct drbd_nl_cfg_reply *reply)402402+{403403+ struct primary primary_args;404404+405405+ memset(&primary_args, 0, sizeof(struct primary));406406+ if (!primary_from_tags(mdev, nlp->tag_list, &primary_args)) {407407+ reply->ret_code = ERR_MANDATORY_TAG;408408+ return 0;409409+ }410410+411411+ reply->ret_code =412412+ drbd_set_role(mdev, R_PRIMARY, primary_args.overwrite_peer);413413+414414+ return 0;415415+}416416+417417+static int drbd_nl_secondary(struct drbd_conf *mdev, struct drbd_nl_cfg_req *nlp,418418+ struct drbd_nl_cfg_reply *reply)419419+{420420+ reply->ret_code = drbd_set_role(mdev, R_SECONDARY, 0);421421+422422+ return 0;423423+}424424+425425+/* initializes the md.*_offset members, so we are able to find426426+ * the on disk meta data */427427+static void drbd_md_set_sector_offsets(struct drbd_conf *mdev,428428+ struct drbd_backing_dev *bdev)429429+{430430+ sector_t md_size_sect = 0;431431+ switch (bdev->dc.meta_dev_idx) {432432+ default:433433+ /* v07 style fixed size indexed meta data */434434+ bdev->md.md_size_sect = MD_RESERVED_SECT;435435+ bdev->md.md_offset = drbd_md_ss__(mdev, bdev);436436+ bdev->md.al_offset = MD_AL_OFFSET;437437+ bdev->md.bm_offset = MD_BM_OFFSET;438438+ break;439439+ case DRBD_MD_INDEX_FLEX_EXT:440440+ /* just occupy the full device; unit: sectors */441441+ bdev->md.md_size_sect = drbd_get_capacity(bdev->md_bdev);442442+ bdev->md.md_offset = 0;443443+ bdev->md.al_offset = MD_AL_OFFSET;444444+ bdev->md.bm_offset = MD_BM_OFFSET;445445+ break;446446+ case DRBD_MD_INDEX_INTERNAL:447447+ case DRBD_MD_INDEX_FLEX_INT:448448+ bdev->md.md_offset = drbd_md_ss__(mdev, bdev);449449+ /* al size is still fixed */450450+ bdev->md.al_offset = -MD_AL_MAX_SIZE;451451+ /* we need (slightly less than) ~ this much bitmap sectors: */452452+ md_size_sect = drbd_get_capacity(bdev->backing_bdev);453453+ md_size_sect = ALIGN(md_size_sect, BM_SECT_PER_EXT);454454+ md_size_sect = BM_SECT_TO_EXT(md_size_sect);455455+ md_size_sect = ALIGN(md_size_sect, 8);456456+457457+ /* plus the "drbd meta data super block",458458+ * and the activity log; */459459+ md_size_sect += MD_BM_OFFSET;460460+461461+ bdev->md.md_size_sect = md_size_sect;462462+ /* bitmap offset is adjusted by 'super' block size */463463+ bdev->md.bm_offset = -md_size_sect + MD_AL_OFFSET;464464+ break;465465+ }466466+}467467+468468+char *ppsize(char *buf, unsigned long long size)469469+{470470+ /* Needs 9 bytes at max. */471471+ static char units[] = { 'K', 'M', 'G', 'T', 'P', 'E' };472472+ int base = 0;473473+ while (size >= 10000) {474474+ /* shift + round */475475+ size = (size >> 10) + !!(size & (1<<9));476476+ base++;477477+ }478478+ sprintf(buf, "%lu %cB", (long)size, units[base]);479479+480480+ return buf;481481+}482482+483483+/* there is still a theoretical deadlock when called from receiver484484+ * on an D_INCONSISTENT R_PRIMARY:485485+ * remote READ does inc_ap_bio, receiver would need to receive answer486486+ * packet from remote to dec_ap_bio again.487487+ * receiver receive_sizes(), comes here,488488+ * waits for ap_bio_cnt == 0. -> deadlock.489489+ * but this cannot happen, actually, because:490490+ * R_PRIMARY D_INCONSISTENT, and peer's disk is unreachable491491+ * (not connected, or bad/no disk on peer):492492+ * see drbd_fail_request_early, ap_bio_cnt is zero.493493+ * R_PRIMARY D_INCONSISTENT, and C_SYNC_TARGET:494494+ * peer may not initiate a resize.495495+ */496496+void drbd_suspend_io(struct drbd_conf *mdev)497497+{498498+ set_bit(SUSPEND_IO, &mdev->flags);499499+ wait_event(mdev->misc_wait, !atomic_read(&mdev->ap_bio_cnt));500500+}501501+502502+void drbd_resume_io(struct drbd_conf *mdev)503503+{504504+ clear_bit(SUSPEND_IO, &mdev->flags);505505+ wake_up(&mdev->misc_wait);506506+}507507+508508+/**509509+ * drbd_determine_dev_size() - Sets the right device size obeying all constraints510510+ * @mdev: DRBD device.511511+ *512512+ * Returns 0 on success, negative return values indicate errors.513513+ * You should call drbd_md_sync() after calling this function.514514+ */515515+enum determine_dev_size drbd_determin_dev_size(struct drbd_conf *mdev) __must_hold(local)516516+{517517+ sector_t prev_first_sect, prev_size; /* previous meta location */518518+ sector_t la_size;519519+ sector_t size;520520+ char ppb[10];521521+522522+ int md_moved, la_size_changed;523523+ enum determine_dev_size rv = unchanged;524524+525525+ /* race:526526+ * application request passes inc_ap_bio,527527+ * but then cannot get an AL-reference.528528+ * this function later may wait on ap_bio_cnt == 0. -> deadlock.529529+ *530530+ * to avoid that:531531+ * Suspend IO right here.532532+ * still lock the act_log to not trigger ASSERTs there.533533+ */534534+ drbd_suspend_io(mdev);535535+536536+ /* no wait necessary anymore, actually we could assert that */537537+ wait_event(mdev->al_wait, lc_try_lock(mdev->act_log));538538+539539+ prev_first_sect = drbd_md_first_sector(mdev->ldev);540540+ prev_size = mdev->ldev->md.md_size_sect;541541+ la_size = mdev->ldev->md.la_size_sect;542542+543543+ /* TODO: should only be some assert here, not (re)init... */544544+ drbd_md_set_sector_offsets(mdev, mdev->ldev);545545+546546+ size = drbd_new_dev_size(mdev, mdev->ldev);547547+548548+ if (drbd_get_capacity(mdev->this_bdev) != size ||549549+ drbd_bm_capacity(mdev) != size) {550550+ int err;551551+ err = drbd_bm_resize(mdev, size);552552+ if (unlikely(err)) {553553+ /* currently there is only one error: ENOMEM! */554554+ size = drbd_bm_capacity(mdev)>>1;555555+ if (size == 0) {556556+ dev_err(DEV, "OUT OF MEMORY! "557557+ "Could not allocate bitmap!\n");558558+ } else {559559+ dev_err(DEV, "BM resizing failed. "560560+ "Leaving size unchanged at size = %lu KB\n",561561+ (unsigned long)size);562562+ }563563+ rv = dev_size_error;564564+ }565565+ /* racy, see comments above. */566566+ drbd_set_my_capacity(mdev, size);567567+ mdev->ldev->md.la_size_sect = size;568568+ dev_info(DEV, "size = %s (%llu KB)\n", ppsize(ppb, size>>1),569569+ (unsigned long long)size>>1);570570+ }571571+ if (rv == dev_size_error)572572+ goto out;573573+574574+ la_size_changed = (la_size != mdev->ldev->md.la_size_sect);575575+576576+ md_moved = prev_first_sect != drbd_md_first_sector(mdev->ldev)577577+ || prev_size != mdev->ldev->md.md_size_sect;578578+579579+ if (la_size_changed || md_moved) {580580+ drbd_al_shrink(mdev); /* All extents inactive. */581581+ dev_info(DEV, "Writing the whole bitmap, %s\n",582582+ la_size_changed && md_moved ? "size changed and md moved" :583583+ la_size_changed ? "size changed" : "md moved");584584+ rv = drbd_bitmap_io(mdev, &drbd_bm_write, "size changed"); /* does drbd_resume_io() ! */585585+ drbd_md_mark_dirty(mdev);586586+ }587587+588588+ if (size > la_size)589589+ rv = grew;590590+ if (size < la_size)591591+ rv = shrunk;592592+out:593593+ lc_unlock(mdev->act_log);594594+ wake_up(&mdev->al_wait);595595+ drbd_resume_io(mdev);596596+597597+ return rv;598598+}599599+600600+sector_t601601+drbd_new_dev_size(struct drbd_conf *mdev, struct drbd_backing_dev *bdev)602602+{603603+ sector_t p_size = mdev->p_size; /* partner's disk size. */604604+ sector_t la_size = bdev->md.la_size_sect; /* last agreed size. */605605+ sector_t m_size; /* my size */606606+ sector_t u_size = bdev->dc.disk_size; /* size requested by user. */607607+ sector_t size = 0;608608+609609+ m_size = drbd_get_max_capacity(bdev);610610+611611+ if (p_size && m_size) {612612+ size = min_t(sector_t, p_size, m_size);613613+ } else {614614+ if (la_size) {615615+ size = la_size;616616+ if (m_size && m_size < size)617617+ size = m_size;618618+ if (p_size && p_size < size)619619+ size = p_size;620620+ } else {621621+ if (m_size)622622+ size = m_size;623623+ if (p_size)624624+ size = p_size;625625+ }626626+ }627627+628628+ if (size == 0)629629+ dev_err(DEV, "Both nodes diskless!\n");630630+631631+ if (u_size) {632632+ if (u_size > size)633633+ dev_err(DEV, "Requested disk size is too big (%lu > %lu)\n",634634+ (unsigned long)u_size>>1, (unsigned long)size>>1);635635+ else636636+ size = u_size;637637+ }638638+639639+ return size;640640+}641641+642642+/**643643+ * drbd_check_al_size() - Ensures that the AL is of the right size644644+ * @mdev: DRBD device.645645+ *646646+ * Returns -EBUSY if current al lru is still used, -ENOMEM when allocation647647+ * failed, and 0 on success. You should call drbd_md_sync() after you called648648+ * this function.649649+ */650650+static int drbd_check_al_size(struct drbd_conf *mdev)651651+{652652+ struct lru_cache *n, *t;653653+ struct lc_element *e;654654+ unsigned int in_use;655655+ int i;656656+657657+ ERR_IF(mdev->sync_conf.al_extents < 7)658658+ mdev->sync_conf.al_extents = 127;659659+660660+ if (mdev->act_log &&661661+ mdev->act_log->nr_elements == mdev->sync_conf.al_extents)662662+ return 0;663663+664664+ in_use = 0;665665+ t = mdev->act_log;666666+ n = lc_create("act_log", drbd_al_ext_cache,667667+ mdev->sync_conf.al_extents, sizeof(struct lc_element), 0);668668+669669+ if (n == NULL) {670670+ dev_err(DEV, "Cannot allocate act_log lru!\n");671671+ return -ENOMEM;672672+ }673673+ spin_lock_irq(&mdev->al_lock);674674+ if (t) {675675+ for (i = 0; i < t->nr_elements; i++) {676676+ e = lc_element_by_index(t, i);677677+ if (e->refcnt)678678+ dev_err(DEV, "refcnt(%d)==%d\n",679679+ e->lc_number, e->refcnt);680680+ in_use += e->refcnt;681681+ }682682+ }683683+ if (!in_use)684684+ mdev->act_log = n;685685+ spin_unlock_irq(&mdev->al_lock);686686+ if (in_use) {687687+ dev_err(DEV, "Activity log still in use!\n");688688+ lc_destroy(n);689689+ return -EBUSY;690690+ } else {691691+ if (t)692692+ lc_destroy(t);693693+ }694694+ drbd_md_mark_dirty(mdev); /* we changed mdev->act_log->nr_elemens */695695+ return 0;696696+}697697+698698+void drbd_setup_queue_param(struct drbd_conf *mdev, unsigned int max_seg_s) __must_hold(local)699699+{700700+ struct request_queue * const q = mdev->rq_queue;701701+ struct request_queue * const b = mdev->ldev->backing_bdev->bd_disk->queue;702702+ int max_segments = mdev->ldev->dc.max_bio_bvecs;703703+704704+ if (b->merge_bvec_fn && !mdev->ldev->dc.use_bmbv)705705+ max_seg_s = PAGE_SIZE;706706+707707+ max_seg_s = min(queue_max_sectors(b) * queue_logical_block_size(b), max_seg_s);708708+709709+ blk_queue_max_sectors(q, max_seg_s >> 9);710710+ blk_queue_max_phys_segments(q, max_segments ? max_segments : MAX_PHYS_SEGMENTS);711711+ blk_queue_max_hw_segments(q, max_segments ? max_segments : MAX_HW_SEGMENTS);712712+ blk_queue_max_segment_size(q, max_seg_s);713713+ blk_queue_logical_block_size(q, 512);714714+ blk_queue_segment_boundary(q, PAGE_SIZE-1);715715+ blk_stack_limits(&q->limits, &b->limits, 0);716716+717717+ if (b->merge_bvec_fn)718718+ dev_warn(DEV, "Backing device's merge_bvec_fn() = %p\n",719719+ b->merge_bvec_fn);720720+ dev_info(DEV, "max_segment_size ( = BIO size ) = %u\n", queue_max_segment_size(q));721721+722722+ if (q->backing_dev_info.ra_pages != b->backing_dev_info.ra_pages) {723723+ dev_info(DEV, "Adjusting my ra_pages to backing device's (%lu -> %lu)\n",724724+ q->backing_dev_info.ra_pages,725725+ b->backing_dev_info.ra_pages);726726+ q->backing_dev_info.ra_pages = b->backing_dev_info.ra_pages;727727+ }728728+}729729+730730+/* serialize deconfig (worker exiting, doing cleanup)731731+ * and reconfig (drbdsetup disk, drbdsetup net)732732+ *733733+ * wait for a potentially exiting worker, then restart it,734734+ * or start a new one.735735+ */736736+static void drbd_reconfig_start(struct drbd_conf *mdev)737737+{738738+ wait_event(mdev->state_wait, test_and_set_bit(CONFIG_PENDING, &mdev->flags));739739+ wait_event(mdev->state_wait, !test_bit(DEVICE_DYING, &mdev->flags));740740+ drbd_thread_start(&mdev->worker);741741+}742742+743743+/* if still unconfigured, stops worker again.744744+ * if configured now, clears CONFIG_PENDING.745745+ * wakes potential waiters */746746+static void drbd_reconfig_done(struct drbd_conf *mdev)747747+{748748+ spin_lock_irq(&mdev->req_lock);749749+ if (mdev->state.disk == D_DISKLESS &&750750+ mdev->state.conn == C_STANDALONE &&751751+ mdev->state.role == R_SECONDARY) {752752+ set_bit(DEVICE_DYING, &mdev->flags);753753+ drbd_thread_stop_nowait(&mdev->worker);754754+ } else755755+ clear_bit(CONFIG_PENDING, &mdev->flags);756756+ spin_unlock_irq(&mdev->req_lock);757757+ wake_up(&mdev->state_wait);758758+}759759+760760+/* does always return 0;761761+ * interesting return code is in reply->ret_code */762762+static int drbd_nl_disk_conf(struct drbd_conf *mdev, struct drbd_nl_cfg_req *nlp,763763+ struct drbd_nl_cfg_reply *reply)764764+{765765+ enum drbd_ret_codes retcode;766766+ enum determine_dev_size dd;767767+ sector_t max_possible_sectors;768768+ sector_t min_md_device_sectors;769769+ struct drbd_backing_dev *nbc = NULL; /* new_backing_conf */770770+ struct inode *inode, *inode2;771771+ struct lru_cache *resync_lru = NULL;772772+ union drbd_state ns, os;773773+ int rv;774774+ int cp_discovered = 0;775775+ int logical_block_size;776776+777777+ drbd_reconfig_start(mdev);778778+779779+ /* if you want to reconfigure, please tear down first */780780+ if (mdev->state.disk > D_DISKLESS) {781781+ retcode = ERR_DISK_CONFIGURED;782782+ goto fail;783783+ }784784+785785+ /* allocation not in the IO path, cqueue thread context */786786+ nbc = kzalloc(sizeof(struct drbd_backing_dev), GFP_KERNEL);787787+ if (!nbc) {788788+ retcode = ERR_NOMEM;789789+ goto fail;790790+ }791791+792792+ nbc->dc.disk_size = DRBD_DISK_SIZE_SECT_DEF;793793+ nbc->dc.on_io_error = DRBD_ON_IO_ERROR_DEF;794794+ nbc->dc.fencing = DRBD_FENCING_DEF;795795+ nbc->dc.max_bio_bvecs = DRBD_MAX_BIO_BVECS_DEF;796796+797797+ if (!disk_conf_from_tags(mdev, nlp->tag_list, &nbc->dc)) {798798+ retcode = ERR_MANDATORY_TAG;799799+ goto fail;800800+ }801801+802802+ if (nbc->dc.meta_dev_idx < DRBD_MD_INDEX_FLEX_INT) {803803+ retcode = ERR_MD_IDX_INVALID;804804+ goto fail;805805+ }806806+807807+ nbc->lo_file = filp_open(nbc->dc.backing_dev, O_RDWR, 0);808808+ if (IS_ERR(nbc->lo_file)) {809809+ dev_err(DEV, "open(\"%s\") failed with %ld\n", nbc->dc.backing_dev,810810+ PTR_ERR(nbc->lo_file));811811+ nbc->lo_file = NULL;812812+ retcode = ERR_OPEN_DISK;813813+ goto fail;814814+ }815815+816816+ inode = nbc->lo_file->f_dentry->d_inode;817817+818818+ if (!S_ISBLK(inode->i_mode)) {819819+ retcode = ERR_DISK_NOT_BDEV;820820+ goto fail;821821+ }822822+823823+ nbc->md_file = filp_open(nbc->dc.meta_dev, O_RDWR, 0);824824+ if (IS_ERR(nbc->md_file)) {825825+ dev_err(DEV, "open(\"%s\") failed with %ld\n", nbc->dc.meta_dev,826826+ PTR_ERR(nbc->md_file));827827+ nbc->md_file = NULL;828828+ retcode = ERR_OPEN_MD_DISK;829829+ goto fail;830830+ }831831+832832+ inode2 = nbc->md_file->f_dentry->d_inode;833833+834834+ if (!S_ISBLK(inode2->i_mode)) {835835+ retcode = ERR_MD_NOT_BDEV;836836+ goto fail;837837+ }838838+839839+ nbc->backing_bdev = inode->i_bdev;840840+ if (bd_claim(nbc->backing_bdev, mdev)) {841841+ printk(KERN_ERR "drbd: bd_claim(%p,%p); failed [%p;%p;%u]\n",842842+ nbc->backing_bdev, mdev,843843+ nbc->backing_bdev->bd_holder,844844+ nbc->backing_bdev->bd_contains->bd_holder,845845+ nbc->backing_bdev->bd_holders);846846+ retcode = ERR_BDCLAIM_DISK;847847+ goto fail;848848+ }849849+850850+ resync_lru = lc_create("resync", drbd_bm_ext_cache,851851+ 61, sizeof(struct bm_extent),852852+ offsetof(struct bm_extent, lce));853853+ if (!resync_lru) {854854+ retcode = ERR_NOMEM;855855+ goto release_bdev_fail;856856+ }857857+858858+ /* meta_dev_idx >= 0: external fixed size,859859+ * possibly multiple drbd sharing one meta device.860860+ * TODO in that case, paranoia check that [md_bdev, meta_dev_idx] is861861+ * not yet used by some other drbd minor!862862+ * (if you use drbd.conf + drbdadm,863863+ * that should check it for you already; but if you don't, or someone864864+ * fooled it, we need to double check here) */865865+ nbc->md_bdev = inode2->i_bdev;866866+ if (bd_claim(nbc->md_bdev, (nbc->dc.meta_dev_idx < 0) ? (void *)mdev867867+ : (void *) drbd_m_holder)) {868868+ retcode = ERR_BDCLAIM_MD_DISK;869869+ goto release_bdev_fail;870870+ }871871+872872+ if ((nbc->backing_bdev == nbc->md_bdev) !=873873+ (nbc->dc.meta_dev_idx == DRBD_MD_INDEX_INTERNAL ||874874+ nbc->dc.meta_dev_idx == DRBD_MD_INDEX_FLEX_INT)) {875875+ retcode = ERR_MD_IDX_INVALID;876876+ goto release_bdev2_fail;877877+ }878878+879879+ /* RT - for drbd_get_max_capacity() DRBD_MD_INDEX_FLEX_INT */880880+ drbd_md_set_sector_offsets(mdev, nbc);881881+882882+ if (drbd_get_max_capacity(nbc) < nbc->dc.disk_size) {883883+ dev_err(DEV, "max capacity %llu smaller than disk size %llu\n",884884+ (unsigned long long) drbd_get_max_capacity(nbc),885885+ (unsigned long long) nbc->dc.disk_size);886886+ retcode = ERR_DISK_TO_SMALL;887887+ goto release_bdev2_fail;888888+ }889889+890890+ if (nbc->dc.meta_dev_idx < 0) {891891+ max_possible_sectors = DRBD_MAX_SECTORS_FLEX;892892+ /* at least one MB, otherwise it does not make sense */893893+ min_md_device_sectors = (2<<10);894894+ } else {895895+ max_possible_sectors = DRBD_MAX_SECTORS;896896+ min_md_device_sectors = MD_RESERVED_SECT * (nbc->dc.meta_dev_idx + 1);897897+ }898898+899899+ if (drbd_get_capacity(nbc->md_bdev) > max_possible_sectors)900900+ dev_warn(DEV, "truncating very big lower level device "901901+ "to currently maximum possible %llu sectors\n",902902+ (unsigned long long) max_possible_sectors);903903+904904+ if (drbd_get_capacity(nbc->md_bdev) < min_md_device_sectors) {905905+ retcode = ERR_MD_DISK_TO_SMALL;906906+ dev_warn(DEV, "refusing attach: md-device too small, "907907+ "at least %llu sectors needed for this meta-disk type\n",908908+ (unsigned long long) min_md_device_sectors);909909+ goto release_bdev2_fail;910910+ }911911+912912+ /* Make sure the new disk is big enough913913+ * (we may currently be R_PRIMARY with no local disk...) */914914+ if (drbd_get_max_capacity(nbc) <915915+ drbd_get_capacity(mdev->this_bdev)) {916916+ retcode = ERR_DISK_TO_SMALL;917917+ goto release_bdev2_fail;918918+ }919919+920920+ nbc->known_size = drbd_get_capacity(nbc->backing_bdev);921921+922922+ drbd_suspend_io(mdev);923923+ /* also wait for the last barrier ack. */924924+ wait_event(mdev->misc_wait, !atomic_read(&mdev->ap_pending_cnt));925925+ /* and for any other previously queued work */926926+ drbd_flush_workqueue(mdev);927927+928928+ retcode = _drbd_request_state(mdev, NS(disk, D_ATTACHING), CS_VERBOSE);929929+ drbd_resume_io(mdev);930930+ if (retcode < SS_SUCCESS)931931+ goto release_bdev2_fail;932932+933933+ if (!get_ldev_if_state(mdev, D_ATTACHING))934934+ goto force_diskless;935935+936936+ drbd_md_set_sector_offsets(mdev, nbc);937937+938938+ if (!mdev->bitmap) {939939+ if (drbd_bm_init(mdev)) {940940+ retcode = ERR_NOMEM;941941+ goto force_diskless_dec;942942+ }943943+ }944944+945945+ retcode = drbd_md_read(mdev, nbc);946946+ if (retcode != NO_ERROR)947947+ goto force_diskless_dec;948948+949949+ if (mdev->state.conn < C_CONNECTED &&950950+ mdev->state.role == R_PRIMARY &&951951+ (mdev->ed_uuid & ~((u64)1)) != (nbc->md.uuid[UI_CURRENT] & ~((u64)1))) {952952+ dev_err(DEV, "Can only attach to data with current UUID=%016llX\n",953953+ (unsigned long long)mdev->ed_uuid);954954+ retcode = ERR_DATA_NOT_CURRENT;955955+ goto force_diskless_dec;956956+ }957957+958958+ /* Since we are diskless, fix the activity log first... */959959+ if (drbd_check_al_size(mdev)) {960960+ retcode = ERR_NOMEM;961961+ goto force_diskless_dec;962962+ }963963+964964+ /* Prevent shrinking of consistent devices ! */965965+ if (drbd_md_test_flag(nbc, MDF_CONSISTENT) &&966966+ drbd_new_dev_size(mdev, nbc) < nbc->md.la_size_sect) {967967+ dev_warn(DEV, "refusing to truncate a consistent device\n");968968+ retcode = ERR_DISK_TO_SMALL;969969+ goto force_diskless_dec;970970+ }971971+972972+ if (!drbd_al_read_log(mdev, nbc)) {973973+ retcode = ERR_IO_MD_DISK;974974+ goto force_diskless_dec;975975+ }976976+977977+ /* allocate a second IO page if logical_block_size != 512 */978978+ logical_block_size = bdev_logical_block_size(nbc->md_bdev);979979+ if (logical_block_size == 0)980980+ logical_block_size = MD_SECTOR_SIZE;981981+982982+ if (logical_block_size != MD_SECTOR_SIZE) {983983+ if (!mdev->md_io_tmpp) {984984+ struct page *page = alloc_page(GFP_NOIO);985985+ if (!page)986986+ goto force_diskless_dec;987987+988988+ dev_warn(DEV, "Meta data's bdev logical_block_size = %d != %d\n",989989+ logical_block_size, MD_SECTOR_SIZE);990990+ dev_warn(DEV, "Workaround engaged (has performance impact).\n");991991+992992+ mdev->md_io_tmpp = page;993993+ }994994+ }995995+996996+ /* Reset the "barriers don't work" bits here, then force meta data to997997+ * be written, to ensure we determine if barriers are supported. */998998+ if (nbc->dc.no_md_flush)999999+ set_bit(MD_NO_BARRIER, &mdev->flags);10001000+ else10011001+ clear_bit(MD_NO_BARRIER, &mdev->flags);10021002+10031003+ /* Point of no return reached.10041004+ * Devices and memory are no longer released by error cleanup below.10051005+ * now mdev takes over responsibility, and the state engine should10061006+ * clean it up somewhere. */10071007+ D_ASSERT(mdev->ldev == NULL);10081008+ mdev->ldev = nbc;10091009+ mdev->resync = resync_lru;10101010+ nbc = NULL;10111011+ resync_lru = NULL;10121012+10131013+ mdev->write_ordering = WO_bio_barrier;10141014+ drbd_bump_write_ordering(mdev, WO_bio_barrier);10151015+10161016+ if (drbd_md_test_flag(mdev->ldev, MDF_CRASHED_PRIMARY))10171017+ set_bit(CRASHED_PRIMARY, &mdev->flags);10181018+ else10191019+ clear_bit(CRASHED_PRIMARY, &mdev->flags);10201020+10211021+ if (drbd_md_test_flag(mdev->ldev, MDF_PRIMARY_IND)) {10221022+ set_bit(CRASHED_PRIMARY, &mdev->flags);10231023+ cp_discovered = 1;10241024+ }10251025+10261026+ mdev->send_cnt = 0;10271027+ mdev->recv_cnt = 0;10281028+ mdev->read_cnt = 0;10291029+ mdev->writ_cnt = 0;10301030+10311031+ drbd_setup_queue_param(mdev, DRBD_MAX_SEGMENT_SIZE);10321032+10331033+ /* If I am currently not R_PRIMARY,10341034+ * but meta data primary indicator is set,10351035+ * I just now recover from a hard crash,10361036+ * and have been R_PRIMARY before that crash.10371037+ *10381038+ * Now, if I had no connection before that crash10391039+ * (have been degraded R_PRIMARY), chances are that10401040+ * I won't find my peer now either.10411041+ *10421042+ * In that case, and _only_ in that case,10431043+ * we use the degr-wfc-timeout instead of the default,10441044+ * so we can automatically recover from a crash of a10451045+ * degraded but active "cluster" after a certain timeout.10461046+ */10471047+ clear_bit(USE_DEGR_WFC_T, &mdev->flags);10481048+ if (mdev->state.role != R_PRIMARY &&10491049+ drbd_md_test_flag(mdev->ldev, MDF_PRIMARY_IND) &&10501050+ !drbd_md_test_flag(mdev->ldev, MDF_CONNECTED_IND))10511051+ set_bit(USE_DEGR_WFC_T, &mdev->flags);10521052+10531053+ dd = drbd_determin_dev_size(mdev);10541054+ if (dd == dev_size_error) {10551055+ retcode = ERR_NOMEM_BITMAP;10561056+ goto force_diskless_dec;10571057+ } else if (dd == grew)10581058+ set_bit(RESYNC_AFTER_NEG, &mdev->flags);10591059+10601060+ if (drbd_md_test_flag(mdev->ldev, MDF_FULL_SYNC)) {10611061+ dev_info(DEV, "Assuming that all blocks are out of sync "10621062+ "(aka FullSync)\n");10631063+ if (drbd_bitmap_io(mdev, &drbd_bmio_set_n_write, "set_n_write from attaching")) {10641064+ retcode = ERR_IO_MD_DISK;10651065+ goto force_diskless_dec;10661066+ }10671067+ } else {10681068+ if (drbd_bitmap_io(mdev, &drbd_bm_read, "read from attaching") < 0) {10691069+ retcode = ERR_IO_MD_DISK;10701070+ goto force_diskless_dec;10711071+ }10721072+ }10731073+10741074+ if (cp_discovered) {10751075+ drbd_al_apply_to_bm(mdev);10761076+ drbd_al_to_on_disk_bm(mdev);10771077+ }10781078+10791079+ spin_lock_irq(&mdev->req_lock);10801080+ os = mdev->state;10811081+ ns.i = os.i;10821082+ /* If MDF_CONSISTENT is not set go into inconsistent state,10831083+ otherwise investigate MDF_WasUpToDate...10841084+ If MDF_WAS_UP_TO_DATE is not set go into D_OUTDATED disk state,10851085+ otherwise into D_CONSISTENT state.10861086+ */10871087+ if (drbd_md_test_flag(mdev->ldev, MDF_CONSISTENT)) {10881088+ if (drbd_md_test_flag(mdev->ldev, MDF_WAS_UP_TO_DATE))10891089+ ns.disk = D_CONSISTENT;10901090+ else10911091+ ns.disk = D_OUTDATED;10921092+ } else {10931093+ ns.disk = D_INCONSISTENT;10941094+ }10951095+10961096+ if (drbd_md_test_flag(mdev->ldev, MDF_PEER_OUT_DATED))10971097+ ns.pdsk = D_OUTDATED;10981098+10991099+ if ( ns.disk == D_CONSISTENT &&11001100+ (ns.pdsk == D_OUTDATED || mdev->ldev->dc.fencing == FP_DONT_CARE))11011101+ ns.disk = D_UP_TO_DATE;11021102+11031103+ /* All tests on MDF_PRIMARY_IND, MDF_CONNECTED_IND,11041104+ MDF_CONSISTENT and MDF_WAS_UP_TO_DATE must happen before11051105+ this point, because drbd_request_state() modifies these11061106+ flags. */11071107+11081108+ /* In case we are C_CONNECTED postpone any decision on the new disk11091109+ state after the negotiation phase. */11101110+ if (mdev->state.conn == C_CONNECTED) {11111111+ mdev->new_state_tmp.i = ns.i;11121112+ ns.i = os.i;11131113+ ns.disk = D_NEGOTIATING;11141114+ }11151115+11161116+ rv = _drbd_set_state(mdev, ns, CS_VERBOSE, NULL);11171117+ ns = mdev->state;11181118+ spin_unlock_irq(&mdev->req_lock);11191119+11201120+ if (rv < SS_SUCCESS)11211121+ goto force_diskless_dec;11221122+11231123+ if (mdev->state.role == R_PRIMARY)11241124+ mdev->ldev->md.uuid[UI_CURRENT] |= (u64)1;11251125+ else11261126+ mdev->ldev->md.uuid[UI_CURRENT] &= ~(u64)1;11271127+11281128+ drbd_md_mark_dirty(mdev);11291129+ drbd_md_sync(mdev);11301130+11311131+ kobject_uevent(&disk_to_dev(mdev->vdisk)->kobj, KOBJ_CHANGE);11321132+ put_ldev(mdev);11331133+ reply->ret_code = retcode;11341134+ drbd_reconfig_done(mdev);11351135+ return 0;11361136+11371137+ force_diskless_dec:11381138+ put_ldev(mdev);11391139+ force_diskless:11401140+ drbd_force_state(mdev, NS(disk, D_DISKLESS));11411141+ drbd_md_sync(mdev);11421142+ release_bdev2_fail:11431143+ if (nbc)11441144+ bd_release(nbc->md_bdev);11451145+ release_bdev_fail:11461146+ if (nbc)11471147+ bd_release(nbc->backing_bdev);11481148+ fail:11491149+ if (nbc) {11501150+ if (nbc->lo_file)11511151+ fput(nbc->lo_file);11521152+ if (nbc->md_file)11531153+ fput(nbc->md_file);11541154+ kfree(nbc);11551155+ }11561156+ lc_destroy(resync_lru);11571157+11581158+ reply->ret_code = retcode;11591159+ drbd_reconfig_done(mdev);11601160+ return 0;11611161+}11621162+11631163+static int drbd_nl_detach(struct drbd_conf *mdev, struct drbd_nl_cfg_req *nlp,11641164+ struct drbd_nl_cfg_reply *reply)11651165+{11661166+ reply->ret_code = drbd_request_state(mdev, NS(disk, D_DISKLESS));11671167+ return 0;11681168+}11691169+11701170+static int drbd_nl_net_conf(struct drbd_conf *mdev, struct drbd_nl_cfg_req *nlp,11711171+ struct drbd_nl_cfg_reply *reply)11721172+{11731173+ int i, ns;11741174+ enum drbd_ret_codes retcode;11751175+ struct net_conf *new_conf = NULL;11761176+ struct crypto_hash *tfm = NULL;11771177+ struct crypto_hash *integrity_w_tfm = NULL;11781178+ struct crypto_hash *integrity_r_tfm = NULL;11791179+ struct hlist_head *new_tl_hash = NULL;11801180+ struct hlist_head *new_ee_hash = NULL;11811181+ struct drbd_conf *odev;11821182+ char hmac_name[CRYPTO_MAX_ALG_NAME];11831183+ void *int_dig_out = NULL;11841184+ void *int_dig_in = NULL;11851185+ void *int_dig_vv = NULL;11861186+ struct sockaddr *new_my_addr, *new_peer_addr, *taken_addr;11871187+11881188+ drbd_reconfig_start(mdev);11891189+11901190+ if (mdev->state.conn > C_STANDALONE) {11911191+ retcode = ERR_NET_CONFIGURED;11921192+ goto fail;11931193+ }11941194+11951195+ /* allocation not in the IO path, cqueue thread context */11961196+ new_conf = kmalloc(sizeof(struct net_conf), GFP_KERNEL);11971197+ if (!new_conf) {11981198+ retcode = ERR_NOMEM;11991199+ goto fail;12001200+ }12011201+12021202+ memset(new_conf, 0, sizeof(struct net_conf));12031203+ new_conf->timeout = DRBD_TIMEOUT_DEF;12041204+ new_conf->try_connect_int = DRBD_CONNECT_INT_DEF;12051205+ new_conf->ping_int = DRBD_PING_INT_DEF;12061206+ new_conf->max_epoch_size = DRBD_MAX_EPOCH_SIZE_DEF;12071207+ new_conf->max_buffers = DRBD_MAX_BUFFERS_DEF;12081208+ new_conf->unplug_watermark = DRBD_UNPLUG_WATERMARK_DEF;12091209+ new_conf->sndbuf_size = DRBD_SNDBUF_SIZE_DEF;12101210+ new_conf->rcvbuf_size = DRBD_RCVBUF_SIZE_DEF;12111211+ new_conf->ko_count = DRBD_KO_COUNT_DEF;12121212+ new_conf->after_sb_0p = DRBD_AFTER_SB_0P_DEF;12131213+ new_conf->after_sb_1p = DRBD_AFTER_SB_1P_DEF;12141214+ new_conf->after_sb_2p = DRBD_AFTER_SB_2P_DEF;12151215+ new_conf->want_lose = 0;12161216+ new_conf->two_primaries = 0;12171217+ new_conf->wire_protocol = DRBD_PROT_C;12181218+ new_conf->ping_timeo = DRBD_PING_TIMEO_DEF;12191219+ new_conf->rr_conflict = DRBD_RR_CONFLICT_DEF;12201220+12211221+ if (!net_conf_from_tags(mdev, nlp->tag_list, new_conf)) {12221222+ retcode = ERR_MANDATORY_TAG;12231223+ goto fail;12241224+ }12251225+12261226+ if (new_conf->two_primaries12271227+ && (new_conf->wire_protocol != DRBD_PROT_C)) {12281228+ retcode = ERR_NOT_PROTO_C;12291229+ goto fail;12301230+ };12311231+12321232+ if (mdev->state.role == R_PRIMARY && new_conf->want_lose) {12331233+ retcode = ERR_DISCARD;12341234+ goto fail;12351235+ }12361236+12371237+ retcode = NO_ERROR;12381238+12391239+ new_my_addr = (struct sockaddr *)&new_conf->my_addr;12401240+ new_peer_addr = (struct sockaddr *)&new_conf->peer_addr;12411241+ for (i = 0; i < minor_count; i++) {12421242+ odev = minor_to_mdev(i);12431243+ if (!odev || odev == mdev)12441244+ continue;12451245+ if (get_net_conf(odev)) {12461246+ taken_addr = (struct sockaddr *)&odev->net_conf->my_addr;12471247+ if (new_conf->my_addr_len == odev->net_conf->my_addr_len &&12481248+ !memcmp(new_my_addr, taken_addr, new_conf->my_addr_len))12491249+ retcode = ERR_LOCAL_ADDR;12501250+12511251+ taken_addr = (struct sockaddr *)&odev->net_conf->peer_addr;12521252+ if (new_conf->peer_addr_len == odev->net_conf->peer_addr_len &&12531253+ !memcmp(new_peer_addr, taken_addr, new_conf->peer_addr_len))12541254+ retcode = ERR_PEER_ADDR;12551255+12561256+ put_net_conf(odev);12571257+ if (retcode != NO_ERROR)12581258+ goto fail;12591259+ }12601260+ }12611261+12621262+ if (new_conf->cram_hmac_alg[0] != 0) {12631263+ snprintf(hmac_name, CRYPTO_MAX_ALG_NAME, "hmac(%s)",12641264+ new_conf->cram_hmac_alg);12651265+ tfm = crypto_alloc_hash(hmac_name, 0, CRYPTO_ALG_ASYNC);12661266+ if (IS_ERR(tfm)) {12671267+ tfm = NULL;12681268+ retcode = ERR_AUTH_ALG;12691269+ goto fail;12701270+ }12711271+12721272+ if (crypto_tfm_alg_type(crypto_hash_tfm(tfm))12731273+ != CRYPTO_ALG_TYPE_HASH) {12741274+ retcode = ERR_AUTH_ALG_ND;12751275+ goto fail;12761276+ }12771277+ }12781278+12791279+ if (new_conf->integrity_alg[0]) {12801280+ integrity_w_tfm = crypto_alloc_hash(new_conf->integrity_alg, 0, CRYPTO_ALG_ASYNC);12811281+ if (IS_ERR(integrity_w_tfm)) {12821282+ integrity_w_tfm = NULL;12831283+ retcode=ERR_INTEGRITY_ALG;12841284+ goto fail;12851285+ }12861286+12871287+ if (!drbd_crypto_is_hash(crypto_hash_tfm(integrity_w_tfm))) {12881288+ retcode=ERR_INTEGRITY_ALG_ND;12891289+ goto fail;12901290+ }12911291+12921292+ integrity_r_tfm = crypto_alloc_hash(new_conf->integrity_alg, 0, CRYPTO_ALG_ASYNC);12931293+ if (IS_ERR(integrity_r_tfm)) {12941294+ integrity_r_tfm = NULL;12951295+ retcode=ERR_INTEGRITY_ALG;12961296+ goto fail;12971297+ }12981298+ }12991299+13001300+ ns = new_conf->max_epoch_size/8;13011301+ if (mdev->tl_hash_s != ns) {13021302+ new_tl_hash = kzalloc(ns*sizeof(void *), GFP_KERNEL);13031303+ if (!new_tl_hash) {13041304+ retcode = ERR_NOMEM;13051305+ goto fail;13061306+ }13071307+ }13081308+13091309+ ns = new_conf->max_buffers/8;13101310+ if (new_conf->two_primaries && (mdev->ee_hash_s != ns)) {13111311+ new_ee_hash = kzalloc(ns*sizeof(void *), GFP_KERNEL);13121312+ if (!new_ee_hash) {13131313+ retcode = ERR_NOMEM;13141314+ goto fail;13151315+ }13161316+ }13171317+13181318+ ((char *)new_conf->shared_secret)[SHARED_SECRET_MAX-1] = 0;13191319+13201320+ if (integrity_w_tfm) {13211321+ i = crypto_hash_digestsize(integrity_w_tfm);13221322+ int_dig_out = kmalloc(i, GFP_KERNEL);13231323+ if (!int_dig_out) {13241324+ retcode = ERR_NOMEM;13251325+ goto fail;13261326+ }13271327+ int_dig_in = kmalloc(i, GFP_KERNEL);13281328+ if (!int_dig_in) {13291329+ retcode = ERR_NOMEM;13301330+ goto fail;13311331+ }13321332+ int_dig_vv = kmalloc(i, GFP_KERNEL);13331333+ if (!int_dig_vv) {13341334+ retcode = ERR_NOMEM;13351335+ goto fail;13361336+ }13371337+ }13381338+13391339+ if (!mdev->bitmap) {13401340+ if(drbd_bm_init(mdev)) {13411341+ retcode = ERR_NOMEM;13421342+ goto fail;13431343+ }13441344+ }13451345+13461346+ spin_lock_irq(&mdev->req_lock);13471347+ if (mdev->net_conf != NULL) {13481348+ retcode = ERR_NET_CONFIGURED;13491349+ spin_unlock_irq(&mdev->req_lock);13501350+ goto fail;13511351+ }13521352+ mdev->net_conf = new_conf;13531353+13541354+ mdev->send_cnt = 0;13551355+ mdev->recv_cnt = 0;13561356+13571357+ if (new_tl_hash) {13581358+ kfree(mdev->tl_hash);13591359+ mdev->tl_hash_s = mdev->net_conf->max_epoch_size/8;13601360+ mdev->tl_hash = new_tl_hash;13611361+ }13621362+13631363+ if (new_ee_hash) {13641364+ kfree(mdev->ee_hash);13651365+ mdev->ee_hash_s = mdev->net_conf->max_buffers/8;13661366+ mdev->ee_hash = new_ee_hash;13671367+ }13681368+13691369+ crypto_free_hash(mdev->cram_hmac_tfm);13701370+ mdev->cram_hmac_tfm = tfm;13711371+13721372+ crypto_free_hash(mdev->integrity_w_tfm);13731373+ mdev->integrity_w_tfm = integrity_w_tfm;13741374+13751375+ crypto_free_hash(mdev->integrity_r_tfm);13761376+ mdev->integrity_r_tfm = integrity_r_tfm;13771377+13781378+ kfree(mdev->int_dig_out);13791379+ kfree(mdev->int_dig_in);13801380+ kfree(mdev->int_dig_vv);13811381+ mdev->int_dig_out=int_dig_out;13821382+ mdev->int_dig_in=int_dig_in;13831383+ mdev->int_dig_vv=int_dig_vv;13841384+ spin_unlock_irq(&mdev->req_lock);13851385+13861386+ retcode = _drbd_request_state(mdev, NS(conn, C_UNCONNECTED), CS_VERBOSE);13871387+13881388+ kobject_uevent(&disk_to_dev(mdev->vdisk)->kobj, KOBJ_CHANGE);13891389+ reply->ret_code = retcode;13901390+ drbd_reconfig_done(mdev);13911391+ return 0;13921392+13931393+fail:13941394+ kfree(int_dig_out);13951395+ kfree(int_dig_in);13961396+ kfree(int_dig_vv);13971397+ crypto_free_hash(tfm);13981398+ crypto_free_hash(integrity_w_tfm);13991399+ crypto_free_hash(integrity_r_tfm);14001400+ kfree(new_tl_hash);14011401+ kfree(new_ee_hash);14021402+ kfree(new_conf);14031403+14041404+ reply->ret_code = retcode;14051405+ drbd_reconfig_done(mdev);14061406+ return 0;14071407+}14081408+14091409+static int drbd_nl_disconnect(struct drbd_conf *mdev, struct drbd_nl_cfg_req *nlp,14101410+ struct drbd_nl_cfg_reply *reply)14111411+{14121412+ int retcode;14131413+14141414+ retcode = _drbd_request_state(mdev, NS(conn, C_DISCONNECTING), CS_ORDERED);14151415+14161416+ if (retcode == SS_NOTHING_TO_DO)14171417+ goto done;14181418+ else if (retcode == SS_ALREADY_STANDALONE)14191419+ goto done;14201420+ else if (retcode == SS_PRIMARY_NOP) {14211421+ /* Our statche checking code wants to see the peer outdated. */14221422+ retcode = drbd_request_state(mdev, NS2(conn, C_DISCONNECTING,14231423+ pdsk, D_OUTDATED));14241424+ } else if (retcode == SS_CW_FAILED_BY_PEER) {14251425+ /* The peer probably wants to see us outdated. */14261426+ retcode = _drbd_request_state(mdev, NS2(conn, C_DISCONNECTING,14271427+ disk, D_OUTDATED),14281428+ CS_ORDERED);14291429+ if (retcode == SS_IS_DISKLESS || retcode == SS_LOWER_THAN_OUTDATED) {14301430+ drbd_force_state(mdev, NS(conn, C_DISCONNECTING));14311431+ retcode = SS_SUCCESS;14321432+ }14331433+ }14341434+14351435+ if (retcode < SS_SUCCESS)14361436+ goto fail;14371437+14381438+ if (wait_event_interruptible(mdev->state_wait,14391439+ mdev->state.conn != C_DISCONNECTING)) {14401440+ /* Do not test for mdev->state.conn == C_STANDALONE, since14411441+ someone else might connect us in the mean time! */14421442+ retcode = ERR_INTR;14431443+ goto fail;14441444+ }14451445+14461446+ done:14471447+ retcode = NO_ERROR;14481448+ fail:14491449+ drbd_md_sync(mdev);14501450+ reply->ret_code = retcode;14511451+ return 0;14521452+}14531453+14541454+void resync_after_online_grow(struct drbd_conf *mdev)14551455+{14561456+ int iass; /* I am sync source */14571457+14581458+ dev_info(DEV, "Resync of new storage after online grow\n");14591459+ if (mdev->state.role != mdev->state.peer)14601460+ iass = (mdev->state.role == R_PRIMARY);14611461+ else14621462+ iass = test_bit(DISCARD_CONCURRENT, &mdev->flags);14631463+14641464+ if (iass)14651465+ drbd_start_resync(mdev, C_SYNC_SOURCE);14661466+ else14671467+ _drbd_request_state(mdev, NS(conn, C_WF_SYNC_UUID), CS_VERBOSE + CS_SERIALIZE);14681468+}14691469+14701470+static int drbd_nl_resize(struct drbd_conf *mdev, struct drbd_nl_cfg_req *nlp,14711471+ struct drbd_nl_cfg_reply *reply)14721472+{14731473+ struct resize rs;14741474+ int retcode = NO_ERROR;14751475+ int ldsc = 0; /* local disk size changed */14761476+ enum determine_dev_size dd;14771477+14781478+ memset(&rs, 0, sizeof(struct resize));14791479+ if (!resize_from_tags(mdev, nlp->tag_list, &rs)) {14801480+ retcode = ERR_MANDATORY_TAG;14811481+ goto fail;14821482+ }14831483+14841484+ if (mdev->state.conn > C_CONNECTED) {14851485+ retcode = ERR_RESIZE_RESYNC;14861486+ goto fail;14871487+ }14881488+14891489+ if (mdev->state.role == R_SECONDARY &&14901490+ mdev->state.peer == R_SECONDARY) {14911491+ retcode = ERR_NO_PRIMARY;14921492+ goto fail;14931493+ }14941494+14951495+ if (!get_ldev(mdev)) {14961496+ retcode = ERR_NO_DISK;14971497+ goto fail;14981498+ }14991499+15001500+ if (mdev->ldev->known_size != drbd_get_capacity(mdev->ldev->backing_bdev)) {15011501+ mdev->ldev->known_size = drbd_get_capacity(mdev->ldev->backing_bdev);15021502+ ldsc = 1;15031503+ }15041504+15051505+ mdev->ldev->dc.disk_size = (sector_t)rs.resize_size;15061506+ dd = drbd_determin_dev_size(mdev);15071507+ drbd_md_sync(mdev);15081508+ put_ldev(mdev);15091509+ if (dd == dev_size_error) {15101510+ retcode = ERR_NOMEM_BITMAP;15111511+ goto fail;15121512+ }15131513+15141514+ if (mdev->state.conn == C_CONNECTED && (dd != unchanged || ldsc)) {15151515+ if (dd == grew)15161516+ set_bit(RESIZE_PENDING, &mdev->flags);15171517+15181518+ drbd_send_uuids(mdev);15191519+ drbd_send_sizes(mdev, 1);15201520+ }15211521+15221522+ fail:15231523+ reply->ret_code = retcode;15241524+ return 0;15251525+}15261526+15271527+static int drbd_nl_syncer_conf(struct drbd_conf *mdev, struct drbd_nl_cfg_req *nlp,15281528+ struct drbd_nl_cfg_reply *reply)15291529+{15301530+ int retcode = NO_ERROR;15311531+ int err;15321532+ int ovr; /* online verify running */15331533+ int rsr; /* re-sync running */15341534+ struct crypto_hash *verify_tfm = NULL;15351535+ struct crypto_hash *csums_tfm = NULL;15361536+ struct syncer_conf sc;15371537+ cpumask_var_t new_cpu_mask;15381538+15391539+ if (!zalloc_cpumask_var(&new_cpu_mask, GFP_KERNEL)) {15401540+ retcode = ERR_NOMEM;15411541+ goto fail;15421542+ }15431543+15441544+ if (nlp->flags & DRBD_NL_SET_DEFAULTS) {15451545+ memset(&sc, 0, sizeof(struct syncer_conf));15461546+ sc.rate = DRBD_RATE_DEF;15471547+ sc.after = DRBD_AFTER_DEF;15481548+ sc.al_extents = DRBD_AL_EXTENTS_DEF;15491549+ } else15501550+ memcpy(&sc, &mdev->sync_conf, sizeof(struct syncer_conf));15511551+15521552+ if (!syncer_conf_from_tags(mdev, nlp->tag_list, &sc)) {15531553+ retcode = ERR_MANDATORY_TAG;15541554+ goto fail;15551555+ }15561556+15571557+ /* re-sync running */15581558+ rsr = ( mdev->state.conn == C_SYNC_SOURCE ||15591559+ mdev->state.conn == C_SYNC_TARGET ||15601560+ mdev->state.conn == C_PAUSED_SYNC_S ||15611561+ mdev->state.conn == C_PAUSED_SYNC_T );15621562+15631563+ if (rsr && strcmp(sc.csums_alg, mdev->sync_conf.csums_alg)) {15641564+ retcode = ERR_CSUMS_RESYNC_RUNNING;15651565+ goto fail;15661566+ }15671567+15681568+ if (!rsr && sc.csums_alg[0]) {15691569+ csums_tfm = crypto_alloc_hash(sc.csums_alg, 0, CRYPTO_ALG_ASYNC);15701570+ if (IS_ERR(csums_tfm)) {15711571+ csums_tfm = NULL;15721572+ retcode = ERR_CSUMS_ALG;15731573+ goto fail;15741574+ }15751575+15761576+ if (!drbd_crypto_is_hash(crypto_hash_tfm(csums_tfm))) {15771577+ retcode = ERR_CSUMS_ALG_ND;15781578+ goto fail;15791579+ }15801580+ }15811581+15821582+ /* online verify running */15831583+ ovr = (mdev->state.conn == C_VERIFY_S || mdev->state.conn == C_VERIFY_T);15841584+15851585+ if (ovr) {15861586+ if (strcmp(sc.verify_alg, mdev->sync_conf.verify_alg)) {15871587+ retcode = ERR_VERIFY_RUNNING;15881588+ goto fail;15891589+ }15901590+ }15911591+15921592+ if (!ovr && sc.verify_alg[0]) {15931593+ verify_tfm = crypto_alloc_hash(sc.verify_alg, 0, CRYPTO_ALG_ASYNC);15941594+ if (IS_ERR(verify_tfm)) {15951595+ verify_tfm = NULL;15961596+ retcode = ERR_VERIFY_ALG;15971597+ goto fail;15981598+ }15991599+16001600+ if (!drbd_crypto_is_hash(crypto_hash_tfm(verify_tfm))) {16011601+ retcode = ERR_VERIFY_ALG_ND;16021602+ goto fail;16031603+ }16041604+ }16051605+16061606+ /* silently ignore cpu mask on UP kernel */16071607+ if (nr_cpu_ids > 1 && sc.cpu_mask[0] != 0) {16081608+ err = __bitmap_parse(sc.cpu_mask, 32, 0,16091609+ cpumask_bits(new_cpu_mask), nr_cpu_ids);16101610+ if (err) {16111611+ dev_warn(DEV, "__bitmap_parse() failed with %d\n", err);16121612+ retcode = ERR_CPU_MASK_PARSE;16131613+ goto fail;16141614+ }16151615+ }16161616+16171617+ ERR_IF (sc.rate < 1) sc.rate = 1;16181618+ ERR_IF (sc.al_extents < 7) sc.al_extents = 127; /* arbitrary minimum */16191619+#define AL_MAX ((MD_AL_MAX_SIZE-1) * AL_EXTENTS_PT)16201620+ if (sc.al_extents > AL_MAX) {16211621+ dev_err(DEV, "sc.al_extents > %d\n", AL_MAX);16221622+ sc.al_extents = AL_MAX;16231623+ }16241624+#undef AL_MAX16251625+16261626+ /* most sanity checks done, try to assign the new sync-after16271627+ * dependency. need to hold the global lock in there,16281628+ * to avoid a race in the dependency loop check. */16291629+ retcode = drbd_alter_sa(mdev, sc.after);16301630+ if (retcode != NO_ERROR)16311631+ goto fail;16321632+16331633+ /* ok, assign the rest of it as well.16341634+ * lock against receive_SyncParam() */16351635+ spin_lock(&mdev->peer_seq_lock);16361636+ mdev->sync_conf = sc;16371637+16381638+ if (!rsr) {16391639+ crypto_free_hash(mdev->csums_tfm);16401640+ mdev->csums_tfm = csums_tfm;16411641+ csums_tfm = NULL;16421642+ }16431643+16441644+ if (!ovr) {16451645+ crypto_free_hash(mdev->verify_tfm);16461646+ mdev->verify_tfm = verify_tfm;16471647+ verify_tfm = NULL;16481648+ }16491649+ spin_unlock(&mdev->peer_seq_lock);16501650+16511651+ if (get_ldev(mdev)) {16521652+ wait_event(mdev->al_wait, lc_try_lock(mdev->act_log));16531653+ drbd_al_shrink(mdev);16541654+ err = drbd_check_al_size(mdev);16551655+ lc_unlock(mdev->act_log);16561656+ wake_up(&mdev->al_wait);16571657+16581658+ put_ldev(mdev);16591659+ drbd_md_sync(mdev);16601660+16611661+ if (err) {16621662+ retcode = ERR_NOMEM;16631663+ goto fail;16641664+ }16651665+ }16661666+16671667+ if (mdev->state.conn >= C_CONNECTED)16681668+ drbd_send_sync_param(mdev, &sc);16691669+16701670+ if (!cpumask_equal(mdev->cpu_mask, new_cpu_mask)) {16711671+ cpumask_copy(mdev->cpu_mask, new_cpu_mask);16721672+ drbd_calc_cpu_mask(mdev);16731673+ mdev->receiver.reset_cpu_mask = 1;16741674+ mdev->asender.reset_cpu_mask = 1;16751675+ mdev->worker.reset_cpu_mask = 1;16761676+ }16771677+16781678+ kobject_uevent(&disk_to_dev(mdev->vdisk)->kobj, KOBJ_CHANGE);16791679+fail:16801680+ free_cpumask_var(new_cpu_mask);16811681+ crypto_free_hash(csums_tfm);16821682+ crypto_free_hash(verify_tfm);16831683+ reply->ret_code = retcode;16841684+ return 0;16851685+}16861686+16871687+static int drbd_nl_invalidate(struct drbd_conf *mdev, struct drbd_nl_cfg_req *nlp,16881688+ struct drbd_nl_cfg_reply *reply)16891689+{16901690+ int retcode;16911691+16921692+ retcode = _drbd_request_state(mdev, NS(conn, C_STARTING_SYNC_T), CS_ORDERED);16931693+16941694+ if (retcode < SS_SUCCESS && retcode != SS_NEED_CONNECTION)16951695+ retcode = drbd_request_state(mdev, NS(conn, C_STARTING_SYNC_T));16961696+16971697+ while (retcode == SS_NEED_CONNECTION) {16981698+ spin_lock_irq(&mdev->req_lock);16991699+ if (mdev->state.conn < C_CONNECTED)17001700+ retcode = _drbd_set_state(_NS(mdev, disk, D_INCONSISTENT), CS_VERBOSE, NULL);17011701+ spin_unlock_irq(&mdev->req_lock);17021702+17031703+ if (retcode != SS_NEED_CONNECTION)17041704+ break;17051705+17061706+ retcode = drbd_request_state(mdev, NS(conn, C_STARTING_SYNC_T));17071707+ }17081708+17091709+ reply->ret_code = retcode;17101710+ return 0;17111711+}17121712+17131713+static int drbd_nl_invalidate_peer(struct drbd_conf *mdev, struct drbd_nl_cfg_req *nlp,17141714+ struct drbd_nl_cfg_reply *reply)17151715+{17161716+17171717+ reply->ret_code = drbd_request_state(mdev, NS(conn, C_STARTING_SYNC_S));17181718+17191719+ return 0;17201720+}17211721+17221722+static int drbd_nl_pause_sync(struct drbd_conf *mdev, struct drbd_nl_cfg_req *nlp,17231723+ struct drbd_nl_cfg_reply *reply)17241724+{17251725+ int retcode = NO_ERROR;17261726+17271727+ if (drbd_request_state(mdev, NS(user_isp, 1)) == SS_NOTHING_TO_DO)17281728+ retcode = ERR_PAUSE_IS_SET;17291729+17301730+ reply->ret_code = retcode;17311731+ return 0;17321732+}17331733+17341734+static int drbd_nl_resume_sync(struct drbd_conf *mdev, struct drbd_nl_cfg_req *nlp,17351735+ struct drbd_nl_cfg_reply *reply)17361736+{17371737+ int retcode = NO_ERROR;17381738+17391739+ if (drbd_request_state(mdev, NS(user_isp, 0)) == SS_NOTHING_TO_DO)17401740+ retcode = ERR_PAUSE_IS_CLEAR;17411741+17421742+ reply->ret_code = retcode;17431743+ return 0;17441744+}17451745+17461746+static int drbd_nl_suspend_io(struct drbd_conf *mdev, struct drbd_nl_cfg_req *nlp,17471747+ struct drbd_nl_cfg_reply *reply)17481748+{17491749+ reply->ret_code = drbd_request_state(mdev, NS(susp, 1));17501750+17511751+ return 0;17521752+}17531753+17541754+static int drbd_nl_resume_io(struct drbd_conf *mdev, struct drbd_nl_cfg_req *nlp,17551755+ struct drbd_nl_cfg_reply *reply)17561756+{17571757+ reply->ret_code = drbd_request_state(mdev, NS(susp, 0));17581758+ return 0;17591759+}17601760+17611761+static int drbd_nl_outdate(struct drbd_conf *mdev, struct drbd_nl_cfg_req *nlp,17621762+ struct drbd_nl_cfg_reply *reply)17631763+{17641764+ reply->ret_code = drbd_request_state(mdev, NS(disk, D_OUTDATED));17651765+ return 0;17661766+}17671767+17681768+static int drbd_nl_get_config(struct drbd_conf *mdev, struct drbd_nl_cfg_req *nlp,17691769+ struct drbd_nl_cfg_reply *reply)17701770+{17711771+ unsigned short *tl;17721772+17731773+ tl = reply->tag_list;17741774+17751775+ if (get_ldev(mdev)) {17761776+ tl = disk_conf_to_tags(mdev, &mdev->ldev->dc, tl);17771777+ put_ldev(mdev);17781778+ }17791779+17801780+ if (get_net_conf(mdev)) {17811781+ tl = net_conf_to_tags(mdev, mdev->net_conf, tl);17821782+ put_net_conf(mdev);17831783+ }17841784+ tl = syncer_conf_to_tags(mdev, &mdev->sync_conf, tl);17851785+17861786+ put_unaligned(TT_END, tl++); /* Close the tag list */17871787+17881788+ return (int)((char *)tl - (char *)reply->tag_list);17891789+}17901790+17911791+static int drbd_nl_get_state(struct drbd_conf *mdev, struct drbd_nl_cfg_req *nlp,17921792+ struct drbd_nl_cfg_reply *reply)17931793+{17941794+ unsigned short *tl = reply->tag_list;17951795+ union drbd_state s = mdev->state;17961796+ unsigned long rs_left;17971797+ unsigned int res;17981798+17991799+ tl = get_state_to_tags(mdev, (struct get_state *)&s, tl);18001800+18011801+ /* no local ref, no bitmap, no syncer progress. */18021802+ if (s.conn >= C_SYNC_SOURCE && s.conn <= C_PAUSED_SYNC_T) {18031803+ if (get_ldev(mdev)) {18041804+ drbd_get_syncer_progress(mdev, &rs_left, &res);18051805+ tl = tl_add_int(tl, T_sync_progress, &res);18061806+ put_ldev(mdev);18071807+ }18081808+ }18091809+ put_unaligned(TT_END, tl++); /* Close the tag list */18101810+18111811+ return (int)((char *)tl - (char *)reply->tag_list);18121812+}18131813+18141814+static int drbd_nl_get_uuids(struct drbd_conf *mdev, struct drbd_nl_cfg_req *nlp,18151815+ struct drbd_nl_cfg_reply *reply)18161816+{18171817+ unsigned short *tl;18181818+18191819+ tl = reply->tag_list;18201820+18211821+ if (get_ldev(mdev)) {18221822+ tl = tl_add_blob(tl, T_uuids, mdev->ldev->md.uuid, UI_SIZE*sizeof(u64));18231823+ tl = tl_add_int(tl, T_uuids_flags, &mdev->ldev->md.flags);18241824+ put_ldev(mdev);18251825+ }18261826+ put_unaligned(TT_END, tl++); /* Close the tag list */18271827+18281828+ return (int)((char *)tl - (char *)reply->tag_list);18291829+}18301830+18311831+/**18321832+ * drbd_nl_get_timeout_flag() - Used by drbdsetup to find out which timeout value to use18331833+ * @mdev: DRBD device.18341834+ * @nlp: Netlink/connector packet from drbdsetup18351835+ * @reply: Reply packet for drbdsetup18361836+ */18371837+static int drbd_nl_get_timeout_flag(struct drbd_conf *mdev, struct drbd_nl_cfg_req *nlp,18381838+ struct drbd_nl_cfg_reply *reply)18391839+{18401840+ unsigned short *tl;18411841+ char rv;18421842+18431843+ tl = reply->tag_list;18441844+18451845+ rv = mdev->state.pdsk == D_OUTDATED ? UT_PEER_OUTDATED :18461846+ test_bit(USE_DEGR_WFC_T, &mdev->flags) ? UT_DEGRADED : UT_DEFAULT;18471847+18481848+ tl = tl_add_blob(tl, T_use_degraded, &rv, sizeof(rv));18491849+ put_unaligned(TT_END, tl++); /* Close the tag list */18501850+18511851+ return (int)((char *)tl - (char *)reply->tag_list);18521852+}18531853+18541854+static int drbd_nl_start_ov(struct drbd_conf *mdev, struct drbd_nl_cfg_req *nlp,18551855+ struct drbd_nl_cfg_reply *reply)18561856+{18571857+ /* default to resume from last known position, if possible */18581858+ struct start_ov args =18591859+ { .start_sector = mdev->ov_start_sector };18601860+18611861+ if (!start_ov_from_tags(mdev, nlp->tag_list, &args)) {18621862+ reply->ret_code = ERR_MANDATORY_TAG;18631863+ return 0;18641864+ }18651865+ /* w_make_ov_request expects position to be aligned */18661866+ mdev->ov_start_sector = args.start_sector & ~BM_SECT_PER_BIT;18671867+ reply->ret_code = drbd_request_state(mdev,NS(conn,C_VERIFY_S));18681868+ return 0;18691869+}18701870+18711871+18721872+static int drbd_nl_new_c_uuid(struct drbd_conf *mdev, struct drbd_nl_cfg_req *nlp,18731873+ struct drbd_nl_cfg_reply *reply)18741874+{18751875+ int retcode = NO_ERROR;18761876+ int skip_initial_sync = 0;18771877+ int err;18781878+18791879+ struct new_c_uuid args;18801880+18811881+ memset(&args, 0, sizeof(struct new_c_uuid));18821882+ if (!new_c_uuid_from_tags(mdev, nlp->tag_list, &args)) {18831883+ reply->ret_code = ERR_MANDATORY_TAG;18841884+ return 0;18851885+ }18861886+18871887+ mutex_lock(&mdev->state_mutex); /* Protects us against serialized state changes. */18881888+18891889+ if (!get_ldev(mdev)) {18901890+ retcode = ERR_NO_DISK;18911891+ goto out;18921892+ }18931893+18941894+ /* this is "skip initial sync", assume to be clean */18951895+ if (mdev->state.conn == C_CONNECTED && mdev->agreed_pro_version >= 90 &&18961896+ mdev->ldev->md.uuid[UI_CURRENT] == UUID_JUST_CREATED && args.clear_bm) {18971897+ dev_info(DEV, "Preparing to skip initial sync\n");18981898+ skip_initial_sync = 1;18991899+ } else if (mdev->state.conn != C_STANDALONE) {19001900+ retcode = ERR_CONNECTED;19011901+ goto out_dec;19021902+ }19031903+19041904+ drbd_uuid_set(mdev, UI_BITMAP, 0); /* Rotate UI_BITMAP to History 1, etc... */19051905+ drbd_uuid_new_current(mdev); /* New current, previous to UI_BITMAP */19061906+19071907+ if (args.clear_bm) {19081908+ err = drbd_bitmap_io(mdev, &drbd_bmio_clear_n_write, "clear_n_write from new_c_uuid");19091909+ if (err) {19101910+ dev_err(DEV, "Writing bitmap failed with %d\n",err);19111911+ retcode = ERR_IO_MD_DISK;19121912+ }19131913+ if (skip_initial_sync) {19141914+ drbd_send_uuids_skip_initial_sync(mdev);19151915+ _drbd_uuid_set(mdev, UI_BITMAP, 0);19161916+ spin_lock_irq(&mdev->req_lock);19171917+ _drbd_set_state(_NS2(mdev, disk, D_UP_TO_DATE, pdsk, D_UP_TO_DATE),19181918+ CS_VERBOSE, NULL);19191919+ spin_unlock_irq(&mdev->req_lock);19201920+ }19211921+ }19221922+19231923+ drbd_md_sync(mdev);19241924+out_dec:19251925+ put_ldev(mdev);19261926+out:19271927+ mutex_unlock(&mdev->state_mutex);19281928+19291929+ reply->ret_code = retcode;19301930+ return 0;19311931+}19321932+19331933+static struct drbd_conf *ensure_mdev(struct drbd_nl_cfg_req *nlp)19341934+{19351935+ struct drbd_conf *mdev;19361936+19371937+ if (nlp->drbd_minor >= minor_count)19381938+ return NULL;19391939+19401940+ mdev = minor_to_mdev(nlp->drbd_minor);19411941+19421942+ if (!mdev && (nlp->flags & DRBD_NL_CREATE_DEVICE)) {19431943+ struct gendisk *disk = NULL;19441944+ mdev = drbd_new_device(nlp->drbd_minor);19451945+19461946+ spin_lock_irq(&drbd_pp_lock);19471947+ if (minor_table[nlp->drbd_minor] == NULL) {19481948+ minor_table[nlp->drbd_minor] = mdev;19491949+ disk = mdev->vdisk;19501950+ mdev = NULL;19511951+ } /* else: we lost the race */19521952+ spin_unlock_irq(&drbd_pp_lock);19531953+19541954+ if (disk) /* we won the race above */19551955+ /* in case we ever add a drbd_delete_device(),19561956+ * don't forget the del_gendisk! */19571957+ add_disk(disk);19581958+ else /* we lost the race above */19591959+ drbd_free_mdev(mdev);19601960+19611961+ mdev = minor_to_mdev(nlp->drbd_minor);19621962+ }19631963+19641964+ return mdev;19651965+}19661966+19671967+struct cn_handler_struct {19681968+ int (*function)(struct drbd_conf *,19691969+ struct drbd_nl_cfg_req *,19701970+ struct drbd_nl_cfg_reply *);19711971+ int reply_body_size;19721972+};19731973+19741974+static struct cn_handler_struct cnd_table[] = {19751975+ [ P_primary ] = { &drbd_nl_primary, 0 },19761976+ [ P_secondary ] = { &drbd_nl_secondary, 0 },19771977+ [ P_disk_conf ] = { &drbd_nl_disk_conf, 0 },19781978+ [ P_detach ] = { &drbd_nl_detach, 0 },19791979+ [ P_net_conf ] = { &drbd_nl_net_conf, 0 },19801980+ [ P_disconnect ] = { &drbd_nl_disconnect, 0 },19811981+ [ P_resize ] = { &drbd_nl_resize, 0 },19821982+ [ P_syncer_conf ] = { &drbd_nl_syncer_conf, 0 },19831983+ [ P_invalidate ] = { &drbd_nl_invalidate, 0 },19841984+ [ P_invalidate_peer ] = { &drbd_nl_invalidate_peer, 0 },19851985+ [ P_pause_sync ] = { &drbd_nl_pause_sync, 0 },19861986+ [ P_resume_sync ] = { &drbd_nl_resume_sync, 0 },19871987+ [ P_suspend_io ] = { &drbd_nl_suspend_io, 0 },19881988+ [ P_resume_io ] = { &drbd_nl_resume_io, 0 },19891989+ [ P_outdate ] = { &drbd_nl_outdate, 0 },19901990+ [ P_get_config ] = { &drbd_nl_get_config,19911991+ sizeof(struct syncer_conf_tag_len_struct) +19921992+ sizeof(struct disk_conf_tag_len_struct) +19931993+ sizeof(struct net_conf_tag_len_struct) },19941994+ [ P_get_state ] = { &drbd_nl_get_state,19951995+ sizeof(struct get_state_tag_len_struct) +19961996+ sizeof(struct sync_progress_tag_len_struct) },19971997+ [ P_get_uuids ] = { &drbd_nl_get_uuids,19981998+ sizeof(struct get_uuids_tag_len_struct) },19991999+ [ P_get_timeout_flag ] = { &drbd_nl_get_timeout_flag,20002000+ sizeof(struct get_timeout_flag_tag_len_struct)},20012001+ [ P_start_ov ] = { &drbd_nl_start_ov, 0 },20022002+ [ P_new_c_uuid ] = { &drbd_nl_new_c_uuid, 0 },20032003+};20042004+20052005+static void drbd_connector_callback(struct cn_msg *req)20062006+{20072007+ struct drbd_nl_cfg_req *nlp = (struct drbd_nl_cfg_req *)req->data;20082008+ struct cn_handler_struct *cm;20092009+ struct cn_msg *cn_reply;20102010+ struct drbd_nl_cfg_reply *reply;20112011+ struct drbd_conf *mdev;20122012+ int retcode, rr;20132013+ int reply_size = sizeof(struct cn_msg)20142014+ + sizeof(struct drbd_nl_cfg_reply)20152015+ + sizeof(short int);20162016+20172017+ if (!try_module_get(THIS_MODULE)) {20182018+ printk(KERN_ERR "drbd: try_module_get() failed!\n");20192019+ return;20202020+ }20212021+20222022+ mdev = ensure_mdev(nlp);20232023+ if (!mdev) {20242024+ retcode = ERR_MINOR_INVALID;20252025+ goto fail;20262026+ }20272027+20282028+ trace_drbd_netlink(req, 1);20292029+20302030+ if (nlp->packet_type >= P_nl_after_last_packet) {20312031+ retcode = ERR_PACKET_NR;20322032+ goto fail;20332033+ }20342034+20352035+ cm = cnd_table + nlp->packet_type;20362036+20372037+ /* This may happen if packet number is 0: */20382038+ if (cm->function == NULL) {20392039+ retcode = ERR_PACKET_NR;20402040+ goto fail;20412041+ }20422042+20432043+ reply_size += cm->reply_body_size;20442044+20452045+ /* allocation not in the IO path, cqueue thread context */20462046+ cn_reply = kmalloc(reply_size, GFP_KERNEL);20472047+ if (!cn_reply) {20482048+ retcode = ERR_NOMEM;20492049+ goto fail;20502050+ }20512051+ reply = (struct drbd_nl_cfg_reply *) cn_reply->data;20522052+20532053+ reply->packet_type =20542054+ cm->reply_body_size ? nlp->packet_type : P_nl_after_last_packet;20552055+ reply->minor = nlp->drbd_minor;20562056+ reply->ret_code = NO_ERROR; /* Might by modified by cm->function. */20572057+ /* reply->tag_list; might be modified by cm->function. */20582058+20592059+ rr = cm->function(mdev, nlp, reply);20602060+20612061+ cn_reply->id = req->id;20622062+ cn_reply->seq = req->seq;20632063+ cn_reply->ack = req->ack + 1;20642064+ cn_reply->len = sizeof(struct drbd_nl_cfg_reply) + rr;20652065+ cn_reply->flags = 0;20662066+20672067+ trace_drbd_netlink(cn_reply, 0);20682068+ rr = cn_netlink_send(cn_reply, CN_IDX_DRBD, GFP_KERNEL);20692069+ if (rr && rr != -ESRCH)20702070+ printk(KERN_INFO "drbd: cn_netlink_send()=%d\n", rr);20712071+20722072+ kfree(cn_reply);20732073+ module_put(THIS_MODULE);20742074+ return;20752075+ fail:20762076+ drbd_nl_send_reply(req, retcode);20772077+ module_put(THIS_MODULE);20782078+}20792079+20802080+static atomic_t drbd_nl_seq = ATOMIC_INIT(2); /* two. */20812081+20822082+static unsigned short *20832083+__tl_add_blob(unsigned short *tl, enum drbd_tags tag, const void *data,20842084+ unsigned short len, int nul_terminated)20852085+{20862086+ unsigned short l = tag_descriptions[tag_number(tag)].max_len;20872087+ len = (len < l) ? len : l;20882088+ put_unaligned(tag, tl++);20892089+ put_unaligned(len, tl++);20902090+ memcpy(tl, data, len);20912091+ tl = (unsigned short*)((char*)tl + len);20922092+ if (nul_terminated)20932093+ *((char*)tl - 1) = 0;20942094+ return tl;20952095+}20962096+20972097+static unsigned short *20982098+tl_add_blob(unsigned short *tl, enum drbd_tags tag, const void *data, int len)20992099+{21002100+ return __tl_add_blob(tl, tag, data, len, 0);21012101+}21022102+21032103+static unsigned short *21042104+tl_add_str(unsigned short *tl, enum drbd_tags tag, const char *str)21052105+{21062106+ return __tl_add_blob(tl, tag, str, strlen(str)+1, 0);21072107+}21082108+21092109+static unsigned short *21102110+tl_add_int(unsigned short *tl, enum drbd_tags tag, const void *val)21112111+{21122112+ put_unaligned(tag, tl++);21132113+ switch(tag_type(tag)) {21142114+ case TT_INTEGER:21152115+ put_unaligned(sizeof(int), tl++);21162116+ put_unaligned(*(int *)val, (int *)tl);21172117+ tl = (unsigned short*)((char*)tl+sizeof(int));21182118+ break;21192119+ case TT_INT64:21202120+ put_unaligned(sizeof(u64), tl++);21212121+ put_unaligned(*(u64 *)val, (u64 *)tl);21222122+ tl = (unsigned short*)((char*)tl+sizeof(u64));21232123+ break;21242124+ default:21252125+ /* someone did something stupid. */21262126+ ;21272127+ }21282128+ return tl;21292129+}21302130+21312131+void drbd_bcast_state(struct drbd_conf *mdev, union drbd_state state)21322132+{21332133+ char buffer[sizeof(struct cn_msg)+21342134+ sizeof(struct drbd_nl_cfg_reply)+21352135+ sizeof(struct get_state_tag_len_struct)+21362136+ sizeof(short int)];21372137+ struct cn_msg *cn_reply = (struct cn_msg *) buffer;21382138+ struct drbd_nl_cfg_reply *reply =21392139+ (struct drbd_nl_cfg_reply *)cn_reply->data;21402140+ unsigned short *tl = reply->tag_list;21412141+21422142+ /* dev_warn(DEV, "drbd_bcast_state() got called\n"); */21432143+21442144+ tl = get_state_to_tags(mdev, (struct get_state *)&state, tl);21452145+21462146+ put_unaligned(TT_END, tl++); /* Close the tag list */21472147+21482148+ cn_reply->id.idx = CN_IDX_DRBD;21492149+ cn_reply->id.val = CN_VAL_DRBD;21502150+21512151+ cn_reply->seq = atomic_add_return(1, &drbd_nl_seq);21522152+ cn_reply->ack = 0; /* not used here. */21532153+ cn_reply->len = sizeof(struct drbd_nl_cfg_reply) +21542154+ (int)((char *)tl - (char *)reply->tag_list);21552155+ cn_reply->flags = 0;21562156+21572157+ reply->packet_type = P_get_state;21582158+ reply->minor = mdev_to_minor(mdev);21592159+ reply->ret_code = NO_ERROR;21602160+21612161+ trace_drbd_netlink(cn_reply, 0);21622162+ cn_netlink_send(cn_reply, CN_IDX_DRBD, GFP_NOIO);21632163+}21642164+21652165+void drbd_bcast_ev_helper(struct drbd_conf *mdev, char *helper_name)21662166+{21672167+ char buffer[sizeof(struct cn_msg)+21682168+ sizeof(struct drbd_nl_cfg_reply)+21692169+ sizeof(struct call_helper_tag_len_struct)+21702170+ sizeof(short int)];21712171+ struct cn_msg *cn_reply = (struct cn_msg *) buffer;21722172+ struct drbd_nl_cfg_reply *reply =21732173+ (struct drbd_nl_cfg_reply *)cn_reply->data;21742174+ unsigned short *tl = reply->tag_list;21752175+21762176+ /* dev_warn(DEV, "drbd_bcast_state() got called\n"); */21772177+21782178+ tl = tl_add_str(tl, T_helper, helper_name);21792179+ put_unaligned(TT_END, tl++); /* Close the tag list */21802180+21812181+ cn_reply->id.idx = CN_IDX_DRBD;21822182+ cn_reply->id.val = CN_VAL_DRBD;21832183+21842184+ cn_reply->seq = atomic_add_return(1, &drbd_nl_seq);21852185+ cn_reply->ack = 0; /* not used here. */21862186+ cn_reply->len = sizeof(struct drbd_nl_cfg_reply) +21872187+ (int)((char *)tl - (char *)reply->tag_list);21882188+ cn_reply->flags = 0;21892189+21902190+ reply->packet_type = P_call_helper;21912191+ reply->minor = mdev_to_minor(mdev);21922192+ reply->ret_code = NO_ERROR;21932193+21942194+ trace_drbd_netlink(cn_reply, 0);21952195+ cn_netlink_send(cn_reply, CN_IDX_DRBD, GFP_NOIO);21962196+}21972197+21982198+void drbd_bcast_ee(struct drbd_conf *mdev,21992199+ const char *reason, const int dgs,22002200+ const char* seen_hash, const char* calc_hash,22012201+ const struct drbd_epoch_entry* e)22022202+{22032203+ struct cn_msg *cn_reply;22042204+ struct drbd_nl_cfg_reply *reply;22052205+ struct bio_vec *bvec;22062206+ unsigned short *tl;22072207+ int i;22082208+22092209+ if (!e)22102210+ return;22112211+ if (!reason || !reason[0])22122212+ return;22132213+22142214+ /* apparently we have to memcpy twice, first to prepare the data for the22152215+ * struct cn_msg, then within cn_netlink_send from the cn_msg to the22162216+ * netlink skb. */22172217+ /* receiver thread context, which is not in the writeout path (of this node),22182218+ * but may be in the writeout path of the _other_ node.22192219+ * GFP_NOIO to avoid potential "distributed deadlock". */22202220+ cn_reply = kmalloc(22212221+ sizeof(struct cn_msg)+22222222+ sizeof(struct drbd_nl_cfg_reply)+22232223+ sizeof(struct dump_ee_tag_len_struct)+22242224+ sizeof(short int),22252225+ GFP_NOIO);22262226+22272227+ if (!cn_reply) {22282228+ dev_err(DEV, "could not kmalloc buffer for drbd_bcast_ee, sector %llu, size %u\n",22292229+ (unsigned long long)e->sector, e->size);22302230+ return;22312231+ }22322232+22332233+ reply = (struct drbd_nl_cfg_reply*)cn_reply->data;22342234+ tl = reply->tag_list;22352235+22362236+ tl = tl_add_str(tl, T_dump_ee_reason, reason);22372237+ tl = tl_add_blob(tl, T_seen_digest, seen_hash, dgs);22382238+ tl = tl_add_blob(tl, T_calc_digest, calc_hash, dgs);22392239+ tl = tl_add_int(tl, T_ee_sector, &e->sector);22402240+ tl = tl_add_int(tl, T_ee_block_id, &e->block_id);22412241+22422242+ put_unaligned(T_ee_data, tl++);22432243+ put_unaligned(e->size, tl++);22442244+22452245+ __bio_for_each_segment(bvec, e->private_bio, i, 0) {22462246+ void *d = kmap(bvec->bv_page);22472247+ memcpy(tl, d + bvec->bv_offset, bvec->bv_len);22482248+ kunmap(bvec->bv_page);22492249+ tl=(unsigned short*)((char*)tl + bvec->bv_len);22502250+ }22512251+ put_unaligned(TT_END, tl++); /* Close the tag list */22522252+22532253+ cn_reply->id.idx = CN_IDX_DRBD;22542254+ cn_reply->id.val = CN_VAL_DRBD;22552255+22562256+ cn_reply->seq = atomic_add_return(1,&drbd_nl_seq);22572257+ cn_reply->ack = 0; // not used here.22582258+ cn_reply->len = sizeof(struct drbd_nl_cfg_reply) +22592259+ (int)((char*)tl - (char*)reply->tag_list);22602260+ cn_reply->flags = 0;22612261+22622262+ reply->packet_type = P_dump_ee;22632263+ reply->minor = mdev_to_minor(mdev);22642264+ reply->ret_code = NO_ERROR;22652265+22662266+ trace_drbd_netlink(cn_reply, 0);22672267+ cn_netlink_send(cn_reply, CN_IDX_DRBD, GFP_NOIO);22682268+ kfree(cn_reply);22692269+}22702270+22712271+void drbd_bcast_sync_progress(struct drbd_conf *mdev)22722272+{22732273+ char buffer[sizeof(struct cn_msg)+22742274+ sizeof(struct drbd_nl_cfg_reply)+22752275+ sizeof(struct sync_progress_tag_len_struct)+22762276+ sizeof(short int)];22772277+ struct cn_msg *cn_reply = (struct cn_msg *) buffer;22782278+ struct drbd_nl_cfg_reply *reply =22792279+ (struct drbd_nl_cfg_reply *)cn_reply->data;22802280+ unsigned short *tl = reply->tag_list;22812281+ unsigned long rs_left;22822282+ unsigned int res;22832283+22842284+ /* no local ref, no bitmap, no syncer progress, no broadcast. */22852285+ if (!get_ldev(mdev))22862286+ return;22872287+ drbd_get_syncer_progress(mdev, &rs_left, &res);22882288+ put_ldev(mdev);22892289+22902290+ tl = tl_add_int(tl, T_sync_progress, &res);22912291+ put_unaligned(TT_END, tl++); /* Close the tag list */22922292+22932293+ cn_reply->id.idx = CN_IDX_DRBD;22942294+ cn_reply->id.val = CN_VAL_DRBD;22952295+22962296+ cn_reply->seq = atomic_add_return(1, &drbd_nl_seq);22972297+ cn_reply->ack = 0; /* not used here. */22982298+ cn_reply->len = sizeof(struct drbd_nl_cfg_reply) +22992299+ (int)((char *)tl - (char *)reply->tag_list);23002300+ cn_reply->flags = 0;23012301+23022302+ reply->packet_type = P_sync_progress;23032303+ reply->minor = mdev_to_minor(mdev);23042304+ reply->ret_code = NO_ERROR;23052305+23062306+ trace_drbd_netlink(cn_reply, 0);23072307+ cn_netlink_send(cn_reply, CN_IDX_DRBD, GFP_NOIO);23082308+}23092309+23102310+int __init drbd_nl_init(void)23112311+{23122312+ static struct cb_id cn_id_drbd;23132313+ int err, try=10;23142314+23152315+ cn_id_drbd.val = CN_VAL_DRBD;23162316+ do {23172317+ cn_id_drbd.idx = cn_idx;23182318+ err = cn_add_callback(&cn_id_drbd, "cn_drbd", &drbd_connector_callback);23192319+ if (!err)23202320+ break;23212321+ cn_idx = (cn_idx + CN_IDX_STEP);23222322+ } while (try--);23232323+23242324+ if (err) {23252325+ printk(KERN_ERR "drbd: cn_drbd failed to register\n");23262326+ return err;23272327+ }23282328+23292329+ return 0;23302330+}23312331+23322332+void drbd_nl_cleanup(void)23332333+{23342334+ static struct cb_id cn_id_drbd;23352335+23362336+ cn_id_drbd.idx = cn_idx;23372337+ cn_id_drbd.val = CN_VAL_DRBD;23382338+23392339+ cn_del_callback(&cn_id_drbd);23402340+}23412341+23422342+void drbd_nl_send_reply(struct cn_msg *req, int ret_code)23432343+{23442344+ char buffer[sizeof(struct cn_msg)+sizeof(struct drbd_nl_cfg_reply)];23452345+ struct cn_msg *cn_reply = (struct cn_msg *) buffer;23462346+ struct drbd_nl_cfg_reply *reply =23472347+ (struct drbd_nl_cfg_reply *)cn_reply->data;23482348+ int rr;23492349+23502350+ cn_reply->id = req->id;23512351+23522352+ cn_reply->seq = req->seq;23532353+ cn_reply->ack = req->ack + 1;23542354+ cn_reply->len = sizeof(struct drbd_nl_cfg_reply);23552355+ cn_reply->flags = 0;23562356+23572357+ reply->minor = ((struct drbd_nl_cfg_req *)req->data)->drbd_minor;23582358+ reply->ret_code = ret_code;23592359+23602360+ trace_drbd_netlink(cn_reply, 0);23612361+ rr = cn_netlink_send(cn_reply, CN_IDX_DRBD, GFP_NOIO);23622362+ if (rr && rr != -ESRCH)23632363+ printk(KERN_INFO "drbd: cn_netlink_send()=%d\n", rr);23642364+}23652365+
+266
drivers/block/drbd/drbd_proc.c
···11+/*22+ drbd_proc.c33+44+ This file is part of DRBD by Philipp Reisner and Lars Ellenberg.55+66+ Copyright (C) 2001-2008, LINBIT Information Technologies GmbH.77+ Copyright (C) 1999-2008, Philipp Reisner <philipp.reisner@linbit.com>.88+ Copyright (C) 2002-2008, Lars Ellenberg <lars.ellenberg@linbit.com>.99+1010+ drbd is free software; you can redistribute it and/or modify1111+ it under the terms of the GNU General Public License as published by1212+ the Free Software Foundation; either version 2, or (at your option)1313+ any later version.1414+1515+ drbd is distributed in the hope that it will be useful,1616+ but WITHOUT ANY WARRANTY; without even the implied warranty of1717+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the1818+ GNU General Public License for more details.1919+2020+ You should have received a copy of the GNU General Public License2121+ along with drbd; see the file COPYING. If not, write to2222+ the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, USA.2323+2424+ */2525+2626+#include <linux/autoconf.h>2727+#include <linux/module.h>2828+2929+#include <asm/uaccess.h>3030+#include <linux/fs.h>3131+#include <linux/file.h>3232+#include <linux/slab.h>3333+#include <linux/proc_fs.h>3434+#include <linux/seq_file.h>3535+#include <linux/drbd.h>3636+#include "drbd_int.h"3737+3838+static int drbd_proc_open(struct inode *inode, struct file *file);3939+4040+4141+struct proc_dir_entry *drbd_proc;4242+struct file_operations drbd_proc_fops = {4343+ .owner = THIS_MODULE,4444+ .open = drbd_proc_open,4545+ .read = seq_read,4646+ .llseek = seq_lseek,4747+ .release = single_release,4848+};4949+5050+5151+/*lge5252+ * progress bars shamelessly adapted from driver/md/md.c5353+ * output looks like5454+ * [=====>..............] 33.5% (23456/123456)5555+ * finish: 2:20:20 speed: 6,345 (6,456) K/sec5656+ */5757+static void drbd_syncer_progress(struct drbd_conf *mdev, struct seq_file *seq)5858+{5959+ unsigned long db, dt, dbdt, rt, rs_left;6060+ unsigned int res;6161+ int i, x, y;6262+6363+ drbd_get_syncer_progress(mdev, &rs_left, &res);6464+6565+ x = res/50;6666+ y = 20-x;6767+ seq_printf(seq, "\t[");6868+ for (i = 1; i < x; i++)6969+ seq_printf(seq, "=");7070+ seq_printf(seq, ">");7171+ for (i = 0; i < y; i++)7272+ seq_printf(seq, ".");7373+ seq_printf(seq, "] ");7474+7575+ seq_printf(seq, "sync'ed:%3u.%u%% ", res / 10, res % 10);7676+ /* if more than 1 GB display in MB */7777+ if (mdev->rs_total > 0x100000L)7878+ seq_printf(seq, "(%lu/%lu)M\n\t",7979+ (unsigned long) Bit2KB(rs_left >> 10),8080+ (unsigned long) Bit2KB(mdev->rs_total >> 10));8181+ else8282+ seq_printf(seq, "(%lu/%lu)K\n\t",8383+ (unsigned long) Bit2KB(rs_left),8484+ (unsigned long) Bit2KB(mdev->rs_total));8585+8686+ /* see drivers/md/md.c8787+ * We do not want to overflow, so the order of operands and8888+ * the * 100 / 100 trick are important. We do a +1 to be8989+ * safe against division by zero. We only estimate anyway.9090+ *9191+ * dt: time from mark until now9292+ * db: blocks written from mark until now9393+ * rt: remaining time9494+ */9595+ dt = (jiffies - mdev->rs_mark_time) / HZ;9696+9797+ if (dt > 20) {9898+ /* if we made no update to rs_mark_time for too long,9999+ * we are stalled. show that. */100100+ seq_printf(seq, "stalled\n");101101+ return;102102+ }103103+104104+ if (!dt)105105+ dt++;106106+ db = mdev->rs_mark_left - rs_left;107107+ rt = (dt * (rs_left / (db/100+1)))/100; /* seconds */108108+109109+ seq_printf(seq, "finish: %lu:%02lu:%02lu",110110+ rt / 3600, (rt % 3600) / 60, rt % 60);111111+112112+ /* current speed average over (SYNC_MARKS * SYNC_MARK_STEP) jiffies */113113+ dbdt = Bit2KB(db/dt);114114+ if (dbdt > 1000)115115+ seq_printf(seq, " speed: %ld,%03ld",116116+ dbdt/1000, dbdt % 1000);117117+ else118118+ seq_printf(seq, " speed: %ld", dbdt);119119+120120+ /* mean speed since syncer started121121+ * we do account for PausedSync periods */122122+ dt = (jiffies - mdev->rs_start - mdev->rs_paused) / HZ;123123+ if (dt <= 0)124124+ dt = 1;125125+ db = mdev->rs_total - rs_left;126126+ dbdt = Bit2KB(db/dt);127127+ if (dbdt > 1000)128128+ seq_printf(seq, " (%ld,%03ld)",129129+ dbdt/1000, dbdt % 1000);130130+ else131131+ seq_printf(seq, " (%ld)", dbdt);132132+133133+ seq_printf(seq, " K/sec\n");134134+}135135+136136+static void resync_dump_detail(struct seq_file *seq, struct lc_element *e)137137+{138138+ struct bm_extent *bme = lc_entry(e, struct bm_extent, lce);139139+140140+ seq_printf(seq, "%5d %s %s\n", bme->rs_left,141141+ bme->flags & BME_NO_WRITES ? "NO_WRITES" : "---------",142142+ bme->flags & BME_LOCKED ? "LOCKED" : "------"143143+ );144144+}145145+146146+static int drbd_seq_show(struct seq_file *seq, void *v)147147+{148148+ int i, hole = 0;149149+ const char *sn;150150+ struct drbd_conf *mdev;151151+152152+ static char write_ordering_chars[] = {153153+ [WO_none] = 'n',154154+ [WO_drain_io] = 'd',155155+ [WO_bdev_flush] = 'f',156156+ [WO_bio_barrier] = 'b',157157+ };158158+159159+ seq_printf(seq, "version: " REL_VERSION " (api:%d/proto:%d-%d)\n%s\n",160160+ API_VERSION, PRO_VERSION_MIN, PRO_VERSION_MAX, drbd_buildtag());161161+162162+ /*163163+ cs .. connection state164164+ ro .. node role (local/remote)165165+ ds .. disk state (local/remote)166166+ protocol167167+ various flags168168+ ns .. network send169169+ nr .. network receive170170+ dw .. disk write171171+ dr .. disk read172172+ al .. activity log write count173173+ bm .. bitmap update write count174174+ pe .. pending (waiting for ack or data reply)175175+ ua .. unack'd (still need to send ack or data reply)176176+ ap .. application requests accepted, but not yet completed177177+ ep .. number of epochs currently "on the fly", P_BARRIER_ACK pending178178+ wo .. write ordering mode currently in use179179+ oos .. known out-of-sync kB180180+ */181181+182182+ for (i = 0; i < minor_count; i++) {183183+ mdev = minor_to_mdev(i);184184+ if (!mdev) {185185+ hole = 1;186186+ continue;187187+ }188188+ if (hole) {189189+ hole = 0;190190+ seq_printf(seq, "\n");191191+ }192192+193193+ sn = drbd_conn_str(mdev->state.conn);194194+195195+ if (mdev->state.conn == C_STANDALONE &&196196+ mdev->state.disk == D_DISKLESS &&197197+ mdev->state.role == R_SECONDARY) {198198+ seq_printf(seq, "%2d: cs:Unconfigured\n", i);199199+ } else {200200+ seq_printf(seq,201201+ "%2d: cs:%s ro:%s/%s ds:%s/%s %c %c%c%c%c%c\n"202202+ " ns:%u nr:%u dw:%u dr:%u al:%u bm:%u "203203+ "lo:%d pe:%d ua:%d ap:%d ep:%d wo:%c",204204+ i, sn,205205+ drbd_role_str(mdev->state.role),206206+ drbd_role_str(mdev->state.peer),207207+ drbd_disk_str(mdev->state.disk),208208+ drbd_disk_str(mdev->state.pdsk),209209+ (mdev->net_conf == NULL ? ' ' :210210+ (mdev->net_conf->wire_protocol - DRBD_PROT_A+'A')),211211+ mdev->state.susp ? 's' : 'r',212212+ mdev->state.aftr_isp ? 'a' : '-',213213+ mdev->state.peer_isp ? 'p' : '-',214214+ mdev->state.user_isp ? 'u' : '-',215215+ mdev->congestion_reason ?: '-',216216+ mdev->send_cnt/2,217217+ mdev->recv_cnt/2,218218+ mdev->writ_cnt/2,219219+ mdev->read_cnt/2,220220+ mdev->al_writ_cnt,221221+ mdev->bm_writ_cnt,222222+ atomic_read(&mdev->local_cnt),223223+ atomic_read(&mdev->ap_pending_cnt) +224224+ atomic_read(&mdev->rs_pending_cnt),225225+ atomic_read(&mdev->unacked_cnt),226226+ atomic_read(&mdev->ap_bio_cnt),227227+ mdev->epochs,228228+ write_ordering_chars[mdev->write_ordering]229229+ );230230+ seq_printf(seq, " oos:%lu\n",231231+ Bit2KB(drbd_bm_total_weight(mdev)));232232+ }233233+ if (mdev->state.conn == C_SYNC_SOURCE ||234234+ mdev->state.conn == C_SYNC_TARGET)235235+ drbd_syncer_progress(mdev, seq);236236+237237+ if (mdev->state.conn == C_VERIFY_S || mdev->state.conn == C_VERIFY_T)238238+ seq_printf(seq, "\t%3d%% %lu/%lu\n",239239+ (int)((mdev->rs_total-mdev->ov_left) /240240+ (mdev->rs_total/100+1)),241241+ mdev->rs_total - mdev->ov_left,242242+ mdev->rs_total);243243+244244+ if (proc_details >= 1 && get_ldev_if_state(mdev, D_FAILED)) {245245+ lc_seq_printf_stats(seq, mdev->resync);246246+ lc_seq_printf_stats(seq, mdev->act_log);247247+ put_ldev(mdev);248248+ }249249+250250+ if (proc_details >= 2) {251251+ if (mdev->resync) {252252+ lc_seq_dump_details(seq, mdev->resync, "rs_left",253253+ resync_dump_detail);254254+ }255255+ }256256+ }257257+258258+ return 0;259259+}260260+261261+static int drbd_proc_open(struct inode *inode, struct file *file)262262+{263263+ return single_open(file, drbd_seq_show, PDE(inode)->data);264264+}265265+266266+/* PROC FS stuff end */
+4456
drivers/block/drbd/drbd_receiver.c
···11+/*22+ drbd_receiver.c33+44+ This file is part of DRBD by Philipp Reisner and Lars Ellenberg.55+66+ Copyright (C) 2001-2008, LINBIT Information Technologies GmbH.77+ Copyright (C) 1999-2008, Philipp Reisner <philipp.reisner@linbit.com>.88+ Copyright (C) 2002-2008, Lars Ellenberg <lars.ellenberg@linbit.com>.99+1010+ drbd is free software; you can redistribute it and/or modify1111+ it under the terms of the GNU General Public License as published by1212+ the Free Software Foundation; either version 2, or (at your option)1313+ any later version.1414+1515+ drbd is distributed in the hope that it will be useful,1616+ but WITHOUT ANY WARRANTY; without even the implied warranty of1717+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the1818+ GNU General Public License for more details.1919+2020+ You should have received a copy of the GNU General Public License2121+ along with drbd; see the file COPYING. If not, write to2222+ the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, USA.2323+ */2424+2525+2626+#include <linux/autoconf.h>2727+#include <linux/module.h>2828+2929+#include <asm/uaccess.h>3030+#include <net/sock.h>3131+3232+#include <linux/version.h>3333+#include <linux/drbd.h>3434+#include <linux/fs.h>3535+#include <linux/file.h>3636+#include <linux/in.h>3737+#include <linux/mm.h>3838+#include <linux/memcontrol.h>3939+#include <linux/mm_inline.h>4040+#include <linux/slab.h>4141+#include <linux/smp_lock.h>4242+#include <linux/pkt_sched.h>4343+#define __KERNEL_SYSCALLS__4444+#include <linux/unistd.h>4545+#include <linux/vmalloc.h>4646+#include <linux/random.h>4747+#include <linux/mm.h>4848+#include <linux/string.h>4949+#include <linux/scatterlist.h>5050+#include "drbd_int.h"5151+#include "drbd_tracing.h"5252+#include "drbd_req.h"5353+5454+#include "drbd_vli.h"5555+5656+struct flush_work {5757+ struct drbd_work w;5858+ struct drbd_epoch *epoch;5959+};6060+6161+enum finish_epoch {6262+ FE_STILL_LIVE,6363+ FE_DESTROYED,6464+ FE_RECYCLED,6565+};6666+6767+static int drbd_do_handshake(struct drbd_conf *mdev);6868+static int drbd_do_auth(struct drbd_conf *mdev);6969+7070+static enum finish_epoch drbd_may_finish_epoch(struct drbd_conf *, struct drbd_epoch *, enum epoch_event);7171+static int e_end_block(struct drbd_conf *, struct drbd_work *, int);7272+7373+static struct drbd_epoch *previous_epoch(struct drbd_conf *mdev, struct drbd_epoch *epoch)7474+{7575+ struct drbd_epoch *prev;7676+ spin_lock(&mdev->epoch_lock);7777+ prev = list_entry(epoch->list.prev, struct drbd_epoch, list);7878+ if (prev == epoch || prev == mdev->current_epoch)7979+ prev = NULL;8080+ spin_unlock(&mdev->epoch_lock);8181+ return prev;8282+}8383+8484+#define GFP_TRY (__GFP_HIGHMEM | __GFP_NOWARN)8585+8686+static struct page *drbd_pp_first_page_or_try_alloc(struct drbd_conf *mdev)8787+{8888+ struct page *page = NULL;8989+9090+ /* Yes, testing drbd_pp_vacant outside the lock is racy.9191+ * So what. It saves a spin_lock. */9292+ if (drbd_pp_vacant > 0) {9393+ spin_lock(&drbd_pp_lock);9494+ page = drbd_pp_pool;9595+ if (page) {9696+ drbd_pp_pool = (struct page *)page_private(page);9797+ set_page_private(page, 0); /* just to be polite */9898+ drbd_pp_vacant--;9999+ }100100+ spin_unlock(&drbd_pp_lock);101101+ }102102+ /* GFP_TRY, because we must not cause arbitrary write-out: in a DRBD103103+ * "criss-cross" setup, that might cause write-out on some other DRBD,104104+ * which in turn might block on the other node at this very place. */105105+ if (!page)106106+ page = alloc_page(GFP_TRY);107107+ if (page)108108+ atomic_inc(&mdev->pp_in_use);109109+ return page;110110+}111111+112112+/* kick lower level device, if we have more than (arbitrary number)113113+ * reference counts on it, which typically are locally submitted io114114+ * requests. don't use unacked_cnt, so we speed up proto A and B, too. */115115+static void maybe_kick_lo(struct drbd_conf *mdev)116116+{117117+ if (atomic_read(&mdev->local_cnt) >= mdev->net_conf->unplug_watermark)118118+ drbd_kick_lo(mdev);119119+}120120+121121+static void reclaim_net_ee(struct drbd_conf *mdev, struct list_head *to_be_freed)122122+{123123+ struct drbd_epoch_entry *e;124124+ struct list_head *le, *tle;125125+126126+ /* The EEs are always appended to the end of the list. Since127127+ they are sent in order over the wire, they have to finish128128+ in order. As soon as we see the first not finished we can129129+ stop to examine the list... */130130+131131+ list_for_each_safe(le, tle, &mdev->net_ee) {132132+ e = list_entry(le, struct drbd_epoch_entry, w.list);133133+ if (drbd_bio_has_active_page(e->private_bio))134134+ break;135135+ list_move(le, to_be_freed);136136+ }137137+}138138+139139+static void drbd_kick_lo_and_reclaim_net(struct drbd_conf *mdev)140140+{141141+ LIST_HEAD(reclaimed);142142+ struct drbd_epoch_entry *e, *t;143143+144144+ maybe_kick_lo(mdev);145145+ spin_lock_irq(&mdev->req_lock);146146+ reclaim_net_ee(mdev, &reclaimed);147147+ spin_unlock_irq(&mdev->req_lock);148148+149149+ list_for_each_entry_safe(e, t, &reclaimed, w.list)150150+ drbd_free_ee(mdev, e);151151+}152152+153153+/**154154+ * drbd_pp_alloc() - Returns a page, fails only if a signal comes in155155+ * @mdev: DRBD device.156156+ * @retry: whether or not to retry allocation forever (or until signalled)157157+ *158158+ * Tries to allocate a page, first from our own page pool, then from the159159+ * kernel, unless this allocation would exceed the max_buffers setting.160160+ * If @retry is non-zero, retry until DRBD frees a page somewhere else.161161+ */162162+static struct page *drbd_pp_alloc(struct drbd_conf *mdev, int retry)163163+{164164+ struct page *page = NULL;165165+ DEFINE_WAIT(wait);166166+167167+ if (atomic_read(&mdev->pp_in_use) < mdev->net_conf->max_buffers) {168168+ page = drbd_pp_first_page_or_try_alloc(mdev);169169+ if (page)170170+ return page;171171+ }172172+173173+ for (;;) {174174+ prepare_to_wait(&drbd_pp_wait, &wait, TASK_INTERRUPTIBLE);175175+176176+ drbd_kick_lo_and_reclaim_net(mdev);177177+178178+ if (atomic_read(&mdev->pp_in_use) < mdev->net_conf->max_buffers) {179179+ page = drbd_pp_first_page_or_try_alloc(mdev);180180+ if (page)181181+ break;182182+ }183183+184184+ if (!retry)185185+ break;186186+187187+ if (signal_pending(current)) {188188+ dev_warn(DEV, "drbd_pp_alloc interrupted!\n");189189+ break;190190+ }191191+192192+ schedule();193193+ }194194+ finish_wait(&drbd_pp_wait, &wait);195195+196196+ return page;197197+}198198+199199+/* Must not be used from irq, as that may deadlock: see drbd_pp_alloc.200200+ * Is also used from inside an other spin_lock_irq(&mdev->req_lock) */201201+static void drbd_pp_free(struct drbd_conf *mdev, struct page *page)202202+{203203+ int free_it;204204+205205+ spin_lock(&drbd_pp_lock);206206+ if (drbd_pp_vacant > (DRBD_MAX_SEGMENT_SIZE/PAGE_SIZE)*minor_count) {207207+ free_it = 1;208208+ } else {209209+ set_page_private(page, (unsigned long)drbd_pp_pool);210210+ drbd_pp_pool = page;211211+ drbd_pp_vacant++;212212+ free_it = 0;213213+ }214214+ spin_unlock(&drbd_pp_lock);215215+216216+ atomic_dec(&mdev->pp_in_use);217217+218218+ if (free_it)219219+ __free_page(page);220220+221221+ wake_up(&drbd_pp_wait);222222+}223223+224224+static void drbd_pp_free_bio_pages(struct drbd_conf *mdev, struct bio *bio)225225+{226226+ struct page *p_to_be_freed = NULL;227227+ struct page *page;228228+ struct bio_vec *bvec;229229+ int i;230230+231231+ spin_lock(&drbd_pp_lock);232232+ __bio_for_each_segment(bvec, bio, i, 0) {233233+ if (drbd_pp_vacant > (DRBD_MAX_SEGMENT_SIZE/PAGE_SIZE)*minor_count) {234234+ set_page_private(bvec->bv_page, (unsigned long)p_to_be_freed);235235+ p_to_be_freed = bvec->bv_page;236236+ } else {237237+ set_page_private(bvec->bv_page, (unsigned long)drbd_pp_pool);238238+ drbd_pp_pool = bvec->bv_page;239239+ drbd_pp_vacant++;240240+ }241241+ }242242+ spin_unlock(&drbd_pp_lock);243243+ atomic_sub(bio->bi_vcnt, &mdev->pp_in_use);244244+245245+ while (p_to_be_freed) {246246+ page = p_to_be_freed;247247+ p_to_be_freed = (struct page *)page_private(page);248248+ set_page_private(page, 0); /* just to be polite */249249+ put_page(page);250250+ }251251+252252+ wake_up(&drbd_pp_wait);253253+}254254+255255+/*256256+You need to hold the req_lock:257257+ _drbd_wait_ee_list_empty()258258+259259+You must not have the req_lock:260260+ drbd_free_ee()261261+ drbd_alloc_ee()262262+ drbd_init_ee()263263+ drbd_release_ee()264264+ drbd_ee_fix_bhs()265265+ drbd_process_done_ee()266266+ drbd_clear_done_ee()267267+ drbd_wait_ee_list_empty()268268+*/269269+270270+struct drbd_epoch_entry *drbd_alloc_ee(struct drbd_conf *mdev,271271+ u64 id,272272+ sector_t sector,273273+ unsigned int data_size,274274+ gfp_t gfp_mask) __must_hold(local)275275+{276276+ struct request_queue *q;277277+ struct drbd_epoch_entry *e;278278+ struct page *page;279279+ struct bio *bio;280280+ unsigned int ds;281281+282282+ if (FAULT_ACTIVE(mdev, DRBD_FAULT_AL_EE))283283+ return NULL;284284+285285+ e = mempool_alloc(drbd_ee_mempool, gfp_mask & ~__GFP_HIGHMEM);286286+ if (!e) {287287+ if (!(gfp_mask & __GFP_NOWARN))288288+ dev_err(DEV, "alloc_ee: Allocation of an EE failed\n");289289+ return NULL;290290+ }291291+292292+ bio = bio_alloc(gfp_mask & ~__GFP_HIGHMEM, div_ceil(data_size, PAGE_SIZE));293293+ if (!bio) {294294+ if (!(gfp_mask & __GFP_NOWARN))295295+ dev_err(DEV, "alloc_ee: Allocation of a bio failed\n");296296+ goto fail1;297297+ }298298+299299+ bio->bi_bdev = mdev->ldev->backing_bdev;300300+ bio->bi_sector = sector;301301+302302+ ds = data_size;303303+ while (ds) {304304+ page = drbd_pp_alloc(mdev, (gfp_mask & __GFP_WAIT));305305+ if (!page) {306306+ if (!(gfp_mask & __GFP_NOWARN))307307+ dev_err(DEV, "alloc_ee: Allocation of a page failed\n");308308+ goto fail2;309309+ }310310+ if (!bio_add_page(bio, page, min_t(int, ds, PAGE_SIZE), 0)) {311311+ drbd_pp_free(mdev, page);312312+ dev_err(DEV, "alloc_ee: bio_add_page(s=%llu,"313313+ "data_size=%u,ds=%u) failed\n",314314+ (unsigned long long)sector, data_size, ds);315315+316316+ q = bdev_get_queue(bio->bi_bdev);317317+ if (q->merge_bvec_fn) {318318+ struct bvec_merge_data bvm = {319319+ .bi_bdev = bio->bi_bdev,320320+ .bi_sector = bio->bi_sector,321321+ .bi_size = bio->bi_size,322322+ .bi_rw = bio->bi_rw,323323+ };324324+ int l = q->merge_bvec_fn(q, &bvm,325325+ &bio->bi_io_vec[bio->bi_vcnt]);326326+ dev_err(DEV, "merge_bvec_fn() = %d\n", l);327327+ }328328+329329+ /* dump more of the bio. */330330+ dev_err(DEV, "bio->bi_max_vecs = %d\n", bio->bi_max_vecs);331331+ dev_err(DEV, "bio->bi_vcnt = %d\n", bio->bi_vcnt);332332+ dev_err(DEV, "bio->bi_size = %d\n", bio->bi_size);333333+ dev_err(DEV, "bio->bi_phys_segments = %d\n", bio->bi_phys_segments);334334+335335+ goto fail2;336336+ break;337337+ }338338+ ds -= min_t(int, ds, PAGE_SIZE);339339+ }340340+341341+ D_ASSERT(data_size == bio->bi_size);342342+343343+ bio->bi_private = e;344344+ e->mdev = mdev;345345+ e->sector = sector;346346+ e->size = bio->bi_size;347347+348348+ e->private_bio = bio;349349+ e->block_id = id;350350+ INIT_HLIST_NODE(&e->colision);351351+ e->epoch = NULL;352352+ e->flags = 0;353353+354354+ trace_drbd_ee(mdev, e, "allocated");355355+356356+ return e;357357+358358+ fail2:359359+ drbd_pp_free_bio_pages(mdev, bio);360360+ bio_put(bio);361361+ fail1:362362+ mempool_free(e, drbd_ee_mempool);363363+364364+ return NULL;365365+}366366+367367+void drbd_free_ee(struct drbd_conf *mdev, struct drbd_epoch_entry *e)368368+{369369+ struct bio *bio = e->private_bio;370370+ trace_drbd_ee(mdev, e, "freed");371371+ drbd_pp_free_bio_pages(mdev, bio);372372+ bio_put(bio);373373+ D_ASSERT(hlist_unhashed(&e->colision));374374+ mempool_free(e, drbd_ee_mempool);375375+}376376+377377+int drbd_release_ee(struct drbd_conf *mdev, struct list_head *list)378378+{379379+ LIST_HEAD(work_list);380380+ struct drbd_epoch_entry *e, *t;381381+ int count = 0;382382+383383+ spin_lock_irq(&mdev->req_lock);384384+ list_splice_init(list, &work_list);385385+ spin_unlock_irq(&mdev->req_lock);386386+387387+ list_for_each_entry_safe(e, t, &work_list, w.list) {388388+ drbd_free_ee(mdev, e);389389+ count++;390390+ }391391+ return count;392392+}393393+394394+395395+/*396396+ * This function is called from _asender only_397397+ * but see also comments in _req_mod(,barrier_acked)398398+ * and receive_Barrier.399399+ *400400+ * Move entries from net_ee to done_ee, if ready.401401+ * Grab done_ee, call all callbacks, free the entries.402402+ * The callbacks typically send out ACKs.403403+ */404404+static int drbd_process_done_ee(struct drbd_conf *mdev)405405+{406406+ LIST_HEAD(work_list);407407+ LIST_HEAD(reclaimed);408408+ struct drbd_epoch_entry *e, *t;409409+ int ok = (mdev->state.conn >= C_WF_REPORT_PARAMS);410410+411411+ spin_lock_irq(&mdev->req_lock);412412+ reclaim_net_ee(mdev, &reclaimed);413413+ list_splice_init(&mdev->done_ee, &work_list);414414+ spin_unlock_irq(&mdev->req_lock);415415+416416+ list_for_each_entry_safe(e, t, &reclaimed, w.list)417417+ drbd_free_ee(mdev, e);418418+419419+ /* possible callbacks here:420420+ * e_end_block, and e_end_resync_block, e_send_discard_ack.421421+ * all ignore the last argument.422422+ */423423+ list_for_each_entry_safe(e, t, &work_list, w.list) {424424+ trace_drbd_ee(mdev, e, "process_done_ee");425425+ /* list_del not necessary, next/prev members not touched */426426+ ok = e->w.cb(mdev, &e->w, !ok) && ok;427427+ drbd_free_ee(mdev, e);428428+ }429429+ wake_up(&mdev->ee_wait);430430+431431+ return ok;432432+}433433+434434+void _drbd_wait_ee_list_empty(struct drbd_conf *mdev, struct list_head *head)435435+{436436+ DEFINE_WAIT(wait);437437+438438+ /* avoids spin_lock/unlock439439+ * and calling prepare_to_wait in the fast path */440440+ while (!list_empty(head)) {441441+ prepare_to_wait(&mdev->ee_wait, &wait, TASK_UNINTERRUPTIBLE);442442+ spin_unlock_irq(&mdev->req_lock);443443+ drbd_kick_lo(mdev);444444+ schedule();445445+ finish_wait(&mdev->ee_wait, &wait);446446+ spin_lock_irq(&mdev->req_lock);447447+ }448448+}449449+450450+void drbd_wait_ee_list_empty(struct drbd_conf *mdev, struct list_head *head)451451+{452452+ spin_lock_irq(&mdev->req_lock);453453+ _drbd_wait_ee_list_empty(mdev, head);454454+ spin_unlock_irq(&mdev->req_lock);455455+}456456+457457+/* see also kernel_accept; which is only present since 2.6.18.458458+ * also we want to log which part of it failed, exactly */459459+static int drbd_accept(struct drbd_conf *mdev, const char **what,460460+ struct socket *sock, struct socket **newsock)461461+{462462+ struct sock *sk = sock->sk;463463+ int err = 0;464464+465465+ *what = "listen";466466+ err = sock->ops->listen(sock, 5);467467+ if (err < 0)468468+ goto out;469469+470470+ *what = "sock_create_lite";471471+ err = sock_create_lite(sk->sk_family, sk->sk_type, sk->sk_protocol,472472+ newsock);473473+ if (err < 0)474474+ goto out;475475+476476+ *what = "accept";477477+ err = sock->ops->accept(sock, *newsock, 0);478478+ if (err < 0) {479479+ sock_release(*newsock);480480+ *newsock = NULL;481481+ goto out;482482+ }483483+ (*newsock)->ops = sock->ops;484484+485485+out:486486+ return err;487487+}488488+489489+static int drbd_recv_short(struct drbd_conf *mdev, struct socket *sock,490490+ void *buf, size_t size, int flags)491491+{492492+ mm_segment_t oldfs;493493+ struct kvec iov = {494494+ .iov_base = buf,495495+ .iov_len = size,496496+ };497497+ struct msghdr msg = {498498+ .msg_iovlen = 1,499499+ .msg_iov = (struct iovec *)&iov,500500+ .msg_flags = (flags ? flags : MSG_WAITALL | MSG_NOSIGNAL)501501+ };502502+ int rv;503503+504504+ oldfs = get_fs();505505+ set_fs(KERNEL_DS);506506+ rv = sock_recvmsg(sock, &msg, size, msg.msg_flags);507507+ set_fs(oldfs);508508+509509+ return rv;510510+}511511+512512+static int drbd_recv(struct drbd_conf *mdev, void *buf, size_t size)513513+{514514+ mm_segment_t oldfs;515515+ struct kvec iov = {516516+ .iov_base = buf,517517+ .iov_len = size,518518+ };519519+ struct msghdr msg = {520520+ .msg_iovlen = 1,521521+ .msg_iov = (struct iovec *)&iov,522522+ .msg_flags = MSG_WAITALL | MSG_NOSIGNAL523523+ };524524+ int rv;525525+526526+ oldfs = get_fs();527527+ set_fs(KERNEL_DS);528528+529529+ for (;;) {530530+ rv = sock_recvmsg(mdev->data.socket, &msg, size, msg.msg_flags);531531+ if (rv == size)532532+ break;533533+534534+ /* Note:535535+ * ECONNRESET other side closed the connection536536+ * ERESTARTSYS (on sock) we got a signal537537+ */538538+539539+ if (rv < 0) {540540+ if (rv == -ECONNRESET)541541+ dev_info(DEV, "sock was reset by peer\n");542542+ else if (rv != -ERESTARTSYS)543543+ dev_err(DEV, "sock_recvmsg returned %d\n", rv);544544+ break;545545+ } else if (rv == 0) {546546+ dev_info(DEV, "sock was shut down by peer\n");547547+ break;548548+ } else {549549+ /* signal came in, or peer/link went down,550550+ * after we read a partial message551551+ */552552+ /* D_ASSERT(signal_pending(current)); */553553+ break;554554+ }555555+ };556556+557557+ set_fs(oldfs);558558+559559+ if (rv != size)560560+ drbd_force_state(mdev, NS(conn, C_BROKEN_PIPE));561561+562562+ return rv;563563+}564564+565565+static struct socket *drbd_try_connect(struct drbd_conf *mdev)566566+{567567+ const char *what;568568+ struct socket *sock;569569+ struct sockaddr_in6 src_in6;570570+ int err;571571+ int disconnect_on_error = 1;572572+573573+ if (!get_net_conf(mdev))574574+ return NULL;575575+576576+ what = "sock_create_kern";577577+ err = sock_create_kern(((struct sockaddr *)mdev->net_conf->my_addr)->sa_family,578578+ SOCK_STREAM, IPPROTO_TCP, &sock);579579+ if (err < 0) {580580+ sock = NULL;581581+ goto out;582582+ }583583+584584+ sock->sk->sk_rcvtimeo =585585+ sock->sk->sk_sndtimeo = mdev->net_conf->try_connect_int*HZ;586586+587587+ /* explicitly bind to the configured IP as source IP588588+ * for the outgoing connections.589589+ * This is needed for multihomed hosts and to be590590+ * able to use lo: interfaces for drbd.591591+ * Make sure to use 0 as port number, so linux selects592592+ * a free one dynamically.593593+ */594594+ memcpy(&src_in6, mdev->net_conf->my_addr,595595+ min_t(int, mdev->net_conf->my_addr_len, sizeof(src_in6)));596596+ if (((struct sockaddr *)mdev->net_conf->my_addr)->sa_family == AF_INET6)597597+ src_in6.sin6_port = 0;598598+ else599599+ ((struct sockaddr_in *)&src_in6)->sin_port = 0; /* AF_INET & AF_SCI */600600+601601+ what = "bind before connect";602602+ err = sock->ops->bind(sock,603603+ (struct sockaddr *) &src_in6,604604+ mdev->net_conf->my_addr_len);605605+ if (err < 0)606606+ goto out;607607+608608+ /* connect may fail, peer not yet available.609609+ * stay C_WF_CONNECTION, don't go Disconnecting! */610610+ disconnect_on_error = 0;611611+ what = "connect";612612+ err = sock->ops->connect(sock,613613+ (struct sockaddr *)mdev->net_conf->peer_addr,614614+ mdev->net_conf->peer_addr_len, 0);615615+616616+out:617617+ if (err < 0) {618618+ if (sock) {619619+ sock_release(sock);620620+ sock = NULL;621621+ }622622+ switch (-err) {623623+ /* timeout, busy, signal pending */624624+ case ETIMEDOUT: case EAGAIN: case EINPROGRESS:625625+ case EINTR: case ERESTARTSYS:626626+ /* peer not (yet) available, network problem */627627+ case ECONNREFUSED: case ENETUNREACH:628628+ case EHOSTDOWN: case EHOSTUNREACH:629629+ disconnect_on_error = 0;630630+ break;631631+ default:632632+ dev_err(DEV, "%s failed, err = %d\n", what, err);633633+ }634634+ if (disconnect_on_error)635635+ drbd_force_state(mdev, NS(conn, C_DISCONNECTING));636636+ }637637+ put_net_conf(mdev);638638+ return sock;639639+}640640+641641+static struct socket *drbd_wait_for_connect(struct drbd_conf *mdev)642642+{643643+ int timeo, err;644644+ struct socket *s_estab = NULL, *s_listen;645645+ const char *what;646646+647647+ if (!get_net_conf(mdev))648648+ return NULL;649649+650650+ what = "sock_create_kern";651651+ err = sock_create_kern(((struct sockaddr *)mdev->net_conf->my_addr)->sa_family,652652+ SOCK_STREAM, IPPROTO_TCP, &s_listen);653653+ if (err) {654654+ s_listen = NULL;655655+ goto out;656656+ }657657+658658+ timeo = mdev->net_conf->try_connect_int * HZ;659659+ timeo += (random32() & 1) ? timeo / 7 : -timeo / 7; /* 28.5% random jitter */660660+661661+ s_listen->sk->sk_reuse = 1; /* SO_REUSEADDR */662662+ s_listen->sk->sk_rcvtimeo = timeo;663663+ s_listen->sk->sk_sndtimeo = timeo;664664+665665+ what = "bind before listen";666666+ err = s_listen->ops->bind(s_listen,667667+ (struct sockaddr *) mdev->net_conf->my_addr,668668+ mdev->net_conf->my_addr_len);669669+ if (err < 0)670670+ goto out;671671+672672+ err = drbd_accept(mdev, &what, s_listen, &s_estab);673673+674674+out:675675+ if (s_listen)676676+ sock_release(s_listen);677677+ if (err < 0) {678678+ if (err != -EAGAIN && err != -EINTR && err != -ERESTARTSYS) {679679+ dev_err(DEV, "%s failed, err = %d\n", what, err);680680+ drbd_force_state(mdev, NS(conn, C_DISCONNECTING));681681+ }682682+ }683683+ put_net_conf(mdev);684684+685685+ return s_estab;686686+}687687+688688+static int drbd_send_fp(struct drbd_conf *mdev,689689+ struct socket *sock, enum drbd_packets cmd)690690+{691691+ struct p_header *h = (struct p_header *) &mdev->data.sbuf.header;692692+693693+ return _drbd_send_cmd(mdev, sock, cmd, h, sizeof(*h), 0);694694+}695695+696696+static enum drbd_packets drbd_recv_fp(struct drbd_conf *mdev, struct socket *sock)697697+{698698+ struct p_header *h = (struct p_header *) &mdev->data.sbuf.header;699699+ int rr;700700+701701+ rr = drbd_recv_short(mdev, sock, h, sizeof(*h), 0);702702+703703+ if (rr == sizeof(*h) && h->magic == BE_DRBD_MAGIC)704704+ return be16_to_cpu(h->command);705705+706706+ return 0xffff;707707+}708708+709709+/**710710+ * drbd_socket_okay() - Free the socket if its connection is not okay711711+ * @mdev: DRBD device.712712+ * @sock: pointer to the pointer to the socket.713713+ */714714+static int drbd_socket_okay(struct drbd_conf *mdev, struct socket **sock)715715+{716716+ int rr;717717+ char tb[4];718718+719719+ if (!*sock)720720+ return FALSE;721721+722722+ rr = drbd_recv_short(mdev, *sock, tb, 4, MSG_DONTWAIT | MSG_PEEK);723723+724724+ if (rr > 0 || rr == -EAGAIN) {725725+ return TRUE;726726+ } else {727727+ sock_release(*sock);728728+ *sock = NULL;729729+ return FALSE;730730+ }731731+}732732+733733+/*734734+ * return values:735735+ * 1 yes, we have a valid connection736736+ * 0 oops, did not work out, please try again737737+ * -1 peer talks different language,738738+ * no point in trying again, please go standalone.739739+ * -2 We do not have a network config...740740+ */741741+static int drbd_connect(struct drbd_conf *mdev)742742+{743743+ struct socket *s, *sock, *msock;744744+ int try, h, ok;745745+746746+ D_ASSERT(!mdev->data.socket);747747+748748+ if (test_and_clear_bit(CREATE_BARRIER, &mdev->flags))749749+ dev_err(DEV, "CREATE_BARRIER flag was set in drbd_connect - now cleared!\n");750750+751751+ if (drbd_request_state(mdev, NS(conn, C_WF_CONNECTION)) < SS_SUCCESS)752752+ return -2;753753+754754+ clear_bit(DISCARD_CONCURRENT, &mdev->flags);755755+756756+ sock = NULL;757757+ msock = NULL;758758+759759+ do {760760+ for (try = 0;;) {761761+ /* 3 tries, this should take less than a second! */762762+ s = drbd_try_connect(mdev);763763+ if (s || ++try >= 3)764764+ break;765765+ /* give the other side time to call bind() & listen() */766766+ __set_current_state(TASK_INTERRUPTIBLE);767767+ schedule_timeout(HZ / 10);768768+ }769769+770770+ if (s) {771771+ if (!sock) {772772+ drbd_send_fp(mdev, s, P_HAND_SHAKE_S);773773+ sock = s;774774+ s = NULL;775775+ } else if (!msock) {776776+ drbd_send_fp(mdev, s, P_HAND_SHAKE_M);777777+ msock = s;778778+ s = NULL;779779+ } else {780780+ dev_err(DEV, "Logic error in drbd_connect()\n");781781+ goto out_release_sockets;782782+ }783783+ }784784+785785+ if (sock && msock) {786786+ __set_current_state(TASK_INTERRUPTIBLE);787787+ schedule_timeout(HZ / 10);788788+ ok = drbd_socket_okay(mdev, &sock);789789+ ok = drbd_socket_okay(mdev, &msock) && ok;790790+ if (ok)791791+ break;792792+ }793793+794794+retry:795795+ s = drbd_wait_for_connect(mdev);796796+ if (s) {797797+ try = drbd_recv_fp(mdev, s);798798+ drbd_socket_okay(mdev, &sock);799799+ drbd_socket_okay(mdev, &msock);800800+ switch (try) {801801+ case P_HAND_SHAKE_S:802802+ if (sock) {803803+ dev_warn(DEV, "initial packet S crossed\n");804804+ sock_release(sock);805805+ }806806+ sock = s;807807+ break;808808+ case P_HAND_SHAKE_M:809809+ if (msock) {810810+ dev_warn(DEV, "initial packet M crossed\n");811811+ sock_release(msock);812812+ }813813+ msock = s;814814+ set_bit(DISCARD_CONCURRENT, &mdev->flags);815815+ break;816816+ default:817817+ dev_warn(DEV, "Error receiving initial packet\n");818818+ sock_release(s);819819+ if (random32() & 1)820820+ goto retry;821821+ }822822+ }823823+824824+ if (mdev->state.conn <= C_DISCONNECTING)825825+ goto out_release_sockets;826826+ if (signal_pending(current)) {827827+ flush_signals(current);828828+ smp_rmb();829829+ if (get_t_state(&mdev->receiver) == Exiting)830830+ goto out_release_sockets;831831+ }832832+833833+ if (sock && msock) {834834+ ok = drbd_socket_okay(mdev, &sock);835835+ ok = drbd_socket_okay(mdev, &msock) && ok;836836+ if (ok)837837+ break;838838+ }839839+ } while (1);840840+841841+ msock->sk->sk_reuse = 1; /* SO_REUSEADDR */842842+ sock->sk->sk_reuse = 1; /* SO_REUSEADDR */843843+844844+ sock->sk->sk_allocation = GFP_NOIO;845845+ msock->sk->sk_allocation = GFP_NOIO;846846+847847+ sock->sk->sk_priority = TC_PRIO_INTERACTIVE_BULK;848848+ msock->sk->sk_priority = TC_PRIO_INTERACTIVE;849849+850850+ if (mdev->net_conf->sndbuf_size) {851851+ sock->sk->sk_sndbuf = mdev->net_conf->sndbuf_size;852852+ sock->sk->sk_userlocks |= SOCK_SNDBUF_LOCK;853853+ }854854+855855+ if (mdev->net_conf->rcvbuf_size) {856856+ sock->sk->sk_rcvbuf = mdev->net_conf->rcvbuf_size;857857+ sock->sk->sk_userlocks |= SOCK_RCVBUF_LOCK;858858+ }859859+860860+ /* NOT YET ...861861+ * sock->sk->sk_sndtimeo = mdev->net_conf->timeout*HZ/10;862862+ * sock->sk->sk_rcvtimeo = MAX_SCHEDULE_TIMEOUT;863863+ * first set it to the P_HAND_SHAKE timeout,864864+ * which we set to 4x the configured ping_timeout. */865865+ sock->sk->sk_sndtimeo =866866+ sock->sk->sk_rcvtimeo = mdev->net_conf->ping_timeo*4*HZ/10;867867+868868+ msock->sk->sk_sndtimeo = mdev->net_conf->timeout*HZ/10;869869+ msock->sk->sk_rcvtimeo = mdev->net_conf->ping_int*HZ;870870+871871+ /* we don't want delays.872872+ * we use TCP_CORK where apropriate, though */873873+ drbd_tcp_nodelay(sock);874874+ drbd_tcp_nodelay(msock);875875+876876+ mdev->data.socket = sock;877877+ mdev->meta.socket = msock;878878+ mdev->last_received = jiffies;879879+880880+ D_ASSERT(mdev->asender.task == NULL);881881+882882+ h = drbd_do_handshake(mdev);883883+ if (h <= 0)884884+ return h;885885+886886+ if (mdev->cram_hmac_tfm) {887887+ /* drbd_request_state(mdev, NS(conn, WFAuth)); */888888+ if (!drbd_do_auth(mdev)) {889889+ dev_err(DEV, "Authentication of peer failed\n");890890+ return -1;891891+ }892892+ }893893+894894+ if (drbd_request_state(mdev, NS(conn, C_WF_REPORT_PARAMS)) < SS_SUCCESS)895895+ return 0;896896+897897+ sock->sk->sk_sndtimeo = mdev->net_conf->timeout*HZ/10;898898+ sock->sk->sk_rcvtimeo = MAX_SCHEDULE_TIMEOUT;899899+900900+ atomic_set(&mdev->packet_seq, 0);901901+ mdev->peer_seq = 0;902902+903903+ drbd_thread_start(&mdev->asender);904904+905905+ drbd_send_protocol(mdev);906906+ drbd_send_sync_param(mdev, &mdev->sync_conf);907907+ drbd_send_sizes(mdev, 0);908908+ drbd_send_uuids(mdev);909909+ drbd_send_state(mdev);910910+ clear_bit(USE_DEGR_WFC_T, &mdev->flags);911911+ clear_bit(RESIZE_PENDING, &mdev->flags);912912+913913+ return 1;914914+915915+out_release_sockets:916916+ if (sock)917917+ sock_release(sock);918918+ if (msock)919919+ sock_release(msock);920920+ return -1;921921+}922922+923923+static int drbd_recv_header(struct drbd_conf *mdev, struct p_header *h)924924+{925925+ int r;926926+927927+ r = drbd_recv(mdev, h, sizeof(*h));928928+929929+ if (unlikely(r != sizeof(*h))) {930930+ dev_err(DEV, "short read expecting header on sock: r=%d\n", r);931931+ return FALSE;932932+ };933933+ h->command = be16_to_cpu(h->command);934934+ h->length = be16_to_cpu(h->length);935935+ if (unlikely(h->magic != BE_DRBD_MAGIC)) {936936+ dev_err(DEV, "magic?? on data m: 0x%lx c: %d l: %d\n",937937+ (long)be32_to_cpu(h->magic),938938+ h->command, h->length);939939+ return FALSE;940940+ }941941+ mdev->last_received = jiffies;942942+943943+ return TRUE;944944+}945945+946946+static enum finish_epoch drbd_flush_after_epoch(struct drbd_conf *mdev, struct drbd_epoch *epoch)947947+{948948+ int rv;949949+950950+ if (mdev->write_ordering >= WO_bdev_flush && get_ldev(mdev)) {951951+ rv = blkdev_issue_flush(mdev->ldev->backing_bdev, NULL);952952+ if (rv) {953953+ dev_err(DEV, "local disk flush failed with status %d\n", rv);954954+ /* would rather check on EOPNOTSUPP, but that is not reliable.955955+ * don't try again for ANY return value != 0956956+ * if (rv == -EOPNOTSUPP) */957957+ drbd_bump_write_ordering(mdev, WO_drain_io);958958+ }959959+ put_ldev(mdev);960960+ }961961+962962+ return drbd_may_finish_epoch(mdev, epoch, EV_BARRIER_DONE);963963+}964964+965965+static int w_flush(struct drbd_conf *mdev, struct drbd_work *w, int cancel)966966+{967967+ struct flush_work *fw = (struct flush_work *)w;968968+ struct drbd_epoch *epoch = fw->epoch;969969+970970+ kfree(w);971971+972972+ if (!test_and_set_bit(DE_BARRIER_IN_NEXT_EPOCH_ISSUED, &epoch->flags))973973+ drbd_flush_after_epoch(mdev, epoch);974974+975975+ drbd_may_finish_epoch(mdev, epoch, EV_PUT |976976+ (mdev->state.conn < C_CONNECTED ? EV_CLEANUP : 0));977977+978978+ return 1;979979+}980980+981981+/**982982+ * drbd_may_finish_epoch() - Applies an epoch_event to the epoch's state, eventually finishes it.983983+ * @mdev: DRBD device.984984+ * @epoch: Epoch object.985985+ * @ev: Epoch event.986986+ */987987+static enum finish_epoch drbd_may_finish_epoch(struct drbd_conf *mdev,988988+ struct drbd_epoch *epoch,989989+ enum epoch_event ev)990990+{991991+ int finish, epoch_size;992992+ struct drbd_epoch *next_epoch;993993+ int schedule_flush = 0;994994+ enum finish_epoch rv = FE_STILL_LIVE;995995+996996+ spin_lock(&mdev->epoch_lock);997997+ do {998998+ next_epoch = NULL;999999+ finish = 0;10001000+10011001+ epoch_size = atomic_read(&epoch->epoch_size);10021002+10031003+ switch (ev & ~EV_CLEANUP) {10041004+ case EV_PUT:10051005+ atomic_dec(&epoch->active);10061006+ break;10071007+ case EV_GOT_BARRIER_NR:10081008+ set_bit(DE_HAVE_BARRIER_NUMBER, &epoch->flags);10091009+10101010+ /* Special case: If we just switched from WO_bio_barrier to10111011+ WO_bdev_flush we should not finish the current epoch */10121012+ if (test_bit(DE_CONTAINS_A_BARRIER, &epoch->flags) && epoch_size == 1 &&10131013+ mdev->write_ordering != WO_bio_barrier &&10141014+ epoch == mdev->current_epoch)10151015+ clear_bit(DE_CONTAINS_A_BARRIER, &epoch->flags);10161016+ break;10171017+ case EV_BARRIER_DONE:10181018+ set_bit(DE_BARRIER_IN_NEXT_EPOCH_DONE, &epoch->flags);10191019+ break;10201020+ case EV_BECAME_LAST:10211021+ /* nothing to do*/10221022+ break;10231023+ }10241024+10251025+ trace_drbd_epoch(mdev, epoch, ev);10261026+10271027+ if (epoch_size != 0 &&10281028+ atomic_read(&epoch->active) == 0 &&10291029+ test_bit(DE_HAVE_BARRIER_NUMBER, &epoch->flags) &&10301030+ epoch->list.prev == &mdev->current_epoch->list &&10311031+ !test_bit(DE_IS_FINISHING, &epoch->flags)) {10321032+ /* Nearly all conditions are met to finish that epoch... */10331033+ if (test_bit(DE_BARRIER_IN_NEXT_EPOCH_DONE, &epoch->flags) ||10341034+ mdev->write_ordering == WO_none ||10351035+ (epoch_size == 1 && test_bit(DE_CONTAINS_A_BARRIER, &epoch->flags)) ||10361036+ ev & EV_CLEANUP) {10371037+ finish = 1;10381038+ set_bit(DE_IS_FINISHING, &epoch->flags);10391039+ } else if (!test_bit(DE_BARRIER_IN_NEXT_EPOCH_ISSUED, &epoch->flags) &&10401040+ mdev->write_ordering == WO_bio_barrier) {10411041+ atomic_inc(&epoch->active);10421042+ schedule_flush = 1;10431043+ }10441044+ }10451045+ if (finish) {10461046+ if (!(ev & EV_CLEANUP)) {10471047+ spin_unlock(&mdev->epoch_lock);10481048+ drbd_send_b_ack(mdev, epoch->barrier_nr, epoch_size);10491049+ spin_lock(&mdev->epoch_lock);10501050+ }10511051+ dec_unacked(mdev);10521052+10531053+ if (mdev->current_epoch != epoch) {10541054+ next_epoch = list_entry(epoch->list.next, struct drbd_epoch, list);10551055+ list_del(&epoch->list);10561056+ ev = EV_BECAME_LAST | (ev & EV_CLEANUP);10571057+ mdev->epochs--;10581058+ trace_drbd_epoch(mdev, epoch, EV_TRACE_FREE);10591059+ kfree(epoch);10601060+10611061+ if (rv == FE_STILL_LIVE)10621062+ rv = FE_DESTROYED;10631063+ } else {10641064+ epoch->flags = 0;10651065+ atomic_set(&epoch->epoch_size, 0);10661066+ /* atomic_set(&epoch->active, 0); is alrady zero */10671067+ if (rv == FE_STILL_LIVE)10681068+ rv = FE_RECYCLED;10691069+ }10701070+ }10711071+10721072+ if (!next_epoch)10731073+ break;10741074+10751075+ epoch = next_epoch;10761076+ } while (1);10771077+10781078+ spin_unlock(&mdev->epoch_lock);10791079+10801080+ if (schedule_flush) {10811081+ struct flush_work *fw;10821082+ fw = kmalloc(sizeof(*fw), GFP_ATOMIC);10831083+ if (fw) {10841084+ trace_drbd_epoch(mdev, epoch, EV_TRACE_FLUSH);10851085+ fw->w.cb = w_flush;10861086+ fw->epoch = epoch;10871087+ drbd_queue_work(&mdev->data.work, &fw->w);10881088+ } else {10891089+ dev_warn(DEV, "Could not kmalloc a flush_work obj\n");10901090+ set_bit(DE_BARRIER_IN_NEXT_EPOCH_ISSUED, &epoch->flags);10911091+ /* That is not a recursion, only one level */10921092+ drbd_may_finish_epoch(mdev, epoch, EV_BARRIER_DONE);10931093+ drbd_may_finish_epoch(mdev, epoch, EV_PUT);10941094+ }10951095+ }10961096+10971097+ return rv;10981098+}10991099+11001100+/**11011101+ * drbd_bump_write_ordering() - Fall back to an other write ordering method11021102+ * @mdev: DRBD device.11031103+ * @wo: Write ordering method to try.11041104+ */11051105+void drbd_bump_write_ordering(struct drbd_conf *mdev, enum write_ordering_e wo) __must_hold(local)11061106+{11071107+ enum write_ordering_e pwo;11081108+ static char *write_ordering_str[] = {11091109+ [WO_none] = "none",11101110+ [WO_drain_io] = "drain",11111111+ [WO_bdev_flush] = "flush",11121112+ [WO_bio_barrier] = "barrier",11131113+ };11141114+11151115+ pwo = mdev->write_ordering;11161116+ wo = min(pwo, wo);11171117+ if (wo == WO_bio_barrier && mdev->ldev->dc.no_disk_barrier)11181118+ wo = WO_bdev_flush;11191119+ if (wo == WO_bdev_flush && mdev->ldev->dc.no_disk_flush)11201120+ wo = WO_drain_io;11211121+ if (wo == WO_drain_io && mdev->ldev->dc.no_disk_drain)11221122+ wo = WO_none;11231123+ mdev->write_ordering = wo;11241124+ if (pwo != mdev->write_ordering || wo == WO_bio_barrier)11251125+ dev_info(DEV, "Method to ensure write ordering: %s\n", write_ordering_str[mdev->write_ordering]);11261126+}11271127+11281128+/**11291129+ * w_e_reissue() - Worker callback; Resubmit a bio, without BIO_RW_BARRIER set11301130+ * @mdev: DRBD device.11311131+ * @w: work object.11321132+ * @cancel: The connection will be closed anyways (unused in this callback)11331133+ */11341134+int w_e_reissue(struct drbd_conf *mdev, struct drbd_work *w, int cancel) __releases(local)11351135+{11361136+ struct drbd_epoch_entry *e = (struct drbd_epoch_entry *)w;11371137+ struct bio *bio = e->private_bio;11381138+11391139+ /* We leave DE_CONTAINS_A_BARRIER and EE_IS_BARRIER in place,11401140+ (and DE_BARRIER_IN_NEXT_EPOCH_ISSUED in the previous Epoch)11411141+ so that we can finish that epoch in drbd_may_finish_epoch().11421142+ That is necessary if we already have a long chain of Epochs, before11431143+ we realize that BIO_RW_BARRIER is actually not supported */11441144+11451145+ /* As long as the -ENOTSUPP on the barrier is reported immediately11461146+ that will never trigger. If it is reported late, we will just11471147+ print that warning and continue correctly for all future requests11481148+ with WO_bdev_flush */11491149+ if (previous_epoch(mdev, e->epoch))11501150+ dev_warn(DEV, "Write ordering was not enforced (one time event)\n");11511151+11521152+ /* prepare bio for re-submit,11531153+ * re-init volatile members */11541154+ /* we still have a local reference,11551155+ * get_ldev was done in receive_Data. */11561156+ bio->bi_bdev = mdev->ldev->backing_bdev;11571157+ bio->bi_sector = e->sector;11581158+ bio->bi_size = e->size;11591159+ bio->bi_idx = 0;11601160+11611161+ bio->bi_flags &= ~(BIO_POOL_MASK - 1);11621162+ bio->bi_flags |= 1 << BIO_UPTODATE;11631163+11641164+ /* don't know whether this is necessary: */11651165+ bio->bi_phys_segments = 0;11661166+ bio->bi_next = NULL;11671167+11681168+ /* these should be unchanged: */11691169+ /* bio->bi_end_io = drbd_endio_write_sec; */11701170+ /* bio->bi_vcnt = whatever; */11711171+11721172+ e->w.cb = e_end_block;11731173+11741174+ /* This is no longer a barrier request. */11751175+ bio->bi_rw &= ~(1UL << BIO_RW_BARRIER);11761176+11771177+ drbd_generic_make_request(mdev, DRBD_FAULT_DT_WR, bio);11781178+11791179+ return 1;11801180+}11811181+11821182+static int receive_Barrier(struct drbd_conf *mdev, struct p_header *h)11831183+{11841184+ int rv, issue_flush;11851185+ struct p_barrier *p = (struct p_barrier *)h;11861186+ struct drbd_epoch *epoch;11871187+11881188+ ERR_IF(h->length != (sizeof(*p)-sizeof(*h))) return FALSE;11891189+11901190+ rv = drbd_recv(mdev, h->payload, h->length);11911191+ ERR_IF(rv != h->length) return FALSE;11921192+11931193+ inc_unacked(mdev);11941194+11951195+ if (mdev->net_conf->wire_protocol != DRBD_PROT_C)11961196+ drbd_kick_lo(mdev);11971197+11981198+ mdev->current_epoch->barrier_nr = p->barrier;11991199+ rv = drbd_may_finish_epoch(mdev, mdev->current_epoch, EV_GOT_BARRIER_NR);12001200+12011201+ /* P_BARRIER_ACK may imply that the corresponding extent is dropped from12021202+ * the activity log, which means it would not be resynced in case the12031203+ * R_PRIMARY crashes now.12041204+ * Therefore we must send the barrier_ack after the barrier request was12051205+ * completed. */12061206+ switch (mdev->write_ordering) {12071207+ case WO_bio_barrier:12081208+ case WO_none:12091209+ if (rv == FE_RECYCLED)12101210+ return TRUE;12111211+ break;12121212+12131213+ case WO_bdev_flush:12141214+ case WO_drain_io:12151215+ D_ASSERT(rv == FE_STILL_LIVE);12161216+ set_bit(DE_BARRIER_IN_NEXT_EPOCH_ISSUED, &mdev->current_epoch->flags);12171217+ drbd_wait_ee_list_empty(mdev, &mdev->active_ee);12181218+ rv = drbd_flush_after_epoch(mdev, mdev->current_epoch);12191219+ if (rv == FE_RECYCLED)12201220+ return TRUE;12211221+12221222+ /* The asender will send all the ACKs and barrier ACKs out, since12231223+ all EEs moved from the active_ee to the done_ee. We need to12241224+ provide a new epoch object for the EEs that come in soon */12251225+ break;12261226+ }12271227+12281228+ /* receiver context, in the writeout path of the other node.12291229+ * avoid potential distributed deadlock */12301230+ epoch = kmalloc(sizeof(struct drbd_epoch), GFP_NOIO);12311231+ if (!epoch) {12321232+ dev_warn(DEV, "Allocation of an epoch failed, slowing down\n");12331233+ issue_flush = !test_and_set_bit(DE_BARRIER_IN_NEXT_EPOCH_ISSUED, &epoch->flags);12341234+ drbd_wait_ee_list_empty(mdev, &mdev->active_ee);12351235+ if (issue_flush) {12361236+ rv = drbd_flush_after_epoch(mdev, mdev->current_epoch);12371237+ if (rv == FE_RECYCLED)12381238+ return TRUE;12391239+ }12401240+12411241+ drbd_wait_ee_list_empty(mdev, &mdev->done_ee);12421242+12431243+ return TRUE;12441244+ }12451245+12461246+ epoch->flags = 0;12471247+ atomic_set(&epoch->epoch_size, 0);12481248+ atomic_set(&epoch->active, 0);12491249+12501250+ spin_lock(&mdev->epoch_lock);12511251+ if (atomic_read(&mdev->current_epoch->epoch_size)) {12521252+ list_add(&epoch->list, &mdev->current_epoch->list);12531253+ mdev->current_epoch = epoch;12541254+ mdev->epochs++;12551255+ trace_drbd_epoch(mdev, epoch, EV_TRACE_ALLOC);12561256+ } else {12571257+ /* The current_epoch got recycled while we allocated this one... */12581258+ kfree(epoch);12591259+ }12601260+ spin_unlock(&mdev->epoch_lock);12611261+12621262+ return TRUE;12631263+}12641264+12651265+/* used from receive_RSDataReply (recv_resync_read)12661266+ * and from receive_Data */12671267+static struct drbd_epoch_entry *12681268+read_in_block(struct drbd_conf *mdev, u64 id, sector_t sector, int data_size) __must_hold(local)12691269+{12701270+ struct drbd_epoch_entry *e;12711271+ struct bio_vec *bvec;12721272+ struct page *page;12731273+ struct bio *bio;12741274+ int dgs, ds, i, rr;12751275+ void *dig_in = mdev->int_dig_in;12761276+ void *dig_vv = mdev->int_dig_vv;12771277+12781278+ dgs = (mdev->agreed_pro_version >= 87 && mdev->integrity_r_tfm) ?12791279+ crypto_hash_digestsize(mdev->integrity_r_tfm) : 0;12801280+12811281+ if (dgs) {12821282+ rr = drbd_recv(mdev, dig_in, dgs);12831283+ if (rr != dgs) {12841284+ dev_warn(DEV, "short read receiving data digest: read %d expected %d\n",12851285+ rr, dgs);12861286+ return NULL;12871287+ }12881288+ }12891289+12901290+ data_size -= dgs;12911291+12921292+ ERR_IF(data_size & 0x1ff) return NULL;12931293+ ERR_IF(data_size > DRBD_MAX_SEGMENT_SIZE) return NULL;12941294+12951295+ /* GFP_NOIO, because we must not cause arbitrary write-out: in a DRBD12961296+ * "criss-cross" setup, that might cause write-out on some other DRBD,12971297+ * which in turn might block on the other node at this very place. */12981298+ e = drbd_alloc_ee(mdev, id, sector, data_size, GFP_NOIO);12991299+ if (!e)13001300+ return NULL;13011301+ bio = e->private_bio;13021302+ ds = data_size;13031303+ bio_for_each_segment(bvec, bio, i) {13041304+ page = bvec->bv_page;13051305+ rr = drbd_recv(mdev, kmap(page), min_t(int, ds, PAGE_SIZE));13061306+ kunmap(page);13071307+ if (rr != min_t(int, ds, PAGE_SIZE)) {13081308+ drbd_free_ee(mdev, e);13091309+ dev_warn(DEV, "short read receiving data: read %d expected %d\n",13101310+ rr, min_t(int, ds, PAGE_SIZE));13111311+ return NULL;13121312+ }13131313+ ds -= rr;13141314+ }13151315+13161316+ if (dgs) {13171317+ drbd_csum(mdev, mdev->integrity_r_tfm, bio, dig_vv);13181318+ if (memcmp(dig_in, dig_vv, dgs)) {13191319+ dev_err(DEV, "Digest integrity check FAILED.\n");13201320+ drbd_bcast_ee(mdev, "digest failed",13211321+ dgs, dig_in, dig_vv, e);13221322+ drbd_free_ee(mdev, e);13231323+ return NULL;13241324+ }13251325+ }13261326+ mdev->recv_cnt += data_size>>9;13271327+ return e;13281328+}13291329+13301330+/* drbd_drain_block() just takes a data block13311331+ * out of the socket input buffer, and discards it.13321332+ */13331333+static int drbd_drain_block(struct drbd_conf *mdev, int data_size)13341334+{13351335+ struct page *page;13361336+ int rr, rv = 1;13371337+ void *data;13381338+13391339+ page = drbd_pp_alloc(mdev, 1);13401340+13411341+ data = kmap(page);13421342+ while (data_size) {13431343+ rr = drbd_recv(mdev, data, min_t(int, data_size, PAGE_SIZE));13441344+ if (rr != min_t(int, data_size, PAGE_SIZE)) {13451345+ rv = 0;13461346+ dev_warn(DEV, "short read receiving data: read %d expected %d\n",13471347+ rr, min_t(int, data_size, PAGE_SIZE));13481348+ break;13491349+ }13501350+ data_size -= rr;13511351+ }13521352+ kunmap(page);13531353+ drbd_pp_free(mdev, page);13541354+ return rv;13551355+}13561356+13571357+static int recv_dless_read(struct drbd_conf *mdev, struct drbd_request *req,13581358+ sector_t sector, int data_size)13591359+{13601360+ struct bio_vec *bvec;13611361+ struct bio *bio;13621362+ int dgs, rr, i, expect;13631363+ void *dig_in = mdev->int_dig_in;13641364+ void *dig_vv = mdev->int_dig_vv;13651365+13661366+ dgs = (mdev->agreed_pro_version >= 87 && mdev->integrity_r_tfm) ?13671367+ crypto_hash_digestsize(mdev->integrity_r_tfm) : 0;13681368+13691369+ if (dgs) {13701370+ rr = drbd_recv(mdev, dig_in, dgs);13711371+ if (rr != dgs) {13721372+ dev_warn(DEV, "short read receiving data reply digest: read %d expected %d\n",13731373+ rr, dgs);13741374+ return 0;13751375+ }13761376+ }13771377+13781378+ data_size -= dgs;13791379+13801380+ /* optimistically update recv_cnt. if receiving fails below,13811381+ * we disconnect anyways, and counters will be reset. */13821382+ mdev->recv_cnt += data_size>>9;13831383+13841384+ bio = req->master_bio;13851385+ D_ASSERT(sector == bio->bi_sector);13861386+13871387+ bio_for_each_segment(bvec, bio, i) {13881388+ expect = min_t(int, data_size, bvec->bv_len);13891389+ rr = drbd_recv(mdev,13901390+ kmap(bvec->bv_page)+bvec->bv_offset,13911391+ expect);13921392+ kunmap(bvec->bv_page);13931393+ if (rr != expect) {13941394+ dev_warn(DEV, "short read receiving data reply: "13951395+ "read %d expected %d\n",13961396+ rr, expect);13971397+ return 0;13981398+ }13991399+ data_size -= rr;14001400+ }14011401+14021402+ if (dgs) {14031403+ drbd_csum(mdev, mdev->integrity_r_tfm, bio, dig_vv);14041404+ if (memcmp(dig_in, dig_vv, dgs)) {14051405+ dev_err(DEV, "Digest integrity check FAILED. Broken NICs?\n");14061406+ return 0;14071407+ }14081408+ }14091409+14101410+ D_ASSERT(data_size == 0);14111411+ return 1;14121412+}14131413+14141414+/* e_end_resync_block() is called via14151415+ * drbd_process_done_ee() by asender only */14161416+static int e_end_resync_block(struct drbd_conf *mdev, struct drbd_work *w, int unused)14171417+{14181418+ struct drbd_epoch_entry *e = (struct drbd_epoch_entry *)w;14191419+ sector_t sector = e->sector;14201420+ int ok;14211421+14221422+ D_ASSERT(hlist_unhashed(&e->colision));14231423+14241424+ if (likely(drbd_bio_uptodate(e->private_bio))) {14251425+ drbd_set_in_sync(mdev, sector, e->size);14261426+ ok = drbd_send_ack(mdev, P_RS_WRITE_ACK, e);14271427+ } else {14281428+ /* Record failure to sync */14291429+ drbd_rs_failed_io(mdev, sector, e->size);14301430+14311431+ ok = drbd_send_ack(mdev, P_NEG_ACK, e);14321432+ }14331433+ dec_unacked(mdev);14341434+14351435+ return ok;14361436+}14371437+14381438+static int recv_resync_read(struct drbd_conf *mdev, sector_t sector, int data_size) __releases(local)14391439+{14401440+ struct drbd_epoch_entry *e;14411441+14421442+ e = read_in_block(mdev, ID_SYNCER, sector, data_size);14431443+ if (!e) {14441444+ put_ldev(mdev);14451445+ return FALSE;14461446+ }14471447+14481448+ dec_rs_pending(mdev);14491449+14501450+ e->private_bio->bi_end_io = drbd_endio_write_sec;14511451+ e->private_bio->bi_rw = WRITE;14521452+ e->w.cb = e_end_resync_block;14531453+14541454+ inc_unacked(mdev);14551455+ /* corresponding dec_unacked() in e_end_resync_block()14561456+ * respective _drbd_clear_done_ee */14571457+14581458+ spin_lock_irq(&mdev->req_lock);14591459+ list_add(&e->w.list, &mdev->sync_ee);14601460+ spin_unlock_irq(&mdev->req_lock);14611461+14621462+ trace_drbd_ee(mdev, e, "submitting for (rs)write");14631463+ trace_drbd_bio(mdev, "Sec", e->private_bio, 0, NULL);14641464+ drbd_generic_make_request(mdev, DRBD_FAULT_RS_WR, e->private_bio);14651465+ /* accounting done in endio */14661466+14671467+ maybe_kick_lo(mdev);14681468+ return TRUE;14691469+}14701470+14711471+static int receive_DataReply(struct drbd_conf *mdev, struct p_header *h)14721472+{14731473+ struct drbd_request *req;14741474+ sector_t sector;14751475+ unsigned int header_size, data_size;14761476+ int ok;14771477+ struct p_data *p = (struct p_data *)h;14781478+14791479+ header_size = sizeof(*p) - sizeof(*h);14801480+ data_size = h->length - header_size;14811481+14821482+ ERR_IF(data_size == 0) return FALSE;14831483+14841484+ if (drbd_recv(mdev, h->payload, header_size) != header_size)14851485+ return FALSE;14861486+14871487+ sector = be64_to_cpu(p->sector);14881488+14891489+ spin_lock_irq(&mdev->req_lock);14901490+ req = _ar_id_to_req(mdev, p->block_id, sector);14911491+ spin_unlock_irq(&mdev->req_lock);14921492+ if (unlikely(!req)) {14931493+ dev_err(DEV, "Got a corrupt block_id/sector pair(1).\n");14941494+ return FALSE;14951495+ }14961496+14971497+ /* hlist_del(&req->colision) is done in _req_may_be_done, to avoid14981498+ * special casing it there for the various failure cases.14991499+ * still no race with drbd_fail_pending_reads */15001500+ ok = recv_dless_read(mdev, req, sector, data_size);15011501+15021502+ if (ok)15031503+ req_mod(req, data_received);15041504+ /* else: nothing. handled from drbd_disconnect...15051505+ * I don't think we may complete this just yet15061506+ * in case we are "on-disconnect: freeze" */15071507+15081508+ return ok;15091509+}15101510+15111511+static int receive_RSDataReply(struct drbd_conf *mdev, struct p_header *h)15121512+{15131513+ sector_t sector;15141514+ unsigned int header_size, data_size;15151515+ int ok;15161516+ struct p_data *p = (struct p_data *)h;15171517+15181518+ header_size = sizeof(*p) - sizeof(*h);15191519+ data_size = h->length - header_size;15201520+15211521+ ERR_IF(data_size == 0) return FALSE;15221522+15231523+ if (drbd_recv(mdev, h->payload, header_size) != header_size)15241524+ return FALSE;15251525+15261526+ sector = be64_to_cpu(p->sector);15271527+ D_ASSERT(p->block_id == ID_SYNCER);15281528+15291529+ if (get_ldev(mdev)) {15301530+ /* data is submitted to disk within recv_resync_read.15311531+ * corresponding put_ldev done below on error,15321532+ * or in drbd_endio_write_sec. */15331533+ ok = recv_resync_read(mdev, sector, data_size);15341534+ } else {15351535+ if (__ratelimit(&drbd_ratelimit_state))15361536+ dev_err(DEV, "Can not write resync data to local disk.\n");15371537+15381538+ ok = drbd_drain_block(mdev, data_size);15391539+15401540+ drbd_send_ack_dp(mdev, P_NEG_ACK, p);15411541+ }15421542+15431543+ return ok;15441544+}15451545+15461546+/* e_end_block() is called via drbd_process_done_ee().15471547+ * this means this function only runs in the asender thread15481548+ */15491549+static int e_end_block(struct drbd_conf *mdev, struct drbd_work *w, int cancel)15501550+{15511551+ struct drbd_epoch_entry *e = (struct drbd_epoch_entry *)w;15521552+ sector_t sector = e->sector;15531553+ struct drbd_epoch *epoch;15541554+ int ok = 1, pcmd;15551555+15561556+ if (e->flags & EE_IS_BARRIER) {15571557+ epoch = previous_epoch(mdev, e->epoch);15581558+ if (epoch)15591559+ drbd_may_finish_epoch(mdev, epoch, EV_BARRIER_DONE + (cancel ? EV_CLEANUP : 0));15601560+ }15611561+15621562+ if (mdev->net_conf->wire_protocol == DRBD_PROT_C) {15631563+ if (likely(drbd_bio_uptodate(e->private_bio))) {15641564+ pcmd = (mdev->state.conn >= C_SYNC_SOURCE &&15651565+ mdev->state.conn <= C_PAUSED_SYNC_T &&15661566+ e->flags & EE_MAY_SET_IN_SYNC) ?15671567+ P_RS_WRITE_ACK : P_WRITE_ACK;15681568+ ok &= drbd_send_ack(mdev, pcmd, e);15691569+ if (pcmd == P_RS_WRITE_ACK)15701570+ drbd_set_in_sync(mdev, sector, e->size);15711571+ } else {15721572+ ok = drbd_send_ack(mdev, P_NEG_ACK, e);15731573+ /* we expect it to be marked out of sync anyways...15741574+ * maybe assert this? */15751575+ }15761576+ dec_unacked(mdev);15771577+ }15781578+ /* we delete from the conflict detection hash _after_ we sent out the15791579+ * P_WRITE_ACK / P_NEG_ACK, to get the sequence number right. */15801580+ if (mdev->net_conf->two_primaries) {15811581+ spin_lock_irq(&mdev->req_lock);15821582+ D_ASSERT(!hlist_unhashed(&e->colision));15831583+ hlist_del_init(&e->colision);15841584+ spin_unlock_irq(&mdev->req_lock);15851585+ } else {15861586+ D_ASSERT(hlist_unhashed(&e->colision));15871587+ }15881588+15891589+ drbd_may_finish_epoch(mdev, e->epoch, EV_PUT + (cancel ? EV_CLEANUP : 0));15901590+15911591+ return ok;15921592+}15931593+15941594+static int e_send_discard_ack(struct drbd_conf *mdev, struct drbd_work *w, int unused)15951595+{15961596+ struct drbd_epoch_entry *e = (struct drbd_epoch_entry *)w;15971597+ int ok = 1;15981598+15991599+ D_ASSERT(mdev->net_conf->wire_protocol == DRBD_PROT_C);16001600+ ok = drbd_send_ack(mdev, P_DISCARD_ACK, e);16011601+16021602+ spin_lock_irq(&mdev->req_lock);16031603+ D_ASSERT(!hlist_unhashed(&e->colision));16041604+ hlist_del_init(&e->colision);16051605+ spin_unlock_irq(&mdev->req_lock);16061606+16071607+ dec_unacked(mdev);16081608+16091609+ return ok;16101610+}16111611+16121612+/* Called from receive_Data.16131613+ * Synchronize packets on sock with packets on msock.16141614+ *16151615+ * This is here so even when a P_DATA packet traveling via sock overtook an Ack16161616+ * packet traveling on msock, they are still processed in the order they have16171617+ * been sent.16181618+ *16191619+ * Note: we don't care for Ack packets overtaking P_DATA packets.16201620+ *16211621+ * In case packet_seq is larger than mdev->peer_seq number, there are16221622+ * outstanding packets on the msock. We wait for them to arrive.16231623+ * In case we are the logically next packet, we update mdev->peer_seq16241624+ * ourselves. Correctly handles 32bit wrap around.16251625+ *16261626+ * Assume we have a 10 GBit connection, that is about 1<<30 byte per second,16271627+ * about 1<<21 sectors per second. So "worst" case, we have 1<<3 == 8 seconds16281628+ * for the 24bit wrap (historical atomic_t guarantee on some archs), and we have16291629+ * 1<<9 == 512 seconds aka ages for the 32bit wrap around...16301630+ *16311631+ * returns 0 if we may process the packet,16321632+ * -ERESTARTSYS if we were interrupted (by disconnect signal). */16331633+static int drbd_wait_peer_seq(struct drbd_conf *mdev, const u32 packet_seq)16341634+{16351635+ DEFINE_WAIT(wait);16361636+ unsigned int p_seq;16371637+ long timeout;16381638+ int ret = 0;16391639+ spin_lock(&mdev->peer_seq_lock);16401640+ for (;;) {16411641+ prepare_to_wait(&mdev->seq_wait, &wait, TASK_INTERRUPTIBLE);16421642+ if (seq_le(packet_seq, mdev->peer_seq+1))16431643+ break;16441644+ if (signal_pending(current)) {16451645+ ret = -ERESTARTSYS;16461646+ break;16471647+ }16481648+ p_seq = mdev->peer_seq;16491649+ spin_unlock(&mdev->peer_seq_lock);16501650+ timeout = schedule_timeout(30*HZ);16511651+ spin_lock(&mdev->peer_seq_lock);16521652+ if (timeout == 0 && p_seq == mdev->peer_seq) {16531653+ ret = -ETIMEDOUT;16541654+ dev_err(DEV, "ASSERT FAILED waited 30 seconds for sequence update, forcing reconnect\n");16551655+ break;16561656+ }16571657+ }16581658+ finish_wait(&mdev->seq_wait, &wait);16591659+ if (mdev->peer_seq+1 == packet_seq)16601660+ mdev->peer_seq++;16611661+ spin_unlock(&mdev->peer_seq_lock);16621662+ return ret;16631663+}16641664+16651665+/* mirrored write */16661666+static int receive_Data(struct drbd_conf *mdev, struct p_header *h)16671667+{16681668+ sector_t sector;16691669+ struct drbd_epoch_entry *e;16701670+ struct p_data *p = (struct p_data *)h;16711671+ int header_size, data_size;16721672+ int rw = WRITE;16731673+ u32 dp_flags;16741674+16751675+ header_size = sizeof(*p) - sizeof(*h);16761676+ data_size = h->length - header_size;16771677+16781678+ ERR_IF(data_size == 0) return FALSE;16791679+16801680+ if (drbd_recv(mdev, h->payload, header_size) != header_size)16811681+ return FALSE;16821682+16831683+ if (!get_ldev(mdev)) {16841684+ if (__ratelimit(&drbd_ratelimit_state))16851685+ dev_err(DEV, "Can not write mirrored data block "16861686+ "to local disk.\n");16871687+ spin_lock(&mdev->peer_seq_lock);16881688+ if (mdev->peer_seq+1 == be32_to_cpu(p->seq_num))16891689+ mdev->peer_seq++;16901690+ spin_unlock(&mdev->peer_seq_lock);16911691+16921692+ drbd_send_ack_dp(mdev, P_NEG_ACK, p);16931693+ atomic_inc(&mdev->current_epoch->epoch_size);16941694+ return drbd_drain_block(mdev, data_size);16951695+ }16961696+16971697+ /* get_ldev(mdev) successful.16981698+ * Corresponding put_ldev done either below (on various errors),16991699+ * or in drbd_endio_write_sec, if we successfully submit the data at17001700+ * the end of this function. */17011701+17021702+ sector = be64_to_cpu(p->sector);17031703+ e = read_in_block(mdev, p->block_id, sector, data_size);17041704+ if (!e) {17051705+ put_ldev(mdev);17061706+ return FALSE;17071707+ }17081708+17091709+ e->private_bio->bi_end_io = drbd_endio_write_sec;17101710+ e->w.cb = e_end_block;17111711+17121712+ spin_lock(&mdev->epoch_lock);17131713+ e->epoch = mdev->current_epoch;17141714+ atomic_inc(&e->epoch->epoch_size);17151715+ atomic_inc(&e->epoch->active);17161716+17171717+ if (mdev->write_ordering == WO_bio_barrier && atomic_read(&e->epoch->epoch_size) == 1) {17181718+ struct drbd_epoch *epoch;17191719+ /* Issue a barrier if we start a new epoch, and the previous epoch17201720+ was not a epoch containing a single request which already was17211721+ a Barrier. */17221722+ epoch = list_entry(e->epoch->list.prev, struct drbd_epoch, list);17231723+ if (epoch == e->epoch) {17241724+ set_bit(DE_CONTAINS_A_BARRIER, &e->epoch->flags);17251725+ trace_drbd_epoch(mdev, e->epoch, EV_TRACE_ADD_BARRIER);17261726+ rw |= (1<<BIO_RW_BARRIER);17271727+ e->flags |= EE_IS_BARRIER;17281728+ } else {17291729+ if (atomic_read(&epoch->epoch_size) > 1 ||17301730+ !test_bit(DE_CONTAINS_A_BARRIER, &epoch->flags)) {17311731+ set_bit(DE_BARRIER_IN_NEXT_EPOCH_ISSUED, &epoch->flags);17321732+ trace_drbd_epoch(mdev, epoch, EV_TRACE_SETTING_BI);17331733+ set_bit(DE_CONTAINS_A_BARRIER, &e->epoch->flags);17341734+ trace_drbd_epoch(mdev, e->epoch, EV_TRACE_ADD_BARRIER);17351735+ rw |= (1<<BIO_RW_BARRIER);17361736+ e->flags |= EE_IS_BARRIER;17371737+ }17381738+ }17391739+ }17401740+ spin_unlock(&mdev->epoch_lock);17411741+17421742+ dp_flags = be32_to_cpu(p->dp_flags);17431743+ if (dp_flags & DP_HARDBARRIER) {17441744+ dev_err(DEV, "ASSERT FAILED would have submitted barrier request\n");17451745+ /* rw |= (1<<BIO_RW_BARRIER); */17461746+ }17471747+ if (dp_flags & DP_RW_SYNC)17481748+ rw |= (1<<BIO_RW_SYNCIO) | (1<<BIO_RW_UNPLUG);17491749+ if (dp_flags & DP_MAY_SET_IN_SYNC)17501750+ e->flags |= EE_MAY_SET_IN_SYNC;17511751+17521752+ /* I'm the receiver, I do hold a net_cnt reference. */17531753+ if (!mdev->net_conf->two_primaries) {17541754+ spin_lock_irq(&mdev->req_lock);17551755+ } else {17561756+ /* don't get the req_lock yet,17571757+ * we may sleep in drbd_wait_peer_seq */17581758+ const int size = e->size;17591759+ const int discard = test_bit(DISCARD_CONCURRENT, &mdev->flags);17601760+ DEFINE_WAIT(wait);17611761+ struct drbd_request *i;17621762+ struct hlist_node *n;17631763+ struct hlist_head *slot;17641764+ int first;17651765+17661766+ D_ASSERT(mdev->net_conf->wire_protocol == DRBD_PROT_C);17671767+ BUG_ON(mdev->ee_hash == NULL);17681768+ BUG_ON(mdev->tl_hash == NULL);17691769+17701770+ /* conflict detection and handling:17711771+ * 1. wait on the sequence number,17721772+ * in case this data packet overtook ACK packets.17731773+ * 2. check our hash tables for conflicting requests.17741774+ * we only need to walk the tl_hash, since an ee can not17751775+ * have a conflict with an other ee: on the submitting17761776+ * node, the corresponding req had already been conflicting,17771777+ * and a conflicting req is never sent.17781778+ *17791779+ * Note: for two_primaries, we are protocol C,17801780+ * so there cannot be any request that is DONE17811781+ * but still on the transfer log.17821782+ *17831783+ * unconditionally add to the ee_hash.17841784+ *17851785+ * if no conflicting request is found:17861786+ * submit.17871787+ *17881788+ * if any conflicting request is found17891789+ * that has not yet been acked,17901790+ * AND I have the "discard concurrent writes" flag:17911791+ * queue (via done_ee) the P_DISCARD_ACK; OUT.17921792+ *17931793+ * if any conflicting request is found:17941794+ * block the receiver, waiting on misc_wait17951795+ * until no more conflicting requests are there,17961796+ * or we get interrupted (disconnect).17971797+ *17981798+ * we do not just write after local io completion of those17991799+ * requests, but only after req is done completely, i.e.18001800+ * we wait for the P_DISCARD_ACK to arrive!18011801+ *18021802+ * then proceed normally, i.e. submit.18031803+ */18041804+ if (drbd_wait_peer_seq(mdev, be32_to_cpu(p->seq_num)))18051805+ goto out_interrupted;18061806+18071807+ spin_lock_irq(&mdev->req_lock);18081808+18091809+ hlist_add_head(&e->colision, ee_hash_slot(mdev, sector));18101810+18111811+#define OVERLAPS overlaps(i->sector, i->size, sector, size)18121812+ slot = tl_hash_slot(mdev, sector);18131813+ first = 1;18141814+ for (;;) {18151815+ int have_unacked = 0;18161816+ int have_conflict = 0;18171817+ prepare_to_wait(&mdev->misc_wait, &wait,18181818+ TASK_INTERRUPTIBLE);18191819+ hlist_for_each_entry(i, n, slot, colision) {18201820+ if (OVERLAPS) {18211821+ /* only ALERT on first iteration,18221822+ * we may be woken up early... */18231823+ if (first)18241824+ dev_alert(DEV, "%s[%u] Concurrent local write detected!"18251825+ " new: %llus +%u; pending: %llus +%u\n",18261826+ current->comm, current->pid,18271827+ (unsigned long long)sector, size,18281828+ (unsigned long long)i->sector, i->size);18291829+ if (i->rq_state & RQ_NET_PENDING)18301830+ ++have_unacked;18311831+ ++have_conflict;18321832+ }18331833+ }18341834+#undef OVERLAPS18351835+ if (!have_conflict)18361836+ break;18371837+18381838+ /* Discard Ack only for the _first_ iteration */18391839+ if (first && discard && have_unacked) {18401840+ dev_alert(DEV, "Concurrent write! [DISCARD BY FLAG] sec=%llus\n",18411841+ (unsigned long long)sector);18421842+ inc_unacked(mdev);18431843+ e->w.cb = e_send_discard_ack;18441844+ list_add_tail(&e->w.list, &mdev->done_ee);18451845+18461846+ spin_unlock_irq(&mdev->req_lock);18471847+18481848+ /* we could probably send that P_DISCARD_ACK ourselves,18491849+ * but I don't like the receiver using the msock */18501850+18511851+ put_ldev(mdev);18521852+ wake_asender(mdev);18531853+ finish_wait(&mdev->misc_wait, &wait);18541854+ return TRUE;18551855+ }18561856+18571857+ if (signal_pending(current)) {18581858+ hlist_del_init(&e->colision);18591859+18601860+ spin_unlock_irq(&mdev->req_lock);18611861+18621862+ finish_wait(&mdev->misc_wait, &wait);18631863+ goto out_interrupted;18641864+ }18651865+18661866+ spin_unlock_irq(&mdev->req_lock);18671867+ if (first) {18681868+ first = 0;18691869+ dev_alert(DEV, "Concurrent write! [W AFTERWARDS] "18701870+ "sec=%llus\n", (unsigned long long)sector);18711871+ } else if (discard) {18721872+ /* we had none on the first iteration.18731873+ * there must be none now. */18741874+ D_ASSERT(have_unacked == 0);18751875+ }18761876+ schedule();18771877+ spin_lock_irq(&mdev->req_lock);18781878+ }18791879+ finish_wait(&mdev->misc_wait, &wait);18801880+ }18811881+18821882+ list_add(&e->w.list, &mdev->active_ee);18831883+ spin_unlock_irq(&mdev->req_lock);18841884+18851885+ switch (mdev->net_conf->wire_protocol) {18861886+ case DRBD_PROT_C:18871887+ inc_unacked(mdev);18881888+ /* corresponding dec_unacked() in e_end_block()18891889+ * respective _drbd_clear_done_ee */18901890+ break;18911891+ case DRBD_PROT_B:18921892+ /* I really don't like it that the receiver thread18931893+ * sends on the msock, but anyways */18941894+ drbd_send_ack(mdev, P_RECV_ACK, e);18951895+ break;18961896+ case DRBD_PROT_A:18971897+ /* nothing to do */18981898+ break;18991899+ }19001900+19011901+ if (mdev->state.pdsk == D_DISKLESS) {19021902+ /* In case we have the only disk of the cluster, */19031903+ drbd_set_out_of_sync(mdev, e->sector, e->size);19041904+ e->flags |= EE_CALL_AL_COMPLETE_IO;19051905+ drbd_al_begin_io(mdev, e->sector);19061906+ }19071907+19081908+ e->private_bio->bi_rw = rw;19091909+ trace_drbd_ee(mdev, e, "submitting for (data)write");19101910+ trace_drbd_bio(mdev, "Sec", e->private_bio, 0, NULL);19111911+ drbd_generic_make_request(mdev, DRBD_FAULT_DT_WR, e->private_bio);19121912+ /* accounting done in endio */19131913+19141914+ maybe_kick_lo(mdev);19151915+ return TRUE;19161916+19171917+out_interrupted:19181918+ /* yes, the epoch_size now is imbalanced.19191919+ * but we drop the connection anyways, so we don't have a chance to19201920+ * receive a barrier... atomic_inc(&mdev->epoch_size); */19211921+ put_ldev(mdev);19221922+ drbd_free_ee(mdev, e);19231923+ return FALSE;19241924+}19251925+19261926+static int receive_DataRequest(struct drbd_conf *mdev, struct p_header *h)19271927+{19281928+ sector_t sector;19291929+ const sector_t capacity = drbd_get_capacity(mdev->this_bdev);19301930+ struct drbd_epoch_entry *e;19311931+ struct digest_info *di = NULL;19321932+ int size, digest_size;19331933+ unsigned int fault_type;19341934+ struct p_block_req *p =19351935+ (struct p_block_req *)h;19361936+ const int brps = sizeof(*p)-sizeof(*h);19371937+19381938+ if (drbd_recv(mdev, h->payload, brps) != brps)19391939+ return FALSE;19401940+19411941+ sector = be64_to_cpu(p->sector);19421942+ size = be32_to_cpu(p->blksize);19431943+19441944+ if (size <= 0 || (size & 0x1ff) != 0 || size > DRBD_MAX_SEGMENT_SIZE) {19451945+ dev_err(DEV, "%s:%d: sector: %llus, size: %u\n", __FILE__, __LINE__,19461946+ (unsigned long long)sector, size);19471947+ return FALSE;19481948+ }19491949+ if (sector + (size>>9) > capacity) {19501950+ dev_err(DEV, "%s:%d: sector: %llus, size: %u\n", __FILE__, __LINE__,19511951+ (unsigned long long)sector, size);19521952+ return FALSE;19531953+ }19541954+19551955+ if (!get_ldev_if_state(mdev, D_UP_TO_DATE)) {19561956+ if (__ratelimit(&drbd_ratelimit_state))19571957+ dev_err(DEV, "Can not satisfy peer's read request, "19581958+ "no local data.\n");19591959+ drbd_send_ack_rp(mdev, h->command == P_DATA_REQUEST ? P_NEG_DREPLY :19601960+ P_NEG_RS_DREPLY , p);19611961+ return TRUE;19621962+ }19631963+19641964+ /* GFP_NOIO, because we must not cause arbitrary write-out: in a DRBD19651965+ * "criss-cross" setup, that might cause write-out on some other DRBD,19661966+ * which in turn might block on the other node at this very place. */19671967+ e = drbd_alloc_ee(mdev, p->block_id, sector, size, GFP_NOIO);19681968+ if (!e) {19691969+ put_ldev(mdev);19701970+ return FALSE;19711971+ }19721972+19731973+ e->private_bio->bi_rw = READ;19741974+ e->private_bio->bi_end_io = drbd_endio_read_sec;19751975+19761976+ switch (h->command) {19771977+ case P_DATA_REQUEST:19781978+ e->w.cb = w_e_end_data_req;19791979+ fault_type = DRBD_FAULT_DT_RD;19801980+ break;19811981+ case P_RS_DATA_REQUEST:19821982+ e->w.cb = w_e_end_rsdata_req;19831983+ fault_type = DRBD_FAULT_RS_RD;19841984+ /* Eventually this should become asynchronously. Currently it19851985+ * blocks the whole receiver just to delay the reading of a19861986+ * resync data block.19871987+ * the drbd_work_queue mechanism is made for this...19881988+ */19891989+ if (!drbd_rs_begin_io(mdev, sector)) {19901990+ /* we have been interrupted,19911991+ * probably connection lost! */19921992+ D_ASSERT(signal_pending(current));19931993+ goto out_free_e;19941994+ }19951995+ break;19961996+19971997+ case P_OV_REPLY:19981998+ case P_CSUM_RS_REQUEST:19991999+ fault_type = DRBD_FAULT_RS_RD;20002000+ digest_size = h->length - brps ;20012001+ di = kmalloc(sizeof(*di) + digest_size, GFP_NOIO);20022002+ if (!di)20032003+ goto out_free_e;20042004+20052005+ di->digest_size = digest_size;20062006+ di->digest = (((char *)di)+sizeof(struct digest_info));20072007+20082008+ if (drbd_recv(mdev, di->digest, digest_size) != digest_size)20092009+ goto out_free_e;20102010+20112011+ e->block_id = (u64)(unsigned long)di;20122012+ if (h->command == P_CSUM_RS_REQUEST) {20132013+ D_ASSERT(mdev->agreed_pro_version >= 89);20142014+ e->w.cb = w_e_end_csum_rs_req;20152015+ } else if (h->command == P_OV_REPLY) {20162016+ e->w.cb = w_e_end_ov_reply;20172017+ dec_rs_pending(mdev);20182018+ break;20192019+ }20202020+20212021+ if (!drbd_rs_begin_io(mdev, sector)) {20222022+ /* we have been interrupted, probably connection lost! */20232023+ D_ASSERT(signal_pending(current));20242024+ goto out_free_e;20252025+ }20262026+ break;20272027+20282028+ case P_OV_REQUEST:20292029+ if (mdev->state.conn >= C_CONNECTED &&20302030+ mdev->state.conn != C_VERIFY_T)20312031+ dev_warn(DEV, "ASSERT FAILED: got P_OV_REQUEST while being %s\n",20322032+ drbd_conn_str(mdev->state.conn));20332033+ if (mdev->ov_start_sector == ~(sector_t)0 &&20342034+ mdev->agreed_pro_version >= 90) {20352035+ mdev->ov_start_sector = sector;20362036+ mdev->ov_position = sector;20372037+ mdev->ov_left = mdev->rs_total - BM_SECT_TO_BIT(sector);20382038+ dev_info(DEV, "Online Verify start sector: %llu\n",20392039+ (unsigned long long)sector);20402040+ }20412041+ e->w.cb = w_e_end_ov_req;20422042+ fault_type = DRBD_FAULT_RS_RD;20432043+ /* Eventually this should become asynchronous. Currently it20442044+ * blocks the whole receiver just to delay the reading of a20452045+ * resync data block.20462046+ * the drbd_work_queue mechanism is made for this...20472047+ */20482048+ if (!drbd_rs_begin_io(mdev, sector)) {20492049+ /* we have been interrupted,20502050+ * probably connection lost! */20512051+ D_ASSERT(signal_pending(current));20522052+ goto out_free_e;20532053+ }20542054+ break;20552055+20562056+20572057+ default:20582058+ dev_err(DEV, "unexpected command (%s) in receive_DataRequest\n",20592059+ cmdname(h->command));20602060+ fault_type = DRBD_FAULT_MAX;20612061+ }20622062+20632063+ spin_lock_irq(&mdev->req_lock);20642064+ list_add(&e->w.list, &mdev->read_ee);20652065+ spin_unlock_irq(&mdev->req_lock);20662066+20672067+ inc_unacked(mdev);20682068+20692069+ trace_drbd_ee(mdev, e, "submitting for read");20702070+ trace_drbd_bio(mdev, "Sec", e->private_bio, 0, NULL);20712071+ drbd_generic_make_request(mdev, fault_type, e->private_bio);20722072+ maybe_kick_lo(mdev);20732073+20742074+ return TRUE;20752075+20762076+out_free_e:20772077+ kfree(di);20782078+ put_ldev(mdev);20792079+ drbd_free_ee(mdev, e);20802080+ return FALSE;20812081+}20822082+20832083+static int drbd_asb_recover_0p(struct drbd_conf *mdev) __must_hold(local)20842084+{20852085+ int self, peer, rv = -100;20862086+ unsigned long ch_self, ch_peer;20872087+20882088+ self = mdev->ldev->md.uuid[UI_BITMAP] & 1;20892089+ peer = mdev->p_uuid[UI_BITMAP] & 1;20902090+20912091+ ch_peer = mdev->p_uuid[UI_SIZE];20922092+ ch_self = mdev->comm_bm_set;20932093+20942094+ switch (mdev->net_conf->after_sb_0p) {20952095+ case ASB_CONSENSUS:20962096+ case ASB_DISCARD_SECONDARY:20972097+ case ASB_CALL_HELPER:20982098+ dev_err(DEV, "Configuration error.\n");20992099+ break;21002100+ case ASB_DISCONNECT:21012101+ break;21022102+ case ASB_DISCARD_YOUNGER_PRI:21032103+ if (self == 0 && peer == 1) {21042104+ rv = -1;21052105+ break;21062106+ }21072107+ if (self == 1 && peer == 0) {21082108+ rv = 1;21092109+ break;21102110+ }21112111+ /* Else fall through to one of the other strategies... */21122112+ case ASB_DISCARD_OLDER_PRI:21132113+ if (self == 0 && peer == 1) {21142114+ rv = 1;21152115+ break;21162116+ }21172117+ if (self == 1 && peer == 0) {21182118+ rv = -1;21192119+ break;21202120+ }21212121+ /* Else fall through to one of the other strategies... */21222122+ dev_warn(DEV, "Discard younger/older primary did not found a decision\n"21232123+ "Using discard-least-changes instead\n");21242124+ case ASB_DISCARD_ZERO_CHG:21252125+ if (ch_peer == 0 && ch_self == 0) {21262126+ rv = test_bit(DISCARD_CONCURRENT, &mdev->flags)21272127+ ? -1 : 1;21282128+ break;21292129+ } else {21302130+ if (ch_peer == 0) { rv = 1; break; }21312131+ if (ch_self == 0) { rv = -1; break; }21322132+ }21332133+ if (mdev->net_conf->after_sb_0p == ASB_DISCARD_ZERO_CHG)21342134+ break;21352135+ case ASB_DISCARD_LEAST_CHG:21362136+ if (ch_self < ch_peer)21372137+ rv = -1;21382138+ else if (ch_self > ch_peer)21392139+ rv = 1;21402140+ else /* ( ch_self == ch_peer ) */21412141+ /* Well, then use something else. */21422142+ rv = test_bit(DISCARD_CONCURRENT, &mdev->flags)21432143+ ? -1 : 1;21442144+ break;21452145+ case ASB_DISCARD_LOCAL:21462146+ rv = -1;21472147+ break;21482148+ case ASB_DISCARD_REMOTE:21492149+ rv = 1;21502150+ }21512151+21522152+ return rv;21532153+}21542154+21552155+static int drbd_asb_recover_1p(struct drbd_conf *mdev) __must_hold(local)21562156+{21572157+ int self, peer, hg, rv = -100;21582158+21592159+ self = mdev->ldev->md.uuid[UI_BITMAP] & 1;21602160+ peer = mdev->p_uuid[UI_BITMAP] & 1;21612161+21622162+ switch (mdev->net_conf->after_sb_1p) {21632163+ case ASB_DISCARD_YOUNGER_PRI:21642164+ case ASB_DISCARD_OLDER_PRI:21652165+ case ASB_DISCARD_LEAST_CHG:21662166+ case ASB_DISCARD_LOCAL:21672167+ case ASB_DISCARD_REMOTE:21682168+ dev_err(DEV, "Configuration error.\n");21692169+ break;21702170+ case ASB_DISCONNECT:21712171+ break;21722172+ case ASB_CONSENSUS:21732173+ hg = drbd_asb_recover_0p(mdev);21742174+ if (hg == -1 && mdev->state.role == R_SECONDARY)21752175+ rv = hg;21762176+ if (hg == 1 && mdev->state.role == R_PRIMARY)21772177+ rv = hg;21782178+ break;21792179+ case ASB_VIOLENTLY:21802180+ rv = drbd_asb_recover_0p(mdev);21812181+ break;21822182+ case ASB_DISCARD_SECONDARY:21832183+ return mdev->state.role == R_PRIMARY ? 1 : -1;21842184+ case ASB_CALL_HELPER:21852185+ hg = drbd_asb_recover_0p(mdev);21862186+ if (hg == -1 && mdev->state.role == R_PRIMARY) {21872187+ self = drbd_set_role(mdev, R_SECONDARY, 0);21882188+ /* drbd_change_state() does not sleep while in SS_IN_TRANSIENT_STATE,21892189+ * we might be here in C_WF_REPORT_PARAMS which is transient.21902190+ * we do not need to wait for the after state change work either. */21912191+ self = drbd_change_state(mdev, CS_VERBOSE, NS(role, R_SECONDARY));21922192+ if (self != SS_SUCCESS) {21932193+ drbd_khelper(mdev, "pri-lost-after-sb");21942194+ } else {21952195+ dev_warn(DEV, "Successfully gave up primary role.\n");21962196+ rv = hg;21972197+ }21982198+ } else21992199+ rv = hg;22002200+ }22012201+22022202+ return rv;22032203+}22042204+22052205+static int drbd_asb_recover_2p(struct drbd_conf *mdev) __must_hold(local)22062206+{22072207+ int self, peer, hg, rv = -100;22082208+22092209+ self = mdev->ldev->md.uuid[UI_BITMAP] & 1;22102210+ peer = mdev->p_uuid[UI_BITMAP] & 1;22112211+22122212+ switch (mdev->net_conf->after_sb_2p) {22132213+ case ASB_DISCARD_YOUNGER_PRI:22142214+ case ASB_DISCARD_OLDER_PRI:22152215+ case ASB_DISCARD_LEAST_CHG:22162216+ case ASB_DISCARD_LOCAL:22172217+ case ASB_DISCARD_REMOTE:22182218+ case ASB_CONSENSUS:22192219+ case ASB_DISCARD_SECONDARY:22202220+ dev_err(DEV, "Configuration error.\n");22212221+ break;22222222+ case ASB_VIOLENTLY:22232223+ rv = drbd_asb_recover_0p(mdev);22242224+ break;22252225+ case ASB_DISCONNECT:22262226+ break;22272227+ case ASB_CALL_HELPER:22282228+ hg = drbd_asb_recover_0p(mdev);22292229+ if (hg == -1) {22302230+ /* drbd_change_state() does not sleep while in SS_IN_TRANSIENT_STATE,22312231+ * we might be here in C_WF_REPORT_PARAMS which is transient.22322232+ * we do not need to wait for the after state change work either. */22332233+ self = drbd_change_state(mdev, CS_VERBOSE, NS(role, R_SECONDARY));22342234+ if (self != SS_SUCCESS) {22352235+ drbd_khelper(mdev, "pri-lost-after-sb");22362236+ } else {22372237+ dev_warn(DEV, "Successfully gave up primary role.\n");22382238+ rv = hg;22392239+ }22402240+ } else22412241+ rv = hg;22422242+ }22432243+22442244+ return rv;22452245+}22462246+22472247+static void drbd_uuid_dump(struct drbd_conf *mdev, char *text, u64 *uuid,22482248+ u64 bits, u64 flags)22492249+{22502250+ if (!uuid) {22512251+ dev_info(DEV, "%s uuid info vanished while I was looking!\n", text);22522252+ return;22532253+ }22542254+ dev_info(DEV, "%s %016llX:%016llX:%016llX:%016llX bits:%llu flags:%llX\n",22552255+ text,22562256+ (unsigned long long)uuid[UI_CURRENT],22572257+ (unsigned long long)uuid[UI_BITMAP],22582258+ (unsigned long long)uuid[UI_HISTORY_START],22592259+ (unsigned long long)uuid[UI_HISTORY_END],22602260+ (unsigned long long)bits,22612261+ (unsigned long long)flags);22622262+}22632263+22642264+/*22652265+ 100 after split brain try auto recover22662266+ 2 C_SYNC_SOURCE set BitMap22672267+ 1 C_SYNC_SOURCE use BitMap22682268+ 0 no Sync22692269+ -1 C_SYNC_TARGET use BitMap22702270+ -2 C_SYNC_TARGET set BitMap22712271+ -100 after split brain, disconnect22722272+-1000 unrelated data22732273+ */22742274+static int drbd_uuid_compare(struct drbd_conf *mdev, int *rule_nr) __must_hold(local)22752275+{22762276+ u64 self, peer;22772277+ int i, j;22782278+22792279+ self = mdev->ldev->md.uuid[UI_CURRENT] & ~((u64)1);22802280+ peer = mdev->p_uuid[UI_CURRENT] & ~((u64)1);22812281+22822282+ *rule_nr = 10;22832283+ if (self == UUID_JUST_CREATED && peer == UUID_JUST_CREATED)22842284+ return 0;22852285+22862286+ *rule_nr = 20;22872287+ if ((self == UUID_JUST_CREATED || self == (u64)0) &&22882288+ peer != UUID_JUST_CREATED)22892289+ return -2;22902290+22912291+ *rule_nr = 30;22922292+ if (self != UUID_JUST_CREATED &&22932293+ (peer == UUID_JUST_CREATED || peer == (u64)0))22942294+ return 2;22952295+22962296+ if (self == peer) {22972297+ int rct, dc; /* roles at crash time */22982298+22992299+ if (mdev->p_uuid[UI_BITMAP] == (u64)0 && mdev->ldev->md.uuid[UI_BITMAP] != (u64)0) {23002300+23012301+ if (mdev->agreed_pro_version < 91)23022302+ return -1001;23032303+23042304+ if ((mdev->ldev->md.uuid[UI_BITMAP] & ~((u64)1)) == (mdev->p_uuid[UI_HISTORY_START] & ~((u64)1)) &&23052305+ (mdev->ldev->md.uuid[UI_HISTORY_START] & ~((u64)1)) == (mdev->p_uuid[UI_HISTORY_START + 1] & ~((u64)1))) {23062306+ dev_info(DEV, "was SyncSource, missed the resync finished event, corrected myself:\n");23072307+ drbd_uuid_set_bm(mdev, 0UL);23082308+23092309+ drbd_uuid_dump(mdev, "self", mdev->ldev->md.uuid,23102310+ mdev->state.disk >= D_NEGOTIATING ? drbd_bm_total_weight(mdev) : 0, 0);23112311+ *rule_nr = 34;23122312+ } else {23132313+ dev_info(DEV, "was SyncSource (peer failed to write sync_uuid)\n");23142314+ *rule_nr = 36;23152315+ }23162316+23172317+ return 1;23182318+ }23192319+23202320+ if (mdev->ldev->md.uuid[UI_BITMAP] == (u64)0 && mdev->p_uuid[UI_BITMAP] != (u64)0) {23212321+23222322+ if (mdev->agreed_pro_version < 91)23232323+ return -1001;23242324+23252325+ if ((mdev->ldev->md.uuid[UI_HISTORY_START] & ~((u64)1)) == (mdev->p_uuid[UI_BITMAP] & ~((u64)1)) &&23262326+ (mdev->ldev->md.uuid[UI_HISTORY_START + 1] & ~((u64)1)) == (mdev->p_uuid[UI_HISTORY_START] & ~((u64)1))) {23272327+ dev_info(DEV, "was SyncTarget, peer missed the resync finished event, corrected peer:\n");23282328+23292329+ mdev->p_uuid[UI_HISTORY_START + 1] = mdev->p_uuid[UI_HISTORY_START];23302330+ mdev->p_uuid[UI_HISTORY_START] = mdev->p_uuid[UI_BITMAP];23312331+ mdev->p_uuid[UI_BITMAP] = 0UL;23322332+23332333+ drbd_uuid_dump(mdev, "peer", mdev->p_uuid, mdev->p_uuid[UI_SIZE], mdev->p_uuid[UI_FLAGS]);23342334+ *rule_nr = 35;23352335+ } else {23362336+ dev_info(DEV, "was SyncTarget (failed to write sync_uuid)\n");23372337+ *rule_nr = 37;23382338+ }23392339+23402340+ return -1;23412341+ }23422342+23432343+ /* Common power [off|failure] */23442344+ rct = (test_bit(CRASHED_PRIMARY, &mdev->flags) ? 1 : 0) +23452345+ (mdev->p_uuid[UI_FLAGS] & 2);23462346+ /* lowest bit is set when we were primary,23472347+ * next bit (weight 2) is set when peer was primary */23482348+ *rule_nr = 40;23492349+23502350+ switch (rct) {23512351+ case 0: /* !self_pri && !peer_pri */ return 0;23522352+ case 1: /* self_pri && !peer_pri */ return 1;23532353+ case 2: /* !self_pri && peer_pri */ return -1;23542354+ case 3: /* self_pri && peer_pri */23552355+ dc = test_bit(DISCARD_CONCURRENT, &mdev->flags);23562356+ return dc ? -1 : 1;23572357+ }23582358+ }23592359+23602360+ *rule_nr = 50;23612361+ peer = mdev->p_uuid[UI_BITMAP] & ~((u64)1);23622362+ if (self == peer)23632363+ return -1;23642364+23652365+ *rule_nr = 51;23662366+ peer = mdev->p_uuid[UI_HISTORY_START] & ~((u64)1);23672367+ if (self == peer) {23682368+ self = mdev->ldev->md.uuid[UI_HISTORY_START] & ~((u64)1);23692369+ peer = mdev->p_uuid[UI_HISTORY_START + 1] & ~((u64)1);23702370+ if (self == peer) {23712371+ /* The last P_SYNC_UUID did not get though. Undo the last start of23722372+ resync as sync source modifications of the peer's UUIDs. */23732373+23742374+ if (mdev->agreed_pro_version < 91)23752375+ return -1001;23762376+23772377+ mdev->p_uuid[UI_BITMAP] = mdev->p_uuid[UI_HISTORY_START];23782378+ mdev->p_uuid[UI_HISTORY_START] = mdev->p_uuid[UI_HISTORY_START + 1];23792379+ return -1;23802380+ }23812381+ }23822382+23832383+ *rule_nr = 60;23842384+ self = mdev->ldev->md.uuid[UI_CURRENT] & ~((u64)1);23852385+ for (i = UI_HISTORY_START; i <= UI_HISTORY_END; i++) {23862386+ peer = mdev->p_uuid[i] & ~((u64)1);23872387+ if (self == peer)23882388+ return -2;23892389+ }23902390+23912391+ *rule_nr = 70;23922392+ self = mdev->ldev->md.uuid[UI_BITMAP] & ~((u64)1);23932393+ peer = mdev->p_uuid[UI_CURRENT] & ~((u64)1);23942394+ if (self == peer)23952395+ return 1;23962396+23972397+ *rule_nr = 71;23982398+ self = mdev->ldev->md.uuid[UI_HISTORY_START] & ~((u64)1);23992399+ if (self == peer) {24002400+ self = mdev->ldev->md.uuid[UI_HISTORY_START + 1] & ~((u64)1);24012401+ peer = mdev->p_uuid[UI_HISTORY_START] & ~((u64)1);24022402+ if (self == peer) {24032403+ /* The last P_SYNC_UUID did not get though. Undo the last start of24042404+ resync as sync source modifications of our UUIDs. */24052405+24062406+ if (mdev->agreed_pro_version < 91)24072407+ return -1001;24082408+24092409+ _drbd_uuid_set(mdev, UI_BITMAP, mdev->ldev->md.uuid[UI_HISTORY_START]);24102410+ _drbd_uuid_set(mdev, UI_HISTORY_START, mdev->ldev->md.uuid[UI_HISTORY_START + 1]);24112411+24122412+ dev_info(DEV, "Undid last start of resync:\n");24132413+24142414+ drbd_uuid_dump(mdev, "self", mdev->ldev->md.uuid,24152415+ mdev->state.disk >= D_NEGOTIATING ? drbd_bm_total_weight(mdev) : 0, 0);24162416+24172417+ return 1;24182418+ }24192419+ }24202420+24212421+24222422+ *rule_nr = 80;24232423+ for (i = UI_HISTORY_START; i <= UI_HISTORY_END; i++) {24242424+ self = mdev->ldev->md.uuid[i] & ~((u64)1);24252425+ if (self == peer)24262426+ return 2;24272427+ }24282428+24292429+ *rule_nr = 90;24302430+ self = mdev->ldev->md.uuid[UI_BITMAP] & ~((u64)1);24312431+ peer = mdev->p_uuid[UI_BITMAP] & ~((u64)1);24322432+ if (self == peer && self != ((u64)0))24332433+ return 100;24342434+24352435+ *rule_nr = 100;24362436+ for (i = UI_HISTORY_START; i <= UI_HISTORY_END; i++) {24372437+ self = mdev->ldev->md.uuid[i] & ~((u64)1);24382438+ for (j = UI_HISTORY_START; j <= UI_HISTORY_END; j++) {24392439+ peer = mdev->p_uuid[j] & ~((u64)1);24402440+ if (self == peer)24412441+ return -100;24422442+ }24432443+ }24442444+24452445+ return -1000;24462446+}24472447+24482448+/* drbd_sync_handshake() returns the new conn state on success, or24492449+ CONN_MASK (-1) on failure.24502450+ */24512451+static enum drbd_conns drbd_sync_handshake(struct drbd_conf *mdev, enum drbd_role peer_role,24522452+ enum drbd_disk_state peer_disk) __must_hold(local)24532453+{24542454+ int hg, rule_nr;24552455+ enum drbd_conns rv = C_MASK;24562456+ enum drbd_disk_state mydisk;24572457+24582458+ mydisk = mdev->state.disk;24592459+ if (mydisk == D_NEGOTIATING)24602460+ mydisk = mdev->new_state_tmp.disk;24612461+24622462+ dev_info(DEV, "drbd_sync_handshake:\n");24632463+ drbd_uuid_dump(mdev, "self", mdev->ldev->md.uuid, mdev->comm_bm_set, 0);24642464+ drbd_uuid_dump(mdev, "peer", mdev->p_uuid,24652465+ mdev->p_uuid[UI_SIZE], mdev->p_uuid[UI_FLAGS]);24662466+24672467+ hg = drbd_uuid_compare(mdev, &rule_nr);24682468+24692469+ dev_info(DEV, "uuid_compare()=%d by rule %d\n", hg, rule_nr);24702470+24712471+ if (hg == -1000) {24722472+ dev_alert(DEV, "Unrelated data, aborting!\n");24732473+ return C_MASK;24742474+ }24752475+ if (hg == -1001) {24762476+ dev_alert(DEV, "To resolve this both sides have to support at least protocol\n");24772477+ return C_MASK;24782478+ }24792479+24802480+ if ((mydisk == D_INCONSISTENT && peer_disk > D_INCONSISTENT) ||24812481+ (peer_disk == D_INCONSISTENT && mydisk > D_INCONSISTENT)) {24822482+ int f = (hg == -100) || abs(hg) == 2;24832483+ hg = mydisk > D_INCONSISTENT ? 1 : -1;24842484+ if (f)24852485+ hg = hg*2;24862486+ dev_info(DEV, "Becoming sync %s due to disk states.\n",24872487+ hg > 0 ? "source" : "target");24882488+ }24892489+24902490+ if (hg == 100 || (hg == -100 && mdev->net_conf->always_asbp)) {24912491+ int pcount = (mdev->state.role == R_PRIMARY)24922492+ + (peer_role == R_PRIMARY);24932493+ int forced = (hg == -100);24942494+24952495+ switch (pcount) {24962496+ case 0:24972497+ hg = drbd_asb_recover_0p(mdev);24982498+ break;24992499+ case 1:25002500+ hg = drbd_asb_recover_1p(mdev);25012501+ break;25022502+ case 2:25032503+ hg = drbd_asb_recover_2p(mdev);25042504+ break;25052505+ }25062506+ if (abs(hg) < 100) {25072507+ dev_warn(DEV, "Split-Brain detected, %d primaries, "25082508+ "automatically solved. Sync from %s node\n",25092509+ pcount, (hg < 0) ? "peer" : "this");25102510+ if (forced) {25112511+ dev_warn(DEV, "Doing a full sync, since"25122512+ " UUIDs where ambiguous.\n");25132513+ hg = hg*2;25142514+ }25152515+ }25162516+ }25172517+25182518+ if (hg == -100) {25192519+ if (mdev->net_conf->want_lose && !(mdev->p_uuid[UI_FLAGS]&1))25202520+ hg = -1;25212521+ if (!mdev->net_conf->want_lose && (mdev->p_uuid[UI_FLAGS]&1))25222522+ hg = 1;25232523+25242524+ if (abs(hg) < 100)25252525+ dev_warn(DEV, "Split-Brain detected, manually solved. "25262526+ "Sync from %s node\n",25272527+ (hg < 0) ? "peer" : "this");25282528+ }25292529+25302530+ if (hg == -100) {25312531+ dev_alert(DEV, "Split-Brain detected, dropping connection!\n");25322532+ drbd_khelper(mdev, "split-brain");25332533+ return C_MASK;25342534+ }25352535+25362536+ if (hg > 0 && mydisk <= D_INCONSISTENT) {25372537+ dev_err(DEV, "I shall become SyncSource, but I am inconsistent!\n");25382538+ return C_MASK;25392539+ }25402540+25412541+ if (hg < 0 && /* by intention we do not use mydisk here. */25422542+ mdev->state.role == R_PRIMARY && mdev->state.disk >= D_CONSISTENT) {25432543+ switch (mdev->net_conf->rr_conflict) {25442544+ case ASB_CALL_HELPER:25452545+ drbd_khelper(mdev, "pri-lost");25462546+ /* fall through */25472547+ case ASB_DISCONNECT:25482548+ dev_err(DEV, "I shall become SyncTarget, but I am primary!\n");25492549+ return C_MASK;25502550+ case ASB_VIOLENTLY:25512551+ dev_warn(DEV, "Becoming SyncTarget, violating the stable-data"25522552+ "assumption\n");25532553+ }25542554+ }25552555+25562556+ if (abs(hg) >= 2) {25572557+ dev_info(DEV, "Writing the whole bitmap, full sync required after drbd_sync_handshake.\n");25582558+ if (drbd_bitmap_io(mdev, &drbd_bmio_set_n_write, "set_n_write from sync_handshake"))25592559+ return C_MASK;25602560+ }25612561+25622562+ if (hg > 0) { /* become sync source. */25632563+ rv = C_WF_BITMAP_S;25642564+ } else if (hg < 0) { /* become sync target */25652565+ rv = C_WF_BITMAP_T;25662566+ } else {25672567+ rv = C_CONNECTED;25682568+ if (drbd_bm_total_weight(mdev)) {25692569+ dev_info(DEV, "No resync, but %lu bits in bitmap!\n",25702570+ drbd_bm_total_weight(mdev));25712571+ }25722572+ }25732573+25742574+ return rv;25752575+}25762576+25772577+/* returns 1 if invalid */25782578+static int cmp_after_sb(enum drbd_after_sb_p peer, enum drbd_after_sb_p self)25792579+{25802580+ /* ASB_DISCARD_REMOTE - ASB_DISCARD_LOCAL is valid */25812581+ if ((peer == ASB_DISCARD_REMOTE && self == ASB_DISCARD_LOCAL) ||25822582+ (self == ASB_DISCARD_REMOTE && peer == ASB_DISCARD_LOCAL))25832583+ return 0;25842584+25852585+ /* any other things with ASB_DISCARD_REMOTE or ASB_DISCARD_LOCAL are invalid */25862586+ if (peer == ASB_DISCARD_REMOTE || peer == ASB_DISCARD_LOCAL ||25872587+ self == ASB_DISCARD_REMOTE || self == ASB_DISCARD_LOCAL)25882588+ return 1;25892589+25902590+ /* everything else is valid if they are equal on both sides. */25912591+ if (peer == self)25922592+ return 0;25932593+25942594+ /* everything es is invalid. */25952595+ return 1;25962596+}25972597+25982598+static int receive_protocol(struct drbd_conf *mdev, struct p_header *h)25992599+{26002600+ struct p_protocol *p = (struct p_protocol *)h;26012601+ int header_size, data_size;26022602+ int p_proto, p_after_sb_0p, p_after_sb_1p, p_after_sb_2p;26032603+ int p_want_lose, p_two_primaries;26042604+ char p_integrity_alg[SHARED_SECRET_MAX] = "";26052605+26062606+ header_size = sizeof(*p) - sizeof(*h);26072607+ data_size = h->length - header_size;26082608+26092609+ if (drbd_recv(mdev, h->payload, header_size) != header_size)26102610+ return FALSE;26112611+26122612+ p_proto = be32_to_cpu(p->protocol);26132613+ p_after_sb_0p = be32_to_cpu(p->after_sb_0p);26142614+ p_after_sb_1p = be32_to_cpu(p->after_sb_1p);26152615+ p_after_sb_2p = be32_to_cpu(p->after_sb_2p);26162616+ p_want_lose = be32_to_cpu(p->want_lose);26172617+ p_two_primaries = be32_to_cpu(p->two_primaries);26182618+26192619+ if (p_proto != mdev->net_conf->wire_protocol) {26202620+ dev_err(DEV, "incompatible communication protocols\n");26212621+ goto disconnect;26222622+ }26232623+26242624+ if (cmp_after_sb(p_after_sb_0p, mdev->net_conf->after_sb_0p)) {26252625+ dev_err(DEV, "incompatible after-sb-0pri settings\n");26262626+ goto disconnect;26272627+ }26282628+26292629+ if (cmp_after_sb(p_after_sb_1p, mdev->net_conf->after_sb_1p)) {26302630+ dev_err(DEV, "incompatible after-sb-1pri settings\n");26312631+ goto disconnect;26322632+ }26332633+26342634+ if (cmp_after_sb(p_after_sb_2p, mdev->net_conf->after_sb_2p)) {26352635+ dev_err(DEV, "incompatible after-sb-2pri settings\n");26362636+ goto disconnect;26372637+ }26382638+26392639+ if (p_want_lose && mdev->net_conf->want_lose) {26402640+ dev_err(DEV, "both sides have the 'want_lose' flag set\n");26412641+ goto disconnect;26422642+ }26432643+26442644+ if (p_two_primaries != mdev->net_conf->two_primaries) {26452645+ dev_err(DEV, "incompatible setting of the two-primaries options\n");26462646+ goto disconnect;26472647+ }26482648+26492649+ if (mdev->agreed_pro_version >= 87) {26502650+ unsigned char *my_alg = mdev->net_conf->integrity_alg;26512651+26522652+ if (drbd_recv(mdev, p_integrity_alg, data_size) != data_size)26532653+ return FALSE;26542654+26552655+ p_integrity_alg[SHARED_SECRET_MAX-1] = 0;26562656+ if (strcmp(p_integrity_alg, my_alg)) {26572657+ dev_err(DEV, "incompatible setting of the data-integrity-alg\n");26582658+ goto disconnect;26592659+ }26602660+ dev_info(DEV, "data-integrity-alg: %s\n",26612661+ my_alg[0] ? my_alg : (unsigned char *)"<not-used>");26622662+ }26632663+26642664+ return TRUE;26652665+26662666+disconnect:26672667+ drbd_force_state(mdev, NS(conn, C_DISCONNECTING));26682668+ return FALSE;26692669+}26702670+26712671+/* helper function26722672+ * input: alg name, feature name26732673+ * return: NULL (alg name was "")26742674+ * ERR_PTR(error) if something goes wrong26752675+ * or the crypto hash ptr, if it worked out ok. */26762676+struct crypto_hash *drbd_crypto_alloc_digest_safe(const struct drbd_conf *mdev,26772677+ const char *alg, const char *name)26782678+{26792679+ struct crypto_hash *tfm;26802680+26812681+ if (!alg[0])26822682+ return NULL;26832683+26842684+ tfm = crypto_alloc_hash(alg, 0, CRYPTO_ALG_ASYNC);26852685+ if (IS_ERR(tfm)) {26862686+ dev_err(DEV, "Can not allocate \"%s\" as %s (reason: %ld)\n",26872687+ alg, name, PTR_ERR(tfm));26882688+ return tfm;26892689+ }26902690+ if (!drbd_crypto_is_hash(crypto_hash_tfm(tfm))) {26912691+ crypto_free_hash(tfm);26922692+ dev_err(DEV, "\"%s\" is not a digest (%s)\n", alg, name);26932693+ return ERR_PTR(-EINVAL);26942694+ }26952695+ return tfm;26962696+}26972697+26982698+static int receive_SyncParam(struct drbd_conf *mdev, struct p_header *h)26992699+{27002700+ int ok = TRUE;27012701+ struct p_rs_param_89 *p = (struct p_rs_param_89 *)h;27022702+ unsigned int header_size, data_size, exp_max_sz;27032703+ struct crypto_hash *verify_tfm = NULL;27042704+ struct crypto_hash *csums_tfm = NULL;27052705+ const int apv = mdev->agreed_pro_version;27062706+27072707+ exp_max_sz = apv <= 87 ? sizeof(struct p_rs_param)27082708+ : apv == 88 ? sizeof(struct p_rs_param)27092709+ + SHARED_SECRET_MAX27102710+ : /* 89 */ sizeof(struct p_rs_param_89);27112711+27122712+ if (h->length > exp_max_sz) {27132713+ dev_err(DEV, "SyncParam packet too long: received %u, expected <= %u bytes\n",27142714+ h->length, exp_max_sz);27152715+ return FALSE;27162716+ }27172717+27182718+ if (apv <= 88) {27192719+ header_size = sizeof(struct p_rs_param) - sizeof(*h);27202720+ data_size = h->length - header_size;27212721+ } else /* apv >= 89 */ {27222722+ header_size = sizeof(struct p_rs_param_89) - sizeof(*h);27232723+ data_size = h->length - header_size;27242724+ D_ASSERT(data_size == 0);27252725+ }27262726+27272727+ /* initialize verify_alg and csums_alg */27282728+ memset(p->verify_alg, 0, 2 * SHARED_SECRET_MAX);27292729+27302730+ if (drbd_recv(mdev, h->payload, header_size) != header_size)27312731+ return FALSE;27322732+27332733+ mdev->sync_conf.rate = be32_to_cpu(p->rate);27342734+27352735+ if (apv >= 88) {27362736+ if (apv == 88) {27372737+ if (data_size > SHARED_SECRET_MAX) {27382738+ dev_err(DEV, "verify-alg too long, "27392739+ "peer wants %u, accepting only %u byte\n",27402740+ data_size, SHARED_SECRET_MAX);27412741+ return FALSE;27422742+ }27432743+27442744+ if (drbd_recv(mdev, p->verify_alg, data_size) != data_size)27452745+ return FALSE;27462746+27472747+ /* we expect NUL terminated string */27482748+ /* but just in case someone tries to be evil */27492749+ D_ASSERT(p->verify_alg[data_size-1] == 0);27502750+ p->verify_alg[data_size-1] = 0;27512751+27522752+ } else /* apv >= 89 */ {27532753+ /* we still expect NUL terminated strings */27542754+ /* but just in case someone tries to be evil */27552755+ D_ASSERT(p->verify_alg[SHARED_SECRET_MAX-1] == 0);27562756+ D_ASSERT(p->csums_alg[SHARED_SECRET_MAX-1] == 0);27572757+ p->verify_alg[SHARED_SECRET_MAX-1] = 0;27582758+ p->csums_alg[SHARED_SECRET_MAX-1] = 0;27592759+ }27602760+27612761+ if (strcmp(mdev->sync_conf.verify_alg, p->verify_alg)) {27622762+ if (mdev->state.conn == C_WF_REPORT_PARAMS) {27632763+ dev_err(DEV, "Different verify-alg settings. me=\"%s\" peer=\"%s\"\n",27642764+ mdev->sync_conf.verify_alg, p->verify_alg);27652765+ goto disconnect;27662766+ }27672767+ verify_tfm = drbd_crypto_alloc_digest_safe(mdev,27682768+ p->verify_alg, "verify-alg");27692769+ if (IS_ERR(verify_tfm)) {27702770+ verify_tfm = NULL;27712771+ goto disconnect;27722772+ }27732773+ }27742774+27752775+ if (apv >= 89 && strcmp(mdev->sync_conf.csums_alg, p->csums_alg)) {27762776+ if (mdev->state.conn == C_WF_REPORT_PARAMS) {27772777+ dev_err(DEV, "Different csums-alg settings. me=\"%s\" peer=\"%s\"\n",27782778+ mdev->sync_conf.csums_alg, p->csums_alg);27792779+ goto disconnect;27802780+ }27812781+ csums_tfm = drbd_crypto_alloc_digest_safe(mdev,27822782+ p->csums_alg, "csums-alg");27832783+ if (IS_ERR(csums_tfm)) {27842784+ csums_tfm = NULL;27852785+ goto disconnect;27862786+ }27872787+ }27882788+27892789+27902790+ spin_lock(&mdev->peer_seq_lock);27912791+ /* lock against drbd_nl_syncer_conf() */27922792+ if (verify_tfm) {27932793+ strcpy(mdev->sync_conf.verify_alg, p->verify_alg);27942794+ mdev->sync_conf.verify_alg_len = strlen(p->verify_alg) + 1;27952795+ crypto_free_hash(mdev->verify_tfm);27962796+ mdev->verify_tfm = verify_tfm;27972797+ dev_info(DEV, "using verify-alg: \"%s\"\n", p->verify_alg);27982798+ }27992799+ if (csums_tfm) {28002800+ strcpy(mdev->sync_conf.csums_alg, p->csums_alg);28012801+ mdev->sync_conf.csums_alg_len = strlen(p->csums_alg) + 1;28022802+ crypto_free_hash(mdev->csums_tfm);28032803+ mdev->csums_tfm = csums_tfm;28042804+ dev_info(DEV, "using csums-alg: \"%s\"\n", p->csums_alg);28052805+ }28062806+ spin_unlock(&mdev->peer_seq_lock);28072807+ }28082808+28092809+ return ok;28102810+disconnect:28112811+ /* just for completeness: actually not needed,28122812+ * as this is not reached if csums_tfm was ok. */28132813+ crypto_free_hash(csums_tfm);28142814+ /* but free the verify_tfm again, if csums_tfm did not work out */28152815+ crypto_free_hash(verify_tfm);28162816+ drbd_force_state(mdev, NS(conn, C_DISCONNECTING));28172817+ return FALSE;28182818+}28192819+28202820+static void drbd_setup_order_type(struct drbd_conf *mdev, int peer)28212821+{28222822+ /* sorry, we currently have no working implementation28232823+ * of distributed TCQ */28242824+}28252825+28262826+/* warn if the arguments differ by more than 12.5% */28272827+static void warn_if_differ_considerably(struct drbd_conf *mdev,28282828+ const char *s, sector_t a, sector_t b)28292829+{28302830+ sector_t d;28312831+ if (a == 0 || b == 0)28322832+ return;28332833+ d = (a > b) ? (a - b) : (b - a);28342834+ if (d > (a>>3) || d > (b>>3))28352835+ dev_warn(DEV, "Considerable difference in %s: %llus vs. %llus\n", s,28362836+ (unsigned long long)a, (unsigned long long)b);28372837+}28382838+28392839+static int receive_sizes(struct drbd_conf *mdev, struct p_header *h)28402840+{28412841+ struct p_sizes *p = (struct p_sizes *)h;28422842+ enum determine_dev_size dd = unchanged;28432843+ unsigned int max_seg_s;28442844+ sector_t p_size, p_usize, my_usize;28452845+ int ldsc = 0; /* local disk size changed */28462846+ enum drbd_conns nconn;28472847+28482848+ ERR_IF(h->length != (sizeof(*p)-sizeof(*h))) return FALSE;28492849+ if (drbd_recv(mdev, h->payload, h->length) != h->length)28502850+ return FALSE;28512851+28522852+ p_size = be64_to_cpu(p->d_size);28532853+ p_usize = be64_to_cpu(p->u_size);28542854+28552855+ if (p_size == 0 && mdev->state.disk == D_DISKLESS) {28562856+ dev_err(DEV, "some backing storage is needed\n");28572857+ drbd_force_state(mdev, NS(conn, C_DISCONNECTING));28582858+ return FALSE;28592859+ }28602860+28612861+ /* just store the peer's disk size for now.28622862+ * we still need to figure out whether we accept that. */28632863+ mdev->p_size = p_size;28642864+28652865+#define min_not_zero(l, r) (l == 0) ? r : ((r == 0) ? l : min(l, r))28662866+ if (get_ldev(mdev)) {28672867+ warn_if_differ_considerably(mdev, "lower level device sizes",28682868+ p_size, drbd_get_max_capacity(mdev->ldev));28692869+ warn_if_differ_considerably(mdev, "user requested size",28702870+ p_usize, mdev->ldev->dc.disk_size);28712871+28722872+ /* if this is the first connect, or an otherwise expected28732873+ * param exchange, choose the minimum */28742874+ if (mdev->state.conn == C_WF_REPORT_PARAMS)28752875+ p_usize = min_not_zero((sector_t)mdev->ldev->dc.disk_size,28762876+ p_usize);28772877+28782878+ my_usize = mdev->ldev->dc.disk_size;28792879+28802880+ if (mdev->ldev->dc.disk_size != p_usize) {28812881+ mdev->ldev->dc.disk_size = p_usize;28822882+ dev_info(DEV, "Peer sets u_size to %lu sectors\n",28832883+ (unsigned long)mdev->ldev->dc.disk_size);28842884+ }28852885+28862886+ /* Never shrink a device with usable data during connect.28872887+ But allow online shrinking if we are connected. */28882888+ if (drbd_new_dev_size(mdev, mdev->ldev) <28892889+ drbd_get_capacity(mdev->this_bdev) &&28902890+ mdev->state.disk >= D_OUTDATED &&28912891+ mdev->state.conn < C_CONNECTED) {28922892+ dev_err(DEV, "The peer's disk size is too small!\n");28932893+ drbd_force_state(mdev, NS(conn, C_DISCONNECTING));28942894+ mdev->ldev->dc.disk_size = my_usize;28952895+ put_ldev(mdev);28962896+ return FALSE;28972897+ }28982898+ put_ldev(mdev);28992899+ }29002900+#undef min_not_zero29012901+29022902+ if (get_ldev(mdev)) {29032903+ dd = drbd_determin_dev_size(mdev);29042904+ put_ldev(mdev);29052905+ if (dd == dev_size_error)29062906+ return FALSE;29072907+ drbd_md_sync(mdev);29082908+ } else {29092909+ /* I am diskless, need to accept the peer's size. */29102910+ drbd_set_my_capacity(mdev, p_size);29112911+ }29122912+29132913+ if (mdev->p_uuid && mdev->state.conn <= C_CONNECTED && get_ldev(mdev)) {29142914+ nconn = drbd_sync_handshake(mdev,29152915+ mdev->state.peer, mdev->state.pdsk);29162916+ put_ldev(mdev);29172917+29182918+ if (nconn == C_MASK) {29192919+ drbd_force_state(mdev, NS(conn, C_DISCONNECTING));29202920+ return FALSE;29212921+ }29222922+29232923+ if (drbd_request_state(mdev, NS(conn, nconn)) < SS_SUCCESS) {29242924+ drbd_force_state(mdev, NS(conn, C_DISCONNECTING));29252925+ return FALSE;29262926+ }29272927+ }29282928+29292929+ if (get_ldev(mdev)) {29302930+ if (mdev->ldev->known_size != drbd_get_capacity(mdev->ldev->backing_bdev)) {29312931+ mdev->ldev->known_size = drbd_get_capacity(mdev->ldev->backing_bdev);29322932+ ldsc = 1;29332933+ }29342934+29352935+ max_seg_s = be32_to_cpu(p->max_segment_size);29362936+ if (max_seg_s != queue_max_segment_size(mdev->rq_queue))29372937+ drbd_setup_queue_param(mdev, max_seg_s);29382938+29392939+ drbd_setup_order_type(mdev, be32_to_cpu(p->queue_order_type));29402940+ put_ldev(mdev);29412941+ }29422942+29432943+ if (mdev->state.conn > C_WF_REPORT_PARAMS) {29442944+ if (be64_to_cpu(p->c_size) !=29452945+ drbd_get_capacity(mdev->this_bdev) || ldsc) {29462946+ /* we have different sizes, probably peer29472947+ * needs to know my new size... */29482948+ drbd_send_sizes(mdev, 0);29492949+ }29502950+ if (test_and_clear_bit(RESIZE_PENDING, &mdev->flags) ||29512951+ (dd == grew && mdev->state.conn == C_CONNECTED)) {29522952+ if (mdev->state.pdsk >= D_INCONSISTENT &&29532953+ mdev->state.disk >= D_INCONSISTENT)29542954+ resync_after_online_grow(mdev);29552955+ else29562956+ set_bit(RESYNC_AFTER_NEG, &mdev->flags);29572957+ }29582958+ }29592959+29602960+ return TRUE;29612961+}29622962+29632963+static int receive_uuids(struct drbd_conf *mdev, struct p_header *h)29642964+{29652965+ struct p_uuids *p = (struct p_uuids *)h;29662966+ u64 *p_uuid;29672967+ int i;29682968+29692969+ ERR_IF(h->length != (sizeof(*p)-sizeof(*h))) return FALSE;29702970+ if (drbd_recv(mdev, h->payload, h->length) != h->length)29712971+ return FALSE;29722972+29732973+ p_uuid = kmalloc(sizeof(u64)*UI_EXTENDED_SIZE, GFP_NOIO);29742974+29752975+ for (i = UI_CURRENT; i < UI_EXTENDED_SIZE; i++)29762976+ p_uuid[i] = be64_to_cpu(p->uuid[i]);29772977+29782978+ kfree(mdev->p_uuid);29792979+ mdev->p_uuid = p_uuid;29802980+29812981+ if (mdev->state.conn < C_CONNECTED &&29822982+ mdev->state.disk < D_INCONSISTENT &&29832983+ mdev->state.role == R_PRIMARY &&29842984+ (mdev->ed_uuid & ~((u64)1)) != (p_uuid[UI_CURRENT] & ~((u64)1))) {29852985+ dev_err(DEV, "Can only connect to data with current UUID=%016llX\n",29862986+ (unsigned long long)mdev->ed_uuid);29872987+ drbd_force_state(mdev, NS(conn, C_DISCONNECTING));29882988+ return FALSE;29892989+ }29902990+29912991+ if (get_ldev(mdev)) {29922992+ int skip_initial_sync =29932993+ mdev->state.conn == C_CONNECTED &&29942994+ mdev->agreed_pro_version >= 90 &&29952995+ mdev->ldev->md.uuid[UI_CURRENT] == UUID_JUST_CREATED &&29962996+ (p_uuid[UI_FLAGS] & 8);29972997+ if (skip_initial_sync) {29982998+ dev_info(DEV, "Accepted new current UUID, preparing to skip initial sync\n");29992999+ drbd_bitmap_io(mdev, &drbd_bmio_clear_n_write,30003000+ "clear_n_write from receive_uuids");30013001+ _drbd_uuid_set(mdev, UI_CURRENT, p_uuid[UI_CURRENT]);30023002+ _drbd_uuid_set(mdev, UI_BITMAP, 0);30033003+ _drbd_set_state(_NS2(mdev, disk, D_UP_TO_DATE, pdsk, D_UP_TO_DATE),30043004+ CS_VERBOSE, NULL);30053005+ drbd_md_sync(mdev);30063006+ }30073007+ put_ldev(mdev);30083008+ }30093009+30103010+ /* Before we test for the disk state, we should wait until an eventually30113011+ ongoing cluster wide state change is finished. That is important if30123012+ we are primary and are detaching from our disk. We need to see the30133013+ new disk state... */30143014+ wait_event(mdev->misc_wait, !test_bit(CLUSTER_ST_CHANGE, &mdev->flags));30153015+ if (mdev->state.conn >= C_CONNECTED && mdev->state.disk < D_INCONSISTENT)30163016+ drbd_set_ed_uuid(mdev, p_uuid[UI_CURRENT]);30173017+30183018+ return TRUE;30193019+}30203020+30213021+/**30223022+ * convert_state() - Converts the peer's view of the cluster state to our point of view30233023+ * @ps: The state as seen by the peer.30243024+ */30253025+static union drbd_state convert_state(union drbd_state ps)30263026+{30273027+ union drbd_state ms;30283028+30293029+ static enum drbd_conns c_tab[] = {30303030+ [C_CONNECTED] = C_CONNECTED,30313031+30323032+ [C_STARTING_SYNC_S] = C_STARTING_SYNC_T,30333033+ [C_STARTING_SYNC_T] = C_STARTING_SYNC_S,30343034+ [C_DISCONNECTING] = C_TEAR_DOWN, /* C_NETWORK_FAILURE, */30353035+ [C_VERIFY_S] = C_VERIFY_T,30363036+ [C_MASK] = C_MASK,30373037+ };30383038+30393039+ ms.i = ps.i;30403040+30413041+ ms.conn = c_tab[ps.conn];30423042+ ms.peer = ps.role;30433043+ ms.role = ps.peer;30443044+ ms.pdsk = ps.disk;30453045+ ms.disk = ps.pdsk;30463046+ ms.peer_isp = (ps.aftr_isp | ps.user_isp);30473047+30483048+ return ms;30493049+}30503050+30513051+static int receive_req_state(struct drbd_conf *mdev, struct p_header *h)30523052+{30533053+ struct p_req_state *p = (struct p_req_state *)h;30543054+ union drbd_state mask, val;30553055+ int rv;30563056+30573057+ ERR_IF(h->length != (sizeof(*p)-sizeof(*h))) return FALSE;30583058+ if (drbd_recv(mdev, h->payload, h->length) != h->length)30593059+ return FALSE;30603060+30613061+ mask.i = be32_to_cpu(p->mask);30623062+ val.i = be32_to_cpu(p->val);30633063+30643064+ if (test_bit(DISCARD_CONCURRENT, &mdev->flags) &&30653065+ test_bit(CLUSTER_ST_CHANGE, &mdev->flags)) {30663066+ drbd_send_sr_reply(mdev, SS_CONCURRENT_ST_CHG);30673067+ return TRUE;30683068+ }30693069+30703070+ mask = convert_state(mask);30713071+ val = convert_state(val);30723072+30733073+ rv = drbd_change_state(mdev, CS_VERBOSE, mask, val);30743074+30753075+ drbd_send_sr_reply(mdev, rv);30763076+ drbd_md_sync(mdev);30773077+30783078+ return TRUE;30793079+}30803080+30813081+static int receive_state(struct drbd_conf *mdev, struct p_header *h)30823082+{30833083+ struct p_state *p = (struct p_state *)h;30843084+ enum drbd_conns nconn, oconn;30853085+ union drbd_state ns, peer_state;30863086+ enum drbd_disk_state real_peer_disk;30873087+ int rv;30883088+30893089+ ERR_IF(h->length != (sizeof(*p)-sizeof(*h)))30903090+ return FALSE;30913091+30923092+ if (drbd_recv(mdev, h->payload, h->length) != h->length)30933093+ return FALSE;30943094+30953095+ peer_state.i = be32_to_cpu(p->state);30963096+30973097+ real_peer_disk = peer_state.disk;30983098+ if (peer_state.disk == D_NEGOTIATING) {30993099+ real_peer_disk = mdev->p_uuid[UI_FLAGS] & 4 ? D_INCONSISTENT : D_CONSISTENT;31003100+ dev_info(DEV, "real peer disk state = %s\n", drbd_disk_str(real_peer_disk));31013101+ }31023102+31033103+ spin_lock_irq(&mdev->req_lock);31043104+ retry:31053105+ oconn = nconn = mdev->state.conn;31063106+ spin_unlock_irq(&mdev->req_lock);31073107+31083108+ if (nconn == C_WF_REPORT_PARAMS)31093109+ nconn = C_CONNECTED;31103110+31113111+ if (mdev->p_uuid && peer_state.disk >= D_NEGOTIATING &&31123112+ get_ldev_if_state(mdev, D_NEGOTIATING)) {31133113+ int cr; /* consider resync */31143114+31153115+ /* if we established a new connection */31163116+ cr = (oconn < C_CONNECTED);31173117+ /* if we had an established connection31183118+ * and one of the nodes newly attaches a disk */31193119+ cr |= (oconn == C_CONNECTED &&31203120+ (peer_state.disk == D_NEGOTIATING ||31213121+ mdev->state.disk == D_NEGOTIATING));31223122+ /* if we have both been inconsistent, and the peer has been31233123+ * forced to be UpToDate with --overwrite-data */31243124+ cr |= test_bit(CONSIDER_RESYNC, &mdev->flags);31253125+ /* if we had been plain connected, and the admin requested to31263126+ * start a sync by "invalidate" or "invalidate-remote" */31273127+ cr |= (oconn == C_CONNECTED &&31283128+ (peer_state.conn >= C_STARTING_SYNC_S &&31293129+ peer_state.conn <= C_WF_BITMAP_T));31303130+31313131+ if (cr)31323132+ nconn = drbd_sync_handshake(mdev, peer_state.role, real_peer_disk);31333133+31343134+ put_ldev(mdev);31353135+ if (nconn == C_MASK) {31363136+ if (mdev->state.disk == D_NEGOTIATING) {31373137+ drbd_force_state(mdev, NS(disk, D_DISKLESS));31383138+ nconn = C_CONNECTED;31393139+ } else if (peer_state.disk == D_NEGOTIATING) {31403140+ dev_err(DEV, "Disk attach process on the peer node was aborted.\n");31413141+ peer_state.disk = D_DISKLESS;31423142+ } else {31433143+ D_ASSERT(oconn == C_WF_REPORT_PARAMS);31443144+ drbd_force_state(mdev, NS(conn, C_DISCONNECTING));31453145+ return FALSE;31463146+ }31473147+ }31483148+ }31493149+31503150+ spin_lock_irq(&mdev->req_lock);31513151+ if (mdev->state.conn != oconn)31523152+ goto retry;31533153+ clear_bit(CONSIDER_RESYNC, &mdev->flags);31543154+ ns.i = mdev->state.i;31553155+ ns.conn = nconn;31563156+ ns.peer = peer_state.role;31573157+ ns.pdsk = real_peer_disk;31583158+ ns.peer_isp = (peer_state.aftr_isp | peer_state.user_isp);31593159+ if ((nconn == C_CONNECTED || nconn == C_WF_BITMAP_S) && ns.disk == D_NEGOTIATING)31603160+ ns.disk = mdev->new_state_tmp.disk;31613161+31623162+ rv = _drbd_set_state(mdev, ns, CS_VERBOSE | CS_HARD, NULL);31633163+ ns = mdev->state;31643164+ spin_unlock_irq(&mdev->req_lock);31653165+31663166+ if (rv < SS_SUCCESS) {31673167+ drbd_force_state(mdev, NS(conn, C_DISCONNECTING));31683168+ return FALSE;31693169+ }31703170+31713171+ if (oconn > C_WF_REPORT_PARAMS) {31723172+ if (nconn > C_CONNECTED && peer_state.conn <= C_CONNECTED &&31733173+ peer_state.disk != D_NEGOTIATING ) {31743174+ /* we want resync, peer has not yet decided to sync... */31753175+ /* Nowadays only used when forcing a node into primary role and31763176+ setting its disk to UpToDate with that */31773177+ drbd_send_uuids(mdev);31783178+ drbd_send_state(mdev);31793179+ }31803180+ }31813181+31823182+ mdev->net_conf->want_lose = 0;31833183+31843184+ drbd_md_sync(mdev); /* update connected indicator, la_size, ... */31853185+31863186+ return TRUE;31873187+}31883188+31893189+static int receive_sync_uuid(struct drbd_conf *mdev, struct p_header *h)31903190+{31913191+ struct p_rs_uuid *p = (struct p_rs_uuid *)h;31923192+31933193+ wait_event(mdev->misc_wait,31943194+ mdev->state.conn == C_WF_SYNC_UUID ||31953195+ mdev->state.conn < C_CONNECTED ||31963196+ mdev->state.disk < D_NEGOTIATING);31973197+31983198+ /* D_ASSERT( mdev->state.conn == C_WF_SYNC_UUID ); */31993199+32003200+ ERR_IF(h->length != (sizeof(*p)-sizeof(*h))) return FALSE;32013201+ if (drbd_recv(mdev, h->payload, h->length) != h->length)32023202+ return FALSE;32033203+32043204+ /* Here the _drbd_uuid_ functions are right, current should32053205+ _not_ be rotated into the history */32063206+ if (get_ldev_if_state(mdev, D_NEGOTIATING)) {32073207+ _drbd_uuid_set(mdev, UI_CURRENT, be64_to_cpu(p->uuid));32083208+ _drbd_uuid_set(mdev, UI_BITMAP, 0UL);32093209+32103210+ drbd_start_resync(mdev, C_SYNC_TARGET);32113211+32123212+ put_ldev(mdev);32133213+ } else32143214+ dev_err(DEV, "Ignoring SyncUUID packet!\n");32153215+32163216+ return TRUE;32173217+}32183218+32193219+enum receive_bitmap_ret { OK, DONE, FAILED };32203220+32213221+static enum receive_bitmap_ret32223222+receive_bitmap_plain(struct drbd_conf *mdev, struct p_header *h,32233223+ unsigned long *buffer, struct bm_xfer_ctx *c)32243224+{32253225+ unsigned num_words = min_t(size_t, BM_PACKET_WORDS, c->bm_words - c->word_offset);32263226+ unsigned want = num_words * sizeof(long);32273227+32283228+ if (want != h->length) {32293229+ dev_err(DEV, "%s:want (%u) != h->length (%u)\n", __func__, want, h->length);32303230+ return FAILED;32313231+ }32323232+ if (want == 0)32333233+ return DONE;32343234+ if (drbd_recv(mdev, buffer, want) != want)32353235+ return FAILED;32363236+32373237+ drbd_bm_merge_lel(mdev, c->word_offset, num_words, buffer);32383238+32393239+ c->word_offset += num_words;32403240+ c->bit_offset = c->word_offset * BITS_PER_LONG;32413241+ if (c->bit_offset > c->bm_bits)32423242+ c->bit_offset = c->bm_bits;32433243+32443244+ return OK;32453245+}32463246+32473247+static enum receive_bitmap_ret32483248+recv_bm_rle_bits(struct drbd_conf *mdev,32493249+ struct p_compressed_bm *p,32503250+ struct bm_xfer_ctx *c)32513251+{32523252+ struct bitstream bs;32533253+ u64 look_ahead;32543254+ u64 rl;32553255+ u64 tmp;32563256+ unsigned long s = c->bit_offset;32573257+ unsigned long e;32583258+ int len = p->head.length - (sizeof(*p) - sizeof(p->head));32593259+ int toggle = DCBP_get_start(p);32603260+ int have;32613261+ int bits;32623262+32633263+ bitstream_init(&bs, p->code, len, DCBP_get_pad_bits(p));32643264+32653265+ bits = bitstream_get_bits(&bs, &look_ahead, 64);32663266+ if (bits < 0)32673267+ return FAILED;32683268+32693269+ for (have = bits; have > 0; s += rl, toggle = !toggle) {32703270+ bits = vli_decode_bits(&rl, look_ahead);32713271+ if (bits <= 0)32723272+ return FAILED;32733273+32743274+ if (toggle) {32753275+ e = s + rl -1;32763276+ if (e >= c->bm_bits) {32773277+ dev_err(DEV, "bitmap overflow (e:%lu) while decoding bm RLE packet\n", e);32783278+ return FAILED;32793279+ }32803280+ _drbd_bm_set_bits(mdev, s, e);32813281+ }32823282+32833283+ if (have < bits) {32843284+ dev_err(DEV, "bitmap decoding error: h:%d b:%d la:0x%08llx l:%u/%u\n",32853285+ have, bits, look_ahead,32863286+ (unsigned int)(bs.cur.b - p->code),32873287+ (unsigned int)bs.buf_len);32883288+ return FAILED;32893289+ }32903290+ look_ahead >>= bits;32913291+ have -= bits;32923292+32933293+ bits = bitstream_get_bits(&bs, &tmp, 64 - have);32943294+ if (bits < 0)32953295+ return FAILED;32963296+ look_ahead |= tmp << have;32973297+ have += bits;32983298+ }32993299+33003300+ c->bit_offset = s;33013301+ bm_xfer_ctx_bit_to_word_offset(c);33023302+33033303+ return (s == c->bm_bits) ? DONE : OK;33043304+}33053305+33063306+static enum receive_bitmap_ret33073307+decode_bitmap_c(struct drbd_conf *mdev,33083308+ struct p_compressed_bm *p,33093309+ struct bm_xfer_ctx *c)33103310+{33113311+ if (DCBP_get_code(p) == RLE_VLI_Bits)33123312+ return recv_bm_rle_bits(mdev, p, c);33133313+33143314+ /* other variants had been implemented for evaluation,33153315+ * but have been dropped as this one turned out to be "best"33163316+ * during all our tests. */33173317+33183318+ dev_err(DEV, "receive_bitmap_c: unknown encoding %u\n", p->encoding);33193319+ drbd_force_state(mdev, NS(conn, C_PROTOCOL_ERROR));33203320+ return FAILED;33213321+}33223322+33233323+void INFO_bm_xfer_stats(struct drbd_conf *mdev,33243324+ const char *direction, struct bm_xfer_ctx *c)33253325+{33263326+ /* what would it take to transfer it "plaintext" */33273327+ unsigned plain = sizeof(struct p_header) *33283328+ ((c->bm_words+BM_PACKET_WORDS-1)/BM_PACKET_WORDS+1)33293329+ + c->bm_words * sizeof(long);33303330+ unsigned total = c->bytes[0] + c->bytes[1];33313331+ unsigned r;33323332+33333333+ /* total can not be zero. but just in case: */33343334+ if (total == 0)33353335+ return;33363336+33373337+ /* don't report if not compressed */33383338+ if (total >= plain)33393339+ return;33403340+33413341+ /* total < plain. check for overflow, still */33423342+ r = (total > UINT_MAX/1000) ? (total / (plain/1000))33433343+ : (1000 * total / plain);33443344+33453345+ if (r > 1000)33463346+ r = 1000;33473347+33483348+ r = 1000 - r;33493349+ dev_info(DEV, "%s bitmap stats [Bytes(packets)]: plain %u(%u), RLE %u(%u), "33503350+ "total %u; compression: %u.%u%%\n",33513351+ direction,33523352+ c->bytes[1], c->packets[1],33533353+ c->bytes[0], c->packets[0],33543354+ total, r/10, r % 10);33553355+}33563356+33573357+/* Since we are processing the bitfield from lower addresses to higher,33583358+ it does not matter if the process it in 32 bit chunks or 64 bit33593359+ chunks as long as it is little endian. (Understand it as byte stream,33603360+ beginning with the lowest byte...) If we would use big endian33613361+ we would need to process it from the highest address to the lowest,33623362+ in order to be agnostic to the 32 vs 64 bits issue.33633363+33643364+ returns 0 on failure, 1 if we successfully received it. */33653365+static int receive_bitmap(struct drbd_conf *mdev, struct p_header *h)33663366+{33673367+ struct bm_xfer_ctx c;33683368+ void *buffer;33693369+ enum receive_bitmap_ret ret;33703370+ int ok = FALSE;33713371+33723372+ wait_event(mdev->misc_wait, !atomic_read(&mdev->ap_bio_cnt));33733373+33743374+ drbd_bm_lock(mdev, "receive bitmap");33753375+33763376+ /* maybe we should use some per thread scratch page,33773377+ * and allocate that during initial device creation? */33783378+ buffer = (unsigned long *) __get_free_page(GFP_NOIO);33793379+ if (!buffer) {33803380+ dev_err(DEV, "failed to allocate one page buffer in %s\n", __func__);33813381+ goto out;33823382+ }33833383+33843384+ c = (struct bm_xfer_ctx) {33853385+ .bm_bits = drbd_bm_bits(mdev),33863386+ .bm_words = drbd_bm_words(mdev),33873387+ };33883388+33893389+ do {33903390+ if (h->command == P_BITMAP) {33913391+ ret = receive_bitmap_plain(mdev, h, buffer, &c);33923392+ } else if (h->command == P_COMPRESSED_BITMAP) {33933393+ /* MAYBE: sanity check that we speak proto >= 90,33943394+ * and the feature is enabled! */33953395+ struct p_compressed_bm *p;33963396+33973397+ if (h->length > BM_PACKET_PAYLOAD_BYTES) {33983398+ dev_err(DEV, "ReportCBitmap packet too large\n");33993399+ goto out;34003400+ }34013401+ /* use the page buff */34023402+ p = buffer;34033403+ memcpy(p, h, sizeof(*h));34043404+ if (drbd_recv(mdev, p->head.payload, h->length) != h->length)34053405+ goto out;34063406+ if (p->head.length <= (sizeof(*p) - sizeof(p->head))) {34073407+ dev_err(DEV, "ReportCBitmap packet too small (l:%u)\n", p->head.length);34083408+ return FAILED;34093409+ }34103410+ ret = decode_bitmap_c(mdev, p, &c);34113411+ } else {34123412+ dev_warn(DEV, "receive_bitmap: h->command neither ReportBitMap nor ReportCBitMap (is 0x%x)", h->command);34133413+ goto out;34143414+ }34153415+34163416+ c.packets[h->command == P_BITMAP]++;34173417+ c.bytes[h->command == P_BITMAP] += sizeof(struct p_header) + h->length;34183418+34193419+ if (ret != OK)34203420+ break;34213421+34223422+ if (!drbd_recv_header(mdev, h))34233423+ goto out;34243424+ } while (ret == OK);34253425+ if (ret == FAILED)34263426+ goto out;34273427+34283428+ INFO_bm_xfer_stats(mdev, "receive", &c);34293429+34303430+ if (mdev->state.conn == C_WF_BITMAP_T) {34313431+ ok = !drbd_send_bitmap(mdev);34323432+ if (!ok)34333433+ goto out;34343434+ /* Omit CS_ORDERED with this state transition to avoid deadlocks. */34353435+ ok = _drbd_request_state(mdev, NS(conn, C_WF_SYNC_UUID), CS_VERBOSE);34363436+ D_ASSERT(ok == SS_SUCCESS);34373437+ } else if (mdev->state.conn != C_WF_BITMAP_S) {34383438+ /* admin may have requested C_DISCONNECTING,34393439+ * other threads may have noticed network errors */34403440+ dev_info(DEV, "unexpected cstate (%s) in receive_bitmap\n",34413441+ drbd_conn_str(mdev->state.conn));34423442+ }34433443+34443444+ ok = TRUE;34453445+ out:34463446+ drbd_bm_unlock(mdev);34473447+ if (ok && mdev->state.conn == C_WF_BITMAP_S)34483448+ drbd_start_resync(mdev, C_SYNC_SOURCE);34493449+ free_page((unsigned long) buffer);34503450+ return ok;34513451+}34523452+34533453+static int receive_skip(struct drbd_conf *mdev, struct p_header *h)34543454+{34553455+ /* TODO zero copy sink :) */34563456+ static char sink[128];34573457+ int size, want, r;34583458+34593459+ dev_warn(DEV, "skipping unknown optional packet type %d, l: %d!\n",34603460+ h->command, h->length);34613461+34623462+ size = h->length;34633463+ while (size > 0) {34643464+ want = min_t(int, size, sizeof(sink));34653465+ r = drbd_recv(mdev, sink, want);34663466+ ERR_IF(r <= 0) break;34673467+ size -= r;34683468+ }34693469+ return size == 0;34703470+}34713471+34723472+static int receive_UnplugRemote(struct drbd_conf *mdev, struct p_header *h)34733473+{34743474+ if (mdev->state.disk >= D_INCONSISTENT)34753475+ drbd_kick_lo(mdev);34763476+34773477+ /* Make sure we've acked all the TCP data associated34783478+ * with the data requests being unplugged */34793479+ drbd_tcp_quickack(mdev->data.socket);34803480+34813481+ return TRUE;34823482+}34833483+34843484+typedef int (*drbd_cmd_handler_f)(struct drbd_conf *, struct p_header *);34853485+34863486+static drbd_cmd_handler_f drbd_default_handler[] = {34873487+ [P_DATA] = receive_Data,34883488+ [P_DATA_REPLY] = receive_DataReply,34893489+ [P_RS_DATA_REPLY] = receive_RSDataReply,34903490+ [P_BARRIER] = receive_Barrier,34913491+ [P_BITMAP] = receive_bitmap,34923492+ [P_COMPRESSED_BITMAP] = receive_bitmap,34933493+ [P_UNPLUG_REMOTE] = receive_UnplugRemote,34943494+ [P_DATA_REQUEST] = receive_DataRequest,34953495+ [P_RS_DATA_REQUEST] = receive_DataRequest,34963496+ [P_SYNC_PARAM] = receive_SyncParam,34973497+ [P_SYNC_PARAM89] = receive_SyncParam,34983498+ [P_PROTOCOL] = receive_protocol,34993499+ [P_UUIDS] = receive_uuids,35003500+ [P_SIZES] = receive_sizes,35013501+ [P_STATE] = receive_state,35023502+ [P_STATE_CHG_REQ] = receive_req_state,35033503+ [P_SYNC_UUID] = receive_sync_uuid,35043504+ [P_OV_REQUEST] = receive_DataRequest,35053505+ [P_OV_REPLY] = receive_DataRequest,35063506+ [P_CSUM_RS_REQUEST] = receive_DataRequest,35073507+ /* anything missing from this table is in35083508+ * the asender_tbl, see get_asender_cmd */35093509+ [P_MAX_CMD] = NULL,35103510+};35113511+35123512+static drbd_cmd_handler_f *drbd_cmd_handler = drbd_default_handler;35133513+static drbd_cmd_handler_f *drbd_opt_cmd_handler;35143514+35153515+static void drbdd(struct drbd_conf *mdev)35163516+{35173517+ drbd_cmd_handler_f handler;35183518+ struct p_header *header = &mdev->data.rbuf.header;35193519+35203520+ while (get_t_state(&mdev->receiver) == Running) {35213521+ drbd_thread_current_set_cpu(mdev);35223522+ if (!drbd_recv_header(mdev, header))35233523+ break;35243524+35253525+ if (header->command < P_MAX_CMD)35263526+ handler = drbd_cmd_handler[header->command];35273527+ else if (P_MAY_IGNORE < header->command35283528+ && header->command < P_MAX_OPT_CMD)35293529+ handler = drbd_opt_cmd_handler[header->command-P_MAY_IGNORE];35303530+ else if (header->command > P_MAX_OPT_CMD)35313531+ handler = receive_skip;35323532+ else35333533+ handler = NULL;35343534+35353535+ if (unlikely(!handler)) {35363536+ dev_err(DEV, "unknown packet type %d, l: %d!\n",35373537+ header->command, header->length);35383538+ drbd_force_state(mdev, NS(conn, C_PROTOCOL_ERROR));35393539+ break;35403540+ }35413541+ if (unlikely(!handler(mdev, header))) {35423542+ dev_err(DEV, "error receiving %s, l: %d!\n",35433543+ cmdname(header->command), header->length);35443544+ drbd_force_state(mdev, NS(conn, C_PROTOCOL_ERROR));35453545+ break;35463546+ }35473547+35483548+ trace_drbd_packet(mdev, mdev->data.socket, 2, &mdev->data.rbuf,35493549+ __FILE__, __LINE__);35503550+ }35513551+}35523552+35533553+static void drbd_fail_pending_reads(struct drbd_conf *mdev)35543554+{35553555+ struct hlist_head *slot;35563556+ struct hlist_node *pos;35573557+ struct hlist_node *tmp;35583558+ struct drbd_request *req;35593559+ int i;35603560+35613561+ /*35623562+ * Application READ requests35633563+ */35643564+ spin_lock_irq(&mdev->req_lock);35653565+ for (i = 0; i < APP_R_HSIZE; i++) {35663566+ slot = mdev->app_reads_hash+i;35673567+ hlist_for_each_entry_safe(req, pos, tmp, slot, colision) {35683568+ /* it may (but should not any longer!)35693569+ * be on the work queue; if that assert triggers,35703570+ * we need to also grab the35713571+ * spin_lock_irq(&mdev->data.work.q_lock);35723572+ * and list_del_init here. */35733573+ D_ASSERT(list_empty(&req->w.list));35743574+ /* It would be nice to complete outside of spinlock.35753575+ * But this is easier for now. */35763576+ _req_mod(req, connection_lost_while_pending);35773577+ }35783578+ }35793579+ for (i = 0; i < APP_R_HSIZE; i++)35803580+ if (!hlist_empty(mdev->app_reads_hash+i))35813581+ dev_warn(DEV, "ASSERT FAILED: app_reads_hash[%d].first: "35823582+ "%p, should be NULL\n", i, mdev->app_reads_hash[i].first);35833583+35843584+ memset(mdev->app_reads_hash, 0, APP_R_HSIZE*sizeof(void *));35853585+ spin_unlock_irq(&mdev->req_lock);35863586+}35873587+35883588+void drbd_flush_workqueue(struct drbd_conf *mdev)35893589+{35903590+ struct drbd_wq_barrier barr;35913591+35923592+ barr.w.cb = w_prev_work_done;35933593+ init_completion(&barr.done);35943594+ drbd_queue_work(&mdev->data.work, &barr.w);35953595+ wait_for_completion(&barr.done);35963596+}35973597+35983598+static void drbd_disconnect(struct drbd_conf *mdev)35993599+{36003600+ enum drbd_fencing_p fp;36013601+ union drbd_state os, ns;36023602+ int rv = SS_UNKNOWN_ERROR;36033603+ unsigned int i;36043604+36053605+ if (mdev->state.conn == C_STANDALONE)36063606+ return;36073607+ if (mdev->state.conn >= C_WF_CONNECTION)36083608+ dev_err(DEV, "ASSERT FAILED cstate = %s, expected < WFConnection\n",36093609+ drbd_conn_str(mdev->state.conn));36103610+36113611+ /* asender does not clean up anything. it must not interfere, either */36123612+ drbd_thread_stop(&mdev->asender);36133613+36143614+ mutex_lock(&mdev->data.mutex);36153615+ drbd_free_sock(mdev);36163616+ mutex_unlock(&mdev->data.mutex);36173617+36183618+ spin_lock_irq(&mdev->req_lock);36193619+ _drbd_wait_ee_list_empty(mdev, &mdev->active_ee);36203620+ _drbd_wait_ee_list_empty(mdev, &mdev->sync_ee);36213621+ _drbd_wait_ee_list_empty(mdev, &mdev->read_ee);36223622+ spin_unlock_irq(&mdev->req_lock);36233623+36243624+ /* We do not have data structures that would allow us to36253625+ * get the rs_pending_cnt down to 0 again.36263626+ * * On C_SYNC_TARGET we do not have any data structures describing36273627+ * the pending RSDataRequest's we have sent.36283628+ * * On C_SYNC_SOURCE there is no data structure that tracks36293629+ * the P_RS_DATA_REPLY blocks that we sent to the SyncTarget.36303630+ * And no, it is not the sum of the reference counts in the36313631+ * resync_LRU. The resync_LRU tracks the whole operation including36323632+ * the disk-IO, while the rs_pending_cnt only tracks the blocks36333633+ * on the fly. */36343634+ drbd_rs_cancel_all(mdev);36353635+ mdev->rs_total = 0;36363636+ mdev->rs_failed = 0;36373637+ atomic_set(&mdev->rs_pending_cnt, 0);36383638+ wake_up(&mdev->misc_wait);36393639+36403640+ /* make sure syncer is stopped and w_resume_next_sg queued */36413641+ del_timer_sync(&mdev->resync_timer);36423642+ set_bit(STOP_SYNC_TIMER, &mdev->flags);36433643+ resync_timer_fn((unsigned long)mdev);36443644+36453645+ /* so we can be sure that all remote or resync reads36463646+ * made it at least to net_ee */36473647+ wait_event(mdev->misc_wait, !atomic_read(&mdev->local_cnt));36483648+36493649+ /* wait for all w_e_end_data_req, w_e_end_rsdata_req, w_send_barrier,36503650+ * w_make_resync_request etc. which may still be on the worker queue36513651+ * to be "canceled" */36523652+ drbd_flush_workqueue(mdev);36533653+36543654+ /* This also does reclaim_net_ee(). If we do this too early, we might36553655+ * miss some resync ee and pages.*/36563656+ drbd_process_done_ee(mdev);36573657+36583658+ kfree(mdev->p_uuid);36593659+ mdev->p_uuid = NULL;36603660+36613661+ if (!mdev->state.susp)36623662+ tl_clear(mdev);36633663+36643664+ drbd_fail_pending_reads(mdev);36653665+36663666+ dev_info(DEV, "Connection closed\n");36673667+36683668+ drbd_md_sync(mdev);36693669+36703670+ fp = FP_DONT_CARE;36713671+ if (get_ldev(mdev)) {36723672+ fp = mdev->ldev->dc.fencing;36733673+ put_ldev(mdev);36743674+ }36753675+36763676+ if (mdev->state.role == R_PRIMARY) {36773677+ if (fp >= FP_RESOURCE && mdev->state.pdsk >= D_UNKNOWN) {36783678+ enum drbd_disk_state nps = drbd_try_outdate_peer(mdev);36793679+ drbd_request_state(mdev, NS(pdsk, nps));36803680+ }36813681+ }36823682+36833683+ spin_lock_irq(&mdev->req_lock);36843684+ os = mdev->state;36853685+ if (os.conn >= C_UNCONNECTED) {36863686+ /* Do not restart in case we are C_DISCONNECTING */36873687+ ns = os;36883688+ ns.conn = C_UNCONNECTED;36893689+ rv = _drbd_set_state(mdev, ns, CS_VERBOSE, NULL);36903690+ }36913691+ spin_unlock_irq(&mdev->req_lock);36923692+36933693+ if (os.conn == C_DISCONNECTING) {36943694+ struct hlist_head *h;36953695+ wait_event(mdev->misc_wait, atomic_read(&mdev->net_cnt) == 0);36963696+36973697+ /* we must not free the tl_hash36983698+ * while application io is still on the fly */36993699+ wait_event(mdev->misc_wait, atomic_read(&mdev->ap_bio_cnt) == 0);37003700+37013701+ spin_lock_irq(&mdev->req_lock);37023702+ /* paranoia code */37033703+ for (h = mdev->ee_hash; h < mdev->ee_hash + mdev->ee_hash_s; h++)37043704+ if (h->first)37053705+ dev_err(DEV, "ASSERT FAILED ee_hash[%u].first == %p, expected NULL\n",37063706+ (int)(h - mdev->ee_hash), h->first);37073707+ kfree(mdev->ee_hash);37083708+ mdev->ee_hash = NULL;37093709+ mdev->ee_hash_s = 0;37103710+37113711+ /* paranoia code */37123712+ for (h = mdev->tl_hash; h < mdev->tl_hash + mdev->tl_hash_s; h++)37133713+ if (h->first)37143714+ dev_err(DEV, "ASSERT FAILED tl_hash[%u] == %p, expected NULL\n",37153715+ (int)(h - mdev->tl_hash), h->first);37163716+ kfree(mdev->tl_hash);37173717+ mdev->tl_hash = NULL;37183718+ mdev->tl_hash_s = 0;37193719+ spin_unlock_irq(&mdev->req_lock);37203720+37213721+ crypto_free_hash(mdev->cram_hmac_tfm);37223722+ mdev->cram_hmac_tfm = NULL;37233723+37243724+ kfree(mdev->net_conf);37253725+ mdev->net_conf = NULL;37263726+ drbd_request_state(mdev, NS(conn, C_STANDALONE));37273727+ }37283728+37293729+ /* tcp_close and release of sendpage pages can be deferred. I don't37303730+ * want to use SO_LINGER, because apparently it can be deferred for37313731+ * more than 20 seconds (longest time I checked).37323732+ *37333733+ * Actually we don't care for exactly when the network stack does its37343734+ * put_page(), but release our reference on these pages right here.37353735+ */37363736+ i = drbd_release_ee(mdev, &mdev->net_ee);37373737+ if (i)37383738+ dev_info(DEV, "net_ee not empty, killed %u entries\n", i);37393739+ i = atomic_read(&mdev->pp_in_use);37403740+ if (i)37413741+ dev_info(DEV, "pp_in_use = %u, expected 0\n", i);37423742+37433743+ D_ASSERT(list_empty(&mdev->read_ee));37443744+ D_ASSERT(list_empty(&mdev->active_ee));37453745+ D_ASSERT(list_empty(&mdev->sync_ee));37463746+ D_ASSERT(list_empty(&mdev->done_ee));37473747+37483748+ /* ok, no more ee's on the fly, it is safe to reset the epoch_size */37493749+ atomic_set(&mdev->current_epoch->epoch_size, 0);37503750+ D_ASSERT(list_empty(&mdev->current_epoch->list));37513751+}37523752+37533753+/*37543754+ * We support PRO_VERSION_MIN to PRO_VERSION_MAX. The protocol version37553755+ * we can agree on is stored in agreed_pro_version.37563756+ *37573757+ * feature flags and the reserved array should be enough room for future37583758+ * enhancements of the handshake protocol, and possible plugins...37593759+ *37603760+ * for now, they are expected to be zero, but ignored.37613761+ */37623762+static int drbd_send_handshake(struct drbd_conf *mdev)37633763+{37643764+ /* ASSERT current == mdev->receiver ... */37653765+ struct p_handshake *p = &mdev->data.sbuf.handshake;37663766+ int ok;37673767+37683768+ if (mutex_lock_interruptible(&mdev->data.mutex)) {37693769+ dev_err(DEV, "interrupted during initial handshake\n");37703770+ return 0; /* interrupted. not ok. */37713771+ }37723772+37733773+ if (mdev->data.socket == NULL) {37743774+ mutex_unlock(&mdev->data.mutex);37753775+ return 0;37763776+ }37773777+37783778+ memset(p, 0, sizeof(*p));37793779+ p->protocol_min = cpu_to_be32(PRO_VERSION_MIN);37803780+ p->protocol_max = cpu_to_be32(PRO_VERSION_MAX);37813781+ ok = _drbd_send_cmd( mdev, mdev->data.socket, P_HAND_SHAKE,37823782+ (struct p_header *)p, sizeof(*p), 0 );37833783+ mutex_unlock(&mdev->data.mutex);37843784+ return ok;37853785+}37863786+37873787+/*37883788+ * return values:37893789+ * 1 yes, we have a valid connection37903790+ * 0 oops, did not work out, please try again37913791+ * -1 peer talks different language,37923792+ * no point in trying again, please go standalone.37933793+ */37943794+static int drbd_do_handshake(struct drbd_conf *mdev)37953795+{37963796+ /* ASSERT current == mdev->receiver ... */37973797+ struct p_handshake *p = &mdev->data.rbuf.handshake;37983798+ const int expect = sizeof(struct p_handshake)37993799+ -sizeof(struct p_header);38003800+ int rv;38013801+38023802+ rv = drbd_send_handshake(mdev);38033803+ if (!rv)38043804+ return 0;38053805+38063806+ rv = drbd_recv_header(mdev, &p->head);38073807+ if (!rv)38083808+ return 0;38093809+38103810+ if (p->head.command != P_HAND_SHAKE) {38113811+ dev_err(DEV, "expected HandShake packet, received: %s (0x%04x)\n",38123812+ cmdname(p->head.command), p->head.command);38133813+ return -1;38143814+ }38153815+38163816+ if (p->head.length != expect) {38173817+ dev_err(DEV, "expected HandShake length: %u, received: %u\n",38183818+ expect, p->head.length);38193819+ return -1;38203820+ }38213821+38223822+ rv = drbd_recv(mdev, &p->head.payload, expect);38233823+38243824+ if (rv != expect) {38253825+ dev_err(DEV, "short read receiving handshake packet: l=%u\n", rv);38263826+ return 0;38273827+ }38283828+38293829+ trace_drbd_packet(mdev, mdev->data.socket, 2, &mdev->data.rbuf,38303830+ __FILE__, __LINE__);38313831+38323832+ p->protocol_min = be32_to_cpu(p->protocol_min);38333833+ p->protocol_max = be32_to_cpu(p->protocol_max);38343834+ if (p->protocol_max == 0)38353835+ p->protocol_max = p->protocol_min;38363836+38373837+ if (PRO_VERSION_MAX < p->protocol_min ||38383838+ PRO_VERSION_MIN > p->protocol_max)38393839+ goto incompat;38403840+38413841+ mdev->agreed_pro_version = min_t(int, PRO_VERSION_MAX, p->protocol_max);38423842+38433843+ dev_info(DEV, "Handshake successful: "38443844+ "Agreed network protocol version %d\n", mdev->agreed_pro_version);38453845+38463846+ return 1;38473847+38483848+ incompat:38493849+ dev_err(DEV, "incompatible DRBD dialects: "38503850+ "I support %d-%d, peer supports %d-%d\n",38513851+ PRO_VERSION_MIN, PRO_VERSION_MAX,38523852+ p->protocol_min, p->protocol_max);38533853+ return -1;38543854+}38553855+38563856+#if !defined(CONFIG_CRYPTO_HMAC) && !defined(CONFIG_CRYPTO_HMAC_MODULE)38573857+static int drbd_do_auth(struct drbd_conf *mdev)38583858+{38593859+ dev_err(DEV, "This kernel was build without CONFIG_CRYPTO_HMAC.\n");38603860+ dev_err(DEV, "You need to disable 'cram-hmac-alg' in drbd.conf.\n");38613861+ return 0;38623862+}38633863+#else38643864+#define CHALLENGE_LEN 6438653865+static int drbd_do_auth(struct drbd_conf *mdev)38663866+{38673867+ char my_challenge[CHALLENGE_LEN]; /* 64 Bytes... */38683868+ struct scatterlist sg;38693869+ char *response = NULL;38703870+ char *right_response = NULL;38713871+ char *peers_ch = NULL;38723872+ struct p_header p;38733873+ unsigned int key_len = strlen(mdev->net_conf->shared_secret);38743874+ unsigned int resp_size;38753875+ struct hash_desc desc;38763876+ int rv;38773877+38783878+ desc.tfm = mdev->cram_hmac_tfm;38793879+ desc.flags = 0;38803880+38813881+ rv = crypto_hash_setkey(mdev->cram_hmac_tfm,38823882+ (u8 *)mdev->net_conf->shared_secret, key_len);38833883+ if (rv) {38843884+ dev_err(DEV, "crypto_hash_setkey() failed with %d\n", rv);38853885+ rv = 0;38863886+ goto fail;38873887+ }38883888+38893889+ get_random_bytes(my_challenge, CHALLENGE_LEN);38903890+38913891+ rv = drbd_send_cmd2(mdev, P_AUTH_CHALLENGE, my_challenge, CHALLENGE_LEN);38923892+ if (!rv)38933893+ goto fail;38943894+38953895+ rv = drbd_recv_header(mdev, &p);38963896+ if (!rv)38973897+ goto fail;38983898+38993899+ if (p.command != P_AUTH_CHALLENGE) {39003900+ dev_err(DEV, "expected AuthChallenge packet, received: %s (0x%04x)\n",39013901+ cmdname(p.command), p.command);39023902+ rv = 0;39033903+ goto fail;39043904+ }39053905+39063906+ if (p.length > CHALLENGE_LEN*2) {39073907+ dev_err(DEV, "expected AuthChallenge payload too big.\n");39083908+ rv = 0;39093909+ goto fail;39103910+ }39113911+39123912+ peers_ch = kmalloc(p.length, GFP_NOIO);39133913+ if (peers_ch == NULL) {39143914+ dev_err(DEV, "kmalloc of peers_ch failed\n");39153915+ rv = 0;39163916+ goto fail;39173917+ }39183918+39193919+ rv = drbd_recv(mdev, peers_ch, p.length);39203920+39213921+ if (rv != p.length) {39223922+ dev_err(DEV, "short read AuthChallenge: l=%u\n", rv);39233923+ rv = 0;39243924+ goto fail;39253925+ }39263926+39273927+ resp_size = crypto_hash_digestsize(mdev->cram_hmac_tfm);39283928+ response = kmalloc(resp_size, GFP_NOIO);39293929+ if (response == NULL) {39303930+ dev_err(DEV, "kmalloc of response failed\n");39313931+ rv = 0;39323932+ goto fail;39333933+ }39343934+39353935+ sg_init_table(&sg, 1);39363936+ sg_set_buf(&sg, peers_ch, p.length);39373937+39383938+ rv = crypto_hash_digest(&desc, &sg, sg.length, response);39393939+ if (rv) {39403940+ dev_err(DEV, "crypto_hash_digest() failed with %d\n", rv);39413941+ rv = 0;39423942+ goto fail;39433943+ }39443944+39453945+ rv = drbd_send_cmd2(mdev, P_AUTH_RESPONSE, response, resp_size);39463946+ if (!rv)39473947+ goto fail;39483948+39493949+ rv = drbd_recv_header(mdev, &p);39503950+ if (!rv)39513951+ goto fail;39523952+39533953+ if (p.command != P_AUTH_RESPONSE) {39543954+ dev_err(DEV, "expected AuthResponse packet, received: %s (0x%04x)\n",39553955+ cmdname(p.command), p.command);39563956+ rv = 0;39573957+ goto fail;39583958+ }39593959+39603960+ if (p.length != resp_size) {39613961+ dev_err(DEV, "expected AuthResponse payload of wrong size\n");39623962+ rv = 0;39633963+ goto fail;39643964+ }39653965+39663966+ rv = drbd_recv(mdev, response , resp_size);39673967+39683968+ if (rv != resp_size) {39693969+ dev_err(DEV, "short read receiving AuthResponse: l=%u\n", rv);39703970+ rv = 0;39713971+ goto fail;39723972+ }39733973+39743974+ right_response = kmalloc(resp_size, GFP_NOIO);39753975+ if (response == NULL) {39763976+ dev_err(DEV, "kmalloc of right_response failed\n");39773977+ rv = 0;39783978+ goto fail;39793979+ }39803980+39813981+ sg_set_buf(&sg, my_challenge, CHALLENGE_LEN);39823982+39833983+ rv = crypto_hash_digest(&desc, &sg, sg.length, right_response);39843984+ if (rv) {39853985+ dev_err(DEV, "crypto_hash_digest() failed with %d\n", rv);39863986+ rv = 0;39873987+ goto fail;39883988+ }39893989+39903990+ rv = !memcmp(response, right_response, resp_size);39913991+39923992+ if (rv)39933993+ dev_info(DEV, "Peer authenticated using %d bytes of '%s' HMAC\n",39943994+ resp_size, mdev->net_conf->cram_hmac_alg);39953995+39963996+ fail:39973997+ kfree(peers_ch);39983998+ kfree(response);39993999+ kfree(right_response);40004000+40014001+ return rv;40024002+}40034003+#endif40044004+40054005+int drbdd_init(struct drbd_thread *thi)40064006+{40074007+ struct drbd_conf *mdev = thi->mdev;40084008+ unsigned int minor = mdev_to_minor(mdev);40094009+ int h;40104010+40114011+ sprintf(current->comm, "drbd%d_receiver", minor);40124012+40134013+ dev_info(DEV, "receiver (re)started\n");40144014+40154015+ do {40164016+ h = drbd_connect(mdev);40174017+ if (h == 0) {40184018+ drbd_disconnect(mdev);40194019+ __set_current_state(TASK_INTERRUPTIBLE);40204020+ schedule_timeout(HZ);40214021+ }40224022+ if (h == -1) {40234023+ dev_warn(DEV, "Discarding network configuration.\n");40244024+ drbd_force_state(mdev, NS(conn, C_DISCONNECTING));40254025+ }40264026+ } while (h == 0);40274027+40284028+ if (h > 0) {40294029+ if (get_net_conf(mdev)) {40304030+ drbdd(mdev);40314031+ put_net_conf(mdev);40324032+ }40334033+ }40344034+40354035+ drbd_disconnect(mdev);40364036+40374037+ dev_info(DEV, "receiver terminated\n");40384038+ return 0;40394039+}40404040+40414041+/* ********* acknowledge sender ******** */40424042+40434043+static int got_RqSReply(struct drbd_conf *mdev, struct p_header *h)40444044+{40454045+ struct p_req_state_reply *p = (struct p_req_state_reply *)h;40464046+40474047+ int retcode = be32_to_cpu(p->retcode);40484048+40494049+ if (retcode >= SS_SUCCESS) {40504050+ set_bit(CL_ST_CHG_SUCCESS, &mdev->flags);40514051+ } else {40524052+ set_bit(CL_ST_CHG_FAIL, &mdev->flags);40534053+ dev_err(DEV, "Requested state change failed by peer: %s (%d)\n",40544054+ drbd_set_st_err_str(retcode), retcode);40554055+ }40564056+ wake_up(&mdev->state_wait);40574057+40584058+ return TRUE;40594059+}40604060+40614061+static int got_Ping(struct drbd_conf *mdev, struct p_header *h)40624062+{40634063+ return drbd_send_ping_ack(mdev);40644064+40654065+}40664066+40674067+static int got_PingAck(struct drbd_conf *mdev, struct p_header *h)40684068+{40694069+ /* restore idle timeout */40704070+ mdev->meta.socket->sk->sk_rcvtimeo = mdev->net_conf->ping_int*HZ;40714071+40724072+ return TRUE;40734073+}40744074+40754075+static int got_IsInSync(struct drbd_conf *mdev, struct p_header *h)40764076+{40774077+ struct p_block_ack *p = (struct p_block_ack *)h;40784078+ sector_t sector = be64_to_cpu(p->sector);40794079+ int blksize = be32_to_cpu(p->blksize);40804080+40814081+ D_ASSERT(mdev->agreed_pro_version >= 89);40824082+40834083+ update_peer_seq(mdev, be32_to_cpu(p->seq_num));40844084+40854085+ drbd_rs_complete_io(mdev, sector);40864086+ drbd_set_in_sync(mdev, sector, blksize);40874087+ /* rs_same_csums is supposed to count in units of BM_BLOCK_SIZE */40884088+ mdev->rs_same_csum += (blksize >> BM_BLOCK_SHIFT);40894089+ dec_rs_pending(mdev);40904090+40914091+ return TRUE;40924092+}40934093+40944094+/* when we receive the ACK for a write request,40954095+ * verify that we actually know about it */40964096+static struct drbd_request *_ack_id_to_req(struct drbd_conf *mdev,40974097+ u64 id, sector_t sector)40984098+{40994099+ struct hlist_head *slot = tl_hash_slot(mdev, sector);41004100+ struct hlist_node *n;41014101+ struct drbd_request *req;41024102+41034103+ hlist_for_each_entry(req, n, slot, colision) {41044104+ if ((unsigned long)req == (unsigned long)id) {41054105+ if (req->sector != sector) {41064106+ dev_err(DEV, "_ack_id_to_req: found req %p but it has "41074107+ "wrong sector (%llus versus %llus)\n", req,41084108+ (unsigned long long)req->sector,41094109+ (unsigned long long)sector);41104110+ break;41114111+ }41124112+ return req;41134113+ }41144114+ }41154115+ dev_err(DEV, "_ack_id_to_req: failed to find req %p, sector %llus in list\n",41164116+ (void *)(unsigned long)id, (unsigned long long)sector);41174117+ return NULL;41184118+}41194119+41204120+typedef struct drbd_request *(req_validator_fn)41214121+ (struct drbd_conf *mdev, u64 id, sector_t sector);41224122+41234123+static int validate_req_change_req_state(struct drbd_conf *mdev,41244124+ u64 id, sector_t sector, req_validator_fn validator,41254125+ const char *func, enum drbd_req_event what)41264126+{41274127+ struct drbd_request *req;41284128+ struct bio_and_error m;41294129+41304130+ spin_lock_irq(&mdev->req_lock);41314131+ req = validator(mdev, id, sector);41324132+ if (unlikely(!req)) {41334133+ spin_unlock_irq(&mdev->req_lock);41344134+ dev_err(DEV, "%s: got a corrupt block_id/sector pair\n", func);41354135+ return FALSE;41364136+ }41374137+ __req_mod(req, what, &m);41384138+ spin_unlock_irq(&mdev->req_lock);41394139+41404140+ if (m.bio)41414141+ complete_master_bio(mdev, &m);41424142+ return TRUE;41434143+}41444144+41454145+static int got_BlockAck(struct drbd_conf *mdev, struct p_header *h)41464146+{41474147+ struct p_block_ack *p = (struct p_block_ack *)h;41484148+ sector_t sector = be64_to_cpu(p->sector);41494149+ int blksize = be32_to_cpu(p->blksize);41504150+ enum drbd_req_event what;41514151+41524152+ update_peer_seq(mdev, be32_to_cpu(p->seq_num));41534153+41544154+ if (is_syncer_block_id(p->block_id)) {41554155+ drbd_set_in_sync(mdev, sector, blksize);41564156+ dec_rs_pending(mdev);41574157+ return TRUE;41584158+ }41594159+ switch (be16_to_cpu(h->command)) {41604160+ case P_RS_WRITE_ACK:41614161+ D_ASSERT(mdev->net_conf->wire_protocol == DRBD_PROT_C);41624162+ what = write_acked_by_peer_and_sis;41634163+ break;41644164+ case P_WRITE_ACK:41654165+ D_ASSERT(mdev->net_conf->wire_protocol == DRBD_PROT_C);41664166+ what = write_acked_by_peer;41674167+ break;41684168+ case P_RECV_ACK:41694169+ D_ASSERT(mdev->net_conf->wire_protocol == DRBD_PROT_B);41704170+ what = recv_acked_by_peer;41714171+ break;41724172+ case P_DISCARD_ACK:41734173+ D_ASSERT(mdev->net_conf->wire_protocol == DRBD_PROT_C);41744174+ what = conflict_discarded_by_peer;41754175+ break;41764176+ default:41774177+ D_ASSERT(0);41784178+ return FALSE;41794179+ }41804180+41814181+ return validate_req_change_req_state(mdev, p->block_id, sector,41824182+ _ack_id_to_req, __func__ , what);41834183+}41844184+41854185+static int got_NegAck(struct drbd_conf *mdev, struct p_header *h)41864186+{41874187+ struct p_block_ack *p = (struct p_block_ack *)h;41884188+ sector_t sector = be64_to_cpu(p->sector);41894189+41904190+ if (__ratelimit(&drbd_ratelimit_state))41914191+ dev_warn(DEV, "Got NegAck packet. Peer is in troubles?\n");41924192+41934193+ update_peer_seq(mdev, be32_to_cpu(p->seq_num));41944194+41954195+ if (is_syncer_block_id(p->block_id)) {41964196+ int size = be32_to_cpu(p->blksize);41974197+ dec_rs_pending(mdev);41984198+ drbd_rs_failed_io(mdev, sector, size);41994199+ return TRUE;42004200+ }42014201+ return validate_req_change_req_state(mdev, p->block_id, sector,42024202+ _ack_id_to_req, __func__ , neg_acked);42034203+}42044204+42054205+static int got_NegDReply(struct drbd_conf *mdev, struct p_header *h)42064206+{42074207+ struct p_block_ack *p = (struct p_block_ack *)h;42084208+ sector_t sector = be64_to_cpu(p->sector);42094209+42104210+ update_peer_seq(mdev, be32_to_cpu(p->seq_num));42114211+ dev_err(DEV, "Got NegDReply; Sector %llus, len %u; Fail original request.\n",42124212+ (unsigned long long)sector, be32_to_cpu(p->blksize));42134213+42144214+ return validate_req_change_req_state(mdev, p->block_id, sector,42154215+ _ar_id_to_req, __func__ , neg_acked);42164216+}42174217+42184218+static int got_NegRSDReply(struct drbd_conf *mdev, struct p_header *h)42194219+{42204220+ sector_t sector;42214221+ int size;42224222+ struct p_block_ack *p = (struct p_block_ack *)h;42234223+42244224+ sector = be64_to_cpu(p->sector);42254225+ size = be32_to_cpu(p->blksize);42264226+ D_ASSERT(p->block_id == ID_SYNCER);42274227+42284228+ update_peer_seq(mdev, be32_to_cpu(p->seq_num));42294229+42304230+ dec_rs_pending(mdev);42314231+42324232+ if (get_ldev_if_state(mdev, D_FAILED)) {42334233+ drbd_rs_complete_io(mdev, sector);42344234+ drbd_rs_failed_io(mdev, sector, size);42354235+ put_ldev(mdev);42364236+ }42374237+42384238+ return TRUE;42394239+}42404240+42414241+static int got_BarrierAck(struct drbd_conf *mdev, struct p_header *h)42424242+{42434243+ struct p_barrier_ack *p = (struct p_barrier_ack *)h;42444244+42454245+ tl_release(mdev, p->barrier, be32_to_cpu(p->set_size));42464246+42474247+ return TRUE;42484248+}42494249+42504250+static int got_OVResult(struct drbd_conf *mdev, struct p_header *h)42514251+{42524252+ struct p_block_ack *p = (struct p_block_ack *)h;42534253+ struct drbd_work *w;42544254+ sector_t sector;42554255+ int size;42564256+42574257+ sector = be64_to_cpu(p->sector);42584258+ size = be32_to_cpu(p->blksize);42594259+42604260+ update_peer_seq(mdev, be32_to_cpu(p->seq_num));42614261+42624262+ if (be64_to_cpu(p->block_id) == ID_OUT_OF_SYNC)42634263+ drbd_ov_oos_found(mdev, sector, size);42644264+ else42654265+ ov_oos_print(mdev);42664266+42674267+ drbd_rs_complete_io(mdev, sector);42684268+ dec_rs_pending(mdev);42694269+42704270+ if (--mdev->ov_left == 0) {42714271+ w = kmalloc(sizeof(*w), GFP_NOIO);42724272+ if (w) {42734273+ w->cb = w_ov_finished;42744274+ drbd_queue_work_front(&mdev->data.work, w);42754275+ } else {42764276+ dev_err(DEV, "kmalloc(w) failed.");42774277+ ov_oos_print(mdev);42784278+ drbd_resync_finished(mdev);42794279+ }42804280+ }42814281+ return TRUE;42824282+}42834283+42844284+struct asender_cmd {42854285+ size_t pkt_size;42864286+ int (*process)(struct drbd_conf *mdev, struct p_header *h);42874287+};42884288+42894289+static struct asender_cmd *get_asender_cmd(int cmd)42904290+{42914291+ static struct asender_cmd asender_tbl[] = {42924292+ /* anything missing from this table is in42934293+ * the drbd_cmd_handler (drbd_default_handler) table,42944294+ * see the beginning of drbdd() */42954295+ [P_PING] = { sizeof(struct p_header), got_Ping },42964296+ [P_PING_ACK] = { sizeof(struct p_header), got_PingAck },42974297+ [P_RECV_ACK] = { sizeof(struct p_block_ack), got_BlockAck },42984298+ [P_WRITE_ACK] = { sizeof(struct p_block_ack), got_BlockAck },42994299+ [P_RS_WRITE_ACK] = { sizeof(struct p_block_ack), got_BlockAck },43004300+ [P_DISCARD_ACK] = { sizeof(struct p_block_ack), got_BlockAck },43014301+ [P_NEG_ACK] = { sizeof(struct p_block_ack), got_NegAck },43024302+ [P_NEG_DREPLY] = { sizeof(struct p_block_ack), got_NegDReply },43034303+ [P_NEG_RS_DREPLY] = { sizeof(struct p_block_ack), got_NegRSDReply},43044304+ [P_OV_RESULT] = { sizeof(struct p_block_ack), got_OVResult },43054305+ [P_BARRIER_ACK] = { sizeof(struct p_barrier_ack), got_BarrierAck },43064306+ [P_STATE_CHG_REPLY] = { sizeof(struct p_req_state_reply), got_RqSReply },43074307+ [P_RS_IS_IN_SYNC] = { sizeof(struct p_block_ack), got_IsInSync },43084308+ [P_MAX_CMD] = { 0, NULL },43094309+ };43104310+ if (cmd > P_MAX_CMD || asender_tbl[cmd].process == NULL)43114311+ return NULL;43124312+ return &asender_tbl[cmd];43134313+}43144314+43154315+int drbd_asender(struct drbd_thread *thi)43164316+{43174317+ struct drbd_conf *mdev = thi->mdev;43184318+ struct p_header *h = &mdev->meta.rbuf.header;43194319+ struct asender_cmd *cmd = NULL;43204320+43214321+ int rv, len;43224322+ void *buf = h;43234323+ int received = 0;43244324+ int expect = sizeof(struct p_header);43254325+ int empty;43264326+43274327+ sprintf(current->comm, "drbd%d_asender", mdev_to_minor(mdev));43284328+43294329+ current->policy = SCHED_RR; /* Make this a realtime task! */43304330+ current->rt_priority = 2; /* more important than all other tasks */43314331+43324332+ while (get_t_state(thi) == Running) {43334333+ drbd_thread_current_set_cpu(mdev);43344334+ if (test_and_clear_bit(SEND_PING, &mdev->flags)) {43354335+ ERR_IF(!drbd_send_ping(mdev)) goto reconnect;43364336+ mdev->meta.socket->sk->sk_rcvtimeo =43374337+ mdev->net_conf->ping_timeo*HZ/10;43384338+ }43394339+43404340+ /* conditionally cork;43414341+ * it may hurt latency if we cork without much to send */43424342+ if (!mdev->net_conf->no_cork &&43434343+ 3 < atomic_read(&mdev->unacked_cnt))43444344+ drbd_tcp_cork(mdev->meta.socket);43454345+ while (1) {43464346+ clear_bit(SIGNAL_ASENDER, &mdev->flags);43474347+ flush_signals(current);43484348+ if (!drbd_process_done_ee(mdev)) {43494349+ dev_err(DEV, "process_done_ee() = NOT_OK\n");43504350+ goto reconnect;43514351+ }43524352+ /* to avoid race with newly queued ACKs */43534353+ set_bit(SIGNAL_ASENDER, &mdev->flags);43544354+ spin_lock_irq(&mdev->req_lock);43554355+ empty = list_empty(&mdev->done_ee);43564356+ spin_unlock_irq(&mdev->req_lock);43574357+ /* new ack may have been queued right here,43584358+ * but then there is also a signal pending,43594359+ * and we start over... */43604360+ if (empty)43614361+ break;43624362+ }43634363+ /* but unconditionally uncork unless disabled */43644364+ if (!mdev->net_conf->no_cork)43654365+ drbd_tcp_uncork(mdev->meta.socket);43664366+43674367+ /* short circuit, recv_msg would return EINTR anyways. */43684368+ if (signal_pending(current))43694369+ continue;43704370+43714371+ rv = drbd_recv_short(mdev, mdev->meta.socket,43724372+ buf, expect-received, 0);43734373+ clear_bit(SIGNAL_ASENDER, &mdev->flags);43744374+43754375+ flush_signals(current);43764376+43774377+ /* Note:43784378+ * -EINTR (on meta) we got a signal43794379+ * -EAGAIN (on meta) rcvtimeo expired43804380+ * -ECONNRESET other side closed the connection43814381+ * -ERESTARTSYS (on data) we got a signal43824382+ * rv < 0 other than above: unexpected error!43834383+ * rv == expected: full header or command43844384+ * rv < expected: "woken" by signal during receive43854385+ * rv == 0 : "connection shut down by peer"43864386+ */43874387+ if (likely(rv > 0)) {43884388+ received += rv;43894389+ buf += rv;43904390+ } else if (rv == 0) {43914391+ dev_err(DEV, "meta connection shut down by peer.\n");43924392+ goto reconnect;43934393+ } else if (rv == -EAGAIN) {43944394+ if (mdev->meta.socket->sk->sk_rcvtimeo ==43954395+ mdev->net_conf->ping_timeo*HZ/10) {43964396+ dev_err(DEV, "PingAck did not arrive in time.\n");43974397+ goto reconnect;43984398+ }43994399+ set_bit(SEND_PING, &mdev->flags);44004400+ continue;44014401+ } else if (rv == -EINTR) {44024402+ continue;44034403+ } else {44044404+ dev_err(DEV, "sock_recvmsg returned %d\n", rv);44054405+ goto reconnect;44064406+ }44074407+44084408+ if (received == expect && cmd == NULL) {44094409+ if (unlikely(h->magic != BE_DRBD_MAGIC)) {44104410+ dev_err(DEV, "magic?? on meta m: 0x%lx c: %d l: %d\n",44114411+ (long)be32_to_cpu(h->magic),44124412+ h->command, h->length);44134413+ goto reconnect;44144414+ }44154415+ cmd = get_asender_cmd(be16_to_cpu(h->command));44164416+ len = be16_to_cpu(h->length);44174417+ if (unlikely(cmd == NULL)) {44184418+ dev_err(DEV, "unknown command?? on meta m: 0x%lx c: %d l: %d\n",44194419+ (long)be32_to_cpu(h->magic),44204420+ h->command, h->length);44214421+ goto disconnect;44224422+ }44234423+ expect = cmd->pkt_size;44244424+ ERR_IF(len != expect-sizeof(struct p_header)) {44254425+ trace_drbd_packet(mdev, mdev->meta.socket, 1, (void *)h, __FILE__, __LINE__);44264426+ goto reconnect;44274427+ }44284428+ }44294429+ if (received == expect) {44304430+ D_ASSERT(cmd != NULL);44314431+ trace_drbd_packet(mdev, mdev->meta.socket, 1, (void *)h, __FILE__, __LINE__);44324432+ if (!cmd->process(mdev, h))44334433+ goto reconnect;44344434+44354435+ buf = h;44364436+ received = 0;44374437+ expect = sizeof(struct p_header);44384438+ cmd = NULL;44394439+ }44404440+ }44414441+44424442+ if (0) {44434443+reconnect:44444444+ drbd_force_state(mdev, NS(conn, C_NETWORK_FAILURE));44454445+ }44464446+ if (0) {44474447+disconnect:44484448+ drbd_force_state(mdev, NS(conn, C_DISCONNECTING));44494449+ }44504450+ clear_bit(SIGNAL_ASENDER, &mdev->flags);44514451+44524452+ D_ASSERT(mdev->state.conn < C_CONNECTED);44534453+ dev_info(DEV, "asender terminated\n");44544454+44554455+ return 0;44564456+}
+1132
drivers/block/drbd/drbd_req.c
···11+/*22+ drbd_req.c33+44+ This file is part of DRBD by Philipp Reisner and Lars Ellenberg.55+66+ Copyright (C) 2001-2008, LINBIT Information Technologies GmbH.77+ Copyright (C) 1999-2008, Philipp Reisner <philipp.reisner@linbit.com>.88+ Copyright (C) 2002-2008, Lars Ellenberg <lars.ellenberg@linbit.com>.99+1010+ drbd is free software; you can redistribute it and/or modify1111+ it under the terms of the GNU General Public License as published by1212+ the Free Software Foundation; either version 2, or (at your option)1313+ any later version.1414+1515+ drbd is distributed in the hope that it will be useful,1616+ but WITHOUT ANY WARRANTY; without even the implied warranty of1717+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the1818+ GNU General Public License for more details.1919+2020+ You should have received a copy of the GNU General Public License2121+ along with drbd; see the file COPYING. If not, write to2222+ the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, USA.2323+2424+ */2525+2626+#include <linux/autoconf.h>2727+#include <linux/module.h>2828+2929+#include <linux/slab.h>3030+#include <linux/drbd.h>3131+#include "drbd_int.h"3232+#include "drbd_tracing.h"3333+#include "drbd_req.h"3434+3535+3636+/* Update disk stats at start of I/O request */3737+static void _drbd_start_io_acct(struct drbd_conf *mdev, struct drbd_request *req, struct bio *bio)3838+{3939+ const int rw = bio_data_dir(bio);4040+ int cpu;4141+ cpu = part_stat_lock();4242+ part_stat_inc(cpu, &mdev->vdisk->part0, ios[rw]);4343+ part_stat_add(cpu, &mdev->vdisk->part0, sectors[rw], bio_sectors(bio));4444+ part_stat_unlock();4545+ mdev->vdisk->part0.in_flight[rw]++;4646+}4747+4848+/* Update disk stats when completing request upwards */4949+static void _drbd_end_io_acct(struct drbd_conf *mdev, struct drbd_request *req)5050+{5151+ int rw = bio_data_dir(req->master_bio);5252+ unsigned long duration = jiffies - req->start_time;5353+ int cpu;5454+ cpu = part_stat_lock();5555+ part_stat_add(cpu, &mdev->vdisk->part0, ticks[rw], duration);5656+ part_round_stats(cpu, &mdev->vdisk->part0);5757+ part_stat_unlock();5858+ mdev->vdisk->part0.in_flight[rw]--;5959+}6060+6161+static void _req_is_done(struct drbd_conf *mdev, struct drbd_request *req, const int rw)6262+{6363+ const unsigned long s = req->rq_state;6464+ /* if it was a write, we may have to set the corresponding6565+ * bit(s) out-of-sync first. If it had a local part, we need to6666+ * release the reference to the activity log. */6767+ if (rw == WRITE) {6868+ /* remove it from the transfer log.6969+ * well, only if it had been there in the first7070+ * place... if it had not (local only or conflicting7171+ * and never sent), it should still be "empty" as7272+ * initialized in drbd_req_new(), so we can list_del() it7373+ * here unconditionally */7474+ list_del(&req->tl_requests);7575+ /* Set out-of-sync unless both OK flags are set7676+ * (local only or remote failed).7777+ * Other places where we set out-of-sync:7878+ * READ with local io-error */7979+ if (!(s & RQ_NET_OK) || !(s & RQ_LOCAL_OK))8080+ drbd_set_out_of_sync(mdev, req->sector, req->size);8181+8282+ if ((s & RQ_NET_OK) && (s & RQ_LOCAL_OK) && (s & RQ_NET_SIS))8383+ drbd_set_in_sync(mdev, req->sector, req->size);8484+8585+ /* one might be tempted to move the drbd_al_complete_io8686+ * to the local io completion callback drbd_endio_pri.8787+ * but, if this was a mirror write, we may only8888+ * drbd_al_complete_io after this is RQ_NET_DONE,8989+ * otherwise the extent could be dropped from the al9090+ * before it has actually been written on the peer.9191+ * if we crash before our peer knows about the request,9292+ * but after the extent has been dropped from the al,9393+ * we would forget to resync the corresponding extent.9494+ */9595+ if (s & RQ_LOCAL_MASK) {9696+ if (get_ldev_if_state(mdev, D_FAILED)) {9797+ drbd_al_complete_io(mdev, req->sector);9898+ put_ldev(mdev);9999+ } else if (__ratelimit(&drbd_ratelimit_state)) {100100+ dev_warn(DEV, "Should have called drbd_al_complete_io(, %llu), "101101+ "but my Disk seems to have failed :(\n",102102+ (unsigned long long) req->sector);103103+ }104104+ }105105+ }106106+107107+ /* if it was a local io error, we want to notify our108108+ * peer about that, and see if we need to109109+ * detach the disk and stuff.110110+ * to avoid allocating some special work111111+ * struct, reuse the request. */112112+113113+ /* THINK114114+ * why do we do this not when we detect the error,115115+ * but delay it until it is "done", i.e. possibly116116+ * until the next barrier ack? */117117+118118+ if (rw == WRITE &&119119+ ((s & RQ_LOCAL_MASK) && !(s & RQ_LOCAL_OK))) {120120+ if (!(req->w.list.next == LIST_POISON1 ||121121+ list_empty(&req->w.list))) {122122+ /* DEBUG ASSERT only; if this triggers, we123123+ * probably corrupt the worker list here */124124+ dev_err(DEV, "req->w.list.next = %p\n", req->w.list.next);125125+ dev_err(DEV, "req->w.list.prev = %p\n", req->w.list.prev);126126+ }127127+ req->w.cb = w_io_error;128128+ drbd_queue_work(&mdev->data.work, &req->w);129129+ /* drbd_req_free() is done in w_io_error */130130+ } else {131131+ drbd_req_free(req);132132+ }133133+}134134+135135+static void queue_barrier(struct drbd_conf *mdev)136136+{137137+ struct drbd_tl_epoch *b;138138+139139+ /* We are within the req_lock. Once we queued the barrier for sending,140140+ * we set the CREATE_BARRIER bit. It is cleared as soon as a new141141+ * barrier/epoch object is added. This is the only place this bit is142142+ * set. It indicates that the barrier for this epoch is already queued,143143+ * and no new epoch has been created yet. */144144+ if (test_bit(CREATE_BARRIER, &mdev->flags))145145+ return;146146+147147+ b = mdev->newest_tle;148148+ b->w.cb = w_send_barrier;149149+ /* inc_ap_pending done here, so we won't150150+ * get imbalanced on connection loss.151151+ * dec_ap_pending will be done in got_BarrierAck152152+ * or (on connection loss) in tl_clear. */153153+ inc_ap_pending(mdev);154154+ drbd_queue_work(&mdev->data.work, &b->w);155155+ set_bit(CREATE_BARRIER, &mdev->flags);156156+}157157+158158+static void _about_to_complete_local_write(struct drbd_conf *mdev,159159+ struct drbd_request *req)160160+{161161+ const unsigned long s = req->rq_state;162162+ struct drbd_request *i;163163+ struct drbd_epoch_entry *e;164164+ struct hlist_node *n;165165+ struct hlist_head *slot;166166+167167+ /* before we can signal completion to the upper layers,168168+ * we may need to close the current epoch */169169+ if (mdev->state.conn >= C_CONNECTED &&170170+ req->epoch == mdev->newest_tle->br_number)171171+ queue_barrier(mdev);172172+173173+ /* we need to do the conflict detection stuff,174174+ * if we have the ee_hash (two_primaries) and175175+ * this has been on the network */176176+ if ((s & RQ_NET_DONE) && mdev->ee_hash != NULL) {177177+ const sector_t sector = req->sector;178178+ const int size = req->size;179179+180180+ /* ASSERT:181181+ * there must be no conflicting requests, since182182+ * they must have been failed on the spot */183183+#define OVERLAPS overlaps(sector, size, i->sector, i->size)184184+ slot = tl_hash_slot(mdev, sector);185185+ hlist_for_each_entry(i, n, slot, colision) {186186+ if (OVERLAPS) {187187+ dev_alert(DEV, "LOGIC BUG: completed: %p %llus +%u; "188188+ "other: %p %llus +%u\n",189189+ req, (unsigned long long)sector, size,190190+ i, (unsigned long long)i->sector, i->size);191191+ }192192+ }193193+194194+ /* maybe "wake" those conflicting epoch entries195195+ * that wait for this request to finish.196196+ *197197+ * currently, there can be only _one_ such ee198198+ * (well, or some more, which would be pending199199+ * P_DISCARD_ACK not yet sent by the asender...),200200+ * since we block the receiver thread upon the201201+ * first conflict detection, which will wait on202202+ * misc_wait. maybe we want to assert that?203203+ *204204+ * anyways, if we found one,205205+ * we just have to do a wake_up. */206206+#undef OVERLAPS207207+#define OVERLAPS overlaps(sector, size, e->sector, e->size)208208+ slot = ee_hash_slot(mdev, req->sector);209209+ hlist_for_each_entry(e, n, slot, colision) {210210+ if (OVERLAPS) {211211+ wake_up(&mdev->misc_wait);212212+ break;213213+ }214214+ }215215+ }216216+#undef OVERLAPS217217+}218218+219219+void complete_master_bio(struct drbd_conf *mdev,220220+ struct bio_and_error *m)221221+{222222+ trace_drbd_bio(mdev, "Rq", m->bio, 1, NULL);223223+ bio_endio(m->bio, m->error);224224+ dec_ap_bio(mdev);225225+}226226+227227+/* Helper for __req_mod().228228+ * Set m->bio to the master bio, if it is fit to be completed,229229+ * or leave it alone (it is initialized to NULL in __req_mod),230230+ * if it has already been completed, or cannot be completed yet.231231+ * If m->bio is set, the error status to be returned is placed in m->error.232232+ */233233+void _req_may_be_done(struct drbd_request *req, struct bio_and_error *m)234234+{235235+ const unsigned long s = req->rq_state;236236+ struct drbd_conf *mdev = req->mdev;237237+ /* only WRITES may end up here without a master bio (on barrier ack) */238238+ int rw = req->master_bio ? bio_data_dir(req->master_bio) : WRITE;239239+240240+ trace_drbd_req(req, nothing, "_req_may_be_done");241241+242242+ /* we must not complete the master bio, while it is243243+ * still being processed by _drbd_send_zc_bio (drbd_send_dblock)244244+ * not yet acknowledged by the peer245245+ * not yet completed by the local io subsystem246246+ * these flags may get cleared in any order by247247+ * the worker,248248+ * the receiver,249249+ * the bio_endio completion callbacks.250250+ */251251+ if (s & RQ_NET_QUEUED)252252+ return;253253+ if (s & RQ_NET_PENDING)254254+ return;255255+ if (s & RQ_LOCAL_PENDING)256256+ return;257257+258258+ if (req->master_bio) {259259+ /* this is data_received (remote read)260260+ * or protocol C P_WRITE_ACK261261+ * or protocol B P_RECV_ACK262262+ * or protocol A "handed_over_to_network" (SendAck)263263+ * or canceled or failed,264264+ * or killed from the transfer log due to connection loss.265265+ */266266+267267+ /*268268+ * figure out whether to report success or failure.269269+ *270270+ * report success when at least one of the operations succeeded.271271+ * or, to put the other way,272272+ * only report failure, when both operations failed.273273+ *274274+ * what to do about the failures is handled elsewhere.275275+ * what we need to do here is just: complete the master_bio.276276+ *277277+ * local completion error, if any, has been stored as ERR_PTR278278+ * in private_bio within drbd_endio_pri.279279+ */280280+ int ok = (s & RQ_LOCAL_OK) || (s & RQ_NET_OK);281281+ int error = PTR_ERR(req->private_bio);282282+283283+ /* remove the request from the conflict detection284284+ * respective block_id verification hash */285285+ if (!hlist_unhashed(&req->colision))286286+ hlist_del(&req->colision);287287+ else288288+ D_ASSERT((s & RQ_NET_MASK) == 0);289289+290290+ /* for writes we need to do some extra housekeeping */291291+ if (rw == WRITE)292292+ _about_to_complete_local_write(mdev, req);293293+294294+ /* Update disk stats */295295+ _drbd_end_io_acct(mdev, req);296296+297297+ m->error = ok ? 0 : (error ?: -EIO);298298+ m->bio = req->master_bio;299299+ req->master_bio = NULL;300300+ }301301+302302+ if ((s & RQ_NET_MASK) == 0 || (s & RQ_NET_DONE)) {303303+ /* this is disconnected (local only) operation,304304+ * or protocol C P_WRITE_ACK,305305+ * or protocol A or B P_BARRIER_ACK,306306+ * or killed from the transfer log due to connection loss. */307307+ _req_is_done(mdev, req, rw);308308+ }309309+ /* else: network part and not DONE yet. that is310310+ * protocol A or B, barrier ack still pending... */311311+}312312+313313+/*314314+ * checks whether there was an overlapping request315315+ * or ee already registered.316316+ *317317+ * if so, return 1, in which case this request is completed on the spot,318318+ * without ever being submitted or send.319319+ *320320+ * return 0 if it is ok to submit this request.321321+ *322322+ * NOTE:323323+ * paranoia: assume something above us is broken, and issues different write324324+ * requests for the same block simultaneously...325325+ *326326+ * To ensure these won't be reordered differently on both nodes, resulting in327327+ * diverging data sets, we discard the later one(s). Not that this is supposed328328+ * to happen, but this is the rationale why we also have to check for329329+ * conflicting requests with local origin, and why we have to do so regardless330330+ * of whether we allowed multiple primaries.331331+ *332332+ * BTW, in case we only have one primary, the ee_hash is empty anyways, and the333333+ * second hlist_for_each_entry becomes a noop. This is even simpler than to334334+ * grab a reference on the net_conf, and check for the two_primaries flag...335335+ */336336+static int _req_conflicts(struct drbd_request *req)337337+{338338+ struct drbd_conf *mdev = req->mdev;339339+ const sector_t sector = req->sector;340340+ const int size = req->size;341341+ struct drbd_request *i;342342+ struct drbd_epoch_entry *e;343343+ struct hlist_node *n;344344+ struct hlist_head *slot;345345+346346+ D_ASSERT(hlist_unhashed(&req->colision));347347+348348+ if (!get_net_conf(mdev))349349+ return 0;350350+351351+ /* BUG_ON */352352+ ERR_IF (mdev->tl_hash_s == 0)353353+ goto out_no_conflict;354354+ BUG_ON(mdev->tl_hash == NULL);355355+356356+#define OVERLAPS overlaps(i->sector, i->size, sector, size)357357+ slot = tl_hash_slot(mdev, sector);358358+ hlist_for_each_entry(i, n, slot, colision) {359359+ if (OVERLAPS) {360360+ dev_alert(DEV, "%s[%u] Concurrent local write detected! "361361+ "[DISCARD L] new: %llus +%u; "362362+ "pending: %llus +%u\n",363363+ current->comm, current->pid,364364+ (unsigned long long)sector, size,365365+ (unsigned long long)i->sector, i->size);366366+ goto out_conflict;367367+ }368368+ }369369+370370+ if (mdev->ee_hash_s) {371371+ /* now, check for overlapping requests with remote origin */372372+ BUG_ON(mdev->ee_hash == NULL);373373+#undef OVERLAPS374374+#define OVERLAPS overlaps(e->sector, e->size, sector, size)375375+ slot = ee_hash_slot(mdev, sector);376376+ hlist_for_each_entry(e, n, slot, colision) {377377+ if (OVERLAPS) {378378+ dev_alert(DEV, "%s[%u] Concurrent remote write detected!"379379+ " [DISCARD L] new: %llus +%u; "380380+ "pending: %llus +%u\n",381381+ current->comm, current->pid,382382+ (unsigned long long)sector, size,383383+ (unsigned long long)e->sector, e->size);384384+ goto out_conflict;385385+ }386386+ }387387+ }388388+#undef OVERLAPS389389+390390+out_no_conflict:391391+ /* this is like it should be, and what we expected.392392+ * our users do behave after all... */393393+ put_net_conf(mdev);394394+ return 0;395395+396396+out_conflict:397397+ put_net_conf(mdev);398398+ return 1;399399+}400400+401401+/* obviously this could be coded as many single functions402402+ * instead of one huge switch,403403+ * or by putting the code directly in the respective locations404404+ * (as it has been before).405405+ *406406+ * but having it this way407407+ * enforces that it is all in this one place, where it is easier to audit,408408+ * it makes it obvious that whatever "event" "happens" to a request should409409+ * happen "atomically" within the req_lock,410410+ * and it enforces that we have to think in a very structured manner411411+ * about the "events" that may happen to a request during its life time ...412412+ */413413+void __req_mod(struct drbd_request *req, enum drbd_req_event what,414414+ struct bio_and_error *m)415415+{416416+ struct drbd_conf *mdev = req->mdev;417417+ m->bio = NULL;418418+419419+ trace_drbd_req(req, what, NULL);420420+421421+ switch (what) {422422+ default:423423+ dev_err(DEV, "LOGIC BUG in %s:%u\n", __FILE__ , __LINE__);424424+ break;425425+426426+ /* does not happen...427427+ * initialization done in drbd_req_new428428+ case created:429429+ break;430430+ */431431+432432+ case to_be_send: /* via network */433433+ /* reached via drbd_make_request_common434434+ * and from w_read_retry_remote */435435+ D_ASSERT(!(req->rq_state & RQ_NET_MASK));436436+ req->rq_state |= RQ_NET_PENDING;437437+ inc_ap_pending(mdev);438438+ break;439439+440440+ case to_be_submitted: /* locally */441441+ /* reached via drbd_make_request_common */442442+ D_ASSERT(!(req->rq_state & RQ_LOCAL_MASK));443443+ req->rq_state |= RQ_LOCAL_PENDING;444444+ break;445445+446446+ case completed_ok:447447+ if (bio_data_dir(req->master_bio) == WRITE)448448+ mdev->writ_cnt += req->size>>9;449449+ else450450+ mdev->read_cnt += req->size>>9;451451+452452+ req->rq_state |= (RQ_LOCAL_COMPLETED|RQ_LOCAL_OK);453453+ req->rq_state &= ~RQ_LOCAL_PENDING;454454+455455+ _req_may_be_done(req, m);456456+ put_ldev(mdev);457457+ break;458458+459459+ case write_completed_with_error:460460+ req->rq_state |= RQ_LOCAL_COMPLETED;461461+ req->rq_state &= ~RQ_LOCAL_PENDING;462462+463463+ dev_alert(DEV, "Local WRITE failed sec=%llus size=%u\n",464464+ (unsigned long long)req->sector, req->size);465465+ /* and now: check how to handle local io error. */466466+ __drbd_chk_io_error(mdev, FALSE);467467+ _req_may_be_done(req, m);468468+ put_ldev(mdev);469469+ break;470470+471471+ case read_ahead_completed_with_error:472472+ /* it is legal to fail READA */473473+ req->rq_state |= RQ_LOCAL_COMPLETED;474474+ req->rq_state &= ~RQ_LOCAL_PENDING;475475+ _req_may_be_done(req, m);476476+ put_ldev(mdev);477477+ break;478478+479479+ case read_completed_with_error:480480+ drbd_set_out_of_sync(mdev, req->sector, req->size);481481+482482+ req->rq_state |= RQ_LOCAL_COMPLETED;483483+ req->rq_state &= ~RQ_LOCAL_PENDING;484484+485485+ dev_alert(DEV, "Local READ failed sec=%llus size=%u\n",486486+ (unsigned long long)req->sector, req->size);487487+ /* _req_mod(req,to_be_send); oops, recursion... */488488+ D_ASSERT(!(req->rq_state & RQ_NET_MASK));489489+ req->rq_state |= RQ_NET_PENDING;490490+ inc_ap_pending(mdev);491491+492492+ __drbd_chk_io_error(mdev, FALSE);493493+ put_ldev(mdev);494494+ /* NOTE: if we have no connection,495495+ * or know the peer has no good data either,496496+ * then we don't actually need to "queue_for_net_read",497497+ * but we do so anyways, since the drbd_io_error()498498+ * and the potential state change to "Diskless"499499+ * needs to be done from process context */500500+501501+ /* fall through: _req_mod(req,queue_for_net_read); */502502+503503+ case queue_for_net_read:504504+ /* READ or READA, and505505+ * no local disk,506506+ * or target area marked as invalid,507507+ * or just got an io-error. */508508+ /* from drbd_make_request_common509509+ * or from bio_endio during read io-error recovery */510510+511511+ /* so we can verify the handle in the answer packet512512+ * corresponding hlist_del is in _req_may_be_done() */513513+ hlist_add_head(&req->colision, ar_hash_slot(mdev, req->sector));514514+515515+ set_bit(UNPLUG_REMOTE, &mdev->flags); /* why? */516516+517517+ D_ASSERT(req->rq_state & RQ_NET_PENDING);518518+ req->rq_state |= RQ_NET_QUEUED;519519+ req->w.cb = (req->rq_state & RQ_LOCAL_MASK)520520+ ? w_read_retry_remote521521+ : w_send_read_req;522522+ drbd_queue_work(&mdev->data.work, &req->w);523523+ break;524524+525525+ case queue_for_net_write:526526+ /* assert something? */527527+ /* from drbd_make_request_common only */528528+529529+ hlist_add_head(&req->colision, tl_hash_slot(mdev, req->sector));530530+ /* corresponding hlist_del is in _req_may_be_done() */531531+532532+ /* NOTE533533+ * In case the req ended up on the transfer log before being534534+ * queued on the worker, it could lead to this request being535535+ * missed during cleanup after connection loss.536536+ * So we have to do both operations here,537537+ * within the same lock that protects the transfer log.538538+ *539539+ * _req_add_to_epoch(req); this has to be after the540540+ * _maybe_start_new_epoch(req); which happened in541541+ * drbd_make_request_common, because we now may set the bit542542+ * again ourselves to close the current epoch.543543+ *544544+ * Add req to the (now) current epoch (barrier). */545545+546546+ /* see drbd_make_request_common,547547+ * just after it grabs the req_lock */548548+ D_ASSERT(test_bit(CREATE_BARRIER, &mdev->flags) == 0);549549+550550+ req->epoch = mdev->newest_tle->br_number;551551+ list_add_tail(&req->tl_requests,552552+ &mdev->newest_tle->requests);553553+554554+ /* increment size of current epoch */555555+ mdev->newest_tle->n_req++;556556+557557+ /* queue work item to send data */558558+ D_ASSERT(req->rq_state & RQ_NET_PENDING);559559+ req->rq_state |= RQ_NET_QUEUED;560560+ req->w.cb = w_send_dblock;561561+ drbd_queue_work(&mdev->data.work, &req->w);562562+563563+ /* close the epoch, in case it outgrew the limit */564564+ if (mdev->newest_tle->n_req >= mdev->net_conf->max_epoch_size)565565+ queue_barrier(mdev);566566+567567+ break;568568+569569+ case send_canceled:570570+ /* treat it the same */571571+ case send_failed:572572+ /* real cleanup will be done from tl_clear. just update flags573573+ * so it is no longer marked as on the worker queue */574574+ req->rq_state &= ~RQ_NET_QUEUED;575575+ /* if we did it right, tl_clear should be scheduled only after576576+ * this, so this should not be necessary! */577577+ _req_may_be_done(req, m);578578+ break;579579+580580+ case handed_over_to_network:581581+ /* assert something? */582582+ if (bio_data_dir(req->master_bio) == WRITE &&583583+ mdev->net_conf->wire_protocol == DRBD_PROT_A) {584584+ /* this is what is dangerous about protocol A:585585+ * pretend it was successfully written on the peer. */586586+ if (req->rq_state & RQ_NET_PENDING) {587587+ dec_ap_pending(mdev);588588+ req->rq_state &= ~RQ_NET_PENDING;589589+ req->rq_state |= RQ_NET_OK;590590+ } /* else: neg-ack was faster... */591591+ /* it is still not yet RQ_NET_DONE until the592592+ * corresponding epoch barrier got acked as well,593593+ * so we know what to dirty on connection loss */594594+ }595595+ req->rq_state &= ~RQ_NET_QUEUED;596596+ req->rq_state |= RQ_NET_SENT;597597+ /* because _drbd_send_zc_bio could sleep, and may want to598598+ * dereference the bio even after the "write_acked_by_peer" and599599+ * "completed_ok" events came in, once we return from600600+ * _drbd_send_zc_bio (drbd_send_dblock), we have to check601601+ * whether it is done already, and end it. */602602+ _req_may_be_done(req, m);603603+ break;604604+605605+ case connection_lost_while_pending:606606+ /* transfer log cleanup after connection loss */607607+ /* assert something? */608608+ if (req->rq_state & RQ_NET_PENDING)609609+ dec_ap_pending(mdev);610610+ req->rq_state &= ~(RQ_NET_OK|RQ_NET_PENDING);611611+ req->rq_state |= RQ_NET_DONE;612612+ /* if it is still queued, we may not complete it here.613613+ * it will be canceled soon. */614614+ if (!(req->rq_state & RQ_NET_QUEUED))615615+ _req_may_be_done(req, m);616616+ break;617617+618618+ case write_acked_by_peer_and_sis:619619+ req->rq_state |= RQ_NET_SIS;620620+ case conflict_discarded_by_peer:621621+ /* for discarded conflicting writes of multiple primaries,622622+ * there is no need to keep anything in the tl, potential623623+ * node crashes are covered by the activity log. */624624+ if (what == conflict_discarded_by_peer)625625+ dev_alert(DEV, "Got DiscardAck packet %llus +%u!"626626+ " DRBD is not a random data generator!\n",627627+ (unsigned long long)req->sector, req->size);628628+ req->rq_state |= RQ_NET_DONE;629629+ /* fall through */630630+ case write_acked_by_peer:631631+ /* protocol C; successfully written on peer.632632+ * Nothing to do here.633633+ * We want to keep the tl in place for all protocols, to cater634634+ * for volatile write-back caches on lower level devices.635635+ *636636+ * A barrier request is expected to have forced all prior637637+ * requests onto stable storage, so completion of a barrier638638+ * request could set NET_DONE right here, and not wait for the639639+ * P_BARRIER_ACK, but that is an unnecessary optimization. */640640+641641+ /* this makes it effectively the same as for: */642642+ case recv_acked_by_peer:643643+ /* protocol B; pretends to be successfully written on peer.644644+ * see also notes above in handed_over_to_network about645645+ * protocol != C */646646+ req->rq_state |= RQ_NET_OK;647647+ D_ASSERT(req->rq_state & RQ_NET_PENDING);648648+ dec_ap_pending(mdev);649649+ req->rq_state &= ~RQ_NET_PENDING;650650+ _req_may_be_done(req, m);651651+ break;652652+653653+ case neg_acked:654654+ /* assert something? */655655+ if (req->rq_state & RQ_NET_PENDING)656656+ dec_ap_pending(mdev);657657+ req->rq_state &= ~(RQ_NET_OK|RQ_NET_PENDING);658658+659659+ req->rq_state |= RQ_NET_DONE;660660+ _req_may_be_done(req, m);661661+ /* else: done by handed_over_to_network */662662+ break;663663+664664+ case barrier_acked:665665+ if (req->rq_state & RQ_NET_PENDING) {666666+ /* barrier came in before all requests have been acked.667667+ * this is bad, because if the connection is lost now,668668+ * we won't be able to clean them up... */669669+ dev_err(DEV, "FIXME (barrier_acked but pending)\n");670670+ trace_drbd_req(req, nothing, "FIXME (barrier_acked but pending)");671671+ list_move(&req->tl_requests, &mdev->out_of_sequence_requests);672672+ }673673+ D_ASSERT(req->rq_state & RQ_NET_SENT);674674+ req->rq_state |= RQ_NET_DONE;675675+ _req_may_be_done(req, m);676676+ break;677677+678678+ case data_received:679679+ D_ASSERT(req->rq_state & RQ_NET_PENDING);680680+ dec_ap_pending(mdev);681681+ req->rq_state &= ~RQ_NET_PENDING;682682+ req->rq_state |= (RQ_NET_OK|RQ_NET_DONE);683683+ _req_may_be_done(req, m);684684+ break;685685+ };686686+}687687+688688+/* we may do a local read if:689689+ * - we are consistent (of course),690690+ * - or we are generally inconsistent,691691+ * BUT we are still/already IN SYNC for this area.692692+ * since size may be bigger than BM_BLOCK_SIZE,693693+ * we may need to check several bits.694694+ */695695+static int drbd_may_do_local_read(struct drbd_conf *mdev, sector_t sector, int size)696696+{697697+ unsigned long sbnr, ebnr;698698+ sector_t esector, nr_sectors;699699+700700+ if (mdev->state.disk == D_UP_TO_DATE)701701+ return 1;702702+ if (mdev->state.disk >= D_OUTDATED)703703+ return 0;704704+ if (mdev->state.disk < D_INCONSISTENT)705705+ return 0;706706+ /* state.disk == D_INCONSISTENT We will have a look at the BitMap */707707+ nr_sectors = drbd_get_capacity(mdev->this_bdev);708708+ esector = sector + (size >> 9) - 1;709709+710710+ D_ASSERT(sector < nr_sectors);711711+ D_ASSERT(esector < nr_sectors);712712+713713+ sbnr = BM_SECT_TO_BIT(sector);714714+ ebnr = BM_SECT_TO_BIT(esector);715715+716716+ return 0 == drbd_bm_count_bits(mdev, sbnr, ebnr);717717+}718718+719719+static int drbd_make_request_common(struct drbd_conf *mdev, struct bio *bio)720720+{721721+ const int rw = bio_rw(bio);722722+ const int size = bio->bi_size;723723+ const sector_t sector = bio->bi_sector;724724+ struct drbd_tl_epoch *b = NULL;725725+ struct drbd_request *req;726726+ int local, remote;727727+ int err = -EIO;728728+729729+ /* allocate outside of all locks; */730730+ req = drbd_req_new(mdev, bio);731731+ if (!req) {732732+ dec_ap_bio(mdev);733733+ /* only pass the error to the upper layers.734734+ * if user cannot handle io errors, that's not our business. */735735+ dev_err(DEV, "could not kmalloc() req\n");736736+ bio_endio(bio, -ENOMEM);737737+ return 0;738738+ }739739+740740+ trace_drbd_bio(mdev, "Rq", bio, 0, req);741741+742742+ local = get_ldev(mdev);743743+ if (!local) {744744+ bio_put(req->private_bio); /* or we get a bio leak */745745+ req->private_bio = NULL;746746+ }747747+ if (rw == WRITE) {748748+ remote = 1;749749+ } else {750750+ /* READ || READA */751751+ if (local) {752752+ if (!drbd_may_do_local_read(mdev, sector, size)) {753753+ /* we could kick the syncer to754754+ * sync this extent asap, wait for755755+ * it, then continue locally.756756+ * Or just issue the request remotely.757757+ */758758+ local = 0;759759+ bio_put(req->private_bio);760760+ req->private_bio = NULL;761761+ put_ldev(mdev);762762+ }763763+ }764764+ remote = !local && mdev->state.pdsk >= D_UP_TO_DATE;765765+ }766766+767767+ /* If we have a disk, but a READA request is mapped to remote,768768+ * we are R_PRIMARY, D_INCONSISTENT, SyncTarget.769769+ * Just fail that READA request right here.770770+ *771771+ * THINK: maybe fail all READA when not local?772772+ * or make this configurable...773773+ * if network is slow, READA won't do any good.774774+ */775775+ if (rw == READA && mdev->state.disk >= D_INCONSISTENT && !local) {776776+ err = -EWOULDBLOCK;777777+ goto fail_and_free_req;778778+ }779779+780780+ /* For WRITES going to the local disk, grab a reference on the target781781+ * extent. This waits for any resync activity in the corresponding782782+ * resync extent to finish, and, if necessary, pulls in the target783783+ * extent into the activity log, which involves further disk io because784784+ * of transactional on-disk meta data updates. */785785+ if (rw == WRITE && local)786786+ drbd_al_begin_io(mdev, sector);787787+788788+ remote = remote && (mdev->state.pdsk == D_UP_TO_DATE ||789789+ (mdev->state.pdsk == D_INCONSISTENT &&790790+ mdev->state.conn >= C_CONNECTED));791791+792792+ if (!(local || remote)) {793793+ dev_err(DEV, "IO ERROR: neither local nor remote disk\n");794794+ goto fail_free_complete;795795+ }796796+797797+ /* For WRITE request, we have to make sure that we have an798798+ * unused_spare_tle, in case we need to start a new epoch.799799+ * I try to be smart and avoid to pre-allocate always "just in case",800800+ * but there is a race between testing the bit and pointer outside the801801+ * spinlock, and grabbing the spinlock.802802+ * if we lost that race, we retry. */803803+ if (rw == WRITE && remote &&804804+ mdev->unused_spare_tle == NULL &&805805+ test_bit(CREATE_BARRIER, &mdev->flags)) {806806+allocate_barrier:807807+ b = kmalloc(sizeof(struct drbd_tl_epoch), GFP_NOIO);808808+ if (!b) {809809+ dev_err(DEV, "Failed to alloc barrier.\n");810810+ err = -ENOMEM;811811+ goto fail_free_complete;812812+ }813813+ }814814+815815+ /* GOOD, everything prepared, grab the spin_lock */816816+ spin_lock_irq(&mdev->req_lock);817817+818818+ if (remote) {819819+ remote = (mdev->state.pdsk == D_UP_TO_DATE ||820820+ (mdev->state.pdsk == D_INCONSISTENT &&821821+ mdev->state.conn >= C_CONNECTED));822822+ if (!remote)823823+ dev_warn(DEV, "lost connection while grabbing the req_lock!\n");824824+ if (!(local || remote)) {825825+ dev_err(DEV, "IO ERROR: neither local nor remote disk\n");826826+ spin_unlock_irq(&mdev->req_lock);827827+ goto fail_free_complete;828828+ }829829+ }830830+831831+ if (b && mdev->unused_spare_tle == NULL) {832832+ mdev->unused_spare_tle = b;833833+ b = NULL;834834+ }835835+ if (rw == WRITE && remote &&836836+ mdev->unused_spare_tle == NULL &&837837+ test_bit(CREATE_BARRIER, &mdev->flags)) {838838+ /* someone closed the current epoch839839+ * while we were grabbing the spinlock */840840+ spin_unlock_irq(&mdev->req_lock);841841+ goto allocate_barrier;842842+ }843843+844844+845845+ /* Update disk stats */846846+ _drbd_start_io_acct(mdev, req, bio);847847+848848+ /* _maybe_start_new_epoch(mdev);849849+ * If we need to generate a write barrier packet, we have to add the850850+ * new epoch (barrier) object, and queue the barrier packet for sending,851851+ * and queue the req's data after it _within the same lock_, otherwise852852+ * we have race conditions were the reorder domains could be mixed up.853853+ *854854+ * Even read requests may start a new epoch and queue the corresponding855855+ * barrier packet. To get the write ordering right, we only have to856856+ * make sure that, if this is a write request and it triggered a857857+ * barrier packet, this request is queued within the same spinlock. */858858+ if (remote && mdev->unused_spare_tle &&859859+ test_and_clear_bit(CREATE_BARRIER, &mdev->flags)) {860860+ _tl_add_barrier(mdev, mdev->unused_spare_tle);861861+ mdev->unused_spare_tle = NULL;862862+ } else {863863+ D_ASSERT(!(remote && rw == WRITE &&864864+ test_bit(CREATE_BARRIER, &mdev->flags)));865865+ }866866+867867+ /* NOTE868868+ * Actually, 'local' may be wrong here already, since we may have failed869869+ * to write to the meta data, and may become wrong anytime because of870870+ * local io-error for some other request, which would lead to us871871+ * "detaching" the local disk.872872+ *873873+ * 'remote' may become wrong any time because the network could fail.874874+ *875875+ * This is a harmless race condition, though, since it is handled876876+ * correctly at the appropriate places; so it just defers the failure877877+ * of the respective operation.878878+ */879879+880880+ /* mark them early for readability.881881+ * this just sets some state flags. */882882+ if (remote)883883+ _req_mod(req, to_be_send);884884+ if (local)885885+ _req_mod(req, to_be_submitted);886886+887887+ /* check this request on the collision detection hash tables.888888+ * if we have a conflict, just complete it here.889889+ * THINK do we want to check reads, too? (I don't think so...) */890890+ if (rw == WRITE && _req_conflicts(req)) {891891+ /* this is a conflicting request.892892+ * even though it may have been only _partially_893893+ * overlapping with one of the currently pending requests,894894+ * without even submitting or sending it, we will895895+ * pretend that it was successfully served right now.896896+ */897897+ if (local) {898898+ bio_put(req->private_bio);899899+ req->private_bio = NULL;900900+ drbd_al_complete_io(mdev, req->sector);901901+ put_ldev(mdev);902902+ local = 0;903903+ }904904+ if (remote)905905+ dec_ap_pending(mdev);906906+ _drbd_end_io_acct(mdev, req);907907+ /* THINK: do we want to fail it (-EIO), or pretend success? */908908+ bio_endio(req->master_bio, 0);909909+ req->master_bio = NULL;910910+ dec_ap_bio(mdev);911911+ drbd_req_free(req);912912+ remote = 0;913913+ }914914+915915+ /* NOTE remote first: to get the concurrent write detection right,916916+ * we must register the request before start of local IO. */917917+ if (remote) {918918+ /* either WRITE and C_CONNECTED,919919+ * or READ, and no local disk,920920+ * or READ, but not in sync.921921+ */922922+ _req_mod(req, (rw == WRITE)923923+ ? queue_for_net_write924924+ : queue_for_net_read);925925+ }926926+ spin_unlock_irq(&mdev->req_lock);927927+ kfree(b); /* if someone else has beaten us to it... */928928+929929+ if (local) {930930+ req->private_bio->bi_bdev = mdev->ldev->backing_bdev;931931+932932+ trace_drbd_bio(mdev, "Pri", req->private_bio, 0, NULL);933933+934934+ if (FAULT_ACTIVE(mdev, rw == WRITE ? DRBD_FAULT_DT_WR935935+ : rw == READ ? DRBD_FAULT_DT_RD936936+ : DRBD_FAULT_DT_RA))937937+ bio_endio(req->private_bio, -EIO);938938+ else939939+ generic_make_request(req->private_bio);940940+ }941941+942942+ /* we need to plug ALWAYS since we possibly need to kick lo_dev.943943+ * we plug after submit, so we won't miss an unplug event */944944+ drbd_plug_device(mdev);945945+946946+ return 0;947947+948948+fail_free_complete:949949+ if (rw == WRITE && local)950950+ drbd_al_complete_io(mdev, sector);951951+fail_and_free_req:952952+ if (local) {953953+ bio_put(req->private_bio);954954+ req->private_bio = NULL;955955+ put_ldev(mdev);956956+ }957957+ bio_endio(bio, err);958958+ drbd_req_free(req);959959+ dec_ap_bio(mdev);960960+ kfree(b);961961+962962+ return 0;963963+}964964+965965+/* helper function for drbd_make_request966966+ * if we can determine just by the mdev (state) that this request will fail,967967+ * return 1968968+ * otherwise return 0969969+ */970970+static int drbd_fail_request_early(struct drbd_conf *mdev, int is_write)971971+{972972+ /* Unconfigured */973973+ if (mdev->state.conn == C_DISCONNECTING &&974974+ mdev->state.disk == D_DISKLESS)975975+ return 1;976976+977977+ if (mdev->state.role != R_PRIMARY &&978978+ (!allow_oos || is_write)) {979979+ if (__ratelimit(&drbd_ratelimit_state)) {980980+ dev_err(DEV, "Process %s[%u] tried to %s; "981981+ "since we are not in Primary state, "982982+ "we cannot allow this\n",983983+ current->comm, current->pid,984984+ is_write ? "WRITE" : "READ");985985+ }986986+ return 1;987987+ }988988+989989+ /*990990+ * Paranoia: we might have been primary, but sync target, or991991+ * even diskless, then lost the connection.992992+ * This should have been handled (panic? suspend?) somewhere993993+ * else. But maybe it was not, so check again here.994994+ * Caution: as long as we do not have a read/write lock on mdev,995995+ * to serialize state changes, this is racy, since we may lose996996+ * the connection *after* we test for the cstate.997997+ */998998+ if (mdev->state.disk < D_UP_TO_DATE && mdev->state.pdsk < D_UP_TO_DATE) {999999+ if (__ratelimit(&drbd_ratelimit_state))10001000+ dev_err(DEV, "Sorry, I have no access to good data anymore.\n");10011001+ return 1;10021002+ }10031003+10041004+ return 0;10051005+}10061006+10071007+int drbd_make_request_26(struct request_queue *q, struct bio *bio)10081008+{10091009+ unsigned int s_enr, e_enr;10101010+ struct drbd_conf *mdev = (struct drbd_conf *) q->queuedata;10111011+10121012+ if (drbd_fail_request_early(mdev, bio_data_dir(bio) & WRITE)) {10131013+ bio_endio(bio, -EPERM);10141014+ return 0;10151015+ }10161016+10171017+ /* Reject barrier requests if we know the underlying device does10181018+ * not support them.10191019+ * XXX: Need to get this info from peer as well some how so we10201020+ * XXX: reject if EITHER side/data/metadata area does not support them.10211021+ *10221022+ * because of those XXX, this is not yet enabled,10231023+ * i.e. in drbd_init_set_defaults we set the NO_BARRIER_SUPP bit.10241024+ */10251025+ if (unlikely(bio_rw_flagged(bio, BIO_RW_BARRIER) && test_bit(NO_BARRIER_SUPP, &mdev->flags))) {10261026+ /* dev_warn(DEV, "Rejecting barrier request as underlying device does not support\n"); */10271027+ bio_endio(bio, -EOPNOTSUPP);10281028+ return 0;10291029+ }10301030+10311031+ /*10321032+ * what we "blindly" assume:10331033+ */10341034+ D_ASSERT(bio->bi_size > 0);10351035+ D_ASSERT((bio->bi_size & 0x1ff) == 0);10361036+ D_ASSERT(bio->bi_idx == 0);10371037+10381038+ /* to make some things easier, force alignment of requests within the10391039+ * granularity of our hash tables */10401040+ s_enr = bio->bi_sector >> HT_SHIFT;10411041+ e_enr = (bio->bi_sector+(bio->bi_size>>9)-1) >> HT_SHIFT;10421042+10431043+ if (likely(s_enr == e_enr)) {10441044+ inc_ap_bio(mdev, 1);10451045+ return drbd_make_request_common(mdev, bio);10461046+ }10471047+10481048+ /* can this bio be split generically?10491049+ * Maybe add our own split-arbitrary-bios function. */10501050+ if (bio->bi_vcnt != 1 || bio->bi_idx != 0 || bio->bi_size > DRBD_MAX_SEGMENT_SIZE) {10511051+ /* rather error out here than BUG in bio_split */10521052+ dev_err(DEV, "bio would need to, but cannot, be split: "10531053+ "(vcnt=%u,idx=%u,size=%u,sector=%llu)\n",10541054+ bio->bi_vcnt, bio->bi_idx, bio->bi_size,10551055+ (unsigned long long)bio->bi_sector);10561056+ bio_endio(bio, -EINVAL);10571057+ } else {10581058+ /* This bio crosses some boundary, so we have to split it. */10591059+ struct bio_pair *bp;10601060+ /* works for the "do not cross hash slot boundaries" case10611061+ * e.g. sector 262269, size 409610621062+ * s_enr = 262269 >> 6 = 409710631063+ * e_enr = (262269+8-1) >> 6 = 409810641064+ * HT_SHIFT = 610651065+ * sps = 64, mask = 6310661066+ * first_sectors = 64 - (262269 & 63) = 310671067+ */10681068+ const sector_t sect = bio->bi_sector;10691069+ const int sps = 1 << HT_SHIFT; /* sectors per slot */10701070+ const int mask = sps - 1;10711071+ const sector_t first_sectors = sps - (sect & mask);10721072+ bp = bio_split(bio,10731073+#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,28)10741074+ bio_split_pool,10751075+#endif10761076+ first_sectors);10771077+10781078+ /* we need to get a "reference count" (ap_bio_cnt)10791079+ * to avoid races with the disconnect/reconnect/suspend code.10801080+ * In case we need to split the bio here, we need to get two references10811081+ * atomically, otherwise we might deadlock when trying to submit the10821082+ * second one! */10831083+ inc_ap_bio(mdev, 2);10841084+10851085+ D_ASSERT(e_enr == s_enr + 1);10861086+10871087+ drbd_make_request_common(mdev, &bp->bio1);10881088+ drbd_make_request_common(mdev, &bp->bio2);10891089+ bio_pair_release(bp);10901090+ }10911091+ return 0;10921092+}10931093+10941094+/* This is called by bio_add_page(). With this function we reduce10951095+ * the number of BIOs that span over multiple DRBD_MAX_SEGMENT_SIZEs10961096+ * units (was AL_EXTENTs).10971097+ *10981098+ * we do the calculation within the lower 32bit of the byte offsets,10991099+ * since we don't care for actual offset, but only check whether it11001100+ * would cross "activity log extent" boundaries.11011101+ *11021102+ * As long as the BIO is empty we have to allow at least one bvec,11031103+ * regardless of size and offset. so the resulting bio may still11041104+ * cross extent boundaries. those are dealt with (bio_split) in11051105+ * drbd_make_request_26.11061106+ */11071107+int drbd_merge_bvec(struct request_queue *q, struct bvec_merge_data *bvm, struct bio_vec *bvec)11081108+{11091109+ struct drbd_conf *mdev = (struct drbd_conf *) q->queuedata;11101110+ unsigned int bio_offset =11111111+ (unsigned int)bvm->bi_sector << 9; /* 32 bit */11121112+ unsigned int bio_size = bvm->bi_size;11131113+ int limit, backing_limit;11141114+11151115+ limit = DRBD_MAX_SEGMENT_SIZE11161116+ - ((bio_offset & (DRBD_MAX_SEGMENT_SIZE-1)) + bio_size);11171117+ if (limit < 0)11181118+ limit = 0;11191119+ if (bio_size == 0) {11201120+ if (limit <= bvec->bv_len)11211121+ limit = bvec->bv_len;11221122+ } else if (limit && get_ldev(mdev)) {11231123+ struct request_queue * const b =11241124+ mdev->ldev->backing_bdev->bd_disk->queue;11251125+ if (b->merge_bvec_fn && mdev->ldev->dc.use_bmbv) {11261126+ backing_limit = b->merge_bvec_fn(b, bvm, bvec);11271127+ limit = min(limit, backing_limit);11281128+ }11291129+ put_ldev(mdev);11301130+ }11311131+ return limit;11321132+}
+327
drivers/block/drbd/drbd_req.h
···11+/*22+ drbd_req.h33+44+ This file is part of DRBD by Philipp Reisner and Lars Ellenberg.55+66+ Copyright (C) 2006-2008, LINBIT Information Technologies GmbH.77+ Copyright (C) 2006-2008, Lars Ellenberg <lars.ellenberg@linbit.com>.88+ Copyright (C) 2006-2008, Philipp Reisner <philipp.reisner@linbit.com>.99+1010+ DRBD is free software; you can redistribute it and/or modify1111+ it under the terms of the GNU General Public License as published by1212+ the Free Software Foundation; either version 2, or (at your option)1313+ any later version.1414+1515+ DRBD is distributed in the hope that it will be useful,1616+ but WITHOUT ANY WARRANTY; without even the implied warranty of1717+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the1818+ GNU General Public License for more details.1919+2020+ You should have received a copy of the GNU General Public License2121+ along with drbd; see the file COPYING. If not, write to2222+ the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, USA.2323+ */2424+2525+#ifndef _DRBD_REQ_H2626+#define _DRBD_REQ_H2727+2828+#include <linux/autoconf.h>2929+#include <linux/module.h>3030+3131+#include <linux/slab.h>3232+#include <linux/drbd.h>3333+#include "drbd_int.h"3434+#include "drbd_wrappers.h"3535+3636+/* The request callbacks will be called in irq context by the IDE drivers,3737+ and in Softirqs/Tasklets/BH context by the SCSI drivers,3838+ and by the receiver and worker in kernel-thread context.3939+ Try to get the locking right :) */4040+4141+/*4242+ * Objects of type struct drbd_request do only exist on a R_PRIMARY node, and are4343+ * associated with IO requests originating from the block layer above us.4444+ *4545+ * There are quite a few things that may happen to a drbd request4646+ * during its lifetime.4747+ *4848+ * It will be created.4949+ * It will be marked with the intention to be5050+ * submitted to local disk and/or5151+ * send via the network.5252+ *5353+ * It has to be placed on the transfer log and other housekeeping lists,5454+ * In case we have a network connection.5555+ *5656+ * It may be identified as a concurrent (write) request5757+ * and be handled accordingly.5858+ *5959+ * It may me handed over to the local disk subsystem.6060+ * It may be completed by the local disk subsystem,6161+ * either sucessfully or with io-error.6262+ * In case it is a READ request, and it failed locally,6363+ * it may be retried remotely.6464+ *6565+ * It may be queued for sending.6666+ * It may be handed over to the network stack,6767+ * which may fail.6868+ * It may be acknowledged by the "peer" according to the wire_protocol in use.6969+ * this may be a negative ack.7070+ * It may receive a faked ack when the network connection is lost and the7171+ * transfer log is cleaned up.7272+ * Sending may be canceled due to network connection loss.7373+ * When it finally has outlived its time,7474+ * corresponding dirty bits in the resync-bitmap may be cleared or set,7575+ * it will be destroyed,7676+ * and completion will be signalled to the originator,7777+ * with or without "success".7878+ */7979+8080+enum drbd_req_event {8181+ created,8282+ to_be_send,8383+ to_be_submitted,8484+8585+ /* XXX yes, now I am inconsistent...8686+ * these two are not "events" but "actions"8787+ * oh, well... */8888+ queue_for_net_write,8989+ queue_for_net_read,9090+9191+ send_canceled,9292+ send_failed,9393+ handed_over_to_network,9494+ connection_lost_while_pending,9595+ recv_acked_by_peer,9696+ write_acked_by_peer,9797+ write_acked_by_peer_and_sis, /* and set_in_sync */9898+ conflict_discarded_by_peer,9999+ neg_acked,100100+ barrier_acked, /* in protocol A and B */101101+ data_received, /* (remote read) */102102+103103+ read_completed_with_error,104104+ read_ahead_completed_with_error,105105+ write_completed_with_error,106106+ completed_ok,107107+ nothing, /* for tracing only */108108+};109109+110110+/* encoding of request states for now. we don't actually need that many bits.111111+ * we don't need to do atomic bit operations either, since most of the time we112112+ * need to look at the connection state and/or manipulate some lists at the113113+ * same time, so we should hold the request lock anyways.114114+ */115115+enum drbd_req_state_bits {116116+ /* 210117117+ * 000: no local possible118118+ * 001: to be submitted119119+ * UNUSED, we could map: 011: submitted, completion still pending120120+ * 110: completed ok121121+ * 010: completed with error122122+ */123123+ __RQ_LOCAL_PENDING,124124+ __RQ_LOCAL_COMPLETED,125125+ __RQ_LOCAL_OK,126126+127127+ /* 76543128128+ * 00000: no network possible129129+ * 00001: to be send130130+ * 00011: to be send, on worker queue131131+ * 00101: sent, expecting recv_ack (B) or write_ack (C)132132+ * 11101: sent,133133+ * recv_ack (B) or implicit "ack" (A),134134+ * still waiting for the barrier ack.135135+ * master_bio may already be completed and invalidated.136136+ * 11100: write_acked (C),137137+ * data_received (for remote read, any protocol)138138+ * or finally the barrier ack has arrived (B,A)...139139+ * request can be freed140140+ * 01100: neg-acked (write, protocol C)141141+ * or neg-d-acked (read, any protocol)142142+ * or killed from the transfer log143143+ * during cleanup after connection loss144144+ * request can be freed145145+ * 01000: canceled or send failed...146146+ * request can be freed147147+ */148148+149149+ /* if "SENT" is not set, yet, this can still fail or be canceled.150150+ * if "SENT" is set already, we still wait for an Ack packet.151151+ * when cleared, the master_bio may be completed.152152+ * in (B,A) the request object may still linger on the transaction log153153+ * until the corresponding barrier ack comes in */154154+ __RQ_NET_PENDING,155155+156156+ /* If it is QUEUED, and it is a WRITE, it is also registered in the157157+ * transfer log. Currently we need this flag to avoid conflicts between158158+ * worker canceling the request and tl_clear_barrier killing it from159159+ * transfer log. We should restructure the code so this conflict does160160+ * no longer occur. */161161+ __RQ_NET_QUEUED,162162+163163+ /* well, actually only "handed over to the network stack".164164+ *165165+ * TODO can potentially be dropped because of the similar meaning166166+ * of RQ_NET_SENT and ~RQ_NET_QUEUED.167167+ * however it is not exactly the same. before we drop it168168+ * we must ensure that we can tell a request with network part169169+ * from a request without, regardless of what happens to it. */170170+ __RQ_NET_SENT,171171+172172+ /* when set, the request may be freed (if RQ_NET_QUEUED is clear).173173+ * basically this means the corresponding P_BARRIER_ACK was received */174174+ __RQ_NET_DONE,175175+176176+ /* whether or not we know (C) or pretend (B,A) that the write177177+ * was successfully written on the peer.178178+ */179179+ __RQ_NET_OK,180180+181181+ /* peer called drbd_set_in_sync() for this write */182182+ __RQ_NET_SIS,183183+184184+ /* keep this last, its for the RQ_NET_MASK */185185+ __RQ_NET_MAX,186186+};187187+188188+#define RQ_LOCAL_PENDING (1UL << __RQ_LOCAL_PENDING)189189+#define RQ_LOCAL_COMPLETED (1UL << __RQ_LOCAL_COMPLETED)190190+#define RQ_LOCAL_OK (1UL << __RQ_LOCAL_OK)191191+192192+#define RQ_LOCAL_MASK ((RQ_LOCAL_OK << 1)-1) /* 0x07 */193193+194194+#define RQ_NET_PENDING (1UL << __RQ_NET_PENDING)195195+#define RQ_NET_QUEUED (1UL << __RQ_NET_QUEUED)196196+#define RQ_NET_SENT (1UL << __RQ_NET_SENT)197197+#define RQ_NET_DONE (1UL << __RQ_NET_DONE)198198+#define RQ_NET_OK (1UL << __RQ_NET_OK)199199+#define RQ_NET_SIS (1UL << __RQ_NET_SIS)200200+201201+/* 0x1f8 */202202+#define RQ_NET_MASK (((1UL << __RQ_NET_MAX)-1) & ~RQ_LOCAL_MASK)203203+204204+/* epoch entries */205205+static inline206206+struct hlist_head *ee_hash_slot(struct drbd_conf *mdev, sector_t sector)207207+{208208+ BUG_ON(mdev->ee_hash_s == 0);209209+ return mdev->ee_hash +210210+ ((unsigned int)(sector>>HT_SHIFT) % mdev->ee_hash_s);211211+}212212+213213+/* transfer log (drbd_request objects) */214214+static inline215215+struct hlist_head *tl_hash_slot(struct drbd_conf *mdev, sector_t sector)216216+{217217+ BUG_ON(mdev->tl_hash_s == 0);218218+ return mdev->tl_hash +219219+ ((unsigned int)(sector>>HT_SHIFT) % mdev->tl_hash_s);220220+}221221+222222+/* application reads (drbd_request objects) */223223+static struct hlist_head *ar_hash_slot(struct drbd_conf *mdev, sector_t sector)224224+{225225+ return mdev->app_reads_hash226226+ + ((unsigned int)(sector) % APP_R_HSIZE);227227+}228228+229229+/* when we receive the answer for a read request,230230+ * verify that we actually know about it */231231+static inline struct drbd_request *_ar_id_to_req(struct drbd_conf *mdev,232232+ u64 id, sector_t sector)233233+{234234+ struct hlist_head *slot = ar_hash_slot(mdev, sector);235235+ struct hlist_node *n;236236+ struct drbd_request *req;237237+238238+ hlist_for_each_entry(req, n, slot, colision) {239239+ if ((unsigned long)req == (unsigned long)id) {240240+ D_ASSERT(req->sector == sector);241241+ return req;242242+ }243243+ }244244+ return NULL;245245+}246246+247247+static inline struct drbd_request *drbd_req_new(struct drbd_conf *mdev,248248+ struct bio *bio_src)249249+{250250+ struct bio *bio;251251+ struct drbd_request *req =252252+ mempool_alloc(drbd_request_mempool, GFP_NOIO);253253+ if (likely(req)) {254254+ bio = bio_clone(bio_src, GFP_NOIO); /* XXX cannot fail?? */255255+256256+ req->rq_state = 0;257257+ req->mdev = mdev;258258+ req->master_bio = bio_src;259259+ req->private_bio = bio;260260+ req->epoch = 0;261261+ req->sector = bio->bi_sector;262262+ req->size = bio->bi_size;263263+ req->start_time = jiffies;264264+ INIT_HLIST_NODE(&req->colision);265265+ INIT_LIST_HEAD(&req->tl_requests);266266+ INIT_LIST_HEAD(&req->w.list);267267+268268+ bio->bi_private = req;269269+ bio->bi_end_io = drbd_endio_pri;270270+ bio->bi_next = NULL;271271+ }272272+ return req;273273+}274274+275275+static inline void drbd_req_free(struct drbd_request *req)276276+{277277+ mempool_free(req, drbd_request_mempool);278278+}279279+280280+static inline int overlaps(sector_t s1, int l1, sector_t s2, int l2)281281+{282282+ return !((s1 + (l1>>9) <= s2) || (s1 >= s2 + (l2>>9)));283283+}284284+285285+/* Short lived temporary struct on the stack.286286+ * We could squirrel the error to be returned into287287+ * bio->bi_size, or similar. But that would be too ugly. */288288+struct bio_and_error {289289+ struct bio *bio;290290+ int error;291291+};292292+293293+extern void _req_may_be_done(struct drbd_request *req,294294+ struct bio_and_error *m);295295+extern void __req_mod(struct drbd_request *req, enum drbd_req_event what,296296+ struct bio_and_error *m);297297+extern void complete_master_bio(struct drbd_conf *mdev,298298+ struct bio_and_error *m);299299+300300+/* use this if you don't want to deal with calling complete_master_bio()301301+ * outside the spinlock, e.g. when walking some list on cleanup. */302302+static inline void _req_mod(struct drbd_request *req, enum drbd_req_event what)303303+{304304+ struct drbd_conf *mdev = req->mdev;305305+ struct bio_and_error m;306306+307307+ /* __req_mod possibly frees req, do not touch req after that! */308308+ __req_mod(req, what, &m);309309+ if (m.bio)310310+ complete_master_bio(mdev, &m);311311+}312312+313313+/* completion of master bio is outside of spinlock.314314+ * If you need it irqsave, do it your self! */315315+static inline void req_mod(struct drbd_request *req,316316+ enum drbd_req_event what)317317+{318318+ struct drbd_conf *mdev = req->mdev;319319+ struct bio_and_error m;320320+ spin_lock_irq(&mdev->req_lock);321321+ __req_mod(req, what, &m);322322+ spin_unlock_irq(&mdev->req_lock);323323+324324+ if (m.bio)325325+ complete_master_bio(mdev, &m);326326+}327327+#endif
+113
drivers/block/drbd/drbd_strings.c
···11+/*22+ drbd.h33+44+ This file is part of DRBD by Philipp Reisner and Lars Ellenberg.55+66+ Copyright (C) 2003-2008, LINBIT Information Technologies GmbH.77+ Copyright (C) 2003-2008, Philipp Reisner <philipp.reisner@linbit.com>.88+ Copyright (C) 2003-2008, Lars Ellenberg <lars.ellenberg@linbit.com>.99+1010+ drbd is free software; you can redistribute it and/or modify1111+ it under the terms of the GNU General Public License as published by1212+ the Free Software Foundation; either version 2, or (at your option)1313+ any later version.1414+1515+ drbd is distributed in the hope that it will be useful,1616+ but WITHOUT ANY WARRANTY; without even the implied warranty of1717+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the1818+ GNU General Public License for more details.1919+2020+ You should have received a copy of the GNU General Public License2121+ along with drbd; see the file COPYING. If not, write to2222+ the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, USA.2323+2424+*/2525+2626+#include <linux/drbd.h>2727+2828+static const char *drbd_conn_s_names[] = {2929+ [C_STANDALONE] = "StandAlone",3030+ [C_DISCONNECTING] = "Disconnecting",3131+ [C_UNCONNECTED] = "Unconnected",3232+ [C_TIMEOUT] = "Timeout",3333+ [C_BROKEN_PIPE] = "BrokenPipe",3434+ [C_NETWORK_FAILURE] = "NetworkFailure",3535+ [C_PROTOCOL_ERROR] = "ProtocolError",3636+ [C_WF_CONNECTION] = "WFConnection",3737+ [C_WF_REPORT_PARAMS] = "WFReportParams",3838+ [C_TEAR_DOWN] = "TearDown",3939+ [C_CONNECTED] = "Connected",4040+ [C_STARTING_SYNC_S] = "StartingSyncS",4141+ [C_STARTING_SYNC_T] = "StartingSyncT",4242+ [C_WF_BITMAP_S] = "WFBitMapS",4343+ [C_WF_BITMAP_T] = "WFBitMapT",4444+ [C_WF_SYNC_UUID] = "WFSyncUUID",4545+ [C_SYNC_SOURCE] = "SyncSource",4646+ [C_SYNC_TARGET] = "SyncTarget",4747+ [C_PAUSED_SYNC_S] = "PausedSyncS",4848+ [C_PAUSED_SYNC_T] = "PausedSyncT",4949+ [C_VERIFY_S] = "VerifyS",5050+ [C_VERIFY_T] = "VerifyT",5151+};5252+5353+static const char *drbd_role_s_names[] = {5454+ [R_PRIMARY] = "Primary",5555+ [R_SECONDARY] = "Secondary",5656+ [R_UNKNOWN] = "Unknown"5757+};5858+5959+static const char *drbd_disk_s_names[] = {6060+ [D_DISKLESS] = "Diskless",6161+ [D_ATTACHING] = "Attaching",6262+ [D_FAILED] = "Failed",6363+ [D_NEGOTIATING] = "Negotiating",6464+ [D_INCONSISTENT] = "Inconsistent",6565+ [D_OUTDATED] = "Outdated",6666+ [D_UNKNOWN] = "DUnknown",6767+ [D_CONSISTENT] = "Consistent",6868+ [D_UP_TO_DATE] = "UpToDate",6969+};7070+7171+static const char *drbd_state_sw_errors[] = {7272+ [-SS_TWO_PRIMARIES] = "Multiple primaries not allowed by config",7373+ [-SS_NO_UP_TO_DATE_DISK] = "Refusing to be Primary without at least one UpToDate disk",7474+ [-SS_NO_LOCAL_DISK] = "Can not resync without local disk",7575+ [-SS_NO_REMOTE_DISK] = "Can not resync without remote disk",7676+ [-SS_CONNECTED_OUTDATES] = "Refusing to be Outdated while Connected",7777+ [-SS_PRIMARY_NOP] = "Refusing to be Primary while peer is not outdated",7878+ [-SS_RESYNC_RUNNING] = "Can not start OV/resync since it is already active",7979+ [-SS_ALREADY_STANDALONE] = "Can not disconnect a StandAlone device",8080+ [-SS_CW_FAILED_BY_PEER] = "State change was refused by peer node",8181+ [-SS_IS_DISKLESS] = "Device is diskless, the requested operation requires a disk",8282+ [-SS_DEVICE_IN_USE] = "Device is held open by someone",8383+ [-SS_NO_NET_CONFIG] = "Have no net/connection configuration",8484+ [-SS_NO_VERIFY_ALG] = "Need a verify algorithm to start online verify",8585+ [-SS_NEED_CONNECTION] = "Need a connection to start verify or resync",8686+ [-SS_NOT_SUPPORTED] = "Peer does not support protocol",8787+ [-SS_LOWER_THAN_OUTDATED] = "Disk state is lower than outdated",8888+ [-SS_IN_TRANSIENT_STATE] = "In transient state, retry after next state change",8989+ [-SS_CONCURRENT_ST_CHG] = "Concurrent state changes detected and aborted",9090+};9191+9292+const char *drbd_conn_str(enum drbd_conns s)9393+{9494+ /* enums are unsigned... */9595+ return s > C_PAUSED_SYNC_T ? "TOO_LARGE" : drbd_conn_s_names[s];9696+}9797+9898+const char *drbd_role_str(enum drbd_role s)9999+{100100+ return s > R_SECONDARY ? "TOO_LARGE" : drbd_role_s_names[s];101101+}102102+103103+const char *drbd_disk_str(enum drbd_disk_state s)104104+{105105+ return s > D_UP_TO_DATE ? "TOO_LARGE" : drbd_disk_s_names[s];106106+}107107+108108+const char *drbd_set_st_err_str(enum drbd_state_ret_codes err)109109+{110110+ return err <= SS_AFTER_LAST_ERROR ? "TOO_SMALL" :111111+ err > SS_TWO_PRIMARIES ? "TOO_LARGE"112112+ : drbd_state_sw_errors[-err];113113+}
+752
drivers/block/drbd/drbd_tracing.c
···11+/*22+ drbd_tracing.c33+44+ This file is part of DRBD by Philipp Reisner and Lars Ellenberg.55+66+ Copyright (C) 2003-2008, LINBIT Information Technologies GmbH.77+ Copyright (C) 2003-2008, Philipp Reisner <philipp.reisner@linbit.com>.88+ Copyright (C) 2003-2008, Lars Ellenberg <lars.ellenberg@linbit.com>.99+1010+ drbd is free software; you can redistribute it and/or modify1111+ it under the terms of the GNU General Public License as published by1212+ the Free Software Foundation; either version 2, or (at your option)1313+ any later version.1414+1515+ drbd is distributed in the hope that it will be useful,1616+ but WITHOUT ANY WARRANTY; without even the implied warranty of1717+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the1818+ GNU General Public License for more details.1919+2020+ You should have received a copy of the GNU General Public License2121+ along with drbd; see the file COPYING. If not, write to2222+ the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, USA.2323+2424+ */2525+2626+#include <linux/module.h>2727+#include <linux/drbd.h>2828+#include <linux/ctype.h>2929+#include "drbd_int.h"3030+#include "drbd_tracing.h"3131+#include <linux/drbd_tag_magic.h>3232+3333+MODULE_LICENSE("GPL");3434+MODULE_AUTHOR("Philipp Reisner, Lars Ellenberg");3535+MODULE_DESCRIPTION("DRBD tracepoint probes");3636+MODULE_PARM_DESC(trace_mask, "Bitmap of events to trace see drbd_tracing.c");3737+MODULE_PARM_DESC(trace_level, "Current tracing level (changeable in /sys)");3838+MODULE_PARM_DESC(trace_devs, "Bitmap of devices to trace (changeable in /sys)");3939+4040+unsigned int trace_mask = 0; /* Bitmap of events to trace */4141+int trace_level; /* Current trace level */4242+int trace_devs; /* Bitmap of devices to trace */4343+4444+module_param(trace_mask, uint, 0444);4545+module_param(trace_level, int, 0644);4646+module_param(trace_devs, int, 0644);4747+4848+enum {4949+ TRACE_PACKET = 0x0001,5050+ TRACE_RQ = 0x0002,5151+ TRACE_UUID = 0x0004,5252+ TRACE_RESYNC = 0x0008,5353+ TRACE_EE = 0x0010,5454+ TRACE_UNPLUG = 0x0020,5555+ TRACE_NL = 0x0040,5656+ TRACE_AL_EXT = 0x0080,5757+ TRACE_INT_RQ = 0x0100,5858+ TRACE_MD_IO = 0x0200,5959+ TRACE_EPOCH = 0x0400,6060+};6161+6262+/* Buffer printing support6363+ * dbg_print_flags: used for Flags arg to drbd_print_buffer6464+ * - DBGPRINT_BUFFADDR; if set, each line starts with the6565+ * virtual address of the line being output. If clear,6666+ * each line starts with the offset from the beginning6767+ * of the buffer. */6868+enum dbg_print_flags {6969+ DBGPRINT_BUFFADDR = 0x0001,7070+};7171+7272+/* Macro stuff */7373+static char *nl_packet_name(int packet_type)7474+{7575+/* Generate packet type strings */7676+#define NL_PACKET(name, number, fields) \7777+ [P_ ## name] = # name,7878+#define NL_INTEGER Argh!7979+#define NL_BIT Argh!8080+#define NL_INT64 Argh!8181+#define NL_STRING Argh!8282+8383+ static char *nl_tag_name[P_nl_after_last_packet] = {8484+#include "linux/drbd_nl.h"8585+ };8686+8787+ return (packet_type < sizeof(nl_tag_name)/sizeof(nl_tag_name[0])) ?8888+ nl_tag_name[packet_type] : "*Unknown*";8989+}9090+/* /Macro stuff */9191+9292+static inline int is_mdev_trace(struct drbd_conf *mdev, unsigned int level)9393+{9494+ return trace_level >= level && ((1 << mdev_to_minor(mdev)) & trace_devs);9595+}9696+9797+static void probe_drbd_unplug(struct drbd_conf *mdev, char *msg)9898+{9999+ if (!is_mdev_trace(mdev, TRACE_LVL_ALWAYS))100100+ return;101101+102102+ dev_info(DEV, "%s, ap_bio_count=%d\n", msg, atomic_read(&mdev->ap_bio_cnt));103103+}104104+105105+static void probe_drbd_uuid(struct drbd_conf *mdev, enum drbd_uuid_index index)106106+{107107+ static char *uuid_str[UI_EXTENDED_SIZE] = {108108+ [UI_CURRENT] = "CURRENT",109109+ [UI_BITMAP] = "BITMAP",110110+ [UI_HISTORY_START] = "HISTORY_START",111111+ [UI_HISTORY_END] = "HISTORY_END",112112+ [UI_SIZE] = "SIZE",113113+ [UI_FLAGS] = "FLAGS",114114+ };115115+116116+ if (!is_mdev_trace(mdev, TRACE_LVL_ALWAYS))117117+ return;118118+119119+ if (index >= UI_EXTENDED_SIZE) {120120+ dev_warn(DEV, " uuid_index >= EXTENDED_SIZE\n");121121+ return;122122+ }123123+124124+ dev_info(DEV, " uuid[%s] now %016llX\n",125125+ uuid_str[index],126126+ (unsigned long long)mdev->ldev->md.uuid[index]);127127+}128128+129129+static void probe_drbd_md_io(struct drbd_conf *mdev, int rw,130130+ struct drbd_backing_dev *bdev)131131+{132132+ if (!is_mdev_trace(mdev, TRACE_LVL_ALWAYS))133133+ return;134134+135135+ dev_info(DEV, " %s metadata superblock now\n",136136+ rw == READ ? "Reading" : "Writing");137137+}138138+139139+static void probe_drbd_ee(struct drbd_conf *mdev, struct drbd_epoch_entry *e, char* msg)140140+{141141+ if (!is_mdev_trace(mdev, TRACE_LVL_ALWAYS))142142+ return;143143+144144+ dev_info(DEV, "EE %s sec=%llus size=%u e=%p\n",145145+ msg, (unsigned long long)e->sector, e->size, e);146146+}147147+148148+static void probe_drbd_epoch(struct drbd_conf *mdev, struct drbd_epoch *epoch,149149+ enum epoch_event ev)150150+{151151+ static char *epoch_event_str[] = {152152+ [EV_PUT] = "put",153153+ [EV_GOT_BARRIER_NR] = "got_barrier_nr",154154+ [EV_BARRIER_DONE] = "barrier_done",155155+ [EV_BECAME_LAST] = "became_last",156156+ [EV_TRACE_FLUSH] = "issuing_flush",157157+ [EV_TRACE_ADD_BARRIER] = "added_barrier",158158+ [EV_TRACE_SETTING_BI] = "just set barrier_in_next_epoch",159159+ };160160+161161+ if (!is_mdev_trace(mdev, TRACE_LVL_ALWAYS))162162+ return;163163+164164+ ev &= ~EV_CLEANUP;165165+166166+ switch (ev) {167167+ case EV_TRACE_ALLOC:168168+ dev_info(DEV, "Allocate epoch %p/xxxx { } nr_epochs=%d\n", epoch, mdev->epochs);169169+ break;170170+ case EV_TRACE_FREE:171171+ dev_info(DEV, "Freeing epoch %p/%d { size=%d } nr_epochs=%d\n",172172+ epoch, epoch->barrier_nr, atomic_read(&epoch->epoch_size),173173+ mdev->epochs);174174+ break;175175+ default:176176+ dev_info(DEV, "Update epoch %p/%d { size=%d active=%d %c%c n%c%c } ev=%s\n",177177+ epoch, epoch->barrier_nr, atomic_read(&epoch->epoch_size),178178+ atomic_read(&epoch->active),179179+ test_bit(DE_HAVE_BARRIER_NUMBER, &epoch->flags) ? 'n' : '-',180180+ test_bit(DE_CONTAINS_A_BARRIER, &epoch->flags) ? 'b' : '-',181181+ test_bit(DE_BARRIER_IN_NEXT_EPOCH_ISSUED, &epoch->flags) ? 'i' : '-',182182+ test_bit(DE_BARRIER_IN_NEXT_EPOCH_DONE, &epoch->flags) ? 'd' : '-',183183+ epoch_event_str[ev]);184184+ }185185+}186186+187187+static void probe_drbd_netlink(void *data, int is_req)188188+{189189+ struct cn_msg *msg = data;190190+191191+ if (is_req) {192192+ struct drbd_nl_cfg_req *nlp = (struct drbd_nl_cfg_req *)msg->data;193193+194194+ printk(KERN_INFO "drbd%d: "195195+ "Netlink: << %s (%d) - seq: %x, ack: %x, len: %x\n",196196+ nlp->drbd_minor,197197+ nl_packet_name(nlp->packet_type),198198+ nlp->packet_type,199199+ msg->seq, msg->ack, msg->len);200200+ } else {201201+ struct drbd_nl_cfg_reply *nlp = (struct drbd_nl_cfg_reply *)msg->data;202202+203203+ printk(KERN_INFO "drbd%d: "204204+ "Netlink: >> %s (%d) - seq: %x, ack: %x, len: %x\n",205205+ nlp->minor,206206+ nlp->packet_type == P_nl_after_last_packet ?207207+ "Empty-Reply" : nl_packet_name(nlp->packet_type),208208+ nlp->packet_type,209209+ msg->seq, msg->ack, msg->len);210210+ }211211+}212212+213213+static void probe_drbd_actlog(struct drbd_conf *mdev, sector_t sector, char* msg)214214+{215215+ unsigned int enr = (sector >> (AL_EXTENT_SHIFT-9));216216+217217+ if (!is_mdev_trace(mdev, TRACE_LVL_ALWAYS))218218+ return;219219+220220+ dev_info(DEV, "%s (sec=%llus, al_enr=%u, rs_enr=%d)\n",221221+ msg, (unsigned long long) sector, enr,222222+ (int)BM_SECT_TO_EXT(sector));223223+}224224+225225+/**226226+ * drbd_print_buffer() - Hexdump arbitrary binary data into a buffer227227+ * @prefix: String is output at the beginning of each line output.228228+ * @flags: Currently only defined flag: DBGPRINT_BUFFADDR; if set, each229229+ * line starts with the virtual address of the line being230230+ * output. If clear, each line starts with the offset from the231231+ * beginning of the buffer.232232+ * @size: Indicates the size of each entry in the buffer. Supported233233+ * values are sizeof(char), sizeof(short) and sizeof(int)234234+ * @buffer: Start address of buffer235235+ * @buffer_va: Virtual address of start of buffer (normally the same236236+ * as Buffer, but having it separate allows it to hold237237+ * file address for example)238238+ * @length: length of buffer239239+ */240240+static void drbd_print_buffer(const char *prefix, unsigned int flags, int size,241241+ const void *buffer, const void *buffer_va,242242+ unsigned int length)243243+244244+#define LINE_SIZE 16245245+#define LINE_ENTRIES (int)(LINE_SIZE/size)246246+{247247+ const unsigned char *pstart;248248+ const unsigned char *pstart_va;249249+ const unsigned char *pend;250250+ char bytes_str[LINE_SIZE*3+8], ascii_str[LINE_SIZE+8];251251+ char *pbytes = bytes_str, *pascii = ascii_str;252252+ int offset = 0;253253+ long sizemask;254254+ int field_width;255255+ int index;256256+ const unsigned char *pend_str;257257+ const unsigned char *p;258258+ int count;259259+260260+ /* verify size parameter */261261+ if (size != sizeof(char) &&262262+ size != sizeof(short) &&263263+ size != sizeof(int)) {264264+ printk(KERN_DEBUG "drbd_print_buffer: "265265+ "ERROR invalid size %d\n", size);266266+ return;267267+ }268268+269269+ sizemask = size-1;270270+ field_width = size*2;271271+272272+ /* Adjust start/end to be on appropriate boundary for size */273273+ buffer = (const char *)((long)buffer & ~sizemask);274274+ pend = (const unsigned char *)275275+ (((long)buffer + length + sizemask) & ~sizemask);276276+277277+ if (flags & DBGPRINT_BUFFADDR) {278278+ /* Move start back to nearest multiple of line size,279279+ * if printing address. This results in nicely formatted output280280+ * with addresses being on line size (16) byte boundaries */281281+ pstart = (const unsigned char *)((long)buffer & ~(LINE_SIZE-1));282282+ } else {283283+ pstart = (const unsigned char *)buffer;284284+ }285285+286286+ /* Set value of start VA to print if addresses asked for */287287+ pstart_va = (const unsigned char *)buffer_va288288+ - ((const unsigned char *)buffer-pstart);289289+290290+ /* Calculate end position to nicely align right hand side */291291+ pend_str = pstart + (((pend-pstart) + LINE_SIZE-1) & ~(LINE_SIZE-1));292292+293293+ /* Init strings */294294+ *pbytes = *pascii = '\0';295295+296296+ /* Start at beginning of first line */297297+ p = pstart;298298+ count = 0;299299+300300+ while (p < pend_str) {301301+ if (p < (const unsigned char *)buffer || p >= pend) {302302+ /* Before start of buffer or after end- print spaces */303303+ pbytes += sprintf(pbytes, "%*c ", field_width, ' ');304304+ pascii += sprintf(pascii, "%*c", size, ' ');305305+ p += size;306306+ } else {307307+ /* Add hex and ascii to strings */308308+ int val;309309+ switch (size) {310310+ default:311311+ case 1:312312+ val = *(unsigned char *)p;313313+ break;314314+ case 2:315315+ val = *(unsigned short *)p;316316+ break;317317+ case 4:318318+ val = *(unsigned int *)p;319319+ break;320320+ }321321+322322+ pbytes += sprintf(pbytes, "%0*x ", field_width, val);323323+324324+ for (index = size; index; index--) {325325+ *pascii++ = isprint(*p) ? *p : '.';326326+ p++;327327+ }328328+ }329329+330330+ count++;331331+332332+ if (count == LINE_ENTRIES || p >= pend_str) {333333+ /* Null terminate and print record */334334+ *pascii = '\0';335335+ printk(KERN_DEBUG "%s%8.8lx: %*s|%*s|\n",336336+ prefix,337337+ (flags & DBGPRINT_BUFFADDR)338338+ ? (long)pstart_va:(long)offset,339339+ LINE_ENTRIES*(field_width+1), bytes_str,340340+ LINE_SIZE, ascii_str);341341+342342+ /* Move onto next line */343343+ pstart_va += (p-pstart);344344+ pstart = p;345345+ count = 0;346346+ offset += LINE_SIZE;347347+348348+ /* Re-init strings */349349+ pbytes = bytes_str;350350+ pascii = ascii_str;351351+ *pbytes = *pascii = '\0';352352+ }353353+ }354354+}355355+356356+static void probe_drbd_resync(struct drbd_conf *mdev, int level, const char *fmt, va_list args)357357+{358358+ char str[256];359359+360360+ if (!is_mdev_trace(mdev, level))361361+ return;362362+363363+ if (vsnprintf(str, 256, fmt, args) >= 256)364364+ str[255] = 0;365365+366366+ printk(KERN_INFO "%s %s: %s", dev_driver_string(disk_to_dev(mdev->vdisk)),367367+ dev_name(disk_to_dev(mdev->vdisk)), str);368368+}369369+370370+static void probe_drbd_bio(struct drbd_conf *mdev, const char *pfx, struct bio *bio, int complete,371371+ struct drbd_request *r)372372+{373373+#if defined(CONFIG_LBDAF) || defined(CONFIG_LBD)374374+#define SECTOR_FORMAT "%Lx"375375+#else376376+#define SECTOR_FORMAT "%lx"377377+#endif378378+#define SECTOR_SHIFT 9379379+380380+ unsigned long lowaddr = (unsigned long)(bio->bi_sector << SECTOR_SHIFT);381381+ char *faddr = (char *)(lowaddr);382382+ char rb[sizeof(void *)*2+6] = { 0, };383383+ struct bio_vec *bvec;384384+ int segno;385385+386386+ const int rw = bio->bi_rw;387387+ const int biorw = (rw & (RW_MASK|RWA_MASK));388388+ const int biobarrier = (rw & (1<<BIO_RW_BARRIER));389389+ const int biosync = (rw & ((1<<BIO_RW_UNPLUG) | (1<<BIO_RW_SYNCIO)));390390+391391+ if (!is_mdev_trace(mdev, TRACE_LVL_ALWAYS))392392+ return;393393+394394+ if (r)395395+ sprintf(rb, "Req:%p ", r);396396+397397+ dev_info(DEV, "%s %s:%s%s%s Bio:%p %s- %soffset " SECTOR_FORMAT ", size %x\n",398398+ complete ? "<<<" : ">>>",399399+ pfx,400400+ biorw == WRITE ? "Write" : "Read",401401+ biobarrier ? " : B" : "",402402+ biosync ? " : S" : "",403403+ bio,404404+ rb,405405+ complete ? (bio_flagged(bio, BIO_UPTODATE) ? "Success, " : "Failed, ") : "",406406+ bio->bi_sector << SECTOR_SHIFT,407407+ bio->bi_size);408408+409409+ if (trace_level >= TRACE_LVL_METRICS &&410410+ ((biorw == WRITE) ^ complete)) {411411+ printk(KERN_DEBUG " ind page offset length\n");412412+ __bio_for_each_segment(bvec, bio, segno, 0) {413413+ printk(KERN_DEBUG " [%d] %p %8.8x %8.8x\n", segno,414414+ bvec->bv_page, bvec->bv_offset, bvec->bv_len);415415+416416+ if (trace_level >= TRACE_LVL_ALL) {417417+ char *bvec_buf;418418+ unsigned long flags;419419+420420+ bvec_buf = bvec_kmap_irq(bvec, &flags);421421+422422+ drbd_print_buffer(" ", DBGPRINT_BUFFADDR, 1,423423+ bvec_buf,424424+ faddr,425425+ (bvec->bv_len <= 0x80)426426+ ? bvec->bv_len : 0x80);427427+428428+ bvec_kunmap_irq(bvec_buf, &flags);429429+430430+ if (bvec->bv_len > 0x40)431431+ printk(KERN_DEBUG " ....\n");432432+433433+ faddr += bvec->bv_len;434434+ }435435+ }436436+ }437437+}438438+439439+static void probe_drbd_req(struct drbd_request *req, enum drbd_req_event what, char *msg)440440+{441441+ static const char *rq_event_names[] = {442442+ [created] = "created",443443+ [to_be_send] = "to_be_send",444444+ [to_be_submitted] = "to_be_submitted",445445+ [queue_for_net_write] = "queue_for_net_write",446446+ [queue_for_net_read] = "queue_for_net_read",447447+ [send_canceled] = "send_canceled",448448+ [send_failed] = "send_failed",449449+ [handed_over_to_network] = "handed_over_to_network",450450+ [connection_lost_while_pending] =451451+ "connection_lost_while_pending",452452+ [recv_acked_by_peer] = "recv_acked_by_peer",453453+ [write_acked_by_peer] = "write_acked_by_peer",454454+ [neg_acked] = "neg_acked",455455+ [conflict_discarded_by_peer] = "conflict_discarded_by_peer",456456+ [barrier_acked] = "barrier_acked",457457+ [data_received] = "data_received",458458+ [read_completed_with_error] = "read_completed_with_error",459459+ [read_ahead_completed_with_error] = "reada_completed_with_error",460460+ [write_completed_with_error] = "write_completed_with_error",461461+ [completed_ok] = "completed_ok",462462+ };463463+464464+ struct drbd_conf *mdev = req->mdev;465465+466466+ const int rw = (req->master_bio == NULL ||467467+ bio_data_dir(req->master_bio) == WRITE) ?468468+ 'W' : 'R';469469+ const unsigned long s = req->rq_state;470470+471471+ if (what != nothing) {472472+ dev_info(DEV, "__req_mod(%p %c ,%s)\n", req, rw, rq_event_names[what]);473473+ } else {474474+ dev_info(DEV, "%s %p %c L%c%c%cN%c%c%c%c%c %u (%llus +%u) %s\n",475475+ msg, req, rw,476476+ s & RQ_LOCAL_PENDING ? 'p' : '-',477477+ s & RQ_LOCAL_COMPLETED ? 'c' : '-',478478+ s & RQ_LOCAL_OK ? 'o' : '-',479479+ s & RQ_NET_PENDING ? 'p' : '-',480480+ s & RQ_NET_QUEUED ? 'q' : '-',481481+ s & RQ_NET_SENT ? 's' : '-',482482+ s & RQ_NET_DONE ? 'd' : '-',483483+ s & RQ_NET_OK ? 'o' : '-',484484+ req->epoch,485485+ (unsigned long long)req->sector,486486+ req->size,487487+ drbd_conn_str(mdev->state.conn));488488+ }489489+}490490+491491+492492+#define drbd_peer_str drbd_role_str493493+#define drbd_pdsk_str drbd_disk_str494494+495495+#define PSM(A) \496496+do { \497497+ if (mask.A) { \498498+ int i = snprintf(p, len, " " #A "( %s )", \499499+ drbd_##A##_str(val.A)); \500500+ if (i >= len) \501501+ return op; \502502+ p += i; \503503+ len -= i; \504504+ } \505505+} while (0)506506+507507+static char *dump_st(char *p, int len, union drbd_state mask, union drbd_state val)508508+{509509+ char *op = p;510510+ *p = '\0';511511+ PSM(role);512512+ PSM(peer);513513+ PSM(conn);514514+ PSM(disk);515515+ PSM(pdsk);516516+517517+ return op;518518+}519519+520520+#define INFOP(fmt, args...) \521521+do { \522522+ if (trace_level >= TRACE_LVL_ALL) { \523523+ dev_info(DEV, "%s:%d: %s [%d] %s %s " fmt , \524524+ file, line, current->comm, current->pid, \525525+ sockname, recv ? "<<<" : ">>>" , \526526+ ## args); \527527+ } else { \528528+ dev_info(DEV, "%s %s " fmt, sockname, \529529+ recv ? "<<<" : ">>>" , \530530+ ## args); \531531+ } \532532+} while (0)533533+534534+static char *_dump_block_id(u64 block_id, char *buff)535535+{536536+ if (is_syncer_block_id(block_id))537537+ strcpy(buff, "SyncerId");538538+ else539539+ sprintf(buff, "%llx", (unsigned long long)block_id);540540+541541+ return buff;542542+}543543+544544+static void probe_drbd_packet(struct drbd_conf *mdev, struct socket *sock,545545+ int recv, union p_polymorph *p, char *file, int line)546546+{547547+ char *sockname = sock == mdev->meta.socket ? "meta" : "data";548548+ int cmd = (recv == 2) ? p->header.command : be16_to_cpu(p->header.command);549549+ char tmp[300];550550+ union drbd_state m, v;551551+552552+ switch (cmd) {553553+ case P_HAND_SHAKE:554554+ INFOP("%s (protocol %u-%u)\n", cmdname(cmd),555555+ be32_to_cpu(p->handshake.protocol_min),556556+ be32_to_cpu(p->handshake.protocol_max));557557+ break;558558+559559+ case P_BITMAP: /* don't report this */560560+ case P_COMPRESSED_BITMAP: /* don't report this */561561+ break;562562+563563+ case P_DATA:564564+ INFOP("%s (sector %llus, id %s, seq %u, f %x)\n", cmdname(cmd),565565+ (unsigned long long)be64_to_cpu(p->data.sector),566566+ _dump_block_id(p->data.block_id, tmp),567567+ be32_to_cpu(p->data.seq_num),568568+ be32_to_cpu(p->data.dp_flags)569569+ );570570+ break;571571+572572+ case P_DATA_REPLY:573573+ case P_RS_DATA_REPLY:574574+ INFOP("%s (sector %llus, id %s)\n", cmdname(cmd),575575+ (unsigned long long)be64_to_cpu(p->data.sector),576576+ _dump_block_id(p->data.block_id, tmp)577577+ );578578+ break;579579+580580+ case P_RECV_ACK:581581+ case P_WRITE_ACK:582582+ case P_RS_WRITE_ACK:583583+ case P_DISCARD_ACK:584584+ case P_NEG_ACK:585585+ case P_NEG_RS_DREPLY:586586+ INFOP("%s (sector %llus, size %u, id %s, seq %u)\n",587587+ cmdname(cmd),588588+ (long long)be64_to_cpu(p->block_ack.sector),589589+ be32_to_cpu(p->block_ack.blksize),590590+ _dump_block_id(p->block_ack.block_id, tmp),591591+ be32_to_cpu(p->block_ack.seq_num)592592+ );593593+ break;594594+595595+ case P_DATA_REQUEST:596596+ case P_RS_DATA_REQUEST:597597+ INFOP("%s (sector %llus, size %u, id %s)\n", cmdname(cmd),598598+ (long long)be64_to_cpu(p->block_req.sector),599599+ be32_to_cpu(p->block_req.blksize),600600+ _dump_block_id(p->block_req.block_id, tmp)601601+ );602602+ break;603603+604604+ case P_BARRIER:605605+ case P_BARRIER_ACK:606606+ INFOP("%s (barrier %u)\n", cmdname(cmd), p->barrier.barrier);607607+ break;608608+609609+ case P_SYNC_PARAM:610610+ case P_SYNC_PARAM89:611611+ INFOP("%s (rate %u, verify-alg \"%.64s\", csums-alg \"%.64s\")\n",612612+ cmdname(cmd), be32_to_cpu(p->rs_param_89.rate),613613+ p->rs_param_89.verify_alg, p->rs_param_89.csums_alg);614614+ break;615615+616616+ case P_UUIDS:617617+ INFOP("%s Curr:%016llX, Bitmap:%016llX, "618618+ "HisSt:%016llX, HisEnd:%016llX\n",619619+ cmdname(cmd),620620+ (unsigned long long)be64_to_cpu(p->uuids.uuid[UI_CURRENT]),621621+ (unsigned long long)be64_to_cpu(p->uuids.uuid[UI_BITMAP]),622622+ (unsigned long long)be64_to_cpu(p->uuids.uuid[UI_HISTORY_START]),623623+ (unsigned long long)be64_to_cpu(p->uuids.uuid[UI_HISTORY_END]));624624+ break;625625+626626+ case P_SIZES:627627+ INFOP("%s (d %lluMiB, u %lluMiB, c %lldMiB, "628628+ "max bio %x, q order %x)\n",629629+ cmdname(cmd),630630+ (long long)(be64_to_cpu(p->sizes.d_size)>>(20-9)),631631+ (long long)(be64_to_cpu(p->sizes.u_size)>>(20-9)),632632+ (long long)(be64_to_cpu(p->sizes.c_size)>>(20-9)),633633+ be32_to_cpu(p->sizes.max_segment_size),634634+ be32_to_cpu(p->sizes.queue_order_type));635635+ break;636636+637637+ case P_STATE:638638+ v.i = be32_to_cpu(p->state.state);639639+ m.i = 0xffffffff;640640+ dump_st(tmp, sizeof(tmp), m, v);641641+ INFOP("%s (s %x {%s})\n", cmdname(cmd), v.i, tmp);642642+ break;643643+644644+ case P_STATE_CHG_REQ:645645+ m.i = be32_to_cpu(p->req_state.mask);646646+ v.i = be32_to_cpu(p->req_state.val);647647+ dump_st(tmp, sizeof(tmp), m, v);648648+ INFOP("%s (m %x v %x {%s})\n", cmdname(cmd), m.i, v.i, tmp);649649+ break;650650+651651+ case P_STATE_CHG_REPLY:652652+ INFOP("%s (ret %x)\n", cmdname(cmd),653653+ be32_to_cpu(p->req_state_reply.retcode));654654+ break;655655+656656+ case P_PING:657657+ case P_PING_ACK:658658+ /*659659+ * Dont trace pings at summary level660660+ */661661+ if (trace_level < TRACE_LVL_ALL)662662+ break;663663+ /* fall through... */664664+ default:665665+ INFOP("%s (%u)\n", cmdname(cmd), cmd);666666+ break;667667+ }668668+}669669+670670+671671+static int __init drbd_trace_init(void)672672+{673673+ int ret;674674+675675+ if (trace_mask & TRACE_UNPLUG) {676676+ ret = register_trace_drbd_unplug(probe_drbd_unplug);677677+ WARN_ON(ret);678678+ }679679+ if (trace_mask & TRACE_UUID) {680680+ ret = register_trace_drbd_uuid(probe_drbd_uuid);681681+ WARN_ON(ret);682682+ }683683+ if (trace_mask & TRACE_EE) {684684+ ret = register_trace_drbd_ee(probe_drbd_ee);685685+ WARN_ON(ret);686686+ }687687+ if (trace_mask & TRACE_PACKET) {688688+ ret = register_trace_drbd_packet(probe_drbd_packet);689689+ WARN_ON(ret);690690+ }691691+ if (trace_mask & TRACE_MD_IO) {692692+ ret = register_trace_drbd_md_io(probe_drbd_md_io);693693+ WARN_ON(ret);694694+ }695695+ if (trace_mask & TRACE_EPOCH) {696696+ ret = register_trace_drbd_epoch(probe_drbd_epoch);697697+ WARN_ON(ret);698698+ }699699+ if (trace_mask & TRACE_NL) {700700+ ret = register_trace_drbd_netlink(probe_drbd_netlink);701701+ WARN_ON(ret);702702+ }703703+ if (trace_mask & TRACE_AL_EXT) {704704+ ret = register_trace_drbd_actlog(probe_drbd_actlog);705705+ WARN_ON(ret);706706+ }707707+ if (trace_mask & TRACE_RQ) {708708+ ret = register_trace_drbd_bio(probe_drbd_bio);709709+ WARN_ON(ret);710710+ }711711+ if (trace_mask & TRACE_INT_RQ) {712712+ ret = register_trace_drbd_req(probe_drbd_req);713713+ WARN_ON(ret);714714+ }715715+ if (trace_mask & TRACE_RESYNC) {716716+ ret = register_trace__drbd_resync(probe_drbd_resync);717717+ WARN_ON(ret);718718+ }719719+ return 0;720720+}721721+722722+module_init(drbd_trace_init);723723+724724+static void __exit drbd_trace_exit(void)725725+{726726+ if (trace_mask & TRACE_UNPLUG)727727+ unregister_trace_drbd_unplug(probe_drbd_unplug);728728+ if (trace_mask & TRACE_UUID)729729+ unregister_trace_drbd_uuid(probe_drbd_uuid);730730+ if (trace_mask & TRACE_EE)731731+ unregister_trace_drbd_ee(probe_drbd_ee);732732+ if (trace_mask & TRACE_PACKET)733733+ unregister_trace_drbd_packet(probe_drbd_packet);734734+ if (trace_mask & TRACE_MD_IO)735735+ unregister_trace_drbd_md_io(probe_drbd_md_io);736736+ if (trace_mask & TRACE_EPOCH)737737+ unregister_trace_drbd_epoch(probe_drbd_epoch);738738+ if (trace_mask & TRACE_NL)739739+ unregister_trace_drbd_netlink(probe_drbd_netlink);740740+ if (trace_mask & TRACE_AL_EXT)741741+ unregister_trace_drbd_actlog(probe_drbd_actlog);742742+ if (trace_mask & TRACE_RQ)743743+ unregister_trace_drbd_bio(probe_drbd_bio);744744+ if (trace_mask & TRACE_INT_RQ)745745+ unregister_trace_drbd_req(probe_drbd_req);746746+ if (trace_mask & TRACE_RESYNC)747747+ unregister_trace__drbd_resync(probe_drbd_resync);748748+749749+ tracepoint_synchronize_unregister();750750+}751751+752752+module_exit(drbd_trace_exit);
+87
drivers/block/drbd/drbd_tracing.h
···11+/*22+ drbd_tracing.h33+44+ This file is part of DRBD by Philipp Reisner and Lars Ellenberg.55+66+ Copyright (C) 2003-2008, LINBIT Information Technologies GmbH.77+ Copyright (C) 2003-2008, Philipp Reisner <philipp.reisner@linbit.com>.88+ Copyright (C) 2003-2008, Lars Ellenberg <lars.ellenberg@linbit.com>.99+1010+ drbd is free software; you can redistribute it and/or modify1111+ it under the terms of the GNU General Public License as published by1212+ the Free Software Foundation; either version 2, or (at your option)1313+ any later version.1414+1515+ drbd is distributed in the hope that it will be useful,1616+ but WITHOUT ANY WARRANTY; without even the implied warranty of1717+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the1818+ GNU General Public License for more details.1919+2020+ You should have received a copy of the GNU General Public License2121+ along with drbd; see the file COPYING. If not, write to2222+ the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, USA.2323+2424+ */2525+2626+#ifndef DRBD_TRACING_H2727+#define DRBD_TRACING_H2828+2929+#include <linux/tracepoint.h>3030+#include "drbd_int.h"3131+#include "drbd_req.h"3232+3333+enum {3434+ TRACE_LVL_ALWAYS = 0,3535+ TRACE_LVL_SUMMARY,3636+ TRACE_LVL_METRICS,3737+ TRACE_LVL_ALL,3838+ TRACE_LVL_MAX3939+};4040+4141+DECLARE_TRACE(drbd_unplug,4242+ TP_PROTO(struct drbd_conf *mdev, char* msg),4343+ TP_ARGS(mdev, msg));4444+4545+DECLARE_TRACE(drbd_uuid,4646+ TP_PROTO(struct drbd_conf *mdev, enum drbd_uuid_index index),4747+ TP_ARGS(mdev, index));4848+4949+DECLARE_TRACE(drbd_ee,5050+ TP_PROTO(struct drbd_conf *mdev, struct drbd_epoch_entry *e, char* msg),5151+ TP_ARGS(mdev, e, msg));5252+5353+DECLARE_TRACE(drbd_md_io,5454+ TP_PROTO(struct drbd_conf *mdev, int rw, struct drbd_backing_dev *bdev),5555+ TP_ARGS(mdev, rw, bdev));5656+5757+DECLARE_TRACE(drbd_epoch,5858+ TP_PROTO(struct drbd_conf *mdev, struct drbd_epoch *epoch, enum epoch_event ev),5959+ TP_ARGS(mdev, epoch, ev));6060+6161+DECLARE_TRACE(drbd_netlink,6262+ TP_PROTO(void *data, int is_req),6363+ TP_ARGS(data, is_req));6464+6565+DECLARE_TRACE(drbd_actlog,6666+ TP_PROTO(struct drbd_conf *mdev, sector_t sector, char* msg),6767+ TP_ARGS(mdev, sector, msg));6868+6969+DECLARE_TRACE(drbd_bio,7070+ TP_PROTO(struct drbd_conf *mdev, const char *pfx, struct bio *bio, int complete,7171+ struct drbd_request *r),7272+ TP_ARGS(mdev, pfx, bio, complete, r));7373+7474+DECLARE_TRACE(drbd_req,7575+ TP_PROTO(struct drbd_request *req, enum drbd_req_event what, char *msg),7676+ TP_ARGS(req, what, msg));7777+7878+DECLARE_TRACE(drbd_packet,7979+ TP_PROTO(struct drbd_conf *mdev, struct socket *sock,8080+ int recv, union p_polymorph *p, char *file, int line),8181+ TP_ARGS(mdev, sock, recv, p, file, line));8282+8383+DECLARE_TRACE(_drbd_resync,8484+ TP_PROTO(struct drbd_conf *mdev, int level, const char *fmt, va_list args),8585+ TP_ARGS(mdev, level, fmt, args));8686+8787+#endif
+351
drivers/block/drbd/drbd_vli.h
···11+/*22+-*- linux-c -*-33+ drbd_receiver.c44+ This file is part of DRBD by Philipp Reisner and Lars Ellenberg.55+66+ Copyright (C) 2001-2008, LINBIT Information Technologies GmbH.77+ Copyright (C) 1999-2008, Philipp Reisner <philipp.reisner@linbit.com>.88+ Copyright (C) 2002-2008, Lars Ellenberg <lars.ellenberg@linbit.com>.99+1010+ drbd is free software; you can redistribute it and/or modify1111+ it under the terms of the GNU General Public License as published by1212+ the Free Software Foundation; either version 2, or (at your option)1313+ any later version.1414+1515+ drbd is distributed in the hope that it will be useful,1616+ but WITHOUT ANY WARRANTY; without even the implied warranty of1717+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the1818+ GNU General Public License for more details.1919+2020+ You should have received a copy of the GNU General Public License2121+ along with drbd; see the file COPYING. If not, write to2222+ the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, USA.2323+ */2424+2525+#ifndef _DRBD_VLI_H2626+#define _DRBD_VLI_H2727+2828+/*2929+ * At a granularity of 4KiB storage represented per bit,3030+ * and stroage sizes of several TiB,3131+ * and possibly small-bandwidth replication,3232+ * the bitmap transfer time can take much too long,3333+ * if transmitted in plain text.3434+ *3535+ * We try to reduce the transfered bitmap information3636+ * by encoding runlengths of bit polarity.3737+ *3838+ * We never actually need to encode a "zero" (runlengths are positive).3939+ * But then we have to store the value of the first bit.4040+ * The first bit of information thus shall encode if the first runlength4141+ * gives the number of set or unset bits.4242+ *4343+ * We assume that large areas are either completely set or unset,4444+ * which gives good compression with any runlength method,4545+ * even when encoding the runlength as fixed size 32bit/64bit integers.4646+ *4747+ * Still, there may be areas where the polarity flips every few bits,4848+ * and encoding the runlength sequence of those areas with fix size4949+ * integers would be much worse than plaintext.5050+ *5151+ * We want to encode small runlength values with minimum code length,5252+ * while still being able to encode a Huge run of all zeros.5353+ *5454+ * Thus we need a Variable Length Integer encoding, VLI.5555+ *5656+ * For some cases, we produce more code bits than plaintext input.5757+ * We need to send incompressible chunks as plaintext, skip over them5858+ * and then see if the next chunk compresses better.5959+ *6060+ * We don't care too much about "excellent" compression ratio for large6161+ * runlengths (all set/all clear): whether we achieve a factor of 1006262+ * or 1000 is not that much of an issue.6363+ * We do not want to waste too much on short runlengths in the "noisy"6464+ * parts of the bitmap, though.6565+ *6666+ * There are endless variants of VLI, we experimented with:6767+ * * simple byte-based6868+ * * various bit based with different code word length.6969+ *7070+ * To avoid yet an other configuration parameter (choice of bitmap compression7171+ * algorithm) which was difficult to explain and tune, we just chose the one7272+ * variant that turned out best in all test cases.7373+ * Based on real world usage patterns, with device sizes ranging from a few GiB7474+ * to several TiB, file server/mailserver/webserver/mysql/postgress,7575+ * mostly idle to really busy, the all time winner (though sometimes only7676+ * marginally better) is:7777+ */7878+7979+/*8080+ * encoding is "visualised" as8181+ * __little endian__ bitstream, least significant bit first (left most)8282+ *8383+ * this particular encoding is chosen so that the prefix code8484+ * starts as unary encoding the level, then modified so that8585+ * 10 levels can be described in 8bit, with minimal overhead8686+ * for the smaller levels.8787+ *8888+ * Number of data bits follow fibonacci sequence, with the exception of the8989+ * last level (+1 data bit, so it makes 64bit total). The only worse code when9090+ * encoding bit polarity runlength is 1 plain bits => 2 code bits.9191+prefix data bits max val Nº data bits9292+0 x 0x2 19393+10 x 0x4 19494+110 xx 0x8 29595+1110 xxx 0x10 39696+11110 xxx xx 0x30 59797+111110 xx xxxxxx 0x130 89898+11111100 xxxxxxxx xxxxx 0x2130 139999+11111110 xxxxxxxx xxxxxxxx xxxxx 0x202130 21100100+11111101 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xx 0x400202130 34101101+11111111 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx 56102102+ * maximum encodable value: 0x100000400202130 == 2**56 + some */103103+104104+/* compression "table":105105+ transmitted x 0.29106106+ as plaintext x ........................107107+ x ........................108108+ x ........................109109+ x 0.59 0.21........................110110+ x ........................................................111111+ x .. c ...................................................112112+ x 0.44.. o ...................................................113113+ x .......... d ...................................................114114+ x .......... e ...................................................115115+ X............. ...................................................116116+ x.............. b ...................................................117117+2.0x............... i ...................................................118118+ #X................ t ...................................................119119+ #................. s ........................... plain bits ..........120120+-+-----------------------------------------------------------------------121121+ 1 16 32 64122122+*/123123+124124+/* LEVEL: (total bits, prefix bits, prefix value),125125+ * sorted ascending by number of total bits.126126+ * The rest of the code table is calculated at compiletime from this. */127127+128128+/* fibonacci data 1, 1, ... */129129+#define VLI_L_1_1() do { \130130+ LEVEL( 2, 1, 0x00); \131131+ LEVEL( 3, 2, 0x01); \132132+ LEVEL( 5, 3, 0x03); \133133+ LEVEL( 7, 4, 0x07); \134134+ LEVEL(10, 5, 0x0f); \135135+ LEVEL(14, 6, 0x1f); \136136+ LEVEL(21, 8, 0x3f); \137137+ LEVEL(29, 8, 0x7f); \138138+ LEVEL(42, 8, 0xbf); \139139+ LEVEL(64, 8, 0xff); \140140+ } while (0)141141+142142+/* finds a suitable level to decode the least significant part of in.143143+ * returns number of bits consumed.144144+ *145145+ * BUG() for bad input, as that would mean a buggy code table. */146146+static inline int vli_decode_bits(u64 *out, const u64 in)147147+{148148+ u64 adj = 1;149149+150150+#define LEVEL(t,b,v) \151151+ do { \152152+ if ((in & ((1 << b) -1)) == v) { \153153+ *out = ((in & ((~0ULL) >> (64-t))) >> b) + adj; \154154+ return t; \155155+ } \156156+ adj += 1ULL << (t - b); \157157+ } while (0)158158+159159+ VLI_L_1_1();160160+161161+ /* NOT REACHED, if VLI_LEVELS code table is defined properly */162162+ BUG();163163+#undef LEVEL164164+}165165+166166+/* return number of code bits needed,167167+ * or negative error number */168168+static inline int __vli_encode_bits(u64 *out, const u64 in)169169+{170170+ u64 max = 0;171171+ u64 adj = 1;172172+173173+ if (in == 0)174174+ return -EINVAL;175175+176176+#define LEVEL(t,b,v) do { \177177+ max += 1ULL << (t - b); \178178+ if (in <= max) { \179179+ if (out) \180180+ *out = ((in - adj) << b) | v; \181181+ return t; \182182+ } \183183+ adj = max + 1; \184184+ } while (0)185185+186186+ VLI_L_1_1();187187+188188+ return -EOVERFLOW;189189+#undef LEVEL190190+}191191+192192+#undef VLI_L_1_1193193+194194+/* code from here down is independend of actually used bit code */195195+196196+/*197197+ * Code length is determined by some unique (e.g. unary) prefix.198198+ * This encodes arbitrary bit length, not whole bytes: we have a bit-stream,199199+ * not a byte stream.200200+ */201201+202202+/* for the bitstream, we need a cursor */203203+struct bitstream_cursor {204204+ /* the current byte */205205+ u8 *b;206206+ /* the current bit within *b, nomalized: 0..7 */207207+ unsigned int bit;208208+};209209+210210+/* initialize cursor to point to first bit of stream */211211+static inline void bitstream_cursor_reset(struct bitstream_cursor *cur, void *s)212212+{213213+ cur->b = s;214214+ cur->bit = 0;215215+}216216+217217+/* advance cursor by that many bits; maximum expected input value: 64,218218+ * but depending on VLI implementation, it may be more. */219219+static inline void bitstream_cursor_advance(struct bitstream_cursor *cur, unsigned int bits)220220+{221221+ bits += cur->bit;222222+ cur->b = cur->b + (bits >> 3);223223+ cur->bit = bits & 7;224224+}225225+226226+/* the bitstream itself knows its length */227227+struct bitstream {228228+ struct bitstream_cursor cur;229229+ unsigned char *buf;230230+ size_t buf_len; /* in bytes */231231+232232+ /* for input stream:233233+ * number of trailing 0 bits for padding234234+ * total number of valid bits in stream: buf_len * 8 - pad_bits */235235+ unsigned int pad_bits;236236+};237237+238238+static inline void bitstream_init(struct bitstream *bs, void *s, size_t len, unsigned int pad_bits)239239+{240240+ bs->buf = s;241241+ bs->buf_len = len;242242+ bs->pad_bits = pad_bits;243243+ bitstream_cursor_reset(&bs->cur, bs->buf);244244+}245245+246246+static inline void bitstream_rewind(struct bitstream *bs)247247+{248248+ bitstream_cursor_reset(&bs->cur, bs->buf);249249+ memset(bs->buf, 0, bs->buf_len);250250+}251251+252252+/* Put (at most 64) least significant bits of val into bitstream, and advance cursor.253253+ * Ignores "pad_bits".254254+ * Returns zero if bits == 0 (nothing to do).255255+ * Returns number of bits used if successful.256256+ *257257+ * If there is not enough room left in bitstream,258258+ * leaves bitstream unchanged and returns -ENOBUFS.259259+ */260260+static inline int bitstream_put_bits(struct bitstream *bs, u64 val, const unsigned int bits)261261+{262262+ unsigned char *b = bs->cur.b;263263+ unsigned int tmp;264264+265265+ if (bits == 0)266266+ return 0;267267+268268+ if ((bs->cur.b + ((bs->cur.bit + bits -1) >> 3)) - bs->buf >= bs->buf_len)269269+ return -ENOBUFS;270270+271271+ /* paranoia: strip off hi bits; they should not be set anyways. */272272+ if (bits < 64)273273+ val &= ~0ULL >> (64 - bits);274274+275275+ *b++ |= (val & 0xff) << bs->cur.bit;276276+277277+ for (tmp = 8 - bs->cur.bit; tmp < bits; tmp += 8)278278+ *b++ |= (val >> tmp) & 0xff;279279+280280+ bitstream_cursor_advance(&bs->cur, bits);281281+ return bits;282282+}283283+284284+/* Fetch (at most 64) bits from bitstream into *out, and advance cursor.285285+ *286286+ * If more than 64 bits are requested, returns -EINVAL and leave *out unchanged.287287+ *288288+ * If there are less than the requested number of valid bits left in the289289+ * bitstream, still fetches all available bits.290290+ *291291+ * Returns number of actually fetched bits.292292+ */293293+static inline int bitstream_get_bits(struct bitstream *bs, u64 *out, int bits)294294+{295295+ u64 val;296296+ unsigned int n;297297+298298+ if (bits > 64)299299+ return -EINVAL;300300+301301+ if (bs->cur.b + ((bs->cur.bit + bs->pad_bits + bits -1) >> 3) - bs->buf >= bs->buf_len)302302+ bits = ((bs->buf_len - (bs->cur.b - bs->buf)) << 3)303303+ - bs->cur.bit - bs->pad_bits;304304+305305+ if (bits == 0) {306306+ *out = 0;307307+ return 0;308308+ }309309+310310+ /* get the high bits */311311+ val = 0;312312+ n = (bs->cur.bit + bits + 7) >> 3;313313+ /* n may be at most 9, if cur.bit + bits > 64 */314314+ /* which means this copies at most 8 byte */315315+ if (n) {316316+ memcpy(&val, bs->cur.b+1, n - 1);317317+ val = le64_to_cpu(val) << (8 - bs->cur.bit);318318+ }319319+320320+ /* we still need the low bits */321321+ val |= bs->cur.b[0] >> bs->cur.bit;322322+323323+ /* and mask out bits we don't want */324324+ val &= ~0ULL >> (64 - bits);325325+326326+ bitstream_cursor_advance(&bs->cur, bits);327327+ *out = val;328328+329329+ return bits;330330+}331331+332332+/* encodes @in as vli into @bs;333333+334334+ * return values335335+ * > 0: number of bits successfully stored in bitstream336336+ * -ENOBUFS @bs is full337337+ * -EINVAL input zero (invalid)338338+ * -EOVERFLOW input too large for this vli code (invalid)339339+ */340340+static inline int vli_encode_bits(struct bitstream *bs, u64 in)341341+{342342+ u64 code = code;343343+ int bits = __vli_encode_bits(&code, in);344344+345345+ if (bits <= 0)346346+ return bits;347347+348348+ return bitstream_put_bits(bs, code, bits);349349+}350350+351351+#endif
+1529
drivers/block/drbd/drbd_worker.c
···11+/*22+ drbd_worker.c33+44+ This file is part of DRBD by Philipp Reisner and Lars Ellenberg.55+66+ Copyright (C) 2001-2008, LINBIT Information Technologies GmbH.77+ Copyright (C) 1999-2008, Philipp Reisner <philipp.reisner@linbit.com>.88+ Copyright (C) 2002-2008, Lars Ellenberg <lars.ellenberg@linbit.com>.99+1010+ drbd is free software; you can redistribute it and/or modify1111+ it under the terms of the GNU General Public License as published by1212+ the Free Software Foundation; either version 2, or (at your option)1313+ any later version.1414+1515+ drbd is distributed in the hope that it will be useful,1616+ but WITHOUT ANY WARRANTY; without even the implied warranty of1717+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the1818+ GNU General Public License for more details.1919+2020+ You should have received a copy of the GNU General Public License2121+ along with drbd; see the file COPYING. If not, write to2222+ the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, USA.2323+2424+ */2525+2626+#include <linux/autoconf.h>2727+#include <linux/module.h>2828+#include <linux/version.h>2929+#include <linux/drbd.h>3030+#include <linux/sched.h>3131+#include <linux/smp_lock.h>3232+#include <linux/wait.h>3333+#include <linux/mm.h>3434+#include <linux/memcontrol.h>3535+#include <linux/mm_inline.h>3636+#include <linux/slab.h>3737+#include <linux/random.h>3838+#include <linux/mm.h>3939+#include <linux/string.h>4040+#include <linux/scatterlist.h>4141+4242+#include "drbd_int.h"4343+#include "drbd_req.h"4444+#include "drbd_tracing.h"4545+4646+#define SLEEP_TIME (HZ/10)4747+4848+static int w_make_ov_request(struct drbd_conf *mdev, struct drbd_work *w, int cancel);4949+5050+5151+5252+/* defined here:5353+ drbd_md_io_complete5454+ drbd_endio_write_sec5555+ drbd_endio_read_sec5656+ drbd_endio_pri5757+5858+ * more endio handlers:5959+ atodb_endio in drbd_actlog.c6060+ drbd_bm_async_io_complete in drbd_bitmap.c6161+6262+ * For all these callbacks, note the following:6363+ * The callbacks will be called in irq context by the IDE drivers,6464+ * and in Softirqs/Tasklets/BH context by the SCSI drivers.6565+ * Try to get the locking right :)6666+ *6767+ */6868+6969+7070+/* About the global_state_lock7171+ Each state transition on an device holds a read lock. In case we have7272+ to evaluate the sync after dependencies, we grab a write lock, because7373+ we need stable states on all devices for that. */7474+rwlock_t global_state_lock;7575+7676+/* used for synchronous meta data and bitmap IO7777+ * submitted by drbd_md_sync_page_io()7878+ */7979+void drbd_md_io_complete(struct bio *bio, int error)8080+{8181+ struct drbd_md_io *md_io;8282+8383+ md_io = (struct drbd_md_io *)bio->bi_private;8484+ md_io->error = error;8585+8686+ trace_drbd_bio(md_io->mdev, "Md", bio, 1, NULL);8787+8888+ complete(&md_io->event);8989+}9090+9191+/* reads on behalf of the partner,9292+ * "submitted" by the receiver9393+ */9494+void drbd_endio_read_sec(struct bio *bio, int error) __releases(local)9595+{9696+ unsigned long flags = 0;9797+ struct drbd_epoch_entry *e = NULL;9898+ struct drbd_conf *mdev;9999+ int uptodate = bio_flagged(bio, BIO_UPTODATE);100100+101101+ e = bio->bi_private;102102+ mdev = e->mdev;103103+104104+ if (error)105105+ dev_warn(DEV, "read: error=%d s=%llus\n", error,106106+ (unsigned long long)e->sector);107107+ if (!error && !uptodate) {108108+ dev_warn(DEV, "read: setting error to -EIO s=%llus\n",109109+ (unsigned long long)e->sector);110110+ /* strange behavior of some lower level drivers...111111+ * fail the request by clearing the uptodate flag,112112+ * but do not return any error?! */113113+ error = -EIO;114114+ }115115+116116+ D_ASSERT(e->block_id != ID_VACANT);117117+118118+ trace_drbd_bio(mdev, "Sec", bio, 1, NULL);119119+120120+ spin_lock_irqsave(&mdev->req_lock, flags);121121+ mdev->read_cnt += e->size >> 9;122122+ list_del(&e->w.list);123123+ if (list_empty(&mdev->read_ee))124124+ wake_up(&mdev->ee_wait);125125+ spin_unlock_irqrestore(&mdev->req_lock, flags);126126+127127+ drbd_chk_io_error(mdev, error, FALSE);128128+ drbd_queue_work(&mdev->data.work, &e->w);129129+ put_ldev(mdev);130130+131131+ trace_drbd_ee(mdev, e, "read completed");132132+}133133+134134+/* writes on behalf of the partner, or resync writes,135135+ * "submitted" by the receiver.136136+ */137137+void drbd_endio_write_sec(struct bio *bio, int error) __releases(local)138138+{139139+ unsigned long flags = 0;140140+ struct drbd_epoch_entry *e = NULL;141141+ struct drbd_conf *mdev;142142+ sector_t e_sector;143143+ int do_wake;144144+ int is_syncer_req;145145+ int do_al_complete_io;146146+ int uptodate = bio_flagged(bio, BIO_UPTODATE);147147+ int is_barrier = bio_rw_flagged(bio, BIO_RW_BARRIER);148148+149149+ e = bio->bi_private;150150+ mdev = e->mdev;151151+152152+ if (error)153153+ dev_warn(DEV, "write: error=%d s=%llus\n", error,154154+ (unsigned long long)e->sector);155155+ if (!error && !uptodate) {156156+ dev_warn(DEV, "write: setting error to -EIO s=%llus\n",157157+ (unsigned long long)e->sector);158158+ /* strange behavior of some lower level drivers...159159+ * fail the request by clearing the uptodate flag,160160+ * but do not return any error?! */161161+ error = -EIO;162162+ }163163+164164+ /* error == -ENOTSUPP would be a better test,165165+ * alas it is not reliable */166166+ if (error && is_barrier && e->flags & EE_IS_BARRIER) {167167+ drbd_bump_write_ordering(mdev, WO_bdev_flush);168168+ spin_lock_irqsave(&mdev->req_lock, flags);169169+ list_del(&e->w.list);170170+ e->w.cb = w_e_reissue;171171+ /* put_ldev actually happens below, once we come here again. */172172+ __release(local);173173+ spin_unlock_irqrestore(&mdev->req_lock, flags);174174+ drbd_queue_work(&mdev->data.work, &e->w);175175+ return;176176+ }177177+178178+ D_ASSERT(e->block_id != ID_VACANT);179179+180180+ trace_drbd_bio(mdev, "Sec", bio, 1, NULL);181181+182182+ spin_lock_irqsave(&mdev->req_lock, flags);183183+ mdev->writ_cnt += e->size >> 9;184184+ is_syncer_req = is_syncer_block_id(e->block_id);185185+186186+ /* after we moved e to done_ee,187187+ * we may no longer access it,188188+ * it may be freed/reused already!189189+ * (as soon as we release the req_lock) */190190+ e_sector = e->sector;191191+ do_al_complete_io = e->flags & EE_CALL_AL_COMPLETE_IO;192192+193193+ list_del(&e->w.list); /* has been on active_ee or sync_ee */194194+ list_add_tail(&e->w.list, &mdev->done_ee);195195+196196+ trace_drbd_ee(mdev, e, "write completed");197197+198198+ /* No hlist_del_init(&e->colision) here, we did not send the Ack yet,199199+ * neither did we wake possibly waiting conflicting requests.200200+ * done from "drbd_process_done_ee" within the appropriate w.cb201201+ * (e_end_block/e_end_resync_block) or from _drbd_clear_done_ee */202202+203203+ do_wake = is_syncer_req204204+ ? list_empty(&mdev->sync_ee)205205+ : list_empty(&mdev->active_ee);206206+207207+ if (error)208208+ __drbd_chk_io_error(mdev, FALSE);209209+ spin_unlock_irqrestore(&mdev->req_lock, flags);210210+211211+ if (is_syncer_req)212212+ drbd_rs_complete_io(mdev, e_sector);213213+214214+ if (do_wake)215215+ wake_up(&mdev->ee_wait);216216+217217+ if (do_al_complete_io)218218+ drbd_al_complete_io(mdev, e_sector);219219+220220+ wake_asender(mdev);221221+ put_ldev(mdev);222222+223223+}224224+225225+/* read, readA or write requests on R_PRIMARY coming from drbd_make_request226226+ */227227+void drbd_endio_pri(struct bio *bio, int error)228228+{229229+ unsigned long flags;230230+ struct drbd_request *req = bio->bi_private;231231+ struct drbd_conf *mdev = req->mdev;232232+ struct bio_and_error m;233233+ enum drbd_req_event what;234234+ int uptodate = bio_flagged(bio, BIO_UPTODATE);235235+236236+ if (error)237237+ dev_warn(DEV, "p %s: error=%d\n",238238+ bio_data_dir(bio) == WRITE ? "write" : "read", error);239239+ if (!error && !uptodate) {240240+ dev_warn(DEV, "p %s: setting error to -EIO\n",241241+ bio_data_dir(bio) == WRITE ? "write" : "read");242242+ /* strange behavior of some lower level drivers...243243+ * fail the request by clearing the uptodate flag,244244+ * but do not return any error?! */245245+ error = -EIO;246246+ }247247+248248+ trace_drbd_bio(mdev, "Pri", bio, 1, NULL);249249+250250+ /* to avoid recursion in __req_mod */251251+ if (unlikely(error)) {252252+ what = (bio_data_dir(bio) == WRITE)253253+ ? write_completed_with_error254254+ : (bio_rw(bio) == READA)255255+ ? read_completed_with_error256256+ : read_ahead_completed_with_error;257257+ } else258258+ what = completed_ok;259259+260260+ bio_put(req->private_bio);261261+ req->private_bio = ERR_PTR(error);262262+263263+ spin_lock_irqsave(&mdev->req_lock, flags);264264+ __req_mod(req, what, &m);265265+ spin_unlock_irqrestore(&mdev->req_lock, flags);266266+267267+ if (m.bio)268268+ complete_master_bio(mdev, &m);269269+}270270+271271+int w_io_error(struct drbd_conf *mdev, struct drbd_work *w, int cancel)272272+{273273+ struct drbd_request *req = container_of(w, struct drbd_request, w);274274+275275+ /* NOTE: mdev->ldev can be NULL by the time we get here! */276276+ /* D_ASSERT(mdev->ldev->dc.on_io_error != EP_PASS_ON); */277277+278278+ /* the only way this callback is scheduled is from _req_may_be_done,279279+ * when it is done and had a local write error, see comments there */280280+ drbd_req_free(req);281281+282282+ return TRUE;283283+}284284+285285+int w_read_retry_remote(struct drbd_conf *mdev, struct drbd_work *w, int cancel)286286+{287287+ struct drbd_request *req = container_of(w, struct drbd_request, w);288288+289289+ /* We should not detach for read io-error,290290+ * but try to WRITE the P_DATA_REPLY to the failed location,291291+ * to give the disk the chance to relocate that block */292292+293293+ spin_lock_irq(&mdev->req_lock);294294+ if (cancel ||295295+ mdev->state.conn < C_CONNECTED ||296296+ mdev->state.pdsk <= D_INCONSISTENT) {297297+ _req_mod(req, send_canceled);298298+ spin_unlock_irq(&mdev->req_lock);299299+ dev_alert(DEV, "WE ARE LOST. Local IO failure, no peer.\n");300300+ return 1;301301+ }302302+ spin_unlock_irq(&mdev->req_lock);303303+304304+ return w_send_read_req(mdev, w, 0);305305+}306306+307307+int w_resync_inactive(struct drbd_conf *mdev, struct drbd_work *w, int cancel)308308+{309309+ ERR_IF(cancel) return 1;310310+ dev_err(DEV, "resync inactive, but callback triggered??\n");311311+ return 1; /* Simply ignore this! */312312+}313313+314314+void drbd_csum(struct drbd_conf *mdev, struct crypto_hash *tfm, struct bio *bio, void *digest)315315+{316316+ struct hash_desc desc;317317+ struct scatterlist sg;318318+ struct bio_vec *bvec;319319+ int i;320320+321321+ desc.tfm = tfm;322322+ desc.flags = 0;323323+324324+ sg_init_table(&sg, 1);325325+ crypto_hash_init(&desc);326326+327327+ __bio_for_each_segment(bvec, bio, i, 0) {328328+ sg_set_page(&sg, bvec->bv_page, bvec->bv_len, bvec->bv_offset);329329+ crypto_hash_update(&desc, &sg, sg.length);330330+ }331331+ crypto_hash_final(&desc, digest);332332+}333333+334334+static int w_e_send_csum(struct drbd_conf *mdev, struct drbd_work *w, int cancel)335335+{336336+ struct drbd_epoch_entry *e = container_of(w, struct drbd_epoch_entry, w);337337+ int digest_size;338338+ void *digest;339339+ int ok;340340+341341+ D_ASSERT(e->block_id == DRBD_MAGIC + 0xbeef);342342+343343+ if (unlikely(cancel)) {344344+ drbd_free_ee(mdev, e);345345+ return 1;346346+ }347347+348348+ if (likely(drbd_bio_uptodate(e->private_bio))) {349349+ digest_size = crypto_hash_digestsize(mdev->csums_tfm);350350+ digest = kmalloc(digest_size, GFP_NOIO);351351+ if (digest) {352352+ drbd_csum(mdev, mdev->csums_tfm, e->private_bio, digest);353353+354354+ inc_rs_pending(mdev);355355+ ok = drbd_send_drequest_csum(mdev,356356+ e->sector,357357+ e->size,358358+ digest,359359+ digest_size,360360+ P_CSUM_RS_REQUEST);361361+ kfree(digest);362362+ } else {363363+ dev_err(DEV, "kmalloc() of digest failed.\n");364364+ ok = 0;365365+ }366366+ } else367367+ ok = 1;368368+369369+ drbd_free_ee(mdev, e);370370+371371+ if (unlikely(!ok))372372+ dev_err(DEV, "drbd_send_drequest(..., csum) failed\n");373373+ return ok;374374+}375375+376376+#define GFP_TRY (__GFP_HIGHMEM | __GFP_NOWARN)377377+378378+static int read_for_csum(struct drbd_conf *mdev, sector_t sector, int size)379379+{380380+ struct drbd_epoch_entry *e;381381+382382+ if (!get_ldev(mdev))383383+ return 0;384384+385385+ /* GFP_TRY, because if there is no memory available right now, this may386386+ * be rescheduled for later. It is "only" background resync, after all. */387387+ e = drbd_alloc_ee(mdev, DRBD_MAGIC+0xbeef, sector, size, GFP_TRY);388388+ if (!e) {389389+ put_ldev(mdev);390390+ return 2;391391+ }392392+393393+ spin_lock_irq(&mdev->req_lock);394394+ list_add(&e->w.list, &mdev->read_ee);395395+ spin_unlock_irq(&mdev->req_lock);396396+397397+ e->private_bio->bi_end_io = drbd_endio_read_sec;398398+ e->private_bio->bi_rw = READ;399399+ e->w.cb = w_e_send_csum;400400+401401+ mdev->read_cnt += size >> 9;402402+ drbd_generic_make_request(mdev, DRBD_FAULT_RS_RD, e->private_bio);403403+404404+ return 1;405405+}406406+407407+void resync_timer_fn(unsigned long data)408408+{409409+ unsigned long flags;410410+ struct drbd_conf *mdev = (struct drbd_conf *) data;411411+ int queue;412412+413413+ spin_lock_irqsave(&mdev->req_lock, flags);414414+415415+ if (likely(!test_and_clear_bit(STOP_SYNC_TIMER, &mdev->flags))) {416416+ queue = 1;417417+ if (mdev->state.conn == C_VERIFY_S)418418+ mdev->resync_work.cb = w_make_ov_request;419419+ else420420+ mdev->resync_work.cb = w_make_resync_request;421421+ } else {422422+ queue = 0;423423+ mdev->resync_work.cb = w_resync_inactive;424424+ }425425+426426+ spin_unlock_irqrestore(&mdev->req_lock, flags);427427+428428+ /* harmless race: list_empty outside data.work.q_lock */429429+ if (list_empty(&mdev->resync_work.list) && queue)430430+ drbd_queue_work(&mdev->data.work, &mdev->resync_work);431431+}432432+433433+int w_make_resync_request(struct drbd_conf *mdev,434434+ struct drbd_work *w, int cancel)435435+{436436+ unsigned long bit;437437+ sector_t sector;438438+ const sector_t capacity = drbd_get_capacity(mdev->this_bdev);439439+ int max_segment_size = queue_max_segment_size(mdev->rq_queue);440440+ int number, i, size, pe, mx;441441+ int align, queued, sndbuf;442442+443443+ if (unlikely(cancel))444444+ return 1;445445+446446+ if (unlikely(mdev->state.conn < C_CONNECTED)) {447447+ dev_err(DEV, "Confused in w_make_resync_request()! cstate < Connected");448448+ return 0;449449+ }450450+451451+ if (mdev->state.conn != C_SYNC_TARGET)452452+ dev_err(DEV, "%s in w_make_resync_request\n",453453+ drbd_conn_str(mdev->state.conn));454454+455455+ if (!get_ldev(mdev)) {456456+ /* Since we only need to access mdev->rsync a457457+ get_ldev_if_state(mdev,D_FAILED) would be sufficient, but458458+ to continue resync with a broken disk makes no sense at459459+ all */460460+ dev_err(DEV, "Disk broke down during resync!\n");461461+ mdev->resync_work.cb = w_resync_inactive;462462+ return 1;463463+ }464464+465465+ number = SLEEP_TIME * mdev->sync_conf.rate / ((BM_BLOCK_SIZE/1024)*HZ);466466+ pe = atomic_read(&mdev->rs_pending_cnt);467467+468468+ mutex_lock(&mdev->data.mutex);469469+ if (mdev->data.socket)470470+ mx = mdev->data.socket->sk->sk_rcvbuf / sizeof(struct p_block_req);471471+ else472472+ mx = 1;473473+ mutex_unlock(&mdev->data.mutex);474474+475475+ /* For resync rates >160MB/sec, allow more pending RS requests */476476+ if (number > mx)477477+ mx = number;478478+479479+ /* Limit the number of pending RS requests to no more than the peer's receive buffer */480480+ if ((pe + number) > mx) {481481+ number = mx - pe;482482+ }483483+484484+ for (i = 0; i < number; i++) {485485+ /* Stop generating RS requests, when half of the send buffer is filled */486486+ mutex_lock(&mdev->data.mutex);487487+ if (mdev->data.socket) {488488+ queued = mdev->data.socket->sk->sk_wmem_queued;489489+ sndbuf = mdev->data.socket->sk->sk_sndbuf;490490+ } else {491491+ queued = 1;492492+ sndbuf = 0;493493+ }494494+ mutex_unlock(&mdev->data.mutex);495495+ if (queued > sndbuf / 2)496496+ goto requeue;497497+498498+next_sector:499499+ size = BM_BLOCK_SIZE;500500+ bit = drbd_bm_find_next(mdev, mdev->bm_resync_fo);501501+502502+ if (bit == -1UL) {503503+ mdev->bm_resync_fo = drbd_bm_bits(mdev);504504+ mdev->resync_work.cb = w_resync_inactive;505505+ put_ldev(mdev);506506+ return 1;507507+ }508508+509509+ sector = BM_BIT_TO_SECT(bit);510510+511511+ if (drbd_try_rs_begin_io(mdev, sector)) {512512+ mdev->bm_resync_fo = bit;513513+ goto requeue;514514+ }515515+ mdev->bm_resync_fo = bit + 1;516516+517517+ if (unlikely(drbd_bm_test_bit(mdev, bit) == 0)) {518518+ drbd_rs_complete_io(mdev, sector);519519+ goto next_sector;520520+ }521521+522522+#if DRBD_MAX_SEGMENT_SIZE > BM_BLOCK_SIZE523523+ /* try to find some adjacent bits.524524+ * we stop if we have already the maximum req size.525525+ *526526+ * Additionally always align bigger requests, in order to527527+ * be prepared for all stripe sizes of software RAIDs.528528+ *529529+ * we _do_ care about the agreed-upon q->max_segment_size530530+ * here, as splitting up the requests on the other side is more531531+ * difficult. the consequence is, that on lvm and md and other532532+ * "indirect" devices, this is dead code, since533533+ * q->max_segment_size will be PAGE_SIZE.534534+ */535535+ align = 1;536536+ for (;;) {537537+ if (size + BM_BLOCK_SIZE > max_segment_size)538538+ break;539539+540540+ /* Be always aligned */541541+ if (sector & ((1<<(align+3))-1))542542+ break;543543+544544+ /* do not cross extent boundaries */545545+ if (((bit+1) & BM_BLOCKS_PER_BM_EXT_MASK) == 0)546546+ break;547547+ /* now, is it actually dirty, after all?548548+ * caution, drbd_bm_test_bit is tri-state for some549549+ * obscure reason; ( b == 0 ) would get the out-of-band550550+ * only accidentally right because of the "oddly sized"551551+ * adjustment below */552552+ if (drbd_bm_test_bit(mdev, bit+1) != 1)553553+ break;554554+ bit++;555555+ size += BM_BLOCK_SIZE;556556+ if ((BM_BLOCK_SIZE << align) <= size)557557+ align++;558558+ i++;559559+ }560560+ /* if we merged some,561561+ * reset the offset to start the next drbd_bm_find_next from */562562+ if (size > BM_BLOCK_SIZE)563563+ mdev->bm_resync_fo = bit + 1;564564+#endif565565+566566+ /* adjust very last sectors, in case we are oddly sized */567567+ if (sector + (size>>9) > capacity)568568+ size = (capacity-sector)<<9;569569+ if (mdev->agreed_pro_version >= 89 && mdev->csums_tfm) {570570+ switch (read_for_csum(mdev, sector, size)) {571571+ case 0: /* Disk failure*/572572+ put_ldev(mdev);573573+ return 0;574574+ case 2: /* Allocation failed */575575+ drbd_rs_complete_io(mdev, sector);576576+ mdev->bm_resync_fo = BM_SECT_TO_BIT(sector);577577+ goto requeue;578578+ /* case 1: everything ok */579579+ }580580+ } else {581581+ inc_rs_pending(mdev);582582+ if (!drbd_send_drequest(mdev, P_RS_DATA_REQUEST,583583+ sector, size, ID_SYNCER)) {584584+ dev_err(DEV, "drbd_send_drequest() failed, aborting...\n");585585+ dec_rs_pending(mdev);586586+ put_ldev(mdev);587587+ return 0;588588+ }589589+ }590590+ }591591+592592+ if (mdev->bm_resync_fo >= drbd_bm_bits(mdev)) {593593+ /* last syncer _request_ was sent,594594+ * but the P_RS_DATA_REPLY not yet received. sync will end (and595595+ * next sync group will resume), as soon as we receive the last596596+ * resync data block, and the last bit is cleared.597597+ * until then resync "work" is "inactive" ...598598+ */599599+ mdev->resync_work.cb = w_resync_inactive;600600+ put_ldev(mdev);601601+ return 1;602602+ }603603+604604+ requeue:605605+ mod_timer(&mdev->resync_timer, jiffies + SLEEP_TIME);606606+ put_ldev(mdev);607607+ return 1;608608+}609609+610610+static int w_make_ov_request(struct drbd_conf *mdev, struct drbd_work *w, int cancel)611611+{612612+ int number, i, size;613613+ sector_t sector;614614+ const sector_t capacity = drbd_get_capacity(mdev->this_bdev);615615+616616+ if (unlikely(cancel))617617+ return 1;618618+619619+ if (unlikely(mdev->state.conn < C_CONNECTED)) {620620+ dev_err(DEV, "Confused in w_make_ov_request()! cstate < Connected");621621+ return 0;622622+ }623623+624624+ number = SLEEP_TIME*mdev->sync_conf.rate / ((BM_BLOCK_SIZE/1024)*HZ);625625+ if (atomic_read(&mdev->rs_pending_cnt) > number)626626+ goto requeue;627627+628628+ number -= atomic_read(&mdev->rs_pending_cnt);629629+630630+ sector = mdev->ov_position;631631+ for (i = 0; i < number; i++) {632632+ if (sector >= capacity) {633633+ mdev->resync_work.cb = w_resync_inactive;634634+ return 1;635635+ }636636+637637+ size = BM_BLOCK_SIZE;638638+639639+ if (drbd_try_rs_begin_io(mdev, sector)) {640640+ mdev->ov_position = sector;641641+ goto requeue;642642+ }643643+644644+ if (sector + (size>>9) > capacity)645645+ size = (capacity-sector)<<9;646646+647647+ inc_rs_pending(mdev);648648+ if (!drbd_send_ov_request(mdev, sector, size)) {649649+ dec_rs_pending(mdev);650650+ return 0;651651+ }652652+ sector += BM_SECT_PER_BIT;653653+ }654654+ mdev->ov_position = sector;655655+656656+ requeue:657657+ mod_timer(&mdev->resync_timer, jiffies + SLEEP_TIME);658658+ return 1;659659+}660660+661661+662662+int w_ov_finished(struct drbd_conf *mdev, struct drbd_work *w, int cancel)663663+{664664+ kfree(w);665665+ ov_oos_print(mdev);666666+ drbd_resync_finished(mdev);667667+668668+ return 1;669669+}670670+671671+static int w_resync_finished(struct drbd_conf *mdev, struct drbd_work *w, int cancel)672672+{673673+ kfree(w);674674+675675+ drbd_resync_finished(mdev);676676+677677+ return 1;678678+}679679+680680+int drbd_resync_finished(struct drbd_conf *mdev)681681+{682682+ unsigned long db, dt, dbdt;683683+ unsigned long n_oos;684684+ union drbd_state os, ns;685685+ struct drbd_work *w;686686+ char *khelper_cmd = NULL;687687+688688+ /* Remove all elements from the resync LRU. Since future actions689689+ * might set bits in the (main) bitmap, then the entries in the690690+ * resync LRU would be wrong. */691691+ if (drbd_rs_del_all(mdev)) {692692+ /* In case this is not possible now, most probably because693693+ * there are P_RS_DATA_REPLY Packets lingering on the worker's694694+ * queue (or even the read operations for those packets695695+ * is not finished by now). Retry in 100ms. */696696+697697+ drbd_kick_lo(mdev);698698+ __set_current_state(TASK_INTERRUPTIBLE);699699+ schedule_timeout(HZ / 10);700700+ w = kmalloc(sizeof(struct drbd_work), GFP_ATOMIC);701701+ if (w) {702702+ w->cb = w_resync_finished;703703+ drbd_queue_work(&mdev->data.work, w);704704+ return 1;705705+ }706706+ dev_err(DEV, "Warn failed to drbd_rs_del_all() and to kmalloc(w).\n");707707+ }708708+709709+ dt = (jiffies - mdev->rs_start - mdev->rs_paused) / HZ;710710+ if (dt <= 0)711711+ dt = 1;712712+ db = mdev->rs_total;713713+ dbdt = Bit2KB(db/dt);714714+ mdev->rs_paused /= HZ;715715+716716+ if (!get_ldev(mdev))717717+ goto out;718718+719719+ spin_lock_irq(&mdev->req_lock);720720+ os = mdev->state;721721+722722+ /* This protects us against multiple calls (that can happen in the presence723723+ of application IO), and against connectivity loss just before we arrive here. */724724+ if (os.conn <= C_CONNECTED)725725+ goto out_unlock;726726+727727+ ns = os;728728+ ns.conn = C_CONNECTED;729729+730730+ dev_info(DEV, "%s done (total %lu sec; paused %lu sec; %lu K/sec)\n",731731+ (os.conn == C_VERIFY_S || os.conn == C_VERIFY_T) ?732732+ "Online verify " : "Resync",733733+ dt + mdev->rs_paused, mdev->rs_paused, dbdt);734734+735735+ n_oos = drbd_bm_total_weight(mdev);736736+737737+ if (os.conn == C_VERIFY_S || os.conn == C_VERIFY_T) {738738+ if (n_oos) {739739+ dev_alert(DEV, "Online verify found %lu %dk block out of sync!\n",740740+ n_oos, Bit2KB(1));741741+ khelper_cmd = "out-of-sync";742742+ }743743+ } else {744744+ D_ASSERT((n_oos - mdev->rs_failed) == 0);745745+746746+ if (os.conn == C_SYNC_TARGET || os.conn == C_PAUSED_SYNC_T)747747+ khelper_cmd = "after-resync-target";748748+749749+ if (mdev->csums_tfm && mdev->rs_total) {750750+ const unsigned long s = mdev->rs_same_csum;751751+ const unsigned long t = mdev->rs_total;752752+ const int ratio =753753+ (t == 0) ? 0 :754754+ (t < 100000) ? ((s*100)/t) : (s/(t/100));755755+ dev_info(DEV, "%u %% had equal check sums, eliminated: %luK; "756756+ "transferred %luK total %luK\n",757757+ ratio,758758+ Bit2KB(mdev->rs_same_csum),759759+ Bit2KB(mdev->rs_total - mdev->rs_same_csum),760760+ Bit2KB(mdev->rs_total));761761+ }762762+ }763763+764764+ if (mdev->rs_failed) {765765+ dev_info(DEV, " %lu failed blocks\n", mdev->rs_failed);766766+767767+ if (os.conn == C_SYNC_TARGET || os.conn == C_PAUSED_SYNC_T) {768768+ ns.disk = D_INCONSISTENT;769769+ ns.pdsk = D_UP_TO_DATE;770770+ } else {771771+ ns.disk = D_UP_TO_DATE;772772+ ns.pdsk = D_INCONSISTENT;773773+ }774774+ } else {775775+ ns.disk = D_UP_TO_DATE;776776+ ns.pdsk = D_UP_TO_DATE;777777+778778+ if (os.conn == C_SYNC_TARGET || os.conn == C_PAUSED_SYNC_T) {779779+ if (mdev->p_uuid) {780780+ int i;781781+ for (i = UI_BITMAP ; i <= UI_HISTORY_END ; i++)782782+ _drbd_uuid_set(mdev, i, mdev->p_uuid[i]);783783+ drbd_uuid_set(mdev, UI_BITMAP, mdev->ldev->md.uuid[UI_CURRENT]);784784+ _drbd_uuid_set(mdev, UI_CURRENT, mdev->p_uuid[UI_CURRENT]);785785+ } else {786786+ dev_err(DEV, "mdev->p_uuid is NULL! BUG\n");787787+ }788788+ }789789+790790+ drbd_uuid_set_bm(mdev, 0UL);791791+792792+ if (mdev->p_uuid) {793793+ /* Now the two UUID sets are equal, update what we794794+ * know of the peer. */795795+ int i;796796+ for (i = UI_CURRENT ; i <= UI_HISTORY_END ; i++)797797+ mdev->p_uuid[i] = mdev->ldev->md.uuid[i];798798+ }799799+ }800800+801801+ _drbd_set_state(mdev, ns, CS_VERBOSE, NULL);802802+out_unlock:803803+ spin_unlock_irq(&mdev->req_lock);804804+ put_ldev(mdev);805805+out:806806+ mdev->rs_total = 0;807807+ mdev->rs_failed = 0;808808+ mdev->rs_paused = 0;809809+ mdev->ov_start_sector = 0;810810+811811+ if (test_and_clear_bit(WRITE_BM_AFTER_RESYNC, &mdev->flags)) {812812+ dev_warn(DEV, "Writing the whole bitmap, due to failed kmalloc\n");813813+ drbd_queue_bitmap_io(mdev, &drbd_bm_write, NULL, "write from resync_finished");814814+ }815815+816816+ if (khelper_cmd)817817+ drbd_khelper(mdev, khelper_cmd);818818+819819+ return 1;820820+}821821+822822+/* helper */823823+static void move_to_net_ee_or_free(struct drbd_conf *mdev, struct drbd_epoch_entry *e)824824+{825825+ if (drbd_bio_has_active_page(e->private_bio)) {826826+ /* This might happen if sendpage() has not finished */827827+ spin_lock_irq(&mdev->req_lock);828828+ list_add_tail(&e->w.list, &mdev->net_ee);829829+ spin_unlock_irq(&mdev->req_lock);830830+ } else831831+ drbd_free_ee(mdev, e);832832+}833833+834834+/**835835+ * w_e_end_data_req() - Worker callback, to send a P_DATA_REPLY packet in response to a P_DATA_REQUEST836836+ * @mdev: DRBD device.837837+ * @w: work object.838838+ * @cancel: The connection will be closed anyways839839+ */840840+int w_e_end_data_req(struct drbd_conf *mdev, struct drbd_work *w, int cancel)841841+{842842+ struct drbd_epoch_entry *e = container_of(w, struct drbd_epoch_entry, w);843843+ int ok;844844+845845+ if (unlikely(cancel)) {846846+ drbd_free_ee(mdev, e);847847+ dec_unacked(mdev);848848+ return 1;849849+ }850850+851851+ if (likely(drbd_bio_uptodate(e->private_bio))) {852852+ ok = drbd_send_block(mdev, P_DATA_REPLY, e);853853+ } else {854854+ if (__ratelimit(&drbd_ratelimit_state))855855+ dev_err(DEV, "Sending NegDReply. sector=%llus.\n",856856+ (unsigned long long)e->sector);857857+858858+ ok = drbd_send_ack(mdev, P_NEG_DREPLY, e);859859+ }860860+861861+ dec_unacked(mdev);862862+863863+ move_to_net_ee_or_free(mdev, e);864864+865865+ if (unlikely(!ok))866866+ dev_err(DEV, "drbd_send_block() failed\n");867867+ return ok;868868+}869869+870870+/**871871+ * w_e_end_rsdata_req() - Worker callback to send a P_RS_DATA_REPLY packet in response to a P_RS_DATA_REQUESTRS872872+ * @mdev: DRBD device.873873+ * @w: work object.874874+ * @cancel: The connection will be closed anyways875875+ */876876+int w_e_end_rsdata_req(struct drbd_conf *mdev, struct drbd_work *w, int cancel)877877+{878878+ struct drbd_epoch_entry *e = container_of(w, struct drbd_epoch_entry, w);879879+ int ok;880880+881881+ if (unlikely(cancel)) {882882+ drbd_free_ee(mdev, e);883883+ dec_unacked(mdev);884884+ return 1;885885+ }886886+887887+ if (get_ldev_if_state(mdev, D_FAILED)) {888888+ drbd_rs_complete_io(mdev, e->sector);889889+ put_ldev(mdev);890890+ }891891+892892+ if (likely(drbd_bio_uptodate(e->private_bio))) {893893+ if (likely(mdev->state.pdsk >= D_INCONSISTENT)) {894894+ inc_rs_pending(mdev);895895+ ok = drbd_send_block(mdev, P_RS_DATA_REPLY, e);896896+ } else {897897+ if (__ratelimit(&drbd_ratelimit_state))898898+ dev_err(DEV, "Not sending RSDataReply, "899899+ "partner DISKLESS!\n");900900+ ok = 1;901901+ }902902+ } else {903903+ if (__ratelimit(&drbd_ratelimit_state))904904+ dev_err(DEV, "Sending NegRSDReply. sector %llus.\n",905905+ (unsigned long long)e->sector);906906+907907+ ok = drbd_send_ack(mdev, P_NEG_RS_DREPLY, e);908908+909909+ /* update resync data with failure */910910+ drbd_rs_failed_io(mdev, e->sector, e->size);911911+ }912912+913913+ dec_unacked(mdev);914914+915915+ move_to_net_ee_or_free(mdev, e);916916+917917+ if (unlikely(!ok))918918+ dev_err(DEV, "drbd_send_block() failed\n");919919+ return ok;920920+}921921+922922+int w_e_end_csum_rs_req(struct drbd_conf *mdev, struct drbd_work *w, int cancel)923923+{924924+ struct drbd_epoch_entry *e = container_of(w, struct drbd_epoch_entry, w);925925+ struct digest_info *di;926926+ int digest_size;927927+ void *digest = NULL;928928+ int ok, eq = 0;929929+930930+ if (unlikely(cancel)) {931931+ drbd_free_ee(mdev, e);932932+ dec_unacked(mdev);933933+ return 1;934934+ }935935+936936+ drbd_rs_complete_io(mdev, e->sector);937937+938938+ di = (struct digest_info *)(unsigned long)e->block_id;939939+940940+ if (likely(drbd_bio_uptodate(e->private_bio))) {941941+ /* quick hack to try to avoid a race against reconfiguration.942942+ * a real fix would be much more involved,943943+ * introducing more locking mechanisms */944944+ if (mdev->csums_tfm) {945945+ digest_size = crypto_hash_digestsize(mdev->csums_tfm);946946+ D_ASSERT(digest_size == di->digest_size);947947+ digest = kmalloc(digest_size, GFP_NOIO);948948+ }949949+ if (digest) {950950+ drbd_csum(mdev, mdev->csums_tfm, e->private_bio, digest);951951+ eq = !memcmp(digest, di->digest, digest_size);952952+ kfree(digest);953953+ }954954+955955+ if (eq) {956956+ drbd_set_in_sync(mdev, e->sector, e->size);957957+ mdev->rs_same_csum++;958958+ ok = drbd_send_ack(mdev, P_RS_IS_IN_SYNC, e);959959+ } else {960960+ inc_rs_pending(mdev);961961+ e->block_id = ID_SYNCER;962962+ ok = drbd_send_block(mdev, P_RS_DATA_REPLY, e);963963+ }964964+ } else {965965+ ok = drbd_send_ack(mdev, P_NEG_RS_DREPLY, e);966966+ if (__ratelimit(&drbd_ratelimit_state))967967+ dev_err(DEV, "Sending NegDReply. I guess it gets messy.\n");968968+ }969969+970970+ dec_unacked(mdev);971971+972972+ kfree(di);973973+974974+ move_to_net_ee_or_free(mdev, e);975975+976976+ if (unlikely(!ok))977977+ dev_err(DEV, "drbd_send_block/ack() failed\n");978978+ return ok;979979+}980980+981981+int w_e_end_ov_req(struct drbd_conf *mdev, struct drbd_work *w, int cancel)982982+{983983+ struct drbd_epoch_entry *e = container_of(w, struct drbd_epoch_entry, w);984984+ int digest_size;985985+ void *digest;986986+ int ok = 1;987987+988988+ if (unlikely(cancel))989989+ goto out;990990+991991+ if (unlikely(!drbd_bio_uptodate(e->private_bio)))992992+ goto out;993993+994994+ digest_size = crypto_hash_digestsize(mdev->verify_tfm);995995+ /* FIXME if this allocation fails, online verify will not terminate! */996996+ digest = kmalloc(digest_size, GFP_NOIO);997997+ if (digest) {998998+ drbd_csum(mdev, mdev->verify_tfm, e->private_bio, digest);999999+ inc_rs_pending(mdev);10001000+ ok = drbd_send_drequest_csum(mdev, e->sector, e->size,10011001+ digest, digest_size, P_OV_REPLY);10021002+ if (!ok)10031003+ dec_rs_pending(mdev);10041004+ kfree(digest);10051005+ }10061006+10071007+out:10081008+ drbd_free_ee(mdev, e);10091009+10101010+ dec_unacked(mdev);10111011+10121012+ return ok;10131013+}10141014+10151015+void drbd_ov_oos_found(struct drbd_conf *mdev, sector_t sector, int size)10161016+{10171017+ if (mdev->ov_last_oos_start + mdev->ov_last_oos_size == sector) {10181018+ mdev->ov_last_oos_size += size>>9;10191019+ } else {10201020+ mdev->ov_last_oos_start = sector;10211021+ mdev->ov_last_oos_size = size>>9;10221022+ }10231023+ drbd_set_out_of_sync(mdev, sector, size);10241024+ set_bit(WRITE_BM_AFTER_RESYNC, &mdev->flags);10251025+}10261026+10271027+int w_e_end_ov_reply(struct drbd_conf *mdev, struct drbd_work *w, int cancel)10281028+{10291029+ struct drbd_epoch_entry *e = container_of(w, struct drbd_epoch_entry, w);10301030+ struct digest_info *di;10311031+ int digest_size;10321032+ void *digest;10331033+ int ok, eq = 0;10341034+10351035+ if (unlikely(cancel)) {10361036+ drbd_free_ee(mdev, e);10371037+ dec_unacked(mdev);10381038+ return 1;10391039+ }10401040+10411041+ /* after "cancel", because after drbd_disconnect/drbd_rs_cancel_all10421042+ * the resync lru has been cleaned up already */10431043+ drbd_rs_complete_io(mdev, e->sector);10441044+10451045+ di = (struct digest_info *)(unsigned long)e->block_id;10461046+10471047+ if (likely(drbd_bio_uptodate(e->private_bio))) {10481048+ digest_size = crypto_hash_digestsize(mdev->verify_tfm);10491049+ digest = kmalloc(digest_size, GFP_NOIO);10501050+ if (digest) {10511051+ drbd_csum(mdev, mdev->verify_tfm, e->private_bio, digest);10521052+10531053+ D_ASSERT(digest_size == di->digest_size);10541054+ eq = !memcmp(digest, di->digest, digest_size);10551055+ kfree(digest);10561056+ }10571057+ } else {10581058+ ok = drbd_send_ack(mdev, P_NEG_RS_DREPLY, e);10591059+ if (__ratelimit(&drbd_ratelimit_state))10601060+ dev_err(DEV, "Sending NegDReply. I guess it gets messy.\n");10611061+ }10621062+10631063+ dec_unacked(mdev);10641064+10651065+ kfree(di);10661066+10671067+ if (!eq)10681068+ drbd_ov_oos_found(mdev, e->sector, e->size);10691069+ else10701070+ ov_oos_print(mdev);10711071+10721072+ ok = drbd_send_ack_ex(mdev, P_OV_RESULT, e->sector, e->size,10731073+ eq ? ID_IN_SYNC : ID_OUT_OF_SYNC);10741074+10751075+ drbd_free_ee(mdev, e);10761076+10771077+ if (--mdev->ov_left == 0) {10781078+ ov_oos_print(mdev);10791079+ drbd_resync_finished(mdev);10801080+ }10811081+10821082+ return ok;10831083+}10841084+10851085+int w_prev_work_done(struct drbd_conf *mdev, struct drbd_work *w, int cancel)10861086+{10871087+ struct drbd_wq_barrier *b = container_of(w, struct drbd_wq_barrier, w);10881088+ complete(&b->done);10891089+ return 1;10901090+}10911091+10921092+int w_send_barrier(struct drbd_conf *mdev, struct drbd_work *w, int cancel)10931093+{10941094+ struct drbd_tl_epoch *b = container_of(w, struct drbd_tl_epoch, w);10951095+ struct p_barrier *p = &mdev->data.sbuf.barrier;10961096+ int ok = 1;10971097+10981098+ /* really avoid racing with tl_clear. w.cb may have been referenced10991099+ * just before it was reassigned and re-queued, so double check that.11001100+ * actually, this race was harmless, since we only try to send the11011101+ * barrier packet here, and otherwise do nothing with the object.11021102+ * but compare with the head of w_clear_epoch */11031103+ spin_lock_irq(&mdev->req_lock);11041104+ if (w->cb != w_send_barrier || mdev->state.conn < C_CONNECTED)11051105+ cancel = 1;11061106+ spin_unlock_irq(&mdev->req_lock);11071107+ if (cancel)11081108+ return 1;11091109+11101110+ if (!drbd_get_data_sock(mdev))11111111+ return 0;11121112+ p->barrier = b->br_number;11131113+ /* inc_ap_pending was done where this was queued.11141114+ * dec_ap_pending will be done in got_BarrierAck11151115+ * or (on connection loss) in w_clear_epoch. */11161116+ ok = _drbd_send_cmd(mdev, mdev->data.socket, P_BARRIER,11171117+ (struct p_header *)p, sizeof(*p), 0);11181118+ drbd_put_data_sock(mdev);11191119+11201120+ return ok;11211121+}11221122+11231123+int w_send_write_hint(struct drbd_conf *mdev, struct drbd_work *w, int cancel)11241124+{11251125+ if (cancel)11261126+ return 1;11271127+ return drbd_send_short_cmd(mdev, P_UNPLUG_REMOTE);11281128+}11291129+11301130+/**11311131+ * w_send_dblock() - Worker callback to send a P_DATA packet in order to mirror a write request11321132+ * @mdev: DRBD device.11331133+ * @w: work object.11341134+ * @cancel: The connection will be closed anyways11351135+ */11361136+int w_send_dblock(struct drbd_conf *mdev, struct drbd_work *w, int cancel)11371137+{11381138+ struct drbd_request *req = container_of(w, struct drbd_request, w);11391139+ int ok;11401140+11411141+ if (unlikely(cancel)) {11421142+ req_mod(req, send_canceled);11431143+ return 1;11441144+ }11451145+11461146+ ok = drbd_send_dblock(mdev, req);11471147+ req_mod(req, ok ? handed_over_to_network : send_failed);11481148+11491149+ return ok;11501150+}11511151+11521152+/**11531153+ * w_send_read_req() - Worker callback to send a read request (P_DATA_REQUEST) packet11541154+ * @mdev: DRBD device.11551155+ * @w: work object.11561156+ * @cancel: The connection will be closed anyways11571157+ */11581158+int w_send_read_req(struct drbd_conf *mdev, struct drbd_work *w, int cancel)11591159+{11601160+ struct drbd_request *req = container_of(w, struct drbd_request, w);11611161+ int ok;11621162+11631163+ if (unlikely(cancel)) {11641164+ req_mod(req, send_canceled);11651165+ return 1;11661166+ }11671167+11681168+ ok = drbd_send_drequest(mdev, P_DATA_REQUEST, req->sector, req->size,11691169+ (unsigned long)req);11701170+11711171+ if (!ok) {11721172+ /* ?? we set C_TIMEOUT or C_BROKEN_PIPE in drbd_send();11731173+ * so this is probably redundant */11741174+ if (mdev->state.conn >= C_CONNECTED)11751175+ drbd_force_state(mdev, NS(conn, C_NETWORK_FAILURE));11761176+ }11771177+ req_mod(req, ok ? handed_over_to_network : send_failed);11781178+11791179+ return ok;11801180+}11811181+11821182+static int _drbd_may_sync_now(struct drbd_conf *mdev)11831183+{11841184+ struct drbd_conf *odev = mdev;11851185+11861186+ while (1) {11871187+ if (odev->sync_conf.after == -1)11881188+ return 1;11891189+ odev = minor_to_mdev(odev->sync_conf.after);11901190+ ERR_IF(!odev) return 1;11911191+ if ((odev->state.conn >= C_SYNC_SOURCE &&11921192+ odev->state.conn <= C_PAUSED_SYNC_T) ||11931193+ odev->state.aftr_isp || odev->state.peer_isp ||11941194+ odev->state.user_isp)11951195+ return 0;11961196+ }11971197+}11981198+11991199+/**12001200+ * _drbd_pause_after() - Pause resync on all devices that may not resync now12011201+ * @mdev: DRBD device.12021202+ *12031203+ * Called from process context only (admin command and after_state_ch).12041204+ */12051205+static int _drbd_pause_after(struct drbd_conf *mdev)12061206+{12071207+ struct drbd_conf *odev;12081208+ int i, rv = 0;12091209+12101210+ for (i = 0; i < minor_count; i++) {12111211+ odev = minor_to_mdev(i);12121212+ if (!odev)12131213+ continue;12141214+ if (odev->state.conn == C_STANDALONE && odev->state.disk == D_DISKLESS)12151215+ continue;12161216+ if (!_drbd_may_sync_now(odev))12171217+ rv |= (__drbd_set_state(_NS(odev, aftr_isp, 1), CS_HARD, NULL)12181218+ != SS_NOTHING_TO_DO);12191219+ }12201220+12211221+ return rv;12221222+}12231223+12241224+/**12251225+ * _drbd_resume_next() - Resume resync on all devices that may resync now12261226+ * @mdev: DRBD device.12271227+ *12281228+ * Called from process context only (admin command and worker).12291229+ */12301230+static int _drbd_resume_next(struct drbd_conf *mdev)12311231+{12321232+ struct drbd_conf *odev;12331233+ int i, rv = 0;12341234+12351235+ for (i = 0; i < minor_count; i++) {12361236+ odev = minor_to_mdev(i);12371237+ if (!odev)12381238+ continue;12391239+ if (odev->state.conn == C_STANDALONE && odev->state.disk == D_DISKLESS)12401240+ continue;12411241+ if (odev->state.aftr_isp) {12421242+ if (_drbd_may_sync_now(odev))12431243+ rv |= (__drbd_set_state(_NS(odev, aftr_isp, 0),12441244+ CS_HARD, NULL)12451245+ != SS_NOTHING_TO_DO) ;12461246+ }12471247+ }12481248+ return rv;12491249+}12501250+12511251+void resume_next_sg(struct drbd_conf *mdev)12521252+{12531253+ write_lock_irq(&global_state_lock);12541254+ _drbd_resume_next(mdev);12551255+ write_unlock_irq(&global_state_lock);12561256+}12571257+12581258+void suspend_other_sg(struct drbd_conf *mdev)12591259+{12601260+ write_lock_irq(&global_state_lock);12611261+ _drbd_pause_after(mdev);12621262+ write_unlock_irq(&global_state_lock);12631263+}12641264+12651265+static int sync_after_error(struct drbd_conf *mdev, int o_minor)12661266+{12671267+ struct drbd_conf *odev;12681268+12691269+ if (o_minor == -1)12701270+ return NO_ERROR;12711271+ if (o_minor < -1 || minor_to_mdev(o_minor) == NULL)12721272+ return ERR_SYNC_AFTER;12731273+12741274+ /* check for loops */12751275+ odev = minor_to_mdev(o_minor);12761276+ while (1) {12771277+ if (odev == mdev)12781278+ return ERR_SYNC_AFTER_CYCLE;12791279+12801280+ /* dependency chain ends here, no cycles. */12811281+ if (odev->sync_conf.after == -1)12821282+ return NO_ERROR;12831283+12841284+ /* follow the dependency chain */12851285+ odev = minor_to_mdev(odev->sync_conf.after);12861286+ }12871287+}12881288+12891289+int drbd_alter_sa(struct drbd_conf *mdev, int na)12901290+{12911291+ int changes;12921292+ int retcode;12931293+12941294+ write_lock_irq(&global_state_lock);12951295+ retcode = sync_after_error(mdev, na);12961296+ if (retcode == NO_ERROR) {12971297+ mdev->sync_conf.after = na;12981298+ do {12991299+ changes = _drbd_pause_after(mdev);13001300+ changes |= _drbd_resume_next(mdev);13011301+ } while (changes);13021302+ }13031303+ write_unlock_irq(&global_state_lock);13041304+ return retcode;13051305+}13061306+13071307+/**13081308+ * drbd_start_resync() - Start the resync process13091309+ * @mdev: DRBD device.13101310+ * @side: Either C_SYNC_SOURCE or C_SYNC_TARGET13111311+ *13121312+ * This function might bring you directly into one of the13131313+ * C_PAUSED_SYNC_* states.13141314+ */13151315+void drbd_start_resync(struct drbd_conf *mdev, enum drbd_conns side)13161316+{13171317+ union drbd_state ns;13181318+ int r;13191319+13201320+ if (mdev->state.conn >= C_SYNC_SOURCE) {13211321+ dev_err(DEV, "Resync already running!\n");13221322+ return;13231323+ }13241324+13251325+ trace_drbd_resync(mdev, TRACE_LVL_SUMMARY, "Resync starting: side=%s\n",13261326+ side == C_SYNC_TARGET ? "SyncTarget" : "SyncSource");13271327+13281328+ /* In case a previous resync run was aborted by an IO error/detach on the peer. */13291329+ drbd_rs_cancel_all(mdev);13301330+13311331+ if (side == C_SYNC_TARGET) {13321332+ /* Since application IO was locked out during C_WF_BITMAP_T and13331333+ C_WF_SYNC_UUID we are still unmodified. Before going to C_SYNC_TARGET13341334+ we check that we might make the data inconsistent. */13351335+ r = drbd_khelper(mdev, "before-resync-target");13361336+ r = (r >> 8) & 0xff;13371337+ if (r > 0) {13381338+ dev_info(DEV, "before-resync-target handler returned %d, "13391339+ "dropping connection.\n", r);13401340+ drbd_force_state(mdev, NS(conn, C_DISCONNECTING));13411341+ return;13421342+ }13431343+ }13441344+13451345+ drbd_state_lock(mdev);13461346+13471347+ if (!get_ldev_if_state(mdev, D_NEGOTIATING)) {13481348+ drbd_state_unlock(mdev);13491349+ return;13501350+ }13511351+13521352+ if (side == C_SYNC_TARGET) {13531353+ mdev->bm_resync_fo = 0;13541354+ } else /* side == C_SYNC_SOURCE */ {13551355+ u64 uuid;13561356+13571357+ get_random_bytes(&uuid, sizeof(u64));13581358+ drbd_uuid_set(mdev, UI_BITMAP, uuid);13591359+ drbd_send_sync_uuid(mdev, uuid);13601360+13611361+ D_ASSERT(mdev->state.disk == D_UP_TO_DATE);13621362+ }13631363+13641364+ write_lock_irq(&global_state_lock);13651365+ ns = mdev->state;13661366+13671367+ ns.aftr_isp = !_drbd_may_sync_now(mdev);13681368+13691369+ ns.conn = side;13701370+13711371+ if (side == C_SYNC_TARGET)13721372+ ns.disk = D_INCONSISTENT;13731373+ else /* side == C_SYNC_SOURCE */13741374+ ns.pdsk = D_INCONSISTENT;13751375+13761376+ r = __drbd_set_state(mdev, ns, CS_VERBOSE, NULL);13771377+ ns = mdev->state;13781378+13791379+ if (ns.conn < C_CONNECTED)13801380+ r = SS_UNKNOWN_ERROR;13811381+13821382+ if (r == SS_SUCCESS) {13831383+ mdev->rs_total =13841384+ mdev->rs_mark_left = drbd_bm_total_weight(mdev);13851385+ mdev->rs_failed = 0;13861386+ mdev->rs_paused = 0;13871387+ mdev->rs_start =13881388+ mdev->rs_mark_time = jiffies;13891389+ mdev->rs_same_csum = 0;13901390+ _drbd_pause_after(mdev);13911391+ }13921392+ write_unlock_irq(&global_state_lock);13931393+ drbd_state_unlock(mdev);13941394+ put_ldev(mdev);13951395+13961396+ if (r == SS_SUCCESS) {13971397+ dev_info(DEV, "Began resync as %s (will sync %lu KB [%lu bits set]).\n",13981398+ drbd_conn_str(ns.conn),13991399+ (unsigned long) mdev->rs_total << (BM_BLOCK_SHIFT-10),14001400+ (unsigned long) mdev->rs_total);14011401+14021402+ if (mdev->rs_total == 0) {14031403+ /* Peer still reachable? Beware of failing before-resync-target handlers! */14041404+ request_ping(mdev);14051405+ __set_current_state(TASK_INTERRUPTIBLE);14061406+ schedule_timeout(mdev->net_conf->ping_timeo*HZ/9); /* 9 instead 10 */14071407+ drbd_resync_finished(mdev);14081408+ return;14091409+ }14101410+14111411+ /* ns.conn may already be != mdev->state.conn,14121412+ * we may have been paused in between, or become paused until14131413+ * the timer triggers.14141414+ * No matter, that is handled in resync_timer_fn() */14151415+ if (ns.conn == C_SYNC_TARGET)14161416+ mod_timer(&mdev->resync_timer, jiffies);14171417+14181418+ drbd_md_sync(mdev);14191419+ }14201420+}14211421+14221422+int drbd_worker(struct drbd_thread *thi)14231423+{14241424+ struct drbd_conf *mdev = thi->mdev;14251425+ struct drbd_work *w = NULL;14261426+ LIST_HEAD(work_list);14271427+ int intr = 0, i;14281428+14291429+ sprintf(current->comm, "drbd%d_worker", mdev_to_minor(mdev));14301430+14311431+ while (get_t_state(thi) == Running) {14321432+ drbd_thread_current_set_cpu(mdev);14331433+14341434+ if (down_trylock(&mdev->data.work.s)) {14351435+ mutex_lock(&mdev->data.mutex);14361436+ if (mdev->data.socket && !mdev->net_conf->no_cork)14371437+ drbd_tcp_uncork(mdev->data.socket);14381438+ mutex_unlock(&mdev->data.mutex);14391439+14401440+ intr = down_interruptible(&mdev->data.work.s);14411441+14421442+ mutex_lock(&mdev->data.mutex);14431443+ if (mdev->data.socket && !mdev->net_conf->no_cork)14441444+ drbd_tcp_cork(mdev->data.socket);14451445+ mutex_unlock(&mdev->data.mutex);14461446+ }14471447+14481448+ if (intr) {14491449+ D_ASSERT(intr == -EINTR);14501450+ flush_signals(current);14511451+ ERR_IF (get_t_state(thi) == Running)14521452+ continue;14531453+ break;14541454+ }14551455+14561456+ if (get_t_state(thi) != Running)14571457+ break;14581458+ /* With this break, we have done a down() but not consumed14591459+ the entry from the list. The cleanup code takes care of14601460+ this... */14611461+14621462+ w = NULL;14631463+ spin_lock_irq(&mdev->data.work.q_lock);14641464+ ERR_IF(list_empty(&mdev->data.work.q)) {14651465+ /* something terribly wrong in our logic.14661466+ * we were able to down() the semaphore,14671467+ * but the list is empty... doh.14681468+ *14691469+ * what is the best thing to do now?14701470+ * try again from scratch, restarting the receiver,14711471+ * asender, whatnot? could break even more ugly,14721472+ * e.g. when we are primary, but no good local data.14731473+ *14741474+ * I'll try to get away just starting over this loop.14751475+ */14761476+ spin_unlock_irq(&mdev->data.work.q_lock);14771477+ continue;14781478+ }14791479+ w = list_entry(mdev->data.work.q.next, struct drbd_work, list);14801480+ list_del_init(&w->list);14811481+ spin_unlock_irq(&mdev->data.work.q_lock);14821482+14831483+ if (!w->cb(mdev, w, mdev->state.conn < C_CONNECTED)) {14841484+ /* dev_warn(DEV, "worker: a callback failed! \n"); */14851485+ if (mdev->state.conn >= C_CONNECTED)14861486+ drbd_force_state(mdev,14871487+ NS(conn, C_NETWORK_FAILURE));14881488+ }14891489+ }14901490+ D_ASSERT(test_bit(DEVICE_DYING, &mdev->flags));14911491+ D_ASSERT(test_bit(CONFIG_PENDING, &mdev->flags));14921492+14931493+ spin_lock_irq(&mdev->data.work.q_lock);14941494+ i = 0;14951495+ while (!list_empty(&mdev->data.work.q)) {14961496+ list_splice_init(&mdev->data.work.q, &work_list);14971497+ spin_unlock_irq(&mdev->data.work.q_lock);14981498+14991499+ while (!list_empty(&work_list)) {15001500+ w = list_entry(work_list.next, struct drbd_work, list);15011501+ list_del_init(&w->list);15021502+ w->cb(mdev, w, 1);15031503+ i++; /* dead debugging code */15041504+ }15051505+15061506+ spin_lock_irq(&mdev->data.work.q_lock);15071507+ }15081508+ sema_init(&mdev->data.work.s, 0);15091509+ /* DANGEROUS race: if someone did queue his work within the spinlock,15101510+ * but up() ed outside the spinlock, we could get an up() on the15111511+ * semaphore without corresponding list entry.15121512+ * So don't do that.15131513+ */15141514+ spin_unlock_irq(&mdev->data.work.q_lock);15151515+15161516+ D_ASSERT(mdev->state.disk == D_DISKLESS && mdev->state.conn == C_STANDALONE);15171517+ /* _drbd_set_state only uses stop_nowait.15181518+ * wait here for the Exiting receiver. */15191519+ drbd_thread_stop(&mdev->receiver);15201520+ drbd_mdev_cleanup(mdev);15211521+15221522+ dev_info(DEV, "worker terminated\n");15231523+15241524+ clear_bit(DEVICE_DYING, &mdev->flags);15251525+ clear_bit(CONFIG_PENDING, &mdev->flags);15261526+ wake_up(&mdev->state_wait);15271527+15281528+ return 0;15291529+}
+91
drivers/block/drbd/drbd_wrappers.h
···11+#ifndef _DRBD_WRAPPERS_H22+#define _DRBD_WRAPPERS_H33+44+#include <linux/ctype.h>55+#include <linux/mm.h>66+77+/* see get_sb_bdev and bd_claim */88+extern char *drbd_sec_holder;99+1010+/* sets the number of 512 byte sectors of our virtual device */1111+static inline void drbd_set_my_capacity(struct drbd_conf *mdev,1212+ sector_t size)1313+{1414+ /* set_capacity(mdev->this_bdev->bd_disk, size); */1515+ set_capacity(mdev->vdisk, size);1616+ mdev->this_bdev->bd_inode->i_size = (loff_t)size << 9;1717+}1818+1919+#define drbd_bio_uptodate(bio) bio_flagged(bio, BIO_UPTODATE)2020+2121+static inline int drbd_bio_has_active_page(struct bio *bio)2222+{2323+ struct bio_vec *bvec;2424+ int i;2525+2626+ __bio_for_each_segment(bvec, bio, i, 0) {2727+ if (page_count(bvec->bv_page) > 1)2828+ return 1;2929+ }3030+3131+ return 0;3232+}3333+3434+/* bi_end_io handlers */3535+extern void drbd_md_io_complete(struct bio *bio, int error);3636+extern void drbd_endio_read_sec(struct bio *bio, int error);3737+extern void drbd_endio_write_sec(struct bio *bio, int error);3838+extern void drbd_endio_pri(struct bio *bio, int error);3939+4040+/*4141+ * used to submit our private bio4242+ */4343+static inline void drbd_generic_make_request(struct drbd_conf *mdev,4444+ int fault_type, struct bio *bio)4545+{4646+ __release(local);4747+ if (!bio->bi_bdev) {4848+ printk(KERN_ERR "drbd%d: drbd_generic_make_request: "4949+ "bio->bi_bdev == NULL\n",5050+ mdev_to_minor(mdev));5151+ dump_stack();5252+ bio_endio(bio, -ENODEV);5353+ return;5454+ }5555+5656+ if (FAULT_ACTIVE(mdev, fault_type))5757+ bio_endio(bio, -EIO);5858+ else5959+ generic_make_request(bio);6060+}6161+6262+static inline void drbd_plug_device(struct drbd_conf *mdev)6363+{6464+ struct request_queue *q;6565+ q = bdev_get_queue(mdev->this_bdev);6666+6767+ spin_lock_irq(q->queue_lock);6868+6969+/* XXX the check on !blk_queue_plugged is redundant,7070+ * implicitly checked in blk_plug_device */7171+7272+ if (!blk_queue_plugged(q)) {7373+ blk_plug_device(q);7474+ del_timer(&q->unplug_timer);7575+ /* unplugging should not happen automatically... */7676+ }7777+ spin_unlock_irq(q->queue_lock);7878+}7979+8080+static inline int drbd_crypto_is_hash(struct crypto_tfm *tfm)8181+{8282+ return (crypto_tfm_alg_type(tfm) & CRYPTO_ALG_TYPE_HASH_MASK)8383+ == CRYPTO_ALG_TYPE_HASH;8484+}8585+8686+#ifndef __CHECKER__8787+# undef __cond_lock8888+# define __cond_lock(x,c) (c)8989+#endif9090+9191+#endif
+349
include/linux/drbd.h
···11+/*22+ drbd.h33+ Kernel module for 2.6.x Kernels44+55+ This file is part of DRBD by Philipp Reisner and Lars Ellenberg.66+77+ Copyright (C) 2001-2008, LINBIT Information Technologies GmbH.88+ Copyright (C) 2001-2008, Philipp Reisner <philipp.reisner@linbit.com>.99+ Copyright (C) 2001-2008, Lars Ellenberg <lars.ellenberg@linbit.com>.1010+1111+ drbd is free software; you can redistribute it and/or modify1212+ it under the terms of the GNU General Public License as published by1313+ the Free Software Foundation; either version 2, or (at your option)1414+ any later version.1515+1616+ drbd is distributed in the hope that it will be useful,1717+ but WITHOUT ANY WARRANTY; without even the implied warranty of1818+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the1919+ GNU General Public License for more details.2020+2121+ You should have received a copy of the GNU General Public License2222+ along with drbd; see the file COPYING. If not, write to2323+ the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, USA.2424+2525+*/2626+#ifndef DRBD_H2727+#define DRBD_H2828+#include <linux/connector.h>2929+#include <asm/types.h>3030+3131+#ifdef __KERNEL__3232+#include <linux/types.h>3333+#include <asm/byteorder.h>3434+#else3535+#include <sys/types.h>3636+#include <sys/wait.h>3737+#include <limits.h>3838+3939+/* Altough the Linux source code makes a difference between4040+ generic endianness and the bitfields' endianness, there is no4141+ architecture as of Linux-2.6.24-rc4 where the bitfileds' endianness4242+ does not match the generic endianness. */4343+4444+#if __BYTE_ORDER == __LITTLE_ENDIAN4545+#define __LITTLE_ENDIAN_BITFIELD4646+#elif __BYTE_ORDER == __BIG_ENDIAN4747+#define __BIG_ENDIAN_BITFIELD4848+#else4949+# error "sorry, weird endianness on this box"5050+#endif5151+5252+#endif5353+5454+5555+extern const char *drbd_buildtag(void);5656+#define REL_VERSION "8.3.3rc2"5757+#define API_VERSION 885858+#define PRO_VERSION_MIN 865959+#define PRO_VERSION_MAX 916060+6161+6262+enum drbd_io_error_p {6363+ EP_PASS_ON, /* FIXME should the better be named "Ignore"? */6464+ EP_CALL_HELPER,6565+ EP_DETACH6666+};6767+6868+enum drbd_fencing_p {6969+ FP_DONT_CARE,7070+ FP_RESOURCE,7171+ FP_STONITH7272+};7373+7474+enum drbd_disconnect_p {7575+ DP_RECONNECT,7676+ DP_DROP_NET_CONF,7777+ DP_FREEZE_IO7878+};7979+8080+enum drbd_after_sb_p {8181+ ASB_DISCONNECT,8282+ ASB_DISCARD_YOUNGER_PRI,8383+ ASB_DISCARD_OLDER_PRI,8484+ ASB_DISCARD_ZERO_CHG,8585+ ASB_DISCARD_LEAST_CHG,8686+ ASB_DISCARD_LOCAL,8787+ ASB_DISCARD_REMOTE,8888+ ASB_CONSENSUS,8989+ ASB_DISCARD_SECONDARY,9090+ ASB_CALL_HELPER,9191+ ASB_VIOLENTLY9292+};9393+9494+/* KEEP the order, do not delete or insert. Only append. */9595+enum drbd_ret_codes {9696+ ERR_CODE_BASE = 100,9797+ NO_ERROR = 101,9898+ ERR_LOCAL_ADDR = 102,9999+ ERR_PEER_ADDR = 103,100100+ ERR_OPEN_DISK = 104,101101+ ERR_OPEN_MD_DISK = 105,102102+ ERR_DISK_NOT_BDEV = 107,103103+ ERR_MD_NOT_BDEV = 108,104104+ ERR_DISK_TO_SMALL = 111,105105+ ERR_MD_DISK_TO_SMALL = 112,106106+ ERR_BDCLAIM_DISK = 114,107107+ ERR_BDCLAIM_MD_DISK = 115,108108+ ERR_MD_IDX_INVALID = 116,109109+ ERR_IO_MD_DISK = 118,110110+ ERR_MD_INVALID = 119,111111+ ERR_AUTH_ALG = 120,112112+ ERR_AUTH_ALG_ND = 121,113113+ ERR_NOMEM = 122,114114+ ERR_DISCARD = 123,115115+ ERR_DISK_CONFIGURED = 124,116116+ ERR_NET_CONFIGURED = 125,117117+ ERR_MANDATORY_TAG = 126,118118+ ERR_MINOR_INVALID = 127,119119+ ERR_INTR = 129, /* EINTR */120120+ ERR_RESIZE_RESYNC = 130,121121+ ERR_NO_PRIMARY = 131,122122+ ERR_SYNC_AFTER = 132,123123+ ERR_SYNC_AFTER_CYCLE = 133,124124+ ERR_PAUSE_IS_SET = 134,125125+ ERR_PAUSE_IS_CLEAR = 135,126126+ ERR_PACKET_NR = 137,127127+ ERR_NO_DISK = 138,128128+ ERR_NOT_PROTO_C = 139,129129+ ERR_NOMEM_BITMAP = 140,130130+ ERR_INTEGRITY_ALG = 141, /* DRBD 8.2 only */131131+ ERR_INTEGRITY_ALG_ND = 142, /* DRBD 8.2 only */132132+ ERR_CPU_MASK_PARSE = 143, /* DRBD 8.2 only */133133+ ERR_CSUMS_ALG = 144, /* DRBD 8.2 only */134134+ ERR_CSUMS_ALG_ND = 145, /* DRBD 8.2 only */135135+ ERR_VERIFY_ALG = 146, /* DRBD 8.2 only */136136+ ERR_VERIFY_ALG_ND = 147, /* DRBD 8.2 only */137137+ ERR_CSUMS_RESYNC_RUNNING= 148, /* DRBD 8.2 only */138138+ ERR_VERIFY_RUNNING = 149, /* DRBD 8.2 only */139139+ ERR_DATA_NOT_CURRENT = 150,140140+ ERR_CONNECTED = 151, /* DRBD 8.3 only */141141+142142+ /* insert new ones above this line */143143+ AFTER_LAST_ERR_CODE144144+};145145+146146+#define DRBD_PROT_A 1147147+#define DRBD_PROT_B 2148148+#define DRBD_PROT_C 3149149+150150+enum drbd_role {151151+ R_UNKNOWN = 0,152152+ R_PRIMARY = 1, /* role */153153+ R_SECONDARY = 2, /* role */154154+ R_MASK = 3,155155+};156156+157157+/* The order of these constants is important.158158+ * The lower ones (<C_WF_REPORT_PARAMS) indicate159159+ * that there is no socket!160160+ * >=C_WF_REPORT_PARAMS ==> There is a socket161161+ */162162+enum drbd_conns {163163+ C_STANDALONE,164164+ C_DISCONNECTING, /* Temporal state on the way to StandAlone. */165165+ C_UNCONNECTED, /* >= C_UNCONNECTED -> inc_net() succeeds */166166+167167+ /* These temporal states are all used on the way168168+ * from >= C_CONNECTED to Unconnected.169169+ * The 'disconnect reason' states170170+ * I do not allow to change beween them. */171171+ C_TIMEOUT,172172+ C_BROKEN_PIPE,173173+ C_NETWORK_FAILURE,174174+ C_PROTOCOL_ERROR,175175+ C_TEAR_DOWN,176176+177177+ C_WF_CONNECTION,178178+ C_WF_REPORT_PARAMS, /* we have a socket */179179+ C_CONNECTED, /* we have introduced each other */180180+ C_STARTING_SYNC_S, /* starting full sync by admin request. */181181+ C_STARTING_SYNC_T, /* stariing full sync by admin request. */182182+ C_WF_BITMAP_S,183183+ C_WF_BITMAP_T,184184+ C_WF_SYNC_UUID,185185+186186+ /* All SyncStates are tested with this comparison187187+ * xx >= C_SYNC_SOURCE && xx <= C_PAUSED_SYNC_T */188188+ C_SYNC_SOURCE,189189+ C_SYNC_TARGET,190190+ C_VERIFY_S,191191+ C_VERIFY_T,192192+ C_PAUSED_SYNC_S,193193+ C_PAUSED_SYNC_T,194194+ C_MASK = 31195195+};196196+197197+enum drbd_disk_state {198198+ D_DISKLESS,199199+ D_ATTACHING, /* In the process of reading the meta-data */200200+ D_FAILED, /* Becomes D_DISKLESS as soon as we told it the peer */201201+ /* when >= D_FAILED it is legal to access mdev->bc */202202+ D_NEGOTIATING, /* Late attaching state, we need to talk to the peer */203203+ D_INCONSISTENT,204204+ D_OUTDATED,205205+ D_UNKNOWN, /* Only used for the peer, never for myself */206206+ D_CONSISTENT, /* Might be D_OUTDATED, might be D_UP_TO_DATE ... */207207+ D_UP_TO_DATE, /* Only this disk state allows applications' IO ! */208208+ D_MASK = 15209209+};210210+211211+union drbd_state {212212+/* According to gcc's docs is the ...213213+ * The order of allocation of bit-fields within a unit (C90 6.5.2.1, C99 6.7.2.1).214214+ * Determined by ABI.215215+ * pointed out by Maxim Uvarov q<muvarov@ru.mvista.com>216216+ * even though we transmit as "cpu_to_be32(state)",217217+ * the offsets of the bitfields still need to be swapped218218+ * on different endianess.219219+ */220220+ struct {221221+#if defined(__LITTLE_ENDIAN_BITFIELD)222222+ unsigned role:2 ; /* 3/4 primary/secondary/unknown */223223+ unsigned peer:2 ; /* 3/4 primary/secondary/unknown */224224+ unsigned conn:5 ; /* 17/32 cstates */225225+ unsigned disk:4 ; /* 8/16 from D_DISKLESS to D_UP_TO_DATE */226226+ unsigned pdsk:4 ; /* 8/16 from D_DISKLESS to D_UP_TO_DATE */227227+ unsigned susp:1 ; /* 2/2 IO suspended no/yes */228228+ unsigned aftr_isp:1 ; /* isp .. imposed sync pause */229229+ unsigned peer_isp:1 ;230230+ unsigned user_isp:1 ;231231+ unsigned _pad:11; /* 0 unused */232232+#elif defined(__BIG_ENDIAN_BITFIELD)233233+ unsigned _pad:11; /* 0 unused */234234+ unsigned user_isp:1 ;235235+ unsigned peer_isp:1 ;236236+ unsigned aftr_isp:1 ; /* isp .. imposed sync pause */237237+ unsigned susp:1 ; /* 2/2 IO suspended no/yes */238238+ unsigned pdsk:4 ; /* 8/16 from D_DISKLESS to D_UP_TO_DATE */239239+ unsigned disk:4 ; /* 8/16 from D_DISKLESS to D_UP_TO_DATE */240240+ unsigned conn:5 ; /* 17/32 cstates */241241+ unsigned peer:2 ; /* 3/4 primary/secondary/unknown */242242+ unsigned role:2 ; /* 3/4 primary/secondary/unknown */243243+#else244244+# error "this endianess is not supported"245245+#endif246246+ };247247+ unsigned int i;248248+};249249+250250+enum drbd_state_ret_codes {251251+ SS_CW_NO_NEED = 4,252252+ SS_CW_SUCCESS = 3,253253+ SS_NOTHING_TO_DO = 2,254254+ SS_SUCCESS = 1,255255+ SS_UNKNOWN_ERROR = 0, /* Used to sleep longer in _drbd_request_state */256256+ SS_TWO_PRIMARIES = -1,257257+ SS_NO_UP_TO_DATE_DISK = -2,258258+ SS_NO_LOCAL_DISK = -4,259259+ SS_NO_REMOTE_DISK = -5,260260+ SS_CONNECTED_OUTDATES = -6,261261+ SS_PRIMARY_NOP = -7,262262+ SS_RESYNC_RUNNING = -8,263263+ SS_ALREADY_STANDALONE = -9,264264+ SS_CW_FAILED_BY_PEER = -10,265265+ SS_IS_DISKLESS = -11,266266+ SS_DEVICE_IN_USE = -12,267267+ SS_NO_NET_CONFIG = -13,268268+ SS_NO_VERIFY_ALG = -14, /* drbd-8.2 only */269269+ SS_NEED_CONNECTION = -15, /* drbd-8.2 only */270270+ SS_LOWER_THAN_OUTDATED = -16,271271+ SS_NOT_SUPPORTED = -17, /* drbd-8.2 only */272272+ SS_IN_TRANSIENT_STATE = -18, /* Retry after the next state change */273273+ SS_CONCURRENT_ST_CHG = -19, /* Concurrent cluster side state change! */274274+ SS_AFTER_LAST_ERROR = -20, /* Keep this at bottom */275275+};276276+277277+/* from drbd_strings.c */278278+extern const char *drbd_conn_str(enum drbd_conns);279279+extern const char *drbd_role_str(enum drbd_role);280280+extern const char *drbd_disk_str(enum drbd_disk_state);281281+extern const char *drbd_set_st_err_str(enum drbd_state_ret_codes);282282+283283+#define SHARED_SECRET_MAX 64284284+285285+#define MDF_CONSISTENT (1 << 0)286286+#define MDF_PRIMARY_IND (1 << 1)287287+#define MDF_CONNECTED_IND (1 << 2)288288+#define MDF_FULL_SYNC (1 << 3)289289+#define MDF_WAS_UP_TO_DATE (1 << 4)290290+#define MDF_PEER_OUT_DATED (1 << 5)291291+#define MDF_CRASHED_PRIMARY (1 << 6)292292+293293+enum drbd_uuid_index {294294+ UI_CURRENT,295295+ UI_BITMAP,296296+ UI_HISTORY_START,297297+ UI_HISTORY_END,298298+ UI_SIZE, /* nl-packet: number of dirty bits */299299+ UI_FLAGS, /* nl-packet: flags */300300+ UI_EXTENDED_SIZE /* Everything. */301301+};302302+303303+enum drbd_timeout_flag {304304+ UT_DEFAULT = 0,305305+ UT_DEGRADED = 1,306306+ UT_PEER_OUTDATED = 2,307307+};308308+309309+#define UUID_JUST_CREATED ((__u64)4)310310+311311+#define DRBD_MAGIC 0x83740267312312+#define BE_DRBD_MAGIC __constant_cpu_to_be32(DRBD_MAGIC)313313+314314+/* these are of type "int" */315315+#define DRBD_MD_INDEX_INTERNAL -1316316+#define DRBD_MD_INDEX_FLEX_EXT -2317317+#define DRBD_MD_INDEX_FLEX_INT -3318318+319319+/* Start of the new netlink/connector stuff */320320+321321+#define DRBD_NL_CREATE_DEVICE 0x01322322+#define DRBD_NL_SET_DEFAULTS 0x02323323+324324+/* The following line should be moved over to linux/connector.h325325+ * when the time comes */326326+#ifndef CN_IDX_DRBD327327+# define CN_IDX_DRBD 0x4328328+/* Ubuntu "intrepid ibex" release defined CN_IDX_DRBD as 0x6 */329329+#endif330330+#define CN_VAL_DRBD 0x1331331+332332+/* For searching a vacant cn_idx value */333333+#define CN_IDX_STEP 6977334334+335335+struct drbd_nl_cfg_req {336336+ int packet_type;337337+ unsigned int drbd_minor;338338+ int flags;339339+ unsigned short tag_list[];340340+};341341+342342+struct drbd_nl_cfg_reply {343343+ int packet_type;344344+ unsigned int minor;345345+ int ret_code; /* enum ret_code or set_st_err_t */346346+ unsigned short tag_list[]; /* only used with get_* calls */347347+};348348+349349+#endif
+137
include/linux/drbd_limits.h
···11+/*22+ drbd_limits.h33+ This file is part of DRBD by Philipp Reisner and Lars Ellenberg.44+*/55+66+/*77+ * Our current limitations.88+ * Some of them are hard limits,99+ * some of them are arbitrary range limits, that make it easier to provide1010+ * feedback about nonsense settings for certain configurable values.1111+ */1212+1313+#ifndef DRBD_LIMITS_H1414+#define DRBD_LIMITS_H 11515+1616+#define DEBUG_RANGE_CHECK 01717+1818+#define DRBD_MINOR_COUNT_MIN 11919+#define DRBD_MINOR_COUNT_MAX 2552020+2121+#define DRBD_DIALOG_REFRESH_MIN 02222+#define DRBD_DIALOG_REFRESH_MAX 6002323+2424+/* valid port number */2525+#define DRBD_PORT_MIN 12626+#define DRBD_PORT_MAX 0xffff2727+2828+/* startup { */2929+ /* if you want more than 3.4 days, disable */3030+#define DRBD_WFC_TIMEOUT_MIN 03131+#define DRBD_WFC_TIMEOUT_MAX 3000003232+#define DRBD_WFC_TIMEOUT_DEF 03333+3434+#define DRBD_DEGR_WFC_TIMEOUT_MIN 03535+#define DRBD_DEGR_WFC_TIMEOUT_MAX 3000003636+#define DRBD_DEGR_WFC_TIMEOUT_DEF 03737+3838+#define DRBD_OUTDATED_WFC_TIMEOUT_MIN 03939+#define DRBD_OUTDATED_WFC_TIMEOUT_MAX 3000004040+#define DRBD_OUTDATED_WFC_TIMEOUT_DEF 04141+/* }*/4242+4343+/* net { */4444+ /* timeout, unit centi seconds4545+ * more than one minute timeout is not usefull */4646+#define DRBD_TIMEOUT_MIN 14747+#define DRBD_TIMEOUT_MAX 6004848+#define DRBD_TIMEOUT_DEF 60 /* 6 seconds */4949+5050+ /* active connection retries when C_WF_CONNECTION */5151+#define DRBD_CONNECT_INT_MIN 15252+#define DRBD_CONNECT_INT_MAX 1205353+#define DRBD_CONNECT_INT_DEF 10 /* seconds */5454+5555+ /* keep-alive probes when idle */5656+#define DRBD_PING_INT_MIN 15757+#define DRBD_PING_INT_MAX 1205858+#define DRBD_PING_INT_DEF 105959+6060+ /* timeout for the ping packets.*/6161+#define DRBD_PING_TIMEO_MIN 16262+#define DRBD_PING_TIMEO_MAX 1006363+#define DRBD_PING_TIMEO_DEF 56464+6565+ /* max number of write requests between write barriers */6666+#define DRBD_MAX_EPOCH_SIZE_MIN 16767+#define DRBD_MAX_EPOCH_SIZE_MAX 200006868+#define DRBD_MAX_EPOCH_SIZE_DEF 20486969+7070+ /* I don't think that a tcp send buffer of more than 10M is usefull */7171+#define DRBD_SNDBUF_SIZE_MIN 07272+#define DRBD_SNDBUF_SIZE_MAX (10<<20)7373+#define DRBD_SNDBUF_SIZE_DEF (2*65535)7474+7575+#define DRBD_RCVBUF_SIZE_MIN 07676+#define DRBD_RCVBUF_SIZE_MAX (10<<20)7777+#define DRBD_RCVBUF_SIZE_DEF (2*65535)7878+7979+ /* @4k PageSize -> 128kB - 512MB */8080+#define DRBD_MAX_BUFFERS_MIN 328181+#define DRBD_MAX_BUFFERS_MAX 1310728282+#define DRBD_MAX_BUFFERS_DEF 20488383+8484+ /* @4k PageSize -> 4kB - 512MB */8585+#define DRBD_UNPLUG_WATERMARK_MIN 18686+#define DRBD_UNPLUG_WATERMARK_MAX 1310728787+#define DRBD_UNPLUG_WATERMARK_DEF (DRBD_MAX_BUFFERS_DEF/16)8888+8989+ /* 0 is disabled.9090+ * 200 should be more than enough even for very short timeouts */9191+#define DRBD_KO_COUNT_MIN 09292+#define DRBD_KO_COUNT_MAX 2009393+#define DRBD_KO_COUNT_DEF 09494+/* } */9595+9696+/* syncer { */9797+ /* FIXME allow rate to be zero? */9898+#define DRBD_RATE_MIN 19999+/* channel bonding 10 GbE, or other hardware */100100+#define DRBD_RATE_MAX (4 << 20)101101+#define DRBD_RATE_DEF 250 /* kb/second */102102+103103+ /* less than 7 would hit performance unneccessarily.104104+ * 3833 is the largest prime that still does fit105105+ * into 64 sectors of activity log */106106+#define DRBD_AL_EXTENTS_MIN 7107107+#define DRBD_AL_EXTENTS_MAX 3833108108+#define DRBD_AL_EXTENTS_DEF 127109109+110110+#define DRBD_AFTER_MIN -1111111+#define DRBD_AFTER_MAX 255112112+#define DRBD_AFTER_DEF -1113113+114114+/* } */115115+116116+/* drbdsetup XY resize -d Z117117+ * you are free to reduce the device size to nothing, if you want to.118118+ * the upper limit with 64bit kernel, enough ram and flexible meta data119119+ * is 16 TB, currently. */120120+/* DRBD_MAX_SECTORS */121121+#define DRBD_DISK_SIZE_SECT_MIN 0122122+#define DRBD_DISK_SIZE_SECT_MAX (16 * (2LLU << 30))123123+#define DRBD_DISK_SIZE_SECT_DEF 0 /* = disabled = no user size... */124124+125125+#define DRBD_ON_IO_ERROR_DEF EP_PASS_ON126126+#define DRBD_FENCING_DEF FP_DONT_CARE127127+#define DRBD_AFTER_SB_0P_DEF ASB_DISCONNECT128128+#define DRBD_AFTER_SB_1P_DEF ASB_DISCONNECT129129+#define DRBD_AFTER_SB_2P_DEF ASB_DISCONNECT130130+#define DRBD_RR_CONFLICT_DEF ASB_DISCONNECT131131+132132+#define DRBD_MAX_BIO_BVECS_MIN 0133133+#define DRBD_MAX_BIO_BVECS_MAX 128134134+#define DRBD_MAX_BIO_BVECS_DEF 0135135+136136+#undef RANGE137137+#endif
···11+#ifndef DRBD_TAG_MAGIC_H22+#define DRBD_TAG_MAGIC_H33+44+#define TT_END 055+#define TT_REMOVED 0xE00066+77+/* declare packet_type enums */88+enum packet_types {99+#define NL_PACKET(name, number, fields) P_ ## name = number,1010+#define NL_INTEGER(pn, pr, member)1111+#define NL_INT64(pn, pr, member)1212+#define NL_BIT(pn, pr, member)1313+#define NL_STRING(pn, pr, member, len)1414+#include "drbd_nl.h"1515+ P_nl_after_last_packet,1616+};1717+1818+/* These struct are used to deduce the size of the tag lists: */1919+#define NL_PACKET(name, number, fields) \2020+ struct name ## _tag_len_struct { fields };2121+#define NL_INTEGER(pn, pr, member) \2222+ int member; int tag_and_len ## member;2323+#define NL_INT64(pn, pr, member) \2424+ __u64 member; int tag_and_len ## member;2525+#define NL_BIT(pn, pr, member) \2626+ unsigned char member:1; int tag_and_len ## member;2727+#define NL_STRING(pn, pr, member, len) \2828+ unsigned char member[len]; int member ## _len; \2929+ int tag_and_len ## member;3030+#include "linux/drbd_nl.h"3131+3232+/* declate tag-list-sizes */3333+static const int tag_list_sizes[] = {3434+#define NL_PACKET(name, number, fields) 2 fields ,3535+#define NL_INTEGER(pn, pr, member) + 4 + 43636+#define NL_INT64(pn, pr, member) + 4 + 83737+#define NL_BIT(pn, pr, member) + 4 + 13838+#define NL_STRING(pn, pr, member, len) + 4 + (len)3939+#include "drbd_nl.h"4040+};4141+4242+/* The two highest bits are used for the tag type */4343+#define TT_MASK 0xC0004444+#define TT_INTEGER 0x00004545+#define TT_INT64 0x40004646+#define TT_BIT 0x80004747+#define TT_STRING 0xC0004848+/* The next bit indicates if processing of the tag is mandatory */4949+#define T_MANDATORY 0x20005050+#define T_MAY_IGNORE 0x00005151+#define TN_MASK 0x1fff5252+/* The remaining 13 bits are used to enumerate the tags */5353+5454+#define tag_type(T) ((T) & TT_MASK)5555+#define tag_number(T) ((T) & TN_MASK)5656+5757+/* declare tag enums */5858+#define NL_PACKET(name, number, fields) fields5959+enum drbd_tags {6060+#define NL_INTEGER(pn, pr, member) T_ ## member = pn | TT_INTEGER | pr ,6161+#define NL_INT64(pn, pr, member) T_ ## member = pn | TT_INT64 | pr ,6262+#define NL_BIT(pn, pr, member) T_ ## member = pn | TT_BIT | pr ,6363+#define NL_STRING(pn, pr, member, len) T_ ## member = pn | TT_STRING | pr ,6464+#include "drbd_nl.h"6565+};6666+6767+struct tag {6868+ const char *name;6969+ int type_n_flags;7070+ int max_len;7171+};7272+7373+/* declare tag names */7474+#define NL_PACKET(name, number, fields) fields7575+static const struct tag tag_descriptions[] = {7676+#define NL_INTEGER(pn, pr, member) [ pn ] = { #member, TT_INTEGER | pr, sizeof(int) },7777+#define NL_INT64(pn, pr, member) [ pn ] = { #member, TT_INT64 | pr, sizeof(__u64) },7878+#define NL_BIT(pn, pr, member) [ pn ] = { #member, TT_BIT | pr, sizeof(int) },7979+#define NL_STRING(pn, pr, member, len) [ pn ] = { #member, TT_STRING | pr, (len) },8080+#include "drbd_nl.h"8181+};8282+8383+#endif
+294
include/linux/lru_cache.h
···11+/*22+ lru_cache.c33+44+ This file is part of DRBD by Philipp Reisner and Lars Ellenberg.55+66+ Copyright (C) 2003-2008, LINBIT Information Technologies GmbH.77+ Copyright (C) 2003-2008, Philipp Reisner <philipp.reisner@linbit.com>.88+ Copyright (C) 2003-2008, Lars Ellenberg <lars.ellenberg@linbit.com>.99+1010+ drbd is free software; you can redistribute it and/or modify1111+ it under the terms of the GNU General Public License as published by1212+ the Free Software Foundation; either version 2, or (at your option)1313+ any later version.1414+1515+ drbd is distributed in the hope that it will be useful,1616+ but WITHOUT ANY WARRANTY; without even the implied warranty of1717+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the1818+ GNU General Public License for more details.1919+2020+ You should have received a copy of the GNU General Public License2121+ along with drbd; see the file COPYING. If not, write to2222+ the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, USA.2323+2424+ */2525+2626+#ifndef LRU_CACHE_H2727+#define LRU_CACHE_H2828+2929+#include <linux/list.h>3030+#include <linux/slab.h>3131+#include <linux/bitops.h>3232+#include <linux/string.h> /* for memset */3333+#include <linux/seq_file.h>3434+3535+/*3636+This header file (and its .c file; kernel-doc of functions see there)3737+ define a helper framework to easily keep track of index:label associations,3838+ and changes to an "active set" of objects, as well as pending transactions,3939+ to persistently record those changes.4040+4141+ We use an LRU policy if it is necessary to "cool down" a region currently in4242+ the active set before we can "heat" a previously unused region.4343+4444+ Because of this later property, it is called "lru_cache".4545+ As it actually Tracks Objects in an Active SeT, we could also call it4646+ toast (incidentally that is what may happen to the data on the4747+ backend storage uppon next resync, if we don't get it right).4848+4949+What for?5050+5151+We replicate IO (more or less synchronously) to local and remote disk.5252+5353+For crash recovery after replication node failure,5454+ we need to resync all regions that have been target of in-flight WRITE IO5555+ (in use, or "hot", regions), as we don't know wether or not those WRITEs have5656+ made it to stable storage.5757+5858+ To avoid a "full resync", we need to persistently track these regions.5959+6060+ This is known as "write intent log", and can be implemented as on-disk6161+ (coarse or fine grained) bitmap, or other meta data.6262+6363+ To avoid the overhead of frequent extra writes to this meta data area,6464+ usually the condition is softened to regions that _may_ have been target of6565+ in-flight WRITE IO, e.g. by only lazily clearing the on-disk write-intent6666+ bitmap, trading frequency of meta data transactions against amount of6767+ (possibly unneccessary) resync traffic.6868+6969+ If we set a hard limit on the area that may be "hot" at any given time, we7070+ limit the amount of resync traffic needed for crash recovery.7171+7272+For recovery after replication link failure,7373+ we need to resync all blocks that have been changed on the other replica7474+ in the mean time, or, if both replica have been changed independently [*],7575+ all blocks that have been changed on either replica in the mean time.7676+ [*] usually as a result of a cluster split-brain and insufficient protection.7777+ but there are valid use cases to do this on purpose.7878+7979+ Tracking those blocks can be implemented as "dirty bitmap".8080+ Having it fine-grained reduces the amount of resync traffic.8181+ It should also be persistent, to allow for reboots (or crashes)8282+ while the replication link is down.8383+8484+There are various possible implementations for persistently storing8585+write intent log information, three of which are mentioned here.8686+8787+"Chunk dirtying"8888+ The on-disk "dirty bitmap" may be re-used as "write-intent" bitmap as well.8989+ To reduce the frequency of bitmap updates for write-intent log purposes,9090+ one could dirty "chunks" (of some size) at a time of the (fine grained)9191+ on-disk bitmap, while keeping the in-memory "dirty" bitmap as clean as9292+ possible, flushing it to disk again when a previously "hot" (and on-disk9393+ dirtied as full chunk) area "cools down" again (no IO in flight anymore,9494+ and none expected in the near future either).9595+9696+"Explicit (coarse) write intent bitmap"9797+ An other implementation could chose a (probably coarse) explicit bitmap,9898+ for write-intent log purposes, additionally to the fine grained dirty bitmap.9999+100100+"Activity log"101101+ Yet an other implementation may keep track of the hot regions, by starting102102+ with an empty set, and writing down a journal of region numbers that have103103+ become "hot", or have "cooled down" again.104104+105105+ To be able to use a ring buffer for this journal of changes to the active106106+ set, we not only record the actual changes to that set, but also record the107107+ not changing members of the set in a round robin fashion. To do so, we use a108108+ fixed (but configurable) number of slots which we can identify by index, and109109+ associate region numbers (labels) with these indices.110110+ For each transaction recording a change to the active set, we record the111111+ change itself (index: -old_label, +new_label), and which index is associated112112+ with which label (index: current_label) within a certain sliding window that113113+ is moved further over the available indices with each such transaction.114114+115115+ Thus, for crash recovery, if the ringbuffer is sufficiently large, we can116116+ accurately reconstruct the active set.117117+118118+ Sufficiently large depends only on maximum number of active objects, and the119119+ size of the sliding window recording "index: current_label" associations within120120+ each transaction.121121+122122+ This is what we call the "activity log".123123+124124+ Currently we need one activity log transaction per single label change, which125125+ does not give much benefit over the "dirty chunks of bitmap" approach, other126126+ than potentially less seeks.127127+128128+ We plan to change the transaction format to support multiple changes per129129+ transaction, which then would reduce several (disjoint, "random") updates to130130+ the bitmap into one transaction to the activity log ring buffer.131131+*/132132+133133+/* this defines an element in a tracked set134134+ * .colision is for hash table lookup.135135+ * When we process a new IO request, we know its sector, thus can deduce the136136+ * region number (label) easily. To do the label -> object lookup without a137137+ * full list walk, we use a simple hash table.138138+ *139139+ * .list is on one of three lists:140140+ * in_use: currently in use (refcnt > 0, lc_number != LC_FREE)141141+ * lru: unused but ready to be reused or recycled142142+ * (ts_refcnt == 0, lc_number != LC_FREE),143143+ * free: unused but ready to be recycled144144+ * (ts_refcnt == 0, lc_number == LC_FREE),145145+ *146146+ * an element is said to be "in the active set",147147+ * if either on "in_use" or "lru", i.e. lc_number != LC_FREE.148148+ *149149+ * DRBD currently (May 2009) only uses 61 elements on the resync lru_cache150150+ * (total memory usage 2 pages), and up to 3833 elements on the act_log151151+ * lru_cache, totalling ~215 kB for 64bit architechture, ~53 pages.152152+ *153153+ * We usually do not actually free these objects again, but only "recycle"154154+ * them, as the change "index: -old_label, +LC_FREE" would need a transaction155155+ * as well. Which also means that using a kmem_cache to allocate the objects156156+ * from wastes some resources.157157+ * But it avoids high order page allocations in kmalloc.158158+ */159159+struct lc_element {160160+ struct hlist_node colision;161161+ struct list_head list; /* LRU list or free list */162162+ unsigned refcnt;163163+ /* back "pointer" into ts_cache->element[index],164164+ * for paranoia, and for "ts_element_to_index" */165165+ unsigned lc_index;166166+ /* if we want to track a larger set of objects,167167+ * it needs to become arch independend u64 */168168+ unsigned lc_number;169169+170170+ /* special label when on free list */171171+#define LC_FREE (~0U)172172+};173173+174174+struct lru_cache {175175+ /* the least recently used item is kept at lru->prev */176176+ struct list_head lru;177177+ struct list_head free;178178+ struct list_head in_use;179179+180180+ /* the pre-created kmem cache to allocate the objects from */181181+ struct kmem_cache *lc_cache;182182+183183+ /* size of tracked objects, used to memset(,0,) them in lc_reset */184184+ size_t element_size;185185+ /* offset of struct lc_element member in the tracked object */186186+ size_t element_off;187187+188188+ /* number of elements (indices) */189189+ unsigned int nr_elements;190190+ /* Arbitrary limit on maximum tracked objects. Practical limit is much191191+ * lower due to allocation failures, probably. For typical use cases,192192+ * nr_elements should be a few thousand at most.193193+ * This also limits the maximum value of ts_element.ts_index, allowing the194194+ * 8 high bits of .ts_index to be overloaded with flags in the future. */195195+#define LC_MAX_ACTIVE (1<<24)196196+197197+ /* statistics */198198+ unsigned used; /* number of lelements currently on in_use list */199199+ unsigned long hits, misses, starving, dirty, changed;200200+201201+ /* see below: flag-bits for lru_cache */202202+ unsigned long flags;203203+204204+ /* when changing the label of an index element */205205+ unsigned int new_number;206206+207207+ /* for paranoia when changing the label of an index element */208208+ struct lc_element *changing_element;209209+210210+ void *lc_private;211211+ const char *name;212212+213213+ /* nr_elements there */214214+ struct hlist_head *lc_slot;215215+ struct lc_element **lc_element;216216+};217217+218218+219219+/* flag-bits for lru_cache */220220+enum {221221+ /* debugging aid, to catch concurrent access early.222222+ * user needs to guarantee exclusive access by proper locking! */223223+ __LC_PARANOIA,224224+ /* if we need to change the set, but currently there is a changing225225+ * transaction pending, we are "dirty", and must deferr further226226+ * changing requests */227227+ __LC_DIRTY,228228+ /* if we need to change the set, but currently there is no free nor229229+ * unused element available, we are "starving", and must not give out230230+ * further references, to guarantee that eventually some refcnt will231231+ * drop to zero and we will be able to make progress again, changing232232+ * the set, writing the transaction.233233+ * if the statistics say we are frequently starving,234234+ * nr_elements is too small. */235235+ __LC_STARVING,236236+};237237+#define LC_PARANOIA (1<<__LC_PARANOIA)238238+#define LC_DIRTY (1<<__LC_DIRTY)239239+#define LC_STARVING (1<<__LC_STARVING)240240+241241+extern struct lru_cache *lc_create(const char *name, struct kmem_cache *cache,242242+ unsigned e_count, size_t e_size, size_t e_off);243243+extern void lc_reset(struct lru_cache *lc);244244+extern void lc_destroy(struct lru_cache *lc);245245+extern void lc_set(struct lru_cache *lc, unsigned int enr, int index);246246+extern void lc_del(struct lru_cache *lc, struct lc_element *element);247247+248248+extern struct lc_element *lc_try_get(struct lru_cache *lc, unsigned int enr);249249+extern struct lc_element *lc_find(struct lru_cache *lc, unsigned int enr);250250+extern struct lc_element *lc_get(struct lru_cache *lc, unsigned int enr);251251+extern unsigned int lc_put(struct lru_cache *lc, struct lc_element *e);252252+extern void lc_changed(struct lru_cache *lc, struct lc_element *e);253253+254254+struct seq_file;255255+extern size_t lc_seq_printf_stats(struct seq_file *seq, struct lru_cache *lc);256256+257257+extern void lc_seq_dump_details(struct seq_file *seq, struct lru_cache *lc, char *utext,258258+ void (*detail) (struct seq_file *, struct lc_element *));259259+260260+/**261261+ * lc_try_lock - can be used to stop lc_get() from changing the tracked set262262+ * @lc: the lru cache to operate on263263+ *264264+ * Note that the reference counts and order on the active and lru lists may265265+ * still change. Returns true if we aquired the lock.266266+ */267267+static inline int lc_try_lock(struct lru_cache *lc)268268+{269269+ return !test_and_set_bit(__LC_DIRTY, &lc->flags);270270+}271271+272272+/**273273+ * lc_unlock - unlock @lc, allow lc_get() to change the set again274274+ * @lc: the lru cache to operate on275275+ */276276+static inline void lc_unlock(struct lru_cache *lc)277277+{278278+ clear_bit(__LC_DIRTY, &lc->flags);279279+ smp_mb__after_clear_bit();280280+}281281+282282+static inline int lc_is_used(struct lru_cache *lc, unsigned int enr)283283+{284284+ struct lc_element *e = lc_find(lc, enr);285285+ return e && e->refcnt;286286+}287287+288288+#define lc_entry(ptr, type, member) \289289+ container_of(ptr, type, member)290290+291291+extern struct lc_element *lc_element_by_index(struct lru_cache *lc, unsigned i);292292+extern unsigned int lc_index_of(struct lru_cache *lc, struct lc_element *e);293293+294294+#endif
···11+/*22+ lru_cache.c33+44+ This file is part of DRBD by Philipp Reisner and Lars Ellenberg.55+66+ Copyright (C) 2003-2008, LINBIT Information Technologies GmbH.77+ Copyright (C) 2003-2008, Philipp Reisner <philipp.reisner@linbit.com>.88+ Copyright (C) 2003-2008, Lars Ellenberg <lars.ellenberg@linbit.com>.99+1010+ drbd is free software; you can redistribute it and/or modify1111+ it under the terms of the GNU General Public License as published by1212+ the Free Software Foundation; either version 2, or (at your option)1313+ any later version.1414+1515+ drbd is distributed in the hope that it will be useful,1616+ but WITHOUT ANY WARRANTY; without even the implied warranty of1717+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the1818+ GNU General Public License for more details.1919+2020+ You should have received a copy of the GNU General Public License2121+ along with drbd; see the file COPYING. If not, write to2222+ the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, USA.2323+2424+ */2525+2626+#include <linux/module.h>2727+#include <linux/bitops.h>2828+#include <linux/slab.h>2929+#include <linux/string.h> /* for memset */3030+#include <linux/seq_file.h> /* for seq_printf */3131+#include <linux/lru_cache.h>3232+3333+MODULE_AUTHOR("Philipp Reisner <phil@linbit.com>, "3434+ "Lars Ellenberg <lars@linbit.com>");3535+MODULE_DESCRIPTION("lru_cache - Track sets of hot objects");3636+MODULE_LICENSE("GPL");3737+3838+/* this is developers aid only.3939+ * it catches concurrent access (lack of locking on the users part) */4040+#define PARANOIA_ENTRY() do { \4141+ BUG_ON(!lc); \4242+ BUG_ON(!lc->nr_elements); \4343+ BUG_ON(test_and_set_bit(__LC_PARANOIA, &lc->flags)); \4444+} while (0)4545+4646+#define RETURN(x...) do { \4747+ clear_bit(__LC_PARANOIA, &lc->flags); \4848+ smp_mb__after_clear_bit(); return x ; } while (0)4949+5050+/* BUG() if e is not one of the elements tracked by lc */5151+#define PARANOIA_LC_ELEMENT(lc, e) do { \5252+ struct lru_cache *lc_ = (lc); \5353+ struct lc_element *e_ = (e); \5454+ unsigned i = e_->lc_index; \5555+ BUG_ON(i >= lc_->nr_elements); \5656+ BUG_ON(lc_->lc_element[i] != e_); } while (0)5757+5858+/**5959+ * lc_create - prepares to track objects in an active set6060+ * @name: descriptive name only used in lc_seq_printf_stats and lc_seq_dump_details6161+ * @e_count: number of elements allowed to be active simultaneously6262+ * @e_size: size of the tracked objects6363+ * @e_off: offset to the &struct lc_element member in a tracked object6464+ *6565+ * Returns a pointer to a newly initialized struct lru_cache on success,6666+ * or NULL on (allocation) failure.6767+ */6868+struct lru_cache *lc_create(const char *name, struct kmem_cache *cache,6969+ unsigned e_count, size_t e_size, size_t e_off)7070+{7171+ struct hlist_head *slot = NULL;7272+ struct lc_element **element = NULL;7373+ struct lru_cache *lc;7474+ struct lc_element *e;7575+ unsigned cache_obj_size = kmem_cache_size(cache);7676+ unsigned i;7777+7878+ WARN_ON(cache_obj_size < e_size);7979+ if (cache_obj_size < e_size)8080+ return NULL;8181+8282+ /* e_count too big; would probably fail the allocation below anyways.8383+ * for typical use cases, e_count should be few thousand at most. */8484+ if (e_count > LC_MAX_ACTIVE)8585+ return NULL;8686+8787+ slot = kzalloc(e_count * sizeof(struct hlist_head*), GFP_KERNEL);8888+ if (!slot)8989+ goto out_fail;9090+ element = kzalloc(e_count * sizeof(struct lc_element *), GFP_KERNEL);9191+ if (!element)9292+ goto out_fail;9393+9494+ lc = kzalloc(sizeof(*lc), GFP_KERNEL);9595+ if (!lc)9696+ goto out_fail;9797+9898+ INIT_LIST_HEAD(&lc->in_use);9999+ INIT_LIST_HEAD(&lc->lru);100100+ INIT_LIST_HEAD(&lc->free);101101+102102+ lc->name = name;103103+ lc->element_size = e_size;104104+ lc->element_off = e_off;105105+ lc->nr_elements = e_count;106106+ lc->new_number = LC_FREE;107107+ lc->lc_cache = cache;108108+ lc->lc_element = element;109109+ lc->lc_slot = slot;110110+111111+ /* preallocate all objects */112112+ for (i = 0; i < e_count; i++) {113113+ void *p = kmem_cache_alloc(cache, GFP_KERNEL);114114+ if (!p)115115+ break;116116+ memset(p, 0, lc->element_size);117117+ e = p + e_off;118118+ e->lc_index = i;119119+ e->lc_number = LC_FREE;120120+ list_add(&e->list, &lc->free);121121+ element[i] = e;122122+ }123123+ if (i == e_count)124124+ return lc;125125+126126+ /* else: could not allocate all elements, give up */127127+ for (i--; i; i--) {128128+ void *p = element[i];129129+ kmem_cache_free(cache, p - e_off);130130+ }131131+ kfree(lc);132132+out_fail:133133+ kfree(element);134134+ kfree(slot);135135+ return NULL;136136+}137137+138138+void lc_free_by_index(struct lru_cache *lc, unsigned i)139139+{140140+ void *p = lc->lc_element[i];141141+ WARN_ON(!p);142142+ if (p) {143143+ p -= lc->element_off;144144+ kmem_cache_free(lc->lc_cache, p);145145+ }146146+}147147+148148+/**149149+ * lc_destroy - frees memory allocated by lc_create()150150+ * @lc: the lru cache to destroy151151+ */152152+void lc_destroy(struct lru_cache *lc)153153+{154154+ unsigned i;155155+ if (!lc)156156+ return;157157+ for (i = 0; i < lc->nr_elements; i++)158158+ lc_free_by_index(lc, i);159159+ kfree(lc->lc_element);160160+ kfree(lc->lc_slot);161161+ kfree(lc);162162+}163163+164164+/**165165+ * lc_reset - does a full reset for @lc and the hash table slots.166166+ * @lc: the lru cache to operate on167167+ *168168+ * It is roughly the equivalent of re-allocating a fresh lru_cache object,169169+ * basically a short cut to lc_destroy(lc); lc = lc_create(...);170170+ */171171+void lc_reset(struct lru_cache *lc)172172+{173173+ unsigned i;174174+175175+ INIT_LIST_HEAD(&lc->in_use);176176+ INIT_LIST_HEAD(&lc->lru);177177+ INIT_LIST_HEAD(&lc->free);178178+ lc->used = 0;179179+ lc->hits = 0;180180+ lc->misses = 0;181181+ lc->starving = 0;182182+ lc->dirty = 0;183183+ lc->changed = 0;184184+ lc->flags = 0;185185+ lc->changing_element = NULL;186186+ lc->new_number = LC_FREE;187187+ memset(lc->lc_slot, 0, sizeof(struct hlist_head) * lc->nr_elements);188188+189189+ for (i = 0; i < lc->nr_elements; i++) {190190+ struct lc_element *e = lc->lc_element[i];191191+ void *p = e;192192+ p -= lc->element_off;193193+ memset(p, 0, lc->element_size);194194+ /* re-init it */195195+ e->lc_index = i;196196+ e->lc_number = LC_FREE;197197+ list_add(&e->list, &lc->free);198198+ }199199+}200200+201201+/**202202+ * lc_seq_printf_stats - print stats about @lc into @seq203203+ * @seq: the seq_file to print into204204+ * @lc: the lru cache to print statistics of205205+ */206206+size_t lc_seq_printf_stats(struct seq_file *seq, struct lru_cache *lc)207207+{208208+ /* NOTE:209209+ * total calls to lc_get are210210+ * (starving + hits + misses)211211+ * misses include "dirty" count (update from an other thread in212212+ * progress) and "changed", when this in fact lead to an successful213213+ * update of the cache.214214+ */215215+ return seq_printf(seq, "\t%s: used:%u/%u "216216+ "hits:%lu misses:%lu starving:%lu dirty:%lu changed:%lu\n",217217+ lc->name, lc->used, lc->nr_elements,218218+ lc->hits, lc->misses, lc->starving, lc->dirty, lc->changed);219219+}220220+221221+static struct hlist_head *lc_hash_slot(struct lru_cache *lc, unsigned int enr)222222+{223223+ return lc->lc_slot + (enr % lc->nr_elements);224224+}225225+226226+227227+/**228228+ * lc_find - find element by label, if present in the hash table229229+ * @lc: The lru_cache object230230+ * @enr: element number231231+ *232232+ * Returns the pointer to an element, if the element with the requested233233+ * "label" or element number is present in the hash table,234234+ * or NULL if not found. Does not change the refcnt.235235+ */236236+struct lc_element *lc_find(struct lru_cache *lc, unsigned int enr)237237+{238238+ struct hlist_node *n;239239+ struct lc_element *e;240240+241241+ BUG_ON(!lc);242242+ BUG_ON(!lc->nr_elements);243243+ hlist_for_each_entry(e, n, lc_hash_slot(lc, enr), colision) {244244+ if (e->lc_number == enr)245245+ return e;246246+ }247247+ return NULL;248248+}249249+250250+/* returned element will be "recycled" immediately */251251+static struct lc_element *lc_evict(struct lru_cache *lc)252252+{253253+ struct list_head *n;254254+ struct lc_element *e;255255+256256+ if (list_empty(&lc->lru))257257+ return NULL;258258+259259+ n = lc->lru.prev;260260+ e = list_entry(n, struct lc_element, list);261261+262262+ PARANOIA_LC_ELEMENT(lc, e);263263+264264+ list_del(&e->list);265265+ hlist_del(&e->colision);266266+ return e;267267+}268268+269269+/**270270+ * lc_del - removes an element from the cache271271+ * @lc: The lru_cache object272272+ * @e: The element to remove273273+ *274274+ * @e must be unused (refcnt == 0). Moves @e from "lru" to "free" list,275275+ * sets @e->enr to %LC_FREE.276276+ */277277+void lc_del(struct lru_cache *lc, struct lc_element *e)278278+{279279+ PARANOIA_ENTRY();280280+ PARANOIA_LC_ELEMENT(lc, e);281281+ BUG_ON(e->refcnt);282282+283283+ e->lc_number = LC_FREE;284284+ hlist_del_init(&e->colision);285285+ list_move(&e->list, &lc->free);286286+ RETURN();287287+}288288+289289+static struct lc_element *lc_get_unused_element(struct lru_cache *lc)290290+{291291+ struct list_head *n;292292+293293+ if (list_empty(&lc->free))294294+ return lc_evict(lc);295295+296296+ n = lc->free.next;297297+ list_del(n);298298+ return list_entry(n, struct lc_element, list);299299+}300300+301301+static int lc_unused_element_available(struct lru_cache *lc)302302+{303303+ if (!list_empty(&lc->free))304304+ return 1; /* something on the free list */305305+ if (!list_empty(&lc->lru))306306+ return 1; /* something to evict */307307+308308+ return 0;309309+}310310+311311+312312+/**313313+ * lc_get - get element by label, maybe change the active set314314+ * @lc: the lru cache to operate on315315+ * @enr: the label to look up316316+ *317317+ * Finds an element in the cache, increases its usage count,318318+ * "touches" and returns it.319319+ *320320+ * In case the requested number is not present, it needs to be added to the321321+ * cache. Therefore it is possible that an other element becomes evicted from322322+ * the cache. In either case, the user is notified so he is able to e.g. keep323323+ * a persistent log of the cache changes, and therefore the objects in use.324324+ *325325+ * Return values:326326+ * NULL327327+ * The cache was marked %LC_STARVING,328328+ * or the requested label was not in the active set329329+ * and a changing transaction is still pending (@lc was marked %LC_DIRTY).330330+ * Or no unused or free element could be recycled (@lc will be marked as331331+ * %LC_STARVING, blocking further lc_get() operations).332332+ *333333+ * pointer to the element with the REQUESTED element number.334334+ * In this case, it can be used right away335335+ *336336+ * pointer to an UNUSED element with some different element number,337337+ * where that different number may also be %LC_FREE.338338+ *339339+ * In this case, the cache is marked %LC_DIRTY (blocking further changes),340340+ * and the returned element pointer is removed from the lru list and341341+ * hash collision chains. The user now should do whatever housekeeping342342+ * is necessary.343343+ * Then he must call lc_changed(lc,element_pointer), to finish344344+ * the change.345345+ *346346+ * NOTE: The user needs to check the lc_number on EACH use, so he recognizes347347+ * any cache set change.348348+ */349349+struct lc_element *lc_get(struct lru_cache *lc, unsigned int enr)350350+{351351+ struct lc_element *e;352352+353353+ PARANOIA_ENTRY();354354+ if (lc->flags & LC_STARVING) {355355+ ++lc->starving;356356+ RETURN(NULL);357357+ }358358+359359+ e = lc_find(lc, enr);360360+ if (e) {361361+ ++lc->hits;362362+ if (e->refcnt++ == 0)363363+ lc->used++;364364+ list_move(&e->list, &lc->in_use); /* Not evictable... */365365+ RETURN(e);366366+ }367367+368368+ ++lc->misses;369369+370370+ /* In case there is nothing available and we can not kick out371371+ * the LRU element, we have to wait ...372372+ */373373+ if (!lc_unused_element_available(lc)) {374374+ __set_bit(__LC_STARVING, &lc->flags);375375+ RETURN(NULL);376376+ }377377+378378+ /* it was not present in the active set.379379+ * we are going to recycle an unused (or even "free") element.380380+ * user may need to commit a transaction to record that change.381381+ * we serialize on flags & TF_DIRTY */382382+ if (test_and_set_bit(__LC_DIRTY, &lc->flags)) {383383+ ++lc->dirty;384384+ RETURN(NULL);385385+ }386386+387387+ e = lc_get_unused_element(lc);388388+ BUG_ON(!e);389389+390390+ clear_bit(__LC_STARVING, &lc->flags);391391+ BUG_ON(++e->refcnt != 1);392392+ lc->used++;393393+394394+ lc->changing_element = e;395395+ lc->new_number = enr;396396+397397+ RETURN(e);398398+}399399+400400+/* similar to lc_get,401401+ * but only gets a new reference on an existing element.402402+ * you either get the requested element, or NULL.403403+ * will be consolidated into one function.404404+ */405405+struct lc_element *lc_try_get(struct lru_cache *lc, unsigned int enr)406406+{407407+ struct lc_element *e;408408+409409+ PARANOIA_ENTRY();410410+ if (lc->flags & LC_STARVING) {411411+ ++lc->starving;412412+ RETURN(NULL);413413+ }414414+415415+ e = lc_find(lc, enr);416416+ if (e) {417417+ ++lc->hits;418418+ if (e->refcnt++ == 0)419419+ lc->used++;420420+ list_move(&e->list, &lc->in_use); /* Not evictable... */421421+ }422422+ RETURN(e);423423+}424424+425425+/**426426+ * lc_changed - tell @lc that the change has been recorded427427+ * @lc: the lru cache to operate on428428+ * @e: the element pending label change429429+ */430430+void lc_changed(struct lru_cache *lc, struct lc_element *e)431431+{432432+ PARANOIA_ENTRY();433433+ BUG_ON(e != lc->changing_element);434434+ PARANOIA_LC_ELEMENT(lc, e);435435+ ++lc->changed;436436+ e->lc_number = lc->new_number;437437+ list_add(&e->list, &lc->in_use);438438+ hlist_add_head(&e->colision, lc_hash_slot(lc, lc->new_number));439439+ lc->changing_element = NULL;440440+ lc->new_number = LC_FREE;441441+ clear_bit(__LC_DIRTY, &lc->flags);442442+ smp_mb__after_clear_bit();443443+ RETURN();444444+}445445+446446+447447+/**448448+ * lc_put - give up refcnt of @e449449+ * @lc: the lru cache to operate on450450+ * @e: the element to put451451+ *452452+ * If refcnt reaches zero, the element is moved to the lru list,453453+ * and a %LC_STARVING (if set) is cleared.454454+ * Returns the new (post-decrement) refcnt.455455+ */456456+unsigned int lc_put(struct lru_cache *lc, struct lc_element *e)457457+{458458+ PARANOIA_ENTRY();459459+ PARANOIA_LC_ELEMENT(lc, e);460460+ BUG_ON(e->refcnt == 0);461461+ BUG_ON(e == lc->changing_element);462462+ if (--e->refcnt == 0) {463463+ /* move it to the front of LRU. */464464+ list_move(&e->list, &lc->lru);465465+ lc->used--;466466+ clear_bit(__LC_STARVING, &lc->flags);467467+ smp_mb__after_clear_bit();468468+ }469469+ RETURN(e->refcnt);470470+}471471+472472+/**473473+ * lc_element_by_index474474+ * @lc: the lru cache to operate on475475+ * @i: the index of the element to return476476+ */477477+struct lc_element *lc_element_by_index(struct lru_cache *lc, unsigned i)478478+{479479+ BUG_ON(i >= lc->nr_elements);480480+ BUG_ON(lc->lc_element[i] == NULL);481481+ BUG_ON(lc->lc_element[i]->lc_index != i);482482+ return lc->lc_element[i];483483+}484484+485485+/**486486+ * lc_index_of487487+ * @lc: the lru cache to operate on488488+ * @e: the element to query for its index position in lc->element489489+ */490490+unsigned int lc_index_of(struct lru_cache *lc, struct lc_element *e)491491+{492492+ PARANOIA_LC_ELEMENT(lc, e);493493+ return e->lc_index;494494+}495495+496496+/**497497+ * lc_set - associate index with label498498+ * @lc: the lru cache to operate on499499+ * @enr: the label to set500500+ * @index: the element index to associate label with.501501+ *502502+ * Used to initialize the active set to some previously recorded state.503503+ */504504+void lc_set(struct lru_cache *lc, unsigned int enr, int index)505505+{506506+ struct lc_element *e;507507+508508+ if (index < 0 || index >= lc->nr_elements)509509+ return;510510+511511+ e = lc_element_by_index(lc, index);512512+ e->lc_number = enr;513513+514514+ hlist_del_init(&e->colision);515515+ hlist_add_head(&e->colision, lc_hash_slot(lc, enr));516516+ list_move(&e->list, e->refcnt ? &lc->in_use : &lc->lru);517517+}518518+519519+/**520520+ * lc_dump - Dump a complete LRU cache to seq in textual form.521521+ * @lc: the lru cache to operate on522522+ * @seq: the &struct seq_file pointer to seq_printf into523523+ * @utext: user supplied "heading" or other info524524+ * @detail: function pointer the user may provide to dump further details525525+ * of the object the lc_element is embedded in.526526+ */527527+void lc_seq_dump_details(struct seq_file *seq, struct lru_cache *lc, char *utext,528528+ void (*detail) (struct seq_file *, struct lc_element *))529529+{530530+ unsigned int nr_elements = lc->nr_elements;531531+ struct lc_element *e;532532+ int i;533533+534534+ seq_printf(seq, "\tnn: lc_number refcnt %s\n ", utext);535535+ for (i = 0; i < nr_elements; i++) {536536+ e = lc_element_by_index(lc, i);537537+ if (e->lc_number == LC_FREE) {538538+ seq_printf(seq, "\t%2d: FREE\n", i);539539+ } else {540540+ seq_printf(seq, "\t%2d: %4u %4u ", i,541541+ e->lc_number, e->refcnt);542542+ detail(seq, e);543543+ }544544+ }545545+}546546+547547+EXPORT_SYMBOL(lc_create);548548+EXPORT_SYMBOL(lc_reset);549549+EXPORT_SYMBOL(lc_destroy);550550+EXPORT_SYMBOL(lc_set);551551+EXPORT_SYMBOL(lc_del);552552+EXPORT_SYMBOL(lc_try_get);553553+EXPORT_SYMBOL(lc_find);554554+EXPORT_SYMBOL(lc_get);555555+EXPORT_SYMBOL(lc_put);556556+EXPORT_SYMBOL(lc_changed);557557+EXPORT_SYMBOL(lc_element_by_index);558558+EXPORT_SYMBOL(lc_index_of);559559+EXPORT_SYMBOL(lc_seq_printf_stats);560560+EXPORT_SYMBOL(lc_seq_dump_details);