From b8d834488fd7c0c5a79cd2bab112c37a3d3292b9 Mon Sep 17 00:00:00 2001 From: Goldwyn Rodrigues Date: Tue, 10 Jun 2014 16:31:01 -0500 Subject: [PATCH 01/58] md-cluster: Design Documentation Signed-off-by: Goldwyn Rodrigues --- Documentation/md-cluster.txt | 176 +++++++++++++++++++++++++++++++++++ 1 file changed, 176 insertions(+) create mode 100644 Documentation/md-cluster.txt diff --git a/Documentation/md-cluster.txt b/Documentation/md-cluster.txt new file mode 100644 index 000000000000..de1af7db3355 --- /dev/null +++ b/Documentation/md-cluster.txt @@ -0,0 +1,176 @@ +The cluster MD is a shared-device RAID for a cluster. + + +1. On-disk format + +Separate write-intent-bitmap are used for each cluster node. +The bitmaps record all writes that may have been started on that node, +and may not yet have finished. The on-disk layout is: + +0 4k 8k 12k +------------------------------------------------------------------- +| idle | md super | bm super [0] + bits | +| bm bits[0, contd] | bm super[1] + bits | bm bits[1, contd] | +| bm super[2] + bits | bm bits [2, contd] | bm super[3] + bits | +| bm bits [3, contd] | | | + +During "normal" functioning we assume the filesystem ensures that only one +node writes to any given block at a time, so a write +request will + - set the appropriate bit (if not already set) + - commit the write to all mirrors + - schedule the bit to be cleared after a timeout. + +Reads are just handled normally. It is up to the filesystem to +ensure one node doesn't read from a location where another node (or the same +node) is writing. + + +2. DLM Locks for management + +There are two locks for managing the device: + +2.1 Bitmap lock resource (bm_lockres) + + The bm_lockres protects individual node bitmaps. They are named in the + form bitmap001 for node 1, bitmap002 for node and so on. When a node + joins the cluster, it acquires the lock in PW mode and it stays so + during the lifetime the node is part of the cluster. The lock resource + number is based on the slot number returned by the DLM subsystem. Since + DLM starts node count from one and bitmap slots start from zero, one is + subtracted from the DLM slot number to arrive at the bitmap slot number. + +3. Communication + +Each node has to communicate with other nodes when starting or ending +resync, and metadata superblock updates. + +3.1 Message Types + + There are 3 types, of messages which are passed + + 3.1.1 METADATA_UPDATED: informs other nodes that the metadata has been + updated, and the node must re-read the md superblock. This is performed + synchronously. + + 3.1.2 RESYNC: informs other nodes that a resync is initiated or ended + so that each node may suspend or resume the region. + +3.2 Communication mechanism + + The DLM LVB is used to communicate within nodes of the cluster. There + are three resources used for the purpose: + + 3.2.1 Token: The resource which protects the entire communication + system. The node having the token resource is allowed to + communicate. + + 3.2.2 Message: The lock resource which carries the data to + communicate. + + 3.2.3 Ack: The resource, acquiring which means the message has been + acknowledged by all nodes in the cluster. The BAST of the resource + is used to inform the receive node that a node wants to communicate. + +The algorithm is: + + 1. receive status + + sender receiver receiver + ACK:CR ACK:CR ACK:CR + + 2. sender get EX of TOKEN + sender get EX of MESSAGE + sender receiver receiver + TOKEN:EX ACK:CR ACK:CR + MESSAGE:EX + ACK:CR + + Sender checks that it still needs to send a message. Messages received + or other events that happened while waiting for the TOKEN may have made + this message inappropriate or redundant. + + 3. sender write LVB. + sender down-convert MESSAGE from EX to CR + sender try to get EX of ACK + [ wait until all receiver has *processed* the MESSAGE ] + + [ triggered by bast of ACK ] + receiver get CR of MESSAGE + receiver read LVB + receiver processes the message + [ wait finish ] + receiver release ACK + + sender receiver receiver + TOKEN:EX MESSAGE:CR MESSAGE:CR + MESSAGE:CR + ACK:EX + + 4. triggered by grant of EX on ACK (indicating all receivers have processed + message) + sender down-convert ACK from EX to CR + sender release MESSAGE + sender release TOKEN + receiver upconvert to EX of MESSAGE + receiver get CR of ACK + receiver release MESSAGE + + sender receiver receiver + ACK:CR ACK:CR ACK:CR + + +4. Handling Failures + +4.1 Node Failure + When a node fails, the DLM informs the cluster with the slot. The node + starts a cluster recovery thread. The cluster recovery thread: + - acquires the bitmap lock of the failed node + - opens the bitmap + - reads the bitmap of the failed node + - copies the set bitmap to local node + - cleans the bitmap of the failed node + - releases bitmap lock of the failed node + - initiates resync of the bitmap on the current node + + The resync process, is the regular md resync. However, in a clustered + environment when a resync is performed, it needs to tell other nodes + of the areas which are suspended. Before a resync starts, the node + send out RESYNC_START with the (lo,hi) range of the area which needs + to be suspended. Each node maintains a suspend_list, which contains + the list of ranges which are currently suspended. On receiving + RESYNC_START, the node adds the range to the suspend_list. Similarly, + when the node performing resync finishes, it send RESYNC_FINISHED + to other nodes and other nodes remove the corresponding entry from + the suspend_list. + + A helper function, should_suspend() can be used to check if a particular + I/O range should be suspended or not. + +4.2 Device Failure + Device failures are handled and communicated with the metadata update + routine. + +5. Adding a new Device +For adding a new device, it is necessary that all nodes "see" the new device +to be added. For this, the following algorithm is used: + + 1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues + ioctl(ADD_NEW_DISC with disc.state set to MD_DISK_CLUSTER_ADD) + 2. Node 1 sends NEWDISK with uuid and slot number + 3. Other nodes issue kobject_uevent_env with uuid and slot number + (Steps 4,5 could be a udev rule) + 4. In userspace, the node searches for the disk, perhaps + using blkid -t SUB_UUID="" + 5. Other nodes issue either of the following depending on whether the disk + was found: + ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and + disc.number set to slot number) + ioctl(CLUSTERED_DISK_NACK) + 6. Other nodes drop lock on no-new-devs (CR) if device is found + 7. Node 1 attempts EX lock on no-new-devs + 8. If node 1 gets the lock, it sends METADATA_UPDATED after unmarking the disk + as SpareLocal + 9. If not (get no-new-dev lock), it fails the operation and sends METADATA_UPDATED + 10. Other nodes get the information whether a disk is added or not + by the following METADATA_UPDATED. From 183bdf5106af069a774146c4f910a80ad1f57485 Mon Sep 17 00:00:00 2001 From: Goldwyn Rodrigues Date: Fri, 7 Mar 2014 10:30:43 -0600 Subject: [PATCH 02/58] Add number of nodes to bitmap structure for clustering Signed-off-by: Goldwyn Rodrigues --- drivers/md/bitmap.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/md/bitmap.h b/drivers/md/bitmap.h index 30210b9c4ef9..f4e53eadb083 100644 --- a/drivers/md/bitmap.h +++ b/drivers/md/bitmap.h @@ -131,7 +131,8 @@ typedef struct bitmap_super_s { __le32 sectors_reserved; /* 64 number of 512-byte sectors that are * reserved for the bitmap. */ - __u8 pad[256 - 68]; /* set to zero */ + __le32 nodes; /* 68 the maximum number of nodes in cluster. */ + __u8 pad[256 - 72]; /* set to zero */ } bitmap_super_t; /* notes: From 8e854e9cfd1cc3837b4bd96643d5174a72d9f741 Mon Sep 17 00:00:00 2001 From: Goldwyn Rodrigues Date: Fri, 7 Mar 2014 11:21:15 -0600 Subject: [PATCH 03/58] Create a separate module for clustering support Tagged as EXPERIMENTAL for now. Signed-off-by: Goldwyn Rodrigues --- drivers/md/Kconfig | 16 ++++++++++++++++ drivers/md/Makefile | 1 + drivers/md/md-cluster.c | 28 ++++++++++++++++++++++++++++ 3 files changed, 45 insertions(+) create mode 100644 drivers/md/md-cluster.c diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig index 63e05e32b462..eed1fec2d97b 100644 --- a/drivers/md/Kconfig +++ b/drivers/md/Kconfig @@ -175,6 +175,22 @@ config MD_FAULTY In unsure, say N. + +config MD_CLUSTER + tristate "Cluster Support for MD (EXPERIMENTAL)" + depends on BLK_DEV_MD + depends on DLM + default n + ---help--- + Clustering support for MD devices. This enables locking and + synchronization across multiple systems on the cluster, so all + nodes in the cluster can access the MD devices simultaneously. + + This brings the redundancy (and uptime) of RAID levels across the + nodes of the cluster. + + If unsure, say N. + source "drivers/md/bcache/Kconfig" config BLK_DEV_DM_BUILTIN diff --git a/drivers/md/Makefile b/drivers/md/Makefile index a2da532b1c2b..7ed86876f3b7 100644 --- a/drivers/md/Makefile +++ b/drivers/md/Makefile @@ -30,6 +30,7 @@ obj-$(CONFIG_MD_RAID10) += raid10.o obj-$(CONFIG_MD_RAID456) += raid456.o obj-$(CONFIG_MD_MULTIPATH) += multipath.o obj-$(CONFIG_MD_FAULTY) += faulty.o +obj-$(CONFIG_MD_CLUSTER) += md-cluster.o obj-$(CONFIG_BCACHE) += bcache/ obj-$(CONFIG_BLK_DEV_MD) += md-mod.o obj-$(CONFIG_BLK_DEV_DM) += dm-mod.o diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c new file mode 100644 index 000000000000..f377e71949c5 --- /dev/null +++ b/drivers/md/md-cluster.c @@ -0,0 +1,28 @@ +/* + * Copyright (C) 2015, SUSE + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2, or (at your option) + * any later version. + * + */ + + +#include + +static int __init cluster_init(void) +{ + pr_warn("md-cluster: EXPERIMENTAL. Use with caution\n"); + pr_info("Registering Cluster MD functions\n"); + return 0; +} + +static void cluster_exit(void) +{ +} + +module_init(cluster_init); +module_exit(cluster_exit); +MODULE_LICENSE("GPL"); +MODULE_DESCRIPTION("Clustering support for MD"); From 47741b7ca7b389d1b45d7cf15edc279c9be32fa8 Mon Sep 17 00:00:00 2001 From: Goldwyn Rodrigues Date: Fri, 7 Mar 2014 13:49:26 -0600 Subject: [PATCH 04/58] DLM lock and unlock functions A dlm_lock_resource is a structure which contains all information required for locking using DLM. The init function allocates the lock and acquires the lock in NL mode. The unlock function converts the lock resource to NL mode. This is done to preserve LVB and for faster processing of locks. The lock resource is DLM unlocked only in the lockres_free function, which is the end of life of the lock resource. Signed-off-by: Lidong Zhong Signed-off-by: Goldwyn Rodrigues --- drivers/md/md-cluster.c | 102 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 102 insertions(+) diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c index f377e71949c5..bc8ea9d76875 100644 --- a/drivers/md/md-cluster.c +++ b/drivers/md/md-cluster.c @@ -10,6 +10,108 @@ #include +#include +#include +#include "md.h" + +#define LVB_SIZE 64 + +struct dlm_lock_resource { + dlm_lockspace_t *ls; + struct dlm_lksb lksb; + char *name; /* lock name. */ + uint32_t flags; /* flags to pass to dlm_lock() */ + void (*bast)(void *arg, int mode); /* blocking AST function pointer*/ + struct completion completion; /* completion for synchronized locking */ +}; + +static void sync_ast(void *arg) +{ + struct dlm_lock_resource *res; + + res = (struct dlm_lock_resource *) arg; + complete(&res->completion); +} + +static int dlm_lock_sync(struct dlm_lock_resource *res, int mode) +{ + int ret = 0; + + init_completion(&res->completion); + ret = dlm_lock(res->ls, mode, &res->lksb, + res->flags, res->name, strlen(res->name), + 0, sync_ast, res, res->bast); + if (ret) + return ret; + wait_for_completion(&res->completion); + return res->lksb.sb_status; +} + +static int dlm_unlock_sync(struct dlm_lock_resource *res) +{ + return dlm_lock_sync(res, DLM_LOCK_NL); +} + +static struct dlm_lock_resource *lockres_init(dlm_lockspace_t *lockspace, + char *name, void (*bastfn)(void *arg, int mode), int with_lvb) +{ + struct dlm_lock_resource *res = NULL; + int ret, namelen; + + res = kzalloc(sizeof(struct dlm_lock_resource), GFP_KERNEL); + if (!res) + return NULL; + res->ls = lockspace; + namelen = strlen(name); + res->name = kzalloc(namelen + 1, GFP_KERNEL); + if (!res->name) { + pr_err("md-cluster: Unable to allocate resource name for resource %s\n", name); + goto out_err; + } + strlcpy(res->name, name, namelen + 1); + if (with_lvb) { + res->lksb.sb_lvbptr = kzalloc(LVB_SIZE, GFP_KERNEL); + if (!res->lksb.sb_lvbptr) { + pr_err("md-cluster: Unable to allocate LVB for resource %s\n", name); + goto out_err; + } + res->flags = DLM_LKF_VALBLK; + } + + if (bastfn) + res->bast = bastfn; + + res->flags |= DLM_LKF_EXPEDITE; + + ret = dlm_lock_sync(res, DLM_LOCK_NL); + if (ret) { + pr_err("md-cluster: Unable to lock NL on new lock resource %s\n", name); + goto out_err; + } + res->flags &= ~DLM_LKF_EXPEDITE; + res->flags |= DLM_LKF_CONVERT; + + return res; +out_err: + kfree(res->lksb.sb_lvbptr); + kfree(res->name); + kfree(res); + return NULL; +} + +static void lockres_free(struct dlm_lock_resource *res) +{ + if (!res) + return; + + init_completion(&res->completion); + dlm_unlock(res->ls, res->lksb.sb_lkid, 0, &res->lksb, res); + wait_for_completion(&res->completion); + + kfree(res->name); + kfree(res->lksb.sb_lvbptr); + kfree(res); +} static int __init cluster_init(void) { From edb39c9deda87da5aad9c090e2e8eaf8470c852c Mon Sep 17 00:00:00 2001 From: Goldwyn Rodrigues Date: Sat, 29 Mar 2014 10:01:53 -0500 Subject: [PATCH 05/58] Introduce md_cluster_operations to handle cluster functions This allows dynamic registering of cluster hooks. Signed-off-by: Goldwyn Rodrigues --- drivers/md/md-cluster.c | 18 ++++++++++++++ drivers/md/md-cluster.h | 15 ++++++++++++ drivers/md/md.c | 52 +++++++++++++++++++++++++++++++++++++++++ drivers/md/md.h | 7 ++++++ 4 files changed, 92 insertions(+) create mode 100644 drivers/md/md-cluster.h diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c index bc8ea9d76875..e2235600a72b 100644 --- a/drivers/md/md-cluster.c +++ b/drivers/md/md-cluster.c @@ -13,6 +13,7 @@ #include #include #include "md.h" +#include "md-cluster.h" #define LVB_SIZE 64 @@ -113,15 +114,32 @@ static void lockres_free(struct dlm_lock_resource *res) kfree(res); } +static int join(struct mddev *mddev, int nodes) +{ + return 0; +} + +static int leave(struct mddev *mddev) +{ + return 0; +} + +static struct md_cluster_operations cluster_ops = { + .join = join, + .leave = leave, +}; + static int __init cluster_init(void) { pr_warn("md-cluster: EXPERIMENTAL. Use with caution\n"); pr_info("Registering Cluster MD functions\n"); + register_md_cluster_operations(&cluster_ops, THIS_MODULE); return 0; } static void cluster_exit(void) { + unregister_md_cluster_operations(); } module_init(cluster_init); diff --git a/drivers/md/md-cluster.h b/drivers/md/md-cluster.h new file mode 100644 index 000000000000..aa9f07bd6b96 --- /dev/null +++ b/drivers/md/md-cluster.h @@ -0,0 +1,15 @@ + + +#ifndef _MD_CLUSTER_H +#define _MD_CLUSTER_H + +#include "md.h" + +struct mddev; + +struct md_cluster_operations { + int (*join)(struct mddev *mddev); + int (*leave)(struct mddev *mddev); +}; + +#endif /* _MD_CLUSTER_H */ diff --git a/drivers/md/md.c b/drivers/md/md.c index c8d2bac4e28b..57ecb51ec5fd 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -53,6 +53,7 @@ #include #include "md.h" #include "bitmap.h" +#include "md-cluster.h" #ifndef MODULE static void autostart_arrays(int part); @@ -66,6 +67,10 @@ static void autostart_arrays(int part); static LIST_HEAD(pers_list); static DEFINE_SPINLOCK(pers_lock); +struct md_cluster_operations *md_cluster_ops; +struct module *md_cluster_mod; +EXPORT_SYMBOL(md_cluster_mod); + static DECLARE_WAIT_QUEUE_HEAD(resync_wait); static struct workqueue_struct *md_wq; static struct workqueue_struct *md_misc_wq; @@ -7231,6 +7236,53 @@ int unregister_md_personality(struct md_personality *p) } EXPORT_SYMBOL(unregister_md_personality); +int register_md_cluster_operations(struct md_cluster_operations *ops, struct module *module) +{ + if (md_cluster_ops != NULL) + return -EALREADY; + spin_lock(&pers_lock); + md_cluster_ops = ops; + md_cluster_mod = module; + spin_unlock(&pers_lock); + return 0; +} +EXPORT_SYMBOL(register_md_cluster_operations); + +int unregister_md_cluster_operations(void) +{ + spin_lock(&pers_lock); + md_cluster_ops = NULL; + spin_unlock(&pers_lock); + return 0; +} +EXPORT_SYMBOL(unregister_md_cluster_operations); + +int md_setup_cluster(struct mddev *mddev, int nodes) +{ + int err; + + err = request_module("md-cluster"); + if (err) { + pr_err("md-cluster module not found.\n"); + return err; + } + + spin_lock(&pers_lock); + if (!md_cluster_ops || !try_module_get(md_cluster_mod)) { + spin_unlock(&pers_lock); + return -ENOENT; + } + spin_unlock(&pers_lock); + + return md_cluster_ops->join(mddev); +} + +void md_cluster_stop(struct mddev *mddev) +{ + md_cluster_ops->leave(mddev); + module_put(md_cluster_mod); +} + static int is_mddev_idle(struct mddev *mddev, int init) { struct md_rdev *rdev; diff --git a/drivers/md/md.h b/drivers/md/md.h index 318ca8fd430f..018593197c4d 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -23,6 +23,7 @@ #include #include #include +#include "md-cluster.h" #define MaxSector (~(sector_t)0) @@ -608,6 +609,11 @@ static inline void safe_put_page(struct page *p) extern int register_md_personality(struct md_personality *p); extern int unregister_md_personality(struct md_personality *p); +extern int register_md_cluster_operations(struct md_cluster_operations *ops, + struct module *module); +extern int unregister_md_cluster_operations(void); +extern int md_setup_cluster(struct mddev *mddev, int nodes); +extern void md_cluster_stop(struct mddev *mddev); extern struct md_thread *md_register_thread( void (*run)(struct md_thread *thread), struct mddev *mddev, @@ -669,4 +675,5 @@ static inline void rdev_dec_pending(struct md_rdev *rdev, struct mddev *mddev) } } +extern struct md_cluster_operations *md_cluster_ops; #endif /* _MD_MD_H */ From c4ce867fdad200dfd8aa8cbe1eabc26c14c51635 Mon Sep 17 00:00:00 2001 From: Goldwyn Rodrigues Date: Sat, 29 Mar 2014 10:20:02 -0500 Subject: [PATCH 06/58] Introduce md_cluster_info md_cluster_info stores the cluster information in the MD device. The join() is called when mddev detects it is a clustered device. The main responsibilities are: 1. Setup a DLM lockspace 2. Setup all initial locks such as super block locks and bitmap lock (will come later) The leave() clears up the lockspace and all the locks held. Signed-off-by: Goldwyn Rodrigues --- drivers/md/bitmap.c | 9 +++++- drivers/md/md-cluster.c | 65 +++++++++++++++++++++++++++++++++++++++-- drivers/md/md.c | 2 ++ drivers/md/md.h | 8 +++++ 4 files changed, 80 insertions(+), 4 deletions(-) diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c index 3a5767968ba0..e2aacca46911 100644 --- a/drivers/md/bitmap.c +++ b/drivers/md/bitmap.c @@ -433,6 +433,7 @@ void bitmap_update_sb(struct bitmap *bitmap) /* This might have been changed by a reshape */ sb->sync_size = cpu_to_le64(bitmap->mddev->resync_max_sectors); sb->chunksize = cpu_to_le32(bitmap->mddev->bitmap_info.chunksize); + sb->nodes = cpu_to_le32(bitmap->mddev->bitmap_info.nodes); sb->sectors_reserved = cpu_to_le32(bitmap->mddev-> bitmap_info.space); kunmap_atomic(sb); @@ -544,6 +545,7 @@ static int bitmap_read_sb(struct bitmap *bitmap) bitmap_super_t *sb; unsigned long chunksize, daemon_sleep, write_behind; unsigned long long events; + int nodes = 0; unsigned long sectors_reserved = 0; int err = -EINVAL; struct page *sb_page; @@ -583,6 +585,7 @@ static int bitmap_read_sb(struct bitmap *bitmap) daemon_sleep = le32_to_cpu(sb->daemon_sleep) * HZ; write_behind = le32_to_cpu(sb->write_behind); sectors_reserved = le32_to_cpu(sb->sectors_reserved); + nodes = le32_to_cpu(sb->nodes); /* verify that the bitmap-specific fields are valid */ if (sb->magic != cpu_to_le32(BITMAP_MAGIC)) @@ -643,6 +646,7 @@ out_no_sb: bitmap->mddev->bitmap_info.chunksize = chunksize; bitmap->mddev->bitmap_info.daemon_sleep = daemon_sleep; bitmap->mddev->bitmap_info.max_write_behind = write_behind; + bitmap->mddev->bitmap_info.nodes = nodes; if (bitmap->mddev->bitmap_info.space == 0 || bitmap->mddev->bitmap_info.space > sectors_reserved) bitmap->mddev->bitmap_info.space = sectors_reserved; @@ -2186,6 +2190,8 @@ __ATTR(chunksize, S_IRUGO|S_IWUSR, chunksize_show, chunksize_store); static ssize_t metadata_show(struct mddev *mddev, char *page) { + if (mddev_is_clustered(mddev)) + return sprintf(page, "clustered\n"); return sprintf(page, "%s\n", (mddev->bitmap_info.external ? "external" : "internal")); } @@ -2198,7 +2204,8 @@ static ssize_t metadata_store(struct mddev *mddev, const char *buf, size_t len) return -EBUSY; if (strncmp(buf, "external", 8) == 0) mddev->bitmap_info.external = 1; - else if (strncmp(buf, "internal", 8) == 0) + else if ((strncmp(buf, "internal", 8) == 0) || + (strncmp(buf, "clustered", 9) == 0)) mddev->bitmap_info.external = 0; else return -EINVAL; diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c index e2235600a72b..d141d4812c8c 100644 --- a/drivers/md/md-cluster.c +++ b/drivers/md/md-cluster.c @@ -22,8 +22,16 @@ struct dlm_lock_resource { struct dlm_lksb lksb; char *name; /* lock name. */ uint32_t flags; /* flags to pass to dlm_lock() */ - void (*bast)(void *arg, int mode); /* blocking AST function pointer*/ struct completion completion; /* completion for synchronized locking */ + void (*bast)(void *arg, int mode); /* blocking AST function pointer*/ + struct mddev *mddev; /* pointing back to mddev. */ +}; + +struct md_cluster_info { + /* dlm lock space and resources for clustered raid. */ + dlm_lockspace_t *lockspace; + struct dlm_lock_resource *sb_lock; + struct mutex sb_mutex; }; static void sync_ast(void *arg) @@ -53,16 +61,18 @@ static int dlm_unlock_sync(struct dlm_lock_resource *res) return dlm_lock_sync(res, DLM_LOCK_NL); } -static struct dlm_lock_resource *lockres_init(dlm_lockspace_t *lockspace, +static struct dlm_lock_resource *lockres_init(struct mddev *mddev, char *name, void (*bastfn)(void *arg, int mode), int with_lvb) { struct dlm_lock_resource *res = NULL; int ret, namelen; + struct md_cluster_info *cinfo = mddev->cluster_info; res = kzalloc(sizeof(struct dlm_lock_resource), GFP_KERNEL); if (!res) return NULL; - res->ls = lockspace; + res->ls = cinfo->lockspace; + res->mddev = mddev; namelen = strlen(name); res->name = kzalloc(namelen + 1, GFP_KERNEL); if (!res->name) { @@ -114,13 +124,62 @@ static void lockres_free(struct dlm_lock_resource *res) kfree(res); } +static char *pretty_uuid(char *dest, char *src) +{ + int i, len = 0; + + for (i = 0; i < 16; i++) { + if (i == 4 || i == 6 || i == 8 || i == 10) + len += sprintf(dest + len, "-"); + len += sprintf(dest + len, "%02x", (__u8)src[i]); + } + return dest; +} + static int join(struct mddev *mddev, int nodes) { + struct md_cluster_info *cinfo; + int ret; + char str[64]; + + if (!try_module_get(THIS_MODULE)) + return -ENOENT; + + cinfo = kzalloc(sizeof(struct md_cluster_info), GFP_KERNEL); + if (!cinfo) + return -ENOMEM; + + memset(str, 0, 64); + pretty_uuid(str, mddev->uuid); + ret = dlm_new_lockspace(str, NULL, DLM_LSFL_FS, LVB_SIZE, + NULL, NULL, NULL, &cinfo->lockspace); + if (ret) + goto err; + cinfo->sb_lock = lockres_init(mddev, "cmd-super", + NULL, 0); + if (!cinfo->sb_lock) { + ret = -ENOMEM; + goto err; + } + mutex_init(&cinfo->sb_mutex); + mddev->cluster_info = cinfo; return 0; +err: + if (cinfo->lockspace) + dlm_release_lockspace(cinfo->lockspace, 2); + kfree(cinfo); + module_put(THIS_MODULE); + return ret; } static int leave(struct mddev *mddev) { + struct md_cluster_info *cinfo = mddev->cluster_info; + + if (!cinfo) + return 0; + lockres_free(cinfo->sb_lock); + dlm_release_lockspace(cinfo->lockspace, 2); return 0; } diff --git a/drivers/md/md.c b/drivers/md/md.c index 57ecb51ec5fd..3387f940140b 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -7279,6 +7279,8 @@ int md_setup_cluster(struct mddev *mddev, int nodes) void md_cluster_stop(struct mddev *mddev) { + if (!md_cluster_ops) + return; md_cluster_ops->leave(mddev); module_put(md_cluster_mod); } diff --git a/drivers/md/md.h b/drivers/md/md.h index 018593197c4d..80fc89976915 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -203,6 +203,8 @@ extern int rdev_clear_badblocks(struct md_rdev *rdev, sector_t s, int sectors, int is_new); extern void md_ack_all_badblocks(struct badblocks *bb); +struct md_cluster_info; + struct mddev { void *private; struct md_personality *pers; @@ -431,6 +433,7 @@ struct mddev { unsigned long daemon_sleep; /* how many jiffies between updates? */ unsigned long max_write_behind; /* write-behind mode */ int external; + int nodes; /* Maximum number of nodes in the cluster */ } bitmap_info; atomic_t max_corr_read_errors; /* max read retries */ @@ -449,6 +452,7 @@ struct mddev { struct work_struct flush_work; struct work_struct event_work; /* used by dm to report failure event */ void (*sync_super)(struct mddev *mddev, struct md_rdev *rdev); + struct md_cluster_info *cluster_info; }; static inline int __must_check mddev_lock(struct mddev *mddev) @@ -676,4 +680,8 @@ static inline void rdev_dec_pending(struct md_rdev *rdev, struct mddev *mddev) } extern struct md_cluster_operations *md_cluster_ops; +static inline int mddev_is_clustered(struct mddev *mddev) +{ + return mddev->cluster_info && mddev->bitmap_info.nodes > 1; +} #endif /* _MD_MD_H */ From ca8895d9bb41e743271c42a4438a296de891b73b Mon Sep 17 00:00:00 2001 From: Goldwyn Rodrigues Date: Wed, 26 Nov 2014 12:22:03 -0600 Subject: [PATCH 07/58] Return MD_SB_CLUSTERED if mddev is clustered Signed-off-by: Goldwyn Rodrigues --- drivers/md/md.c | 3 +++ include/uapi/linux/raid/md_p.h | 1 + 2 files changed, 4 insertions(+) diff --git a/drivers/md/md.c b/drivers/md/md.c index 3387f940140b..5ed57688e5c5 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -5634,6 +5634,9 @@ static int get_array_info(struct mddev *mddev, void __user *arg) info.state = (1<bitmap && mddev->bitmap_info.offset) info.state |= (1< Date: Sun, 30 Mar 2014 00:42:49 -0500 Subject: [PATCH 08/58] Add node recovery callbacks DLM offers callbacks when a node fails and the lock remastery is performed: 1. recover_prep: called when DLM discovers a node is down 2. recover_slot: called when DLM identifies the node and recovery can start 3. recover_done: called when all nodes have completed recover_slot recover_slot() and recover_done() are also called when the node joins initially in order to inform the node with its slot number. These slot numbers start from one, so we deduct one to make it start with zero which the cluster-md code uses. Signed-off-by: Goldwyn Rodrigues --- drivers/md/bitmap.c | 1 + drivers/md/bitmap.h | 4 +-- drivers/md/md-cluster.c | 62 +++++++++++++++++++++++++++++++++++++---- drivers/md/md-cluster.h | 3 +- drivers/md/md.c | 2 +- drivers/md/md.h | 1 + 6 files changed, 64 insertions(+), 9 deletions(-) diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c index e2aacca46911..b43a75a246e7 100644 --- a/drivers/md/bitmap.c +++ b/drivers/md/bitmap.c @@ -637,6 +637,7 @@ static int bitmap_read_sb(struct bitmap *bitmap) if (le32_to_cpu(sb->version) == BITMAP_MAJOR_HOSTENDIAN) set_bit(BITMAP_HOSTENDIAN, &bitmap->flags); bitmap->events_cleared = le64_to_cpu(sb->events_cleared); + strlcpy(bitmap->mddev->bitmap_info.cluster_name, sb->cluster_name, 64); err = 0; out: kunmap_atomic(sb); diff --git a/drivers/md/bitmap.h b/drivers/md/bitmap.h index f4e53eadb083..ec9032f105b8 100644 --- a/drivers/md/bitmap.h +++ b/drivers/md/bitmap.h @@ -130,9 +130,9 @@ typedef struct bitmap_super_s { __le32 write_behind; /* 60 number of outstanding write-behind writes */ __le32 sectors_reserved; /* 64 number of 512-byte sectors that are * reserved for the bitmap. */ - __le32 nodes; /* 68 the maximum number of nodes in cluster. */ - __u8 pad[256 - 72]; /* set to zero */ + __u8 cluster_name[64]; /* 72 cluster name to which this md belongs */ + __u8 pad[256 - 136]; /* set to zero */ } bitmap_super_t; /* notes: diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c index d141d4812c8c..1f3c8f39ecb2 100644 --- a/drivers/md/md-cluster.c +++ b/drivers/md/md-cluster.c @@ -30,6 +30,8 @@ struct dlm_lock_resource { struct md_cluster_info { /* dlm lock space and resources for clustered raid. */ dlm_lockspace_t *lockspace; + int slot_number; + struct completion completion; struct dlm_lock_resource *sb_lock; struct mutex sb_mutex; }; @@ -136,10 +138,42 @@ static char *pretty_uuid(char *dest, char *src) return dest; } +static void recover_prep(void *arg) +{ +} + +static void recover_slot(void *arg, struct dlm_slot *slot) +{ + struct mddev *mddev = arg; + struct md_cluster_info *cinfo = mddev->cluster_info; + + pr_info("md-cluster: %s Node %d/%d down. My slot: %d. Initiating recovery.\n", + mddev->bitmap_info.cluster_name, + slot->nodeid, slot->slot, + cinfo->slot_number); +} + +static void recover_done(void *arg, struct dlm_slot *slots, + int num_slots, int our_slot, + uint32_t generation) +{ + struct mddev *mddev = arg; + struct md_cluster_info *cinfo = mddev->cluster_info; + + cinfo->slot_number = our_slot; + complete(&cinfo->completion); +} + +static const struct dlm_lockspace_ops md_ls_ops = { + .recover_prep = recover_prep, + .recover_slot = recover_slot, + .recover_done = recover_done, +}; + static int join(struct mddev *mddev, int nodes) { struct md_cluster_info *cinfo; - int ret; + int ret, ops_rv; char str[64]; if (!try_module_get(THIS_MODULE)) @@ -149,24 +183,30 @@ static int join(struct mddev *mddev, int nodes) if (!cinfo) return -ENOMEM; + init_completion(&cinfo->completion); + + mutex_init(&cinfo->sb_mutex); + mddev->cluster_info = cinfo; + memset(str, 0, 64); pretty_uuid(str, mddev->uuid); - ret = dlm_new_lockspace(str, NULL, DLM_LSFL_FS, LVB_SIZE, - NULL, NULL, NULL, &cinfo->lockspace); + ret = dlm_new_lockspace(str, mddev->bitmap_info.cluster_name, + DLM_LSFL_FS, LVB_SIZE, + &md_ls_ops, mddev, &ops_rv, &cinfo->lockspace); if (ret) goto err; + wait_for_completion(&cinfo->completion); cinfo->sb_lock = lockres_init(mddev, "cmd-super", NULL, 0); if (!cinfo->sb_lock) { ret = -ENOMEM; goto err; } - mutex_init(&cinfo->sb_mutex); - mddev->cluster_info = cinfo; return 0; err: if (cinfo->lockspace) dlm_release_lockspace(cinfo->lockspace, 2); + mddev->cluster_info = NULL; kfree(cinfo); module_put(THIS_MODULE); return ret; @@ -183,9 +223,21 @@ static int leave(struct mddev *mddev) return 0; } +/* slot_number(): Returns the MD slot number to use + * DLM starts the slot numbers from 1, wheras cluster-md + * wants the number to be from zero, so we deduct one + */ +static int slot_number(struct mddev *mddev) +{ + struct md_cluster_info *cinfo = mddev->cluster_info; + + return cinfo->slot_number - 1; +} + static struct md_cluster_operations cluster_ops = { .join = join, .leave = leave, + .slot_number = slot_number, }; static int __init cluster_init(void) diff --git a/drivers/md/md-cluster.h b/drivers/md/md-cluster.h index aa9f07bd6b96..52a21e0d6dbc 100644 --- a/drivers/md/md-cluster.h +++ b/drivers/md/md-cluster.h @@ -8,8 +8,9 @@ struct mddev; struct md_cluster_operations { - int (*join)(struct mddev *mddev); + int (*join)(struct mddev *mddev, int nodes); int (*leave)(struct mddev *mddev); + int (*slot_number)(struct mddev *mddev); }; #endif /* _MD_CLUSTER_H */ diff --git a/drivers/md/md.c b/drivers/md/md.c index 5ed57688e5c5..8f310d98f082 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -7277,7 +7277,7 @@ int md_setup_cluster(struct mddev *mddev, int nodes) } spin_unlock(&pers_lock); - return md_cluster_ops->join(mddev); + return md_cluster_ops->join(mddev, nodes); } void md_cluster_stop(struct mddev *mddev) diff --git a/drivers/md/md.h b/drivers/md/md.h index 80fc89976915..81e568090d8f 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -434,6 +434,7 @@ struct mddev { unsigned long max_write_behind; /* write-behind mode */ int external; int nodes; /* Maximum number of nodes in the cluster */ + char cluster_name[64]; /* Name of the cluster */ } bitmap_info; atomic_t max_corr_read_errors; /* max read retries */ From b97e92574c0bf335db1cd2ec491d8ff5cd5d0b49 Mon Sep 17 00:00:00 2001 From: Goldwyn Rodrigues Date: Fri, 6 Jun 2014 11:50:56 -0500 Subject: [PATCH 09/58] Use separate bitmaps for each nodes in the cluster On-disk format: 0 4k 8k 12k ------------------------------------------------------------------- | idle | md super | bm super [0] + bits | | bm bits[0, contd] | bm super[1] + bits | bm bits[1, contd] | | bm super[2] + bits | bm bits [2, contd] | bm super[3] + bits | | bm bits [3, contd] | | | Bitmap super has a field nodes, which defines the maximum number of nodes the device can use. While reading the bitmap super, if the cluster finds out that the number of nodes is > 0: 1. Requests the md-cluster module. 2. Calls md_cluster_ops->join(), which sets up clustering such as joining DLM lockspace. Since the first time, the first bitmap is read. After the call to the cluster_setup, the bitmap offset is adjusted and the superblock is re-read. This also ensures the bitmap is read the bitmap lock (when bitmap lock is introduced in later patches) Questions: 1. cluster name is repeated in all bitmap supers. Is that okay? Signed-off-by: Goldwyn Rodrigues --- drivers/md/bitmap.c | 67 +++++++++++++++++++++++++++++++++++------ drivers/md/bitmap.h | 1 + drivers/md/md-cluster.c | 6 ++++ 3 files changed, 64 insertions(+), 10 deletions(-) diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c index b43a75a246e7..b1d94eee3346 100644 --- a/drivers/md/bitmap.c +++ b/drivers/md/bitmap.c @@ -205,6 +205,10 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait) struct block_device *bdev; struct mddev *mddev = bitmap->mddev; struct bitmap_storage *store = &bitmap->storage; + int node_offset = 0; + + if (mddev_is_clustered(bitmap->mddev)) + node_offset = bitmap->cluster_slot * store->file_pages; while ((rdev = next_active_rdev(rdev, mddev)) != NULL) { int size = PAGE_SIZE; @@ -549,6 +553,7 @@ static int bitmap_read_sb(struct bitmap *bitmap) unsigned long sectors_reserved = 0; int err = -EINVAL; struct page *sb_page; + int cluster_setup_done = 0; if (!bitmap->storage.file && !bitmap->mddev->bitmap_info.offset) { chunksize = 128 * 1024 * 1024; @@ -564,6 +569,7 @@ static int bitmap_read_sb(struct bitmap *bitmap) return -ENOMEM; bitmap->storage.sb_page = sb_page; +re_read: if (bitmap->storage.file) { loff_t isize = i_size_read(bitmap->storage.file->f_mapping->host); int bytes = isize > PAGE_SIZE ? PAGE_SIZE : isize; @@ -579,6 +585,7 @@ static int bitmap_read_sb(struct bitmap *bitmap) if (err) return err; + err = -EINVAL; sb = kmap_atomic(sb_page); chunksize = le32_to_cpu(sb->chunksize); @@ -586,6 +593,7 @@ static int bitmap_read_sb(struct bitmap *bitmap) write_behind = le32_to_cpu(sb->write_behind); sectors_reserved = le32_to_cpu(sb->sectors_reserved); nodes = le32_to_cpu(sb->nodes); + strlcpy(bitmap->mddev->bitmap_info.cluster_name, sb->cluster_name, 64); /* verify that the bitmap-specific fields are valid */ if (sb->magic != cpu_to_le32(BITMAP_MAGIC)) @@ -622,7 +630,7 @@ static int bitmap_read_sb(struct bitmap *bitmap) goto out; } events = le64_to_cpu(sb->events); - if (events < bitmap->mddev->events) { + if (!nodes && (events < bitmap->mddev->events)) { printk(KERN_INFO "%s: bitmap file is out of date (%llu < %llu) " "-- forcing full recovery\n", @@ -639,8 +647,34 @@ static int bitmap_read_sb(struct bitmap *bitmap) bitmap->events_cleared = le64_to_cpu(sb->events_cleared); strlcpy(bitmap->mddev->bitmap_info.cluster_name, sb->cluster_name, 64); err = 0; + out: kunmap_atomic(sb); + if (nodes && !cluster_setup_done) { + sector_t bm_blocks; + + bm_blocks = sector_div(bitmap->mddev->resync_max_sectors, (chunksize >> 9)); + bm_blocks = bm_blocks << 3; + /* We have bitmap supers at 4k boundaries, hence this + * is hardcoded */ + bm_blocks = DIV_ROUND_UP(bm_blocks, 4096); + err = md_setup_cluster(bitmap->mddev, nodes); + if (err) { + pr_err("%s: Could not setup cluster service (%d)\n", + bmname(bitmap), err); + goto out_no_sb; + } + bitmap->cluster_slot = md_cluster_ops->slot_number(bitmap->mddev); + bitmap->mddev->bitmap_info.offset += + bitmap->cluster_slot * (bm_blocks << 3); + pr_info("%s:%d bm slot: %d offset: %llu\n", __func__, __LINE__, + bitmap->cluster_slot, + (unsigned long long)bitmap->mddev->bitmap_info.offset); + cluster_setup_done = 1; + goto re_read; + } + + out_no_sb: if (test_bit(BITMAP_STALE, &bitmap->flags)) bitmap->events_cleared = bitmap->mddev->events; @@ -651,8 +685,11 @@ out_no_sb: if (bitmap->mddev->bitmap_info.space == 0 || bitmap->mddev->bitmap_info.space > sectors_reserved) bitmap->mddev->bitmap_info.space = sectors_reserved; - if (err) + if (err) { bitmap_print_sb(bitmap); + if (cluster_setup_done) + md_cluster_stop(bitmap->mddev); + } return err; } @@ -697,9 +734,10 @@ static inline struct page *filemap_get_page(struct bitmap_storage *store, } static int bitmap_storage_alloc(struct bitmap_storage *store, - unsigned long chunks, int with_super) + unsigned long chunks, int with_super, + int slot_number) { - int pnum; + int pnum, offset = 0; unsigned long num_pages; unsigned long bytes; @@ -708,6 +746,7 @@ static int bitmap_storage_alloc(struct bitmap_storage *store, bytes += sizeof(bitmap_super_t); num_pages = DIV_ROUND_UP(bytes, PAGE_SIZE); + offset = slot_number * (num_pages - 1); store->filemap = kmalloc(sizeof(struct page *) * num_pages, GFP_KERNEL); @@ -718,20 +757,22 @@ static int bitmap_storage_alloc(struct bitmap_storage *store, store->sb_page = alloc_page(GFP_KERNEL|__GFP_ZERO); if (store->sb_page == NULL) return -ENOMEM; - store->sb_page->index = 0; } + pnum = 0; if (store->sb_page) { store->filemap[0] = store->sb_page; pnum = 1; + store->sb_page->index = offset; } + for ( ; pnum < num_pages; pnum++) { store->filemap[pnum] = alloc_page(GFP_KERNEL|__GFP_ZERO); if (!store->filemap[pnum]) { store->file_pages = pnum; return -ENOMEM; } - store->filemap[pnum]->index = pnum; + store->filemap[pnum]->index = pnum + offset; } store->file_pages = pnum; @@ -940,7 +981,7 @@ static void bitmap_set_memory_bits(struct bitmap *bitmap, sector_t offset, int n */ static int bitmap_init_from_disk(struct bitmap *bitmap, sector_t start) { - unsigned long i, chunks, index, oldindex, bit; + unsigned long i, chunks, index, oldindex, bit, node_offset = 0; struct page *page = NULL; unsigned long bit_cnt = 0; struct file *file; @@ -986,6 +1027,9 @@ static int bitmap_init_from_disk(struct bitmap *bitmap, sector_t start) if (!bitmap->mddev->bitmap_info.external) offset = sizeof(bitmap_super_t); + if (mddev_is_clustered(bitmap->mddev)) + node_offset = bitmap->cluster_slot * (DIV_ROUND_UP(store->bytes, PAGE_SIZE)); + for (i = 0; i < chunks; i++) { int b; index = file_page_index(&bitmap->storage, i); @@ -1006,7 +1050,7 @@ static int bitmap_init_from_disk(struct bitmap *bitmap, sector_t start) bitmap->mddev, bitmap->mddev->bitmap_info.offset, page, - index, count); + index + node_offset, count); if (ret) goto err; @@ -1212,7 +1256,6 @@ void bitmap_daemon_work(struct mddev *mddev) j < bitmap->storage.file_pages && !test_bit(BITMAP_STALE, &bitmap->flags); j++) { - if (test_page_attr(bitmap, j, BITMAP_PAGE_DIRTY)) /* bitmap_unplug will handle the rest */ @@ -1596,6 +1639,9 @@ static void bitmap_free(struct bitmap *bitmap) if (!bitmap) /* there was no bitmap */ return; + if (mddev_is_clustered(bitmap->mddev) && bitmap->mddev->cluster_info) + md_cluster_stop(bitmap->mddev); + /* Shouldn't be needed - but just in case.... */ wait_event(bitmap->write_wait, atomic_read(&bitmap->pending_writes) == 0); @@ -1854,7 +1900,8 @@ int bitmap_resize(struct bitmap *bitmap, sector_t blocks, memset(&store, 0, sizeof(store)); if (bitmap->mddev->bitmap_info.offset || bitmap->mddev->bitmap_info.file) ret = bitmap_storage_alloc(&store, chunks, - !bitmap->mddev->bitmap_info.external); + !bitmap->mddev->bitmap_info.external, + bitmap->cluster_slot); if (ret) goto err; diff --git a/drivers/md/bitmap.h b/drivers/md/bitmap.h index ec9032f105b8..4e9acb08bbe0 100644 --- a/drivers/md/bitmap.h +++ b/drivers/md/bitmap.h @@ -227,6 +227,7 @@ struct bitmap { wait_queue_head_t behind_wait; struct kernfs_node *sysfs_can_clear; + int cluster_slot; /* Slot offset for clustered env */ }; /* the bitmap API */ diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c index 1f3c8f39ecb2..66700e244a40 100644 --- a/drivers/md/md-cluster.c +++ b/drivers/md/md-cluster.c @@ -196,6 +196,12 @@ static int join(struct mddev *mddev, int nodes) if (ret) goto err; wait_for_completion(&cinfo->completion); + if (nodes <= cinfo->slot_number) { + pr_err("md-cluster: Slot allotted(%d) greater than available slots(%d)", cinfo->slot_number - 1, + nodes); + ret = -ERANGE; + goto err; + } cinfo->sb_lock = lockres_init(mddev, "cmd-super", NULL, 0); if (!cinfo->sb_lock) { From 54519c5f4b398bcfe599f652b4ef4004d5fa63ff Mon Sep 17 00:00:00 2001 From: Goldwyn Rodrigues Date: Fri, 6 Jun 2014 12:12:32 -0500 Subject: [PATCH 10/58] Lock bitmap while joining the cluster Signed-off-by: Goldwyn Rodrigues --- drivers/md/md-cluster.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c index 66700e244a40..75c6602f4c75 100644 --- a/drivers/md/md-cluster.c +++ b/drivers/md/md-cluster.c @@ -34,6 +34,7 @@ struct md_cluster_info { struct completion completion; struct dlm_lock_resource *sb_lock; struct mutex sb_mutex; + struct dlm_lock_resource *bitmap_lockres; }; static void sync_ast(void *arg) @@ -208,6 +209,18 @@ static int join(struct mddev *mddev, int nodes) ret = -ENOMEM; goto err; } + + pr_info("md-cluster: Joined cluster %s slot %d\n", str, cinfo->slot_number); + snprintf(str, 64, "bitmap%04d", cinfo->slot_number - 1); + cinfo->bitmap_lockres = lockres_init(mddev, str, NULL, 1); + if (!cinfo->bitmap_lockres) + goto err; + if (dlm_lock_sync(cinfo->bitmap_lockres, DLM_LOCK_PW)) { + pr_err("Failed to get bitmap lock\n"); + ret = -EINVAL; + goto err; + } + return 0; err: if (cinfo->lockspace) @@ -225,6 +238,7 @@ static int leave(struct mddev *mddev) if (!cinfo) return 0; lockres_free(cinfo->sb_lock); + lockres_free(cinfo->bitmap_lockres); dlm_release_lockspace(cinfo->lockspace, 2); return 0; } From 96ae923ab659e37dd5fc1e05ecbf654e2f94bcbe Mon Sep 17 00:00:00 2001 From: Goldwyn Rodrigues Date: Fri, 6 Jun 2014 12:35:34 -0500 Subject: [PATCH 11/58] Gather on-going resync information of other nodes When a node joins, it does not know of other nodes performing resync. So, each node keeps the resync information in it's LVB. When a new node joins, it reads the LVB of each "online" bitmap. [TODO] The new node attempts to get the PW lock on other bitmap, if it is successful, it reads the bitmap and performs the resync (if required) on it's behalf. If the node does not get the PW, it requests CR and reads the LVB for the resync information. Signed-off-by: Goldwyn Rodrigues --- drivers/md/md-cluster.c | 111 ++++++++++++++++++++++++++++++++++++++++ drivers/md/md-cluster.h | 1 + drivers/md/md.c | 8 +++ 3 files changed, 120 insertions(+) diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c index 75c6602f4c75..b59c3a0ebd08 100644 --- a/drivers/md/md-cluster.c +++ b/drivers/md/md-cluster.c @@ -27,6 +27,18 @@ struct dlm_lock_resource { struct mddev *mddev; /* pointing back to mddev. */ }; +struct suspend_info { + int slot; + sector_t lo; + sector_t hi; + struct list_head list; +}; + +struct resync_info { + __le64 lo; + __le64 hi; +}; + struct md_cluster_info { /* dlm lock space and resources for clustered raid. */ dlm_lockspace_t *lockspace; @@ -35,6 +47,8 @@ struct md_cluster_info { struct dlm_lock_resource *sb_lock; struct mutex sb_mutex; struct dlm_lock_resource *bitmap_lockres; + struct list_head suspend_list; + spinlock_t suspend_lock; }; static void sync_ast(void *arg) @@ -139,6 +153,37 @@ static char *pretty_uuid(char *dest, char *src) return dest; } +static void add_resync_info(struct mddev *mddev, struct dlm_lock_resource *lockres, + sector_t lo, sector_t hi) +{ + struct resync_info *ri; + + ri = (struct resync_info *)lockres->lksb.sb_lvbptr; + ri->lo = cpu_to_le64(lo); + ri->hi = cpu_to_le64(hi); +} + +static struct suspend_info *read_resync_info(struct mddev *mddev, struct dlm_lock_resource *lockres) +{ + struct resync_info ri; + struct suspend_info *s = NULL; + sector_t hi = 0; + + dlm_lock_sync(lockres, DLM_LOCK_CR); + memcpy(&ri, lockres->lksb.sb_lvbptr, sizeof(struct resync_info)); + hi = le64_to_cpu(ri.hi); + if (ri.hi > 0) { + s = kzalloc(sizeof(struct suspend_info), GFP_KERNEL); + if (!s) + goto out; + s->hi = hi; + s->lo = le64_to_cpu(ri.lo); + } + dlm_unlock_sync(lockres); +out: + return s; +} + static void recover_prep(void *arg) { } @@ -171,6 +216,53 @@ static const struct dlm_lockspace_ops md_ls_ops = { .recover_done = recover_done, }; +static int gather_all_resync_info(struct mddev *mddev, int total_slots) +{ + struct md_cluster_info *cinfo = mddev->cluster_info; + int i, ret = 0; + struct dlm_lock_resource *bm_lockres; + struct suspend_info *s; + char str[64]; + + + for (i = 0; i < total_slots; i++) { + memset(str, '\0', 64); + snprintf(str, 64, "bitmap%04d", i); + bm_lockres = lockres_init(mddev, str, NULL, 1); + if (!bm_lockres) + return -ENOMEM; + if (i == (cinfo->slot_number - 1)) + continue; + + bm_lockres->flags |= DLM_LKF_NOQUEUE; + ret = dlm_lock_sync(bm_lockres, DLM_LOCK_PW); + if (ret == -EAGAIN) { + memset(bm_lockres->lksb.sb_lvbptr, '\0', LVB_SIZE); + s = read_resync_info(mddev, bm_lockres); + if (s) { + pr_info("%s:%d Resync[%llu..%llu] in progress on %d\n", + __func__, __LINE__, + (unsigned long long) s->lo, + (unsigned long long) s->hi, i); + spin_lock_irq(&cinfo->suspend_lock); + s->slot = i; + list_add(&s->list, &cinfo->suspend_list); + spin_unlock_irq(&cinfo->suspend_lock); + } + ret = 0; + lockres_free(bm_lockres); + continue; + } + if (ret) + goto out; + /* TODO: Read the disk bitmap sb and check if it needs recovery */ + dlm_unlock_sync(bm_lockres); + lockres_free(bm_lockres); + } +out: + return ret; +} + static int join(struct mddev *mddev, int nodes) { struct md_cluster_info *cinfo; @@ -221,8 +313,17 @@ static int join(struct mddev *mddev, int nodes) goto err; } + INIT_LIST_HEAD(&cinfo->suspend_list); + spin_lock_init(&cinfo->suspend_lock); + + ret = gather_all_resync_info(mddev, nodes); + if (ret) + goto err; + return 0; err: + lockres_free(cinfo->bitmap_lockres); + lockres_free(cinfo->sb_lock); if (cinfo->lockspace) dlm_release_lockspace(cinfo->lockspace, 2); mddev->cluster_info = NULL; @@ -254,10 +355,20 @@ static int slot_number(struct mddev *mddev) return cinfo->slot_number - 1; } +static void resync_info_update(struct mddev *mddev, sector_t lo, sector_t hi) +{ + struct md_cluster_info *cinfo = mddev->cluster_info; + + add_resync_info(mddev, cinfo->bitmap_lockres, lo, hi); + /* Re-acquire the lock to refresh LVB */ + dlm_lock_sync(cinfo->bitmap_lockres, DLM_LOCK_PW); +} + static struct md_cluster_operations cluster_ops = { .join = join, .leave = leave, .slot_number = slot_number, + .resync_info_update = resync_info_update, }; static int __init cluster_init(void) diff --git a/drivers/md/md-cluster.h b/drivers/md/md-cluster.h index 52a21e0d6dbc..51a24df15b64 100644 --- a/drivers/md/md-cluster.h +++ b/drivers/md/md-cluster.h @@ -11,6 +11,7 @@ struct md_cluster_operations { int (*join)(struct mddev *mddev, int nodes); int (*leave)(struct mddev *mddev); int (*slot_number)(struct mddev *mddev); + void (*resync_info_update)(struct mddev *mddev, sector_t lo, sector_t hi); }; #endif /* _MD_CLUSTER_H */ diff --git a/drivers/md/md.c b/drivers/md/md.c index 8f310d98f082..71f655015385 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -7626,6 +7626,9 @@ void md_do_sync(struct md_thread *thread) md_new_event(mddev); update_time = jiffies; + if (mddev_is_clustered(mddev)) + md_cluster_ops->resync_info_update(mddev, j, max_sectors); + blk_start_plug(&plug); while (j < max_sectors) { sector_t sectors; @@ -7686,6 +7689,8 @@ void md_do_sync(struct md_thread *thread) j += sectors; if (j > 2) mddev->curr_resync = j; + if (mddev_is_clustered(mddev)) + md_cluster_ops->resync_info_update(mddev, j, max_sectors); mddev->curr_mark_cnt = io_sectors; if (last_check == 0) /* this is the earliest that rebuild will be @@ -7746,6 +7751,9 @@ void md_do_sync(struct md_thread *thread) /* tell personality that we are finished */ mddev->pers->sync_request(mddev, max_sectors, &skipped, 1); + if (mddev_is_clustered(mddev)) + md_cluster_ops->resync_info_update(mddev, 0, 0); + if (!test_bit(MD_RECOVERY_CHECK, &mddev->recovery) && mddev->curr_resync > 2) { if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) { From f9209a323547f054c7439a3bf67c45e64a054bdd Mon Sep 17 00:00:00 2001 From: Goldwyn Rodrigues Date: Fri, 6 Jun 2014 12:43:49 -0500 Subject: [PATCH 12/58] bitmap_create returns bitmap pointer This is done to have multiple bitmaps open at the same time. Signed-off-by: Goldwyn Rodrigues --- drivers/md/bitmap.c | 60 +++++++++++++++++++++++++++------------------ drivers/md/bitmap.h | 2 +- drivers/md/md.c | 25 ++++++++++++++----- 3 files changed, 56 insertions(+), 31 deletions(-) diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c index b1d94eee3346..f02551f50bb5 100644 --- a/drivers/md/bitmap.c +++ b/drivers/md/bitmap.c @@ -553,7 +553,6 @@ static int bitmap_read_sb(struct bitmap *bitmap) unsigned long sectors_reserved = 0; int err = -EINVAL; struct page *sb_page; - int cluster_setup_done = 0; if (!bitmap->storage.file && !bitmap->mddev->bitmap_info.offset) { chunksize = 128 * 1024 * 1024; @@ -570,6 +569,18 @@ static int bitmap_read_sb(struct bitmap *bitmap) bitmap->storage.sb_page = sb_page; re_read: + /* If cluster_slot is set, the cluster is setup */ + if (bitmap->cluster_slot >= 0) { + long long bm_blocks; + + bm_blocks = bitmap->mddev->resync_max_sectors / (bitmap->mddev->bitmap_info.chunksize >> 9); + bm_blocks = bm_blocks << 3; + bm_blocks = DIV_ROUND_UP(bm_blocks, 4096); + bitmap->mddev->bitmap_info.offset += bitmap->cluster_slot * (bm_blocks << 3); + pr_info("%s:%d bm slot: %d offset: %llu\n", __func__, __LINE__, + bitmap->cluster_slot, (unsigned long long)bitmap->mddev->bitmap_info.offset); + } + if (bitmap->storage.file) { loff_t isize = i_size_read(bitmap->storage.file->f_mapping->host); int bytes = isize > PAGE_SIZE ? PAGE_SIZE : isize; @@ -650,14 +661,9 @@ re_read: out: kunmap_atomic(sb); - if (nodes && !cluster_setup_done) { - sector_t bm_blocks; - - bm_blocks = sector_div(bitmap->mddev->resync_max_sectors, (chunksize >> 9)); - bm_blocks = bm_blocks << 3; - /* We have bitmap supers at 4k boundaries, hence this - * is hardcoded */ - bm_blocks = DIV_ROUND_UP(bm_blocks, 4096); + /* Assiging chunksize is required for "re_read" */ + bitmap->mddev->bitmap_info.chunksize = chunksize; + if (nodes && (bitmap->cluster_slot < 0)) { err = md_setup_cluster(bitmap->mddev, nodes); if (err) { pr_err("%s: Could not setup cluster service (%d)\n", @@ -665,12 +671,9 @@ out: goto out_no_sb; } bitmap->cluster_slot = md_cluster_ops->slot_number(bitmap->mddev); - bitmap->mddev->bitmap_info.offset += - bitmap->cluster_slot * (bm_blocks << 3); pr_info("%s:%d bm slot: %d offset: %llu\n", __func__, __LINE__, bitmap->cluster_slot, (unsigned long long)bitmap->mddev->bitmap_info.offset); - cluster_setup_done = 1; goto re_read; } @@ -687,7 +690,7 @@ out_no_sb: bitmap->mddev->bitmap_info.space = sectors_reserved; if (err) { bitmap_print_sb(bitmap); - if (cluster_setup_done) + if (bitmap->cluster_slot < 0) md_cluster_stop(bitmap->mddev); } return err; @@ -1639,7 +1642,8 @@ static void bitmap_free(struct bitmap *bitmap) if (!bitmap) /* there was no bitmap */ return; - if (mddev_is_clustered(bitmap->mddev) && bitmap->mddev->cluster_info) + if (mddev_is_clustered(bitmap->mddev) && bitmap->mddev->cluster_info && + bitmap->cluster_slot == md_cluster_ops->slot_number(bitmap->mddev)) md_cluster_stop(bitmap->mddev); /* Shouldn't be needed - but just in case.... */ @@ -1687,7 +1691,7 @@ void bitmap_destroy(struct mddev *mddev) * initialize the bitmap structure * if this returns an error, bitmap_destroy must be called to do clean up */ -int bitmap_create(struct mddev *mddev) +struct bitmap *bitmap_create(struct mddev *mddev, int slot) { struct bitmap *bitmap; sector_t blocks = mddev->resync_max_sectors; @@ -1701,7 +1705,7 @@ int bitmap_create(struct mddev *mddev) bitmap = kzalloc(sizeof(*bitmap), GFP_KERNEL); if (!bitmap) - return -ENOMEM; + return ERR_PTR(-ENOMEM); spin_lock_init(&bitmap->counts.lock); atomic_set(&bitmap->pending_writes, 0); @@ -1710,6 +1714,7 @@ int bitmap_create(struct mddev *mddev) init_waitqueue_head(&bitmap->behind_wait); bitmap->mddev = mddev; + bitmap->cluster_slot = slot; if (mddev->kobj.sd) bm = sysfs_get_dirent(mddev->kobj.sd, "bitmap"); @@ -1757,12 +1762,14 @@ int bitmap_create(struct mddev *mddev) printk(KERN_INFO "created bitmap (%lu pages) for device %s\n", bitmap->counts.pages, bmname(bitmap)); - mddev->bitmap = bitmap; - return test_bit(BITMAP_WRITE_ERROR, &bitmap->flags) ? -EIO : 0; + err = test_bit(BITMAP_WRITE_ERROR, &bitmap->flags) ? -EIO : 0; + if (err) + goto error; + return bitmap; error: bitmap_free(bitmap); - return err; + return ERR_PTR(err); } int bitmap_load(struct mddev *mddev) @@ -2073,13 +2080,18 @@ location_store(struct mddev *mddev, const char *buf, size_t len) return -EINVAL; mddev->bitmap_info.offset = offset; if (mddev->pers) { + struct bitmap *bitmap; mddev->pers->quiesce(mddev, 1); - rv = bitmap_create(mddev); - if (!rv) + bitmap = bitmap_create(mddev, -1); + if (IS_ERR(bitmap)) + rv = PTR_ERR(bitmap); + else { + mddev->bitmap = bitmap; rv = bitmap_load(mddev); - if (rv) { - bitmap_destroy(mddev); - mddev->bitmap_info.offset = 0; + if (rv) { + bitmap_destroy(mddev); + mddev->bitmap_info.offset = 0; + } } mddev->pers->quiesce(mddev, 0); if (rv) diff --git a/drivers/md/bitmap.h b/drivers/md/bitmap.h index 4e9acb08bbe0..67c7f77c67dd 100644 --- a/drivers/md/bitmap.h +++ b/drivers/md/bitmap.h @@ -233,7 +233,7 @@ struct bitmap { /* the bitmap API */ /* these are used only by md/bitmap */ -int bitmap_create(struct mddev *mddev); +struct bitmap *bitmap_create(struct mddev *mddev, int slot); int bitmap_load(struct mddev *mddev); void bitmap_flush(struct mddev *mddev); void bitmap_destroy(struct mddev *mddev); diff --git a/drivers/md/md.c b/drivers/md/md.c index 71f655015385..630a9142a819 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -5076,10 +5076,16 @@ int md_run(struct mddev *mddev) } if (err == 0 && pers->sync_request && (mddev->bitmap_info.file || mddev->bitmap_info.offset)) { - err = bitmap_create(mddev); - if (err) + struct bitmap *bitmap; + + bitmap = bitmap_create(mddev, -1); + if (IS_ERR(bitmap)) { + err = PTR_ERR(bitmap); printk(KERN_ERR "%s: failed to create bitmap (%d)\n", mdname(mddev), err); + } else + mddev->bitmap = bitmap; + } if (err) { mddev_detach(mddev); @@ -6039,9 +6045,13 @@ static int set_bitmap_file(struct mddev *mddev, int fd) if (mddev->pers) { mddev->pers->quiesce(mddev, 1); if (fd >= 0) { - err = bitmap_create(mddev); - if (!err) + struct bitmap *bitmap; + + bitmap = bitmap_create(mddev, -1); + if (!IS_ERR(bitmap)) { + mddev->bitmap = bitmap; err = bitmap_load(mddev); + } } if (fd < 0 || err) { bitmap_destroy(mddev); @@ -6306,6 +6316,7 @@ static int update_array_info(struct mddev *mddev, mdu_array_info_t *info) if (mddev->recovery || mddev->sync_thread) return -EBUSY; if (info->state & (1<bitmap) return -EEXIST; @@ -6316,9 +6327,11 @@ static int update_array_info(struct mddev *mddev, mdu_array_info_t *info) mddev->bitmap_info.space = mddev->bitmap_info.default_space; mddev->pers->quiesce(mddev, 1); - rv = bitmap_create(mddev); - if (!rv) + bitmap = bitmap_create(mddev, -1); + if (!IS_ERR(bitmap)) { + mddev->bitmap = bitmap; rv = bitmap_load(mddev); + } if (rv) bitmap_destroy(mddev); mddev->pers->quiesce(mddev, 0); From 11dd35daaab86d12270d23a10e8d242846a8830a Mon Sep 17 00:00:00 2001 From: Goldwyn Rodrigues Date: Sat, 7 Jun 2014 00:36:26 -0500 Subject: [PATCH 13/58] Copy set bits from another slot bitmap_copy_from_slot reads the bitmap from the slot mentioned. It then copies the set bits to the node local bitmap. This is helper function for the resync operation on node failure. bitmap_set_memory_bits() currently assumes it is only run at startup and that they bitmap is currently empty. So if it finds that a region is already marked as dirty, it won't mark it dirty again. Change bitmap_set_memory_bits() to always set the NEEDED_MASK bit if 'needed' is set. Signed-off-by: Goldwyn Rodrigues --- drivers/md/bitmap.c | 78 ++++++++++++++++++++++++++++++++++++++++++++- drivers/md/bitmap.h | 2 ++ 2 files changed, 79 insertions(+), 1 deletion(-) diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c index f02551f50bb5..dd8c78043eab 100644 --- a/drivers/md/bitmap.c +++ b/drivers/md/bitmap.c @@ -934,6 +934,28 @@ static void bitmap_file_clear_bit(struct bitmap *bitmap, sector_t block) } } +static int bitmap_file_test_bit(struct bitmap *bitmap, sector_t block) +{ + unsigned long bit; + struct page *page; + void *paddr; + unsigned long chunk = block >> bitmap->counts.chunkshift; + int set = 0; + + page = filemap_get_page(&bitmap->storage, chunk); + if (!page) + return -EINVAL; + bit = file_page_offset(&bitmap->storage, chunk); + paddr = kmap_atomic(page); + if (test_bit(BITMAP_HOSTENDIAN, &bitmap->flags)) + set = test_bit(bit, paddr); + else + set = test_bit_le(bit, paddr); + kunmap_atomic(paddr); + return set; +} + + /* this gets called when the md device is ready to unplug its underlying * (slave) device queues -- before we let any writes go down, we need to * sync the dirty pages of the bitmap file to disk */ @@ -1581,11 +1603,13 @@ static void bitmap_set_memory_bits(struct bitmap *bitmap, sector_t offset, int n return; } if (!*bmc) { - *bmc = 2 | (needed ? NEEDED_MASK : 0); + *bmc = 2; bitmap_count_page(&bitmap->counts, offset, 1); bitmap_set_pending(&bitmap->counts, offset); bitmap->allclean = 0; } + if (needed) + *bmc |= NEEDED_MASK; spin_unlock_irq(&bitmap->counts.lock); } @@ -1823,6 +1847,58 @@ out: } EXPORT_SYMBOL_GPL(bitmap_load); +/* Loads the bitmap associated with slot and copies the resync information + * to our bitmap + */ +int bitmap_copy_from_slot(struct mddev *mddev, int slot, + sector_t *low, sector_t *high) +{ + int rv = 0, i, j; + sector_t block, lo = 0, hi = 0; + struct bitmap_counts *counts; + struct bitmap *bitmap = bitmap_create(mddev, slot); + + if (IS_ERR(bitmap)) + return PTR_ERR(bitmap); + + rv = bitmap_read_sb(bitmap); + if (rv) + goto err; + + rv = bitmap_init_from_disk(bitmap, 0); + if (rv) + goto err; + + counts = &bitmap->counts; + for (j = 0; j < counts->chunks; j++) { + block = (sector_t)j << counts->chunkshift; + if (bitmap_file_test_bit(bitmap, block)) { + if (!lo) + lo = block; + hi = block; + bitmap_file_clear_bit(bitmap, block); + bitmap_set_memory_bits(mddev->bitmap, block, 1); + bitmap_file_set_bit(mddev->bitmap, block); + } + } + + bitmap_update_sb(bitmap); + /* Setting this for the ev_page should be enough. + * And we do not require both write_all and PAGE_DIRT either + */ + for (i = 0; i < bitmap->storage.file_pages; i++) + set_page_attr(bitmap, i, BITMAP_PAGE_DIRTY); + bitmap_write_all(bitmap); + bitmap_unplug(bitmap); + *low = lo; + *high = hi; +err: + bitmap_free(bitmap); + return rv; +} +EXPORT_SYMBOL_GPL(bitmap_copy_from_slot); + + void bitmap_status(struct seq_file *seq, struct bitmap *bitmap) { unsigned long chunk_kb; diff --git a/drivers/md/bitmap.h b/drivers/md/bitmap.h index 67c7f77c67dd..4aabc74ef7b9 100644 --- a/drivers/md/bitmap.h +++ b/drivers/md/bitmap.h @@ -262,6 +262,8 @@ void bitmap_daemon_work(struct mddev *mddev); int bitmap_resize(struct bitmap *bitmap, sector_t blocks, int chunksize, int init); +int bitmap_copy_from_slot(struct mddev *mddev, int slot, + sector_t *lo, sector_t *hi); #endif #endif From e94987db2ed983aea4e45d22db9e17c6bbf2a623 Mon Sep 17 00:00:00 2001 From: Goldwyn Rodrigues Date: Sat, 7 Jun 2014 00:45:22 -0500 Subject: [PATCH 14/58] Initiate recovery on node failure The DLM informs us in case of node failure with the DLM slot number. cluster_info->recovery_map sets the bit corresponding to the slot number and wakes up the recovery thread. The recovery thread: 1. Derives the slot number from the recovery_map 2. Locks the bitmap corresponding to the slot 3. Copies the set bits to the node-local bitmap Signed-off-by: Goldwyn Rodrigues --- drivers/md/md-cluster.c | 58 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 58 insertions(+) diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c index b59c3a0ebd08..1f82d0d731ae 100644 --- a/drivers/md/md-cluster.c +++ b/drivers/md/md-cluster.c @@ -13,6 +13,7 @@ #include #include #include "md.h" +#include "bitmap.h" #include "md-cluster.h" #define LVB_SIZE 64 @@ -49,6 +50,8 @@ struct md_cluster_info { struct dlm_lock_resource *bitmap_lockres; struct list_head suspend_list; spinlock_t suspend_lock; + struct md_thread *recovery_thread; + unsigned long recovery_map; }; static void sync_ast(void *arg) @@ -184,6 +187,50 @@ out: return s; } +void recover_bitmaps(struct md_thread *thread) +{ + struct mddev *mddev = thread->mddev; + struct md_cluster_info *cinfo = mddev->cluster_info; + struct dlm_lock_resource *bm_lockres; + char str[64]; + int slot, ret; + struct suspend_info *s, *tmp; + sector_t lo, hi; + + while (cinfo->recovery_map) { + slot = fls64((u64)cinfo->recovery_map) - 1; + + /* Clear suspend_area associated with the bitmap */ + spin_lock_irq(&cinfo->suspend_lock); + list_for_each_entry_safe(s, tmp, &cinfo->suspend_list, list) + if (slot == s->slot) { + list_del(&s->list); + kfree(s); + } + spin_unlock_irq(&cinfo->suspend_lock); + + snprintf(str, 64, "bitmap%04d", slot); + bm_lockres = lockres_init(mddev, str, NULL, 1); + if (!bm_lockres) { + pr_err("md-cluster: Cannot initialize bitmaps\n"); + goto clear_bit; + } + + ret = dlm_lock_sync(bm_lockres, DLM_LOCK_PW); + if (ret) { + pr_err("md-cluster: Could not DLM lock %s: %d\n", + str, ret); + goto clear_bit; + } + ret = bitmap_copy_from_slot(mddev, slot, &lo, &hi); + if (ret) + pr_err("md-cluster: Could not copy data from bitmap %d\n", slot); + dlm_unlock_sync(bm_lockres); +clear_bit: + clear_bit(slot, &cinfo->recovery_map); + } +} + static void recover_prep(void *arg) { } @@ -197,6 +244,16 @@ static void recover_slot(void *arg, struct dlm_slot *slot) mddev->bitmap_info.cluster_name, slot->nodeid, slot->slot, cinfo->slot_number); + set_bit(slot->slot - 1, &cinfo->recovery_map); + if (!cinfo->recovery_thread) { + cinfo->recovery_thread = md_register_thread(recover_bitmaps, + mddev, "recover"); + if (!cinfo->recovery_thread) { + pr_warn("md-cluster: Could not create recovery thread\n"); + return; + } + } + md_wakeup_thread(cinfo->recovery_thread); } static void recover_done(void *arg, struct dlm_slot *slots, @@ -338,6 +395,7 @@ static int leave(struct mddev *mddev) if (!cinfo) return 0; + md_unregister_thread(&cinfo->recovery_thread); lockres_free(cinfo->sb_lock); lockres_free(cinfo->bitmap_lockres); dlm_release_lockspace(cinfo->lockspace, 2); From 4b26a08af92c0d9c0bce07612b56ff326112321a Mon Sep 17 00:00:00 2001 From: Goldwyn Rodrigues Date: Sat, 7 Jun 2014 00:52:29 -0500 Subject: [PATCH 15/58] Perform resync for cluster node failure If bitmap_copy_slot returns hi>0, we need to perform resync. Signed-off-by: Goldwyn Rodrigues --- drivers/md/md-cluster.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c index 1f82d0d731ae..d2987130be34 100644 --- a/drivers/md/md-cluster.c +++ b/drivers/md/md-cluster.c @@ -223,8 +223,18 @@ void recover_bitmaps(struct md_thread *thread) goto clear_bit; } ret = bitmap_copy_from_slot(mddev, slot, &lo, &hi); - if (ret) + if (ret) { pr_err("md-cluster: Could not copy data from bitmap %d\n", slot); + goto dlm_unlock; + } + if (hi > 0) { + /* TODO:Wait for current resync to get over */ + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); + if (lo < mddev->recovery_cp) + mddev->recovery_cp = lo; + md_check_recovery(mddev); + } +dlm_unlock: dlm_unlock_sync(bm_lockres); clear_bit: clear_bit(slot, &cinfo->recovery_map); From 4664680c389828928efc61ce3d2cf2c65ad35c97 Mon Sep 17 00:00:00 2001 From: Goldwyn Rodrigues Date: Sat, 7 Jun 2014 01:08:29 -0500 Subject: [PATCH 16/58] Communication Framework: Receiving 1. receive status sender receiver receiver ACK:CR ACK:CR ACK:CR 2. sender get EX of TOKEN sender get EX of MESSAGE sender receiver receiver TOKEN:EX ACK:CR ACK:CR MESSAGE:EX ACK:CR 3. sender write LVB. sender down-convert MESSAGE from EX to CR sender try to get EX of ACK [ wait until all receiver has *processed* the MESSAGE ] [ triggered by bast of ACK ] receiver get CR of MESSAGE receiver read LVB receiver processes the message [ wait finish ] receiver release ACK sender receiver receiver TOKEN:EX MESSAGE:CR MESSAGE:CR MESSAGE:CR ACK:EX 4. sender down-convert ACK from EX to CR sender release MESSAGE sender release TOKEN receiver upconvert to EX of MESSAGE receiver get CR of ACK receiver release MESSAGE sender receiver receiver ACK:CR ACK:CR ACK:CR Signed-off-by: Lidong Zhong Signed-off-by: Goldwyn Rodrigues --- drivers/md/md-cluster.c | 102 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 102 insertions(+) diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c index d2987130be34..96734cdb9b45 100644 --- a/drivers/md/md-cluster.c +++ b/drivers/md/md-cluster.c @@ -52,6 +52,23 @@ struct md_cluster_info { spinlock_t suspend_lock; struct md_thread *recovery_thread; unsigned long recovery_map; + /* communication loc resources */ + struct dlm_lock_resource *ack_lockres; + struct dlm_lock_resource *message_lockres; + struct dlm_lock_resource *token_lockres; + struct md_thread *recv_thread; +}; + +enum msg_type { + METADATA_UPDATED = 0, + RESYNCING, +}; + +struct cluster_msg { + int type; + int slot; + sector_t low; + sector_t high; }; static void sync_ast(void *arg) @@ -283,6 +300,64 @@ static const struct dlm_lockspace_ops md_ls_ops = { .recover_done = recover_done, }; +/* + * The BAST function for the ack lock resource + * This function wakes up the receive thread in + * order to receive and process the message. + */ +static void ack_bast(void *arg, int mode) +{ + struct dlm_lock_resource *res = (struct dlm_lock_resource *)arg; + struct md_cluster_info *cinfo = res->mddev->cluster_info; + + if (mode == DLM_LOCK_EX) + md_wakeup_thread(cinfo->recv_thread); +} + +static void process_recvd_msg(struct mddev *mddev, struct cluster_msg *msg) +{ + switch (msg->type) { + case METADATA_UPDATED: + pr_info("%s: %d Received message: METADATA_UPDATE from %d\n", + __func__, __LINE__, msg->slot); + break; + case RESYNCING: + pr_info("%s: %d Received message: RESYNCING from %d\n", + __func__, __LINE__, msg->slot); + break; + }; +} + +/* + * thread for receiving message + */ +static void recv_daemon(struct md_thread *thread) +{ + struct md_cluster_info *cinfo = thread->mddev->cluster_info; + struct dlm_lock_resource *ack_lockres = cinfo->ack_lockres; + struct dlm_lock_resource *message_lockres = cinfo->message_lockres; + struct cluster_msg msg; + + /*get CR on Message*/ + if (dlm_lock_sync(message_lockres, DLM_LOCK_CR)) { + pr_err("md/raid1:failed to get CR on MESSAGE\n"); + return; + } + + /* read lvb and wake up thread to process this message_lockres */ + memcpy(&msg, message_lockres->lksb.sb_lvbptr, sizeof(struct cluster_msg)); + process_recvd_msg(thread->mddev, &msg); + + /*release CR on ack_lockres*/ + dlm_unlock_sync(ack_lockres); + /*up-convert to EX on message_lockres*/ + dlm_lock_sync(message_lockres, DLM_LOCK_EX); + /*get CR on ack_lockres again*/ + dlm_lock_sync(ack_lockres, DLM_LOCK_CR); + /*release CR on message_lockres*/ + dlm_unlock_sync(message_lockres); +} + static int gather_all_resync_info(struct mddev *mddev, int total_slots) { struct md_cluster_info *cinfo = mddev->cluster_info; @@ -368,6 +443,26 @@ static int join(struct mddev *mddev, int nodes) ret = -ENOMEM; goto err; } + /* Initiate the communication resources */ + ret = -ENOMEM; + cinfo->recv_thread = md_register_thread(recv_daemon, mddev, "cluster_recv"); + if (!cinfo->recv_thread) { + pr_err("md-cluster: cannot allocate memory for recv_thread!\n"); + goto err; + } + cinfo->message_lockres = lockres_init(mddev, "message", NULL, 1); + if (!cinfo->message_lockres) + goto err; + cinfo->token_lockres = lockres_init(mddev, "token", NULL, 0); + if (!cinfo->token_lockres) + goto err; + cinfo->ack_lockres = lockres_init(mddev, "ack", ack_bast, 0); + if (!cinfo->ack_lockres) + goto err; + /* get sync CR lock on ACK. */ + if (dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_CR)) + pr_err("md-cluster: failed to get a sync CR lock on ACK!(%d)\n", + ret); pr_info("md-cluster: Joined cluster %s slot %d\n", str, cinfo->slot_number); snprintf(str, 64, "bitmap%04d", cinfo->slot_number - 1); @@ -389,6 +484,9 @@ static int join(struct mddev *mddev, int nodes) return 0; err: + lockres_free(cinfo->message_lockres); + lockres_free(cinfo->token_lockres); + lockres_free(cinfo->ack_lockres); lockres_free(cinfo->bitmap_lockres); lockres_free(cinfo->sb_lock); if (cinfo->lockspace) @@ -406,6 +504,10 @@ static int leave(struct mddev *mddev) if (!cinfo) return 0; md_unregister_thread(&cinfo->recovery_thread); + md_unregister_thread(&cinfo->recv_thread); + lockres_free(cinfo->message_lockres); + lockres_free(cinfo->token_lockres); + lockres_free(cinfo->ack_lockres); lockres_free(cinfo->sb_lock); lockres_free(cinfo->bitmap_lockres); dlm_release_lockspace(cinfo->lockspace, 2); From 601b515c5dcc00fa71148cd9d2405ea1f70bc9cd Mon Sep 17 00:00:00 2001 From: Goldwyn Rodrigues Date: Sat, 7 Jun 2014 01:28:53 -0500 Subject: [PATCH 17/58] Communication Framework: Sending functions The sending part is split in two functions to make sure atomicity of the operations, such as the MD superblock update. Signed-off-by: Lidong Zhong Signed-off-by: Goldwyn Rodrigues --- drivers/md/md-cluster.c | 87 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 87 insertions(+) diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c index 96734cdb9b45..9a4abe1b4aa4 100644 --- a/drivers/md/md-cluster.c +++ b/drivers/md/md-cluster.c @@ -358,6 +358,93 @@ static void recv_daemon(struct md_thread *thread) dlm_unlock_sync(message_lockres); } +/* lock_comm() + * Takes the lock on the TOKEN lock resource so no other + * node can communicate while the operation is underway. + */ +static int lock_comm(struct md_cluster_info *cinfo) +{ + int error; + + error = dlm_lock_sync(cinfo->token_lockres, DLM_LOCK_EX); + if (error) + pr_err("md-cluster(%s:%d): failed to get EX on TOKEN (%d)\n", + __func__, __LINE__, error); + return error; +} + +static void unlock_comm(struct md_cluster_info *cinfo) +{ + dlm_unlock_sync(cinfo->token_lockres); +} + +/* __sendmsg() + * This function performs the actual sending of the message. This function is + * usually called after performing the encompassing operation + * The function: + * 1. Grabs the message lockresource in EX mode + * 2. Copies the message to the message LVB + * 3. Downconverts message lockresource to CR + * 4. Upconverts ack lock resource from CR to EX. This forces the BAST on other nodes + * and the other nodes read the message. The thread will wait here until all other + * nodes have released ack lock resource. + * 5. Downconvert ack lockresource to CR + */ +static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg) +{ + int error; + int slot = cinfo->slot_number - 1; + + cmsg->slot = cpu_to_le32(slot); + /*get EX on Message*/ + error = dlm_lock_sync(cinfo->message_lockres, DLM_LOCK_EX); + if (error) { + pr_err("md-cluster: failed to get EX on MESSAGE (%d)\n", error); + goto failed_message; + } + + memcpy(cinfo->message_lockres->lksb.sb_lvbptr, (void *)cmsg, + sizeof(struct cluster_msg)); + /*down-convert EX to CR on Message*/ + error = dlm_lock_sync(cinfo->message_lockres, DLM_LOCK_CR); + if (error) { + pr_err("md-cluster: failed to convert EX to CR on MESSAGE(%d)\n", + error); + goto failed_message; + } + + /*up-convert CR to EX on Ack*/ + error = dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_EX); + if (error) { + pr_err("md-cluster: failed to convert CR to EX on ACK(%d)\n", + error); + goto failed_ack; + } + + /*down-convert EX to CR on Ack*/ + error = dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_CR); + if (error) { + pr_err("md-cluster: failed to convert EX to CR on ACK(%d)\n", + error); + goto failed_ack; + } + +failed_ack: + dlm_unlock_sync(cinfo->message_lockres); +failed_message: + return error; +} + +static int sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg) +{ + int ret; + + lock_comm(cinfo); + ret = __sendmsg(cinfo, cmsg); + unlock_comm(cinfo); + return ret; +} + static int gather_all_resync_info(struct mddev *mddev, int total_slots) { struct md_cluster_info *cinfo = mddev->cluster_info; From 293467aa1f161cd50920ccf7fc1efa3946a4d50c Mon Sep 17 00:00:00 2001 From: Goldwyn Rodrigues Date: Sat, 7 Jun 2014 01:44:51 -0500 Subject: [PATCH 18/58] metadata_update sends message to other nodes - request to send a message - make changes to superblock - send messages telling everyone that the superblock has changed - other nodes all read the superblock - other nodes all ack the messages - updating node release the "I'm sending a message" resource. Signed-off-by: Goldwyn Rodrigues --- drivers/md/md-cluster.c | 28 +++++++++++++ drivers/md/md-cluster.h | 3 ++ drivers/md/md.c | 89 ++++++++++++++++++++++++++++++++++------- 3 files changed, 106 insertions(+), 14 deletions(-) diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c index 9a4abe1b4aa4..5db491010835 100644 --- a/drivers/md/md-cluster.c +++ b/drivers/md/md-cluster.c @@ -621,11 +621,39 @@ static void resync_info_update(struct mddev *mddev, sector_t lo, sector_t hi) dlm_lock_sync(cinfo->bitmap_lockres, DLM_LOCK_PW); } +static int metadata_update_start(struct mddev *mddev) +{ + return lock_comm(mddev->cluster_info); +} + +static int metadata_update_finish(struct mddev *mddev) +{ + struct md_cluster_info *cinfo = mddev->cluster_info; + struct cluster_msg cmsg; + int ret; + + memset(&cmsg, 0, sizeof(cmsg)); + cmsg.type = cpu_to_le32(METADATA_UPDATED); + ret = __sendmsg(cinfo, &cmsg); + unlock_comm(cinfo); + return ret; +} + +static int metadata_update_cancel(struct mddev *mddev) +{ + struct md_cluster_info *cinfo = mddev->cluster_info; + + return dlm_unlock_sync(cinfo->token_lockres); +} + static struct md_cluster_operations cluster_ops = { .join = join, .leave = leave, .slot_number = slot_number, .resync_info_update = resync_info_update, + .metadata_update_start = metadata_update_start, + .metadata_update_finish = metadata_update_finish, + .metadata_update_cancel = metadata_update_cancel, }; static int __init cluster_init(void) diff --git a/drivers/md/md-cluster.h b/drivers/md/md-cluster.h index 51a24df15b64..658982afcf9b 100644 --- a/drivers/md/md-cluster.h +++ b/drivers/md/md-cluster.h @@ -12,6 +12,9 @@ struct md_cluster_operations { int (*leave)(struct mddev *mddev); int (*slot_number)(struct mddev *mddev); void (*resync_info_update)(struct mddev *mddev, sector_t lo, sector_t hi); + int (*metadata_update_start)(struct mddev *mddev); + int (*metadata_update_finish)(struct mddev *mddev); + int (*metadata_update_cancel)(struct mddev *mddev); }; #endif /* _MD_CLUSTER_H */ diff --git a/drivers/md/md.c b/drivers/md/md.c index 630a9142a819..0052e433d8a6 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -2472,10 +2472,14 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len) err = -EBUSY; else { struct mddev *mddev = rdev->mddev; + if (mddev_is_clustered(mddev)) + md_cluster_ops->metadata_update_start(mddev); kick_rdev_from_array(rdev); if (mddev->pers) md_update_sb(mddev, 1); md_new_event(mddev); + if (mddev_is_clustered(mddev)) + md_cluster_ops->metadata_update_finish(mddev); err = 0; } } else if (cmd_match(buf, "writemostly")) { @@ -4008,8 +4012,12 @@ size_store(struct mddev *mddev, const char *buf, size_t len) if (err) return err; if (mddev->pers) { + if (mddev_is_clustered(mddev)) + md_cluster_ops->metadata_update_start(mddev); err = update_size(mddev, sectors); md_update_sb(mddev, 1); + if (mddev_is_clustered(mddev)) + md_cluster_ops->metadata_update_finish(mddev); } else { if (mddev->dev_sectors == 0 || mddev->dev_sectors > sectors) @@ -5236,6 +5244,8 @@ static void md_clean(struct mddev *mddev) static void __md_stop_writes(struct mddev *mddev) { + if (mddev_is_clustered(mddev)) + md_cluster_ops->metadata_update_start(mddev); set_bit(MD_RECOVERY_FROZEN, &mddev->recovery); flush_workqueue(md_misc_wq); if (mddev->sync_thread) { @@ -5254,6 +5264,8 @@ static void __md_stop_writes(struct mddev *mddev) mddev->in_sync = 1; md_update_sb(mddev, 1); } + if (mddev_is_clustered(mddev)) + md_cluster_ops->metadata_update_finish(mddev); } void md_stop_writes(struct mddev *mddev) @@ -5902,6 +5914,9 @@ static int hot_remove_disk(struct mddev *mddev, dev_t dev) if (!rdev) return -ENXIO; + if (mddev_is_clustered(mddev)) + md_cluster_ops->metadata_update_start(mddev); + clear_bit(Blocked, &rdev->flags); remove_and_add_spares(mddev, rdev); @@ -5912,8 +5927,13 @@ static int hot_remove_disk(struct mddev *mddev, dev_t dev) md_update_sb(mddev, 1); md_new_event(mddev); + if (mddev_is_clustered(mddev)) + md_cluster_ops->metadata_update_finish(mddev); + return 0; busy: + if (mddev_is_clustered(mddev)) + md_cluster_ops->metadata_update_cancel(mddev); printk(KERN_WARNING "md: cannot remove active disk %s from %s ...\n", bdevname(rdev->bdev,b), mdname(mddev)); return -EBUSY; @@ -5963,12 +5983,15 @@ static int hot_add_disk(struct mddev *mddev, dev_t dev) err = -EINVAL; goto abort_export; } + + if (mddev_is_clustered(mddev)) + md_cluster_ops->metadata_update_start(mddev); clear_bit(In_sync, &rdev->flags); rdev->desc_nr = -1; rdev->saved_raid_disk = -1; err = bind_rdev_to_array(rdev, mddev); if (err) - goto abort_export; + goto abort_clustered; /* * The rest should better be atomic, we can have disk failures @@ -5979,6 +6002,8 @@ static int hot_add_disk(struct mddev *mddev, dev_t dev) md_update_sb(mddev, 1); + if (mddev_is_clustered(mddev)) + md_cluster_ops->metadata_update_finish(mddev); /* * Kick recovery, maybe this spare has to be added to the * array immediately. @@ -5988,6 +6013,9 @@ static int hot_add_disk(struct mddev *mddev, dev_t dev) md_new_event(mddev); return 0; +abort_clustered: + if (mddev_is_clustered(mddev)) + md_cluster_ops->metadata_update_cancel(mddev); abort_export: export_rdev(rdev); return err; @@ -6304,6 +6332,8 @@ static int update_array_info(struct mddev *mddev, mdu_array_info_t *info) return rv; } } + if (mddev_is_clustered(mddev)) + md_cluster_ops->metadata_update_start(mddev); if (info->size >= 0 && mddev->dev_sectors / 2 != info->size) rv = update_size(mddev, (sector_t)info->size * 2); @@ -6311,17 +6341,25 @@ static int update_array_info(struct mddev *mddev, mdu_array_info_t *info) rv = update_raid_disks(mddev, info->raid_disks); if ((state ^ info->state) & (1<pers->quiesce == NULL || mddev->thread == NULL) - return -EINVAL; - if (mddev->recovery || mddev->sync_thread) - return -EBUSY; + if (mddev->pers->quiesce == NULL || mddev->thread == NULL) { + rv = -EINVAL; + goto err; + } + if (mddev->recovery || mddev->sync_thread) { + rv = -EBUSY; + goto err; + } if (info->state & (1<bitmap) - return -EEXIST; - if (mddev->bitmap_info.default_offset == 0) - return -EINVAL; + if (mddev->bitmap) { + rv = -EEXIST; + goto err; + } + if (mddev->bitmap_info.default_offset == 0) { + rv = -EINVAL; + goto err; + } mddev->bitmap_info.offset = mddev->bitmap_info.default_offset; mddev->bitmap_info.space = @@ -6337,10 +6375,14 @@ static int update_array_info(struct mddev *mddev, mdu_array_info_t *info) mddev->pers->quiesce(mddev, 0); } else { /* remove the bitmap */ - if (!mddev->bitmap) - return -ENOENT; - if (mddev->bitmap->storage.file) - return -EINVAL; + if (!mddev->bitmap) { + rv = -ENOENT; + goto err; + } + if (mddev->bitmap->storage.file) { + rv = -EINVAL; + goto err; + } mddev->pers->quiesce(mddev, 1); bitmap_destroy(mddev); mddev->pers->quiesce(mddev, 0); @@ -6348,6 +6390,12 @@ static int update_array_info(struct mddev *mddev, mdu_array_info_t *info) } } md_update_sb(mddev, 1); + if (mddev_is_clustered(mddev)) + md_cluster_ops->metadata_update_finish(mddev); + return rv; +err: + if (mddev_is_clustered(mddev)) + md_cluster_ops->metadata_update_cancel(mddev); return rv; } @@ -7438,7 +7486,11 @@ int md_allow_write(struct mddev *mddev) mddev->safemode == 0) mddev->safemode = 1; spin_unlock(&mddev->lock); + if (mddev_is_clustered(mddev)) + md_cluster_ops->metadata_update_start(mddev); md_update_sb(mddev, 0); + if (mddev_is_clustered(mddev)) + md_cluster_ops->metadata_update_finish(mddev); sysfs_notify_dirent_safe(mddev->sysfs_state); } else spin_unlock(&mddev->lock); @@ -7996,8 +8048,13 @@ void md_check_recovery(struct mddev *mddev) sysfs_notify_dirent_safe(mddev->sysfs_state); } - if (mddev->flags & MD_UPDATE_SB_FLAGS) + if (mddev->flags & MD_UPDATE_SB_FLAGS) { + if (mddev_is_clustered(mddev)) + md_cluster_ops->metadata_update_start(mddev); md_update_sb(mddev, 0); + if (mddev_is_clustered(mddev)) + md_cluster_ops->metadata_update_finish(mddev); + } if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) && !test_bit(MD_RECOVERY_DONE, &mddev->recovery)) { @@ -8095,6 +8152,8 @@ void md_reap_sync_thread(struct mddev *mddev) set_bit(MD_CHANGE_DEVS, &mddev->flags); } } + if (mddev_is_clustered(mddev)) + md_cluster_ops->metadata_update_start(mddev); if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) && mddev->pers->finish_reshape) mddev->pers->finish_reshape(mddev); @@ -8107,6 +8166,8 @@ void md_reap_sync_thread(struct mddev *mddev) rdev->saved_raid_disk = -1; md_update_sb(mddev, 1); + if (mddev_is_clustered(mddev)) + md_cluster_ops->metadata_update_finish(mddev); clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery); clear_bit(MD_RECOVERY_SYNC, &mddev->recovery); clear_bit(MD_RECOVERY_RESHAPE, &mddev->recovery); From 1d7e3e96117a864fe2ab3d02a14e49855319fdde Mon Sep 17 00:00:00 2001 From: Goldwyn Rodrigues Date: Sat, 7 Jun 2014 01:53:00 -0500 Subject: [PATCH 19/58] Reload superblock if METADATA_UPDATED is received Re-reads the devices by invalidating the cache. Since we don't write to faulty devices, this is detected using events recorded in the devices. If it is old as compared to the mddev mark it is faulty. Signed-off-by: Goldwyn Rodrigues --- drivers/md/md-cluster.c | 1 + drivers/md/md.c | 22 ++++++++++++++++++++++ drivers/md/md.h | 1 + 3 files changed, 24 insertions(+) diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c index 5db491010835..7e419f05b568 100644 --- a/drivers/md/md-cluster.c +++ b/drivers/md/md-cluster.c @@ -320,6 +320,7 @@ static void process_recvd_msg(struct mddev *mddev, struct cluster_msg *msg) case METADATA_UPDATED: pr_info("%s: %d Received message: METADATA_UPDATE from %d\n", __func__, __LINE__, msg->slot); + md_reload_sb(mddev); break; case RESYNCING: pr_info("%s: %d Received message: RESYNCING from %d\n", diff --git a/drivers/md/md.c b/drivers/md/md.c index 0052e433d8a6..3eb45dc0537f 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -8788,6 +8788,28 @@ err_wq: return ret; } +void md_reload_sb(struct mddev *mddev) +{ + struct md_rdev *rdev, *tmp; + + rdev_for_each_safe(rdev, tmp, mddev) { + rdev->sb_loaded = 0; + ClearPageUptodate(rdev->sb_page); + } + mddev->raid_disks = 0; + analyze_sbs(mddev); + rdev_for_each_safe(rdev, tmp, mddev) { + struct mdp_superblock_1 *sb = page_address(rdev->sb_page); + /* since we don't write to faulty devices, we figure out if the + * disk is faulty by comparing events + */ + if (mddev->events > sb->events) + set_bit(Faulty, &rdev->flags); + } + +} +EXPORT_SYMBOL(md_reload_sb); + #ifndef MODULE /* diff --git a/drivers/md/md.h b/drivers/md/md.h index 81e568090d8f..bfebcfdf54e6 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -665,6 +665,7 @@ extern struct bio *bio_alloc_mddev(gfp_t gfp_mask, int nr_iovecs, struct mddev *mddev); extern void md_unplug(struct blk_plug_cb *cb, bool from_schedule); +extern void md_reload_sb(struct mddev *mddev); static inline int mddev_check_plugged(struct mddev *mddev) { return !!blk_check_plugged(md_unplug, mddev, From 965400eb615dfb32d62cb3319a895bd94eb9f3b4 Mon Sep 17 00:00:00 2001 From: Goldwyn Rodrigues Date: Sat, 7 Jun 2014 02:16:58 -0500 Subject: [PATCH 20/58] Send RESYNCING while performing resync start/stop When a resync is initiated, RESYNCING message is sent to all active nodes with the range (lo,hi). When the resync is over, a RESYNCING message is sent with (0,0). A high sector value of zero indicates that the resync is over. Signed-off-by: Goldwyn Rodrigues --- drivers/md/md-cluster.c | 32 ++++++++++++++++++++++++++++++++ drivers/md/md-cluster.h | 2 ++ drivers/md/md.c | 4 ++-- 3 files changed, 36 insertions(+), 2 deletions(-) diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c index 7e419f05b568..6428cc3ce38d 100644 --- a/drivers/md/md-cluster.c +++ b/drivers/md/md-cluster.c @@ -647,11 +647,43 @@ static int metadata_update_cancel(struct mddev *mddev) return dlm_unlock_sync(cinfo->token_lockres); } +static int resync_send(struct mddev *mddev, enum msg_type type, + sector_t lo, sector_t hi) +{ + struct md_cluster_info *cinfo = mddev->cluster_info; + struct cluster_msg cmsg; + int slot = cinfo->slot_number - 1; + + pr_info("%s:%d lo: %llu hi: %llu\n", __func__, __LINE__, + (unsigned long long)lo, + (unsigned long long)hi); + resync_info_update(mddev, lo, hi); + cmsg.type = cpu_to_le32(type); + cmsg.slot = cpu_to_le32(slot); + cmsg.low = cpu_to_le64(lo); + cmsg.high = cpu_to_le64(hi); + return sendmsg(cinfo, &cmsg); +} + +static int resync_start(struct mddev *mddev, sector_t lo, sector_t hi) +{ + pr_info("%s:%d\n", __func__, __LINE__); + return resync_send(mddev, RESYNCING, lo, hi); +} + +static void resync_finish(struct mddev *mddev) +{ + pr_info("%s:%d\n", __func__, __LINE__); + resync_send(mddev, RESYNCING, 0, 0); +} + static struct md_cluster_operations cluster_ops = { .join = join, .leave = leave, .slot_number = slot_number, .resync_info_update = resync_info_update, + .resync_start = resync_start, + .resync_finish = resync_finish, .metadata_update_start = metadata_update_start, .metadata_update_finish = metadata_update_finish, .metadata_update_cancel = metadata_update_cancel, diff --git a/drivers/md/md-cluster.h b/drivers/md/md-cluster.h index 658982afcf9b..054f9eafa065 100644 --- a/drivers/md/md-cluster.h +++ b/drivers/md/md-cluster.h @@ -12,6 +12,8 @@ struct md_cluster_operations { int (*leave)(struct mddev *mddev); int (*slot_number)(struct mddev *mddev); void (*resync_info_update)(struct mddev *mddev, sector_t lo, sector_t hi); + int (*resync_start)(struct mddev *mddev, sector_t lo, sector_t hi); + void (*resync_finish)(struct mddev *mddev); int (*metadata_update_start)(struct mddev *mddev); int (*metadata_update_finish)(struct mddev *mddev); int (*metadata_update_cancel)(struct mddev *mddev); diff --git a/drivers/md/md.c b/drivers/md/md.c index 3eb45dc0537f..a1af24d369fc 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -7692,7 +7692,7 @@ void md_do_sync(struct md_thread *thread) update_time = jiffies; if (mddev_is_clustered(mddev)) - md_cluster_ops->resync_info_update(mddev, j, max_sectors); + md_cluster_ops->resync_start(mddev, j, max_sectors); blk_start_plug(&plug); while (j < max_sectors) { @@ -7817,7 +7817,7 @@ void md_do_sync(struct md_thread *thread) mddev->pers->sync_request(mddev, max_sectors, &skipped, 1); if (mddev_is_clustered(mddev)) - md_cluster_ops->resync_info_update(mddev, 0, 0); + md_cluster_ops->resync_finish(mddev); if (!test_bit(MD_RECOVERY_CHECK, &mddev->recovery) && mddev->curr_resync > 2) { From e59721ccdc65dd4fbd8f311a063ecc8f6232dbcc Mon Sep 17 00:00:00 2001 From: Goldwyn Rodrigues Date: Sat, 7 Jun 2014 02:30:30 -0500 Subject: [PATCH 21/58] Resync start/Finish actions When a RESYNC_START message arrives, the node removes the entry with the current slot number and adds the range to the suspend_list. Simlarly, when a RESYNC_FINISHED message is received, node clears entry with respect to the bitmap number. Signed-off-by: Goldwyn Rodrigues --- drivers/md/md-cluster.c | 46 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 46 insertions(+) diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c index 6428cc3ce38d..6b0dffebc90f 100644 --- a/drivers/md/md-cluster.c +++ b/drivers/md/md-cluster.c @@ -314,6 +314,50 @@ static void ack_bast(void *arg, int mode) md_wakeup_thread(cinfo->recv_thread); } +static void __remove_suspend_info(struct md_cluster_info *cinfo, int slot) +{ + struct suspend_info *s, *tmp; + + list_for_each_entry_safe(s, tmp, &cinfo->suspend_list, list) + if (slot == s->slot) { + pr_info("%s:%d Deleting suspend_info: %d\n", + __func__, __LINE__, slot); + list_del(&s->list); + kfree(s); + break; + } +} + +static void remove_suspend_info(struct md_cluster_info *cinfo, int slot) +{ + spin_lock_irq(&cinfo->suspend_lock); + __remove_suspend_info(cinfo, slot); + spin_unlock_irq(&cinfo->suspend_lock); +} + + +static void process_suspend_info(struct md_cluster_info *cinfo, + int slot, sector_t lo, sector_t hi) +{ + struct suspend_info *s; + + if (!hi) { + remove_suspend_info(cinfo, slot); + return; + } + s = kzalloc(sizeof(struct suspend_info), GFP_KERNEL); + if (!s) + return; + s->slot = slot; + s->lo = lo; + s->hi = hi; + spin_lock_irq(&cinfo->suspend_lock); + /* Remove existing entry (if exists) before adding */ + __remove_suspend_info(cinfo, slot); + list_add(&s->list, &cinfo->suspend_list); + spin_unlock_irq(&cinfo->suspend_lock); +} + static void process_recvd_msg(struct mddev *mddev, struct cluster_msg *msg) { switch (msg->type) { @@ -325,6 +369,8 @@ static void process_recvd_msg(struct mddev *mddev, struct cluster_msg *msg) case RESYNCING: pr_info("%s: %d Received message: RESYNCING from %d\n", __func__, __LINE__, msg->slot); + process_suspend_info(mddev->cluster_info, msg->slot, + msg->low, msg->high); break; }; } From 589a1c491621ab81a1955d17d634636522c1b4c1 Mon Sep 17 00:00:00 2001 From: Goldwyn Rodrigues Date: Sat, 7 Jun 2014 02:39:37 -0500 Subject: [PATCH 22/58] Suspend writes in RAID1 if within range If there is a resync going on, all nodes must suspend writes to the range. This is recorded in the suspend_info/suspend_list. If there is an I/O within the ranges of any of the suspend_info, should_suspend will return 1. Signed-off-by: Goldwyn Rodrigues --- drivers/md/md-cluster.c | 20 ++++++++++++++++++++ drivers/md/md-cluster.h | 1 + drivers/md/md.c | 1 + drivers/md/raid1.c | 11 ++++++++--- 4 files changed, 30 insertions(+), 3 deletions(-) diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c index 6b0dffebc90f..d85a6ca4443e 100644 --- a/drivers/md/md-cluster.c +++ b/drivers/md/md-cluster.c @@ -723,6 +723,25 @@ static void resync_finish(struct mddev *mddev) resync_send(mddev, RESYNCING, 0, 0); } +static int area_resyncing(struct mddev *mddev, sector_t lo, sector_t hi) +{ + struct md_cluster_info *cinfo = mddev->cluster_info; + int ret = 0; + struct suspend_info *s; + + spin_lock_irq(&cinfo->suspend_lock); + if (list_empty(&cinfo->suspend_list)) + goto out; + list_for_each_entry(s, &cinfo->suspend_list, list) + if (hi > s->lo && lo < s->hi) { + ret = 1; + break; + } +out: + spin_unlock_irq(&cinfo->suspend_lock); + return ret; +} + static struct md_cluster_operations cluster_ops = { .join = join, .leave = leave, @@ -733,6 +752,7 @@ static struct md_cluster_operations cluster_ops = { .metadata_update_start = metadata_update_start, .metadata_update_finish = metadata_update_finish, .metadata_update_cancel = metadata_update_cancel, + .area_resyncing = area_resyncing, }; static int __init cluster_init(void) diff --git a/drivers/md/md-cluster.h b/drivers/md/md-cluster.h index 054f9eafa065..03785402afaa 100644 --- a/drivers/md/md-cluster.h +++ b/drivers/md/md-cluster.h @@ -17,6 +17,7 @@ struct md_cluster_operations { int (*metadata_update_start)(struct mddev *mddev); int (*metadata_update_finish)(struct mddev *mddev); int (*metadata_update_cancel)(struct mddev *mddev); + int (*area_resyncing)(struct mddev *mddev, sector_t lo, sector_t hi); }; #endif /* _MD_CLUSTER_H */ diff --git a/drivers/md/md.c b/drivers/md/md.c index a1af24d369fc..fe0484648de4 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -68,6 +68,7 @@ static LIST_HEAD(pers_list); static DEFINE_SPINLOCK(pers_lock); struct md_cluster_operations *md_cluster_ops; +EXPORT_SYMBOL(md_cluster_ops); struct module *md_cluster_mod; EXPORT_SYMBOL(md_cluster_mod); diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 4153da5d4011..3aa58ab5b1fd 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -1101,8 +1101,10 @@ static void make_request(struct mddev *mddev, struct bio * bio) md_write_start(mddev, bio); /* wait on superblock update early */ if (bio_data_dir(bio) == WRITE && - bio_end_sector(bio) > mddev->suspend_lo && - bio->bi_iter.bi_sector < mddev->suspend_hi) { + ((bio_end_sector(bio) > mddev->suspend_lo && + bio->bi_iter.bi_sector < mddev->suspend_hi) || + (mddev_is_clustered(mddev) && + md_cluster_ops->area_resyncing(mddev, bio->bi_iter.bi_sector, bio_end_sector(bio))))) { /* As the suspend_* range is controlled by * userspace, we want an interruptible * wait. @@ -1113,7 +1115,10 @@ static void make_request(struct mddev *mddev, struct bio * bio) prepare_to_wait(&conf->wait_barrier, &w, TASK_INTERRUPTIBLE); if (bio_end_sector(bio) <= mddev->suspend_lo || - bio->bi_iter.bi_sector >= mddev->suspend_hi) + bio->bi_iter.bi_sector >= mddev->suspend_hi || + (mddev_is_clustered(mddev) && + !md_cluster_ops->area_resyncing(mddev, + bio->bi_iter.bi_sector, bio_end_sector(bio)))) break; schedule(); } From 7d49ffcfa3cc08aa2301bf3fdb1e423a3fd33ee7 Mon Sep 17 00:00:00 2001 From: Goldwyn Rodrigues Date: Tue, 12 Aug 2014 10:13:19 -0500 Subject: [PATCH 23/58] Read from the first device when an area is resyncing set choose_first true for cluster read in read balance when the area is resyncing. Signed-off-by: Lidong Zhong Signed-off-by: Goldwyn Rodrigues --- drivers/md/raid1.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 3aa58ab5b1fd..f70d74189d16 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -539,7 +539,13 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect has_nonrot_disk = 0; choose_next_idle = 0; - choose_first = (conf->mddev->recovery_cp < this_sector + sectors); + if ((conf->mddev->recovery_cp < this_sector + sectors) || + (mddev_is_clustered(conf->mddev) && + md_cluster_ops->area_resyncing(conf->mddev, this_sector, + this_sector + sectors))) + choose_first = 1; + else + choose_first = 0; for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) { sector_t dist; From 1aee41f637694d4bbf91c24195f2b63e3f6badd2 Mon Sep 17 00:00:00 2001 From: Goldwyn Rodrigues Date: Wed, 29 Oct 2014 18:51:31 -0500 Subject: [PATCH 24/58] Add new disk to clustered array Algorithm: 1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues ioctl(ADD_NEW_DISC with disc.state set to MD_DISK_CLUSTER_ADD) 2. Node 1 sends NEWDISK with uuid and slot number 3. Other nodes issue kobject_uevent_env with uuid and slot number (Steps 4,5 could be a udev rule) 4. In userspace, the node searches for the disk, perhaps using blkid -t SUB_UUID="" 5. Other nodes issue either of the following depending on whether the disk was found: ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and disc.number set to slot number) ioctl(CLUSTERED_DISK_NACK) 6. Other nodes drop lock on no-new-devs (CR) if device is found 7. Node 1 attempts EX lock on no-new-devs 8. If node 1 gets the lock, it sends METADATA_UPDATED after unmarking the disk as SpareLocal 9. If not (get no-new-dev lock), it fails the operation and sends METADATA_UPDATED 10. Other nodes understand if the device is added or not by reading the superblock again after receiving the METADATA_UPDATED message. Signed-off-by: Lidong Zhong Signed-off-by: Goldwyn Rodrigues --- drivers/md/md-cluster.c | 104 ++++++++++++++++++++++++++++++++- drivers/md/md-cluster.h | 4 ++ drivers/md/md.c | 52 ++++++++++++++++- drivers/md/md.h | 5 ++ drivers/md/raid1.c | 1 + include/uapi/linux/raid/md_p.h | 6 ++ include/uapi/linux/raid/md_u.h | 1 + 7 files changed, 169 insertions(+), 4 deletions(-) diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c index d85a6ca4443e..03e521a9ca7d 100644 --- a/drivers/md/md-cluster.c +++ b/drivers/md/md-cluster.c @@ -12,11 +12,13 @@ #include #include #include +#include #include "md.h" #include "bitmap.h" #include "md-cluster.h" #define LVB_SIZE 64 +#define NEW_DEV_TIMEOUT 5000 struct dlm_lock_resource { dlm_lockspace_t *ls; @@ -56,19 +58,25 @@ struct md_cluster_info { struct dlm_lock_resource *ack_lockres; struct dlm_lock_resource *message_lockres; struct dlm_lock_resource *token_lockres; + struct dlm_lock_resource *no_new_dev_lockres; struct md_thread *recv_thread; + struct completion newdisk_completion; }; enum msg_type { METADATA_UPDATED = 0, RESYNCING, + NEWDISK, }; struct cluster_msg { int type; int slot; + /* TODO: Unionize this for smaller footprint */ sector_t low; sector_t high; + char uuid[16]; + int raid_slot; }; static void sync_ast(void *arg) @@ -358,13 +366,41 @@ static void process_suspend_info(struct md_cluster_info *cinfo, spin_unlock_irq(&cinfo->suspend_lock); } +static void process_add_new_disk(struct mddev *mddev, struct cluster_msg *cmsg) +{ + char disk_uuid[64]; + struct md_cluster_info *cinfo = mddev->cluster_info; + char event_name[] = "EVENT=ADD_DEVICE"; + char raid_slot[16]; + char *envp[] = {event_name, disk_uuid, raid_slot, NULL}; + int len; + + len = snprintf(disk_uuid, 64, "DEVICE_UUID="); + pretty_uuid(disk_uuid + len, cmsg->uuid); + snprintf(raid_slot, 16, "RAID_DISK=%d", cmsg->raid_slot); + pr_info("%s:%d Sending kobject change with %s and %s\n", __func__, __LINE__, disk_uuid, raid_slot); + init_completion(&cinfo->newdisk_completion); + kobject_uevent_env(&disk_to_dev(mddev->gendisk)->kobj, KOBJ_CHANGE, envp); + wait_for_completion_timeout(&cinfo->newdisk_completion, + NEW_DEV_TIMEOUT); +} + + +static void process_metadata_update(struct mddev *mddev, struct cluster_msg *msg) +{ + struct md_cluster_info *cinfo = mddev->cluster_info; + + md_reload_sb(mddev); + dlm_lock_sync(cinfo->no_new_dev_lockres, DLM_LOCK_CR); +} + static void process_recvd_msg(struct mddev *mddev, struct cluster_msg *msg) { switch (msg->type) { case METADATA_UPDATED: pr_info("%s: %d Received message: METADATA_UPDATE from %d\n", __func__, __LINE__, msg->slot); - md_reload_sb(mddev); + process_metadata_update(mddev, msg); break; case RESYNCING: pr_info("%s: %d Received message: RESYNCING from %d\n", @@ -372,6 +408,10 @@ static void process_recvd_msg(struct mddev *mddev, struct cluster_msg *msg) process_suspend_info(mddev->cluster_info, msg->slot, msg->low, msg->high); break; + case NEWDISK: + pr_info("%s: %d Received message: NEWDISK from %d\n", + __func__, __LINE__, msg->slot); + process_add_new_disk(mddev, msg); }; } @@ -593,10 +633,18 @@ static int join(struct mddev *mddev, int nodes) cinfo->ack_lockres = lockres_init(mddev, "ack", ack_bast, 0); if (!cinfo->ack_lockres) goto err; + cinfo->no_new_dev_lockres = lockres_init(mddev, "no-new-dev", NULL, 0); + if (!cinfo->no_new_dev_lockres) + goto err; + /* get sync CR lock on ACK. */ if (dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_CR)) pr_err("md-cluster: failed to get a sync CR lock on ACK!(%d)\n", ret); + /* get sync CR lock on no-new-dev. */ + if (dlm_lock_sync(cinfo->no_new_dev_lockres, DLM_LOCK_CR)) + pr_err("md-cluster: failed to get a sync CR lock on no-new-dev!(%d)\n", ret); + pr_info("md-cluster: Joined cluster %s slot %d\n", str, cinfo->slot_number); snprintf(str, 64, "bitmap%04d", cinfo->slot_number - 1); @@ -621,6 +669,7 @@ err: lockres_free(cinfo->message_lockres); lockres_free(cinfo->token_lockres); lockres_free(cinfo->ack_lockres); + lockres_free(cinfo->no_new_dev_lockres); lockres_free(cinfo->bitmap_lockres); lockres_free(cinfo->sb_lock); if (cinfo->lockspace) @@ -642,6 +691,7 @@ static int leave(struct mddev *mddev) lockres_free(cinfo->message_lockres); lockres_free(cinfo->token_lockres); lockres_free(cinfo->ack_lockres); + lockres_free(cinfo->no_new_dev_lockres); lockres_free(cinfo->sb_lock); lockres_free(cinfo->bitmap_lockres); dlm_release_lockspace(cinfo->lockspace, 2); @@ -742,6 +792,55 @@ out: return ret; } +static int add_new_disk_start(struct mddev *mddev, struct md_rdev *rdev) +{ + struct md_cluster_info *cinfo = mddev->cluster_info; + struct cluster_msg cmsg; + int ret = 0; + struct mdp_superblock_1 *sb = page_address(rdev->sb_page); + char *uuid = sb->device_uuid; + + memset(&cmsg, 0, sizeof(cmsg)); + cmsg.type = cpu_to_le32(NEWDISK); + memcpy(cmsg.uuid, uuid, 16); + cmsg.raid_slot = rdev->desc_nr; + lock_comm(cinfo); + ret = __sendmsg(cinfo, &cmsg); + if (ret) + return ret; + cinfo->no_new_dev_lockres->flags |= DLM_LKF_NOQUEUE; + ret = dlm_lock_sync(cinfo->no_new_dev_lockres, DLM_LOCK_EX); + cinfo->no_new_dev_lockres->flags &= ~DLM_LKF_NOQUEUE; + /* Some node does not "see" the device */ + if (ret == -EAGAIN) + ret = -ENOENT; + else + dlm_lock_sync(cinfo->no_new_dev_lockres, DLM_LOCK_CR); + return ret; +} + +static int add_new_disk_finish(struct mddev *mddev) +{ + struct cluster_msg cmsg; + struct md_cluster_info *cinfo = mddev->cluster_info; + int ret; + /* Write sb and inform others */ + md_update_sb(mddev, 1); + cmsg.type = METADATA_UPDATED; + ret = __sendmsg(cinfo, &cmsg); + unlock_comm(cinfo); + return ret; +} + +static void new_disk_ack(struct mddev *mddev, bool ack) +{ + struct md_cluster_info *cinfo = mddev->cluster_info; + + if (ack) + dlm_unlock_sync(cinfo->no_new_dev_lockres); + complete(&cinfo->newdisk_completion); +} + static struct md_cluster_operations cluster_ops = { .join = join, .leave = leave, @@ -753,6 +852,9 @@ static struct md_cluster_operations cluster_ops = { .metadata_update_finish = metadata_update_finish, .metadata_update_cancel = metadata_update_cancel, .area_resyncing = area_resyncing, + .add_new_disk_start = add_new_disk_start, + .add_new_disk_finish = add_new_disk_finish, + .new_disk_ack = new_disk_ack, }; static int __init cluster_init(void) diff --git a/drivers/md/md-cluster.h b/drivers/md/md-cluster.h index 03785402afaa..60d7e58964f5 100644 --- a/drivers/md/md-cluster.h +++ b/drivers/md/md-cluster.h @@ -6,6 +6,7 @@ #include "md.h" struct mddev; +struct md_rdev; struct md_cluster_operations { int (*join)(struct mddev *mddev, int nodes); @@ -18,6 +19,9 @@ struct md_cluster_operations { int (*metadata_update_finish)(struct mddev *mddev); int (*metadata_update_cancel)(struct mddev *mddev); int (*area_resyncing)(struct mddev *mddev, sector_t lo, sector_t hi); + int (*add_new_disk_start)(struct mddev *mddev, struct md_rdev *rdev); + int (*add_new_disk_finish)(struct mddev *mddev); + void (*new_disk_ack)(struct mddev *mddev, bool ack); }; #endif /* _MD_CLUSTER_H */ diff --git a/drivers/md/md.c b/drivers/md/md.c index fe0484648de4..5703c2e89f3a 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -2210,7 +2210,7 @@ static void sync_sbs(struct mddev *mddev, int nospares) } } -static void md_update_sb(struct mddev *mddev, int force_change) +void md_update_sb(struct mddev *mddev, int force_change) { struct md_rdev *rdev; int sync_req; @@ -2371,6 +2371,7 @@ repeat: wake_up(&rdev->blocked_wait); } } +EXPORT_SYMBOL(md_update_sb); /* words written to sysfs files may, or may not, be \n terminated. * We want to accept with case. For this we use cmd_match. @@ -3151,7 +3152,7 @@ static void analyze_sbs(struct mddev *mddev) kick_rdev_from_array(rdev); continue; } - if (rdev != freshest) + if (rdev != freshest) { if (super_types[mddev->major_version]. validate_super(mddev, rdev)) { printk(KERN_WARNING "md: kicking non-fresh %s" @@ -3160,6 +3161,15 @@ static void analyze_sbs(struct mddev *mddev) kick_rdev_from_array(rdev); continue; } + /* No device should have a Candidate flag + * when reading devices + */ + if (test_bit(Candidate, &rdev->flags)) { + pr_info("md: kicking Cluster Candidate %s from array!\n", + bdevname(rdev->bdev, b)); + kick_rdev_from_array(rdev); + } + } if (mddev->level == LEVEL_MULTIPATH) { rdev->desc_nr = i++; rdev->raid_disk = rdev->desc_nr; @@ -5655,7 +5665,6 @@ static int get_array_info(struct mddev *mddev, void __user *arg) info.state |= (1<major,info->minor); + if (mddev_is_clustered(mddev) && + !(info->state & ((1 << MD_DISK_CLUSTER_ADD) | (1 << MD_DISK_CANDIDATE)))) { + pr_err("%s: Cannot add to clustered mddev. Try --cluster-add\n", + mdname(mddev)); + return -EINVAL; + } + if (info->major != MAJOR(dev) || info->minor != MINOR(dev)) return -EOVERFLOW; @@ -5830,6 +5846,25 @@ static int add_new_disk(struct mddev *mddev, mdu_disk_info_t *info) else clear_bit(WriteMostly, &rdev->flags); + /* + * check whether the device shows up in other nodes + */ + if (mddev_is_clustered(mddev)) { + if (info->state & (1 << MD_DISK_CANDIDATE)) { + /* Through --cluster-confirm */ + set_bit(Candidate, &rdev->flags); + md_cluster_ops->new_disk_ack(mddev, true); + } else if (info->state & (1 << MD_DISK_CLUSTER_ADD)) { + /* --add initiated by this node */ + err = md_cluster_ops->add_new_disk_start(mddev, rdev); + if (err) { + md_cluster_ops->add_new_disk_finish(mddev); + export_rdev(rdev); + return err; + } + } + } + rdev->raid_disk = -1; err = bind_rdev_to_array(rdev, mddev); if (!err && !mddev->pers->hot_remove_disk) { @@ -5855,6 +5890,9 @@ static int add_new_disk(struct mddev *mddev, mdu_disk_info_t *info) if (!err) md_new_event(mddev); md_wakeup_thread(mddev->thread); + if (mddev_is_clustered(mddev) && + (info->state & (1 << MD_DISK_CLUSTER_ADD))) + md_cluster_ops->add_new_disk_finish(mddev); return err; } @@ -6456,6 +6494,7 @@ static inline bool md_ioctl_valid(unsigned int cmd) case SET_DISK_FAULTY: case STOP_ARRAY: case STOP_ARRAY_RO: + case CLUSTERED_DISK_NACK: return true; default: return false; @@ -6728,6 +6767,13 @@ static int md_ioctl(struct block_device *bdev, fmode_t mode, goto unlock; } + case CLUSTERED_DISK_NACK: + if (mddev_is_clustered(mddev)) + md_cluster_ops->new_disk_ack(mddev, false); + else + err = -EINVAL; + goto unlock; + case HOT_ADD_DISK: err = hot_add_disk(mddev, new_decode_dev(arg)); goto unlock; diff --git a/drivers/md/md.h b/drivers/md/md.h index bfebcfdf54e6..6dc0ce09f50c 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -171,6 +171,10 @@ enum flag_bits { * a want_replacement device with same * raid_disk number. */ + Candidate, /* For clustered environments only: + * This device is seen locally but not + * by the whole cluster + */ }; #define BB_LEN_MASK (0x00000000000001FFULL) @@ -666,6 +670,7 @@ extern struct bio *bio_alloc_mddev(gfp_t gfp_mask, int nr_iovecs, extern void md_unplug(struct blk_plug_cb *cb, bool from_schedule); extern void md_reload_sb(struct mddev *mddev); +extern void md_update_sb(struct mddev *mddev, int force); static inline int mddev_check_plugged(struct mddev *mddev) { return !!blk_check_plugged(md_unplug, mddev, diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index f70d74189d16..53ed5d48308f 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -1571,6 +1571,7 @@ static int raid1_spare_active(struct mddev *mddev) struct md_rdev *rdev = conf->mirrors[i].rdev; struct md_rdev *repl = conf->mirrors[conf->raid_disks + i].rdev; if (repl + && !test_bit(Candidate, &repl->flags) && repl->recovery_offset == MaxSector && !test_bit(Faulty, &repl->flags) && !test_and_set_bit(In_sync, &repl->flags)) { diff --git a/include/uapi/linux/raid/md_p.h b/include/uapi/linux/raid/md_p.h index 643489d33e68..2ae6131e69a5 100644 --- a/include/uapi/linux/raid/md_p.h +++ b/include/uapi/linux/raid/md_p.h @@ -78,6 +78,12 @@ #define MD_DISK_ACTIVE 1 /* disk is running or spare disk */ #define MD_DISK_SYNC 2 /* disk is in sync with the raid set */ #define MD_DISK_REMOVED 3 /* disk is in sync with the raid set */ +#define MD_DISK_CLUSTER_ADD 4 /* Initiate a disk add across the cluster + * For clustered enviroments only. + */ +#define MD_DISK_CANDIDATE 5 /* disk is added as spare (local) until confirmed + * For clustered enviroments only. + */ #define MD_DISK_WRITEMOSTLY 9 /* disk is "write-mostly" is RAID1 config. * read requests will only be sent here in diff --git a/include/uapi/linux/raid/md_u.h b/include/uapi/linux/raid/md_u.h index 74e7c60c4716..1cb8aa6850b5 100644 --- a/include/uapi/linux/raid/md_u.h +++ b/include/uapi/linux/raid/md_u.h @@ -62,6 +62,7 @@ #define STOP_ARRAY _IO (MD_MAJOR, 0x32) #define STOP_ARRAY_RO _IO (MD_MAJOR, 0x33) #define RESTART_ARRAY_RW _IO (MD_MAJOR, 0x34) +#define CLUSTERED_DISK_NACK _IO (MD_MAJOR, 0x35) /* 63 partitions with the alternate major number (mdp) */ #define MdpMinorShift 6 From ba599aca520d6005138d1e5edb125fb83a130141 Mon Sep 17 00:00:00 2001 From: NeilBrown Date: Wed, 25 Feb 2015 11:44:11 +1100 Subject: [PATCH 25/58] md: fix error paths from bitmap_create. Recent change to bitmap_create mishandles errors. In particular a failure doesn't alway cause 'err' to be set. Signed-off-by: NeilBrown --- drivers/md/md.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 5703c2e89f3a..ae3432e57ccb 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -6118,7 +6118,8 @@ static int set_bitmap_file(struct mddev *mddev, int fd) if (!IS_ERR(bitmap)) { mddev->bitmap = bitmap; err = bitmap_load(mddev); - } + } else + err = PTR_ERR(bitmap); } if (fd < 0 || err) { bitmap_destroy(mddev); @@ -6408,7 +6409,8 @@ static int update_array_info(struct mddev *mddev, mdu_array_info_t *info) if (!IS_ERR(bitmap)) { mddev->bitmap = bitmap; rv = bitmap_load(mddev); - } + } else + rv = PTR_ERR(bitmap); if (rv) bitmap_destroy(mddev); mddev->pers->quiesce(mddev, 0); From 935f3d4fc62c1f4d99cd13344762e766cd3bf115 Mon Sep 17 00:00:00 2001 From: NeilBrown Date: Mon, 2 Mar 2015 17:02:29 +1100 Subject: [PATCH 26/58] md/bitmap: fix incorrect DIV_ROUND_UP usage. DIV_ROUTND_UP doesn't work on "long long", - and it should be sector_t anyway. Signed-off-by: NeilBrown --- drivers/md/bitmap.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c index dd8c78043eab..03e0752af99f 100644 --- a/drivers/md/bitmap.c +++ b/drivers/md/bitmap.c @@ -571,11 +571,11 @@ static int bitmap_read_sb(struct bitmap *bitmap) re_read: /* If cluster_slot is set, the cluster is setup */ if (bitmap->cluster_slot >= 0) { - long long bm_blocks; + sector_t bm_blocks; bm_blocks = bitmap->mddev->resync_max_sectors / (bitmap->mddev->bitmap_info.chunksize >> 9); bm_blocks = bm_blocks << 3; - bm_blocks = DIV_ROUND_UP(bm_blocks, 4096); + bm_blocks = DIV_ROUND_UP_SECTOR_T(bm_blocks, 4096); bitmap->mddev->bitmap_info.offset += bitmap->cluster_slot * (bm_blocks << 3); pr_info("%s:%d bm slot: %d offset: %llu\n", __func__, __LINE__, bitmap->cluster_slot, (unsigned long long)bitmap->mddev->bitmap_info.offset); From 3b0e6aacbfe04fa144c4732f269b09ce91177566 Mon Sep 17 00:00:00 2001 From: Stephen Rothwell Date: Tue, 3 Mar 2015 13:35:31 +1100 Subject: [PATCH 27/58] md/bitmap: use sector_div for sector_t divisions neilb: modified to not corrupt ->resync_max_sectors. sector_div usage fixed by Guoqing Jiang Signed-off-by: Stephen Rothwell Signed-off-by: NeilBrown --- drivers/md/bitmap.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c index 03e0752af99f..ac79fef68143 100644 --- a/drivers/md/bitmap.c +++ b/drivers/md/bitmap.c @@ -571,9 +571,10 @@ static int bitmap_read_sb(struct bitmap *bitmap) re_read: /* If cluster_slot is set, the cluster is setup */ if (bitmap->cluster_slot >= 0) { - sector_t bm_blocks; + sector_t bm_blocks = bitmap->mddev->resync_max_sectors; - bm_blocks = bitmap->mddev->resync_max_sectors / (bitmap->mddev->bitmap_info.chunksize >> 9); + sector_div(bm_blocks, + bitmap->mddev->bitmap_info.chunksize >> 9); bm_blocks = bm_blocks << 3; bm_blocks = DIV_ROUND_UP_SECTOR_T(bm_blocks, 4096); bitmap->mddev->bitmap_info.offset += bitmap->cluster_slot * (bm_blocks << 3); From fa8259da0e10b189e41ee60907ec2a499bb66019 Mon Sep 17 00:00:00 2001 From: Goldwyn Rodrigues Date: Mon, 2 Mar 2015 10:55:49 -0600 Subject: [PATCH 28/58] md: Fix stray --cluster-confirm crash A --cluster-confirm without an --add (by another node) can crash the kernel. Fix it by guarding it using a state. Signed-off-by: Goldwyn Rodrigues Signed-off-by: NeilBrown --- drivers/md/md-cluster.c | 15 ++++++++++++++- drivers/md/md-cluster.h | 2 +- drivers/md/md.c | 8 ++++++-- 3 files changed, 21 insertions(+), 4 deletions(-) diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c index 03e521a9ca7d..96679b22cfc0 100644 --- a/drivers/md/md-cluster.c +++ b/drivers/md/md-cluster.c @@ -42,6 +42,10 @@ struct resync_info { __le64 hi; }; +/* md_cluster_info flags */ +#define MD_CLUSTER_WAITING_FOR_NEWDISK 1 + + struct md_cluster_info { /* dlm lock space and resources for clustered raid. */ dlm_lockspace_t *lockspace; @@ -61,6 +65,7 @@ struct md_cluster_info { struct dlm_lock_resource *no_new_dev_lockres; struct md_thread *recv_thread; struct completion newdisk_completion; + unsigned long state; }; enum msg_type { @@ -380,9 +385,11 @@ static void process_add_new_disk(struct mddev *mddev, struct cluster_msg *cmsg) snprintf(raid_slot, 16, "RAID_DISK=%d", cmsg->raid_slot); pr_info("%s:%d Sending kobject change with %s and %s\n", __func__, __LINE__, disk_uuid, raid_slot); init_completion(&cinfo->newdisk_completion); + set_bit(MD_CLUSTER_WAITING_FOR_NEWDISK, &cinfo->state); kobject_uevent_env(&disk_to_dev(mddev->gendisk)->kobj, KOBJ_CHANGE, envp); wait_for_completion_timeout(&cinfo->newdisk_completion, NEW_DEV_TIMEOUT); + clear_bit(MD_CLUSTER_WAITING_FOR_NEWDISK, &cinfo->state); } @@ -832,13 +839,19 @@ static int add_new_disk_finish(struct mddev *mddev) return ret; } -static void new_disk_ack(struct mddev *mddev, bool ack) +static int new_disk_ack(struct mddev *mddev, bool ack) { struct md_cluster_info *cinfo = mddev->cluster_info; + if (!test_bit(MD_CLUSTER_WAITING_FOR_NEWDISK, &cinfo->state)) { + pr_warn("md-cluster(%s): Spurious cluster confirmation\n", mdname(mddev)); + return -EINVAL; + } + if (ack) dlm_unlock_sync(cinfo->no_new_dev_lockres); complete(&cinfo->newdisk_completion); + return 0; } static struct md_cluster_operations cluster_ops = { diff --git a/drivers/md/md-cluster.h b/drivers/md/md-cluster.h index 60d7e58964f5..7417133c4295 100644 --- a/drivers/md/md-cluster.h +++ b/drivers/md/md-cluster.h @@ -21,7 +21,7 @@ struct md_cluster_operations { int (*area_resyncing)(struct mddev *mddev, sector_t lo, sector_t hi); int (*add_new_disk_start)(struct mddev *mddev, struct md_rdev *rdev); int (*add_new_disk_finish)(struct mddev *mddev); - void (*new_disk_ack)(struct mddev *mddev, bool ack); + int (*new_disk_ack)(struct mddev *mddev, bool ack); }; #endif /* _MD_CLUSTER_H */ diff --git a/drivers/md/md.c b/drivers/md/md.c index ae3432e57ccb..eb6f92e57ab6 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -5755,7 +5755,7 @@ static int add_new_disk(struct mddev *mddev, mdu_disk_info_t *info) if (mddev_is_clustered(mddev) && !(info->state & ((1 << MD_DISK_CLUSTER_ADD) | (1 << MD_DISK_CANDIDATE)))) { - pr_err("%s: Cannot add to clustered mddev. Try --cluster-add\n", + pr_err("%s: Cannot add to clustered mddev.\n", mdname(mddev)); return -EINVAL; } @@ -5853,7 +5853,11 @@ static int add_new_disk(struct mddev *mddev, mdu_disk_info_t *info) if (info->state & (1 << MD_DISK_CANDIDATE)) { /* Through --cluster-confirm */ set_bit(Candidate, &rdev->flags); - md_cluster_ops->new_disk_ack(mddev, true); + err = md_cluster_ops->new_disk_ack(mddev, true); + if (err) { + export_rdev(rdev); + return err; + } } else if (info->state & (1 << MD_DISK_CLUSTER_ADD)) { /* --add initiated by this node */ err = md_cluster_ops->add_new_disk_start(mddev, rdev); From 6dc69c9c460b0cf05b5b3f323a8b944a2e52e76d Mon Sep 17 00:00:00 2001 From: kbuild test robot Date: Sat, 28 Feb 2015 07:04:37 +0800 Subject: [PATCH 29/58] md: recover_bitmaps() can be static drivers/md/md-cluster.c:190:6: sparse: symbol 'recover_bitmaps' was not declared. Should it be static? Signed-off-by: Fengguang Wu Signed-off-by: NeilBrown --- drivers/md/md-cluster.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c index 96679b22cfc0..5062bd1929be 100644 --- a/drivers/md/md-cluster.c +++ b/drivers/md/md-cluster.c @@ -217,7 +217,7 @@ out: return s; } -void recover_bitmaps(struct md_thread *thread) +static void recover_bitmaps(struct md_thread *thread) { struct mddev *mddev = thread->mddev; struct md_cluster_info *cinfo = mddev->cluster_info; From 09dd1af2e011a13adce65b74425dfe31e1985e64 Mon Sep 17 00:00:00 2001 From: kbuild test robot Date: Sat, 28 Feb 2015 09:16:08 +0800 Subject: [PATCH 30/58] md/cluster: Communication Framework: fix semicolon.cocci warnings drivers/md/md-cluster.c:328:2-3: Unneeded semicolon Removes unneeded semicolon. Generated by: scripts/coccinelle/misc/semicolon.cocci Signed-off-by: Fengguang Wu Signed-off-by: NeilBrown --- drivers/md/md-cluster.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c index 5062bd1929be..ae8bb547f94d 100644 --- a/drivers/md/md-cluster.c +++ b/drivers/md/md-cluster.c @@ -419,7 +419,7 @@ static void process_recvd_msg(struct mddev *mddev, struct cluster_msg *msg) pr_info("%s: %d Received message: NEWDISK from %d\n", __func__, __LINE__, msg->slot); process_add_new_disk(mddev, msg); - }; + } } /* From 124eb761edfdee13c02e48815b05d9bed7666d4c Mon Sep 17 00:00:00 2001 From: Goldwyn Rodrigues Date: Tue, 24 Mar 2015 11:29:05 -0500 Subject: [PATCH 31/58] md: Fix bitmap offset calculations The calculations of bitmap offset is incorrect with respect to bits to bytes conversion. Also, remove an irrelevant duplicate message. Signed-off-by: Goldwyn Rodrigues Signed-off-by: NeilBrown --- drivers/md/bitmap.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c index ac79fef68143..e98db04eb4f9 100644 --- a/drivers/md/bitmap.c +++ b/drivers/md/bitmap.c @@ -575,7 +575,9 @@ re_read: sector_div(bm_blocks, bitmap->mddev->bitmap_info.chunksize >> 9); - bm_blocks = bm_blocks << 3; + /* bits to bytes */ + bm_blocks = ((bm_blocks+7) >> 3) + sizeof(bitmap_super_t); + /* to 4k blocks */ bm_blocks = DIV_ROUND_UP_SECTOR_T(bm_blocks, 4096); bitmap->mddev->bitmap_info.offset += bitmap->cluster_slot * (bm_blocks << 3); pr_info("%s:%d bm slot: %d offset: %llu\n", __func__, __LINE__, @@ -672,9 +674,6 @@ out: goto out_no_sb; } bitmap->cluster_slot = md_cluster_ops->slot_number(bitmap->mddev); - pr_info("%s:%d bm slot: %d offset: %llu\n", __func__, __LINE__, - bitmap->cluster_slot, - (unsigned long long)bitmap->mddev->bitmap_info.offset); goto re_read; } From 8c58f02e244d5b35fa38aa308007715d4957d4c7 Mon Sep 17 00:00:00 2001 From: Guoqing Jiang Date: Tue, 21 Apr 2015 11:25:52 -0500 Subject: [PATCH 32/58] md-cluster: correct the num for comparison Since the node num of md-cluster is from zero, and cinfo->slot_number represents the slot num of dlm, no need to check for equality. Signed-off-by: Guoqing Jiang Signed-off-by: Goldwyn Rodrigues Signed-off-by: NeilBrown --- drivers/md/md-cluster.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c index ae8bb547f94d..10c44a3a9d6a 100644 --- a/drivers/md/md-cluster.c +++ b/drivers/md/md-cluster.c @@ -612,9 +612,9 @@ static int join(struct mddev *mddev, int nodes) if (ret) goto err; wait_for_completion(&cinfo->completion); - if (nodes <= cinfo->slot_number) { - pr_err("md-cluster: Slot allotted(%d) greater than available slots(%d)", cinfo->slot_number - 1, - nodes); + if (nodes < cinfo->slot_number) { + pr_err("md-cluster: Slot allotted(%d) is greater than available slots(%d).", + cinfo->slot_number, nodes); ret = -ERANGE; goto err; } From fb56dfef4e31f214cfbfa0eb8a1949591c20b118 Mon Sep 17 00:00:00 2001 From: Goldwyn Rodrigues Date: Tue, 14 Apr 2015 10:43:24 -0500 Subject: [PATCH 33/58] md: Export and rename kick_rdev_from_array This export is required for clustering module in order to co-ordinate remove/readd a rdev from all nodes. Signed-off-by: Goldwyn Rodrigues Signed-off-by: NeilBrown --- drivers/md/md.c | 17 +++++++++-------- drivers/md/md.h | 1 + 2 files changed, 10 insertions(+), 8 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index eb6f92e57ab6..bc1e43014292 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -2168,11 +2168,12 @@ static void export_rdev(struct md_rdev *rdev) kobject_put(&rdev->kobj); } -static void kick_rdev_from_array(struct md_rdev *rdev) +void md_kick_rdev_from_array(struct md_rdev *rdev) { unbind_rdev_from_array(rdev); export_rdev(rdev); } +EXPORT_SYMBOL_GPL(md_kick_rdev_from_array); static void export_array(struct mddev *mddev) { @@ -2181,7 +2182,7 @@ static void export_array(struct mddev *mddev) while (!list_empty(&mddev->disks)) { rdev = list_first_entry(&mddev->disks, struct md_rdev, same_set); - kick_rdev_from_array(rdev); + md_kick_rdev_from_array(rdev); } mddev->raid_disks = 0; mddev->major_version = 0; @@ -2476,7 +2477,7 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len) struct mddev *mddev = rdev->mddev; if (mddev_is_clustered(mddev)) md_cluster_ops->metadata_update_start(mddev); - kick_rdev_from_array(rdev); + md_kick_rdev_from_array(rdev); if (mddev->pers) md_update_sb(mddev, 1); md_new_event(mddev); @@ -3134,7 +3135,7 @@ static void analyze_sbs(struct mddev *mddev) "md: fatal superblock inconsistency in %s" " -- removing from array\n", bdevname(rdev->bdev,b)); - kick_rdev_from_array(rdev); + md_kick_rdev_from_array(rdev); } super_types[mddev->major_version]. @@ -3149,7 +3150,7 @@ static void analyze_sbs(struct mddev *mddev) "md: %s: %s: only %d devices permitted\n", mdname(mddev), bdevname(rdev->bdev, b), mddev->max_disks); - kick_rdev_from_array(rdev); + md_kick_rdev_from_array(rdev); continue; } if (rdev != freshest) { @@ -3158,7 +3159,7 @@ static void analyze_sbs(struct mddev *mddev) printk(KERN_WARNING "md: kicking non-fresh %s" " from array!\n", bdevname(rdev->bdev,b)); - kick_rdev_from_array(rdev); + md_kick_rdev_from_array(rdev); continue; } /* No device should have a Candidate flag @@ -3167,7 +3168,7 @@ static void analyze_sbs(struct mddev *mddev) if (test_bit(Candidate, &rdev->flags)) { pr_info("md: kicking Cluster Candidate %s from array!\n", bdevname(rdev->bdev, b)); - kick_rdev_from_array(rdev); + md_kick_rdev_from_array(rdev); } } if (mddev->level == LEVEL_MULTIPATH) { @@ -5966,7 +5967,7 @@ static int hot_remove_disk(struct mddev *mddev, dev_t dev) if (rdev->raid_disk >= 0) goto busy; - kick_rdev_from_array(rdev); + md_kick_rdev_from_array(rdev); md_update_sb(mddev, 1); md_new_event(mddev); diff --git a/drivers/md/md.h b/drivers/md/md.h index 6dc0ce09f50c..d98c0d764d8f 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -671,6 +671,7 @@ extern struct bio *bio_alloc_mddev(gfp_t gfp_mask, int nr_iovecs, extern void md_unplug(struct blk_plug_cb *cb, bool from_schedule); extern void md_reload_sb(struct mddev *mddev); extern void md_update_sb(struct mddev *mddev, int force); +extern void md_kick_rdev_from_array(struct md_rdev * rdev); static inline int mddev_check_plugged(struct mddev *mddev) { return !!blk_check_plugged(md_unplug, mddev, From 57d051dccaef395e0d8c0fff02cfc3a77bacc88c Mon Sep 17 00:00:00 2001 From: Goldwyn Rodrigues Date: Tue, 14 Apr 2015 10:43:55 -0500 Subject: [PATCH 34/58] md: Export and rename find_rdev_nr_rcu This is required by the clustering module (patches to follow) to find the device to remove or re-add. Signed-off-by: Goldwyn Rodrigues Signed-off-by: NeilBrown --- drivers/md/md.c | 9 +++++---- drivers/md/md.h | 1 + 2 files changed, 6 insertions(+), 4 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index bc1e43014292..d406a79f9140 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -642,7 +642,7 @@ void mddev_unlock(struct mddev *mddev) } EXPORT_SYMBOL_GPL(mddev_unlock); -static struct md_rdev *find_rdev_nr_rcu(struct mddev *mddev, int nr) +struct md_rdev *md_find_rdev_nr_rcu(struct mddev *mddev, int nr) { struct md_rdev *rdev; @@ -652,6 +652,7 @@ static struct md_rdev *find_rdev_nr_rcu(struct mddev *mddev, int nr) return NULL; } +EXPORT_SYMBOL_GPL(md_find_rdev_nr_rcu); static struct md_rdev *find_rdev(struct mddev *mddev, dev_t dev) { @@ -2049,11 +2050,11 @@ static int bind_rdev_to_array(struct md_rdev *rdev, struct mddev *mddev) int choice = 0; if (mddev->pers) choice = mddev->raid_disks; - while (find_rdev_nr_rcu(mddev, choice)) + while (md_find_rdev_nr_rcu(mddev, choice)) choice++; rdev->desc_nr = choice; } else { - if (find_rdev_nr_rcu(mddev, rdev->desc_nr)) { + if (md_find_rdev_nr_rcu(mddev, rdev->desc_nr)) { rcu_read_unlock(); return -EBUSY; } @@ -5721,7 +5722,7 @@ static int get_disk_info(struct mddev *mddev, void __user * arg) return -EFAULT; rcu_read_lock(); - rdev = find_rdev_nr_rcu(mddev, info.number); + rdev = md_find_rdev_nr_rcu(mddev, info.number); if (rdev) { info.major = MAJOR(rdev->bdev->bd_dev); info.minor = MINOR(rdev->bdev->bd_dev); diff --git a/drivers/md/md.h b/drivers/md/md.h index d98c0d764d8f..ecdce36ec6b8 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -672,6 +672,7 @@ extern void md_unplug(struct blk_plug_cb *cb, bool from_schedule); extern void md_reload_sb(struct mddev *mddev); extern void md_update_sb(struct mddev *mddev, int force); extern void md_kick_rdev_from_array(struct md_rdev * rdev); +struct md_rdev *md_find_rdev_nr_rcu(struct mddev *mddev, int nr); static inline int mddev_check_plugged(struct mddev *mddev) { return !!blk_check_plugged(md_unplug, mddev, From 88bcfef7be513e8bf5448e0025330fdd97c4c708 Mon Sep 17 00:00:00 2001 From: Goldwyn Rodrigues Date: Tue, 14 Apr 2015 10:44:44 -0500 Subject: [PATCH 35/58] md-cluster: remove capabilities This adds "remove" capabilities for the clustered environment. When a user initiates removal of a device from the array, a REMOVE message with disk number in the array is sent to all the nodes which kick the respective device in their own array. This facilitates the removal of failed devices. Signed-off-by: Goldwyn Rodrigues Signed-off-by: NeilBrown --- drivers/md/md-cluster.c | 30 ++++++++++++++++++++++++++++++ drivers/md/md-cluster.h | 1 + drivers/md/md.c | 7 ++++++- 3 files changed, 37 insertions(+), 1 deletion(-) diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c index 10c44a3a9d6a..30b41b70db17 100644 --- a/drivers/md/md-cluster.c +++ b/drivers/md/md-cluster.c @@ -72,6 +72,7 @@ enum msg_type { METADATA_UPDATED = 0, RESYNCING, NEWDISK, + REMOVE, }; struct cluster_msg { @@ -401,6 +402,16 @@ static void process_metadata_update(struct mddev *mddev, struct cluster_msg *msg dlm_lock_sync(cinfo->no_new_dev_lockres, DLM_LOCK_CR); } +static void process_remove_disk(struct mddev *mddev, struct cluster_msg *msg) +{ + struct md_rdev *rdev = md_find_rdev_nr_rcu(mddev, msg->raid_slot); + + if (rdev) + md_kick_rdev_from_array(rdev); + else + pr_warn("%s: %d Could not find disk(%d) to REMOVE\n", __func__, __LINE__, msg->raid_slot); +} + static void process_recvd_msg(struct mddev *mddev, struct cluster_msg *msg) { switch (msg->type) { @@ -419,6 +430,15 @@ static void process_recvd_msg(struct mddev *mddev, struct cluster_msg *msg) pr_info("%s: %d Received message: NEWDISK from %d\n", __func__, __LINE__, msg->slot); process_add_new_disk(mddev, msg); + break; + case REMOVE: + pr_info("%s: %d Received REMOVE from %d\n", + __func__, __LINE__, msg->slot); + process_remove_disk(mddev, msg); + break; + default: + pr_warn("%s:%d Received unknown message from %d\n", + __func__, __LINE__, msg->slot); } } @@ -854,6 +874,15 @@ static int new_disk_ack(struct mddev *mddev, bool ack) return 0; } +static int remove_disk(struct mddev *mddev, struct md_rdev *rdev) +{ + struct cluster_msg cmsg; + struct md_cluster_info *cinfo = mddev->cluster_info; + cmsg.type = REMOVE; + cmsg.raid_slot = rdev->desc_nr; + return __sendmsg(cinfo, &cmsg); +} + static struct md_cluster_operations cluster_ops = { .join = join, .leave = leave, @@ -868,6 +897,7 @@ static struct md_cluster_operations cluster_ops = { .add_new_disk_start = add_new_disk_start, .add_new_disk_finish = add_new_disk_finish, .new_disk_ack = new_disk_ack, + .remove_disk = remove_disk, }; static int __init cluster_init(void) diff --git a/drivers/md/md-cluster.h b/drivers/md/md-cluster.h index 7417133c4295..71e51432c1f4 100644 --- a/drivers/md/md-cluster.h +++ b/drivers/md/md-cluster.h @@ -22,6 +22,7 @@ struct md_cluster_operations { int (*add_new_disk_start)(struct mddev *mddev, struct md_rdev *rdev); int (*add_new_disk_finish)(struct mddev *mddev); int (*new_disk_ack)(struct mddev *mddev, bool ack); + int (*remove_disk)(struct mddev *mddev, struct md_rdev *rdev); }; #endif /* _MD_CLUSTER_H */ diff --git a/drivers/md/md.c b/drivers/md/md.c index d406a79f9140..ca011d1d1de7 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -2477,8 +2477,10 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len) else { struct mddev *mddev = rdev->mddev; if (mddev_is_clustered(mddev)) - md_cluster_ops->metadata_update_start(mddev); + md_cluster_ops->remove_disk(mddev, rdev); md_kick_rdev_from_array(rdev); + if (mddev_is_clustered(mddev)) + md_cluster_ops->metadata_update_start(mddev); if (mddev->pers) md_update_sb(mddev, 1); md_new_event(mddev); @@ -5968,6 +5970,9 @@ static int hot_remove_disk(struct mddev *mddev, dev_t dev) if (rdev->raid_disk >= 0) goto busy; + if (mddev_is_clustered(mddev)) + md_cluster_ops->remove_disk(mddev, rdev); + md_kick_rdev_from_array(rdev); md_update_sb(mddev, 1); md_new_event(mddev); From a6da4ef85cef0382244fc588c901e133a2ec5109 Mon Sep 17 00:00:00 2001 From: Goldwyn Rodrigues Date: Tue, 14 Apr 2015 10:45:22 -0500 Subject: [PATCH 36/58] md: re-add a failed disk This adds the capability of re-adding a failed disk by writing "re-add" to /sys/block/mdXX/md/dev-YYY/state. This facilitates adding disks which have encountered a temporary error such as a network disconnection/hiccup in an iSCSI device, or a SAN cable disconnection which has been restored. In such a situation, you do not need to remove and re-add the device. Writing re-add to the failed device's state would add it again to the array and perform the recovery of only the blocks which were written after the device failed. This works for generic md, and is not related to clustering. However, this patch is to ease re-add operations listed above in clustering environments. Signed-off-by: Goldwyn Rodrigues Signed-off-by: NeilBrown --- drivers/md/md.c | 57 ++++++++++++++++++++++++++++++++----------------- 1 file changed, 37 insertions(+), 20 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index ca011d1d1de7..429e95e9a942 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -2375,6 +2375,36 @@ repeat: } EXPORT_SYMBOL(md_update_sb); +static int add_bound_rdev(struct md_rdev *rdev) +{ + struct mddev *mddev = rdev->mddev; + int err = 0; + + if (!mddev->pers->hot_remove_disk) { + /* If there is hot_add_disk but no hot_remove_disk + * then added disks for geometry changes, + * and should be added immediately. + */ + super_types[mddev->major_version]. + validate_super(mddev, rdev); + err = mddev->pers->hot_add_disk(mddev, rdev); + if (err) { + unbind_rdev_from_array(rdev); + export_rdev(rdev); + return err; + } + } + sysfs_notify_dirent_safe(rdev->sysfs_state); + + set_bit(MD_CHANGE_DEVS, &mddev->flags); + if (mddev->degraded) + set_bit(MD_RECOVERY_RECOVER, &mddev->recovery); + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); + md_new_event(mddev); + md_wakeup_thread(mddev->thread); + return 0; +} + /* words written to sysfs files may, or may not, be \n terminated. * We want to accept with case. For this we use cmd_match. */ @@ -2564,6 +2594,12 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len) clear_bit(Replacement, &rdev->flags); err = 0; } + } else if (cmd_match(buf, "re-add")) { + if (test_bit(Faulty, &rdev->flags) && (rdev->raid_disk == -1)) { + clear_bit(Faulty, &rdev->flags); + err = add_bound_rdev(rdev); + } else + err = -EBUSY; } if (!err) sysfs_notify_dirent_safe(rdev->sysfs_state); @@ -5875,29 +5911,10 @@ static int add_new_disk(struct mddev *mddev, mdu_disk_info_t *info) rdev->raid_disk = -1; err = bind_rdev_to_array(rdev, mddev); - if (!err && !mddev->pers->hot_remove_disk) { - /* If there is hot_add_disk but no hot_remove_disk - * then added disks for geometry changes, - * and should be added immediately. - */ - super_types[mddev->major_version]. - validate_super(mddev, rdev); - err = mddev->pers->hot_add_disk(mddev, rdev); - if (err) - unbind_rdev_from_array(rdev); - } if (err) export_rdev(rdev); else - sysfs_notify_dirent_safe(rdev->sysfs_state); - - set_bit(MD_CHANGE_DEVS, &mddev->flags); - if (mddev->degraded) - set_bit(MD_RECOVERY_RECOVER, &mddev->recovery); - set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); - if (!err) - md_new_event(mddev); - md_wakeup_thread(mddev->thread); + err = add_bound_rdev(rdev); if (mddev_is_clustered(mddev) && (info->state & (1 << MD_DISK_CLUSTER_ADD))) md_cluster_ops->add_new_disk_finish(mddev); From 97f6cd39da227459cb46ed4088d37d5d8db51c50 Mon Sep 17 00:00:00 2001 From: Goldwyn Rodrigues Date: Tue, 14 Apr 2015 10:45:42 -0500 Subject: [PATCH 37/58] md-cluster: re-add capabilities When "re-add" is writted to /sys/block/mdXX/md/dev-YYY/state, the clustered md: 1. Sends RE_ADD message with the desc_nr. Nodes receiving the message clear the Faulty bit in their respective rdev->flags. 2. The node initiating re-add, gathers the bitmaps of all nodes and copies them into the local bitmap. It does not clear the bitmap from which it is copying. 3. Initiating node schedules a md recovery to sync the devices. Signed-off-by: Guoqing Jiang Signed-off-by: Goldwyn Rodrigues Signed-off-by: NeilBrown --- drivers/md/bitmap.c | 20 +++++++++-------- drivers/md/bitmap.h | 2 +- drivers/md/md-cluster.c | 48 ++++++++++++++++++++++++++++++++++++++++- drivers/md/md-cluster.h | 1 + drivers/md/md.c | 13 +++++++++-- 5 files changed, 71 insertions(+), 13 deletions(-) diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c index e98db04eb4f9..2bc56e2a3526 100644 --- a/drivers/md/bitmap.c +++ b/drivers/md/bitmap.c @@ -1851,7 +1851,7 @@ EXPORT_SYMBOL_GPL(bitmap_load); * to our bitmap */ int bitmap_copy_from_slot(struct mddev *mddev, int slot, - sector_t *low, sector_t *high) + sector_t *low, sector_t *high, bool clear_bits) { int rv = 0, i, j; sector_t block, lo = 0, hi = 0; @@ -1882,14 +1882,16 @@ int bitmap_copy_from_slot(struct mddev *mddev, int slot, } } - bitmap_update_sb(bitmap); - /* Setting this for the ev_page should be enough. - * And we do not require both write_all and PAGE_DIRT either - */ - for (i = 0; i < bitmap->storage.file_pages; i++) - set_page_attr(bitmap, i, BITMAP_PAGE_DIRTY); - bitmap_write_all(bitmap); - bitmap_unplug(bitmap); + if (clear_bits) { + bitmap_update_sb(bitmap); + /* Setting this for the ev_page should be enough. + * And we do not require both write_all and PAGE_DIRT either + */ + for (i = 0; i < bitmap->storage.file_pages; i++) + set_page_attr(bitmap, i, BITMAP_PAGE_DIRTY); + bitmap_write_all(bitmap); + bitmap_unplug(bitmap); + } *low = lo; *high = hi; err: diff --git a/drivers/md/bitmap.h b/drivers/md/bitmap.h index 4aabc74ef7b9..f1f4dd01090d 100644 --- a/drivers/md/bitmap.h +++ b/drivers/md/bitmap.h @@ -263,7 +263,7 @@ void bitmap_daemon_work(struct mddev *mddev); int bitmap_resize(struct bitmap *bitmap, sector_t blocks, int chunksize, int init); int bitmap_copy_from_slot(struct mddev *mddev, int slot, - sector_t *lo, sector_t *hi); + sector_t *lo, sector_t *hi, bool clear_bits); #endif #endif diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c index 30b41b70db17..fcfc4b9b2672 100644 --- a/drivers/md/md-cluster.c +++ b/drivers/md/md-cluster.c @@ -73,6 +73,7 @@ enum msg_type { RESYNCING, NEWDISK, REMOVE, + RE_ADD, }; struct cluster_msg { @@ -253,7 +254,7 @@ static void recover_bitmaps(struct md_thread *thread) str, ret); goto clear_bit; } - ret = bitmap_copy_from_slot(mddev, slot, &lo, &hi); + ret = bitmap_copy_from_slot(mddev, slot, &lo, &hi, true); if (ret) { pr_err("md-cluster: Could not copy data from bitmap %d\n", slot); goto dlm_unlock; @@ -412,6 +413,16 @@ static void process_remove_disk(struct mddev *mddev, struct cluster_msg *msg) pr_warn("%s: %d Could not find disk(%d) to REMOVE\n", __func__, __LINE__, msg->raid_slot); } +static void process_readd_disk(struct mddev *mddev, struct cluster_msg *msg) +{ + struct md_rdev *rdev = md_find_rdev_nr_rcu(mddev, msg->raid_slot); + + if (rdev && test_bit(Faulty, &rdev->flags)) + clear_bit(Faulty, &rdev->flags); + else + pr_warn("%s: %d Could not find disk(%d) which is faulty", __func__, __LINE__, msg->raid_slot); +} + static void process_recvd_msg(struct mddev *mddev, struct cluster_msg *msg) { switch (msg->type) { @@ -436,6 +447,11 @@ static void process_recvd_msg(struct mddev *mddev, struct cluster_msg *msg) __func__, __LINE__, msg->slot); process_remove_disk(mddev, msg); break; + case RE_ADD: + pr_info("%s: %d Received RE_ADD from %d\n", + __func__, __LINE__, msg->slot); + process_readd_disk(mddev, msg); + break; default: pr_warn("%s:%d Received unknown message from %d\n", __func__, __LINE__, msg->slot); @@ -883,6 +899,35 @@ static int remove_disk(struct mddev *mddev, struct md_rdev *rdev) return __sendmsg(cinfo, &cmsg); } +static int gather_bitmaps(struct md_rdev *rdev) +{ + int sn, err; + sector_t lo, hi; + struct cluster_msg cmsg; + struct mddev *mddev = rdev->mddev; + struct md_cluster_info *cinfo = mddev->cluster_info; + + cmsg.type = RE_ADD; + cmsg.raid_slot = rdev->desc_nr; + err = sendmsg(cinfo, &cmsg); + if (err) + goto out; + + for (sn = 0; sn < mddev->bitmap_info.nodes; sn++) { + if (sn == (cinfo->slot_number - 1)) + continue; + err = bitmap_copy_from_slot(mddev, sn, &lo, &hi, false); + if (err) { + pr_warn("md-cluster: Could not gather bitmaps from slot %d", sn); + goto out; + } + if ((hi > 0) && (lo < mddev->recovery_cp)) + mddev->recovery_cp = lo; + } +out: + return err; +} + static struct md_cluster_operations cluster_ops = { .join = join, .leave = leave, @@ -898,6 +943,7 @@ static struct md_cluster_operations cluster_ops = { .add_new_disk_finish = add_new_disk_finish, .new_disk_ack = new_disk_ack, .remove_disk = remove_disk, + .gather_bitmaps = gather_bitmaps, }; static int __init cluster_init(void) diff --git a/drivers/md/md-cluster.h b/drivers/md/md-cluster.h index 71e51432c1f4..6817ee00e053 100644 --- a/drivers/md/md-cluster.h +++ b/drivers/md/md-cluster.h @@ -23,6 +23,7 @@ struct md_cluster_operations { int (*add_new_disk_finish)(struct mddev *mddev); int (*new_disk_ack)(struct mddev *mddev, bool ack); int (*remove_disk)(struct mddev *mddev, struct md_rdev *rdev); + int (*gather_bitmaps)(struct md_rdev *rdev); }; #endif /* _MD_CLUSTER_H */ diff --git a/drivers/md/md.c b/drivers/md/md.c index 429e95e9a942..d9cac48db2fc 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -2596,8 +2596,17 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len) } } else if (cmd_match(buf, "re-add")) { if (test_bit(Faulty, &rdev->flags) && (rdev->raid_disk == -1)) { - clear_bit(Faulty, &rdev->flags); - err = add_bound_rdev(rdev); + /* clear_bit is performed _after_ all the devices + * have their local Faulty bit cleared. If any writes + * happen in the meantime in the local node, they + * will land in the local bitmap, which will be synced + * by this node eventually + */ + if (!mddev_is_clustered(rdev->mddev) || + (err = md_cluster_ops->gather_bitmaps(rdev)) == 0) { + clear_bit(Faulty, &rdev->flags); + err = add_bound_rdev(rdev); + } } else err = -EBUSY; } From 50c37b136a3807eda44afe16529b5af701ec49f5 Mon Sep 17 00:00:00 2001 From: NeilBrown Date: Mon, 23 Mar 2015 17:36:38 +1100 Subject: [PATCH 38/58] md: don't require sync_min to be a multiple of chunk_size. There is really no need for sync_min to be a multiple of chunk_size, and values read from here often aren't. That means you cannot read a value and expect to be able to write it back later. So remove the chunk_size check, and round down to a multiple of 4K, to be sure everything works with 4K-sector devices. Signed-off-by: NeilBrown --- drivers/md/md.c | 13 ++----------- 1 file changed, 2 insertions(+), 11 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 0d8968535976..3724a29eaf0e 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -4427,7 +4427,6 @@ min_sync_store(struct mddev *mddev, const char *buf, size_t len) { unsigned long long min; int err; - int chunk; if (kstrtoull(buf, 10, &min)) return -EINVAL; @@ -4441,16 +4440,8 @@ min_sync_store(struct mddev *mddev, const char *buf, size_t len) if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) goto out_unlock; - /* Must be a multiple of chunk_size */ - chunk = mddev->chunk_sectors; - if (chunk) { - sector_t temp = min; - - err = -EINVAL; - if (sector_div(temp, chunk)) - goto out_unlock; - } - mddev->resync_min = min; + /* Round down to multiple of 4K for safety */ + mddev->resync_min = round_down(min, 8); err = 0; out_unlock: From 09314799e4f0589e52bafcd0ca3556c60468bc0e Mon Sep 17 00:00:00 2001 From: NeilBrown Date: Thu, 19 Feb 2015 16:04:40 +1100 Subject: [PATCH 39/58] md: remove 'go_faster' option from ->sync_request() This option is not well justified and testing suggests that it hardly ever makes any difference. The comment suggests there might be a need to wait for non-resync activity indicated by ->nr_waiting, however raise_barrier() already waits for all of that. So just remove it to simplify reasoning about speed limiting. This allows us to remove a 'FIXME' comment from raid5.c as that never used the flag. Signed-off-by: NeilBrown --- drivers/md/md.c | 5 ++--- drivers/md/md.h | 2 +- drivers/md/raid1.c | 9 +-------- drivers/md/raid10.c | 8 +------- drivers/md/raid5.c | 3 +-- 5 files changed, 6 insertions(+), 21 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 3724a29eaf0e..3b9b032aa006 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -7820,8 +7820,7 @@ void md_do_sync(struct md_thread *thread) if (test_bit(MD_RECOVERY_INTR, &mddev->recovery)) break; - sectors = mddev->pers->sync_request(mddev, j, &skipped, - currspeed < speed_min(mddev)); + sectors = mddev->pers->sync_request(mddev, j, &skipped); if (sectors == 0) { set_bit(MD_RECOVERY_INTR, &mddev->recovery); break; @@ -7898,7 +7897,7 @@ void md_do_sync(struct md_thread *thread) wait_event(mddev->recovery_wait, !atomic_read(&mddev->recovery_active)); /* tell personality that we are finished */ - mddev->pers->sync_request(mddev, max_sectors, &skipped, 1); + mddev->pers->sync_request(mddev, max_sectors, &skipped); if (mddev_is_clustered(mddev)) md_cluster_ops->resync_finish(mddev); diff --git a/drivers/md/md.h b/drivers/md/md.h index ecdce36ec6b8..4046a6c6f223 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -506,7 +506,7 @@ struct md_personality int (*hot_add_disk) (struct mddev *mddev, struct md_rdev *rdev); int (*hot_remove_disk) (struct mddev *mddev, struct md_rdev *rdev); int (*spare_active) (struct mddev *mddev); - sector_t (*sync_request)(struct mddev *mddev, sector_t sector_nr, int *skipped, int go_faster); + sector_t (*sync_request)(struct mddev *mddev, sector_t sector_nr, int *skipped); int (*resize) (struct mddev *mddev, sector_t sectors); sector_t (*size) (struct mddev *mddev, sector_t sectors, int raid_disks); int (*check_reshape) (struct mddev *mddev); diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 4efa50186a2a..9157a29c8dbf 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -2480,7 +2480,7 @@ static int init_resync(struct r1conf *conf) * that can be installed to exclude normal IO requests. */ -static sector_t sync_request(struct mddev *mddev, sector_t sector_nr, int *skipped, int go_faster) +static sector_t sync_request(struct mddev *mddev, sector_t sector_nr, int *skipped) { struct r1conf *conf = mddev->private; struct r1bio *r1_bio; @@ -2533,13 +2533,6 @@ static sector_t sync_request(struct mddev *mddev, sector_t sector_nr, int *skipp *skipped = 1; return sync_blocks; } - /* - * If there is non-resync activity waiting for a turn, - * and resync is going fast enough, - * then let it though before starting on this new sync request. - */ - if (!go_faster && conf->nr_waiting) - msleep_interruptible(1000); bitmap_cond_end_sync(mddev->bitmap, sector_nr); r1_bio = mempool_alloc(conf->r1buf_pool, GFP_NOIO); diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index a7196c49d15d..e793ab6b3570 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -2889,7 +2889,7 @@ static int init_resync(struct r10conf *conf) */ static sector_t sync_request(struct mddev *mddev, sector_t sector_nr, - int *skipped, int go_faster) + int *skipped) { struct r10conf *conf = mddev->private; struct r10bio *r10_bio; @@ -2994,12 +2994,6 @@ static sector_t sync_request(struct mddev *mddev, sector_t sector_nr, if (conf->geo.near_copies < conf->geo.raid_disks && max_sector > (sector_nr | chunk_mask)) max_sector = (sector_nr | chunk_mask) + 1; - /* - * If there is non-resync activity waiting for us then - * put in a delay to throttle resync. - */ - if (!go_faster && conf->nr_waiting) - msleep_interruptible(1000); /* Again, very different code for resync and recovery. * Both must result in an r10bio with a list of bios that diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index cd2f96b2c572..022a0d99e110 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -5050,8 +5050,7 @@ ret: return reshape_sectors; } -/* FIXME go_faster isn't used */ -static inline sector_t sync_request(struct mddev *mddev, sector_t sector_nr, int *skipped, int go_faster) +static inline sector_t sync_request(struct mddev *mddev, sector_t sector_nr, int *skipped) { struct r5conf *conf = mddev->private; struct stripe_head *sh; From ac8fa4196d205ac8fff3f8932bddbad4f16e4110 Mon Sep 17 00:00:00 2001 From: NeilBrown Date: Thu, 19 Feb 2015 16:55:00 +1100 Subject: [PATCH 40/58] md: allow resync to go faster when there is competing IO. When md notices non-sync IO happening while it is trying to resync (or reshape or recover) it slows down to the set minimum. The default minimum might have made sense many years ago but the drives have become faster. Changing the default to match the times isn't really a long term solution. This patch changes the code so that instead of waiting until the speed has dropped to the target, it just waits until pending requests have completed. This means that the delay inserted is a function of the speed of the devices. Testing shows that: - for some loads, the resync speed is unchanged. For those loads increasing the minimum doesn't change the speed either. So this is a good result. To increase resync speed under such loads we would probably need to increase the resync window size. - for other loads, resync speed does increase to a reasonable fraction (e.g. 20%) of maximum possible, and throughput of the load only drops a little bit (e.g. 10%) - for other loads, throughput of the non-sync load drops quite a bit more. These seem to be latency-sensitive loads. So it isn't a perfect solution, but it is mostly an improvement. Signed-off-by: NeilBrown --- drivers/md/md.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 3b9b032aa006..d4f31e195e26 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -7880,11 +7880,18 @@ void md_do_sync(struct md_thread *thread) /((jiffies-mddev->resync_mark)/HZ +1) +1; if (currspeed > speed_min(mddev)) { - if ((currspeed > speed_max(mddev)) || - !is_mddev_idle(mddev, 0)) { + if (currspeed > speed_max(mddev)) { msleep(500); goto repeat; } + if (!is_mddev_idle(mddev, 0)) { + /* + * Give other IO more of a chance. + * The faster the devices, the less we wait. + */ + wait_event(mddev->recovery_wait, + !atomic_read(&mddev->recovery_active)); + } } } printk(KERN_INFO "md: %s: %s %s.\n",mdname(mddev), desc, From 753f2856cda2a130d38ebc3db97bff66c1ef3ca7 Mon Sep 17 00:00:00 2001 From: Heinz Mauelshagen Date: Fri, 13 Feb 2015 19:48:01 +0100 Subject: [PATCH 41/58] md raid0: access mddev->queue (request queue member) conditionally because it is not set when accessed from dm-raid The patch makes 3 references to mddev->queue in the raid0 personality conditional in order to allow for it to be accessed from dm-raid. Mandatory, because md instances underneath dm-raid don't manage a request queue of their own which'd lead to oopses without the patch. Signed-off-by: Heinz Mauelshagen Tested-by: Heinz Mauelshagen Signed-off-by: NeilBrown --- drivers/md/raid0.c | 46 ++++++++++++++++++++++++++-------------------- 1 file changed, 26 insertions(+), 20 deletions(-) diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c index 3b5d7f704aa3..2cb59a641cd2 100644 --- a/drivers/md/raid0.c +++ b/drivers/md/raid0.c @@ -271,14 +271,16 @@ static int create_strip_zones(struct mddev *mddev, struct r0conf **private_conf) goto abort; } - blk_queue_io_min(mddev->queue, mddev->chunk_sectors << 9); - blk_queue_io_opt(mddev->queue, - (mddev->chunk_sectors << 9) * mddev->raid_disks); + if (mddev->queue) { + blk_queue_io_min(mddev->queue, mddev->chunk_sectors << 9); + blk_queue_io_opt(mddev->queue, + (mddev->chunk_sectors << 9) * mddev->raid_disks); - if (!discard_supported) - queue_flag_clear_unlocked(QUEUE_FLAG_DISCARD, mddev->queue); - else - queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, mddev->queue); + if (!discard_supported) + queue_flag_clear_unlocked(QUEUE_FLAG_DISCARD, mddev->queue); + else + queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, mddev->queue); + } pr_debug("md/raid0:%s: done.\n", mdname(mddev)); *private_conf = conf; @@ -429,9 +431,12 @@ static int raid0_run(struct mddev *mddev) } if (md_check_no_bitmap(mddev)) return -EINVAL; - blk_queue_max_hw_sectors(mddev->queue, mddev->chunk_sectors); - blk_queue_max_write_same_sectors(mddev->queue, mddev->chunk_sectors); - blk_queue_max_discard_sectors(mddev->queue, mddev->chunk_sectors); + + if (mddev->queue) { + blk_queue_max_hw_sectors(mddev->queue, mddev->chunk_sectors); + blk_queue_max_write_same_sectors(mddev->queue, mddev->chunk_sectors); + blk_queue_max_discard_sectors(mddev->queue, mddev->chunk_sectors); + } /* if private is not null, we are here after takeover */ if (mddev->private == NULL) { @@ -448,16 +453,17 @@ static int raid0_run(struct mddev *mddev) printk(KERN_INFO "md/raid0:%s: md_size is %llu sectors.\n", mdname(mddev), (unsigned long long)mddev->array_sectors); - /* calculate the max read-ahead size. - * For read-ahead of large files to be effective, we need to - * readahead at least twice a whole stripe. i.e. number of devices - * multiplied by chunk size times 2. - * If an individual device has an ra_pages greater than the - * chunk size, then we will not drive that device as hard as it - * wants. We consider this a configuration error: a larger - * chunksize should be used in that case. - */ - { + + if (mddev->queue) { + /* calculate the max read-ahead size. + * For read-ahead of large files to be effective, we need to + * readahead at least twice a whole stripe. i.e. number of devices + * multiplied by chunk size times 2. + * If an individual device has an ra_pages greater than the + * chunk size, then we will not drive that device as hard as it + * wants. We consider this a configuration error: a larger + * chunksize should be used in that case. + */ int stripe = mddev->raid_disks * (mddev->chunk_sectors << 9) / PAGE_SIZE; if (mddev->queue->backing_dev_info.ra_pages < 2* stripe) From 46d5b785621ad10a373e292f9101ccfc626466e0 Mon Sep 17 00:00:00 2001 From: "shli@kernel.org" Date: Mon, 15 Dec 2014 12:57:02 +1100 Subject: [PATCH 42/58] raid5: use flex_array for scribble data Use flex_array for scribble data. Next patch will batch several stripes together, so scribble data should be able to cover several stripes, so this patch also allocates scribble data for stripes across a chunk. Signed-off-by: Shaohua Li Signed-off-by: NeilBrown --- drivers/md/raid5.c | 89 +++++++++++++++++++++++++++++----------------- drivers/md/raid5.h | 6 +--- 2 files changed, 57 insertions(+), 38 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 022a0d99e110..7fb510e54548 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -54,6 +54,7 @@ #include #include #include +#include #include #include "md.h" @@ -1109,16 +1110,28 @@ static void ops_complete_compute(void *stripe_head_ref) /* return a pointer to the address conversion region of the scribble buffer */ static addr_conv_t *to_addr_conv(struct stripe_head *sh, - struct raid5_percpu *percpu) + struct raid5_percpu *percpu, int i) { - return percpu->scribble + sizeof(struct page *) * (sh->disks + 2); + void *addr; + + addr = flex_array_get(percpu->scribble, i); + return addr + sizeof(struct page *) * (sh->disks + 2); +} + +/* return a pointer to the address conversion region of the scribble buffer */ +static struct page **to_addr_page(struct raid5_percpu *percpu, int i) +{ + void *addr; + + addr = flex_array_get(percpu->scribble, i); + return addr; } static struct dma_async_tx_descriptor * ops_run_compute5(struct stripe_head *sh, struct raid5_percpu *percpu) { int disks = sh->disks; - struct page **xor_srcs = percpu->scribble; + struct page **xor_srcs = to_addr_page(percpu, 0); int target = sh->ops.target; struct r5dev *tgt = &sh->dev[target]; struct page *xor_dest = tgt->page; @@ -1138,7 +1151,7 @@ ops_run_compute5(struct stripe_head *sh, struct raid5_percpu *percpu) atomic_inc(&sh->count); init_async_submit(&submit, ASYNC_TX_FENCE|ASYNC_TX_XOR_ZERO_DST, NULL, - ops_complete_compute, sh, to_addr_conv(sh, percpu)); + ops_complete_compute, sh, to_addr_conv(sh, percpu, 0)); if (unlikely(count == 1)) tx = async_memcpy(xor_dest, xor_srcs[0], 0, 0, STRIPE_SIZE, &submit); else @@ -1183,7 +1196,7 @@ static struct dma_async_tx_descriptor * ops_run_compute6_1(struct stripe_head *sh, struct raid5_percpu *percpu) { int disks = sh->disks; - struct page **blocks = percpu->scribble; + struct page **blocks = to_addr_page(percpu, 0); int target; int qd_idx = sh->qd_idx; struct dma_async_tx_descriptor *tx; @@ -1216,7 +1229,7 @@ ops_run_compute6_1(struct stripe_head *sh, struct raid5_percpu *percpu) BUG_ON(blocks[count+1] != dest); /* q should already be set */ init_async_submit(&submit, ASYNC_TX_FENCE, NULL, ops_complete_compute, sh, - to_addr_conv(sh, percpu)); + to_addr_conv(sh, percpu, 0)); tx = async_gen_syndrome(blocks, 0, count+2, STRIPE_SIZE, &submit); } else { /* Compute any data- or p-drive using XOR */ @@ -1229,7 +1242,7 @@ ops_run_compute6_1(struct stripe_head *sh, struct raid5_percpu *percpu) init_async_submit(&submit, ASYNC_TX_FENCE|ASYNC_TX_XOR_ZERO_DST, NULL, ops_complete_compute, sh, - to_addr_conv(sh, percpu)); + to_addr_conv(sh, percpu, 0)); tx = async_xor(dest, blocks, 0, count, STRIPE_SIZE, &submit); } @@ -1248,7 +1261,7 @@ ops_run_compute6_2(struct stripe_head *sh, struct raid5_percpu *percpu) struct r5dev *tgt = &sh->dev[target]; struct r5dev *tgt2 = &sh->dev[target2]; struct dma_async_tx_descriptor *tx; - struct page **blocks = percpu->scribble; + struct page **blocks = to_addr_page(percpu, 0); struct async_submit_ctl submit; pr_debug("%s: stripe %llu block1: %d block2: %d\n", @@ -1290,7 +1303,7 @@ ops_run_compute6_2(struct stripe_head *sh, struct raid5_percpu *percpu) /* Missing P+Q, just recompute */ init_async_submit(&submit, ASYNC_TX_FENCE, NULL, ops_complete_compute, sh, - to_addr_conv(sh, percpu)); + to_addr_conv(sh, percpu, 0)); return async_gen_syndrome(blocks, 0, syndrome_disks+2, STRIPE_SIZE, &submit); } else { @@ -1314,21 +1327,21 @@ ops_run_compute6_2(struct stripe_head *sh, struct raid5_percpu *percpu) init_async_submit(&submit, ASYNC_TX_FENCE|ASYNC_TX_XOR_ZERO_DST, NULL, NULL, NULL, - to_addr_conv(sh, percpu)); + to_addr_conv(sh, percpu, 0)); tx = async_xor(dest, blocks, 0, count, STRIPE_SIZE, &submit); count = set_syndrome_sources(blocks, sh); init_async_submit(&submit, ASYNC_TX_FENCE, tx, ops_complete_compute, sh, - to_addr_conv(sh, percpu)); + to_addr_conv(sh, percpu, 0)); return async_gen_syndrome(blocks, 0, count+2, STRIPE_SIZE, &submit); } } else { init_async_submit(&submit, ASYNC_TX_FENCE, NULL, ops_complete_compute, sh, - to_addr_conv(sh, percpu)); + to_addr_conv(sh, percpu, 0)); if (failb == syndrome_disks) { /* We're missing D+P. */ return async_raid6_datap_recov(syndrome_disks+2, @@ -1356,7 +1369,7 @@ ops_run_prexor(struct stripe_head *sh, struct raid5_percpu *percpu, struct dma_async_tx_descriptor *tx) { int disks = sh->disks; - struct page **xor_srcs = percpu->scribble; + struct page **xor_srcs = to_addr_page(percpu, 0); int count = 0, pd_idx = sh->pd_idx, i; struct async_submit_ctl submit; @@ -1374,7 +1387,7 @@ ops_run_prexor(struct stripe_head *sh, struct raid5_percpu *percpu, } init_async_submit(&submit, ASYNC_TX_FENCE|ASYNC_TX_XOR_DROP_DST, tx, - ops_complete_prexor, sh, to_addr_conv(sh, percpu)); + ops_complete_prexor, sh, to_addr_conv(sh, percpu, 0)); tx = async_xor(xor_dest, xor_srcs, 0, count, STRIPE_SIZE, &submit); return tx; @@ -1478,7 +1491,7 @@ ops_run_reconstruct5(struct stripe_head *sh, struct raid5_percpu *percpu, struct dma_async_tx_descriptor *tx) { int disks = sh->disks; - struct page **xor_srcs = percpu->scribble; + struct page **xor_srcs = to_addr_page(percpu, 0); struct async_submit_ctl submit; int count = 0, pd_idx = sh->pd_idx, i; struct page *xor_dest; @@ -1531,7 +1544,7 @@ ops_run_reconstruct5(struct stripe_head *sh, struct raid5_percpu *percpu, atomic_inc(&sh->count); init_async_submit(&submit, flags, tx, ops_complete_reconstruct, sh, - to_addr_conv(sh, percpu)); + to_addr_conv(sh, percpu, 0)); if (unlikely(count == 1)) tx = async_memcpy(xor_dest, xor_srcs[0], 0, 0, STRIPE_SIZE, &submit); else @@ -1543,7 +1556,7 @@ ops_run_reconstruct6(struct stripe_head *sh, struct raid5_percpu *percpu, struct dma_async_tx_descriptor *tx) { struct async_submit_ctl submit; - struct page **blocks = percpu->scribble; + struct page **blocks = to_addr_page(percpu, 0); int count, i; pr_debug("%s: stripe %llu\n", __func__, (unsigned long long)sh->sector); @@ -1567,7 +1580,7 @@ ops_run_reconstruct6(struct stripe_head *sh, struct raid5_percpu *percpu, atomic_inc(&sh->count); init_async_submit(&submit, ASYNC_TX_ACK, tx, ops_complete_reconstruct, - sh, to_addr_conv(sh, percpu)); + sh, to_addr_conv(sh, percpu, 0)); async_gen_syndrome(blocks, 0, count+2, STRIPE_SIZE, &submit); } @@ -1589,7 +1602,7 @@ static void ops_run_check_p(struct stripe_head *sh, struct raid5_percpu *percpu) int pd_idx = sh->pd_idx; int qd_idx = sh->qd_idx; struct page *xor_dest; - struct page **xor_srcs = percpu->scribble; + struct page **xor_srcs = to_addr_page(percpu, 0); struct dma_async_tx_descriptor *tx; struct async_submit_ctl submit; int count; @@ -1608,7 +1621,7 @@ static void ops_run_check_p(struct stripe_head *sh, struct raid5_percpu *percpu) } init_async_submit(&submit, 0, NULL, NULL, NULL, - to_addr_conv(sh, percpu)); + to_addr_conv(sh, percpu, 0)); tx = async_xor_val(xor_dest, xor_srcs, 0, count, STRIPE_SIZE, &sh->ops.zero_sum_result, &submit); @@ -1619,7 +1632,7 @@ static void ops_run_check_p(struct stripe_head *sh, struct raid5_percpu *percpu) static void ops_run_check_pq(struct stripe_head *sh, struct raid5_percpu *percpu, int checkp) { - struct page **srcs = percpu->scribble; + struct page **srcs = to_addr_page(percpu, 0); struct async_submit_ctl submit; int count; @@ -1632,7 +1645,7 @@ static void ops_run_check_pq(struct stripe_head *sh, struct raid5_percpu *percpu atomic_inc(&sh->count); init_async_submit(&submit, ASYNC_TX_ACK, NULL, ops_complete_check, - sh, to_addr_conv(sh, percpu)); + sh, to_addr_conv(sh, percpu, 0)); async_syndrome_val(srcs, 0, count+2, STRIPE_SIZE, &sh->ops.zero_sum_result, percpu->spare_page, &submit); } @@ -1772,13 +1785,21 @@ static int grow_stripes(struct r5conf *conf, int num) * calculate over all devices (not just the data blocks), using zeros in place * of the P and Q blocks. */ -static size_t scribble_len(int num) +static struct flex_array *scribble_alloc(int num, int cnt, gfp_t flags) { + struct flex_array *ret; size_t len; len = sizeof(struct page *) * (num+2) + sizeof(addr_conv_t) * (num+2); - - return len; + ret = flex_array_alloc(len, cnt, flags); + if (!ret) + return NULL; + /* always prealloc all elements, so no locking is required */ + if (flex_array_prealloc(ret, 0, cnt, flags)) { + flex_array_free(ret); + return NULL; + } + return ret; } static int resize_stripes(struct r5conf *conf, int newsize) @@ -1896,16 +1917,16 @@ static int resize_stripes(struct r5conf *conf, int newsize) err = -ENOMEM; get_online_cpus(); - conf->scribble_len = scribble_len(newsize); for_each_present_cpu(cpu) { struct raid5_percpu *percpu; - void *scribble; + struct flex_array *scribble; percpu = per_cpu_ptr(conf->percpu, cpu); - scribble = kmalloc(conf->scribble_len, GFP_NOIO); + scribble = scribble_alloc(newsize, conf->chunk_sectors / + STRIPE_SECTORS, GFP_NOIO); if (scribble) { - kfree(percpu->scribble); + flex_array_free(percpu->scribble); percpu->scribble = scribble; } else { err = -ENOMEM; @@ -5698,7 +5719,8 @@ raid5_size(struct mddev *mddev, sector_t sectors, int raid_disks) static void free_scratch_buffer(struct r5conf *conf, struct raid5_percpu *percpu) { safe_put_page(percpu->spare_page); - kfree(percpu->scribble); + if (percpu->scribble) + flex_array_free(percpu->scribble); percpu->spare_page = NULL; percpu->scribble = NULL; } @@ -5708,7 +5730,9 @@ static int alloc_scratch_buffer(struct r5conf *conf, struct raid5_percpu *percpu if (conf->level == 6 && !percpu->spare_page) percpu->spare_page = alloc_page(GFP_KERNEL); if (!percpu->scribble) - percpu->scribble = kmalloc(conf->scribble_len, GFP_KERNEL); + percpu->scribble = scribble_alloc(max(conf->raid_disks, + conf->previous_raid_disks), conf->chunk_sectors / + STRIPE_SECTORS, GFP_KERNEL); if (!percpu->scribble || (conf->level == 6 && !percpu->spare_page)) { free_scratch_buffer(conf, percpu); @@ -5878,7 +5902,6 @@ static struct r5conf *setup_conf(struct mddev *mddev) else conf->previous_raid_disks = mddev->raid_disks - mddev->delta_disks; max_disks = max(conf->raid_disks, conf->previous_raid_disks); - conf->scribble_len = scribble_len(max_disks); conf->disks = kzalloc(max_disks * sizeof(struct disk_info), GFP_KERNEL); @@ -5906,6 +5929,7 @@ static struct r5conf *setup_conf(struct mddev *mddev) INIT_LIST_HEAD(conf->temp_inactive_list + i); conf->level = mddev->new_level; + conf->chunk_sectors = mddev->new_chunk_sectors; if (raid5_alloc_percpu(conf) != 0) goto abort; @@ -5938,7 +5962,6 @@ static struct r5conf *setup_conf(struct mddev *mddev) conf->fullsync = 1; } - conf->chunk_sectors = mddev->new_chunk_sectors; conf->level = mddev->new_level; if (conf->level == 6) conf->max_degraded = 2; diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h index 983e18a83db1..1d0f241d7d3b 100644 --- a/drivers/md/raid5.h +++ b/drivers/md/raid5.h @@ -458,15 +458,11 @@ struct r5conf { /* per cpu variables */ struct raid5_percpu { struct page *spare_page; /* Used when checking P/Q in raid6 */ - void *scribble; /* space for constructing buffer + struct flex_array *scribble; /* space for constructing buffer * lists and performing address * conversions */ } __percpu *percpu; - size_t scribble_len; /* size of scribble region must be - * associated with conf to handle - * cpu hotplug while reshaping - */ #ifdef CONFIG_HOTPLUG_CPU struct notifier_block cpu_notify; #endif From da41ba65972532a04f73927c903029a7ec3bc2ed Mon Sep 17 00:00:00 2001 From: "shli@kernel.org" Date: Mon, 15 Dec 2014 12:57:03 +1100 Subject: [PATCH 43/58] raid5: add a new flag to track if a stripe can be batched A freshly new stripe with write request can be batched. Any time the stripe is handled or new read is queued, the flag will be cleared. Signed-off-by: Shaohua Li Signed-off-by: NeilBrown --- drivers/md/raid5.c | 12 +++++++++--- drivers/md/raid5.h | 1 + 2 files changed, 10 insertions(+), 3 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 7fb510e54548..49b0f23dbad2 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -555,6 +555,7 @@ retry: goto retry; insert_hash(conf, sh); sh->cpu = smp_processor_id(); + set_bit(STRIPE_BATCH_READY, &sh->state); } static struct stripe_head *__find_stripe(struct r5conf *conf, sector_t sector, @@ -2645,7 +2646,8 @@ schedule_reconstruction(struct stripe_head *sh, struct stripe_head_state *s, * toread/towrite point to the first in a chain. * The bi_next chain must be in order. */ -static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, int forwrite) +static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, + int forwrite, int previous) { struct bio **bip; struct r5conf *conf = sh->raid_conf; @@ -2678,6 +2680,9 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, in if (*bip && (*bip)->bi_iter.bi_sector < bio_end_sector(bi)) goto overlap; + if (!forwrite || previous) + clear_bit(STRIPE_BATCH_READY, &sh->state); + BUG_ON(*bip && bi->bi_next && (*bip) != bi->bi_next); if (*bip) bi->bi_next = *bip; @@ -3824,6 +3829,7 @@ static void handle_stripe(struct stripe_head *sh) return; } + clear_bit(STRIPE_BATCH_READY, &sh->state); if (test_bit(STRIPE_SYNC_REQUESTED, &sh->state)) { spin_lock(&sh->stripe_lock); /* Cannot process 'sync' concurrently with 'discard' */ @@ -4793,7 +4799,7 @@ static void make_request(struct mddev *mddev, struct bio * bi) } if (test_bit(STRIPE_EXPANDING, &sh->state) || - !add_stripe_bio(sh, bi, dd_idx, rw)) { + !add_stripe_bio(sh, bi, dd_idx, rw, previous)) { /* Stripe is busy expanding or * add failed due to overlap. Flush everything * and wait a while @@ -5206,7 +5212,7 @@ static int retry_aligned_read(struct r5conf *conf, struct bio *raid_bio) return handled; } - if (!add_stripe_bio(sh, raid_bio, dd_idx, 0)) { + if (!add_stripe_bio(sh, raid_bio, dd_idx, 0, 0)) { release_stripe(sh); raid5_set_bi_processed_stripes(raid_bio, scnt); conf->retry_read_aligned = raid_bio; diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h index 1d0f241d7d3b..37644e3d5293 100644 --- a/drivers/md/raid5.h +++ b/drivers/md/raid5.h @@ -327,6 +327,7 @@ enum { STRIPE_ON_UNPLUG_LIST, STRIPE_DISCARD, STRIPE_ON_RELEASE_LIST, + STRIPE_BATCH_READY, }; /* From 7a87f43405e91ca12b8770eb689dd9886f217091 Mon Sep 17 00:00:00 2001 From: "shli@kernel.org" Date: Mon, 15 Dec 2014 12:57:03 +1100 Subject: [PATCH 44/58] raid5: track overwrite disk count Track overwrite disk count, so we can know if a stripe is a full stripe write. Signed-off-by: Shaohua Li Signed-off-by: NeilBrown --- drivers/md/raid5.c | 14 +++++++++++++- drivers/md/raid5.h | 4 ++++ 2 files changed, 17 insertions(+), 1 deletion(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 49b0f23dbad2..e801c6669c6d 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -553,6 +553,7 @@ retry: } if (read_seqcount_retry(&conf->gen_lock, seq)) goto retry; + sh->overwrite_disks = 0; insert_hash(conf, sh); sh->cpu = smp_processor_id(); set_bit(STRIPE_BATCH_READY, &sh->state); @@ -710,6 +711,12 @@ get_active_stripe(struct r5conf *conf, sector_t sector, return sh; } +static bool is_full_stripe_write(struct stripe_head *sh) +{ + BUG_ON(sh->overwrite_disks > (sh->disks - sh->raid_conf->max_degraded)); + return sh->overwrite_disks == (sh->disks - sh->raid_conf->max_degraded); +} + /* Determine if 'data_offset' or 'new_data_offset' should be used * in this stripe_head. */ @@ -1413,6 +1420,7 @@ ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx) spin_lock_irq(&sh->stripe_lock); chosen = dev->towrite; dev->towrite = NULL; + sh->overwrite_disks = 0; BUG_ON(dev->written); wbi = dev->written = chosen; spin_unlock_irq(&sh->stripe_lock); @@ -2700,7 +2708,8 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, sector = bio_end_sector(bi); } if (sector >= sh->dev[dd_idx].sector + STRIPE_SECTORS) - set_bit(R5_OVERWRITE, &sh->dev[dd_idx].flags); + if (!test_and_set_bit(R5_OVERWRITE, &sh->dev[dd_idx].flags)) + sh->overwrite_disks++; } pr_debug("added bi b#%llu to stripe s#%llu, disk %d.\n", @@ -2772,6 +2781,7 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh, /* fail all writes first */ bi = sh->dev[i].towrite; sh->dev[i].towrite = NULL; + sh->overwrite_disks = 0; spin_unlock_irq(&sh->stripe_lock); if (bi) bitmap_end = 1; @@ -4630,12 +4640,14 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi) } set_bit(STRIPE_DISCARD, &sh->state); finish_wait(&conf->wait_for_overlap, &w); + sh->overwrite_disks = 0; for (d = 0; d < conf->raid_disks; d++) { if (d == sh->pd_idx || d == sh->qd_idx) continue; sh->dev[d].towrite = bi; set_bit(R5_OVERWRITE, &sh->dev[d].flags); raid5_inc_bi_active_stripes(bi); + sh->overwrite_disks++; } spin_unlock_irq(&sh->stripe_lock); if (conf->mddev->bitmap) { diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h index 37644e3d5293..4cc1a48127c7 100644 --- a/drivers/md/raid5.h +++ b/drivers/md/raid5.h @@ -210,6 +210,10 @@ struct stripe_head { atomic_t count; /* nr of active thread/requests */ int bm_seq; /* sequence number for bitmap flushes */ int disks; /* disks in stripe */ + int overwrite_disks; /* total overwrite disks in stripe, + * this is only checked when stripe + * has STRIPE_BATCH_READY + */ enum check_states check_state; enum reconstruct_states reconstruct_state; spinlock_t stripe_lock; From 59fc630b8b5f9f21c8ce3ba153341c107dce1b0c Mon Sep 17 00:00:00 2001 From: "shli@kernel.org" Date: Mon, 15 Dec 2014 12:57:03 +1100 Subject: [PATCH 45/58] RAID5: batch adjacent full stripe write stripe cache is 4k size. Even adjacent full stripe writes are handled in 4k unit. Idealy we should use big size for adjacent full stripe writes. Bigger stripe cache size means less stripes runing in the state machine so can reduce cpu overhead. And also bigger size can cause bigger IO size dispatched to under layer disks. With below patch, we will automatically batch adjacent full stripe write together. Such stripes will be added to the batch list. Only the first stripe of the list will be put to handle_list and so run handle_stripe(). Some steps of handle_stripe() are extended to cover all stripes of the list, including ops_run_io, ops_run_biodrain and so on. With this patch, we have less stripes running in handle_stripe() and we send IO of whole stripe list together to increase IO size. Stripes added to a batch list have some limitations. A batch list can only include full stripe write and can't cross chunk boundary to make sure stripes have the same parity disks. Stripes in a batch list must be in the same state (no written, toread and so on). If a stripe is in a batch list, all new read/write to add_stripe_bio will be blocked to overlap conflict till the batch list is handled. The limitations will make sure stripes in a batch list be in exactly the same state in the life circly. I did test running 160k randwrite in a RAID5 array with 32k chunk size and 6 PCIe SSD. This patch improves around 30% performance and IO size to under layer disk is exactly 32k. I also run a 4k randwrite test in the same array to make sure the performance isn't changed with the patch. Signed-off-by: Shaohua Li Signed-off-by: NeilBrown --- drivers/md/raid5.c | 353 ++++++++++++++++++++++++++++++++++++++++++--- drivers/md/raid5.h | 4 + 2 files changed, 334 insertions(+), 23 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index e801c6669c6d..717189e74243 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -526,6 +526,7 @@ static void init_stripe(struct stripe_head *sh, sector_t sector, int previous) BUG_ON(atomic_read(&sh->count) != 0); BUG_ON(test_bit(STRIPE_HANDLE, &sh->state)); BUG_ON(stripe_operations_active(sh)); + BUG_ON(sh->batch_head); pr_debug("init_stripe called, stripe %llu\n", (unsigned long long)sector); @@ -717,6 +718,124 @@ static bool is_full_stripe_write(struct stripe_head *sh) return sh->overwrite_disks == (sh->disks - sh->raid_conf->max_degraded); } +static void lock_two_stripes(struct stripe_head *sh1, struct stripe_head *sh2) +{ + local_irq_disable(); + if (sh1 > sh2) { + spin_lock(&sh2->stripe_lock); + spin_lock_nested(&sh1->stripe_lock, 1); + } else { + spin_lock(&sh1->stripe_lock); + spin_lock_nested(&sh2->stripe_lock, 1); + } +} + +static void unlock_two_stripes(struct stripe_head *sh1, struct stripe_head *sh2) +{ + spin_unlock(&sh1->stripe_lock); + spin_unlock(&sh2->stripe_lock); + local_irq_enable(); +} + +/* Only freshly new full stripe normal write stripe can be added to a batch list */ +static bool stripe_can_batch(struct stripe_head *sh) +{ + return test_bit(STRIPE_BATCH_READY, &sh->state) && + is_full_stripe_write(sh); +} + +/* we only do back search */ +static void stripe_add_to_batch_list(struct r5conf *conf, struct stripe_head *sh) +{ + struct stripe_head *head; + sector_t head_sector, tmp_sec; + int hash; + int dd_idx; + + if (!stripe_can_batch(sh)) + return; + /* Don't cross chunks, so stripe pd_idx/qd_idx is the same */ + tmp_sec = sh->sector; + if (!sector_div(tmp_sec, conf->chunk_sectors)) + return; + head_sector = sh->sector - STRIPE_SECTORS; + + hash = stripe_hash_locks_hash(head_sector); + spin_lock_irq(conf->hash_locks + hash); + head = __find_stripe(conf, head_sector, conf->generation); + if (head && !atomic_inc_not_zero(&head->count)) { + spin_lock(&conf->device_lock); + if (!atomic_read(&head->count)) { + if (!test_bit(STRIPE_HANDLE, &head->state)) + atomic_inc(&conf->active_stripes); + BUG_ON(list_empty(&head->lru) && + !test_bit(STRIPE_EXPANDING, &head->state)); + list_del_init(&head->lru); + if (head->group) { + head->group->stripes_cnt--; + head->group = NULL; + } + } + atomic_inc(&head->count); + spin_unlock(&conf->device_lock); + } + spin_unlock_irq(conf->hash_locks + hash); + + if (!head) + return; + if (!stripe_can_batch(head)) + goto out; + + lock_two_stripes(head, sh); + /* clear_batch_ready clear the flag */ + if (!stripe_can_batch(head) || !stripe_can_batch(sh)) + goto unlock_out; + + if (sh->batch_head) + goto unlock_out; + + dd_idx = 0; + while (dd_idx == sh->pd_idx || dd_idx == sh->qd_idx) + dd_idx++; + if (head->dev[dd_idx].towrite->bi_rw != sh->dev[dd_idx].towrite->bi_rw) + goto unlock_out; + + if (head->batch_head) { + spin_lock(&head->batch_head->batch_lock); + /* This batch list is already running */ + if (!stripe_can_batch(head)) { + spin_unlock(&head->batch_head->batch_lock); + goto unlock_out; + } + + /* + * at this point, head's BATCH_READY could be cleared, but we + * can still add the stripe to batch list + */ + list_add(&sh->batch_list, &head->batch_list); + spin_unlock(&head->batch_head->batch_lock); + + sh->batch_head = head->batch_head; + } else { + head->batch_head = head; + sh->batch_head = head->batch_head; + spin_lock(&head->batch_lock); + list_add_tail(&sh->batch_list, &head->batch_list); + spin_unlock(&head->batch_lock); + } + + if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) + if (atomic_dec_return(&conf->preread_active_stripes) + < IO_THRESHOLD) + md_wakeup_thread(conf->mddev->thread); + + atomic_inc(&sh->count); +unlock_out: + unlock_two_stripes(head, sh); +out: + release_stripe(head); +} + /* Determine if 'data_offset' or 'new_data_offset' should be used * in this stripe_head. */ @@ -747,6 +866,7 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s) { struct r5conf *conf = sh->raid_conf; int i, disks = sh->disks; + struct stripe_head *head_sh = sh; might_sleep(); @@ -755,6 +875,8 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s) int replace_only = 0; struct bio *bi, *rbi; struct md_rdev *rdev, *rrdev = NULL; + + sh = head_sh; if (test_and_clear_bit(R5_Wantwrite, &sh->dev[i].flags)) { if (test_and_clear_bit(R5_WantFUA, &sh->dev[i].flags)) rw = WRITE_FUA; @@ -773,6 +895,7 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s) if (test_and_clear_bit(R5_SyncIO, &sh->dev[i].flags)) rw |= REQ_SYNC; +again: bi = &sh->dev[i].req; rbi = &sh->dev[i].rreq; /* For writing to replacement */ @@ -791,7 +914,7 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s) /* We raced and saw duplicates */ rrdev = NULL; } else { - if (test_bit(R5_ReadRepl, &sh->dev[i].flags) && rrdev) + if (test_bit(R5_ReadRepl, &head_sh->dev[i].flags) && rrdev) rdev = rrdev; rrdev = NULL; } @@ -862,13 +985,15 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s) __func__, (unsigned long long)sh->sector, bi->bi_rw, i); atomic_inc(&sh->count); + if (sh != head_sh) + atomic_inc(&head_sh->count); if (use_new_offset(conf, sh)) bi->bi_iter.bi_sector = (sh->sector + rdev->new_data_offset); else bi->bi_iter.bi_sector = (sh->sector + rdev->data_offset); - if (test_bit(R5_ReadNoMerge, &sh->dev[i].flags)) + if (test_bit(R5_ReadNoMerge, &head_sh->dev[i].flags)) bi->bi_rw |= REQ_NOMERGE; if (test_bit(R5_SkipCopy, &sh->dev[i].flags)) @@ -912,6 +1037,8 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s) __func__, (unsigned long long)sh->sector, rbi->bi_rw, i); atomic_inc(&sh->count); + if (sh != head_sh) + atomic_inc(&head_sh->count); if (use_new_offset(conf, sh)) rbi->bi_iter.bi_sector = (sh->sector + rrdev->new_data_offset); @@ -945,6 +1072,13 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s) clear_bit(R5_LOCKED, &sh->dev[i].flags); set_bit(STRIPE_HANDLE, &sh->state); } + + if (!head_sh->batch_head) + continue; + sh = list_first_entry(&sh->batch_list, struct stripe_head, + batch_list); + if (sh != head_sh) + goto again; } } @@ -1060,6 +1194,7 @@ static void ops_run_biofill(struct stripe_head *sh) struct async_submit_ctl submit; int i; + BUG_ON(sh->batch_head); pr_debug("%s: stripe %llu\n", __func__, (unsigned long long)sh->sector); @@ -1148,6 +1283,8 @@ ops_run_compute5(struct stripe_head *sh, struct raid5_percpu *percpu) struct async_submit_ctl submit; int i; + BUG_ON(sh->batch_head); + pr_debug("%s: stripe %llu block: %d\n", __func__, (unsigned long long)sh->sector, target); BUG_ON(!test_bit(R5_Wantcompute, &tgt->flags)); @@ -1214,6 +1351,7 @@ ops_run_compute6_1(struct stripe_head *sh, struct raid5_percpu *percpu) int i; int count; + BUG_ON(sh->batch_head); if (sh->ops.target < 0) target = sh->ops.target2; else if (sh->ops.target2 < 0) @@ -1272,6 +1410,7 @@ ops_run_compute6_2(struct stripe_head *sh, struct raid5_percpu *percpu) struct page **blocks = to_addr_page(percpu, 0); struct async_submit_ctl submit; + BUG_ON(sh->batch_head); pr_debug("%s: stripe %llu block1: %d block2: %d\n", __func__, (unsigned long long)sh->sector, target, target2); BUG_ON(target < 0 || target2 < 0); @@ -1384,6 +1523,7 @@ ops_run_prexor(struct stripe_head *sh, struct raid5_percpu *percpu, /* existing parity data subtracted */ struct page *xor_dest = xor_srcs[count++] = sh->dev[pd_idx].page; + BUG_ON(sh->batch_head); pr_debug("%s: stripe %llu\n", __func__, (unsigned long long)sh->sector); @@ -1406,17 +1546,21 @@ ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx) { int disks = sh->disks; int i; + struct stripe_head *head_sh = sh; pr_debug("%s: stripe %llu\n", __func__, (unsigned long long)sh->sector); for (i = disks; i--; ) { - struct r5dev *dev = &sh->dev[i]; + struct r5dev *dev; struct bio *chosen; - if (test_and_clear_bit(R5_Wantdrain, &dev->flags)) { + sh = head_sh; + if (test_and_clear_bit(R5_Wantdrain, &head_sh->dev[i].flags)) { struct bio *wbi; +again: + dev = &sh->dev[i]; spin_lock_irq(&sh->stripe_lock); chosen = dev->towrite; dev->towrite = NULL; @@ -1445,6 +1589,15 @@ ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx) } wbi = r5_next_bio(wbi, dev->sector); } + + if (head_sh->batch_head) { + sh = list_first_entry(&sh->batch_list, + struct stripe_head, + batch_list); + if (sh == head_sh) + continue; + goto again; + } } } @@ -1500,12 +1653,15 @@ ops_run_reconstruct5(struct stripe_head *sh, struct raid5_percpu *percpu, struct dma_async_tx_descriptor *tx) { int disks = sh->disks; - struct page **xor_srcs = to_addr_page(percpu, 0); + struct page **xor_srcs; struct async_submit_ctl submit; - int count = 0, pd_idx = sh->pd_idx, i; + int count, pd_idx = sh->pd_idx, i; struct page *xor_dest; int prexor = 0; unsigned long flags; + int j = 0; + struct stripe_head *head_sh = sh; + int last_stripe; pr_debug("%s: stripe %llu\n", __func__, (unsigned long long)sh->sector); @@ -1522,15 +1678,18 @@ ops_run_reconstruct5(struct stripe_head *sh, struct raid5_percpu *percpu, ops_complete_reconstruct(sh); return; } +again: + count = 0; + xor_srcs = to_addr_page(percpu, j); /* check if prexor is active which means only process blocks * that are part of a read-modify-write (written) */ - if (sh->reconstruct_state == reconstruct_state_prexor_drain_run) { + if (head_sh->reconstruct_state == reconstruct_state_prexor_drain_run) { prexor = 1; xor_dest = xor_srcs[count++] = sh->dev[pd_idx].page; for (i = disks; i--; ) { struct r5dev *dev = &sh->dev[i]; - if (dev->written) + if (head_sh->dev[i].written) xor_srcs[count++] = dev->page; } } else { @@ -1547,17 +1706,32 @@ ops_run_reconstruct5(struct stripe_head *sh, struct raid5_percpu *percpu, * set ASYNC_TX_XOR_DROP_DST and ASYNC_TX_XOR_ZERO_DST * for the synchronous xor case */ - flags = ASYNC_TX_ACK | - (prexor ? ASYNC_TX_XOR_DROP_DST : ASYNC_TX_XOR_ZERO_DST); + last_stripe = !head_sh->batch_head || + list_first_entry(&sh->batch_list, + struct stripe_head, batch_list) == head_sh; + if (last_stripe) { + flags = ASYNC_TX_ACK | + (prexor ? ASYNC_TX_XOR_DROP_DST : ASYNC_TX_XOR_ZERO_DST); - atomic_inc(&sh->count); + atomic_inc(&head_sh->count); + init_async_submit(&submit, flags, tx, ops_complete_reconstruct, head_sh, + to_addr_conv(sh, percpu, j)); + } else { + flags = prexor ? ASYNC_TX_XOR_DROP_DST : ASYNC_TX_XOR_ZERO_DST; + init_async_submit(&submit, flags, tx, NULL, NULL, + to_addr_conv(sh, percpu, j)); + } - init_async_submit(&submit, flags, tx, ops_complete_reconstruct, sh, - to_addr_conv(sh, percpu, 0)); if (unlikely(count == 1)) tx = async_memcpy(xor_dest, xor_srcs[0], 0, 0, STRIPE_SIZE, &submit); else tx = async_xor(xor_dest, xor_srcs, 0, count, STRIPE_SIZE, &submit); + if (!last_stripe) { + j++; + sh = list_first_entry(&sh->batch_list, struct stripe_head, + batch_list); + goto again; + } } static void @@ -1565,8 +1739,10 @@ ops_run_reconstruct6(struct stripe_head *sh, struct raid5_percpu *percpu, struct dma_async_tx_descriptor *tx) { struct async_submit_ctl submit; - struct page **blocks = to_addr_page(percpu, 0); - int count, i; + struct page **blocks; + int count, i, j = 0; + struct stripe_head *head_sh = sh; + int last_stripe; pr_debug("%s: stripe %llu\n", __func__, (unsigned long long)sh->sector); @@ -1584,13 +1760,27 @@ ops_run_reconstruct6(struct stripe_head *sh, struct raid5_percpu *percpu, return; } +again: + blocks = to_addr_page(percpu, j); count = set_syndrome_sources(blocks, sh); + last_stripe = !head_sh->batch_head || + list_first_entry(&sh->batch_list, + struct stripe_head, batch_list) == head_sh; - atomic_inc(&sh->count); - - init_async_submit(&submit, ASYNC_TX_ACK, tx, ops_complete_reconstruct, - sh, to_addr_conv(sh, percpu, 0)); + if (last_stripe) { + atomic_inc(&head_sh->count); + init_async_submit(&submit, ASYNC_TX_ACK, tx, ops_complete_reconstruct, + head_sh, to_addr_conv(sh, percpu, j)); + } else + init_async_submit(&submit, 0, tx, NULL, NULL, + to_addr_conv(sh, percpu, j)); async_gen_syndrome(blocks, 0, count+2, STRIPE_SIZE, &submit); + if (!last_stripe) { + j++; + sh = list_first_entry(&sh->batch_list, struct stripe_head, + batch_list); + goto again; + } } static void ops_complete_check(void *stripe_head_ref) @@ -1620,6 +1810,7 @@ static void ops_run_check_p(struct stripe_head *sh, struct raid5_percpu *percpu) pr_debug("%s: stripe %llu\n", __func__, (unsigned long long)sh->sector); + BUG_ON(sh->batch_head); count = 0; xor_dest = sh->dev[pd_idx].page; xor_srcs[count++] = xor_dest; @@ -1648,6 +1839,7 @@ static void ops_run_check_pq(struct stripe_head *sh, struct raid5_percpu *percpu pr_debug("%s: stripe %llu checkp: %d\n", __func__, (unsigned long long)sh->sector, checkp); + BUG_ON(sh->batch_head); count = set_syndrome_sources(srcs, sh); if (!checkp) srcs[count] = NULL; @@ -1715,7 +1907,7 @@ static void raid_run_ops(struct stripe_head *sh, unsigned long ops_request) BUG(); } - if (overlap_clear) + if (overlap_clear && !sh->batch_head) for (i = disks; i--; ) { struct r5dev *dev = &sh->dev[i]; if (test_and_clear_bit(R5_Overlap, &dev->flags)) @@ -1745,6 +1937,10 @@ static int grow_one_stripe(struct r5conf *conf, int hash) atomic_set(&sh->count, 1); atomic_inc(&conf->active_stripes); INIT_LIST_HEAD(&sh->lru); + + spin_lock_init(&sh->batch_lock); + INIT_LIST_HEAD(&sh->batch_list); + sh->batch_head = NULL; release_stripe(sh); return 1; } @@ -2188,6 +2384,9 @@ static void raid5_end_write_request(struct bio *bi, int error) clear_bit(R5_LOCKED, &sh->dev[i].flags); set_bit(STRIPE_HANDLE, &sh->state); release_stripe(sh); + + if (sh->batch_head && sh != sh->batch_head) + release_stripe(sh->batch_head); } static sector_t compute_blocknr(struct stripe_head *sh, int i, int previous); @@ -2674,6 +2873,9 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, * protect it. */ spin_lock_irq(&sh->stripe_lock); + /* Don't allow new IO added to stripes in batch list */ + if (sh->batch_head) + goto overlap; if (forwrite) { bip = &sh->dev[dd_idx].towrite; if (*bip == NULL) @@ -2723,6 +2925,9 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, sh->bm_seq = conf->seq_flush+1; set_bit(STRIPE_BIT_DELAY, &sh->state); } + + if (stripe_can_batch(sh)) + stripe_add_to_batch_list(conf, sh); return 1; overlap: @@ -2755,6 +2960,7 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh, struct bio **return_bi) { int i; + BUG_ON(sh->batch_head); for (i = disks; i--; ) { struct bio *bi; int bitmap_end = 0; @@ -2870,6 +3076,7 @@ handle_failed_sync(struct r5conf *conf, struct stripe_head *sh, int abort = 0; int i; + BUG_ON(sh->batch_head); clear_bit(STRIPE_SYNCING, &sh->state); if (test_and_clear_bit(R5_Overlap, &sh->dev[sh->pd_idx].flags)) wake_up(&conf->wait_for_overlap); @@ -3100,6 +3307,7 @@ static void handle_stripe_fill(struct stripe_head *sh, { int i; + BUG_ON(sh->batch_head); /* look for blocks to read/compute, skip this if a compute * is already in flight, or if the stripe contents are in the * midst of changing due to a write @@ -3123,6 +3331,9 @@ static void handle_stripe_clean_event(struct r5conf *conf, int i; struct r5dev *dev; int discard_pending = 0; + struct stripe_head *head_sh = sh; + bool do_endio = false; + int wakeup_nr = 0; for (i = disks; i--; ) if (sh->dev[i].written) { @@ -3138,8 +3349,11 @@ static void handle_stripe_clean_event(struct r5conf *conf, clear_bit(R5_UPTODATE, &dev->flags); if (test_and_clear_bit(R5_SkipCopy, &dev->flags)) { WARN_ON(test_bit(R5_UPTODATE, &dev->flags)); - dev->page = dev->orig_page; } + do_endio = true; + +returnbi: + dev->page = dev->orig_page; wbi = dev->written; dev->written = NULL; while (wbi && wbi->bi_iter.bi_sector < @@ -3156,6 +3370,17 @@ static void handle_stripe_clean_event(struct r5conf *conf, STRIPE_SECTORS, !test_bit(STRIPE_DEGRADED, &sh->state), 0); + if (head_sh->batch_head) { + sh = list_first_entry(&sh->batch_list, + struct stripe_head, + batch_list); + if (sh != head_sh) { + dev = &sh->dev[i]; + goto returnbi; + } + } + sh = head_sh; + dev = &sh->dev[i]; } else if (test_bit(R5_Discard, &dev->flags)) discard_pending = 1; WARN_ON(test_bit(R5_SkipCopy, &dev->flags)); @@ -3177,8 +3402,17 @@ static void handle_stripe_clean_event(struct r5conf *conf, * will be reinitialized */ spin_lock_irq(&conf->device_lock); +unhash: remove_hash(sh); + if (head_sh->batch_head) { + sh = list_first_entry(&sh->batch_list, + struct stripe_head, batch_list); + if (sh != head_sh) + goto unhash; + } spin_unlock_irq(&conf->device_lock); + sh = head_sh; + if (test_bit(STRIPE_SYNC_REQUESTED, &sh->state)) set_bit(STRIPE_HANDLE, &sh->state); @@ -3187,6 +3421,39 @@ static void handle_stripe_clean_event(struct r5conf *conf, if (test_and_clear_bit(STRIPE_FULL_WRITE, &sh->state)) if (atomic_dec_and_test(&conf->pending_full_writes)) md_wakeup_thread(conf->mddev->thread); + + if (!head_sh->batch_head || !do_endio) + return; + for (i = 0; i < head_sh->disks; i++) { + if (test_and_clear_bit(R5_Overlap, &head_sh->dev[i].flags)) + wakeup_nr++; + } + while (!list_empty(&head_sh->batch_list)) { + int i; + sh = list_first_entry(&head_sh->batch_list, + struct stripe_head, batch_list); + list_del_init(&sh->batch_list); + + sh->state = head_sh->state & (~((1 << STRIPE_ACTIVE) | + (1 << STRIPE_PREREAD_ACTIVE))); + sh->check_state = head_sh->check_state; + sh->reconstruct_state = head_sh->reconstruct_state; + for (i = 0; i < sh->disks; i++) { + if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags)) + wakeup_nr++; + sh->dev[i].flags = head_sh->dev[i].flags; + } + + spin_lock_irq(&sh->stripe_lock); + sh->batch_head = NULL; + spin_unlock_irq(&sh->stripe_lock); + release_stripe(sh); + } + + spin_lock_irq(&head_sh->stripe_lock); + head_sh->batch_head = NULL; + spin_unlock_irq(&head_sh->stripe_lock); + wake_up_nr(&conf->wait_for_overlap, wakeup_nr); } static void handle_stripe_dirtying(struct r5conf *conf, @@ -3326,6 +3593,7 @@ static void handle_parity_checks5(struct r5conf *conf, struct stripe_head *sh, { struct r5dev *dev = NULL; + BUG_ON(sh->batch_head); set_bit(STRIPE_HANDLE, &sh->state); switch (sh->check_state) { @@ -3416,6 +3684,7 @@ static void handle_parity_checks6(struct r5conf *conf, struct stripe_head *sh, int qd_idx = sh->qd_idx; struct r5dev *dev; + BUG_ON(sh->batch_head); set_bit(STRIPE_HANDLE, &sh->state); BUG_ON(s->failed > 2); @@ -3579,6 +3848,7 @@ static void handle_stripe_expansion(struct r5conf *conf, struct stripe_head *sh) * copy some of them into a target stripe for expand. */ struct dma_async_tx_descriptor *tx = NULL; + BUG_ON(sh->batch_head); clear_bit(STRIPE_EXPAND_SOURCE, &sh->state); for (i = 0; i < sh->disks; i++) if (i != sh->pd_idx && i != sh->qd_idx) { @@ -3822,6 +4092,38 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s) rcu_read_unlock(); } +static int clear_batch_ready(struct stripe_head *sh) +{ + struct stripe_head *tmp; + if (!test_and_clear_bit(STRIPE_BATCH_READY, &sh->state)) + return 0; + spin_lock(&sh->stripe_lock); + if (!sh->batch_head) { + spin_unlock(&sh->stripe_lock); + return 0; + } + + /* + * this stripe could be added to a batch list before we check + * BATCH_READY, skips it + */ + if (sh->batch_head != sh) { + spin_unlock(&sh->stripe_lock); + return 1; + } + spin_lock(&sh->batch_lock); + list_for_each_entry(tmp, &sh->batch_list, batch_list) + clear_bit(STRIPE_BATCH_READY, &tmp->state); + spin_unlock(&sh->batch_lock); + spin_unlock(&sh->stripe_lock); + + /* + * BATCH_READY is cleared, no new stripes can be added. + * batch_list can be accessed without lock + */ + return 0; +} + static void handle_stripe(struct stripe_head *sh) { struct stripe_head_state s; @@ -3839,7 +4141,11 @@ static void handle_stripe(struct stripe_head *sh) return; } - clear_bit(STRIPE_BATCH_READY, &sh->state); + if (clear_batch_ready(sh) ) { + clear_bit_unlock(STRIPE_ACTIVE, &sh->state); + return; + } + if (test_bit(STRIPE_SYNC_REQUESTED, &sh->state)) { spin_lock(&sh->stripe_lock); /* Cannot process 'sync' concurrently with 'discard' */ @@ -4824,7 +5130,8 @@ static void make_request(struct mddev *mddev, struct bio * bi) } set_bit(STRIPE_HANDLE, &sh->state); clear_bit(STRIPE_DELAYED, &sh->state); - if ((bi->bi_rw & REQ_SYNC) && + if ((!sh->batch_head || sh == sh->batch_head) && + (bi->bi_rw & REQ_SYNC) && !test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) atomic_inc(&conf->preread_active_stripes); release_stripe_plug(mddev, sh); diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h index 4cc1a48127c7..c8d0004dca8f 100644 --- a/drivers/md/raid5.h +++ b/drivers/md/raid5.h @@ -219,6 +219,10 @@ struct stripe_head { spinlock_t stripe_lock; int cpu; struct r5worker_group *group; + + struct stripe_head *batch_head; /* protected by stripe lock */ + spinlock_t batch_lock; /* only header's lock is useful */ + struct list_head batch_list; /* protected by head's batch lock*/ /** * struct stripe_operations * @target - STRIPE_OP_COMPUTE_BLK target From 72ac733015bbdc0356ba3e92c52137a265910a91 Mon Sep 17 00:00:00 2001 From: "shli@kernel.org" Date: Mon, 15 Dec 2014 12:57:03 +1100 Subject: [PATCH 46/58] raid5: handle io error of batch list If io error happens in any stripe of a batch list, the batch list will be split, then normal process will run for the stripes in the list. Signed-off-by: Shaohua Li Signed-off-by: NeilBrown --- drivers/md/raid5.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++ drivers/md/raid5.h | 1 + 2 files changed, 49 insertions(+) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 717189e74243..54f3cb312b42 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -1070,6 +1070,9 @@ again: pr_debug("skip op %ld on disc %d for sector %llu\n", bi->bi_rw, i, (unsigned long long)sh->sector); clear_bit(R5_LOCKED, &sh->dev[i].flags); + if (sh->batch_head) + set_bit(STRIPE_BATCH_ERR, + &sh->batch_head->state); set_bit(STRIPE_HANDLE, &sh->state); } @@ -2380,6 +2383,9 @@ static void raid5_end_write_request(struct bio *bi, int error) } rdev_dec_pending(rdev, conf->mddev); + if (sh->batch_head && !uptodate) + set_bit(STRIPE_BATCH_ERR, &sh->batch_head->state); + if (!test_and_clear_bit(R5_DOUBLE_LOCKED, &sh->dev[i].flags)) clear_bit(R5_LOCKED, &sh->dev[i].flags); set_bit(STRIPE_HANDLE, &sh->state); @@ -4124,6 +4130,46 @@ static int clear_batch_ready(struct stripe_head *sh) return 0; } +static void check_break_stripe_batch_list(struct stripe_head *sh) +{ + struct stripe_head *head_sh, *next; + int i; + + if (!test_and_clear_bit(STRIPE_BATCH_ERR, &sh->state)) + return; + + head_sh = sh; + do { + sh = list_first_entry(&sh->batch_list, + struct stripe_head, batch_list); + BUG_ON(sh == head_sh); + } while (!test_bit(STRIPE_DEGRADED, &sh->state)); + + while (sh != head_sh) { + next = list_first_entry(&sh->batch_list, + struct stripe_head, batch_list); + list_del_init(&sh->batch_list); + + sh->state = head_sh->state & ~((1 << STRIPE_ACTIVE) | + (1 << STRIPE_PREREAD_ACTIVE) | + (1 << STRIPE_DEGRADED)); + sh->check_state = head_sh->check_state; + sh->reconstruct_state = head_sh->reconstruct_state; + for (i = 0; i < sh->disks; i++) + sh->dev[i].flags = head_sh->dev[i].flags & + (~((1 << R5_WriteError) | (1 << R5_Overlap))); + + spin_lock_irq(&sh->stripe_lock); + sh->batch_head = NULL; + spin_unlock_irq(&sh->stripe_lock); + + set_bit(STRIPE_HANDLE, &sh->state); + release_stripe(sh); + + sh = next; + } +} + static void handle_stripe(struct stripe_head *sh) { struct stripe_head_state s; @@ -4146,6 +4192,8 @@ static void handle_stripe(struct stripe_head *sh) return; } + check_break_stripe_batch_list(sh); + if (test_bit(STRIPE_SYNC_REQUESTED, &sh->state)) { spin_lock(&sh->stripe_lock); /* Cannot process 'sync' concurrently with 'discard' */ diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h index c8d0004dca8f..cf3562e99440 100644 --- a/drivers/md/raid5.h +++ b/drivers/md/raid5.h @@ -336,6 +336,7 @@ enum { STRIPE_DISCARD, STRIPE_ON_RELEASE_LIST, STRIPE_BATCH_READY, + STRIPE_BATCH_ERR, }; /* From dabc4ec6ba72418ebca6bf1884f344bba40c8709 Mon Sep 17 00:00:00 2001 From: "shli@kernel.org" Date: Mon, 15 Dec 2014 12:57:04 +1100 Subject: [PATCH 47/58] raid5: handle expansion/resync case with stripe batching expansion/resync can grab a stripe when the stripe is in batch list. Since all stripes in batch list must be in the same state, we can't allow some stripes run into expansion/resync. So we delay expansion/resync for stripe in batch list. Signed-off-by: Shaohua Li Signed-off-by: NeilBrown --- drivers/md/raid5.c | 24 ++++++++++++++++-------- drivers/md/raid5.h | 5 +++++ 2 files changed, 21 insertions(+), 8 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 54f3cb312b42..3ae097d50b51 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -3440,8 +3440,10 @@ unhash: struct stripe_head, batch_list); list_del_init(&sh->batch_list); - sh->state = head_sh->state & (~((1 << STRIPE_ACTIVE) | - (1 << STRIPE_PREREAD_ACTIVE))); + set_mask_bits(&sh->state, ~STRIPE_EXPAND_SYNC_FLAG, + head_sh->state & ~((1 << STRIPE_ACTIVE) | + (1 << STRIPE_PREREAD_ACTIVE) | + STRIPE_EXPAND_SYNC_FLAG)); sh->check_state = head_sh->check_state; sh->reconstruct_state = head_sh->reconstruct_state; for (i = 0; i < sh->disks; i++) { @@ -3453,6 +3455,8 @@ unhash: spin_lock_irq(&sh->stripe_lock); sh->batch_head = NULL; spin_unlock_irq(&sh->stripe_lock); + if (sh->state & STRIPE_EXPAND_SYNC_FLAG) + set_bit(STRIPE_HANDLE, &sh->state); release_stripe(sh); } @@ -3460,6 +3464,8 @@ unhash: head_sh->batch_head = NULL; spin_unlock_irq(&head_sh->stripe_lock); wake_up_nr(&conf->wait_for_overlap, wakeup_nr); + if (head_sh->state & STRIPE_EXPAND_SYNC_FLAG) + set_bit(STRIPE_HANDLE, &head_sh->state); } static void handle_stripe_dirtying(struct r5conf *conf, @@ -3927,8 +3933,8 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s) memset(s, 0, sizeof(*s)); - s->expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state); - s->expanded = test_bit(STRIPE_EXPAND_READY, &sh->state); + s->expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state) && !sh->batch_head; + s->expanded = test_bit(STRIPE_EXPAND_READY, &sh->state) && !sh->batch_head; s->failed_num[0] = -1; s->failed_num[1] = -1; @@ -4150,9 +4156,11 @@ static void check_break_stripe_batch_list(struct stripe_head *sh) struct stripe_head, batch_list); list_del_init(&sh->batch_list); - sh->state = head_sh->state & ~((1 << STRIPE_ACTIVE) | - (1 << STRIPE_PREREAD_ACTIVE) | - (1 << STRIPE_DEGRADED)); + set_mask_bits(&sh->state, ~STRIPE_EXPAND_SYNC_FLAG, + head_sh->state & ~((1 << STRIPE_ACTIVE) | + (1 << STRIPE_PREREAD_ACTIVE) | + (1 << STRIPE_DEGRADED) | + STRIPE_EXPAND_SYNC_FLAG)); sh->check_state = head_sh->check_state; sh->reconstruct_state = head_sh->reconstruct_state; for (i = 0; i < sh->disks; i++) @@ -4194,7 +4202,7 @@ static void handle_stripe(struct stripe_head *sh) check_break_stripe_batch_list(sh); - if (test_bit(STRIPE_SYNC_REQUESTED, &sh->state)) { + if (test_bit(STRIPE_SYNC_REQUESTED, &sh->state) && !sh->batch_head) { spin_lock(&sh->stripe_lock); /* Cannot process 'sync' concurrently with 'discard' */ if (!test_bit(STRIPE_DISCARD, &sh->state) && diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h index cf3562e99440..ee65ed844d3f 100644 --- a/drivers/md/raid5.h +++ b/drivers/md/raid5.h @@ -339,6 +339,11 @@ enum { STRIPE_BATCH_ERR, }; +#define STRIPE_EXPAND_SYNC_FLAG \ + ((1 << STRIPE_EXPAND_SOURCE) |\ + (1 << STRIPE_EXPAND_READY) |\ + (1 << STRIPE_EXPANDING) |\ + (1 << STRIPE_SYNC_REQUESTED)) /* * Operation request flags */ From fe5cbc6e06c7d8b3a86f6f5491d74766bb5c2827 Mon Sep 17 00:00:00 2001 From: Markus Stockhausen Date: Mon, 15 Dec 2014 12:57:04 +1100 Subject: [PATCH 48/58] md/raid6 algorithms: delta syndrome functions v3: s-o-b comment, explanation of performance and descision for the start/stop implementation Implementing rmw functionality for RAID6 requires optimized syndrome calculation. Up to now we can only generate a complete syndrome. The target P/Q pages are always overwritten. With this patch we provide a framework for inplace P/Q modification. In the first place simply fill those functions with NULL values. xor_syndrome() has two additional parameters: start & stop. These will indicate the first and last page that are changing during a rmw run. That makes it possible to avoid several unneccessary loops and speed up calculation. The caller needs to implement the following logic to make the functions work. 1) xor_syndrome(disks, start, stop, ...): "Remove" all data of source blocks inside P/Q between (and including) start and end. 2) modify any block with start <= block <= stop 3) xor_syndrome(disks, start, stop, ...): "Reinsert" all data of source blocks into P/Q between (and including) start and end. Pages between start and stop that won't be changed should be filled with a pointer to the kernel zero page. The reasons for not taking NULL pages are: 1) Algorithms cross the whole source data line by line. Thus avoid additional branches. 2) Having a NULL page avoids calculating the XOR P parity but still need calulation steps for the Q parity. Depending on the algorithm unrolling that might be only a difference of 2 instructions per loop. The benchmark numbers of the gen_syndrome() functions are displayed in the kernel log. Do the same for the xor_syndrome() functions. This will help to analyze performance problems and give an rough estimate how well the algorithm works. The choice of the fastest algorithm will still depend on the gen_syndrome() performance. With the start/stop page implementation the speed can vary a lot in real life. E.g. a change of page 0 & page 15 on a stripe will be harder to compute than the case where page 0 & page 1 are XOR candidates. To be not to enthusiatic about the expected speeds we will run a worse case test that simulates a change on the upper half of the stripe. So we do: 1) calculation of P/Q for the upper pages 2) continuation of Q for the lower (empty) pages Signed-off-by: Markus Stockhausen Signed-off-by: NeilBrown --- include/linux/raid/pq.h | 1 + lib/raid6/algos.c | 41 ++++++++++++++++++++++++++++++++++------- lib/raid6/altivec.uc | 1 + lib/raid6/avx2.c | 3 +++ lib/raid6/int.uc | 3 ++- lib/raid6/mmx.c | 2 ++ lib/raid6/neon.c | 1 + lib/raid6/sse1.c | 2 ++ lib/raid6/sse2.c | 3 +++ lib/raid6/tilegx.uc | 1 + 10 files changed, 50 insertions(+), 8 deletions(-) diff --git a/include/linux/raid/pq.h b/include/linux/raid/pq.h index 73069cb6c54a..a7a06d1dcf9c 100644 --- a/include/linux/raid/pq.h +++ b/include/linux/raid/pq.h @@ -72,6 +72,7 @@ extern const char raid6_empty_zero_page[PAGE_SIZE]; /* Routine choices */ struct raid6_calls { void (*gen_syndrome)(int, size_t, void **); + void (*xor_syndrome)(int, int, int, size_t, void **); int (*valid)(void); /* Returns 1 if this routine set is usable */ const char *name; /* Name of this routine set */ int prefer; /* Has special performance attribute */ diff --git a/lib/raid6/algos.c b/lib/raid6/algos.c index dbef2314901e..975c6e0434bd 100644 --- a/lib/raid6/algos.c +++ b/lib/raid6/algos.c @@ -131,11 +131,12 @@ static inline const struct raid6_recov_calls *raid6_choose_recov(void) static inline const struct raid6_calls *raid6_choose_gen( void *(*const dptrs)[(65536/PAGE_SIZE)+2], const int disks) { - unsigned long perf, bestperf, j0, j1; + unsigned long perf, bestgenperf, bestxorperf, j0, j1; + int start = (disks>>1)-1, stop = disks-3; /* work on the second half of the disks */ const struct raid6_calls *const *algo; const struct raid6_calls *best; - for (bestperf = 0, best = NULL, algo = raid6_algos; *algo; algo++) { + for (bestgenperf = 0, bestxorperf = 0, best = NULL, algo = raid6_algos; *algo; algo++) { if (!best || (*algo)->prefer >= best->prefer) { if ((*algo)->valid && !(*algo)->valid()) continue; @@ -153,19 +154,45 @@ static inline const struct raid6_calls *raid6_choose_gen( } preempt_enable(); - if (perf > bestperf) { - bestperf = perf; + if (perf > bestgenperf) { + bestgenperf = perf; best = *algo; } - pr_info("raid6: %-8s %5ld MB/s\n", (*algo)->name, + pr_info("raid6: %-8s gen() %5ld MB/s\n", (*algo)->name, (perf*HZ) >> (20-16+RAID6_TIME_JIFFIES_LG2)); + + if (!(*algo)->xor_syndrome) + continue; + + perf = 0; + + preempt_disable(); + j0 = jiffies; + while ((j1 = jiffies) == j0) + cpu_relax(); + while (time_before(jiffies, + j1 + (1<xor_syndrome(disks, start, stop, + PAGE_SIZE, *dptrs); + perf++; + } + preempt_enable(); + + if (best == *algo) + bestxorperf = perf; + + pr_info("raid6: %-8s xor() %5ld MB/s\n", (*algo)->name, + (perf*HZ) >> (20-16+RAID6_TIME_JIFFIES_LG2+1)); } } if (best) { - pr_info("raid6: using algorithm %s (%ld MB/s)\n", + pr_info("raid6: using algorithm %s gen() %ld MB/s\n", best->name, - (bestperf*HZ) >> (20-16+RAID6_TIME_JIFFIES_LG2)); + (bestgenperf*HZ) >> (20-16+RAID6_TIME_JIFFIES_LG2)); + if (best->xor_syndrome) + pr_info("raid6: .... xor() %ld MB/s, rmw enabled\n", + (bestxorperf*HZ) >> (20-16+RAID6_TIME_JIFFIES_LG2+1)); raid6_call = *best; } else pr_err("raid6: Yikes! No algorithm found!\n"); diff --git a/lib/raid6/altivec.uc b/lib/raid6/altivec.uc index 7cc12b532e95..bec27fce7501 100644 --- a/lib/raid6/altivec.uc +++ b/lib/raid6/altivec.uc @@ -119,6 +119,7 @@ int raid6_have_altivec(void) const struct raid6_calls raid6_altivec$# = { raid6_altivec$#_gen_syndrome, + NULL, /* XOR not yet implemented */ raid6_have_altivec, "altivecx$#", 0 diff --git a/lib/raid6/avx2.c b/lib/raid6/avx2.c index bc3b1dd436eb..76734004358d 100644 --- a/lib/raid6/avx2.c +++ b/lib/raid6/avx2.c @@ -89,6 +89,7 @@ static void raid6_avx21_gen_syndrome(int disks, size_t bytes, void **ptrs) const struct raid6_calls raid6_avx2x1 = { raid6_avx21_gen_syndrome, + NULL, /* XOR not yet implemented */ raid6_have_avx2, "avx2x1", 1 /* Has cache hints */ @@ -150,6 +151,7 @@ static void raid6_avx22_gen_syndrome(int disks, size_t bytes, void **ptrs) const struct raid6_calls raid6_avx2x2 = { raid6_avx22_gen_syndrome, + NULL, /* XOR not yet implemented */ raid6_have_avx2, "avx2x2", 1 /* Has cache hints */ @@ -242,6 +244,7 @@ static void raid6_avx24_gen_syndrome(int disks, size_t bytes, void **ptrs) const struct raid6_calls raid6_avx2x4 = { raid6_avx24_gen_syndrome, + NULL, /* XOR not yet implemented */ raid6_have_avx2, "avx2x4", 1 /* Has cache hints */ diff --git a/lib/raid6/int.uc b/lib/raid6/int.uc index 5b50f8dfc5d2..5ca60bee1388 100644 --- a/lib/raid6/int.uc +++ b/lib/raid6/int.uc @@ -109,7 +109,8 @@ static void raid6_int$#_gen_syndrome(int disks, size_t bytes, void **ptrs) const struct raid6_calls raid6_intx$# = { raid6_int$#_gen_syndrome, - NULL, /* always valid */ + NULL, /* XOR not yet implemented */ + NULL, /* always valid */ "int" NSTRING "x$#", 0 }; diff --git a/lib/raid6/mmx.c b/lib/raid6/mmx.c index 590c71c9e200..b3b0e1fcd3af 100644 --- a/lib/raid6/mmx.c +++ b/lib/raid6/mmx.c @@ -76,6 +76,7 @@ static void raid6_mmx1_gen_syndrome(int disks, size_t bytes, void **ptrs) const struct raid6_calls raid6_mmxx1 = { raid6_mmx1_gen_syndrome, + NULL, /* XOR not yet implemented */ raid6_have_mmx, "mmxx1", 0 @@ -134,6 +135,7 @@ static void raid6_mmx2_gen_syndrome(int disks, size_t bytes, void **ptrs) const struct raid6_calls raid6_mmxx2 = { raid6_mmx2_gen_syndrome, + NULL, /* XOR not yet implemented */ raid6_have_mmx, "mmxx2", 0 diff --git a/lib/raid6/neon.c b/lib/raid6/neon.c index 36ad4705df1a..d9ad6ee284f4 100644 --- a/lib/raid6/neon.c +++ b/lib/raid6/neon.c @@ -42,6 +42,7 @@ } \ struct raid6_calls const raid6_neonx ## _n = { \ raid6_neon ## _n ## _gen_syndrome, \ + NULL, /* XOR not yet implemented */ \ raid6_have_neon, \ "neonx" #_n, \ 0 \ diff --git a/lib/raid6/sse1.c b/lib/raid6/sse1.c index f76297139445..9025b8ca9aa3 100644 --- a/lib/raid6/sse1.c +++ b/lib/raid6/sse1.c @@ -92,6 +92,7 @@ static void raid6_sse11_gen_syndrome(int disks, size_t bytes, void **ptrs) const struct raid6_calls raid6_sse1x1 = { raid6_sse11_gen_syndrome, + NULL, /* XOR not yet implemented */ raid6_have_sse1_or_mmxext, "sse1x1", 1 /* Has cache hints */ @@ -154,6 +155,7 @@ static void raid6_sse12_gen_syndrome(int disks, size_t bytes, void **ptrs) const struct raid6_calls raid6_sse1x2 = { raid6_sse12_gen_syndrome, + NULL, /* XOR not yet implemented */ raid6_have_sse1_or_mmxext, "sse1x2", 1 /* Has cache hints */ diff --git a/lib/raid6/sse2.c b/lib/raid6/sse2.c index 85b82c85f28e..31acd59a0ef7 100644 --- a/lib/raid6/sse2.c +++ b/lib/raid6/sse2.c @@ -90,6 +90,7 @@ static void raid6_sse21_gen_syndrome(int disks, size_t bytes, void **ptrs) const struct raid6_calls raid6_sse2x1 = { raid6_sse21_gen_syndrome, + NULL, /* XOR not yet implemented */ raid6_have_sse2, "sse2x1", 1 /* Has cache hints */ @@ -152,6 +153,7 @@ static void raid6_sse22_gen_syndrome(int disks, size_t bytes, void **ptrs) const struct raid6_calls raid6_sse2x2 = { raid6_sse22_gen_syndrome, + NULL, /* XOR not yet implemented */ raid6_have_sse2, "sse2x2", 1 /* Has cache hints */ @@ -250,6 +252,7 @@ static void raid6_sse24_gen_syndrome(int disks, size_t bytes, void **ptrs) const struct raid6_calls raid6_sse2x4 = { raid6_sse24_gen_syndrome, + NULL, /* XOR not yet implemented */ raid6_have_sse2, "sse2x4", 1 /* Has cache hints */ diff --git a/lib/raid6/tilegx.uc b/lib/raid6/tilegx.uc index e7c29459cbcd..2dd291a11264 100644 --- a/lib/raid6/tilegx.uc +++ b/lib/raid6/tilegx.uc @@ -80,6 +80,7 @@ void raid6_tilegx$#_gen_syndrome(int disks, size_t bytes, void **ptrs) const struct raid6_calls raid6_tilegx$# = { raid6_tilegx$#_gen_syndrome, + NULL, /* XOR not yet implemented */ NULL, "tilegx$#", 0 From 7e92e1d7629b00578cef22b1f4c6ada726663701 Mon Sep 17 00:00:00 2001 From: Markus Stockhausen Date: Mon, 15 Dec 2014 12:57:04 +1100 Subject: [PATCH 49/58] md/raid6 algorithms: improve test program It is always helpful to have a test tool in place if we implement new data critical algorithms. So add some test routines to the raid6 checker that can prove if the new xor_syndrome() works as expected. Run through all permutations of start/stop pages per algorithm and simulate a xor_syndrome() assisted rmw run. After each rmw check if the recovery algorithm still confirms that the stripe is fine. Signed-off-by: Markus Stockhausen Signed-off-by: NeilBrown --- lib/raid6/test/test.c | 51 ++++++++++++++++++++++++++++++------------- 1 file changed, 36 insertions(+), 15 deletions(-) diff --git a/lib/raid6/test/test.c b/lib/raid6/test/test.c index 5a485b7a7d3c..3bebbabdb510 100644 --- a/lib/raid6/test/test.c +++ b/lib/raid6/test/test.c @@ -28,11 +28,11 @@ char *dataptrs[NDISKS]; char data[NDISKS][PAGE_SIZE]; char recovi[PAGE_SIZE], recovj[PAGE_SIZE]; -static void makedata(void) +static void makedata(int start, int stop) { int i, j; - for (i = 0; i < NDISKS; i++) { + for (i = start; i <= stop; i++) { for (j = 0; j < PAGE_SIZE; j++) data[i][j] = rand(); @@ -91,34 +91,55 @@ int main(int argc, char *argv[]) { const struct raid6_calls *const *algo; const struct raid6_recov_calls *const *ra; - int i, j; + int i, j, p1, p2; int err = 0; - makedata(); + makedata(0, NDISKS-1); for (ra = raid6_recov_algos; *ra; ra++) { if ((*ra)->valid && !(*ra)->valid()) continue; + raid6_2data_recov = (*ra)->data2; raid6_datap_recov = (*ra)->datap; printf("using recovery %s\n", (*ra)->name); for (algo = raid6_algos; *algo; algo++) { - if (!(*algo)->valid || (*algo)->valid()) { - raid6_call = **algo; + if ((*algo)->valid && !(*algo)->valid()) + continue; - /* Nuke syndromes */ - memset(data[NDISKS-2], 0xee, 2*PAGE_SIZE); + raid6_call = **algo; - /* Generate assumed good syndrome */ - raid6_call.gen_syndrome(NDISKS, PAGE_SIZE, - (void **)&dataptrs); + /* Nuke syndromes */ + memset(data[NDISKS-2], 0xee, 2*PAGE_SIZE); + + /* Generate assumed good syndrome */ + raid6_call.gen_syndrome(NDISKS, PAGE_SIZE, + (void **)&dataptrs); + + for (i = 0; i < NDISKS-1; i++) + for (j = i+1; j < NDISKS; j++) + err += test_disks(i, j); + + if (!raid6_call.xor_syndrome) + continue; + + for (p1 = 0; p1 < NDISKS-2; p1++) + for (p2 = p1; p2 < NDISKS-2; p2++) { + + /* Simulate rmw run */ + raid6_call.xor_syndrome(NDISKS, p1, p2, PAGE_SIZE, + (void **)&dataptrs); + makedata(p1, p2); + raid6_call.xor_syndrome(NDISKS, p1, p2, PAGE_SIZE, + (void **)&dataptrs); + + for (i = 0; i < NDISKS-1; i++) + for (j = i+1; j < NDISKS; j++) + err += test_disks(i, j); + } - for (i = 0; i < NDISKS-1; i++) - for (j = i+1; j < NDISKS; j++) - err += test_disks(i, j); - } } printf("\n"); } From 9a5ce91d053961b7cc8fa56bd083819a9fc92734 Mon Sep 17 00:00:00 2001 From: Markus Stockhausen Date: Mon, 15 Dec 2014 12:57:04 +1100 Subject: [PATCH 50/58] md/raid6 algorithms: xor_syndrome() for generic int Start the algorithms with the very basic one. It is left and right optimized. That means we can avoid all calculations for unneeded pages above the right stop offset. For pages below the left start offset we still need the syndrome multiplication but without reading data pages. Signed-off-by: Markus Stockhausen Signed-off-by: NeilBrown --- lib/raid6/int.uc | 40 +++++++++++++++++++++++++++++++++++++++- 1 file changed, 39 insertions(+), 1 deletion(-) diff --git a/lib/raid6/int.uc b/lib/raid6/int.uc index 5ca60bee1388..558aeac9342a 100644 --- a/lib/raid6/int.uc +++ b/lib/raid6/int.uc @@ -107,9 +107,47 @@ static void raid6_int$#_gen_syndrome(int disks, size_t bytes, void **ptrs) } } +static void raid6_int$#_xor_syndrome(int disks, int start, int stop, + size_t bytes, void **ptrs) +{ + u8 **dptr = (u8 **)ptrs; + u8 *p, *q; + int d, z, z0; + + unative_t wd$$, wq$$, wp$$, w1$$, w2$$; + + z0 = stop; /* P/Q right side optimization */ + p = dptr[disks-2]; /* XOR parity */ + q = dptr[disks-1]; /* RS syndrome */ + + for ( d = 0 ; d < bytes ; d += NSIZE*$# ) { + /* P/Q data pages */ + wq$$ = wp$$ = *(unative_t *)&dptr[z0][d+$$*NSIZE]; + for ( z = z0-1 ; z >= start ; z-- ) { + wd$$ = *(unative_t *)&dptr[z][d+$$*NSIZE]; + wp$$ ^= wd$$; + w2$$ = MASK(wq$$); + w1$$ = SHLBYTE(wq$$); + w2$$ &= NBYTES(0x1d); + w1$$ ^= w2$$; + wq$$ = w1$$ ^ wd$$; + } + /* P/Q left side optimization */ + for ( z = start-1 ; z >= 0 ; z-- ) { + w2$$ = MASK(wq$$); + w1$$ = SHLBYTE(wq$$); + w2$$ &= NBYTES(0x1d); + wq$$ = w1$$ ^ w2$$; + } + *(unative_t *)&p[d+NSIZE*$$] ^= wp$$; + *(unative_t *)&q[d+NSIZE*$$] ^= wq$$; + } + +} + const struct raid6_calls raid6_intx$# = { raid6_int$#_gen_syndrome, - NULL, /* XOR not yet implemented */ + raid6_int$#_xor_syndrome, NULL, /* always valid */ "int" NSTRING "x$#", 0 From a582564b24bec0443b5c5ff43ee6d1258f8bd658 Mon Sep 17 00:00:00 2001 From: Markus Stockhausen Date: Mon, 15 Dec 2014 12:57:05 +1100 Subject: [PATCH 51/58] md/raid6 algorithms: xor_syndrome() for SSE2 The second and (last) optimized XOR syndrome calculation. This version supports right and left side optimization. All CPUs with architecture older than Haswell will benefit from it. It should be noted that SSE2 movntdq kills performance for memory areas that are read and written simultaneously in chunks smaller than cache line size. So use movdqa instead for P/Q writes in sse21 and sse22 XOR functions. Signed-off-by: Markus Stockhausen Signed-off-by: NeilBrown --- lib/raid6/sse2.c | 230 ++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 227 insertions(+), 3 deletions(-) diff --git a/lib/raid6/sse2.c b/lib/raid6/sse2.c index 31acd59a0ef7..1d2276b007ee 100644 --- a/lib/raid6/sse2.c +++ b/lib/raid6/sse2.c @@ -88,9 +88,58 @@ static void raid6_sse21_gen_syndrome(int disks, size_t bytes, void **ptrs) kernel_fpu_end(); } + +static void raid6_sse21_xor_syndrome(int disks, int start, int stop, + size_t bytes, void **ptrs) + { + u8 **dptr = (u8 **)ptrs; + u8 *p, *q; + int d, z, z0; + + z0 = stop; /* P/Q right side optimization */ + p = dptr[disks-2]; /* XOR parity */ + q = dptr[disks-1]; /* RS syndrome */ + + kernel_fpu_begin(); + + asm volatile("movdqa %0,%%xmm0" : : "m" (raid6_sse_constants.x1d[0])); + + for ( d = 0 ; d < bytes ; d += 16 ) { + asm volatile("movdqa %0,%%xmm4" :: "m" (dptr[z0][d])); + asm volatile("movdqa %0,%%xmm2" : : "m" (p[d])); + asm volatile("pxor %xmm4,%xmm2"); + /* P/Q data pages */ + for ( z = z0-1 ; z >= start ; z-- ) { + asm volatile("pxor %xmm5,%xmm5"); + asm volatile("pcmpgtb %xmm4,%xmm5"); + asm volatile("paddb %xmm4,%xmm4"); + asm volatile("pand %xmm0,%xmm5"); + asm volatile("pxor %xmm5,%xmm4"); + asm volatile("movdqa %0,%%xmm5" :: "m" (dptr[z][d])); + asm volatile("pxor %xmm5,%xmm2"); + asm volatile("pxor %xmm5,%xmm4"); + } + /* P/Q left side optimization */ + for ( z = start-1 ; z >= 0 ; z-- ) { + asm volatile("pxor %xmm5,%xmm5"); + asm volatile("pcmpgtb %xmm4,%xmm5"); + asm volatile("paddb %xmm4,%xmm4"); + asm volatile("pand %xmm0,%xmm5"); + asm volatile("pxor %xmm5,%xmm4"); + } + asm volatile("pxor %0,%%xmm4" : : "m" (q[d])); + /* Don't use movntdq for r/w memory area < cache line */ + asm volatile("movdqa %%xmm4,%0" : "=m" (q[d])); + asm volatile("movdqa %%xmm2,%0" : "=m" (p[d])); + } + + asm volatile("sfence" : : : "memory"); + kernel_fpu_end(); +} + const struct raid6_calls raid6_sse2x1 = { raid6_sse21_gen_syndrome, - NULL, /* XOR not yet implemented */ + raid6_sse21_xor_syndrome, raid6_have_sse2, "sse2x1", 1 /* Has cache hints */ @@ -151,9 +200,76 @@ static void raid6_sse22_gen_syndrome(int disks, size_t bytes, void **ptrs) kernel_fpu_end(); } + static void raid6_sse22_xor_syndrome(int disks, int start, int stop, + size_t bytes, void **ptrs) + { + u8 **dptr = (u8 **)ptrs; + u8 *p, *q; + int d, z, z0; + + z0 = stop; /* P/Q right side optimization */ + p = dptr[disks-2]; /* XOR parity */ + q = dptr[disks-1]; /* RS syndrome */ + + kernel_fpu_begin(); + + asm volatile("movdqa %0,%%xmm0" : : "m" (raid6_sse_constants.x1d[0])); + + for ( d = 0 ; d < bytes ; d += 32 ) { + asm volatile("movdqa %0,%%xmm4" :: "m" (dptr[z0][d])); + asm volatile("movdqa %0,%%xmm6" :: "m" (dptr[z0][d+16])); + asm volatile("movdqa %0,%%xmm2" : : "m" (p[d])); + asm volatile("movdqa %0,%%xmm3" : : "m" (p[d+16])); + asm volatile("pxor %xmm4,%xmm2"); + asm volatile("pxor %xmm6,%xmm3"); + /* P/Q data pages */ + for ( z = z0-1 ; z >= start ; z-- ) { + asm volatile("pxor %xmm5,%xmm5"); + asm volatile("pxor %xmm7,%xmm7"); + asm volatile("pcmpgtb %xmm4,%xmm5"); + asm volatile("pcmpgtb %xmm6,%xmm7"); + asm volatile("paddb %xmm4,%xmm4"); + asm volatile("paddb %xmm6,%xmm6"); + asm volatile("pand %xmm0,%xmm5"); + asm volatile("pand %xmm0,%xmm7"); + asm volatile("pxor %xmm5,%xmm4"); + asm volatile("pxor %xmm7,%xmm6"); + asm volatile("movdqa %0,%%xmm5" :: "m" (dptr[z][d])); + asm volatile("movdqa %0,%%xmm7" :: "m" (dptr[z][d+16])); + asm volatile("pxor %xmm5,%xmm2"); + asm volatile("pxor %xmm7,%xmm3"); + asm volatile("pxor %xmm5,%xmm4"); + asm volatile("pxor %xmm7,%xmm6"); + } + /* P/Q left side optimization */ + for ( z = start-1 ; z >= 0 ; z-- ) { + asm volatile("pxor %xmm5,%xmm5"); + asm volatile("pxor %xmm7,%xmm7"); + asm volatile("pcmpgtb %xmm4,%xmm5"); + asm volatile("pcmpgtb %xmm6,%xmm7"); + asm volatile("paddb %xmm4,%xmm4"); + asm volatile("paddb %xmm6,%xmm6"); + asm volatile("pand %xmm0,%xmm5"); + asm volatile("pand %xmm0,%xmm7"); + asm volatile("pxor %xmm5,%xmm4"); + asm volatile("pxor %xmm7,%xmm6"); + } + asm volatile("pxor %0,%%xmm4" : : "m" (q[d])); + asm volatile("pxor %0,%%xmm6" : : "m" (q[d+16])); + /* Don't use movntdq for r/w memory area < cache line */ + asm volatile("movdqa %%xmm4,%0" : "=m" (q[d])); + asm volatile("movdqa %%xmm6,%0" : "=m" (q[d+16])); + asm volatile("movdqa %%xmm2,%0" : "=m" (p[d])); + asm volatile("movdqa %%xmm3,%0" : "=m" (p[d+16])); + } + + asm volatile("sfence" : : : "memory"); + kernel_fpu_end(); + } + const struct raid6_calls raid6_sse2x2 = { raid6_sse22_gen_syndrome, - NULL, /* XOR not yet implemented */ + raid6_sse22_xor_syndrome, raid6_have_sse2, "sse2x2", 1 /* Has cache hints */ @@ -250,9 +366,117 @@ static void raid6_sse24_gen_syndrome(int disks, size_t bytes, void **ptrs) kernel_fpu_end(); } + static void raid6_sse24_xor_syndrome(int disks, int start, int stop, + size_t bytes, void **ptrs) + { + u8 **dptr = (u8 **)ptrs; + u8 *p, *q; + int d, z, z0; + + z0 = stop; /* P/Q right side optimization */ + p = dptr[disks-2]; /* XOR parity */ + q = dptr[disks-1]; /* RS syndrome */ + + kernel_fpu_begin(); + + asm volatile("movdqa %0,%%xmm0" :: "m" (raid6_sse_constants.x1d[0])); + + for ( d = 0 ; d < bytes ; d += 64 ) { + asm volatile("movdqa %0,%%xmm4" :: "m" (dptr[z0][d])); + asm volatile("movdqa %0,%%xmm6" :: "m" (dptr[z0][d+16])); + asm volatile("movdqa %0,%%xmm12" :: "m" (dptr[z0][d+32])); + asm volatile("movdqa %0,%%xmm14" :: "m" (dptr[z0][d+48])); + asm volatile("movdqa %0,%%xmm2" : : "m" (p[d])); + asm volatile("movdqa %0,%%xmm3" : : "m" (p[d+16])); + asm volatile("movdqa %0,%%xmm10" : : "m" (p[d+32])); + asm volatile("movdqa %0,%%xmm11" : : "m" (p[d+48])); + asm volatile("pxor %xmm4,%xmm2"); + asm volatile("pxor %xmm6,%xmm3"); + asm volatile("pxor %xmm12,%xmm10"); + asm volatile("pxor %xmm14,%xmm11"); + /* P/Q data pages */ + for ( z = z0-1 ; z >= start ; z-- ) { + asm volatile("prefetchnta %0" :: "m" (dptr[z][d])); + asm volatile("prefetchnta %0" :: "m" (dptr[z][d+32])); + asm volatile("pxor %xmm5,%xmm5"); + asm volatile("pxor %xmm7,%xmm7"); + asm volatile("pxor %xmm13,%xmm13"); + asm volatile("pxor %xmm15,%xmm15"); + asm volatile("pcmpgtb %xmm4,%xmm5"); + asm volatile("pcmpgtb %xmm6,%xmm7"); + asm volatile("pcmpgtb %xmm12,%xmm13"); + asm volatile("pcmpgtb %xmm14,%xmm15"); + asm volatile("paddb %xmm4,%xmm4"); + asm volatile("paddb %xmm6,%xmm6"); + asm volatile("paddb %xmm12,%xmm12"); + asm volatile("paddb %xmm14,%xmm14"); + asm volatile("pand %xmm0,%xmm5"); + asm volatile("pand %xmm0,%xmm7"); + asm volatile("pand %xmm0,%xmm13"); + asm volatile("pand %xmm0,%xmm15"); + asm volatile("pxor %xmm5,%xmm4"); + asm volatile("pxor %xmm7,%xmm6"); + asm volatile("pxor %xmm13,%xmm12"); + asm volatile("pxor %xmm15,%xmm14"); + asm volatile("movdqa %0,%%xmm5" :: "m" (dptr[z][d])); + asm volatile("movdqa %0,%%xmm7" :: "m" (dptr[z][d+16])); + asm volatile("movdqa %0,%%xmm13" :: "m" (dptr[z][d+32])); + asm volatile("movdqa %0,%%xmm15" :: "m" (dptr[z][d+48])); + asm volatile("pxor %xmm5,%xmm2"); + asm volatile("pxor %xmm7,%xmm3"); + asm volatile("pxor %xmm13,%xmm10"); + asm volatile("pxor %xmm15,%xmm11"); + asm volatile("pxor %xmm5,%xmm4"); + asm volatile("pxor %xmm7,%xmm6"); + asm volatile("pxor %xmm13,%xmm12"); + asm volatile("pxor %xmm15,%xmm14"); + } + asm volatile("prefetchnta %0" :: "m" (q[d])); + asm volatile("prefetchnta %0" :: "m" (q[d+32])); + /* P/Q left side optimization */ + for ( z = start-1 ; z >= 0 ; z-- ) { + asm volatile("pxor %xmm5,%xmm5"); + asm volatile("pxor %xmm7,%xmm7"); + asm volatile("pxor %xmm13,%xmm13"); + asm volatile("pxor %xmm15,%xmm15"); + asm volatile("pcmpgtb %xmm4,%xmm5"); + asm volatile("pcmpgtb %xmm6,%xmm7"); + asm volatile("pcmpgtb %xmm12,%xmm13"); + asm volatile("pcmpgtb %xmm14,%xmm15"); + asm volatile("paddb %xmm4,%xmm4"); + asm volatile("paddb %xmm6,%xmm6"); + asm volatile("paddb %xmm12,%xmm12"); + asm volatile("paddb %xmm14,%xmm14"); + asm volatile("pand %xmm0,%xmm5"); + asm volatile("pand %xmm0,%xmm7"); + asm volatile("pand %xmm0,%xmm13"); + asm volatile("pand %xmm0,%xmm15"); + asm volatile("pxor %xmm5,%xmm4"); + asm volatile("pxor %xmm7,%xmm6"); + asm volatile("pxor %xmm13,%xmm12"); + asm volatile("pxor %xmm15,%xmm14"); + } + asm volatile("movntdq %%xmm2,%0" : "=m" (p[d])); + asm volatile("movntdq %%xmm3,%0" : "=m" (p[d+16])); + asm volatile("movntdq %%xmm10,%0" : "=m" (p[d+32])); + asm volatile("movntdq %%xmm11,%0" : "=m" (p[d+48])); + asm volatile("pxor %0,%%xmm4" : : "m" (q[d])); + asm volatile("pxor %0,%%xmm6" : : "m" (q[d+16])); + asm volatile("pxor %0,%%xmm12" : : "m" (q[d+32])); + asm volatile("pxor %0,%%xmm14" : : "m" (q[d+48])); + asm volatile("movntdq %%xmm4,%0" : "=m" (q[d])); + asm volatile("movntdq %%xmm6,%0" : "=m" (q[d+16])); + asm volatile("movntdq %%xmm12,%0" : "=m" (q[d+32])); + asm volatile("movntdq %%xmm14,%0" : "=m" (q[d+48])); + } + asm volatile("sfence" : : : "memory"); + kernel_fpu_end(); + } + + const struct raid6_calls raid6_sse2x4 = { raid6_sse24_gen_syndrome, - NULL, /* XOR not yet implemented */ + raid6_sse24_xor_syndrome, raid6_have_sse2, "sse2x4", 1 /* Has cache hints */ From 584acdd49cd2472ca0f5a06adbe979db82d0b4af Mon Sep 17 00:00:00 2001 From: Markus Stockhausen Date: Mon, 15 Dec 2014 12:57:05 +1100 Subject: [PATCH 52/58] md/raid5: activate raid6 rmw feature Glue it altogehter. The raid6 rmw path should work the same as the already existing raid5 logic. So emulate the prexor handling/flags and split functions as needed. 1) Enable xor_syndrome() in the async layer. 2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome at the start of a rmw run as we did it before for the single parity. 3) Take care of rmw run in ops_run_reconstruct6(). Again process only the changed pages to get syndrome back into sync. 4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw run. The lower layers will calculate start & end pages from that and call the xor_syndrome() correspondingly. 5) Adapt the several places where we ignored Q handling up to now. Performance numbers for a single E5630 system with a mix of 10 7200k desktop/server disks. 300 seconds random write with 8 threads onto a 3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4) bsize rmw_level=1 rmw_level=0 rmw_level=1 rmw_level=0 skip_copy=1 skip_copy=1 skip_copy=0 skip_copy=0 4K 115 KB/s 141 KB/s 165 KB/s 140 KB/s 8K 225 KB/s 275 KB/s 324 KB/s 274 KB/s 16K 434 KB/s 536 KB/s 640 KB/s 534 KB/s 32K 751 KB/s 1,051 KB/s 1,234 KB/s 1,045 KB/s 64K 1,339 KB/s 1,958 KB/s 2,282 KB/s 1,962 KB/s 128K 2,673 KB/s 3,862 KB/s 4,113 KB/s 3,898 KB/s 256K 7,685 KB/s 7,539 KB/s 7,557 KB/s 7,638 KB/s 512K 19,556 KB/s 19,558 KB/s 19,652 KB/s 19,688 Kb/s Signed-off-by: Markus Stockhausen Signed-off-by: NeilBrown --- crypto/async_tx/async_pq.c | 19 +++++-- drivers/md/raid5.c | 104 +++++++++++++++++++++++++++---------- drivers/md/raid5.h | 19 ++++++- include/linux/async_tx.h | 3 ++ 4 files changed, 115 insertions(+), 30 deletions(-) diff --git a/crypto/async_tx/async_pq.c b/crypto/async_tx/async_pq.c index d05327caf69d..5d355e0c2633 100644 --- a/crypto/async_tx/async_pq.c +++ b/crypto/async_tx/async_pq.c @@ -124,6 +124,7 @@ do_sync_gen_syndrome(struct page **blocks, unsigned int offset, int disks, { void **srcs; int i; + int start = -1, stop = disks - 3; if (submit->scribble) srcs = submit->scribble; @@ -134,10 +135,21 @@ do_sync_gen_syndrome(struct page **blocks, unsigned int offset, int disks, if (blocks[i] == NULL) { BUG_ON(i > disks - 3); /* P or Q can't be zero */ srcs[i] = (void*)raid6_empty_zero_page; - } else + } else { srcs[i] = page_address(blocks[i]) + offset; + if (i < disks - 2) { + stop = i; + if (start == -1) + start = i; + } + } } - raid6_call.gen_syndrome(disks, len, srcs); + if (submit->flags & ASYNC_TX_PQ_XOR_DST) { + BUG_ON(!raid6_call.xor_syndrome); + if (start >= 0) + raid6_call.xor_syndrome(disks, start, stop, len, srcs); + } else + raid6_call.gen_syndrome(disks, len, srcs); async_tx_sync_epilog(submit); } @@ -178,7 +190,8 @@ async_gen_syndrome(struct page **blocks, unsigned int offset, int disks, if (device) unmap = dmaengine_get_unmap_data(device->dev, disks, GFP_NOIO); - if (unmap && + /* XORing P/Q is only implemented in software */ + if (unmap && !(submit->flags & ASYNC_TX_PQ_XOR_DST) && (src_cnt <= dma_maxpq(device, 0) || dma_maxpq(device, DMA_PREP_CONTINUE) > 0) && is_dma_pq_aligned(device, offset, 0, len)) { diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 3ae097d50b51..c82ce1fd8723 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -1317,7 +1317,9 @@ ops_run_compute5(struct stripe_head *sh, struct raid5_percpu *percpu) * destination buffer is recorded in srcs[count] and the Q destination * is recorded in srcs[count+1]]. */ -static int set_syndrome_sources(struct page **srcs, struct stripe_head *sh) +static int set_syndrome_sources(struct page **srcs, + struct stripe_head *sh, + int srctype) { int disks = sh->disks; int syndrome_disks = sh->ddf_layout ? disks : (disks - 2); @@ -1332,8 +1334,15 @@ static int set_syndrome_sources(struct page **srcs, struct stripe_head *sh) i = d0_idx; do { int slot = raid6_idx_to_slot(i, sh, &count, syndrome_disks); + struct r5dev *dev = &sh->dev[i]; - srcs[slot] = sh->dev[i].page; + if (i == sh->qd_idx || i == sh->pd_idx || + (srctype == SYNDROME_SRC_ALL) || + (srctype == SYNDROME_SRC_WANT_DRAIN && + test_bit(R5_Wantdrain, &dev->flags)) || + (srctype == SYNDROME_SRC_WRITTEN && + dev->written)) + srcs[slot] = sh->dev[i].page; i = raid6_next_disk(i, disks); } while (i != d0_idx); @@ -1373,7 +1382,7 @@ ops_run_compute6_1(struct stripe_head *sh, struct raid5_percpu *percpu) atomic_inc(&sh->count); if (target == qd_idx) { - count = set_syndrome_sources(blocks, sh); + count = set_syndrome_sources(blocks, sh, SYNDROME_SRC_ALL); blocks[count] = NULL; /* regenerating p is not necessary */ BUG_ON(blocks[count+1] != dest); /* q should already be set */ init_async_submit(&submit, ASYNC_TX_FENCE, NULL, @@ -1481,7 +1490,7 @@ ops_run_compute6_2(struct stripe_head *sh, struct raid5_percpu *percpu) tx = async_xor(dest, blocks, 0, count, STRIPE_SIZE, &submit); - count = set_syndrome_sources(blocks, sh); + count = set_syndrome_sources(blocks, sh, SYNDROME_SRC_ALL); init_async_submit(&submit, ASYNC_TX_FENCE, tx, ops_complete_compute, sh, to_addr_conv(sh, percpu, 0)); @@ -1515,8 +1524,8 @@ static void ops_complete_prexor(void *stripe_head_ref) } static struct dma_async_tx_descriptor * -ops_run_prexor(struct stripe_head *sh, struct raid5_percpu *percpu, - struct dma_async_tx_descriptor *tx) +ops_run_prexor5(struct stripe_head *sh, struct raid5_percpu *percpu, + struct dma_async_tx_descriptor *tx) { int disks = sh->disks; struct page **xor_srcs = to_addr_page(percpu, 0); @@ -1544,6 +1553,26 @@ ops_run_prexor(struct stripe_head *sh, struct raid5_percpu *percpu, return tx; } +static struct dma_async_tx_descriptor * +ops_run_prexor6(struct stripe_head *sh, struct raid5_percpu *percpu, + struct dma_async_tx_descriptor *tx) +{ + struct page **blocks = to_addr_page(percpu, 0); + int count; + struct async_submit_ctl submit; + + pr_debug("%s: stripe %llu\n", __func__, + (unsigned long long)sh->sector); + + count = set_syndrome_sources(blocks, sh, SYNDROME_SRC_WANT_DRAIN); + + init_async_submit(&submit, ASYNC_TX_FENCE|ASYNC_TX_PQ_XOR_DST, tx, + ops_complete_prexor, sh, to_addr_conv(sh, percpu, 0)); + tx = async_gen_syndrome(blocks, 0, count+2, STRIPE_SIZE, &submit); + + return tx; +} + static struct dma_async_tx_descriptor * ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx) { @@ -1746,6 +1775,8 @@ ops_run_reconstruct6(struct stripe_head *sh, struct raid5_percpu *percpu, int count, i, j = 0; struct stripe_head *head_sh = sh; int last_stripe; + int synflags; + unsigned long txflags; pr_debug("%s: stripe %llu\n", __func__, (unsigned long long)sh->sector); @@ -1765,14 +1796,23 @@ ops_run_reconstruct6(struct stripe_head *sh, struct raid5_percpu *percpu, again: blocks = to_addr_page(percpu, j); - count = set_syndrome_sources(blocks, sh); + + if (sh->reconstruct_state == reconstruct_state_prexor_drain_run) { + synflags = SYNDROME_SRC_WRITTEN; + txflags = ASYNC_TX_ACK | ASYNC_TX_PQ_XOR_DST; + } else { + synflags = SYNDROME_SRC_ALL; + txflags = ASYNC_TX_ACK; + } + + count = set_syndrome_sources(blocks, sh, synflags); last_stripe = !head_sh->batch_head || list_first_entry(&sh->batch_list, struct stripe_head, batch_list) == head_sh; if (last_stripe) { atomic_inc(&head_sh->count); - init_async_submit(&submit, ASYNC_TX_ACK, tx, ops_complete_reconstruct, + init_async_submit(&submit, txflags, tx, ops_complete_reconstruct, head_sh, to_addr_conv(sh, percpu, j)); } else init_async_submit(&submit, 0, tx, NULL, NULL, @@ -1843,7 +1883,7 @@ static void ops_run_check_pq(struct stripe_head *sh, struct raid5_percpu *percpu (unsigned long long)sh->sector, checkp); BUG_ON(sh->batch_head); - count = set_syndrome_sources(srcs, sh); + count = set_syndrome_sources(srcs, sh, SYNDROME_SRC_ALL); if (!checkp) srcs[count] = NULL; @@ -1884,8 +1924,12 @@ static void raid_run_ops(struct stripe_head *sh, unsigned long ops_request) async_tx_ack(tx); } - if (test_bit(STRIPE_OP_PREXOR, &ops_request)) - tx = ops_run_prexor(sh, percpu, tx); + if (test_bit(STRIPE_OP_PREXOR, &ops_request)) { + if (level < 6) + tx = ops_run_prexor5(sh, percpu, tx); + else + tx = ops_run_prexor6(sh, percpu, tx); + } if (test_bit(STRIPE_OP_BIODRAIN, &ops_request)) { tx = ops_run_biodrain(sh, tx); @@ -2770,7 +2814,7 @@ static void schedule_reconstruction(struct stripe_head *sh, struct stripe_head_state *s, int rcw, int expand) { - int i, pd_idx = sh->pd_idx, disks = sh->disks; + int i, pd_idx = sh->pd_idx, qd_idx = sh->qd_idx, disks = sh->disks; struct r5conf *conf = sh->raid_conf; int level = conf->level; @@ -2806,13 +2850,15 @@ schedule_reconstruction(struct stripe_head *sh, struct stripe_head_state *s, if (!test_and_set_bit(STRIPE_FULL_WRITE, &sh->state)) atomic_inc(&conf->pending_full_writes); } else { - BUG_ON(level == 6); BUG_ON(!(test_bit(R5_UPTODATE, &sh->dev[pd_idx].flags) || test_bit(R5_Wantcompute, &sh->dev[pd_idx].flags))); + BUG_ON(level == 6 && + (!(test_bit(R5_UPTODATE, &sh->dev[qd_idx].flags) || + test_bit(R5_Wantcompute, &sh->dev[qd_idx].flags)))); for (i = disks; i--; ) { struct r5dev *dev = &sh->dev[i]; - if (i == pd_idx) + if (i == pd_idx || i == qd_idx) continue; if (dev->towrite && @@ -3476,28 +3522,27 @@ static void handle_stripe_dirtying(struct r5conf *conf, int rmw = 0, rcw = 0, i; sector_t recovery_cp = conf->mddev->recovery_cp; - /* RAID6 requires 'rcw' in current implementation. - * Otherwise, check whether resync is now happening or should start. + /* Check whether resync is now happening or should start. * If yes, then the array is dirty (after unclean shutdown or * initial creation), so parity in some stripes might be inconsistent. * In this case, we need to always do reconstruct-write, to ensure * that in case of drive failure or read-error correction, we * generate correct data from the parity. */ - if (conf->max_degraded == 2 || + if (conf->rmw_level == PARITY_DISABLE_RMW || (recovery_cp < MaxSector && sh->sector >= recovery_cp && s->failed == 0)) { /* Calculate the real rcw later - for now make it * look like rcw is cheaper */ rcw = 1; rmw = 2; - pr_debug("force RCW max_degraded=%u, recovery_cp=%llu sh->sector=%llu\n", - conf->max_degraded, (unsigned long long)recovery_cp, + pr_debug("force RCW rmw_level=%u, recovery_cp=%llu sh->sector=%llu\n", + conf->rmw_level, (unsigned long long)recovery_cp, (unsigned long long)sh->sector); } else for (i = disks; i--; ) { /* would I have to read this buffer for read_modify_write */ struct r5dev *dev = &sh->dev[i]; - if ((dev->towrite || i == sh->pd_idx) && + if ((dev->towrite || i == sh->pd_idx || i == sh->qd_idx) && !test_bit(R5_LOCKED, &dev->flags) && !(test_bit(R5_UPTODATE, &dev->flags) || test_bit(R5_Wantcompute, &dev->flags))) { @@ -3507,7 +3552,8 @@ static void handle_stripe_dirtying(struct r5conf *conf, rmw += 2*disks; /* cannot read it */ } /* Would I have to read this buffer for reconstruct_write */ - if (!test_bit(R5_OVERWRITE, &dev->flags) && i != sh->pd_idx && + if (!test_bit(R5_OVERWRITE, &dev->flags) && + i != sh->pd_idx && i != sh->qd_idx && !test_bit(R5_LOCKED, &dev->flags) && !(test_bit(R5_UPTODATE, &dev->flags) || test_bit(R5_Wantcompute, &dev->flags))) { @@ -3520,7 +3566,7 @@ static void handle_stripe_dirtying(struct r5conf *conf, pr_debug("for sector %llu, rmw=%d rcw=%d\n", (unsigned long long)sh->sector, rmw, rcw); set_bit(STRIPE_HANDLE, &sh->state); - if (rmw < rcw && rmw > 0) { + if ((rmw < rcw || (rmw == rcw && conf->rmw_level == PARITY_ENABLE_RMW)) && rmw > 0) { /* prefer read-modify-write, but need to get some data */ if (conf->mddev->queue) blk_add_trace_msg(conf->mddev->queue, @@ -3528,7 +3574,7 @@ static void handle_stripe_dirtying(struct r5conf *conf, (unsigned long long)sh->sector, rmw); for (i = disks; i--; ) { struct r5dev *dev = &sh->dev[i]; - if ((dev->towrite || i == sh->pd_idx) && + if ((dev->towrite || i == sh->pd_idx || i == sh->qd_idx) && !test_bit(R5_LOCKED, &dev->flags) && !(test_bit(R5_UPTODATE, &dev->flags) || test_bit(R5_Wantcompute, &dev->flags)) && @@ -3547,7 +3593,7 @@ static void handle_stripe_dirtying(struct r5conf *conf, } } } - if (rcw <= rmw && rcw > 0) { + if ((rcw < rmw || (rcw == rmw && conf->rmw_level != PARITY_ENABLE_RMW)) && rcw > 0) { /* want reconstruct write, but need to get some data */ int qread =0; rcw = 0; @@ -6344,10 +6390,16 @@ static struct r5conf *setup_conf(struct mddev *mddev) } conf->level = mddev->new_level; - if (conf->level == 6) + if (conf->level == 6) { conf->max_degraded = 2; - else + if (raid6_call.xor_syndrome) + conf->rmw_level = PARITY_ENABLE_RMW; + else + conf->rmw_level = PARITY_DISABLE_RMW; + } else { conf->max_degraded = 1; + conf->rmw_level = PARITY_ENABLE_RMW; + } conf->algorithm = mddev->new_layout; conf->reshape_progress = mddev->reshape_position; if (conf->reshape_progress != MaxSector) { diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h index ee65ed844d3f..57fef9ba36fa 100644 --- a/drivers/md/raid5.h +++ b/drivers/md/raid5.h @@ -355,6 +355,23 @@ enum { STRIPE_OP_RECONSTRUCT, STRIPE_OP_CHECK, }; + +/* + * RAID parity calculation preferences + */ +enum { + PARITY_DISABLE_RMW = 0, + PARITY_ENABLE_RMW, +}; + +/* + * Pages requested from set_syndrome_sources() + */ +enum { + SYNDROME_SRC_ALL, + SYNDROME_SRC_WANT_DRAIN, + SYNDROME_SRC_WRITTEN, +}; /* * Plugging: * @@ -411,7 +428,7 @@ struct r5conf { spinlock_t hash_locks[NR_STRIPE_HASH_LOCKS]; struct mddev *mddev; int chunk_sectors; - int level, algorithm; + int level, algorithm, rmw_level; int max_degraded; int raid_disks; int max_nr_stripes; diff --git a/include/linux/async_tx.h b/include/linux/async_tx.h index 179b38ffd351..388574ea38ed 100644 --- a/include/linux/async_tx.h +++ b/include/linux/async_tx.h @@ -60,12 +60,15 @@ struct dma_chan_ref { * dependency chain * @ASYNC_TX_FENCE: specify that the next operation in the dependency * chain uses this operation's result as an input + * @ASYNC_TX_PQ_XOR_DST: do not overwrite the syndrome but XOR it with the + * input data. Required for rmw case. */ enum async_tx_flags { ASYNC_TX_XOR_ZERO_DST = (1 << 0), ASYNC_TX_XOR_DROP_DST = (1 << 1), ASYNC_TX_ACK = (1 << 2), ASYNC_TX_FENCE = (1 << 3), + ASYNC_TX_PQ_XOR_DST = (1 << 4), }; /** From d06f191f8ecaef4d524e765fdb455f96392fbd42 Mon Sep 17 00:00:00 2001 From: Markus Stockhausen Date: Mon, 15 Dec 2014 12:57:05 +1100 Subject: [PATCH 53/58] md/raid5: introduce configuration option rmw_level Depending on the available coding we allow optimized rmw logic for write operations. To support easier testing this patch allows manual control of the rmw/rcw descision through the interface /sys/block/mdX/md/rmw_level. The configuration can handle three levels of control. rmw_level=0: Disable rmw for all RAID types. Hardware assisted P/Q calculation has no implementation path yet to factor in/out chunks of a syndrome. Enforcing this level can be benefical for slow CPUs with hardware syndrome support and fast SSDs. rmw_level=1: Estimate rmw IOs and rcw IOs. Execute rmw only if we will save IOs. This equals the "old" unpatched behaviour and will be the default. rmw_level=2: Execute rmw even if calculated IOs for rmw and rcw are equal. We might have higher CPU consumption because of calculating the parity twice but it can be benefical otherwise. E.g. RAID4 with fast dedicated parity disk/SSD. The option is implemented just to be forward-looking and will ONLY work with this patch! Signed-off-by: Markus Stockhausen Signed-off-by: NeilBrown --- drivers/md/raid5.c | 44 ++++++++++++++++++++++++++++++++++++++++++++ drivers/md/raid5.h | 1 + 2 files changed, 45 insertions(+) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index c82ce1fd8723..f78b1964543b 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -5879,6 +5879,49 @@ raid5_stripecache_size = __ATTR(stripe_cache_size, S_IRUGO | S_IWUSR, raid5_show_stripe_cache_size, raid5_store_stripe_cache_size); +static ssize_t +raid5_show_rmw_level(struct mddev *mddev, char *page) +{ + struct r5conf *conf = mddev->private; + if (conf) + return sprintf(page, "%d\n", conf->rmw_level); + else + return 0; +} + +static ssize_t +raid5_store_rmw_level(struct mddev *mddev, const char *page, size_t len) +{ + struct r5conf *conf = mddev->private; + unsigned long new; + + if (!conf) + return -ENODEV; + + if (len >= PAGE_SIZE) + return -EINVAL; + + if (kstrtoul(page, 10, &new)) + return -EINVAL; + + if (new != PARITY_DISABLE_RMW && !raid6_call.xor_syndrome) + return -EINVAL; + + if (new != PARITY_DISABLE_RMW && + new != PARITY_ENABLE_RMW && + new != PARITY_PREFER_RMW) + return -EINVAL; + + conf->rmw_level = new; + return len; +} + +static struct md_sysfs_entry +raid5_rmw_level = __ATTR(rmw_level, S_IRUGO | S_IWUSR, + raid5_show_rmw_level, + raid5_store_rmw_level); + + static ssize_t raid5_show_preread_threshold(struct mddev *mddev, char *page) { @@ -6065,6 +6108,7 @@ static struct attribute *raid5_attrs[] = { &raid5_preread_bypass_threshold.attr, &raid5_group_thread_cnt.attr, &raid5_skip_copy.attr, + &raid5_rmw_level.attr, NULL, }; static struct attribute_group raid5_attrs_group = { diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h index 57fef9ba36fa..6614ac5ffc0e 100644 --- a/drivers/md/raid5.h +++ b/drivers/md/raid5.h @@ -362,6 +362,7 @@ enum { enum { PARITY_DISABLE_RMW = 0, PARITY_ENABLE_RMW, + PARITY_PREFER_RMW, }; /* From a9683a795bcca6d0e7fe4c4c00e071218f3f4428 Mon Sep 17 00:00:00 2001 From: NeilBrown Date: Wed, 25 Feb 2015 12:02:51 +1100 Subject: [PATCH 54/58] md/raid5: pass gfp_t arg to grow_one_stripe() This is needed for future improvement to stripe cache management. Signed-off-by: NeilBrown --- drivers/md/raid5.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index f78b1964543b..ed8e34153c3d 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -497,7 +497,7 @@ static void shrink_buffers(struct stripe_head *sh) } } -static int grow_buffers(struct stripe_head *sh) +static int grow_buffers(struct stripe_head *sh, gfp_t gfp) { int i; int num = sh->raid_conf->pool_size; @@ -505,7 +505,7 @@ static int grow_buffers(struct stripe_head *sh) for (i = 0; i < num; i++) { struct page *page; - if (!(page = alloc_page(GFP_KERNEL))) { + if (!(page = alloc_page(gfp))) { return 1; } sh->dev[i].page = page; @@ -1963,10 +1963,10 @@ static void raid_run_ops(struct stripe_head *sh, unsigned long ops_request) put_cpu(); } -static int grow_one_stripe(struct r5conf *conf, int hash) +static int grow_one_stripe(struct r5conf *conf, int hash, gfp_t gfp) { struct stripe_head *sh; - sh = kmem_cache_zalloc(conf->slab_cache, GFP_KERNEL); + sh = kmem_cache_zalloc(conf->slab_cache, gfp); if (!sh) return 0; @@ -1974,7 +1974,7 @@ static int grow_one_stripe(struct r5conf *conf, int hash) spin_lock_init(&sh->stripe_lock); - if (grow_buffers(sh)) { + if (grow_buffers(sh, gfp)) { shrink_buffers(sh); kmem_cache_free(conf->slab_cache, sh); return 0; @@ -2016,7 +2016,7 @@ static int grow_stripes(struct r5conf *conf, int num) conf->pool_size = devs; hash = conf->max_nr_stripes % NR_STRIPE_HASH_LOCKS; while (num--) { - if (!grow_one_stripe(conf, hash)) + if (!grow_one_stripe(conf, hash, GFP_KERNEL)) return 1; conf->max_nr_stripes++; hash = (hash + 1) % NR_STRIPE_HASH_LOCKS; @@ -5841,7 +5841,7 @@ raid5_set_cache_size(struct mddev *mddev, int size) return err; hash = conf->max_nr_stripes % NR_STRIPE_HASH_LOCKS; while (size > conf->max_nr_stripes) { - if (grow_one_stripe(conf, hash)) + if (grow_one_stripe(conf, hash, GFP_KERNEL)) conf->max_nr_stripes++; else break; hash = (hash + 1) % NR_STRIPE_HASH_LOCKS; From 486f0644c3482cbf64fe804499836e1f05abec14 Mon Sep 17 00:00:00 2001 From: NeilBrown Date: Wed, 25 Feb 2015 12:10:35 +1100 Subject: [PATCH 55/58] md/raid5: move max_nr_stripes management into grow_one_stripe and drop_one_stripe Rather than adjusting max_nr_stripes whenever {grow,drop}_one_stripe() succeeds, do it inside the functions. Also choose the correct hash to handle next inside the functions. This removes duplication and will help with future new uses of {grow,drop}_one_stripe. This also fixes a minor bug where the "md/raid:%md: allocate XXkB" message always said "0kB". Signed-off-by: NeilBrown --- drivers/md/raid5.c | 57 +++++++++++++++++++--------------------------- 1 file changed, 24 insertions(+), 33 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index ed8e34153c3d..78ac7dc853c7 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -1963,7 +1963,7 @@ static void raid_run_ops(struct stripe_head *sh, unsigned long ops_request) put_cpu(); } -static int grow_one_stripe(struct r5conf *conf, int hash, gfp_t gfp) +static int grow_one_stripe(struct r5conf *conf, gfp_t gfp) { struct stripe_head *sh; sh = kmem_cache_zalloc(conf->slab_cache, gfp); @@ -1979,7 +1979,8 @@ static int grow_one_stripe(struct r5conf *conf, int hash, gfp_t gfp) kmem_cache_free(conf->slab_cache, sh); return 0; } - sh->hash_lock_index = hash; + sh->hash_lock_index = + conf->max_nr_stripes % NR_STRIPE_HASH_LOCKS; /* we just created an active stripe so... */ atomic_set(&sh->count, 1); atomic_inc(&conf->active_stripes); @@ -1989,6 +1990,7 @@ static int grow_one_stripe(struct r5conf *conf, int hash, gfp_t gfp) INIT_LIST_HEAD(&sh->batch_list); sh->batch_head = NULL; release_stripe(sh); + conf->max_nr_stripes++; return 1; } @@ -1996,7 +1998,6 @@ static int grow_stripes(struct r5conf *conf, int num) { struct kmem_cache *sc; int devs = max(conf->raid_disks, conf->previous_raid_disks); - int hash; if (conf->mddev->gendisk) sprintf(conf->cache_name[0], @@ -2014,13 +2015,10 @@ static int grow_stripes(struct r5conf *conf, int num) return 1; conf->slab_cache = sc; conf->pool_size = devs; - hash = conf->max_nr_stripes % NR_STRIPE_HASH_LOCKS; - while (num--) { - if (!grow_one_stripe(conf, hash, GFP_KERNEL)) + while (num--) + if (!grow_one_stripe(conf, GFP_KERNEL)) return 1; - conf->max_nr_stripes++; - hash = (hash + 1) % NR_STRIPE_HASH_LOCKS; - } + return 0; } @@ -2210,9 +2208,10 @@ static int resize_stripes(struct r5conf *conf, int newsize) return err; } -static int drop_one_stripe(struct r5conf *conf, int hash) +static int drop_one_stripe(struct r5conf *conf) { struct stripe_head *sh; + int hash = (conf->max_nr_stripes - 1) % NR_STRIPE_HASH_LOCKS; spin_lock_irq(conf->hash_locks + hash); sh = get_free_stripe(conf, hash); @@ -2223,15 +2222,15 @@ static int drop_one_stripe(struct r5conf *conf, int hash) shrink_buffers(sh); kmem_cache_free(conf->slab_cache, sh); atomic_dec(&conf->active_stripes); + conf->max_nr_stripes--; return 1; } static void shrink_stripes(struct r5conf *conf) { - int hash; - for (hash = 0; hash < NR_STRIPE_HASH_LOCKS; hash++) - while (drop_one_stripe(conf, hash)) - ; + while (conf->max_nr_stripes && + drop_one_stripe(conf)) + ; if (conf->slab_cache) kmem_cache_destroy(conf->slab_cache); @@ -5822,30 +5821,22 @@ raid5_set_cache_size(struct mddev *mddev, int size) { struct r5conf *conf = mddev->private; int err; - int hash; if (size <= 16 || size > 32768) return -EINVAL; - hash = (conf->max_nr_stripes - 1) % NR_STRIPE_HASH_LOCKS; - while (size < conf->max_nr_stripes) { - if (drop_one_stripe(conf, hash)) - conf->max_nr_stripes--; - else - break; - hash--; - if (hash < 0) - hash = NR_STRIPE_HASH_LOCKS - 1; - } + + while (size < conf->max_nr_stripes && + drop_one_stripe(conf)) + ; + err = md_allow_write(mddev); if (err) return err; - hash = conf->max_nr_stripes % NR_STRIPE_HASH_LOCKS; - while (size > conf->max_nr_stripes) { - if (grow_one_stripe(conf, hash, GFP_KERNEL)) - conf->max_nr_stripes++; - else break; - hash = (hash + 1) % NR_STRIPE_HASH_LOCKS; - } + + while (size > conf->max_nr_stripes) + if (!grow_one_stripe(conf, GFP_KERNEL)) + break; + return 0; } EXPORT_SYMBOL(raid5_set_cache_size); @@ -6451,7 +6442,7 @@ static struct r5conf *setup_conf(struct mddev *mddev) conf->prev_algo = mddev->layout; } - memory = conf->max_nr_stripes * (sizeof(struct stripe_head) + + memory = NR_STRIPES * (sizeof(struct stripe_head) + max_disks * ((sizeof(struct bio) + PAGE_SIZE))) / 1024; atomic_set(&conf->empty_inactive_list_nr, NR_STRIPE_HASH_LOCKS); if (grow_stripes(conf, NR_STRIPES)) { From 5423399a84ee1d92d29d763029ed40e4905cf50f Mon Sep 17 00:00:00 2001 From: NeilBrown Date: Thu, 26 Feb 2015 12:21:04 +1100 Subject: [PATCH 56/58] md/raid5: change ->inactive_blocked to a bit-flag. This allows us to easily add more (atomic) flags. Signed-off-by: NeilBrown --- drivers/md/raid5.c | 13 ++++++++----- drivers/md/raid5.h | 9 ++++++--- 2 files changed, 14 insertions(+), 8 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 78ac7dc853c7..b7cd32e7f29e 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -672,20 +672,23 @@ get_active_stripe(struct r5conf *conf, sector_t sector, *(conf->hash_locks + hash)); sh = __find_stripe(conf, sector, conf->generation - previous); if (!sh) { - if (!conf->inactive_blocked) + if (!test_bit(R5_INACTIVE_BLOCKED, &conf->cache_state)) sh = get_free_stripe(conf, hash); if (noblock && sh == NULL) break; if (!sh) { - conf->inactive_blocked = 1; + set_bit(R5_INACTIVE_BLOCKED, + &conf->cache_state); wait_event_lock_irq( conf->wait_for_stripe, !list_empty(conf->inactive_list + hash) && (atomic_read(&conf->active_stripes) < (conf->max_nr_stripes * 3 / 4) - || !conf->inactive_blocked), + || !test_bit(R5_INACTIVE_BLOCKED, + &conf->cache_state)), *(conf->hash_locks + hash)); - conf->inactive_blocked = 0; + clear_bit(R5_INACTIVE_BLOCKED, + &conf->cache_state); } else { init_stripe(sh, sector, previous); atomic_inc(&sh->count); @@ -4602,7 +4605,7 @@ static int raid5_congested(struct mddev *mddev, int bits) * how busy the stripe_cache is */ - if (conf->inactive_blocked) + if (test_bit(R5_INACTIVE_BLOCKED, &conf->cache_state)) return 1; if (conf->quiesce) return 1; diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h index 6614ac5ffc0e..ebe4e24bc14d 100644 --- a/drivers/md/raid5.h +++ b/drivers/md/raid5.h @@ -509,9 +509,11 @@ struct r5conf { struct llist_head released_stripes; wait_queue_head_t wait_for_stripe; wait_queue_head_t wait_for_overlap; - int inactive_blocked; /* release of inactive stripes blocked, - * waiting for 25% to be free - */ + unsigned long cache_state; +#define R5_INACTIVE_BLOCKED 1 /* release of inactive stripes blocked, + * waiting for 25% to be free + */ + int pool_size; /* number of disks in stripeheads in pool */ spinlock_t device_lock; struct disk_info *disks; @@ -526,6 +528,7 @@ struct r5conf { int worker_cnt_per_group; }; + /* * Our supported algorithms */ From edbe83ab4c27ea6669eb57adb5ed7eaec1118ceb Mon Sep 17 00:00:00 2001 From: NeilBrown Date: Thu, 26 Feb 2015 12:47:56 +1100 Subject: [PATCH 57/58] md/raid5: allow the stripe_cache to grow and shrink. The default setting of 256 stripe_heads is probably much too small for many configurations. So it is best to make it auto-configure. Shrinking the cache under memory pressure is easy. The only interesting part here is that we put a fairly high cost ('seeks') on shrinking the cache as the cost is greater than just having to read more data, it reduces parallelism. Growing the cache on demand needs to be done carefully. If we allow fast growth, that can upset memory balance as lots of dirty memory can quickly turn into lots of memory queued in the stripe_cache. It is important for the raid5 block device to appear congested to allow write-throttling to work. So we only add stripes slowly. We set a flag when an allocation fails because all stripes are in use, allocate at a convenient time when that flag is set, and don't allow it to be set again until at least one stripe_head has been released for re-use. This means that a spurt of requests will only cause one stripe_head to be allocated, but a steady stream of requests will slowly increase the cache size - until memory pressure puts it back again. It could take hours to reach a steady state. The value written to, and displayed in, stripe_cache_size is used as a minimum. The cache can grow above this and shrink back down to it. The actual size is not directly visible, though it can be deduced to some extent by watching stripe_cache_active. Signed-off-by: NeilBrown --- drivers/md/raid5.c | 68 +++++++++++++++++++++++++++++++++++++++++----- drivers/md/raid5.h | 11 +++++++- 2 files changed, 71 insertions(+), 8 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index b7cd32e7f29e..9716319cc477 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -672,8 +672,13 @@ get_active_stripe(struct r5conf *conf, sector_t sector, *(conf->hash_locks + hash)); sh = __find_stripe(conf, sector, conf->generation - previous); if (!sh) { - if (!test_bit(R5_INACTIVE_BLOCKED, &conf->cache_state)) + if (!test_bit(R5_INACTIVE_BLOCKED, &conf->cache_state)) { sh = get_free_stripe(conf, hash); + if (!sh && llist_empty(&conf->released_stripes) && + !test_bit(R5_DID_ALLOC, &conf->cache_state)) + set_bit(R5_ALLOC_MORE, + &conf->cache_state); + } if (noblock && sh == NULL) break; if (!sh) { @@ -5761,6 +5766,8 @@ static void raid5d(struct md_thread *thread) int batch_size, released; released = release_stripe_list(conf, conf->temp_inactive_list); + if (released) + clear_bit(R5_DID_ALLOC, &conf->cache_state); if ( !list_empty(&conf->bitmap_list)) { @@ -5799,6 +5806,13 @@ static void raid5d(struct md_thread *thread) pr_debug("%d stripes handled\n", handled); spin_unlock_irq(&conf->device_lock); + if (test_and_clear_bit(R5_ALLOC_MORE, &conf->cache_state)) { + grow_one_stripe(conf, __GFP_NOWARN); + /* Set flag even if allocation failed. This helps + * slow down allocation requests when mem is short + */ + set_bit(R5_DID_ALLOC, &conf->cache_state); + } async_tx_issue_pending_all(); blk_finish_plug(&plug); @@ -5814,7 +5828,7 @@ raid5_show_stripe_cache_size(struct mddev *mddev, char *page) spin_lock(&mddev->lock); conf = mddev->private; if (conf) - ret = sprintf(page, "%d\n", conf->max_nr_stripes); + ret = sprintf(page, "%d\n", conf->min_nr_stripes); spin_unlock(&mddev->lock); return ret; } @@ -5828,10 +5842,12 @@ raid5_set_cache_size(struct mddev *mddev, int size) if (size <= 16 || size > 32768) return -EINVAL; + conf->min_nr_stripes = size; while (size < conf->max_nr_stripes && drop_one_stripe(conf)) ; + err = md_allow_write(mddev); if (err) return err; @@ -5947,7 +5963,7 @@ raid5_store_preread_threshold(struct mddev *mddev, const char *page, size_t len) conf = mddev->private; if (!conf) err = -ENODEV; - else if (new > conf->max_nr_stripes) + else if (new > conf->min_nr_stripes) err = -EINVAL; else conf->bypass_threshold = new; @@ -6228,6 +6244,8 @@ static void raid5_free_percpu(struct r5conf *conf) static void free_conf(struct r5conf *conf) { + if (conf->shrinker.seeks) + unregister_shrinker(&conf->shrinker); free_thread_groups(conf); shrink_stripes(conf); raid5_free_percpu(conf); @@ -6295,6 +6313,30 @@ static int raid5_alloc_percpu(struct r5conf *conf) return err; } +static unsigned long raid5_cache_scan(struct shrinker *shrink, + struct shrink_control *sc) +{ + struct r5conf *conf = container_of(shrink, struct r5conf, shrinker); + int ret = 0; + while (ret < sc->nr_to_scan) { + if (drop_one_stripe(conf) == 0) + return SHRINK_STOP; + ret++; + } + return ret; +} + +static unsigned long raid5_cache_count(struct shrinker *shrink, + struct shrink_control *sc) +{ + struct r5conf *conf = container_of(shrink, struct r5conf, shrinker); + + if (conf->max_nr_stripes < conf->min_nr_stripes) + /* unlikely, but not impossible */ + return 0; + return conf->max_nr_stripes - conf->min_nr_stripes; +} + static struct r5conf *setup_conf(struct mddev *mddev) { struct r5conf *conf; @@ -6445,10 +6487,11 @@ static struct r5conf *setup_conf(struct mddev *mddev) conf->prev_algo = mddev->layout; } - memory = NR_STRIPES * (sizeof(struct stripe_head) + + conf->min_nr_stripes = NR_STRIPES; + memory = conf->min_nr_stripes * (sizeof(struct stripe_head) + max_disks * ((sizeof(struct bio) + PAGE_SIZE))) / 1024; atomic_set(&conf->empty_inactive_list_nr, NR_STRIPE_HASH_LOCKS); - if (grow_stripes(conf, NR_STRIPES)) { + if (grow_stripes(conf, conf->min_nr_stripes)) { printk(KERN_ERR "md/raid:%s: couldn't allocate %dkB for buffers\n", mdname(mddev), memory); @@ -6456,6 +6499,17 @@ static struct r5conf *setup_conf(struct mddev *mddev) } else printk(KERN_INFO "md/raid:%s: allocated %dkB\n", mdname(mddev), memory); + /* + * Losing a stripe head costs more than the time to refill it, + * it reduces the queue depth and so can hurt throughput. + * So set it rather large, scaled by number of devices. + */ + conf->shrinker.seeks = DEFAULT_SEEKS * conf->raid_disks * 4; + conf->shrinker.scan_objects = raid5_cache_scan; + conf->shrinker.count_objects = raid5_cache_count; + conf->shrinker.batch = 128; + conf->shrinker.flags = 0; + register_shrinker(&conf->shrinker); sprintf(pers_name, "raid%d", mddev->new_level); conf->thread = md_register_thread(raid5d, mddev, pers_name); @@ -7097,9 +7151,9 @@ static int check_stripe_cache(struct mddev *mddev) */ struct r5conf *conf = mddev->private; if (((mddev->chunk_sectors << 9) / STRIPE_SIZE) * 4 - > conf->max_nr_stripes || + > conf->min_nr_stripes || ((mddev->new_chunk_sectors << 9) / STRIPE_SIZE) * 4 - > conf->max_nr_stripes) { + > conf->min_nr_stripes) { printk(KERN_WARNING "md/raid:%s: reshape: not enough stripes. Needed %lu\n", mdname(mddev), ((max(mddev->chunk_sectors, mddev->new_chunk_sectors) << 9) diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h index ebe4e24bc14d..7dc0dd86074b 100644 --- a/drivers/md/raid5.h +++ b/drivers/md/raid5.h @@ -433,6 +433,7 @@ struct r5conf { int max_degraded; int raid_disks; int max_nr_stripes; + int min_nr_stripes; /* reshape_progress is the leading edge of a 'reshape' * It has value MaxSector when no reshape is happening @@ -513,7 +514,15 @@ struct r5conf { #define R5_INACTIVE_BLOCKED 1 /* release of inactive stripes blocked, * waiting for 25% to be free */ - +#define R5_ALLOC_MORE 2 /* It might help to allocate another + * stripe. + */ +#define R5_DID_ALLOC 4 /* A stripe was allocated, don't allocate + * more until at least one has been + * released. This avoids flooding + * the cache. + */ + struct shrinker shrinker; int pool_size; /* number of disks in stripeheads in pool */ spinlock_t device_lock; struct disk_info *disks; From 9ffc8f7cb9647b13dfe4d1ad0d5e1427bb8b46d6 Mon Sep 17 00:00:00 2001 From: Eric Mei Date: Wed, 18 Mar 2015 23:39:11 -0600 Subject: [PATCH 58/58] md/raid5: don't do chunk aligned read on degraded array. When array is degraded, read data landed on failed drives will result in reading rest of data in a stripe. So a single sequential read would result in same data being read twice. This patch is to avoid chunk aligned read for degraded array. The downside is to involve stripe cache which means associated CPU overhead and extra memory copy. Test Results: Following test are done on a enterprise storage node with Seagate 6T SAS drives and Xeon E5-2648L CPU (10 cores, 1.9Ghz), 10 disks MD RAID6 8+2, chunk size 128 KiB. I use FIO, using direct-io with various bs size, enough queue depth, tested sequential and 100% random read against 3 array config: 1) optimal, as baseline; 2) degraded; 3) degraded with this patch. Kernel version is 4.0-rc3. Each individual test I only did once so there might be some variations, but we just focus on big trend. Sequential Read: bs=(KiB) optimal(MiB/s) degraded(MiB/s) degraded-with-patch (MiB/s) 1024 1608 656 995 512 1624 710 956 256 1635 728 980 128 1636 771 983 64 1612 1119 1000 32 1580 1420 1004 16 1368 688 986 8 768 647 953 4 411 413 850 Random Read: bs=(KiB) optimal(IOPS) degraded(IOPS) degraded-with-patch (IOPS) 1024 163 160 156 512 274 273 272 256 426 428 424 128 576 592 591 64 726 724 726 32 849 848 837 16 900 970 971 8 927 940 929 4 948 940 955 Some notes: * In sequential + optimal, as bs size getting smaller, the FIO thread become CPU bound. * In sequential + degraded, there's big increase when bs is 64K and 32K, I don't have explanation. * In sequential + degraded-with-patch, the MD thread mostly become CPU bound. If you want to we can discuss specific data point in those data. But in general it seems with this patch, we have more predictable and in most cases significant better sequential read performance when array is degraded, and almost no noticeable impact on random read. Performance is a complicated thing, the patch works well for this particular configuration, but may not be universal. For example I imagine testing on all SSD array may have very different result. But I personally think in most cases IO bandwidth is more scarce resource than CPU. Signed-off-by: Eric Mei Signed-off-by: NeilBrown --- drivers/md/raid5.c | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 9716319cc477..77dfd720aaa0 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -4632,8 +4632,12 @@ static int raid5_mergeable_bvec(struct mddev *mddev, unsigned int chunk_sectors = mddev->chunk_sectors; unsigned int bio_sectors = bvm->bi_size >> 9; - if ((bvm->bi_rw & 1) == WRITE) - return biovec->bv_len; /* always allow writes to be mergeable */ + /* + * always allow writes to be mergeable, read as well if array + * is degraded as we'll go through stripe cache anyway. + */ + if ((bvm->bi_rw & 1) == WRITE || mddev->degraded) + return biovec->bv_len; if (mddev->new_chunk_sectors < mddev->chunk_sectors) chunk_sectors = mddev->new_chunk_sectors; @@ -5110,7 +5114,12 @@ static void make_request(struct mddev *mddev, struct bio * bi) md_write_start(mddev, bi); - if (rw == READ && + /* + * If array is degraded, better not do chunk aligned read because + * later we might have to read it again in order to reconstruct + * data on failed drives. + */ + if (rw == READ && mddev->degraded == 0 && mddev->reshape_position == MaxSector && chunk_aligned_read(mddev,bi)) return;