1
0
Fork 0
Commit Graph

649 Commits (1eff9d322a444245c67515edb52bc0eb68374aa8)

Author SHA1 Message Date
Jens Axboe 1eff9d322a block: rename bio bi_rw to bi_opf
Since commit 63a4cc2486, bio->bi_rw contains flags in the lower
portion and the op code in the higher portions. This means that
old code that relies on manually setting bi_rw is most likely
going to be broken. Instead of letting that brokeness linger,
rename the member, to force old and out-of-tree code to break
at compile time instead of at runtime.

No intended functional changes in this commit.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-07 14:41:02 -06:00
Shaohua Li 3f35e210ed Merge branch 'mymd/for-next' into mymd/for-linus 2016-07-28 09:34:14 -07:00
Christoph Hellwig 70246286e9 block: get rid of bio_rw and READA
These two are confusing leftover of the old world order, combining
values of the REQ_OP_ and REQ_ namespaces.  For callers that don't
special case we mostly just replace bi_rw with bio_data_dir or
op_is_write, except for the few cases where a switch over the REQ_OP_
values makes more sense.  Any check for READA is replaced with an
explicit check for REQ_RAHEAD.  Also remove the READA alias for
REQ_RAHEAD.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-20 17:37:01 -06:00
NeilBrown d787be4092 md: reduce the number of synchronize_rcu() calls when multiple devices fail.
Every time a device is removed with ->hot_remove_disk() a synchronize_rcu() call is made
which can delay several milliseconds in some case.
If lots of devices fail at once - as could happen with a large RAID10 where one set
of devices are removed all at once - these delays can add up to be very inconcenient.

As failure is not reversible we can check for that first, setting a
separate flag if it is found, and then all synchronize_rcu() once for
all the flagged devices.  Then ->hot_remove_disk() function can skip the
synchronize_rcu() step if the flag is set.

fix build error(Shaohua)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-06-13 11:54:22 -07:00
NeilBrown f5b67ae86e md: be extra careful not to take a reference to a Faulty device.
It is important that we never increment rdev->nr_pending on a Faulty
device as ->hot_remove_disk() assumes that once the Faulty flag is visible
no code will take a new reference.

Some places take a new reference after only check In_sync.  This should
be safe as the two are changed together.  However to make the code more
obviously safe, add checks for 'Faulty' as well.

Note: the actual rule is:
  Never increment nr_pending if  Faulty is set and Blocked is clear,
  never clear Faulty, and never set Blocked without holding a reference
  through nr_pending.

fix build error (Shaohua)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-06-13 11:54:21 -07:00
NeilBrown 5fd133511d md/raid5: add rcu protection to rdev accesses in raid5_status.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-06-13 11:54:20 -07:00
NeilBrown 3f232d6a95 md/raid5: add rcu protection to rdev accesses in want_replace
Being in the middle of resync is no longer protection against failed
rdevs disappearing.  So add rcu protection.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-06-13 11:54:19 -07:00
NeilBrown e50d399232 md/raid5: add rcu protection to rdev accesses in handle_failed_sync.
The rdev could be freed while handle_failed_sync is running, so
rcu protection is needed.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-06-13 11:54:19 -07:00
Mike Christie 28a8f0d317 block, drivers, fs: rename REQ_FLUSH to REQ_PREFLUSH
To avoid confusion between REQ_OP_FLUSH, which is handled by
request_fn drivers, and upper layers requesting the block layer
perform a flush sequence along with possibly a WRITE, this patch
renames REQ_FLUSH to REQ_PREFLUSH.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie 6296b9604f block, drivers, fs: shrink bi_rw from long to int
We don't need bi_rw to be so large on 64 bit archs, so
reduce it to unsigned int.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie 796a5cf083 md: use bio op accessors
Separate the op from the rq_flag_bits and have md
set/get the bio using bio_set_op_attrs/bio_op.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Song Liu 4125758074 right meaning of PARITY_ENABLE_RMW and PARITY_PREFER_RMW
In current handle_stripe_dirtying, the code prefers rmw with
PARITY_ENABLE_RMW; while prefers rcw with PARITY_PREFER_RMW.

This patch reverses this behavior.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-05-25 21:26:07 -07:00
Guoqing Jiang 85ad1d13ee md: set MD_CHANGE_PENDING in a atomic region
Some code waits for a metadata update by:

1. flagging that it is needed (MD_CHANGE_DEVS or MD_CHANGE_CLEAN)
2. setting MD_CHANGE_PENDING and waking the management thread
3. waiting for MD_CHANGE_PENDING to be cleared

If the first two are done without locking, the code in md_update_sb()
which checks if it needs to repeat might test if an update is needed
before step 1, then clear MD_CHANGE_PENDING after step 2, resulting
in the wait returning early.

So make sure all places that set MD_CHANGE_PENDING are atomicial, and
bit_clear_unless (suggested by Neil) is introduced for the purpose.

Cc: Martin Kepplinger <martink@posteo.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: <linux-kernel@vger.kernel.org>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-05-09 09:24:02 -07:00
Heinz Mauelshagen fe67d19a2d md: raid5: add prerequisite to run underneath dm-raid
In case md runs underneath the dm-raid target, the mddev does not have
a request queue or gendisk, thus avoid accesses.

This patch adds a missing conditional to the raid5 personality.

Signed-of-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-05-09 09:24:02 -07:00
Shaohua Li b8a0b8e946 raid5: delete unnecessary warnning
If device has R5_LOCKED set, it's legit device has R5_SkipCopy set and page !=
orig_page. After R5_LOCKED is clear, handle_stripe_clean_event will clear the
SkipCopy flag and set page to orig_page. So the warning is unnecessary.

Reported-by: Joey Liao <joeyliao@qnap.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-04-29 14:18:03 -07:00
Anna-Maria Gleixner 1d034e68e2 md/raid5: Cleanup cpu hotplug notifier
The raid456_cpu_notify() hotplug callback lacks handling of the
CPU_UP_CANCELED case. That means if CPU_UP_PREPARE fails, the scratch
buffer is leaked.

Add handling for CPU_UP_CANCELED[_FROZEN] hotplug notifier transitions
to free the scratch buffer.

CC: Shaohua Li <shli@kernel.org>
CC: linux-raid@vger.kernel.org
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-03-17 14:30:15 -07:00
Shaohua Li fb3229d5cd md/raid5: output stripe state for debug
Neil recently fixed an obscure race in break_stripe_batch_list. Debug would be
quite convenient if we know the stripe state. This is what this patch does.

Signed-off-by: Shaohua Li <shli@fb.com>
2016-03-09 10:08:38 -08:00
NeilBrown 550da24f8d md/raid5: preserve STRIPE_PREREAD_ACTIVE in break_stripe_batch_list
break_stripe_batch_list breaks up a batch and copies some flags from
the batch head to the members, preserving others.

It doesn't preserve or copy STRIPE_PREREAD_ACTIVE.  This is not
normally a problem as STRIPE_PREREAD_ACTIVE is cleared when a
stripe_head is added to a batch, and is not set on stripe_heads
already in a batch.

However there is no locking to ensure one thread doesn't set the flag
after it has just been cleared in another.  This does occasionally happen.

md/raid5 maintains a count of the number of stripe_heads with
STRIPE_PREREAD_ACTIVE set: conf->preread_active_stripes.  When
break_stripe_batch_list clears STRIPE_PREREAD_ACTIVE inadvertently
this could becomes incorrect and will never again return to zero.

md/raid5 delays the handling of some stripe_heads until
preread_active_stripes becomes zero.  So when the above mention race
happens, those stripe_heads become blocked and never progress,
resulting is write to the array handing.

So: change break_stripe_batch_list to preserve STRIPE_PREREAD_ACTIVE
in the members of a batch.

URL: https://bugzilla.kernel.org/show_bug.cgi?id=108741
URL: https://bugzilla.redhat.com/show_bug.cgi?id=1258153
URL: http://thread.gmane.org/5649C0E9.2030204@zoner.cz
Reported-by: Martin Svec <martin.svec@zoner.cz> (and others)
Tested-by: Tom Weber <linux@junkyard.4t2.com>
Fixes: 1b956f7a8f ("md/raid5: be more selective about distributing flags across batch.")
Cc: stable@vger.kernel.org (v4.1 and later)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-03-09 09:31:41 -08:00
Shaohua Li 6ab2a4b806 RAID5: revert e9e4c377e2 to fix a livelock
Revert commit
e9e4c377e2f563(md/raid5: per hash value and exclusive wait_for_stripe)

The problem is raid5_get_active_stripe waits on
conf->wait_for_stripe[hash]. Assume hash is 0. My test release stripes
in this order:
- release all stripes with hash 0
- raid5_get_active_stripe still sleeps since active_stripes >
  max_nr_stripes * 3 / 4
- release all stripes with hash other than 0. active_stripes becomes 0
- raid5_get_active_stripe still sleeps, since nobody wakes up
  wait_for_stripe[0]
The system live locks. The problem is active_stripes isn't a per-hash
count. Revert the patch makes the live lock go away.

Cc: stable@vger.kernel.org (v4.2+)
Cc: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Cc: NeilBrown <neilb@suse.de>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-02-26 09:44:56 -08:00
Shaohua Li 27a353c026 RAID5: check_reshape() shouldn't call mddev_suspend
check_reshape() is called from raid5d thread. raid5d thread shouldn't
call mddev_suspend(), because mddev_suspend() waits for all IO finish
but IO is handled in raid5d thread, we could easily deadlock here.

This issue is introduced by
738a273 ("md/raid5: fix allocation of 'scribble' array.")

Cc: stable@vger.kernel.org (v4.1+)
Reported-and-tested-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-02-26 09:44:11 -08:00
Jes Sorensen e7597e69de md/raid5: Compare apples to apples (or sectors to sectors)
'max_discard_sectors' is in sectors, while 'stripe' is in bytes.

This fixes the problem where DISCARD would get disabled on some larger
RAID5 configurations (6 or more drives in my testing), while it worked
as expected with smaller configurations.

Fixes: 620125f2bf ("MD: raid5 trim support")
Cc: stable@vger.kernel.org v3.7+
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-02-25 16:38:53 -08:00
Shaohua Li 849674e4fb MD: rename some functions
These short function names are hard to search. Rename them to make vim happy.

Signed-off-by: Shaohua Li <shli@fb.com>
2016-01-20 13:52:20 -08:00
Shaohua Li f6b6ec5cfa raid5-cache: add journal hot add/remove support
Add support for journal disk hot add/remove. Mostly trival checks in md
part. The raid5 part is a little tricky. For hot-remove, we can't wait
pending write as it's called from raid5d. The wait will cause deadlock.
We simplily fail the hot-remove. A hot-remove retry can success
eventually since if journal disk is faulty all pending write will be
failed and finish. For hot-add, since an array supporting journal but
without journal disk will be marked read-only, we are safe to hot add
journal without stopping IO (should be read IO, while journal only
handles write IO).

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06 11:39:57 +11:00
Roman Gushchin b46020aa3a md/raid5: remove redundant check in stripe_add_to_batch_list()
The stripe_add_to_batch_list() function is called only if
stripe_can_batch() returned true, so there is no need for double check.

Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
Cc: Neil Brown <neilb@suse.com>
Cc: linux-raid@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06 11:38:22 +11:00
Linus Torvalds ac322de6bf md updates for 4.4.
Two major components to this update.
 
 1/ the clustered-raid1 support from SUSE is nearly
   complete.  There are a few outstanding issues being
   worked on.  Maybe half a dozen patches will bring
   this to a usable state.
 
 2/ The first stage of journalled-raid5 support from
    Facebook makes an appearance.  With a journal
    device configured (typically NVRAM or SSD), the
    "RAID5 write hole" should be closed - a crash
    during degraded operations cannot result in data
    corruption.
 
    The next stage will be to use the journal as a
    write-behind cache so that latency can be reduced
    and in some cases throughput increased by
    performing more full-stripe writes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJWNX9RAAoJEDnsnt1WYoG5bYMP/jI0pV3wcbs7mZQAa8S/V0lU
 2l25x4MdwDvqVKMfjIc/C5J08QNgcrgSvhiVPCEOK0w18q395vep9f6gFKbMHhu/
 lWU3PLHGw8XBHp5yEnxrpQkN0pRrNjh5NqIdlVMBNyL6u+RZPS2ZuzxJ8wiNAFg1
 MypNkgoUu6s+nBp4DWWnMGYhBc+szBR+gTYAzGiZ8vqOH9uiSJ2SsGG5aRVUN/af
 oMYvJAf9aA6uj+xSzNlXIaLfWJIrshQYS1jU/W4gTm0DwK9yqbTxvubJaE0SGu/o
 73FGU8tmQ6ELYfsp3D/jmfUkE7weiNEQhdVb/4wy1A/SGc+W7Ju9pxfhm8ra57s0
 /BCkfwWZXEvx1flegXfK1mC6EMpMIcGAD2FQEhmQbW6wTdDwtNyEhIePDVGJwD/F
 rhEThFa+Dg9+xnBGnS6OUK3EpXgml2hAeAC7uA3TVSAnWd/9/Mpim6fZhqrB/v9L
 Ik0tZt+H4nxYaheZjKlKhuXUQYcUWGiMb67bGMem/YAlMa4y9C9qF+9mPXxyjVlI
 hBsd5SfZNz99DyB/bO8BumQeIWlTfzLeFzWW67eQ864LRKO6k0/VIbPZHCfn2oVG
 XvyC2fUhNOIURP3IMxcyHYxOA7Mu6EDsVVDTpuqLVbZQ5IPjDEfQ54yB/BLUvbX/
 Gh2/tKn7Xc25HuLAFEbs
 =TD5o
 -----END PGP SIGNATURE-----

Merge tag 'md/4.4' of git://neil.brown.name/md

Pull md updates from Neil Brown:
 "Two major components to this update.

   1) The clustered-raid1 support from SUSE is nearly complete.  There
      are a few outstanding issues being worked on.  Maybe half a dozen
      patches will bring this to a usable state.

   2) The first stage of journalled-raid5 support from Facebook makes an
      appearance.  With a journal device configured (typically NVRAM or
      SSD), the "RAID5 write hole" should be closed - a crash during
      degraded operations cannot result in data corruption.

      The next stage will be to use the journal as a write-behind cache
      so that latency can be reduced and in some cases throughput
      increased by performing more full-stripe writes.

* tag 'md/4.4' of git://neil.brown.name/md: (66 commits)
  MD: when RAID journal is missing/faulty, block RESTART_ARRAY_RW
  MD: set journal disk ->raid_disk
  MD: kick out journal disk if it's not fresh
  raid5-cache: start raid5 readonly if journal is missing
  MD: add new bit to indicate raid array with journal
  raid5-cache: IO error handling
  raid5: journal disk can't be removed
  raid5-cache: add trim support for log
  MD: fix info output for journal disk
  raid5-cache: use bio chaining
  raid5-cache: small log->seq cleanup
  raid5-cache: new helper: r5_reserve_log_entry
  raid5-cache: inline r5l_alloc_io_unit into r5l_new_meta
  raid5-cache: take rdev->data_offset into account early on
  raid5-cache: refactor bio allocation
  raid5-cache: clean up r5l_get_meta
  raid5-cache: simplify state machine when caches flushes are not needed
  raid5-cache: factor out a helper to run all stripes for an I/O unit
  raid5-cache: rename flushed_ios to finished_ios
  raid5-cache: free I/O units earlier
  ...
2015-11-04 21:12:47 -08:00
Shaohua Li f2076e7d06 MD: set journal disk ->raid_disk
Set journal disk ->raid_disk to >=0, I choose raid_disks + 1 instead of
0, because we already have a disk with ->raid_disk 0 and this causes
sysfs entry creation conflict. A lot of places assumes disk with
->raid_disk >=0 is normal raid disk, so we add check for journal disk.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:29 +11:00
Shaohua Li 7dde2ad3c5 raid5-cache: start raid5 readonly if journal is missing
If raid array is expected to have journal (eg, journal is set in MD
superblock feature map) and the array is started without journal disk,
start the array readonly.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:29 +11:00
Shaohua Li 6e74a9cfb5 raid5-cache: IO error handling
There are 3 places the raid5-cache dispatches IO. The discard IO error
doesn't matter, so we ignore it. The superblock write IO error can be
handled in MD core. The remaining are log write and flush. When the IO
error happens, we mark log disk faulty and fail all write IO. Read IO is
still allowed to run. Userspace will get a notification too and
corresponding daemon can choose setting raid array readonly for example.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:29 +11:00
Shaohua Li c2bb6242ec raid5: journal disk can't be removed
raid5-cache uses journal disk rdev->bdev, rdev->mddev in several places.
Don't allow journal disk disappear magically. On the other hand, we do
need to update superblock for other disks to bump up ->events, so next
time journal disk will be identified as stale.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:29 +11:00
Shaohua Li e6c033f79a raid5-cache: move reclaim stop to quiesce
Move reclaim stop to quiesce handling, where is safer for this stuff.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:27 +11:00
Shaohua Li 828cbe989e raid5-cache: optimize FLUSH IO with log enabled
With log enabled, bio is written to raid disks after the bio is settled
down in log disk. The recovery guarantees we can recovery the bio data
from log disk, so we we skip FLUSH IO.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:26 +11:00
Shaohua Li a8c34f9159 raid5-cache: switching to state machine for log disk cache flush
Before we write stripe data to raid disks, we must guarantee stripe data
is settled down in log disk. To do this, we flush log disk cache and
wait the flush finish. That wait introduces sleep time in raid5d thread
and impact performance. This patch moves the log disk cache flush
process to the stripe handling state machine, which can remove the wait
in raid5d.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:26 +11:00
Shaohua Li 5c7e81c3de raid5: enable log for raid array with cache disk
Now log is safe to enable for raid array with cache disk

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:26 +11:00
Shaohua Li 713cf5a639 raid5: don't allow resize/reshape with cache(log) support
If cache(log) support is enabled, don't allow resize/reshape in current
stage. In the future, we can flush all data from cache(log) to raid
before resize/reshape and then allow resize/reshape.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:26 +11:00
Shaohua Li 9c3e333d3f raid5: disable batch with log enabled
With log enabled, r5l_write_stripe will add the stripe to log. With
batch, several stripes are linked together. The stripes must be in the
same state. While with log, the log/reclaim unit is stripe, we can't
guarantee the several stripes are in the same state. Disabling batch for
log now.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:26 +11:00
Roman Gushchin b8a9d66d04 md/raid5: fix locking in handle_stripe_clean_event()
After commit 566c09c534 ("raid5: relieve lock contention in get_active_stripe()")
__find_stripe() is called under conf->hash_locks + hash.
But handle_stripe_clean_event() calls remove_hash() under
conf->device_lock.

Under some cirscumstances the hash chain can be circuited,
and we get an infinite loop with disabled interrupts and locked hash
lock in __find_stripe(). This leads to hard lockup on multiple CPUs
and following system crash.

I was able to reproduce this behavior on raid6 over 6 ssd disks.
The devices_handle_discard_safely option should be set to enable trim
support. The following script was used:

for i in `seq 1 32`; do
    dd if=/dev/zero of=large$i bs=10M count=100 &
done

neilb: original was against a 3.x kernel.  I forward-ported
  to 4.3-rc.  This verison is suitable for any kernel since
  Commit: 59fc630b8b ("RAID5: batch adjacent full stripe write")
  (v4.1+).  I'll post a version for earlier kernels to stable.

Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
Fixes: 566c09c534 ("raid5: relieve lock contention in get_active_stripe()")
Signed-off-by: NeilBrown <neilb@suse.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: <stable@vger.kernel.org> # 3.13 - 4.2
2015-10-31 10:53:50 +11:00
Shaohua Li 0576b1c618 raid5: log reclaim support
This is the reclaim support for raid5 log. A stripe write will have
following steps:

1. reconstruct the stripe, read data/calculate parity. ops_run_io
prepares to write data/parity to raid disks
2. hijack ops_run_io. stripe data/parity is appending to log disk
3. flush log disk cache
4. ops_run_io run again and do normal operation. stripe data/parity is
written in raid array disks. raid core can return io to upper layer.
5. flush cache of all raid array disks
6. update super block
7. log disk space used by the stripe can be reused

In practice, several stripes consist of an io_unit and we will batch
several io_unit in different steps, but the whole process doesn't
change.

It's possible io return just after data/parity hit log disk, but then
read IO will need read from log disk. For simplicity, IO return happens
at step 4, where read IO can directly read from raid disks.

Currently reclaim run if there is specific reclaimable space (1/4 disk
size or 10G) or we are out of space. Reclaim is just to free log disk
spaces, it doesn't impact data consistency. The size based force reclaim
is to make sure log isn't too big, so recovery doesn't scan log too
much.

Recovery make sure raid disks and log disk have the same data of a
stripe. If crash happens before 4, recovery might/might not recovery
stripe's data/parity depending on if data/parity and its checksum
matches. In either case, this doesn't change the syntax of an IO write.
After step 3, stripe is guaranteed recoverable, because stripe's
data/parity is persistent in log disk. In some cases, log disk content
and raid disks content of a stripe are the same, but recovery will still
copy log disk content to raid disks, this doesn't impact data
consistency. space reuse happens after superblock update and cache
flush.

There is one situation we want to avoid. A broken meta in the middle of
a log causes recovery can't find meta at the head of log. If operations
require meta at the head persistent in log, we must make sure meta
before it persistent in log too. The case is stripe data/parity is in
log and we start write stripe to raid disks (before step 4). stripe
data/parity must be persistent in log before we do the write to raid
disks. The solution is we restrictly maintain io_unit list order. In
this case, we only write stripes of an io_unit to raid disks till the
io_unit is the first one whose data/parity is in log.

The io_unit list order is important for other cases too. For example,
some io_unit are reclaimable and others not. They can be mixed in the
list, we shouldn't reuse space of an unreclaimable io_unit.

Includes fixes to problems which were...
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-24 17:16:19 +11:00
Shaohua Li f6bed0ef0a raid5: add basic stripe log
This introduces a simple log for raid5. Data/parity writing to raid
array first writes to the log, then write to raid array disks. If
crash happens, we can recovery data from the log. This can speed up
raid resync and fix write hole issue.

The log structure is pretty simple. Data/meta data is stored in block
unit, which is 4k generally. It has only one type of meta data block.
The meta data block can track 3 types of data, stripe data, stripe
parity and flush block. MD superblock will point to the last valid
meta data block. Each meta data block has checksum/seq number, so
recovery can scan the log correctly. We store a checksum of stripe
data/parity to the metadata block, so meta data and stripe data/parity
can be written to log disk together. otherwise, meta data write must
wait till stripe data/parity is finished.

For stripe data, meta data block will record stripe data sector and
size. Currently the size is always 4k. This meta data record can be made
simpler if we just fix write hole (eg, we can record data of a stripe's
different disks together), but this format can be extended to support
caching in the future, which must record data address/size.

For stripe parity, meta data block will record stripe sector. It's
size should be 4k (for raid5) or 8k (for raid6). We always store p
parity first. This format should work for caching too.

flush block indicates a stripe is in raid array disks. Fixing write
hole doesn't need this type of meta data, it's for caching extension.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-24 17:16:19 +11:00
Shaohua Li b70abcb247 raid5: add a new state for stripe log handling
When a stripe finishes construction, we write the stripe to raid in
ops_run_io normally. With log, we do a bunch of other operations before
the stripe is written to raid. Mainly write the stripe to log disk,
flush disk cache and so on. The operations are still driven by raid5d
and run in the stripe state machine. We introduce a new state for such
stripe (trapped into log). The stripe is in this state from the time it
first enters ops_run_io (finish construction) to the time it is written
to raid. Since we know the state is only for log, we bypass other
check/operation in handle_stripe.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-24 17:16:19 +11:00
Shaohua Li 6d036f7d52 raid5: export some functions
Next several patches use some raid5 functions, rename them with raid5
prefix and export out.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-24 17:16:18 +11:00
Goldwyn Rodrigues c40f341f1e md-cluster: Use a small window for resync
Suspending the entire device for resync could take too long. Resync
in small chunks.

cluster's resync window (32M) is maintained in r1conf as
cluster_sync_low and cluster_sync_high and processed in
raid1's sync_request(). If the current resync is outside the cluster
resync window:

1. Set the cluster_sync_low to curr_resync_completed.
2. Check if the sync will fit in the new window, if not issue a
   wait_barrier() and set cluster_sync_low to sector_nr.
3. Set cluster_sync_high to cluster_sync_low + resync_window.
4. Send a message to all nodes so they may add it in their suspension
   list.

bitmap_cond_end_sync is modified to allow to force a sync inorder
to get the curr_resync_completed uptodate with the sector passed.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-10-12 01:32:05 -05:00
Julia Lawall 644df1a85f md: drop null test before destroy functions
Remove unneeded NULL test.

The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@ expression x; @@
-if (x != NULL)
  \(kmem_cache_destroy\|mempool_destroy\|dma_pool_destroy\)(x);
// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-02 17:23:44 +10:00
NeilBrown 36707bb2e7 md/raid5: don't index beyond end of array in need_this_block().
When need_this_block probably shouldn't be called when there
are more than 2 failed devices, we really don't want it to try
indexing beyond the end of the failed_num[] of fdev[] arrays.

So limit the loops to at most 2 iterations.

Reported-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-10-02 17:23:43 +10:00
Shaohua Li ebda780bce raid5: update analysis state for failed stripe
handle_failed_stripe() makes the stripe fail, eg, all IO will return
with a failure, but it doesn't update stripe_head_state. Later
handle_stripe() has special handling for raid6 for handle_stripe_fill().
That check before handle_stripe_fill() doesn't skip the failed stripe
and we get a kernel crash in need_this_block.  This patch clear the
analysis state to make sure no functions wrongly called after
handle_failed_stripe()

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-02 17:23:43 +10:00
NeilBrown e89c6fdf9e Merge linux-block/for-4.3/core into md/for-linux
There were a few conflicts that are fairly easy to resolve.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-09-05 11:08:32 +02:00
Linus Torvalds 1081230b74 Merge branch 'for-4.3/core' of git://git.kernel.dk/linux-block
Pull core block updates from Jens Axboe:
 "This first core part of the block IO changes contains:

   - Cleanup of the bio IO error signaling from Christoph.  We used to
     rely on the uptodate bit and passing around of an error, now we
     store the error in the bio itself.

   - Improvement of the above from myself, by shrinking the bio size
     down again to fit in two cachelines on x86-64.

   - Revert of the max_hw_sectors cap removal from a revision again,
     from Jeff Moyer.  This caused performance regressions in various
     tests.  Reinstate the limit, bump it to a more reasonable size
     instead.

   - Make /sys/block/<dev>/queue/discard_max_bytes writeable, by me.
     Most devices have huge trim limits, which can cause nasty latencies
     when deleting files.  Enable the admin to configure the size down.
     We will look into having a more sane default instead of UINT_MAX
     sectors.

   - Improvement of the SGP gaps logic from Keith Busch.

   - Enable the block core to handle arbitrarily sized bios, which
     enables a nice simplification of bio_add_page() (which is an IO hot
     path).  From Kent.

   - Improvements to the partition io stats accounting, making it
     faster.  From Ming Lei.

   - Also from Ming Lei, a basic fixup for overflow of the sysfs pending
     file in blk-mq, as well as a fix for a blk-mq timeout race
     condition.

   - Ming Lin has been carrying Kents above mentioned patches forward
     for a while, and testing them.  Ming also did a few fixes around
     that.

   - Sasha Levin found and fixed a use-after-free problem introduced by
     the bio->bi_error changes from Christoph.

   - Small blk cgroup cleanup from Viresh Kumar"

* 'for-4.3/core' of git://git.kernel.dk/linux-block: (26 commits)
  blk: Fix bio_io_vec index when checking bvec gaps
  block: Replace SG_GAPS with new queue limits mask
  block: bump BLK_DEF_MAX_SECTORS to 2560
  Revert "block: remove artifical max_hw_sectors cap"
  blk-mq: fix race between timeout and freeing request
  blk-mq: fix buffer overflow when reading sysfs file of 'pending'
  Documentation: update notes in biovecs about arbitrarily sized bios
  block: remove bio_get_nr_vecs()
  fs: use helper bio_add_page() instead of open coding on bi_io_vec
  block: kill merge_bvec_fn() completely
  md/raid5: get rid of bio_fits_rdev()
  md/raid5: split bio for chunk_aligned_read
  block: remove split code in blkdev_issue_{discard,write_same}
  btrfs: remove bio splitting and merge_bvec_fn() calls
  bcache: remove driver private bio splitting code
  block: simplify bio_add_page()
  block: make generic_make_request handle arbitrarily sized bios
  blk-cgroup: Drop unlikely before IS_ERR(_OR_NULL)
  block: don't access bio->bi_error after bio_put()
  block: shrink struct bio down to 2 cache lines again
  ...
2015-09-02 13:10:25 -07:00
NeilBrown c3cce6cda1 md/raid5: ensure device failure recorded before write request returns.
When a write to one of the devices of a RAID5/6 fails, the failure is
recorded in the metadata of the other devices so that after a restart
the data on the failed drive wont be trusted even if that drive seems
to be working again (maybe a cable was unplugged).

Similarly when we record a bad-block in response to a write failure,
we must not let the write complete until the bad-block update is safe.

Currently there is no interlock between the write request completing
and the metadata update.  So it is possible that the write will
complete, the app will confirm success in some way, and then the
machine will crash before the metadata update completes.

This is an extremely small hole for a racy to fit in, but it is
theoretically possible and so should be closed.

So:
 - set MD_CHANGE_PENDING when requesting a metadata update for a
   failed device, so we can know with certainty when it completes
 - queue requests that completed when MD_CHANGE_PENDING is set to
   only be processed after the metadata update completes
 - call raid_end_bio_io() on bios in that queue when the time comes.


Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:43:59 +02:00
NeilBrown 34a6f80e16 md/raid5: use bio_list for the list of bios to return.
This will make it easier to splice two lists together which will
be needed in future patch.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:43:50 +02:00
NeilBrown 6cbd81487f md/raid5: handle possible race as reshape completes.
It is possible (though unlikely) for a reshape to be
interrupted between the time that end_reshape is called
and the time when raid5_finish_reshape is called.

This can leave conf->reshape_progress set to MaxSector,
but mddev->reshape_position not.

This combination confused reshape_request() when ->reshape_backwards.
As conf->reshape_progress is so high, it seems the reshape hasn't
really begun.  But assuming MaxSector is a valid address only
leads to sorrow.

So ensure reshape_position and reshape_progress both agree,
and add an extra check in reshape_request() just in case they don't.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:38:59 +02:00
NeilBrown c5e19d906a md: be careful when testing resync_max against curr_resync_completed.
While it generally shouldn't happen, it is not impossible for
curr_resync_completed to exceed resync_max.
This can particularly happen when reshaping RAID5 - the current
status isn't copied to curr_resync_completed promptly, so when it
is, it can exceed resync_max.
This happens when the reshape is 'frozen', resync_max is set low,
and reshape is re-enabled.

Taking a difference between two unsigned numbers is always dangerous
anyway, so add a test to behave correctly if
   curr_resync_completed > resync_max

Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:37:33 +02:00