1
0
Fork 0
Commit Graph

4183 Commits (eaaa7ec71bff4cb34d9025ed89068d4b3cac3df0)

Author SHA1 Message Date
Mike Snitzer 2eff1924e1 dm mpath: cleanup 'struct dm_mpath_io' management code
Refactor and rename existing interfaces to be more specific and
self-documenting.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 22:34:39 -05:00
Mike Snitzer 8637a6bf14 dm mpath: use blk-mq pdu for per-request 'struct dm_mpath_io'
Allow the multipath target to avoid making small allocations for each
'struct dm_mpath_io' that is needed for each request.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 22:34:38 -05:00
Mike Snitzer 591ddcfc4b dm: allow immutable request-based targets to use blk-mq pdu
This will allow DM multipath to use a portion of the blk-mq pdu space
for target data (e.g. struct dm_mpath_io).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 22:34:37 -05:00
Mike Snitzer 30187e1d48 dm: rename target's per_bio_data_size to per_io_data_size
Request-based DM will also make use of per_bio_data_size.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 22:34:37 -05:00
Mike Snitzer eca7ee6dc0 dm: distinquish old .request_fn (dm-old) vs dm-mq request-based DM
Rename various methods to have either a "dm_old" or "dm_mq" prefix.
Improve code comments to assist with understanding the duality of code
that handles both "dm_old" and "dm_mq" cases.

It is no much easier to quickly look at the code and _know_ that a given
method is either 1) "dm_old" only 2) "dm_mq" only 3) common to both.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 22:34:33 -05:00
Mike Snitzer c5248f79f3 dm: remove support for stacking dm-mq on .request_fn device(s)
Remove all fiddley code that propped up this support for a blk-mq
request-queue ontop of all .request_fn devices.

Testing has proven this niche request-based dm-mq mode to be buggy, when
testing fault tolerance with DM multipath, and there is no point trying
to preserve it.

Should help improve efficiency of pure dm-mq code and make code
maintenance less delicate.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 22:33:46 -05:00
Mike Snitzer 818c5f3bef dm: fix a couple locking issues with use of block interfaces
old_stop_queue() was checking blk_queue_stopped() without holding the
q->queue_lock.

dm_requeue_original_request() needed to check blk_queue_stopped(), with
q->queue_lock held, before calling blk_mq_kick_requeue_list().  And a
side-effect of that change is start_queue() must also call
blk_mq_kick_requeue_list().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 22:33:09 -05:00
Mike Snitzer 1c357a1e86 dm: allocate blk_mq_tag_set rather than embed in mapped_device
The blk_mq_tag_set is only needed for dm-mq support.  There is point
wasting space in 'struct mapped_device' for non-dm-mq devices.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> # check kzalloc return
2016-02-22 12:07:14 -05:00
Mike Snitzer faad87df4b dm: add 'dm_mq_nr_hw_queues' and 'dm_mq_queue_depth' module params
Allow user to change these values via module params or sysfs.

'dm_mq_nr_hw_queues' defaults to 1 (max 32).

'dm_mq_queue_depth' defaults to 2048 (up from 64, which proved far too
small under moderate sized workloads -- the dm-multipath device would
continuously block waiting for tags (requests) to become available).
The maximum is BLK_MQ_MAX_DEPTH (currently 10240).

Keep in mind the total number of pre-allocated requests per
request-based dm-mq device is 'dm_mq_nr_hw_queues' * 'dm_mq_queue_depth'
(currently 2048).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 12:07:10 -05:00
Mike Snitzer c91852ff08 dm: optimize dm_request_fn()
DM multipath is the only request-based DM target -- which only supports
tables with a single target that is immutable.  Leverage this fact in
dm_request_fn().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 11:06:22 -05:00
Mike Snitzer 16f122661d dm: optimize dm_mq_queue_rq()
DM multipath is the only dm-mq target.  But that aside, request-based DM
only supports tables with a single target that is immutable.  Leverage
this fact in dm_mq_queue_rq() by using the 'immutable_target' stored in
the mapped_device when the table was made active.  This saves the need
to even take the read-side of the SRCU via dm_{get,put}_live_table.

If the active DM table does not have an immutable target (e.g. "error"
target was swapped in) then fallback to the slow-path where the target
is looked up from the live table.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 11:06:22 -05:00
Mike Snitzer f083b09b78 dm: set DM_TARGET_WILDCARD feature on "error" target
The DM_TARGET_WILDCARD feature indicates that the "error" target may
replace any target; even immutable targets.  This feature will be useful
to preserve the ability to replace the "multipath" target even once it
is formally converted over to having the DM_TARGET_IMMUTABLE feature.

Also, implicit in the DM_TARGET_WILDCARD feature flag being set is that
.map, .map_rq, .clone_and_map_rq and .release_clone_rq are all defined
in the target_type.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 11:06:21 -05:00
Mike Snitzer e522c03905 dm: cleanup dm_any_congested()
The request-based DM support for checking queue congestion doesn't
require access to the live DM table.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 11:06:20 -05:00
Mike Snitzer ae6ad75e5c dm: remove unused dm_get_rq_mapinfo()
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-22 11:06:20 -05:00
Mike Snitzer 6acfe68bac dm: fix excessive dm-mq context switching
Request-based DM's blk-mq support (dm-mq) was reported to be 50% slower
than if an underlying null_blk device were used directly.  One of the
reasons for this drop in performance is that blk_insert_clone_request()
was calling blk_mq_insert_request() with @async=true.  This forced the
use of kblockd_schedule_delayed_work_on() to run the blk-mq hw queues
which ushered in ping-ponging between process context (fio in this case)
and kblockd's kworker to submit the cloned request.  The ftrace
function_graph tracer showed:

  kworker-2013  =>   fio-12190
  fio-12190    =>  kworker-2013
  ...
  kworker-2013  =>   fio-12190
  fio-12190    =>  kworker-2013
  ...

Fixing blk_insert_clone_request()'s blk_mq_insert_request() call to
_not_ use kblockd to submit the cloned requests isn't enough to
eliminate the observed context switches.

In addition to this dm-mq specific blk-core fix, there are 2 DM core
fixes to dm-mq that (when paired with the blk-core fix) completely
eliminate the observed context switching:

1)  don't blk_mq_run_hw_queues in blk-mq request completion

    Motivated by desire to reduce overhead of dm-mq, punting to kblockd
    just increases context switches.

    In my testing against a really fast null_blk device there was no benefit
    to running blk_mq_run_hw_queues() on completion (and no other blk-mq
    driver does this).  So hopefully this change doesn't induce the need for
    yet another revert like commit 621739b00e !

2)  use blk_mq_complete_request() in dm_complete_request()

    blk_complete_request() doesn't offer the traditional q->mq_ops vs
    .request_fn branching pattern that other historic block interfaces
    do (e.g. blk_get_request).  Using blk_mq_complete_request() for
    blk-mq requests is important for performance.  It should be noted
    that, like blk_complete_request(), blk_mq_complete_request() doesn't
    natively handle partial completions -- but the request-based
    DM-multipath target does provide the required partial completion
    support by dm.c:end_clone_bio() triggering requeueing of the request
    via dm-mpath.c:multipath_end_io()'s return of DM_ENDIO_REQUEUE.

dm-mq fix #2 is _much_ more important than #1 for eliminating the
context switches.
Before: cpu          : usr=15.10%, sys=59.39%, ctx=7905181, majf=0, minf=475
After:  cpu          : usr=20.60%, sys=79.35%, ctx=2008, majf=0, minf=472

With these changes multithreaded async read IOPs improved from ~950K
to ~1350K for this dm-mq stacked on null_blk test-case.  The raw read
IOPs of the underlying null_blk device for the same workload is ~1950K.

Fixes: 7fb4898e0 ("block: add blk-mq support to blk_insert_cloned_request()")
Fixes: bfebd1cdb ("dm: add full blk-mq support to request-based DM")
Cc: stable@vger.kernel.org # 4.1+
Reported-by: Sagi Grimberg <sagig@dev.mellanox.co.il>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
2016-02-22 11:04:40 -05:00
Mike Snitzer 956a402580 dm: fix sparse "unexpected unlock" warnings in ioctl code
Rename dm_get_live_table_for_ioctl to dm_grab_bdev_for_ioctl and have it
do the dm_{get,put}_live_table() rather than split those operations.

The dm_grab_bdev_for_ioctl() callers only care about the block_device
associated with a singleton DM device so there isn't any need to retain
a reference to the live DM table.  It is sufficient to:
1) dm_get_live_table()
2) bdgrab() the bdev associated with the singleton table's target
3) dm_put_live_table()
4) bdput() the bdev

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-21 20:27:51 -05:00
Mike Snitzer 664820265d dm: do not return target from dm_get_live_table_for_ioctl()
None of the callers actually used the returned target.
Also, just reuse bdev pointer passed to dm_blk_ioctl().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-21 20:27:51 -05:00
Mike Snitzer 4328daa2e7 dm: fix dm_rq_target_io leak on faults with .request_fn DM w/ blk-mq paths
Using request-based DM mpath configured with the following stacking
(.request_fn DM mpath ontop of scsi-mq paths):

echo Y > /sys/module/scsi_mod/parameters/use_blk_mq
echo N > /sys/module/dm_mod/parameters/use_blk_mq

'struct dm_rq_target_io' would leak if a request is requeued before a
blk-mq clone is allocated (or fails to allocate).  free_rq_tio()
wasn't being called.

kmemleak reported:

unreferenced object 0xffff8800b90b98c0 (size 112):
  comm "kworker/7:1H", pid 5692, jiffies 4295056109 (age 78.589s)
  hex dump (first 32 bytes):
    00 d0 5c 2c 03 88 ff ff 40 00 bf 01 00 c9 ff ff  ..\,....@.......
    e0 d9 b1 34 00 88 ff ff 00 00 00 00 00 00 00 00  ...4............
  backtrace:
    [<ffffffff81672b6e>] kmemleak_alloc+0x4e/0xb0
    [<ffffffff811dbb63>] kmem_cache_alloc+0xc3/0x1e0
    [<ffffffff8117eae5>] mempool_alloc_slab+0x15/0x20
    [<ffffffff8117ec1e>] mempool_alloc+0x6e/0x170
    [<ffffffffa00029ac>] dm_old_prep_fn+0x3c/0x180 [dm_mod]
    [<ffffffff812fbd78>] blk_peek_request+0x168/0x290
    [<ffffffffa0003e62>] dm_request_fn+0xb2/0x1b0 [dm_mod]
    [<ffffffff812f66e3>] __blk_run_queue+0x33/0x40
    [<ffffffff812f9585>] blk_delay_work+0x25/0x40
    [<ffffffff81096fff>] process_one_work+0x14f/0x3d0
    [<ffffffff81097715>] worker_thread+0x125/0x4b0
    [<ffffffff8109ce88>] kthread+0xd8/0xf0
    [<ffffffff8167cb8f>] ret_from_fork+0x3f/0x70
    [<ffffffffffffffff>] 0xffffffffffffffff

crash> struct -o dm_rq_target_io
struct dm_rq_target_io {
    ...
}
SIZE: 112

Fixes: e5863d9ad7 ("dm: allocate requests in target when stacking on blk-mq devices")
Cc: stable@vger.kernel.org # 4.0+
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-02-21 20:27:50 -05:00
Shaohua Li 9ea064158f Merge branch 'mymd/for-next' into mymd/for-linus 2016-02-03 15:43:59 -08:00
Herbert Xu bbdb23b5d6 dm crypt: Use skcipher and ahash
This patch replaces uses of ablkcipher with skcipher, and the long
obsolete hash interface with ahash.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2016-01-27 20:35:48 +08:00
Shaohua Li fc2561ec0a md-cluster: delete useless code
page->index already considers node offset. The node_offset calculation
in write_sb_page is useless and confusion.

Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: NeilBrown <neilb@suse.com>
Acked-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-01-24 18:13:37 -08:00
Shaohua Li 4ac7a65f80 md-cluster: fix missing memory free
There are several places we allocate dlm_lock_resource, but not free it.

leave() need free a lock resource too (from Guoqing)
Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Guoqing Jiang <gqjiang@suse.com>
Cc: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-01-24 18:13:18 -08:00
Linus Torvalds 641203549a Merge branch 'for-4.5/drivers' of git://git.kernel.dk/linux-block
Pull block driver updates from Jens Axboe:
 "This is the block driver pull request for 4.5, with the exception of
  NVMe, which is in a separate branch and will be posted after this one.

  This pull request contains:

   - A set of bcache stability fixes, which have been acked by Kent.
     These have been used and tested for more than a year by the
     community, so it's about time that they got in.

   - A set of drbd updates from the drbd team (Andreas, Lars, Philipp)
     and Markus Elfring, Oleg Drokin.

   - A set of fixes for xen blkback/front from the usual suspects, (Bob,
     Konrad) as well as community based fixes from Kiri, Julien, and
     Peng.

   - A 2038 time fix for sx8 from Shraddha, with a fix from me.

   - A small mtip32xx cleanup from Zhu Yanjun.

   - A null_blk division fix from Arnd"

* 'for-4.5/drivers' of git://git.kernel.dk/linux-block: (71 commits)
  null_blk: use sector_div instead of do_div
  mtip32xx: restrict variables visible in current code module
  xen/blkfront: Fix crash if backend doesn't follow the right states.
  xen/blkback: Fix two memory leaks.
  xen/blkback: make st_ statistics per ring
  xen/blkfront: Handle non-indirect grant with 64KB pages
  xen-blkfront: Introduce blkif_ring_get_request
  xen-blkback: clear PF_NOFREEZE for xen_blkif_schedule()
  xen/blkback: Free resources if connect_ring failed.
  xen/blocks: Return -EXX instead of -1
  xen/blkback: make pool of persistent grants and free pages per-queue
  xen/blkback: get the number of hardware queues/rings from blkfront
  xen/blkback: pseudo support for multi hardware queues/rings
  xen/blkback: separate ring information out of struct xen_blkif
  xen/blkfront: correct setting for xen_blkif_max_ring_order
  xen/blkfront: make persistent grants pool per-queue
  xen/blkfront: Remove duplicate setting of ->xbdev.
  xen/blkfront: Cleanup of comments, fix unaligned variables, and syntax errors.
  xen/blkfront: negotiate number of queues/rings to be used with backend
  xen/blkfront: split per device io_lock
  ...
2016-01-21 18:19:38 -08:00
Shaohua Li 849674e4fb MD: rename some functions
These short function names are hard to search. Rename them to make vim happy.

Signed-off-by: Shaohua Li <shli@fb.com>
2016-01-20 13:52:20 -08:00
Linus Torvalds 3c28c9ccaf md updates for 4.5
Mostly clustered-raid1 and raid5 journal updates.
 one Y2038 fix and other minor stuff.
 
 One patch removes me from the MAINTAINERS file and adds a record of
 my md maintainership to Credits.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJWmJEhAAoJEDnsnt1WYoG5raQQAI9lBrHO+Q8C8RImPsemLX0X
 ypjH38XUwEwKNYYfsCVI7PKAqCl7r8ITzY054gKsU0iHAfqLQlEN8aMz0v0fJQhg
 Msb7utrEMQ0UERwNcc+3J78ffFAdWrkHVd64Ley0h/pizFPSlL0K2RuIGTBc9sGX
 Hz2Ci11Ch7FdK7C/Zl7I6tK1pkthu3hBXYEZyg1GngRRhZEJj2U7mBmy1E37NA72
 o66B5r5FSlnIA8MAo/EAViCxtMJKBPRWU/WnkMhOJ1Yyw/FwMpbM2prLBLtYFqwF
 OLZOLuDUHY5HxdX2U+3R0hBzF78aozcH6od60SWg7wOmI/IkXYiYFujlxMd132FE
 OT+aa+UHHDEkATTSyt98OmxIkQ8uqKiNsSYqBk9lpNAPtmEbhqRX4RAOdrqP0G83
 DX7iyZpAK4YhB4BkJxMtNdSIOnss1TwfOdKyvoBZYmY6bTKh7p+dpw4cvIjV4VDi
 p6+BUQdJQ7mHRLV9QI4IuG52AJO8cRGc1OVvqLEMzO8uZlpyxX9nJrSqeP/dKKfa
 pJ5pYssilXEeKCDnODGqSRdt9aU4ENDW/oIkAW2U3cnSHUwBoMLF1WJ+M3Atbm+s
 i3/iDp26SnSiHM+DVHije5v0OGOroYdJwKDIFWToElcfc9Q5IDHU+KP8oeuPqqOS
 WA08l+zj+ahfP7Yu1DUC
 =gl+r
 -----END PGP SIGNATURE-----

Merge tag 'md/4.5' of git://neil.brown.name/md

Pull md updates from Neil Brown:
 "Mostly clustered-raid1 and raid5 journal updates.  one Y2038 fix and
  other minor stuff.

  One patch removes me from the MAINTAINERS file and adds a record of my
  md maintainership to Credits"

Many thanks to Neil, who has been around for a _looong_ time.

* tag 'md/4.5' of git://neil.brown.name/md: (26 commits)
  md/raid: only permit hot-add of compatible integrity profiles
  Remove myself as MD Maintainer, and add to Credits.
  raid5-cache: handle journal hotadd in quiesce
  MD: add journal with array suspended
  md: set MD_HAS_JOURNAL in correct places
  md: Remove 'ready' field from mddev.
  md: remove unnecesary md_new_event_inintr
  raid5: allow r5l_io_unit allocations to fail
  raid5-cache: use a mempool for the metadata block
  raid5-cache: use a bio_set
  raid5-cache: add journal hot add/remove support
  drivers: md: use ktime_get_real_seconds()
  md: avoid warning for 32-bit sector_t
  raid5-cache: free meta_page earlier
  raid5-cache: simplify r5l_move_io_unit_list
  md: update comment for md_allow_write
  md-cluster: update comments for MD_CLUSTER_SEND_LOCKED_ALREADY
  md-cluster: Protect communication with mutexes
  md-cluster: Defer MD reloading to mddev->thread
  md-cluster: update the documentation
  ...
2016-01-15 12:28:00 -08:00
Linus Torvalds 7d1fc01afc Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
  floppy: make local variable non-static
  exynos: fixes an incorrect header guard
  dt-bindings: fixes some incorrect header guards
  cpufreq-dt: correct dead link in documentation
  cpufreq: ARM big LITTLE: correct dead link in documentation
  treewide: Fix typos in printk
  Documentation: filesystem: Fix typo in fs/eventfd.c
  fs/super.c: use && instead of & for warn_on condition
  Documentation: fix sysfs-ptp
  lib: scatterlist: fix Kconfig description
2016-01-14 17:04:19 -08:00
Linus Torvalds d080827f85 libnvdimm for 4.5
1/ Media error handling: The 'badblocks' implementation that originated
    in md-raid is up-levelled to a generic capability of a block device.
    This initial implementation is limited to being consulted in the pmem
    block-i/o path.  Later, 'badblocks' will be consulted when creating
    dax mappings.
 
 2/ Raw block device dax: For virtualization and other cases that want
    large contiguous mappings of persistent memory, add the capability to
    dax-mmap a block device directly.
 
 3/ Increased /dev/mem restrictions: Add an option to treat all io-memory
    as IORESOURCE_EXCLUSIVE, i.e. disable /dev/mem access while a driver is
    actively using an address range.  This behavior is controlled via the
    new CONFIG_IO_STRICT_DEVMEM option and can be overridden by the
    existing "iomem=relaxed" kernel command line option.
 
 4/ Miscellaneous fixes include a 'pfn'-device huge page alignment fix,
    block device shutdown crash fix, and other small libnvdimm fixes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJWlrhjAAoJEB7SkWpmfYgCFbAQALKsQfFwT6JFS+zlPgiNpbqw
 2VMNKEH0AfGYGj96mT02j2q+vSUmXLMIDMTsbe0sDdtwFZtQbFmhmryzPWUVppSu
 KGTlLPW8vuEhQVs91+UI3BQKkvpi0+tbR8hPOh9W6QhjpRT+lyHFKnsNR5HZy5wB
 K4/VMaT5ffd5/pXRTjkYiPQYTwWyfcvNjICj0YtqhPvOwS031m77JpFsWJ8HSpEX
 K99VlzNUPMXd1pYkHmFNXWw52fhRGNhwAEomLeKMdQfKms+KnbKp8BOSA0aCqU8E
 kpujQcilDXJwykFQZOFI3Z5Dxvrv8lxFTU8HRMBvo3ESzfTWjfqcvyjGOjDUcruw
 ihESFSJtdZzhrBiMnf9RRqSpMFJvAT8MVT6Q4D3mZUHCMPbUqFJsQjMPt9hEH3ho
 4F0D2lesOCkubUKFTZmjMoDb+szuKbVhYK8TeFVVEhizinc/Aj0NKuazJqi+CXB/
 xh0ER4ZxD8wvzqFFWvS5UvR1G9I5fr7+3jGRUrqGLHlSdeXP9dkEg28ao3QbWk3x
 1dPOen6ZqQ9WJ/E7eGmXbVEz2R4Xd79hMXQzdQwmKDk/KbxRoAp7hyU8BslAyrBf
 HCdmVt+RAgrxZYfFRXuLhqwEBThJnNrgZA3qu74FUpkpFg6xRUu1bAYBiF7N+bFi
 82b5UbMkveBTtkXjJoiR
 =7V5r
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm updates from Dan Williams:
 "The bulk of this has appeared in -next and independently received a
  build success notification from the kbuild robot.  The 'for-4.5/block-
  dax' topic branch was rebased over the weekend to drop the "block
  device end-of-life" rework that Al would like to see re-implemented
  with a notifier, and to address bug reports against the badblocks
  integration.

  There is pending feedback against "libnvdimm: Add a poison list and
  export badblocks" received last week.  Linda identified some localized
  fixups that we will handle incrementally.

  Summary:

   - Media error handling: The 'badblocks' implementation that
     originated in md-raid is up-levelled to a generic capability of a
     block device.  This initial implementation is limited to being
     consulted in the pmem block-i/o path.  Later, 'badblocks' will be
     consulted when creating dax mappings.

   - Raw block device dax: For virtualization and other cases that want
     large contiguous mappings of persistent memory, add the capability
     to dax-mmap a block device directly.

   - Increased /dev/mem restrictions: Add an option to treat all
     io-memory as IORESOURCE_EXCLUSIVE, i.e. disable /dev/mem access
     while a driver is actively using an address range.  This behavior
     is controlled via the new CONFIG_IO_STRICT_DEVMEM option and can be
     overridden by the existing "iomem=relaxed" kernel command line
     option.

   - Miscellaneous fixes include a 'pfn'-device huge page alignment fix,
     block device shutdown crash fix, and other small libnvdimm fixes"

* tag 'libnvdimm-for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (32 commits)
  block: kill disk_{check|set|clear|alloc}_badblocks
  libnvdimm, pmem: nvdimm_read_bytes() badblocks support
  pmem, dax: disable dax in the presence of bad blocks
  pmem: fail io-requests to known bad blocks
  libnvdimm: convert to statically allocated badblocks
  libnvdimm: don't fail init for full badblocks list
  block, badblocks: introduce devm_init_badblocks
  block: clarify badblocks lifetime
  badblocks: rename badblocks_free to badblocks_exit
  libnvdimm, pmem: move definition of nvdimm_namespace_add_poison to nd.h
  libnvdimm: Add a poison list and export badblocks
  nfit_test: Enable DSMs for all test NFITs
  md: convert to use the generic badblocks code
  block: Add badblock management for gendisks
  badblocks: Add core badblock management code
  block: fix del_gendisk() vs blkdev_ioctl crash
  block: enable dax for raw block devices
  block: introduce bdev_file_inode()
  restrict /dev/mem to idle io memory ranges
  arch: consolidate CONFIG_STRICT_DEVM in lib/Kconfig.debug
  ...
2016-01-13 19:15:14 -08:00
Dan Williams 1501efadc5 md/raid: only permit hot-add of compatible integrity profiles
It is not safe for an integrity profile to be changed while i/o is
in-flight in the queue.  Prevent adding new disks or otherwise online
spares to an array if the device has an incompatible integrity profile.

The original change to the blk_integrity_unregister implementation in
md, commmit c7bfced9a6 "md: suspend i/o during runtime
blk_integrity_unregister" introduced an immediate hang regression.

This policy of disallowing changes the integrity profile once one has
been established is shared with DM.

Here is an abbreviated log from a test run that:
1/ Creates a degraded raid1 with an integrity-enabled device (pmem0s) [   59.076127]
2/ Tries to add an integrity-disabled device (pmem1m) [   90.489209]
3/ Retries with an integrity-enabled device (pmem1s) [  205.671277]

[   59.076127] md/raid1:md0: active with 1 out of 2 mirrors
[   59.078302] md: data integrity enabled on md0
[..]
[   90.489209] md0: incompatible integrity profile for pmem1m
[..]
[  205.671277] md: super_written gets error=-5
[  205.677386] md/raid1:md0: Disk failure on pmem1m, disabling device.
[  205.677386] md/raid1:md0: Operation continuing on 1 devices.
[  205.683037] RAID1 conf printout:
[  205.684699]  --- wd:1 rd:2
[  205.685972]  disk 0, wo:0, o:1, dev:pmem0s
[  205.687562]  disk 1, wo:1, o:1, dev:pmem1s
[  205.691717] md: recovery of RAID array md0

Fixes: c7bfced9a6 ("md: suspend i/o during runtime blk_integrity_unregister")
Cc: <stable@vger.kernel.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Reported-by: NeilBrown <neilb@suse.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-14 11:49:57 +11:00
Shaohua Li 16a43f6a65 raid5-cache: handle journal hotadd in quiesce
Handle journal hotadd in quiesce to avoid creating duplicated threads.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-14 11:49:43 +11:00
Shaohua Li 87d4d91616 MD: add journal with array suspended
Hot add journal disk in recovery thread context brings a lot of trouble
as IO could be running. Unlike spare disk hot add, adding journal disk
with array suspended makes more sense and implmentation is much easier.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-14 11:49:43 +11:00
Shaohua Li a62ab49eb5 md: set MD_HAS_JOURNAL in correct places
Set MD_HAS_JOURNAL when a array is loaded or journal is initialized.
This is to avoid the flags set too early in journal disk hotadd.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-14 11:49:43 +11:00
Linus Torvalds 33caf82acf Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "All kinds of stuff.  That probably should've been 5 or 6 separate
  branches, but by the time I'd realized how large and mixed that bag
  had become it had been too close to -final to play with rebasing.

  Some fs/namei.c cleanups there, memdup_user_nul() introduction and
  switching open-coded instances, burying long-dead code, whack-a-mole
  of various kinds, several new helpers for ->llseek(), assorted
  cleanups and fixes from various people, etc.

  One piece probably deserves special mention - Neil's
  lookup_one_len_unlocked().  Similar to lookup_one_len(), but gets
  called without ->i_mutex and tries to avoid ever taking it.  That, of
  course, means that it's not useful for any directory modifications,
  but things like getting inode attributes in nfds readdirplus are fine
  with that.  I really should've asked for moratorium on lookup-related
  changes this cycle, but since I hadn't done that early enough...  I
  *am* asking for that for the coming cycle, though - I'm going to try
  and get conversion of i_mutex to rwsem with ->lookup() done under lock
  taken shared.

  There will be a patch closer to the end of the window, along the lines
  of the one Linus had posted last May - mechanical conversion of
  ->i_mutex accesses to inode_lock()/inode_unlock()/inode_trylock()/
  inode_is_locked()/inode_lock_nested().  To quote Linus back then:

    -----
    |    This is an automated patch using
    |
    |        sed 's/mutex_lock(&\(.*\)->i_mutex)/inode_lock(\1)/'
    |        sed 's/mutex_unlock(&\(.*\)->i_mutex)/inode_unlock(\1)/'
    |        sed 's/mutex_lock_nested(&\(.*\)->i_mutex,[     ]*I_MUTEX_\([A-Z0-9_]*\))/inode_lock_nested(\1, I_MUTEX_\2)/'
    |        sed 's/mutex_is_locked(&\(.*\)->i_mutex)/inode_is_locked(\1)/'
    |        sed 's/mutex_trylock(&\(.*\)->i_mutex)/inode_trylock(\1)/'
    |
    |    with a very few manual fixups
    -----

  I'm going to send that once the ->i_mutex-affecting stuff in -next
  gets mostly merged (or when Linus says he's about to stop taking
  merges)"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
  nfsd: don't hold i_mutex over userspace upcalls
  fs:affs:Replace time_t with time64_t
  fs/9p: use fscache mutex rather than spinlock
  proc: add a reschedule point in proc_readfd_common()
  logfs: constify logfs_block_ops structures
  fcntl: allow to set O_DIRECT flag on pipe
  fs: __generic_file_splice_read retry lookup on AOP_TRUNCATED_PAGE
  fs: xattr: Use kvfree()
  [s390] page_to_phys() always returns a multiple of PAGE_SIZE
  nbd: use ->compat_ioctl()
  fs: use block_device name vsprintf helper
  lib/vsprintf: add %*pg format specifier
  fs: use gendisk->disk_name where possible
  poll: plug an unused argument to do_poll
  amdkfd: don't open-code memdup_user()
  cdrom: don't open-code memdup_user()
  rsxx: don't open-code memdup_user()
  mtip32xx: don't open-code memdup_user()
  [um] mconsole: don't open-code memdup_user_nul()
  [um] hostaudio: don't open-code memdup_user()
  ...
2016-01-12 17:11:47 -08:00
Linus Torvalds 03891f9c85 - The most significant set of changes this cycle is the Forward Error
Correction (FEC) support that has been added to the DM verity target.
   Google uses DM verity on all Android devices and it is believed that
   this FEC support will enable DM verity to recover from storage
   failures seen since DM verity was first deployed as part of Android.
 
 - A stable fix for a race in the destruction of DM thin pool's workqueue
 
 - A stable fix for hung IO if a DM snapshot copy hit an error
 
 - A few small cleanups in DM core and DM persistent data.
 
 - A couple DM thinp range discard improvements (address atomicity of
   finding a range and the efficiency of discarding a partially mapped
   thin device)
 
 - Add ability to debug DM bufio leaks by recording stack trace when a
   buffer is allocated.  Upon detected leak the recorded stack is dumped.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJWlAf0AAoJEMUj8QotnQNaHPIH/2rvnzg71RsP/7IRI/5DHETP
 ubxKhKd7tfqwJEjuQvhiYB1Ubo+gvXEuT51C2G1ug2QzHsjymmE14q/60ElB7+/U
 ++bGisWvqm4ZqWWM9yffqbESzNOfNTn7dLduaxGeLxVG3zVLfzQRfSPOqhk1FiIv
 H35v0Xx/j1NAHQtcocVYzG4P5BwfgmeyuYmUq8BklHNlwa3drBKnMZfIlF4u2216
 Z3K7d+5nLpSsPyejzpQlByHTUt/eVy1Y2ZBgudWITaP5DAcUQwHyLZI4k3skmMiK
 O/xLZ54aeKI9NhtEwH8s8jOd3b7Kvw/oAw5nfPj7jmIDF3if8U2HCU6KgfBVwwU=
 =fOsS
 -----END PGP SIGNATURE-----

Merge tag 'dm-4.5-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - The most significant set of changes this cycle is the Forward Error
   Correction (FEC) support that has been added to the DM verity target.

   Google uses DM verity on all Android devices and it is believed that
   this FEC support will enable DM verity to recover from storage
   failures seen since DM verity was first deployed as part of Android.

 - A stable fix for a race in the destruction of DM thin pool's
   workqueue

 - A stable fix for hung IO if a DM snapshot copy hit an error

 - A few small cleanups in DM core and DM persistent data.

 - A couple DM thinp range discard improvements (address atomicity of
   finding a range and the efficiency of discarding a partially mapped
   thin device)

 - Add ability to debug DM bufio leaks by recording stack trace when a
   buffer is allocated.  Upon detected leak the recorded stack is
   dumped.

* tag 'dm-4.5-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm snapshot: fix hung bios when copy error occurs
  dm thin: bump thin and thin-pool target versions
  dm thin: fix race condition when destroying thin pool workqueue
  dm space map metadata: remove unused variable in brb_pop()
  dm verity: add ignore_zero_blocks feature
  dm verity: add support for forward error correction
  dm verity: factor out verity_for_bv_block()
  dm verity: factor out structures and functions useful to separate object
  dm verity: move dm-verity.c to dm-verity-target.c
  dm verity: separate function for parsing opt args
  dm verity: clean up duplicate hashing code
  dm btree: factor out need_insert() helper
  dm bufio: use BUG_ON instead of conditional call to BUG
  dm bufio: store stacktrace in buffers to help find buffer leaks
  dm bufio: return NULL to improve code clarity
  dm block manager: cleanup code that prints stacktrace
  dm: don't save and restore bi_private
  dm thin metadata: make dm_thin_find_mapped_range() atomic
  dm thin metadata: speed up discard of partially mapped volumes
2016-01-11 22:25:00 -08:00
Dan Williams d3b407fb3f badblocks: rename badblocks_free to badblocks_exit
For symmetry with badblocks_init() make it clear that this path only
destroys incremental allocations of a badblocks instance, and does not
free the badblocks instance itself.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-09 08:39:04 -08:00
Vishal Verma fc974ee2bf md: convert to use the generic badblocks code
Retain badblocks as part of rdev, but use the accessor functions from
include/linux/badblocks for all manipulation.

Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-09 08:39:03 -08:00
Mikulas Patocka 385277bfb5 dm snapshot: fix hung bios when copy error occurs
When there is an error copying a chunk dm-snapshot can incorrectly hold
associated bios indefinitely, resulting in hung IO.

The function copy_callback sets pe->error if there was error copying the
chunk, and then calls complete_exception.  complete_exception calls
pending_complete on error, otherwise it calls commit_exception with
commit_callback (and commit_callback calls complete_exception).

The persistent exception store (dm-snap-persistent.c) assumes that calls
to prepare_exception and commit_exception are paired.
persistent_prepare_exception increases ps->pending_count and
persistent_commit_exception decreases it.

If there is a copy error, persistent_prepare_exception is called but
persistent_commit_exception is not.  This results in the variable
ps->pending_count never returning to zero and that causes some pending
exceptions (and their associated bios) to be held forever.

Fix this by unconditionally calling commit_exception regardless of
whether the copy was successful.  A new "valid" parameter is added to
commit_exception -- when the copy fails this parameter is set to zero so
that the chunk that failed to copy (and all following chunks) is not
recorded in the snapshot store.  Also, remove commit_callback now that
it is merely a wrapper around pending_complete.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2016-01-08 20:03:05 -05:00
Mike Snitzer 1c2e54e1ed dm thin: bump thin and thin-pool target versions
Commit 3d5f6733 ("dm thin metadata: speed up discard of partially mapped
volumes"), or some other dm-thinp change during the Linux 4.5
development window, really should've bumped these target versions.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-01-06 20:59:40 -05:00
NeilBrown 274d8cbde1 md: Remove 'ready' field from mddev.
This field is always set in tandem with ->pers, and when it is tested
->pers is also tested.  So ->ready is not needed.

It was needed once, but code rearrangement and locking changes have
removed that needed.

Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-07 11:01:14 +11:00
Guoqing Jiang bb9ef71646 md: remove unnecesary md_new_event_inintr
md_new_event had removed sysfs_notify since 'commit 72a23c211e
("Make sure all changes to md/sync_action are notified.")', so we
can use md_new_event and delete md_new_event_inintr.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-07 11:01:14 +11:00
Christoph Hellwig 5036c39020 raid5: allow r5l_io_unit allocations to fail
And propagate the error up the stack so we can add the stripe
to no_stripes_list and retry our log operation later.  This avoids
blocking raid5d due to reclaim, an it allows to get rid of the
deadlock-prone GFP_NOFAIL allocation.

shli: add missing mempool_destroy()

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06 11:40:12 +11:00
Christoph Hellwig e8deb63810 raid5-cache: use a mempool for the metadata block
We only have a limited number in flight, so use a page based mempool.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06 11:40:08 +11:00
Christoph Hellwig c38d29b33b raid5-cache: use a bio_set
This allows us to make guaranteed forward progress.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06 11:40:04 +11:00
Shaohua Li f6b6ec5cfa raid5-cache: add journal hot add/remove support
Add support for journal disk hot add/remove. Mostly trival checks in md
part. The raid5 part is a little tricky. For hot-remove, we can't wait
pending write as it's called from raid5d. The wait will cause deadlock.
We simplily fail the hot-remove. A hot-remove retry can success
eventually since if journal disk is faulty all pending write will be
failed and finish. For hot-add, since an array supporting journal but
without journal disk will be marked read-only, we are safe to hot add
journal without stopping IO (should be read IO, while journal only
handles write IO).

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06 11:39:57 +11:00
Deepa Dinamani 9ebc6ef188 drivers: md: use ktime_get_real_seconds()
get_seconds() API is not y2038 safe on 32 bit systems and the API
is deprecated. Replace it with calls to ktime_get_real_seconds()
API instead. Change mddev structure types to time64_t accordingly.

32 bit signed timestamps will overflow in the year 2038.

Change the user interface mdu_array_info_s structure timestamps:
ctime and utime values used in ioctls GET_ARRAY_INFO and
SET_ARRAY_INFO to unsigned int. This will extend the field to last
until the year 2106.
The long term plan is to get rid of ctime and utime values in
this structure as this information can be read from the on-disk
meta data directly.

Clamp the tim64_t timestamps to positive values with a max of U32_MAX
when returning from GET_ARRAY_INFO ioctl to accommodate above changes
in the data type of timestamps to unsigned int.

v0.90 on disk meta data uses u32 for maintaining time stamps.
So this will also last until year 2106.
Assumption is that the usage of v0.90 will be deprecated by
year 2106.

Timestamp fields in the on disk meta data for v1.0 version already
use 64 bit data types. Remove the truncation of the bits while
writing to or reading from these from the disk.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06 11:39:53 +11:00
Arnd Bergmann 3312c951ef md: avoid warning for 32-bit sector_t
When CONFIG_LBDAF is not set, sector_t is only 32-bits wide, which
means we cannot have devices with more than 2TB, and the code that
is trying to handle compatibility support for large devices in
md version 0.90 is meaningless but also causes a compile-time warning:

drivers/md/md.c: In function 'super_90_load':
drivers/md/md.c:1029:19: warning: large integer implicitly truncated to unsigned type [-Woverflow]
drivers/md/md.c: In function 'super_90_rdev_size_change':
drivers/md/md.c:1323:17: warning: large integer implicitly truncated to unsigned type [-Woverflow]

This adds a check for CONFIG_LBDAF to avoid even getting into this
code path, and also adds an explicit cast to let the compiler know
it doesn't have to warn about the truncation.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06 11:39:48 +11:00
Christoph Hellwig ad66d445ee raid5-cache: free meta_page earlier
Once the I/O completed we don't need the meta page anymore.  As the iounits
can live on for a long time this reduces memory pressure a bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06 11:39:43 +11:00
Christoph Hellwig 3848c0bcb0 raid5-cache: simplify r5l_move_io_unit_list
It's only used for one kind of move, so make that explicit.  Also clean
up the code a bit by using list_for_each_safe.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06 11:39:34 +11:00
Guoqing Jiang abf3508d8f md: update comment for md_allow_write
MD_CHANGE_CLEAN had been replaced with MD_CHANGE_PENDING after
commit 070dc6 ("md: resolve confusion of MD_CHANGE_CLEAN"),
so make the change accordingly.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06 11:39:26 +11:00
Guoqing Jiang e19508fa4d md-cluster: update comments for MD_CLUSTER_SEND_LOCKED_ALREADY
1. fix unbalanced parentheses.
2. add more description about that MD_CLUSTER_SEND_LOCKED_ALREADY
   will be cleared after set it in add_new_disk.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06 11:39:21 +11:00
Guoqing Jiang 8b9277c814 md-cluster: Protect communication with mutexes
Communication can happen through multiple threads. It is possible that
one thread steps over another threads sequence. So, we use mutexes to
protect both the send and receive sequences.

Send communication is locked through state bit, MD_CLUSTER_SEND_LOCK.
Communication is locked with bit manipulation in order to allow
"lock and hold" for the add operation. In case of an add operation,
if the lock is held, MD_CLUSTER_SEND_LOCKED_ALREADY is set.
When md_update_sb() calls metadata_update_start(), it checks
(in a single statement to avoid races), if the communication
is already locked. If yes, it merely returns zero, else it
locks the token lockresource.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06 11:39:17 +11:00
Guoqing Jiang 15858fa5b0 md-cluster: Defer MD reloading to mddev->thread
Reloading of superblock must be performed under reconfig_mutex. However,
this cannot be done with md_reload_sb because it would deadlock with
the message DLM lock. So, we defer it in md_check_recovery() which is
executed by mddev->thread.

This introduces a new flag, MD_RELOAD_SB, which if set, will reload the
superblock. And good_device_nr is also added to 'struct mddev' which is
used to get the num of the good device within cluster raid.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06 11:39:10 +11:00
Guoqing Jiang f6a2dc64ee md-cluster: append some actions when change bitmap from clustered to none
For clustered raid, we need to do extra actions when change
bitmap to none.

1. check if all the bitmap lock could be get or not, if yes then
   we can continue the change since cluster raid is only active
   in current node. Otherwise return fail and unlock the related
   bitmap locks
2. set nodes to 0 and then leave cluster environment.
3. release other nodes's bitmap lock.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06 11:38:57 +11:00
Goldwyn Rodrigues 09afd2a8d6 md-cluster: Allow spare devices to be marked as faulty
If a spare device was marked faulty, it would not be reflected
in receiving nodes because it would mark it as activated and continue.
Continue the operation, so it may be set as faulty.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06 11:38:51 +11:00
Goldwyn Rodrigues 54a88392cd md-cluster: Fix the remove sequence with the new MD reload code
The remove disk message does not need metadata_update_start(), but
can be an independent message.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06 11:38:42 +11:00
Guoqing Jiang 659b254fa7 md-cluster: remove a disk asynchronously from cluster environment
For cluster raid, if one disk couldn't be reach in one node, then
other nodes would receive the REMOVE message for the disk.

In receiving node, we can't call md_kick_rdev_from_array to remove
the disk from array synchronously since the disk might still be busy
in this node. So let's set a ClusterRemove flag on the disk, then
let the thread to do the removal job eventually.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06 11:38:36 +11:00
Goldwyn Rodrigues ac277c6a8a md-cluster: Avoid the resync ping-pong
If a RESYNCING message with (0,0) has been sent before, do not send it
again. This avoids a resync ping pong between the nodes. We read
the bitmap lockresource's LVB to figure out the previous value
of the RESYNCING message.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06 11:38:27 +11:00
Roman Gushchin b46020aa3a md/raid5: remove redundant check in stripe_add_to_batch_list()
The stripe_add_to_batch_list() function is called only if
stripe_can_batch() returned true, so there is no need for double check.

Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
Cc: Neil Brown <neilb@suse.com>
Cc: linux-raid@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.com>
2016-01-06 11:38:22 +11:00
Al Viro 93bbf5831d md: more open-coded offset_in_page()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-01-04 10:29:12 -05:00
Al Viro 756d097b95 dm-bufio: virt_to_phys() doesn't change remainder modulo PAGE_SIZE
... so virt_to_phys(p) & (PAGE_SIZE - 1) is a very odd way to
spell offset_in_page(p).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-01-04 10:29:07 -05:00
Kent Overstreet 627ccd20b4 bcache: Change refill_dirty() to always scan entire disk if necessary
Previously, it would only scan the entire disk if it was starting from
the very start of the disk - i.e. if the previous scan got to the end.

This was broken by refill_full_stripes(), which updates last_scanned so
that refill_dirty was never triggering the searched_from_start path.

But if we change refill_dirty() to always scan the entire disk if
necessary, regardless of what last_scanned was, the code gets cleaner
and we fix that bug too.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-12-30 20:23:16 -07:00
Stefan Bader 8d16ce540c bcache: prevent crash on changing writeback_running
Added a safeguard in the shutdown case. At least while not being
attached it is also possible to trigger a kernel bug by writing into
writeback_running. This change  adds the same check before trying to
wake up the thread for that case.

Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-12-30 20:23:14 -07:00
Gabriel de Perthuis d7076f2162 bcache: allows use of register in udev to avoid "device_busy" error.
Allows to use register, not register_quiet in udev to avoid "device_busy" error.
The initial patch proposed at https://lkml.org/lkml/2013/8/26/549 by Gabriel de Perthuis
<g2p.code@gmail.com> does not unlock the mutex and hangs the kernel.

See http://thread.gmane.org/gmane.linux.kernel.bcache.devel/2594 for the discussion.

Cc: Denis Bychkov <manover@gmail.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Eric Wheeler <bcache@linux.ewheeler.net>
Cc: Gabriel de Perthuis <g2p.code@gmail.com>
Cc: stable@vger.kernel.org

Signed-off-by: Jens Axboe <axboe@fb.com>
2015-12-30 20:23:13 -07:00
Zheng Liu 2ecf0cdb2b bcache: unregister reboot notifier if bcache fails to unregister device
In bcache_init() function it forgot to unregister reboot notifier if
bcache fails to unregister a block device.  This commit fixes this.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Tested-by: Joshua Schmid <jschmid@suse.com>
Tested-by: Eric Wheeler <bcache@linux.ewheeler.net>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-12-30 20:23:11 -07:00
Al Viro 4d4d8573a8 bcache: fix a leak in bch_cached_dev_run()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Tested-by: Joshua Schmid <jschmid@suse.com>
Tested-by: Eric Wheeler <bcache@linux.ewheeler.net>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-12-30 20:23:10 -07:00
Zheng Liu fecaee6f20 bcache: clear BCACHE_DEV_UNLINK_DONE flag when attaching a backing device
This bug can be reproduced by the following script:

  #!/bin/bash

  bcache_sysfs="/sys/fs/bcache"

  function clear_cache()
  {
  	if [ ! -e $bcache_sysfs ]; then
  		echo "no bcache sysfs"
  		exit
  	fi

  	cset_uuid=$(ls -l $bcache_sysfs|head -n 2|tail -n 1|awk '{print $9}')
  	sudo sh -c "echo $cset_uuid > /sys/block/sdb/sdb1/bcache/detach"
  	sleep 5
  	sudo sh -c "echo $cset_uuid > /sys/block/sdb/sdb1/bcache/attach"
  }

  for ((i=0;i<10;i++)); do
  	clear_cache
  done

The warning messages look like below:
[  275.948611] ------------[ cut here ]------------
[  275.963840] WARNING: at fs/sysfs/dir.c:512 sysfs_add_one+0xb8/0xd0() (Tainted: P        W
---------------   )
[  275.979253] Hardware name: Tecal RH2285
[  275.994106] sysfs: cannot create duplicate filename '/devices/pci0000:00/0000:00:09.0/0000:08:00.0/host4/target4:2:1/4:2:1:0/block/sdb/sdb1/bcache/cache'
[  276.024105] Modules linked in: bcache tcp_diag inet_diag ipmi_devintf ipmi_si ipmi_msghandler
bonding 8021q garp stp llc ipv6 ext3 jbd loop sg iomemory_vsl(P) bnx2 microcode serio_raw i2c_i801
i2c_core iTCO_wdt iTCO_vendor_support i7core_edac edac_core shpchp ext4 jbd2 mbcache megaraid_sas
pata_acpi ata_generic ata_piix dm_mod [last unloaded: scsi_wait_scan]
[  276.072643] Pid: 2765, comm: sh Tainted: P        W  ---------------    2.6.32 #1
[  276.089315] Call Trace:
[  276.105801]  [<ffffffff81070fe7>] ? warn_slowpath_common+0x87/0xc0
[  276.122650]  [<ffffffff810710d6>] ? warn_slowpath_fmt+0x46/0x50
[  276.139361]  [<ffffffff81205c08>] ? sysfs_add_one+0xb8/0xd0
[  276.156012]  [<ffffffff8120609b>] ? sysfs_do_create_link+0x12b/0x170
[  276.172682]  [<ffffffff81206113>] ? sysfs_create_link+0x13/0x20
[  276.189282]  [<ffffffffa03bda21>] ? bcache_device_link+0xc1/0x110 [bcache]
[  276.205993]  [<ffffffffa03bfa08>] ? bch_cached_dev_attach+0x478/0x4f0 [bcache]
[  276.222794]  [<ffffffffa03c4a17>] ? bch_cached_dev_store+0x627/0x780 [bcache]
[  276.239680]  [<ffffffff8116783a>] ? alloc_pages_current+0xaa/0x110
[  276.256594]  [<ffffffff81203b15>] ? sysfs_write_file+0xe5/0x170
[  276.273364]  [<ffffffff811887b8>] ? vfs_write+0xb8/0x1a0
[  276.290133]  [<ffffffff811890b1>] ? sys_write+0x51/0x90
[  276.306368]  [<ffffffff8100c072>] ? system_call_fastpath+0x16/0x1b
[  276.322301] ---[ end trace 9f5d4fcdd0c3edfb ]---
[  276.338241] ------------[ cut here ]------------
[  276.354109] WARNING: at /home/wenqing.lz/bcache/bcache/super.c:720
bcache_device_link+0xdf/0x110 [bcache]() (Tainted: P        W  ---------------   )
[  276.386017] Hardware name: Tecal RH2285
[  276.401430] Couldn't create device <-> cache set symlinks
[  276.401759] Modules linked in: bcache tcp_diag inet_diag ipmi_devintf ipmi_si ipmi_msghandler
bonding 8021q garp stp llc ipv6 ext3 jbd loop sg iomemory_vsl(P) bnx2 microcode serio_raw i2c_i801
i2c_core iTCO_wdt iTCO_vendor_support i7core_edac edac_core shpchp ext4 jbd2 mbcache megaraid_sas
pata_acpi ata_generic ata_piix dm_mod [last unloaded: scsi_wait_scan]
[  276.465477] Pid: 2765, comm: sh Tainted: P        W  ---------------    2.6.32 #1
[  276.482169] Call Trace:
[  276.498610]  [<ffffffff81070fe7>] ? warn_slowpath_common+0x87/0xc0
[  276.515405]  [<ffffffff810710d6>] ? warn_slowpath_fmt+0x46/0x50
[  276.532059]  [<ffffffffa03bda3f>] ? bcache_device_link+0xdf/0x110 [bcache]
[  276.548808]  [<ffffffffa03bfa08>] ? bch_cached_dev_attach+0x478/0x4f0 [bcache]
[  276.565569]  [<ffffffffa03c4a17>] ? bch_cached_dev_store+0x627/0x780 [bcache]
[  276.582418]  [<ffffffff8116783a>] ? alloc_pages_current+0xaa/0x110
[  276.599341]  [<ffffffff81203b15>] ? sysfs_write_file+0xe5/0x170
[  276.616142]  [<ffffffff811887b8>] ? vfs_write+0xb8/0x1a0
[  276.632607]  [<ffffffff811890b1>] ? sys_write+0x51/0x90
[  276.648671]  [<ffffffff8100c072>] ? system_call_fastpath+0x16/0x1b
[  276.664756] ---[ end trace 9f5d4fcdd0c3edfc ]---

We forget to clear BCACHE_DEV_UNLINK_DONE flag in bcache_device_attach()
function when we attach a backing device first time.  After detaching this
backing device, this flag will be true and sysfs_remove_link() isn't called in
bcache_device_unlink().  Then when we attach this backing device again,
sysfs_create_link() will return EEXIST error in bcache_device_link().

So the fix is trival and we clear this flag in bcache_device_link().

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Tested-by: Joshua Schmid <jschmid@suse.com>
Tested-by: Eric Wheeler <bcache@linux.ewheeler.net>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-12-30 20:23:08 -07:00
Kent Overstreet c5f1e5adf9 bcache: Add a cond_resched() call to gc
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Tested-by: Eric Wheeler <bcache@linux.ewheeler.net>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-12-30 20:23:06 -07:00
Zheng Liu 2ef9ccbfcb bcache: fix a livelock when we cause a huge number of cache misses
Subject :	[PATCH v2] bcache: fix a livelock in btree lock
Date :	Wed, 25 Feb 2015 20:32:09 +0800 (02/25/2015 04:32:09 AM)

This commit tries to fix a livelock in bcache.  This livelock might
happen when we causes a huge number of cache misses simultaneously.

When we get a cache miss, bcache will execute the following path.

->cached_dev_make_request()
  ->cached_dev_read()
    ->cached_lookup()
      ->bch->btree_map_keys()
        ->btree_root()  <------------------------
          ->bch_btree_map_keys_recurse()        |
            ->cache_lookup_fn()                 |
              ->cached_dev_cache_miss()         |
                ->bch_btree_insert_check_key() -|
                  [If btree->seq is not equal to seq + 1, we should return
                   EINTR and traverse btree again.]

In bch_btree_insert_check_key() function we first need to check upgrade
flag (op->lock == -1), and when this flag is true we need to release
read btree->lock and try to take write btree->lock.  During taking and
releasing this write lock, btree->seq will be monotone increased in
order to prevent other threads modify this in cache miss (see btree.h:74).
But if there are some cache misses caused by some requested, we could
meet a livelock because btree->seq is always changed by others.  Thus no
one can make progress.

This commit will try to take write btree->lock if it encounters a race
when we traverse btree.  Although it sacrifice the scalability but we
can ensure that only one can modify the btree.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Tested-by: Joshua Schmid <jschmid@suse.com>
Tested-by: Eric Wheeler <bcache@linux.ewheeler.net>
Cc: Joshua Schmid <jschmid@suse.com>
Cc: Zhu Yanhai <zhu.yanhai@gmail.com>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-12-30 20:23:05 -07:00
NeilBrown 312045eef9 md: remove check for MD_RECOVERY_NEEDED in action_store.
md currently doesn't allow a 'sync_action' such as 'reshape' to be set
while MD_RECOVERY_NEEDED is set.

This s a problem, particularly since commit 738a273806 as that can
cause ->check_shape to call mddev_resume() which sets
MD_RECOVERY_NEEDED.  So by the time we come to start 'reshape' it is
very likely that MD_RECOVERY_NEEDED is still set.

Testing for this flag is not really needed and is in any case very
racy as it can be set at any moment - asynchronously.  Any race
between setting a sync_action and setting MD_RECOVERY_NEEDED must
already be handled properly in some locked code, probably
md_check_recovery(), so remove the test here.

The test on MD_RECOVERY_RUNNING is also racy in the 'reshape' case
so we should test it again after getting mddev_lock().

As this fixes a race and a regression which can cause 'reshape' to
fail, it is suitable for -stable kernels since 4.1

Reported-by: Xiao Ni <xni@redhat.com>
Fixes: 738a273806 ("md/raid5: fix allocation of 'scribble' array.")
Cc: stable@vger.kernel.org (v4.1+)
Signed-off-by: NeilBrown <neilb@suse.com>
2015-12-21 11:10:06 +11:00
Goldwyn Rodrigues cb01c5496d Fix remove_and_add_spares removes drive added as spare in slot_store
Commit 2910ff17d1
introduced a regression which would remove a recently added spare via
slot_store. Revert part of the patch which touches slot_store() and add
the disk directly using pers->hot_add_disk()

Fixes: 2910ff17d1 ("md: remove_and_add_spares() to activate specific
rdev")
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-12-18 15:19:16 +11:00
Mikulas Patocka 0dc10e50f2 md: fix bug due to nested suspend
The patch c7bfced9a6 committed to 4.4-rc
causes crash in LVM test shell/lvchange-raid.sh. The kernel crashes with
this BUG, the reason is that we attempt to suspend a device that is
already suspended. See also
https://bugzilla.redhat.com/show_bug.cgi?id=1283491

This patch fixes the bug by changing functions mddev_suspend and
mddev_resume to always nest.
The number of nested calls to mddev_nested_suspend is kept in the
variable mddev->suspended.
[neilb: made mddev_suspend() always nest instead of introduce mddev_nested_suspend]

kernel BUG at drivers/md/md.c:317!
CPU: 3 PID: 32754 Comm: lvm Not tainted 4.4.0-rc2 #1
task: 0000000047076040 ti: 0000000047014000 task.ti: 0000000047014000

     YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
PSW: 00001000000001000000000000001111 Not tainted
r00-03  000000000804000f 00000000102c5280 0000000010c7522c 000000007e3d1810
r04-07  0000000010c6f000 000000004ef37f20 000000007e3d1dd0 000000007e3d1810
r08-11  000000007c9f1600 0000000000000000 0000000000000001 ffffffffffffffff
r12-15  0000000010c1d000 0000000000000041 00000000f98d63c8 00000000f98e49e4
r16-19  00000000f98e49e4 00000000c138fd06 00000000f98d63c8 0000000000000001
r20-23  0000000000000002 000000004ef37f00 00000000000000b0 00000000000001d1
r24-27  00000000424783a0 000000007e3d1dd0 000000007e3d1810 00000000102b2000
r28-31  0000000000000001 0000000047014840 0000000047014930 0000000000000001
sr00-03  0000000007040800 0000000000000000 0000000000000000 0000000007040800
sr04-07  0000000000000000 0000000000000000 0000000000000000 0000000000000000

IASQ: 0000000000000000 0000000000000000 IAOQ: 00000000102c538c 00000000102c5390
 IIR: 03ffe01f    ISR: 0000000000000000  IOR: 00000000102b2748
 CPU:        3   CR30: 0000000047014000 CR31: 0000000000000000
 ORIG_R28: 00000000000000b0
 IAOQ[0]: mddev_suspend+0x10c/0x160 [md_mod]
 IAOQ[1]: mddev_suspend+0x110/0x160 [md_mod]
 RP(r2): raid1_add_disk+0xd4/0x2c0 [raid1]
Backtrace:
 [<0000000010c7522c>] raid1_add_disk+0xd4/0x2c0 [raid1]
 [<0000000010c20078>] raid_resume+0x390/0x418 [dm_raid]
 [<00000000105833e8>] dm_table_resume_targets+0xc0/0x188 [dm_mod]
 [<000000001057f784>] dm_resume+0x144/0x1e0 [dm_mod]
 [<0000000010587dd4>] dev_suspend+0x1e4/0x568 [dm_mod]
 [<0000000010589278>] ctl_ioctl+0x1e8/0x428 [dm_mod]
 [<0000000010589518>] dm_compat_ctl_ioctl+0x18/0x68 [dm_mod]
 [<0000000040377b88>] compat_SyS_ioctl+0xd0/0x1558

Fixes: c7bfced9a6 ("md: suspend i/o during runtime blk_integrity_unregister")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-12-18 15:19:16 +11:00
Shaohua Li 9b15603dbd MD: change journal disk role to disk 0
Neil pointed out setting journal disk role to raid_disks will confuse
reshape if we support reshape eventually. Switching the role to 0 (we
should be fine as long as the value >=0) and skip sysfs file creation to
avoid error.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-12-18 15:19:16 +11:00
Artur Paszkiewicz cc57858831 md/raid10: fix data corruption and crash during resync
The commit c31df25f20 ("md/raid10: make sync_request_write() call
bio_copy_data()") replaced manual data copying with bio_copy_data() but
it doesn't work as intended. The source bio (fbio) is already processed,
so its bvec_iter has bi_size == 0 and bi_idx == bi_vcnt.  Because of
this, bio_copy_data() either does not copy anything, or worse, copies
data from the ->bi_next bio if it is set.  This causes wrong data to be
written to drives during resync and sometimes lockups/crashes in
bio_copy_data():

[  517.338478] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [md126_raid10:3319]
[  517.347324] Modules linked in: raid10 xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter ip_tables x86_pkg_temp_thermal coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul cryptd shpchp pcspkr ipmi_si ipmi_msghandler tpm_crb acpi_power_meter acpi_cpufreq ext4 mbcache jbd2 sr_mod cdrom sd_mod e1000e ax88179_178a usbnet mii ahci ata_generic crc32c_intel libahci ptp pata_acpi libata pps_core wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod
[  517.440555] CPU: 0 PID: 3319 Comm: md126_raid10 Not tainted 4.3.0-rc6+ #1
[  517.448384] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS PLYDCRB1.86B.0055.D14.1509221924 09/22/2015
[  517.459768] task: ffff880153773980 ti: ffff880150df8000 task.ti: ffff880150df8000
[  517.468529] RIP: 0010:[<ffffffff812e1888>]  [<ffffffff812e1888>] bio_copy_data+0xc8/0x3c0
[  517.478164] RSP: 0018:ffff880150dfbc98  EFLAGS: 00000246
[  517.484341] RAX: ffff880169356688 RBX: 0000000000001000 RCX: 0000000000000000
[  517.492558] RDX: 0000000000000000 RSI: ffffea0001ac2980 RDI: ffffea0000d835c0
[  517.500773] RBP: ffff880150dfbd08 R08: 0000000000000001 R09: ffff880153773980
[  517.508987] R10: ffff880169356600 R11: 0000000000001000 R12: 0000000000010000
[  517.517199] R13: 000000000000e000 R14: 0000000000000000 R15: 0000000000001000
[  517.525412] FS:  0000000000000000(0000) GS:ffff880174a00000(0000) knlGS:0000000000000000
[  517.534844] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  517.541507] CR2: 00007f8a044d5fed CR3: 0000000169504000 CR4: 00000000001406f0
[  517.549722] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  517.557929] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  517.566144] Stack:
[  517.568626]  ffff880174a16bc0 ffff880153773980 ffff880169356600 0000000000000000
[  517.577659]  0000000000000001 0000000000000001 ffff880153773980 ffff88016a61a800
[  517.586715]  ffff880150dfbcf8 0000000000000001 ffff88016dd209e0 0000000000001000
[  517.595773] Call Trace:
[  517.598747]  [<ffffffffa043ef95>] raid10d+0xfc5/0x1690 [raid10]
[  517.605610]  [<ffffffff816697ae>] ? __schedule+0x29e/0x8e2
[  517.611987]  [<ffffffff814ff206>] md_thread+0x106/0x140
[  517.618072]  [<ffffffff810c1d80>] ? wait_woken+0x80/0x80
[  517.624252]  [<ffffffff814ff100>] ? super_1_load+0x520/0x520
[  517.630817]  [<ffffffff8109ef89>] kthread+0xc9/0xe0
[  517.636506]  [<ffffffff8109eec0>] ? flush_kthread_worker+0x70/0x70
[  517.643653]  [<ffffffff8166d99f>] ret_from_fork+0x3f/0x70
[  517.649929]  [<ffffffff8109eec0>] ? flush_kthread_worker+0x70/0x70

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Reviewed-by: Shaohua Li <shli@kernel.org>
Cc: stable@vger.kernel.org (v4.2+)
Fixes: c31df25f20 ("md/raid10: make sync_request_write() call bio_copy_data()")
Signed-off-by: NeilBrown <neilb@suse.com>
2015-12-18 15:19:16 +11:00
Nikolay Borisov 18d03e8c25 dm thin: fix race condition when destroying thin pool workqueue
When a thin pool is being destroyed delayed work items are
cancelled using cancel_delayed_work(), which doesn't guarantee that on
return the delayed item isn't running.  This can cause the work item to
requeue itself on an already destroyed workqueue.  Fix this by using
cancel_delayed_work_sync() which guarantees that on return the work item
is not running anymore.

Fixes: 905e51b39a ("dm thin: commit outstanding data every second")
Fixes: 85ad643b7e ("dm thin: add timeout to stop out-of-data-space mode holding IO forever")
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-12-17 15:47:20 -05:00
Mike Snitzer 512167788a dm space map metadata: remove unused variable in brb_pop()
Remove the unused struct block_op pointer that was inadvertantly
introduced, via cut-and-paste of previous brb_op() code, as part of
commit 50dd842ad.

(Cc'ing stable@ because commit 50dd842ad did)

Fixes: 50dd842ad ("dm space map metadata: fix ref counting bug when bootstrapping a new space map")
Reported-by: David Binderman <dcb314@hotmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-12-14 09:26:01 -05:00
Sami Tolvanen 0cc37c2df4 dm verity: add ignore_zero_blocks feature
If ignore_zero_blocks is enabled dm-verity will return zeroes for blocks
matching a zero hash without validating the content.

Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-12-10 10:39:03 -05:00
Sami Tolvanen a739ff3f54 dm verity: add support for forward error correction
Add support for correcting corrupted blocks using Reed-Solomon.

This code uses RS(255, N) interleaved across data and hash
blocks. Each error-correcting block covers N bytes evenly
distributed across the combined total data, so that each byte is a
maximum distance away from the others. This makes it possible to
recover from several consecutive corrupted blocks with relatively
small space overhead.

In addition, using verity hashes to locate erasures nearly doubles
the effectiveness of error correction. Being able to detect
corrupted blocks also improves performance, because only corrupted
blocks need to corrected.

For a 2 GiB partition, RS(255, 253) (two parity bytes for each
253-byte block) can correct up to 16 MiB of consecutive corrupted
blocks if erasures can be located, and 8 MiB if they cannot, with
16 MiB space overhead.

Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-12-10 10:39:03 -05:00
Sami Tolvanen bb4d73ac5e dm verity: factor out verity_for_bv_block()
verity_for_bv_block() will be re-used by optional dm-verity object.

Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-12-10 10:39:02 -05:00
Sami Tolvanen ffa393807c dm verity: factor out structures and functions useful to separate object
Prepare for an optional verity object to make use of existing dm-verity
structures and functions.

Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-12-10 10:39:01 -05:00
Sami Tolvanen 03045cbafa dm verity: move dm-verity.c to dm-verity-target.c
Prepare for extending dm-verity with an optional object.  Follows the
naming convention used by other DM targets (e.g. dm-cache and dm-era).

Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-12-10 10:39:01 -05:00
Sami Tolvanen 753c1fd028 dm verity: separate function for parsing opt args
Move optional argument parsing into a separate function to make it
easier to add more of them without making verity_ctr even longer.

Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-12-10 10:39:00 -05:00
Sami Tolvanen 6dbeda3469 dm verity: clean up duplicate hashing code
Handle dm-verity salting in one place to simplify the code.

Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-12-10 10:39:00 -05:00
Mike Snitzer ba503835ad dm btree: factor out need_insert() helper
Eliminates code duplication within insert().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-12-10 10:38:59 -05:00
Anup Limbu 86a49e2dac dm bufio: use BUG_ON instead of conditional call to BUG
Signed-off-by: Anup Limbu <anuplimbu14@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-12-10 10:38:58 -05:00
Mikulas Patocka 86bad0c707 dm bufio: store stacktrace in buffers to help find buffer leaks
The option DM_DEBUG_BLOCK_STACK_TRACING is moved from persistent-data
directory to device mapper directory because it will now be used by
persistent-data and bufio.  When the option is enabled, each bufio buffer
stores the stacktrace of the last dm_bufio_get(), dm_bufio_read() or
dm_bufio_new() call that increased the hold count to 1.  The buffer's
stacktrace is printed if the buffer was not released before the bufio
client is destroyed.

When DM_DEBUG_BLOCK_STACK_TRACING is enabled, any bufio buffer leaks are
considered warnings - i.e. the kernel continues afterwards.  If not
enabled, buffer leaks are considered BUGs and the kernel with crash.
Reasoning on this disposition is: if we only ever warned on buffer leaks
users would generally ignore them and the problematic code would never
get fixed.

Successfully used to find source of bufio leaks fixed with commit
fce079f63c3 ("dm btree: fix bufio buffer leaks in dm_btree_del() error
path").

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-12-10 10:38:58 -05:00
Mikulas Patocka f98c8f7970 dm bufio: return NULL to improve code clarity
A small code cleanup in new_read() - return NULL instead of b (although
b is NULL at this point).  This function is not returning pointer to the
buffer, it is returning a pointer to the bufffer's data, thus it makes
no sense to return the variable b.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-12-10 10:38:57 -05:00
Mikulas Patocka 313c9b9736 dm block manager: cleanup code that prints stacktrace
There is no need to record stack trace and immediately print it.  Just
use dump_stack() to print the current stack.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-12-10 10:38:56 -05:00
Mikulas Patocka fe3265b180 dm: don't save and restore bi_private
Device mapper used the field bi_private to point to dm_target_io. However,
since kernel 3.15, the bi_private field is unused, and so the targets do
not need to save and restore this field.

This patch removes code that saves and restores bi_private from dm-cache,
dm-snapshot and dm-verity.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-12-10 10:38:56 -05:00
Joe Thornber 086fbbbda9 dm thin metadata: make dm_thin_find_mapped_range() atomic
Refactor dm_thin_find_mapped_range() so that it takes the read lock on
the metadata's lock; rather than relying on finer grained locking that
is pushed down inside dm_thin_find_next_mapped_block() and
dm_thin_find_block().

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-12-10 10:38:55 -05:00
Joe Thornber 3d5f67332a dm thin metadata: speed up discard of partially mapped volumes
Use dm_btree_lookup_next() to more quickly discard partially mapped
volumes.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-12-10 10:30:56 -05:00
Joe Thornber ed8b45a367 dm btree: fix bufio buffer leaks in dm_btree_del() error path
If dm_btree_del()'s call to push_frame() fails, e.g. due to
btree_node_validator finding invalid metadata, the dm_btree_del() error
path must unlock all frames (which have active dm-bufio buffers) that
were pushed onto the del_stack.

Otherwise, dm_bufio_client_destroy() will BUG_ON() because dm-bufio
buffers have leaked, e.g.:
  device-mapper: bufio: leaked buffer 3, hold count 1, list 0

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-12-10 10:30:18 -05:00
Joe Thornber 50dd842ad8 dm space map metadata: fix ref counting bug when bootstrapping a new space map
When applying block operations (BOPs) do not remove them from the
uncommitted BOP ring-buffer until after they've been applied -- in case
we recurse.

Also, perform BOP_INC operation, in dm_sm_metadata_create() and
sm_metadata_extend(), in terms of the uncommitted BOP ring-buffer rather
than using direct calls to sm_ll_inc().

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-12-09 13:27:25 -05:00
Joe Thornber 49e99fc717 dm thin metadata: fix bug when taking a metadata snapshot
When you take a metadata snapshot the btree roots for the mapping and
details tree need to have their reference counts incremented so they
persist for the lifetime of the metadata snap.

The roots being incremented were those currently written in the
superblock, which could possibly be out of date if concurrent IO is
triggering new mappings, breaking of sharing, etc.

Fix this by performing a commit with the metadata lock held while taking
a metadata snapshot.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-12-09 13:18:12 -05:00
Masanari Iida e3d132d123 treewide: Fix typos in printk
This patch fix multiple spelling typos found in
various part of kernel.

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2015-12-08 14:59:19 +01:00
Joe Thornber 993ceab919 dm thin metadata: fix bug in dm_thin_remove_range()
dm_btree_remove_leaves() only unmaps a contiguous region so we need a
loop, in __remove_range(), to handle ranges that contain multiple
regions.

A new btree function, dm_btree_lookup_next(), is introduced which is
more efficiently able to skip over regions of the thin device which
aren't mapped.  __remove_range() uses dm_btree_lookup_next() for each
iteration of __remove_range()'s loop.

Also, improve description of dm_btree_remove_leaves().

Fixes: 6550f075 ("dm thin metadata: add dm_thin_remove_range()")
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 4.1+
2015-12-02 13:26:49 -05:00
Mike Snitzer 30ce6e1cc5 dm btree: fix leak of bufio-backed block in btree_split_sibling error path
The block allocated at the start of btree_split_sibling() is never
released if later insert_at() fails.

Fix this by releasing the previously allocated bufio block using
unlock_block().

Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-12-02 13:20:34 -05:00
Mike Snitzer 0fcb04d593 dm thin: fix regression in advertised discard limits
When establishing a thin device's discard limits we cannot rely on the
underlying thin-pool device's discard capabilities (which are inherited
from the thin-pool's underlying data device) given that DM thin devices
must provide discard support even when the thin-pool's underlying data
device doesn't support discards.

Users were exposed to this thin device discard limits regression if
their thin-pool's underlying data device does _not_ support discards.
This regression caused all upper-layers that called the
blkdev_issue_discard() interface to not be able to issue discards to
thin devices (because discard_granularity was 0).  This regression
wasn't caught earlier because the device-mapper-test-suite's extensive
'thin-provisioning' discard tests are only ever performed against
thin-pool's with data devices that support discards.

Fix is to have thin_io_hints() test the pool's 'discard_enabled' feature
rather than inferring whether or not a thin device's discard support
should be enabled by looking at the thin-pool's discard_granularity.

Fixes: 216076705 ("dm thin: disable discard support for thin devices if pool's is disabled")
Reported-by: Mike Gerber <mike@sprachgewalt.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 4.1+
2015-11-23 14:54:46 -05:00
Mikulas Patocka bcbd94ff48 dm crypt: fix a possible hang due to race condition on exit
A kernel thread executes __set_current_state(TASK_INTERRUPTIBLE),
__add_wait_queue, spin_unlock_irq and then tests kthread_should_stop().
It is possible that the processor reorders memory accesses so that
kthread_should_stop() is executed before __set_current_state().  If such
reordering happens, there is a possible race on thread termination:

CPU 0:
calls kthread_should_stop()
	it tests KTHREAD_SHOULD_STOP bit, returns false
CPU 1:
calls kthread_stop(cc->write_thread)
	sets the KTHREAD_SHOULD_STOP bit
	calls wake_up_process on the kernel thread, that sets the thread
	state to TASK_RUNNING
CPU 0:
sets __set_current_state(TASK_INTERRUPTIBLE)
spin_unlock_irq(&cc->write_thread_wait.lock)
schedule() - and the process is stuck and never terminates, because the
	state is TASK_INTERRUPTIBLE and wake_up_process on CPU 1 already
	terminated

Fix this race condition by using a new flag DM_CRYPT_EXIT_THREAD to
signal that the kernel thread should exit.  The flag is set and tested
while holding cc->write_thread_wait.lock, so there is no possibility of
racy access to the flag.

Also, remove the unnecessary set_task_state(current, TASK_RUNNING)
following the schedule() call.  When the process was woken up, its state
was already set to TASK_RUNNING.  Other kernel code also doesn't set the
state to TASK_RUNNING following schedule() (for example,
do_wait_for_common in completion.c doesn't do it).

Fixes: dc2676210c ("dm crypt: offload writes to thread")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org # v4.0+
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-11-19 13:38:30 -05:00
Junichi Nomura 43e43c9ea6 dm mpath: fix infinite recursion in ioctl when no paths and !queue_if_no_path
In multipath_prepare_ioctl(),
  - pgpath is a path selected from available paths
  - m->queue_io is true if we cannot send a request immediately to
    paths, either because:
      * there is no available path
      * the path group needs activation (pg_init)
          - pg_init is not started
          - pg_init is still running
  - m->queue_if_no_path is true if the device is configured to queue
    I/O if there are no available paths

If !pgpath && !m->queue_if_no_path, the handler should return -EIO.
However in the course of refactoring the condition check has broken
and returns success in that case.  Since bdev points to the dm device
itself, dm_blk_ioctl() calls __blk_dev_driver_ioctl() for itself and
recurses until crash.

You could reproduce the problem like this:

  # dmsetup create mp --table '0 1024 multipath 0 0 0 0'
  # sg_inq /dev/mapper/mp
  <crash>
  [  172.648615] BUG: unable to handle kernel paging request at fffffffc81b10268
  [  172.662843] PGD 19dd067 PUD 0
  [  172.666269] Thread overran stack, or stack corrupted
  [  172.671808] Oops: 0000 [#1] SMP
  ...

Fix the condition check with some clarifications.

Fixes: e56f81e0b0 ("dm: refactor ioctl handling")
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-11-17 14:19:00 -05:00
Mike Snitzer 647a20d5ca dm: do not reuse dm_blk_ioctl block_device input as local variable
(Ab)using the @bdev passed to dm_blk_ioctl() opens the potential for
targets' .prepare_ioctl to fail if they go on to check the bdev for
!NULL.

Fixes: e56f81e0b0 ("dm: refactor ioctl handling")
Reported-by: Junichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-11-17 14:18:49 -05:00
Junichi Nomura 5bbbfdf685 dm: fix ioctl retry termination with signal
dm-mpath retries ioctl, when no path is readily available and the device
is configured to queue I/O in such a case. If you want to stop the retry
before multipathd decides to turn off queueing mode, you could send
signal for the process to exit from the loop.

However the check of fatal signal has not carried along when commit
6c182cd88d ("dm mpath: fix ioctl deadlock when no paths") moved the
loop from dm-mpath to dm core. As a result, we can't terminate such
a process in the retry loop.

Easy reproducer of the situation is:

  # dmsetup create mp --table '0 1024 multipath 0 0 0 0'
  # dmsetup message mp 0 'queue_if_no_path'
  # sg_inq /dev/mapper/mp

then you should be able to terminate sg_inq by pressing Ctrl+C.

Fixes: 6c182cd88d ("dm mpath: fix ioctl deadlock when no paths")
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-11-17 14:04:32 -05:00
Mike Snitzer 172c238612 dm thin: restore requested 'error_if_no_space' setting on OODS to WRITE transition
A thin-pool that is in out-of-data-space (OODS) mode may transition back
to write mode -- without the admin adding more space to the thin-pool --
if/when blocks are released (either by deleting thin devices or
discarding provisioned blocks).

But as part of the thin-pool's earlier transition to out-of-data-space
mode the thin-pool may have set the 'error_if_no_space' flag to true if
the no_space_timeout expires without more space having been made
available.  That implementation detail, of changing the pool's
error_if_no_space setting, needs to be reset back to the default that
the user specified when the thin-pool's table was loaded.

Otherwise we'll drop the user requested behaviour on the floor when this
out-of-data-space to write mode transition occurs.

Reported-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Fixes: 2c43fd26e4 ("dm thin: fix missing out-of-data-space to write mode transition if blocks are released")
Cc: stable@vger.kernel.org
2015-11-16 09:36:08 -05:00
Linus Torvalds 3419b45039 Merge branch 'for-4.4/io-poll' of git://git.kernel.dk/linux-block
Pull block IO poll support from Jens Axboe:
 "Various groups have been doing experimentation around IO polling for
  (really) fast devices.  The code has been reviewed and has been
  sitting on the side for a few releases, but this is now good enough
  for coordinated benchmarking and further experimentation.

  Currently O_DIRECT sync read/write are supported.  A framework is in
  the works that allows scalable stats tracking so we can auto-tune
  this.  And we'll add libaio support as well soon.  Fow now, it's an
  opt-in feature for test purposes"

* 'for-4.4/io-poll' of git://git.kernel.dk/linux-block:
  direct-io: be sure to assign dio->bio_bdev for both paths
  directio: add block polling support
  NVMe: add blk polling support
  block: add block polling support
  blk-mq: return tag/queue combo in the make_request_fn handlers
  block: change ->make_request_fn() and users to return a queue cookie
2015-11-10 17:23:49 -08:00
Linus Torvalds 3934bbc044 config fix for md
config dependency needed as md/raid5 now uses crc32c
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJWP8lRAAoJEDnsnt1WYoG5AQ0P/REvKfJxP870gS6p5gowMYXN
 1pwOdq9t2MeVkQk0Q5xBOZFGQI2TL2VZjaZdiSEKqaHgd3IOD/aGpl2exLN8nZM4
 mNx+iD3QvEpmSA9mwlCe+V8vTuE0JeTpod9pXk+aQ0DIUx60dSmWh0Lp8ctr33oQ
 inlJ7kXFws6rG0xU0pOaSDM1hI0sC06Nyi2tvRSyZlZbBIMjZWorzJFUQuVX43rD
 7cQJSQ8z2e2x3V7KXYtZf6Kxe+NzEltnq0OAfnDvjz3iw+a9qU6Qg7diAbcwOdyP
 m34/MHJGIYX9GBlTtVo3+j+h3ppaPpLfT4emeYUPTQCkMXg7J9Zbjkwso7OSkvnL
 kovtgIWRVzxFx/k7jhoWzNOsacOhTFz0E4aKDpOeH2cfU7k5H/siIjeIYcqfW3fU
 q8fJXMplHz3XRYp0JhR5DRJZSw87eQnkIRkvSy8wHx8KVoRi7KNm7fv/3Zi7FpaU
 XnbxQa6FrIoqbqeBQ666Rlkn+r2Ftmj50eudAhbj4/PBesRmt6otvr9uCK+b+adh
 ZI748BC2/IOOK34WNktRQCyS2C4b3VLwRVMHReD1xir34rGGcKO04Hci8T7Qb0nP
 uJBAxmE2zue0oE6d+nnrJS3BlEC2ZFKJGzRxKiLCU7nXXoGCznd++IIPRHAkGhZc
 MSMpaS2mqtdTNp4n+9y8
 =dN6h
 -----END PGP SIGNATURE-----

Merge tag 'md/4.4-rc0-fix' of git://neil.brown.name/md

Pull config fix for md from Neil Brown:
 "New config dependency needed as md/raid5 now uses crc32c"

* tag 'md/4.4-rc0-fix' of git://neil.brown.name/md:
  raid5-cache: add crc32c Kconfig dependency
2015-11-10 12:13:00 -08:00
Arnd Bergmann 14f09e2f9b raid5-cache: add crc32c Kconfig dependency
The recent change of the raid5-cache code to use crc32c instead
of crc32 causes link errors when CONFIG_LIBCRC32C is disabled:

drivers/built-in.o: In function crc32c'
core.c:(.text+0x1c6060): undefined reference to `crc32c'

This adds an explicit 'select' statement like all other users
of this function do.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 5cb2fbd6ea ("raid5-cache: use crc32c checksum")
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-09 09:09:52 +11:00
Linus Torvalds ad804a0b2a Merge branch 'akpm' (patches from Andrew)
Merge second patch-bomb from Andrew Morton:

 - most of the rest of MM

 - procfs

 - lib/ updates

 - printk updates

 - bitops infrastructure tweaks

 - checkpatch updates

 - nilfs2 update

 - signals

 - various other misc bits: coredump, seqfile, kexec, pidns, zlib, ipc,
   dma-debug, dma-mapping, ...

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (102 commits)
  ipc,msg: drop dst nil validation in copy_msg
  include/linux/zutil.h: fix usage example of zlib_adler32()
  panic: release stale console lock to always get the logbuf printed out
  dma-debug: check nents in dma_sync_sg*
  dma-mapping: tidy up dma_parms default handling
  pidns: fix set/getpriority and ioprio_set/get in PRIO_USER mode
  kexec: use file name as the output message prefix
  fs, seqfile: always allow oom killer
  seq_file: reuse string_escape_str()
  fs/seq_file: use seq_* helpers in seq_hex_dump()
  coredump: change zap_threads() and zap_process() to use for_each_thread()
  coredump: ensure all coredumping tasks have SIGNAL_GROUP_COREDUMP
  signal: remove jffs2_garbage_collect_thread()->allow_signal(SIGCONT)
  signal: introduce kernel_signal_stop() to fix jffs2_garbage_collect_thread()
  signal: turn dequeue_signal_lock() into kernel_dequeue_signal()
  signals: kill block_all_signals() and unblock_all_signals()
  nilfs2: fix gcc uninitialized-variable warnings in powerpc build
  nilfs2: fix gcc unused-but-set-variable warnings
  MAINTAINERS: nilfs2: add header file for tracing
  nilfs2: add tracepoints for analyzing reading and writing metadata files
  ...
2015-11-07 14:32:45 -08:00
Linus Torvalds 75021d2859 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial updates from Jiri Kosina:
 "Trivial stuff from trivial tree that can be trivially summed up as:

   - treewide drop of spurious unlikely() before IS_ERR() from Viresh
     Kumar

   - cosmetic fixes (that don't really affect basic functionality of the
     driver) for pktcdvd and bcache, from Julia Lawall and Petr Mladek

   - various comment / printk fixes and updates all over the place"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
  bcache: Really show state of work pending bit
  hwmon: applesmc: fix comment typos
  Kconfig: remove comment about scsi_wait_scan module
  class_find_device: fix reference to argument "match"
  debugfs: document that debugfs_remove*() accepts NULL and error values
  net: Drop unlikely before IS_ERR(_OR_NULL)
  mm: Drop unlikely before IS_ERR(_OR_NULL)
  fs: Drop unlikely before IS_ERR(_OR_NULL)
  drivers: net: Drop unlikely before IS_ERR(_OR_NULL)
  drivers: misc: Drop unlikely before IS_ERR(_OR_NULL)
  UBI: Update comments to reflect UBI_METAONLY flag
  pktcdvd: drop null test before destroy functions
2015-11-07 13:05:44 -08:00
Jens Axboe dece16353e block: change ->make_request_fn() and users to return a queue cookie
No functional changes in this patch, but it prepares us for returning
a more useful cookie related to the IO that was queued up.

Signed-off-by: Jens Axboe <axboe@fb.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
2015-11-07 10:40:46 -07:00
Mel Gorman d0164adc89 mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
__GFP_WAIT has been used to identify atomic context in callers that hold
spinlocks or are in interrupts.  They are expected to be high priority and
have access one of two watermarks lower than "min" which can be referred
to as the "atomic reserve".  __GFP_HIGH users get access to the first
lower watermark and can be called the "high priority reserve".

Over time, callers had a requirement to not block when fallback options
were available.  Some have abused __GFP_WAIT leading to a situation where
an optimisitic allocation with a fallback option can access atomic
reserves.

This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
cannot sleep and have no alternative.  High priority users continue to use
__GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
redefined as a caller that is willing to enter direct reclaim and wake
kswapd for background reclaim.

This patch then converts a number of sites

o __GFP_ATOMIC is used by callers that are high priority and have memory
  pools for those requests. GFP_ATOMIC uses this flag.

o Callers that have a limited mempool to guarantee forward progress clear
  __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
  into this category where kswapd will still be woken but atomic reserves
  are not used as there is a one-entry mempool to guarantee progress.

o Callers that are checking if they are non-blocking should use the
  helper gfpflags_allow_blocking() where possible. This is because
  checking for __GFP_WAIT as was done historically now can trigger false
  positives. Some exceptions like dm-crypt.c exist where the code intent
  is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
  flag manipulations.

o Callers that built their own GFP flags instead of starting with GFP_KERNEL
  and friends now also need to specify __GFP_KSWAPD_RECLAIM.

The first key hazard to watch out for is callers that removed __GFP_WAIT
and was depending on access to atomic reserves for inconspicuous reasons.
In some cases it may be appropriate for them to use __GFP_HIGH.

The second key hazard is callers that assembled their own combination of
GFP flags instead of starting with something like GFP_KERNEL.  They may
now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
if it's missed in most cases as other activity will wake kswapd.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06 17:50:42 -08:00
Petr Mladek 8d090f4731 bcache: Really show state of work pending bit
WORK_STRUCT_PENDING is a mask for testing the pending bit.
test_bit() expects the number of the bit and we need to
use WORK_STRUCT_PENDING_BIT there.

Also work_data_bits() is defined in workqueues.h now.

I have noticed this just by chance when looking how
WORK_STRUCT_PENDING_BIT is used. The change is compile
tested.

Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2015-11-06 15:06:05 +01:00
Linus Torvalds e0700ce709 - Revert a dm-multipath change that caused a regression for unprivledged
users (e.g. kvm guests) that issued ioctls when a multipath device had
   no available paths.
 
 - Include Christoph's refactoring of DM's ioctl handling and add support
   for passing through persistent reservations with DM multipath.
 
 - All other changes are very simple cleanups.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJWOp04AAoJEMUj8QotnQNaFLsH/AhMEH/jI1ObOfy4J1Wy4rOx
 ujJT91uS/s0H3pc9cGKQYnuGpFkX6WWU4wMiabIyiTn4sAsoXaflfIGutivLiDJr
 HfecrMrGZgnP4ZlpPPB02BmlxFbcPW8yzAU4ma38xBgQ+Pu30RO/HkvX/2vKOppG
 qwPop/XsNxq3KXgFGM44ToytM6c/MPGluhuvOwbaacAO1HviMuen9qsVjk4kwcf3
 jGYTbEPHATxyu5/6oKDTkQTYhzdwg3B2qHCiKMGw3l1kXhaQLFcaOivOLV8Sf3xh
 bj1070pkGe9OpqaVzMnwDtJ8rnsBl/Nt4wj9oiQPxbX71GYZAmcMIYn9WEkcKFI=
 =AR2D
 -----END PGP SIGNATURE-----

Merge tag 'dm-4.4-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:
 "Smaller set of DM changes for this merge.  I've based these changes on
  Jens' for-4.4/reservations branch because the associated DM changes
  required it.

   - Revert a dm-multipath change that caused a regression for
     unprivledged users (e.g. kvm guests) that issued ioctls when a
     multipath device had no available paths.

   - Include Christoph's refactoring of DM's ioctl handling and add
     support for passing through persistent reservations with DM
     multipath.

   - All other changes are very simple cleanups"

* tag 'dm-4.4-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm switch: simplify conditional in alloc_region_table()
  dm delay: document that offsets are specified in sectors
  dm delay: capitalize the start of an delay_ctr() error message
  dm delay: Use DM_MAPIO macros instead of open-coded equivalents
  dm linear: remove redundant target name from error messages
  dm persistent data: eliminate unnecessary return values
  dm: eliminate unused "bioset" process for each bio-based DM device
  dm: convert ffs to __ffs
  dm: drop NULL test before kmem_cache_destroy() and mempool_destroy()
  dm: add support for passing through persistent reservations
  dm: refactor ioctl handling
  Revert "dm mpath: fix stalls when handling invalid ioctls"
  dm: initialize non-blk-mq queue data before queue is used
2015-11-04 21:19:53 -08:00
Linus Torvalds ac322de6bf md updates for 4.4.
Two major components to this update.
 
 1/ the clustered-raid1 support from SUSE is nearly
   complete.  There are a few outstanding issues being
   worked on.  Maybe half a dozen patches will bring
   this to a usable state.
 
 2/ The first stage of journalled-raid5 support from
    Facebook makes an appearance.  With a journal
    device configured (typically NVRAM or SSD), the
    "RAID5 write hole" should be closed - a crash
    during degraded operations cannot result in data
    corruption.
 
    The next stage will be to use the journal as a
    write-behind cache so that latency can be reduced
    and in some cases throughput increased by
    performing more full-stripe writes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJWNX9RAAoJEDnsnt1WYoG5bYMP/jI0pV3wcbs7mZQAa8S/V0lU
 2l25x4MdwDvqVKMfjIc/C5J08QNgcrgSvhiVPCEOK0w18q395vep9f6gFKbMHhu/
 lWU3PLHGw8XBHp5yEnxrpQkN0pRrNjh5NqIdlVMBNyL6u+RZPS2ZuzxJ8wiNAFg1
 MypNkgoUu6s+nBp4DWWnMGYhBc+szBR+gTYAzGiZ8vqOH9uiSJ2SsGG5aRVUN/af
 oMYvJAf9aA6uj+xSzNlXIaLfWJIrshQYS1jU/W4gTm0DwK9yqbTxvubJaE0SGu/o
 73FGU8tmQ6ELYfsp3D/jmfUkE7weiNEQhdVb/4wy1A/SGc+W7Ju9pxfhm8ra57s0
 /BCkfwWZXEvx1flegXfK1mC6EMpMIcGAD2FQEhmQbW6wTdDwtNyEhIePDVGJwD/F
 rhEThFa+Dg9+xnBGnS6OUK3EpXgml2hAeAC7uA3TVSAnWd/9/Mpim6fZhqrB/v9L
 Ik0tZt+H4nxYaheZjKlKhuXUQYcUWGiMb67bGMem/YAlMa4y9C9qF+9mPXxyjVlI
 hBsd5SfZNz99DyB/bO8BumQeIWlTfzLeFzWW67eQ864LRKO6k0/VIbPZHCfn2oVG
 XvyC2fUhNOIURP3IMxcyHYxOA7Mu6EDsVVDTpuqLVbZQ5IPjDEfQ54yB/BLUvbX/
 Gh2/tKn7Xc25HuLAFEbs
 =TD5o
 -----END PGP SIGNATURE-----

Merge tag 'md/4.4' of git://neil.brown.name/md

Pull md updates from Neil Brown:
 "Two major components to this update.

   1) The clustered-raid1 support from SUSE is nearly complete.  There
      are a few outstanding issues being worked on.  Maybe half a dozen
      patches will bring this to a usable state.

   2) The first stage of journalled-raid5 support from Facebook makes an
      appearance.  With a journal device configured (typically NVRAM or
      SSD), the "RAID5 write hole" should be closed - a crash during
      degraded operations cannot result in data corruption.

      The next stage will be to use the journal as a write-behind cache
      so that latency can be reduced and in some cases throughput
      increased by performing more full-stripe writes.

* tag 'md/4.4' of git://neil.brown.name/md: (66 commits)
  MD: when RAID journal is missing/faulty, block RESTART_ARRAY_RW
  MD: set journal disk ->raid_disk
  MD: kick out journal disk if it's not fresh
  raid5-cache: start raid5 readonly if journal is missing
  MD: add new bit to indicate raid array with journal
  raid5-cache: IO error handling
  raid5: journal disk can't be removed
  raid5-cache: add trim support for log
  MD: fix info output for journal disk
  raid5-cache: use bio chaining
  raid5-cache: small log->seq cleanup
  raid5-cache: new helper: r5_reserve_log_entry
  raid5-cache: inline r5l_alloc_io_unit into r5l_new_meta
  raid5-cache: take rdev->data_offset into account early on
  raid5-cache: refactor bio allocation
  raid5-cache: clean up r5l_get_meta
  raid5-cache: simplify state machine when caches flushes are not needed
  raid5-cache: factor out a helper to run all stripes for an I/O unit
  raid5-cache: rename flushed_ios to finished_ios
  raid5-cache: free I/O units earlier
  ...
2015-11-04 21:12:47 -08:00
Linus Torvalds 527d1529e3 Merge branch 'for-4.4/integrity' of git://git.kernel.dk/linux-block
Pull block integrity updates from Jens Axboe:
 ""This is the joint work of Dan and Martin, cleaning up and improving
  the support for block data integrity"

* 'for-4.4/integrity' of git://git.kernel.dk/linux-block:
  block, libnvdimm, nvme: provide a built-in blk_integrity nop profile
  block: blk_flush_integrity() for bio-based drivers
  block: move blk_integrity to request_queue
  block: generic request_queue reference counting
  nvme: suspend i/o during runtime blk_integrity_unregister
  md: suspend i/o during runtime blk_integrity_unregister
  md, dm, scsi, nvme, libnvdimm: drop blk_integrity_unregister() at shutdown
  block: Inline blk_integrity in struct gendisk
  block: Export integrity data interval size in sysfs
  block: Reduce the size of struct blk_integrity
  block: Consolidate static integrity profile properties
  block: Move integrity kobject to struct gendisk
2015-11-04 20:51:48 -08:00
Linus Torvalds af7eba0158 two more bug fixes for md.
One bugfix for a list corruption in raid5 because of incorrect
 locking.
 
 Other for possible data corruption when a recovering device is failed,
 removed, and re-added.
 
 Both tagged for -stable.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJWNAnCAAoJEDnsnt1WYoG5VzwP/jRfljaxLKUicIGEd9qeaei5
 vVtmzPugthzYTfcJRm54ZufQOnjse/uXEBqmvoMJhBeEMbL2VIWbn1i7sMWjgq8W
 HVz/N/N8iT5DFWgzXmQQCaxIu17njbEmO8IelcNr3i6tE5wbHJ9WhV5UDGdQgwca
 6XjQ8r8QjmN3uVaiNL4JcrEtZfpE8PqgBCJzsQYRZzOp9m3jJlgyJ9YWQA/VQ2Cj
 cVXs7tTGTdpPQdgRRFHF/Z/7H+KUcp9XV5ahDRwqDryQkZHuFvwyZFfpLspqvc5P
 OJio9WY0y93QmRRsuIC/ig0+CnxDXeqHiRwbprIMvKbPxxaGn4s24VioRDa0PRzT
 v9HcjtUhs9q8iTo4TZJrggD+rPm523a9iiDU6SbM2qlM3XUpBis/fjbLN82Nnas+
 ADa0/DAsOE0I1WltoaaNUbweuflGw2NnFxnUueqq8dK23/Vabu3vw4tohiNp+/3m
 km8Is3j7lGKobQ3AKYIFlbeAqdsyASZkSDfA9IZr3SIeYBWgPljd9n+6BIPOqPNq
 S2HtOLke/XX2KEM0BHzgu4XliE9P9+B9lQETI4MehP0rMzTURGKh87du41YHckp1
 beX22aeOuv//ok11JTs6M+StREKeTrl+dTStn0U1jt6HyMzseGaNoy3Eib5ePBCb
 C2ZZRz4OR2MWvWUFxKms
 =QYvv
 -----END PGP SIGNATURE-----

Merge tag 'md/4.3-rc7-fixes' of git://neil.brown.name/md

Pull md bug fixes from Neil Brown:
 "Two more bug fixes for md.

  One bugfix for a list corruption in raid5 because of incorrect
  locking.

  Other for possible data corruption when a recovering device is failed,
  removed, and re-added.

  Both tagged for -stable"

* tag 'md/4.3-rc7-fixes' of git://neil.brown.name/md:
  Revert "md: allow a partially recovered device to be hot-added to an array."
  md/raid5: fix locking in handle_stripe_clean_event()
2015-10-31 21:20:49 -07:00
Song Liu 339421def5 MD: when RAID journal is missing/faulty, block RESTART_ARRAY_RW
When RAID-4/5/6 array suffers from missing journal device, we put
the array in read only state. We should not allow trasition to
read-write states (clean and active) before replacing journal device.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:29 +11:00
Shaohua Li f2076e7d06 MD: set journal disk ->raid_disk
Set journal disk ->raid_disk to >=0, I choose raid_disks + 1 instead of
0, because we already have a disk with ->raid_disk 0 and this causes
sysfs entry creation conflict. A lot of places assumes disk with
->raid_disk >=0 is normal raid disk, so we add check for journal disk.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:29 +11:00
Song Liu a3dfbdaadb MD: kick out journal disk if it's not fresh
When journal disk is faulty and we are reassemabling the raid array, the
journal disk is old. We don't allow the journal disk added to the raid
array. Since journal disk is missing in the array, the raid5 will mark
the array readonly.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:29 +11:00
Shaohua Li 7dde2ad3c5 raid5-cache: start raid5 readonly if journal is missing
If raid array is expected to have journal (eg, journal is set in MD
superblock feature map) and the array is started without journal disk,
start the array readonly.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:29 +11:00
Song Liu a97b789644 MD: add new bit to indicate raid array with journal
If a raid array has journal feature bit set, add a new bit to indicate
this. If the array is started without journal disk existing, we know
there is something wrong.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:29 +11:00
Shaohua Li 6e74a9cfb5 raid5-cache: IO error handling
There are 3 places the raid5-cache dispatches IO. The discard IO error
doesn't matter, so we ignore it. The superblock write IO error can be
handled in MD core. The remaining are log write and flush. When the IO
error happens, we mark log disk faulty and fail all write IO. Read IO is
still allowed to run. Userspace will get a notification too and
corresponding daemon can choose setting raid array readonly for example.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:29 +11:00
Shaohua Li c2bb6242ec raid5: journal disk can't be removed
raid5-cache uses journal disk rdev->bdev, rdev->mddev in several places.
Don't allow journal disk disappear magically. On the other hand, we do
need to update superblock for other disks to bump up ->events, so next
time journal disk will be identified as stale.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:29 +11:00
Shaohua Li 4b482044d2 raid5-cache: add trim support for log
Since superblock is updated infrequently, we do a simple trim of log
disk (a synchronous trim)

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:29 +11:00
Shaohua Li 9efdca16e0 MD: fix info output for journal disk
journal disk can be faulty. The Journal and Faulty aren't exclusive with
each other.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:29 +11:00
Christoph Hellwig 6143e2cecb raid5-cache: use bio chaining
Simplify the bio completion handler by using bio chaining and submitting
bios as soon as they are full.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:28 +11:00
Christoph Hellwig 2b8ef16ec4 raid5-cache: small log->seq cleanup
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:28 +11:00
Christoph Hellwig c1b9919849 raid5-cache: new helper: r5_reserve_log_entry
Factor out code to reserve log space.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:28 +11:00
Christoph Hellwig 51039cd066 raid5-cache: inline r5l_alloc_io_unit into r5l_new_meta
This is the only user, and keeping all code initializing the io_unit
structure together improves readbility.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:28 +11:00
Christoph Hellwig 1e932a37cc raid5-cache: take rdev->data_offset into account early on
Set up bi_sector properly when we allocate an bio instead of updating it
at submission time.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:28 +11:00
Christoph Hellwig b349feb36c raid5-cache: refactor bio allocation
Split out a helper to allocate a bio for log writes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:28 +11:00
Christoph Hellwig 22581f58ed raid5-cache: clean up r5l_get_meta
Remove the only partially used local 'io' variable to simplify the code
flow.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:28 +11:00
Christoph Hellwig 56fef7c6e0 raid5-cache: simplify state machine when caches flushes are not needed
For devices without a volatile write cache we don't need to send a FLUSH
command to ensure writes are stable on disk, and thus can avoid the whole
step of batching up bios for processing by the MD thread.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:28 +11:00
Christoph Hellwig d8858f4321 raid5-cache: factor out a helper to run all stripes for an I/O unit
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:27 +11:00
Christoph Hellwig 04732f741d raid5-cache: rename flushed_ios to finished_ios
After this series we won't nessecarily have flushed the cache for these
I/Os, so give the list a more neutral name.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:27 +11:00
Christoph Hellwig 170364619a raid5-cache: free I/O units earlier
There is no good reason to keep the I/O unit structures around after the
stripe has been written back to the RAID array.  The only information
we need is the log sequence number, and the checkpoint offset of the
highest successfull writeback.  Store those in the log structure, and
free the IO units from __r5l_stripe_write_finished.

Besides simplifying the code this also avoid having to keep the allocation
for the I/O unit around for a potentially long time as superblock updates
that checkpoint the log do not happen very often.

This also fixes the previously incorrect calculation of 'free' in
r5l_do_reclaim as a side effect: previous if took the last unit which
isn't checkpointed into account.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:27 +11:00
Shaohua Li e6c033f79a raid5-cache: move reclaim stop to quiesce
Move reclaim stop to quiesce handling, where is safer for this stuff.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:27 +11:00
Shaohua Li ac6096e9d5 md: show journal for journal disk in disk state sysfs
Journal disk state sysfs entry should indicate it's journal

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:27 +11:00
Song Liu 0b020e85bd skip match_mddev_units check for special roles
match_mddev_units is used to check whether 2 RAID arrays share
same disk(s). Arrays that share disk(s) will not do resync at the
same time for better performance (fewer HDD seek). However, this
check should not apply to Spare, Faulty, and Journal disks, as
they do not paticipate in resync.

In this patch, match_mddev_units skips check for disks with flag
"Faulty" or "Journal" or raid_disk < 0.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:27 +11:00
Shaohua Li 253f9fd41a raid5-cache: don't delay stripe captured in log
There is a case a stripe gets delayed forever.
1. a stripe finishes construction
2. a new bio hits the stripe
3. handle_stripe runs for the stripe. The stripe gets DELAYED bit set
since construction can't run for new bio (the stripe is locked since
step 1)

Without log, handle_stripe will call ops_run_io. After IO finishes, the
stripe gets unlocked and the stripe will restart and run construction
for the new bio. With log, ops_run_io need to run two times. If the
DELAYED bit set, the stripe can't enter into the handle_list, so the
second ops_run_io doesn't run, which leaves the stripe stalled.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:27 +11:00
Shaohua Li 85f2f9a4f4 raid5-cache: check stripe finish out of order
stripes could finish out of order. Hence r5l_move_io_unit_list() of
__r5l_stripe_write_finished might not move any entry and leave
stripe_end_ios list empty.

This applies on top of http://marc.info/?l=linux-raid&m=144122700510667

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:26 +11:00
Shaohua Li bd18f6462f md: skip resync for raid array with journal
If a raid array has journal, the journal can guarantee the consistency,
we can skip resync after a unclean shutdown. The exception is raid
creation or user initiated resync, which we still do a raid resync.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:26 +11:00
Shaohua Li 828cbe989e raid5-cache: optimize FLUSH IO with log enabled
With log enabled, bio is written to raid disks after the bio is settled
down in log disk. The recovery guarantees we can recovery the bio data
from log disk, so we we skip FLUSH IO.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:26 +11:00
Christoph Hellwig 509ffec708 raid5-cache: move functionality out of __r5l_set_io_unit_state
Just keep __r5l_set_io_unit_state as a small set the state wrapper, and
remove r5l_set_io_unit_state entirely after moving the real
functionality to the two callers that need it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:26 +11:00
Shaohua Li 0fd22b45b2 raid5-cache: fix a user-after-free bug
r5l_compress_stripe_end_list() can free an io_unit. This breaks the
assumption only reclaimer can free io_unit. We can add a reference count
based io_unit free, but since only reclaim can wait io_unit becoming to
STRIPE_END state, we use a simple global wait queue here.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:26 +11:00
Shaohua Li a8c34f9159 raid5-cache: switching to state machine for log disk cache flush
Before we write stripe data to raid disks, we must guarantee stripe data
is settled down in log disk. To do this, we flush log disk cache and
wait the flush finish. That wait introduces sleep time in raid5d thread
and impact performance. This patch moves the log disk cache flush
process to the stripe handling state machine, which can remove the wait
in raid5d.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:26 +11:00
Shaohua Li 5c7e81c3de raid5: enable log for raid array with cache disk
Now log is safe to enable for raid array with cache disk

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:26 +11:00
Shaohua Li 713cf5a639 raid5: don't allow resize/reshape with cache(log) support
If cache(log) support is enabled, don't allow resize/reshape in current
stage. In the future, we can flush all data from cache(log) to raid
before resize/reshape and then allow resize/reshape.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:26 +11:00
Shaohua Li 9c3e333d3f raid5: disable batch with log enabled
With log enabled, r5l_write_stripe will add the stripe to log. With
batch, several stripes are linked together. The stripes must be in the
same state. While with log, the log/reclaim unit is stripe, we can't
guarantee the several stripes are in the same state. Disabling batch for
log now.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-11-01 13:48:26 +11:00
Shaohua Li 5cb2fbd6ea raid5-cache: use crc32c checksum
crc32c has lower overhead with cpu acceleration. It's a shame I didn't
use it in first post, sorry. This changes disk format, but we are still
ok in current stage.

V2: delete unnecessary type conversion as pointed out by Bart

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
2015-11-01 13:45:39 +11:00
Tomohiro Kusumi aad9ae4550 dm switch: simplify conditional in alloc_region_table()
The variable sctx->nr_regions has type unsigned long and the variable
nr_regions has type sector_t.

Thus the variables may be different when overflow happens.
Changed the conditional to "if (nr_regions >= ULONG_MAX)".
Also move the assignment of nr_regions after sector_div()
and the sanity check which looks more sane.

Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-10-31 19:06:06 -04:00
Tomohiro Kusumi f49e869a61 dm delay: document that offsets are specified in sectors
Only delay params are mentioned in delay.txt.
Mention offsets just like documents for linear and flakey do.

Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-10-31 19:06:05 -04:00
Tomohiro Kusumi e213f33e4d dm delay: capitalize the start of an delay_ctr() error message
All other error messages start capitalized.

Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-10-31 19:06:04 -04:00