1
0
Fork 0
Commit Graph

561 Commits (63a4cc24867de73626e16767ce616c50dc5438d3)

Author SHA1 Message Date
Mike Christie 469e3216e2 block discard: use bio set op accessor
This converts the block issue discard helper and users to use
the bio_set_op_attrs accessor and only pass in the operation flags
like REQ_SEQURE.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie f21508211d block: add REQ_OP definitions and helpers
The following patches separate the operation (WRITE, READ, DISCARD,
etc) from the rq_flag_bits flags. This patch adds definitions for
request/bio operations (REQ_OPs) and adds request/bio accessors to
get/set the op.

In this patch the REQ_OPs match the REQ rq_flag_bits ones
for compat reasons while all the code is converted to use the
op accessors in the set. In the last patches the op will become a
number and the accessors and helpers in this patch will be dropped
or updated.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Linus Torvalds 315227f6da DAX error handling for 4.7
- Until now, dax has been disabled if media errors were found on
   any device. This enables the use of DAX in the presence of these
   errors by making all sector-aligned zeroing go through the driver.
 - The driver (already) has the ability to clear errors on writes that
   are sent through the block layer using 'DSMs' defined in ACPI 6.1.
 
 Other misc changes:
 
 - When mounting DAX filesystems, check to make sure the partition
   is page aligned. This is a requirement for DAX, and previously, we
   allowed such unaligned mounts to succeed, but subsequent reads/writes
   would fail.
 
 - Misc/cleanup fixes from Jan that remove unused code from DAX related to
   zeroing, writeback, and some size checks.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXQ4GKAAoJEHr6Yb6juE3/zowP/iclIhgXXXMQJRUHJlePMXC8
 15sGZ32JS1ak9g7vrsmNVEDNynfNtiMYdBxtUyRuj6xqgwdZvFk3F55KOCPtaeA1
 +yADkgeRkTAcwzmHw9WQVEzBCqyzSisdrwtEfH817qdq9FJdH66x2Kos6i+HeAVr
 5Q/e4gs7lKrjf384/QBl+wxNZOndJaQAPd2VRHQqx2A9F33v0ljdwRaUG1r4fjK2
 dtmhcZCqdQyuAGXW3piTnZc5ZFc3DPqO4FkEfqkEK3lFOflK0fd8wMsAZRp/Jd0j
 GJsgnVSWSqG0Dz476djlG0w8t2p5Jv1g9cKChV+ZZEdFLKWHCOUFqXNj8uI8I4k5
 cOEKCHyJ3IwfSHhNQqktEWrQN4T8ZXhWtuc9GuV4UZYuqJqHci6EdR/YsWsJjV+L
 lm/qvK4ipDS1pivxOy8KX/iN0z7Io8J9GXpStDx3g8iWjLlh4YYlbJLWeeRepo/z
 aPlV/QAKcHiGY6jzLExrZIyCWkzwo6O+0p1Kxerv9/7K/32HWbOodZ+tC8eD+N25
 pV69nCGf+u50T2TtIx1+iann4NC1r7zg5yqnT9AgpyZpiwR5joCDzI5sXW+D0rcS
 vPtfM84Ccdeq/e6mvfIpZgR0/npQapKnrmUest0J7P2BFPHiFPji1KzZ7M+1aFOo
 9R6JdrAj0Sc+FBa+cGzH
 =v6Of
 -----END PGP SIGNATURE-----

Merge tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull misc DAX updates from Vishal Verma:
 "DAX error handling for 4.7

   - Until now, dax has been disabled if media errors were found on any
     device.  This enables the use of DAX in the presence of these
     errors by making all sector-aligned zeroing go through the driver.

   - The driver (already) has the ability to clear errors on writes that
     are sent through the block layer using 'DSMs' defined in ACPI 6.1.

  Other misc changes:

   - When mounting DAX filesystems, check to make sure the partition is
     page aligned.  This is a requirement for DAX, and previously, we
     allowed such unaligned mounts to succeed, but subsequent
     reads/writes would fail.

   - Misc/cleanup fixes from Jan that remove unused code from DAX
     related to zeroing, writeback, and some size checks"

* tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  dax: fix a comment in dax_zero_page_range and dax_truncate_page
  dax: for truncate/hole-punch, do zeroing through the driver if possible
  dax: export a low-level __dax_zero_page_range helper
  dax: use sb_issue_zerout instead of calling dax_clear_sectors
  dax: enable dax in the presence of known media errors (badblocks)
  dax: fallback from pmd to pte on error
  block: Update blkdev_dax_capable() for consistency
  xfs: Add alignment check for DAX mount
  ext2: Add alignment check for DAX mount
  ext4: Add alignment check for DAX mount
  block: Add bdev_dax_supported() for dax mount checks
  block: Add vfs_msg() interface
  dax: Remove redundant inode size checks
  dax: Remove pointless writeback from dax_do_io()
  dax: Remove zeroing from dax_io()
  dax: Remove dead zeroing code from fault handlers
  ext2: Avoid DAX zeroing to corrupt data
  ext2: Fix block zeroing in ext2_get_blocks() for DAX
  dax: Remove complete_unwritten argument
  DAX: move RADIX_DAX_ definitions to dax.c
2016-05-26 19:34:26 -07:00
Dan Williams 0a70bd4305 dax: enable dax in the presence of known media errors (badblocks)
1/ If a mapping overlaps a bad sector fail the request.

2/ Do not opportunistically report more dax-capable capacity than is
   requested when errors present.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
[vishal: fix a conflict with system RAM collision patches]
[vishal: add a 'size' parameter to ->direct_access]
[vishal: fix a conflict with DAX alignment check patches]
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
2016-05-18 12:16:56 -06:00
Linus Torvalds 24b9f0cf00 Merge branch 'for-4.7/drivers' of git://git.kernel.dk/linux-block
Pull block driver updates from Jens Axboe:
 "On top of the core pull request, this is the drivers pull request for
  this merge window.  This contains:

   - Switch drivers to the new write back cache API, and kill off the
     flush flags.  From me.

   - Kill the discard support for the STEC pci-e flash driver.  It's
     trivially broken, and apparently unmaintained, so it's safer to
     just remove it.  From Jeff Moyer.

   - A set of lightnvm updates from the usual suspects (Matias/Javier,
     and Simon), and fixes from Arnd, Jeff Mahoney, Sagi, and Wenwei
     Tao.

   - A set of updates for NVMe:

        - Turn the controller state management into a proper state
          machine.  From Christoph.

        - Shuffling of code in preparation for NVMe-over-fabrics, also
          from Christoph.

        - Cleanup of the command prep part from Ming Lin.

        - Rewrite of the discard support from Ming Lin.

        - Deadlock fix for namespace removal from Ming Lin.

        - Use the now exported blk-mq tag helper for IO termination.
          From Sagi.

        - Various little fixes from Christoph, Guilherme, Keith, Ming
          Lin, Wang Sheng-Hui.

   - Convert mtip32xx to use the now exported blk-mq tag iter function,
     from Keith"

* 'for-4.7/drivers' of git://git.kernel.dk/linux-block: (74 commits)
  lightnvm: reserved space calculation incorrect
  lightnvm: rename nr_pages to nr_ppas on nvm_rq
  lightnvm: add is_cached entry to struct ppa_addr
  lightnvm: expose gennvm_mark_blk to targets
  lightnvm: remove mgt targets on mgt removal
  lightnvm: pass dma address to hardware rather than pointer
  lightnvm: do not assume sequential lun alloc.
  nvme/lightnvm: Log using the ctrl named device
  lightnvm: rename dma helper functions
  lightnvm: enable metadata to be sent to device
  lightnvm: do not free unused metadata on rrpc
  lightnvm: fix out of bound ppa lun id on bb tbl
  lightnvm: refactor set_bb_tbl for accepting ppa list
  lightnvm: move responsibility for bad blk mgmt to target
  lightnvm: make nvm_set_rqd_ppalist() aware of vblks
  lightnvm: remove struct factory_blks
  lightnvm: refactor device ops->get_bb_tbl()
  lightnvm: introduce nvm_for_each_lun_ppa() macro
  lightnvm: refactor dev->online_target to global nvm_targets
  lightnvm: rename nvm_targets to nvm_tgt_type
  ...
2016-05-17 16:03:32 -07:00
Toshi Kani a8078b1fc6 block: Update blkdev_dax_capable() for consistency
blkdev_dax_capable() is similar to bdev_dax_supported(), but needs
to remain as a separate interface for checking dax capability of
a raw block device.

Rename and relocate blkdev_dax_capable() to keep them maintained
consistently, and call bdev_direct_access() for the dax capability
check.

There is no change in the behavior.

Link: https://lkml.org/lkml/2016/5/9/950
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jens Axboe <axboe@fb.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Boaz Harrosh <boaz@plexistor.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
2016-05-17 00:44:13 -06:00
Toshi Kani 2d96afc8f7 block: Add bdev_dax_supported() for dax mount checks
DAX imposes additional requirements to a device.  Add
bdev_dax_supported() which performs all the precondition checks
necessary for filesystem to mount the device with dax option.

Also add a new check to verify if a partition is aligned by 4KB.
When a partition is unaligned, any dax read/write access fails,
except for metadata update.

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jens Axboe <axboe@fb.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Boaz Harrosh <boaz@plexistor.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
2016-05-17 00:44:11 -06:00
Toshi Kani 2af3a8159c block: Add vfs_msg() interface
In preparation of moving DAX capability checks to the block layer
from filesystem code, add a VFS message interface that aligns with
filesystem's message format.

For instance, a vfs_msg() message followed by XFS messages in case
of a dax mount error may look like:

  VFS (pmem0p1): error: unaligned partition for dax
  XFS (pmem0p1): DAX unsupported by block device. Turning off DAX.
  XFS (pmem0p1): Mounting V5 Filesystem
   :

vfs_msg() is largely based on ext4_msg().

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jens Axboe <axboe@fb.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Boaz Harrosh <boaz@plexistor.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
2016-05-17 00:44:10 -06:00
Christoph Hellwig 38f2525533 block: add __blkdev_issue_discard
This is a version of blkdev_issue_discard which doesn't wait for
the I/O to complete, but instead allows the caller to submit
the final bio and/or chain it to others.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Sagi Grimberg <sagig@grimberg.me>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-02 09:19:46 -06:00
Jens Axboe c888a8f95a block: kill off q->flush_flags
Now that we converted everything to the newer block write cache
interface, kill off the queue flush_flags and queueable flush
entries.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-04-13 13:33:19 -06:00
Jens Axboe 2245f6de6c block: kill blk_queue_flush()
We don't have any drivers left using it, so kill it off. Update
documentation to use the newer blk_queue_write_cache().

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-04-12 16:00:39 -06:00
Jens Axboe 93e9d8e836 block: add ability to flag write back caching on a device
Add an internal helper and flag for setting whether a queue has
write back caching, or write through (or none). Add a sysfs file
to show this as well, and make it changeable from user space.

This will replace the (awkward) blk_queue_flush() interface that
drivers currently use to inform the block layer of write cache state
and capabilities.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-04-12 15:46:27 -06:00
Ming Lin 37e58237a1 block: add offset in blk_add_request_payload()
We could kmalloc() the payload, so need the offset in page.

Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-04-12 13:13:23 -06:00
Kirill A. Shutemov 09cbfeaf1a mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized.  And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE.  And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special.  They are
not.

The changes are pretty straight-forward:

 - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

 - page_cache_get() -> get_page();

 - page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below.  For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach.  I'll
fix them manually in a separate patch.  Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04 10:41:08 -07:00
Linus Torvalds 3c2de27d79 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:

 - Preparations of parallel lookups (the remaining main obstacle is the
   need to move security_d_instantiate(); once that becomes safe, the
   rest will be a matter of rather short series local to fs/*.c

 - preadv2/pwritev2 series from Christoph

 - assorted fixes

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (32 commits)
  splice: handle zero nr_pages in splice_to_pipe()
  vfs: show_vfsstat: do not ignore errors from show_devname method
  dcache.c: new helper: __d_add()
  don't bother with __d_instantiate(dentry, NULL)
  untangle fsnotify_d_instantiate() a bit
  uninline d_add()
  replace d_add_unique() with saner primitive
  quota: use lookup_one_len_unlocked()
  cifs_get_root(): use lookup_one_len_unlocked()
  nfs_lookup: don't bother with d_instantiate(dentry, NULL)
  kill dentry_unhash()
  ceph_fill_trace(): don't bother with d_instantiate(dn, NULL)
  autofs4: don't bother with d_instantiate(dentry, NULL) in ->lookup()
  configfs: move d_rehash() into configfs_create() for regular files
  ceph: don't bother with d_rehash() in splice_dentry()
  namei: teach lookup_slow() to skip revalidate
  namei: massage lookup_slow() to be usable by lookup_one_len_unlocked()
  lookup_one_len_unlocked(): use lookup_dcache()
  namei: simplify invalidation logics in lookup_dcache()
  namei: change calling conventions for lookup_{fast,slow} and follow_managed()
  ...
2016-03-19 18:52:29 -07:00
Linus Torvalds fcab86add7 Merge branch 'for-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata
Pull libata updates from Tejun Heo:

 - ahci grew runtime power management support so that the controller can
   be turned off if no devices are attached.

 - sata_via isn't dead yet.  It got hotplug support and more refined
   workaround for certain WD drives.

 - Misc cleanups.  There's a merge from for-4.5-fixes to avoid confusing
   conflicts in ahci PCI ID table.

* 'for-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata:
  ata: ahci_xgene: dereferencing uninitialized pointer in probe
  AHCI: Remove obsolete Intel Lewisburg SATA RAID device IDs
  ata: sata_rcar: Use ARCH_RENESAS
  sata_via: Implement hotplug for VT6421
  sata_via: Apply WD workaround only when needed on VT6421
  ahci: Add runtime PM support for the host controller
  ahci: Add functions to manage runtime PM of AHCI ports
  ahci: Convert driver to use modern PM hooks
  ahci: Cache host controller version
  scsi: Drop runtime PM usage count after host is added
  scsi: Set request queue runtime PM status back to active on resume
  block: Add blk_set_runtime_active()
  ata: ahci_mvebu: add support for Armada 3700 variant
  libata: fix unbalanced spin_lock_irqsave/spin_unlock_irq() in ata_scsi_park_show()
  libata: support AHCI on OCTEON platform
2016-03-18 20:06:46 -07:00
Al Viro 8b23a8ce10 Merge branches 'work.lookups', 'work.misc' and 'work.preadv2' into for-next 2016-03-18 16:07:38 -04:00
Christoph Hellwig 8e0b60b96b blk-mq: enable polling support by default
Now that applications need to explicitly ask for polling we can enable it
by default in blk-mq drivers.  Note that this will only have an affect
on driver that supply a poll function, which currently only includes nvme.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Stephen Bates <stephen.bates@pmcs.com>
Tested-by: Stephen Bates <stephen.bates@pmcs.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-03-04 12:20:10 -05:00
Christoph Hellwig f21018427c block: fix blk_rq_get_max_sectors for driver private requests
Driver private request types should not get the artifical cap for the
FS requests.  This is important to use the full device capabilities
for internal command or NVMe pass through commands.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Jeff Lien <Jeff.Lien@hgst.com>
Tested-by: Jeff Lien <Jeff.Lien@hgst.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>

Updated by me to use an explicit check for the one command type that
does support extended checking, instead of relying on the ordering
of the enum command values - as suggested by Keith.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-03 14:43:45 -07:00
Ming Lei 25e71a99f1 block: get the 1st and last bvec via helpers
This patch applies the two introduced helpers to
figure out the 1st and last bvec, and fixes the
original way after bio splitting.

Cc: stable@vger.kernel.org
Reported-by: Sagi Grimberg <sagig@dev.mellanox.co.il>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-03 14:42:49 -07:00
Ming Lei e0af29171a block: check virt boundary in bio_will_gap()
In the following patch, the way for figuring out
the last bvec will be changed with a bit cost introduced,
so return immediately if the queue doesn't have virt
boundary limit. Actually most of devices have not
this limit.

Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-03 14:42:49 -07:00
Mika Westerberg d07ab6d114 block: Add blk_set_runtime_active()
If block device is left runtime suspended during system suspend, resume
hook of the driver typically corrects runtime PM status of the device back
to "active" after it is resumed. However, this is not enough as queue's
runtime PM status is still "suspended". As long as it is in this state
blk_pm_peek_request() returns NULL and thus prevents new requests to be
processed.

Add new function blk_set_runtime_active() that can be used to force the
queue status back to "active" as needed.

Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-02-19 10:52:45 -05:00
James Bottomley 12ffbbe94d Merge remote-tracking branch 'mkp-scsi/4.5/scsi-fixes' into fixes 2016-02-04 21:37:52 -08:00
Martin K. Petersen 0fb5b1fb30 block/sd: Return -EREMOTEIO when WRITE SAME and DISCARD are disabled
When a storage device rejects a WRITE SAME command we will disable write
same functionality for the device and return -EREMOTEIO to the block
layer. -EREMOTEIO will in turn prevent DM from retrying the I/O and/or
failing the path.

Yiwen Jiang discovered a small race where WRITE SAME requests issued
simultaneously would cause -EIO to be returned. This happened because
any requests being prepared after WRITE SAME had been disabled for the
device caused us to return BLKPREP_KILL. The latter caused the block
layer to return -EIO upon completion.

To overcome this we introduce BLKPREP_INVALID which indicates that this
is an invalid request for the device. blk_peek_request() is modified to
return -EREMOTEIO in that case.

Reported-by: Yiwen Jiang <jiangyiwen@huawei.com>
Suggested-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Hannes Reinicke <hare@suse.de>
Reviewed-by: Ewan Milne <emilne@redhat.com>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2016-02-04 22:42:58 -05:00
Linus Torvalds 3e1e21c7bf Merge branch 'for-4.5/nvme' of git://git.kernel.dk/linux-block
Pull NVMe updates from Jens Axboe:
 "Last branch for this series is the nvme changes.  It's in a separate
  branch to avoid splitting too much between core and NVMe changes,
  since NVMe is still helping drive some blk-mq changes.  That said, not
  a huge amount of core changes in here.  The grunt of the work is the
  continued split of the code"

* 'for-4.5/nvme' of git://git.kernel.dk/linux-block: (67 commits)
  uapi: update install list after nvme.h rename
  NVMe: Export NVMe attributes to sysfs group
  NVMe: Shutdown controller only for power-off
  NVMe: IO queue deletion re-write
  NVMe: Remove queue freezing on resets
  NVMe: Use a retryable error code on reset
  NVMe: Fix admin queue ring wrap
  nvme: make SG_IO support optional
  nvme: fixes for NVME_IOCTL_IO_CMD on the char device
  nvme: synchronize access to ctrl->namespaces
  nvme: Move nvme_freeze/unfreeze_queues to nvme core
  PCI/AER: include header file
  NVMe: Export namespace attributes to sysfs
  NVMe: Add pci error handlers
  block: remove REQ_NO_TIMEOUT flag
  nvme: merge iod and cmd_info
  nvme: meta_sg doesn't have to be an array
  nvme: properly free resources for cancelled command
  nvme: simplify completion handling
  nvme: special case AEN requests
  ...
2016-01-21 19:58:02 -08:00
Linus Torvalds 7c24d9f3b2 Merge branch 'for-4.5/core' of git://git.kernel.dk/linux-block
Pull core block updates from Jens Axboe:
 "We don't have a lot of core changes this time around, it's mostly in
  drivers, which will come in a subsequent pull.

  The cores changes include:

   - blk-mq
        - Prep patch from Christoph, changing blk_mq_alloc_request() to
          take flags instead of just using gfp_t for sleep/nosleep.
        - Doc patch from me, clarifying the difference between legacy
          and blk-mq for timer usage.
        - Fixes from Raghavendra for memory-less numa nodes, and a reuse
          of CPU masks.

   - Cleanup from Geliang Tang, using offset_in_page() instead of open
     coding it.

   - From Ilya, rename request_queue slab to it reflects what it holds,
     and a fix for proper use of bdgrab/put.

   - A real fix for the split across stripe boundaries from Keith.  We
     yanked a broken version of this from 4.4-rc final, this one works.

   - From Mike Krinkin, emit a trace message when we split.

   - From Wei Tang, two small cleanups, not explicitly clearing memory
     that is already cleared"

* 'for-4.5/core' of git://git.kernel.dk/linux-block:
  block: use bd{grab,put}() instead of open-coding
  block: split bios to max possible length
  block: add call to split trace point
  blk-mq: Avoid memoryless numa node encoded in hctx numa_node
  blk-mq: Reuse hardware context cpumask for tags
  blk-mq: add a flags parameter to blk_mq_alloc_request
  Revert "blk-flush: Queue through IO scheduler when flush not required"
  block: clarify blk_add_timer() use case for blk-mq
  bio: use offset_in_page macro
  block: do not initialise statics to 0 or NULL
  block: do not initialise globals to 0 or NULL
  block: rename request_queue slab cache
2016-01-19 15:03:34 -08:00
Dan Williams 34c0fd540e mm, dax, pmem: introduce pfn_t
For the purpose of communicating the optional presence of a 'struct
page' for the pfn returned from ->direct_access(), introduce a type that
encapsulates a page-frame-number plus flags.  These flags contain the
historical "page_link" encoding for a scatterlist entry, but can also
denote "device memory".  Where "device memory" is a set of pfns that are
not part of the kernel's linear mapping by default, but are accessed via
the same memory controller as ram.

The motivation for this new type is large capacity persistent memory
that needs struct page entries in the 'memmap' to support 3rd party DMA
(i.e.  O_DIRECT I/O with a persistent memory source/target).  However,
we also need it in support of maintaining a list of mapped inodes which
need to be unmapped at driver teardown or freeze_bdev() time.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Hansen <dave@sr71.net>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Dan Williams b2e0d1625e dax: fix lifetime of in-kernel dax mappings with dax_map_atomic()
The DAX implementation needs to protect new calls to ->direct_access()
and usage of its return value against the driver for the underlying
block device being disabled.  Use blk_queue_enter()/blk_queue_exit() to
hold off blk_cleanup_queue() from proceeding, or otherwise fail new
mapping requests if the request_queue is being torn down.

This also introduces blk_dax_ctl to simplify the interface from fs/dax.c
through dax_map_atomic() to bdev_direct_access().

[willy@linux.intel.com: fix read() of a hole]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Jens Axboe 21491412f2 block: add blk_start_queue_async()
We currently only have an inline/sync helper to restart a stopped
queue. If drivers need an async version, they have to roll their
own. Add a generic helper instead.

Signed-off-by: Jens Axboe <axboe@fb.com>
2015-12-28 13:07:07 -07:00
Christoph Hellwig 287922eb0b block: defer timeouts to a workqueue
Timer context is not very useful for drivers to perform any meaningful abort
action from.  So instead of calling the driver from this useless context
defer it to a workqueue as soon as possible.

Note that while a delayed_work item would seem the right thing here I didn't
dare to use it due to the magic in blk_add_timer that pokes deep into timer
internals.  But maybe this encourages Tejun to add a sensible API for that to
the workqueue API and we'll all be fine in the end :)

Contains a major update from Keith Bush:

"This patch removes synchronizing the timeout work so that the timer can
 start a freeze on its own queue. The timer enters the queue, so timer
 context can only start a freeze, but not wait for frozen."

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-12-22 09:38:16 -07:00
Linus Torvalds 19190f5ea9 SCSI fixes on 20151205
This is quite a bumper crop of fixes: Three from Arnd correcting various build
 issues in some configurations, a lock recursion in qla2xxx.  Two potentially
 exploitable issues in hpsa and mvsas, a potential null deref in st, A revert
 of a bdi registration fix that turned out to cause even more problems, a set
 of fixes to allow people who only defined MPT2SAS to still work after the
 mpt2/mpt3sas merger and a couple of fixes for issues turned up by the hyper-v
 storvsc driver.
 
 Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABAgAGBQJWY8fDAAoJEDeqqVYsXL0MPhwH/Rs05LUU2vnlaZDdDpH5zY56
 YQgKh5duF+ZH+Y4NxX5kkLLo05wpE6xD5xp2yzzmnjTA0Uf/yLVHNdb5D6tRZgSo
 mZjAX+/wGDb/ErwvwTk/K2mhEvB0iZJJVMyWcG3F9dKgciRCF/p1Gn5EarGmc+vM
 w/9xrs1j24Pw7ipHgBj9zU13w+SPMI7LunR0oYL9CJg24jgXG9sAbrwLkox5kHLo
 FFBCrZhev1mzFKa1C+Ln3s0iSf0yEQMd4khzPJAUElkw812PZ7I6r4bCP0ZPKDed
 JR8zex9jo77RyWZwA7fIathA0/ujv0AeIRXgvzb0/io1Yk577r98vt+S3koQVK8=
 =ptgb
 -----END PGP SIGNATURE-----

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
 "This is quite a bumper crop of fixes: three from Arnd correcting
  various build issues in some configurations, a lock recursion in
  qla2xxx.  Two potentially exploitable issues in hpsa and mvsas, a
  potential null deref in st, a revert of a bdi registration fix that
  turned out to cause even more problems, a set of fixes to allow people
  who only defined MPT2SAS to still work after the mpt2/mpt3sas merger
  and a couple of fixes for issues turned up by the hyper-v storvsc
  driver"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  mpt3sas: fix Kconfig dependency problem for mpt2sas back compatibility
  Revert "scsi: Fix a bdi reregistration race"
  mpt3sas: Add dummy Kconfig option for backwards compatibility
  Fix a memory leak in scsi_host_dev_release()
  block/sd: Fix device-imposed transfer length limits
  scsi_debug: fix prevent_allow+verify regressions
  MAINTAINERS: Add myself as co-maintainer of the SCSI subsystem.
  sd: Make discard granularity match logical block size when LBPRZ=1
  scsi: hpsa: select CONFIG_SCSI_SAS_ATTR
  scsi: advansys needs ISA dma api for ISA support
  scsi_sysfs: protect against double execution of __scsi_remove_device()
  st: fix potential null pointer dereference.
  scsi: report 'INQUIRY result too short' once per host
  advansys: fix big-endian builds
  qla2xxx: Fix rwlock recursion
  hpsa: logical vs bitwise AND typo
  mvsas: don't allow negative timeouts
  mpt3sas: Fix use sas_is_tlr_enabled API before enabling MPI2_SCSIIO_CONTROL_TLR_ON flag
2015-12-06 08:02:25 -08:00
James Bottomley be9e2f775f Merge branch 'mkp-fixes' into fixes 2015-12-03 09:32:33 -08:00
Christoph Hellwig 6f3b0e8bcf blk-mq: add a flags parameter to blk_mq_alloc_request
We already have the reserved flag, and a nowait flag awkwardly encoded as
a gfp_t.  Add a real flags argument to make the scheme more extensible and
allow for a nicer calling convention.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-12-01 10:53:59 -07:00
Hannes Reinecke bf4e6b4e75 block: Always check queue limits for cloned requests
When a cloned request is retried on other queues it always needs
to be checked against the queue limits of that queue.
Otherwise the calculations for nr_phys_segments might be wrong,
leading to a crash in scsi_init_sgtable().

To clarify this the patch renames blk_rq_check_limits()
to blk_cloned_rq_check_limits() and removes the symbol
export, as the new function should only be used for
cloned requests and never exported.

Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Ewan Milne <emilne@redhat.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Fixes: e2a60da74 ("block: Clean up special command handling logic")
Cc: stable@vger.kernel.org # 3.7+
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-11-29 14:37:27 -07:00
Martin K. Petersen ca369d51b3 block/sd: Fix device-imposed transfer length limits
Commit 4f258a4634 ("sd: Fix maximum I/O size for BLOCK_PC requests")
had the unfortunate side-effect of removing an implicit clamp to
BLK_DEF_MAX_SECTORS for REQ_TYPE_FS requests in the block layer
code. This caused problems for some SMR drives.

Debugging this issue revealed a few problems with the existing
infrastructure since the block layer didn't know how to deal with
device-imposed limits, only limits set by the I/O controller.

 - Introduce a new queue limit, max_dev_sectors, which is used by the
   ULD to signal the maximum sectors for a REQ_TYPE_FS request.

 - Ensure that max_dev_sectors is correctly stacked and taken into
   account when overriding max_sectors through sysfs.

 - Rework sd_read_block_limits() so it saves the max_xfer and opt_xfer
   values for later processing.

 - In sd_revalidate() set the queue's max_dev_sectors based on the
   MAXIMUM TRANSFER LENGTH value in the Block Limits VPD. If this value
   is not reported, fall back to a cap based on the CDB TRANSFER LENGTH
   field size.

 - In sd_revalidate(), use OPTIMAL TRANSFER LENGTH from the Block Limits
   VPD--if reported and sane--to signal the preferred device transfer
   size for FS requests. Otherwise use BLK_DEF_MAX_SECTORS.

 - blk_limits_max_hw_sectors() is no longer used and can be removed.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=93581
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: sweeneygj@gmx.com
Tested-by: Arzeets <anatol.pomozov@gmail.com>
Tested-by: David Eisner <david.eisner@oriel.oxon.org>
Tested-by: Mario Kicherer <dev@kicherer.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2015-11-25 21:38:58 -05:00
Dan Williams 2e6edc9538 block: protect rw_page against device teardown
Fix use after free crashes like the following:

 general protection fault: 0000 [#1] SMP
 Call Trace:
  [<ffffffffa0050216>] ? pmem_do_bvec.isra.12+0xa6/0xf0 [nd_pmem]
  [<ffffffffa0050ba2>] pmem_rw_page+0x42/0x80 [nd_pmem]
  [<ffffffff8128fd90>] bdev_read_page+0x50/0x60
  [<ffffffff812972f0>] do_mpage_readpage+0x510/0x770
  [<ffffffff8128fd20>] ? I_BDEV+0x20/0x20
  [<ffffffff811d86dc>] ? lru_cache_add+0x1c/0x50
  [<ffffffff81297657>] mpage_readpages+0x107/0x170
  [<ffffffff8128fd20>] ? I_BDEV+0x20/0x20
  [<ffffffff8128fd20>] ? I_BDEV+0x20/0x20
  [<ffffffff8129058d>] blkdev_readpages+0x1d/0x20
  [<ffffffff811d615f>] __do_page_cache_readahead+0x28f/0x310
  [<ffffffff811d6039>] ? __do_page_cache_readahead+0x169/0x310
  [<ffffffff811c5abd>] ? pagecache_get_page+0x2d/0x1d0
  [<ffffffff811c76f6>] filemap_fault+0x396/0x530
  [<ffffffff811f816e>] __do_fault+0x4e/0xf0
  [<ffffffff811fce7d>] handle_mm_fault+0x11bd/0x1b50

Cc: <stable@vger.kernel.org>
Cc: Jens Axboe <axboe@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Reported-by: kbuild test robot <lkp@intel.com>
Acked-by: Matthew Wilcox <willy@linux.intel.com>
[willy: symmetry fixups]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-11-19 13:47:10 -08:00
Jens Axboe 05229beedd block: add block polling support
Add basic support for polling for specific IO to complete. This uses
the cookie that blk-mq passes back, which enables the block layer
to pass this cookie to the driver to spin for a specific request.

This will be combined with request latency tracking, so we can make
qualified decisions about when to poll and when not to. For now, for
benchmark purposes, we add a sysfs file that controls whether polling
is enabled or not.

Signed-off-by: Jens Axboe <axboe@fb.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
2015-11-07 10:40:47 -07:00
Jens Axboe dece16353e block: change ->make_request_fn() and users to return a queue cookie
No functional changes in this patch, but it prepares us for returning
a more useful cookie related to the IO that was queued up.

Signed-off-by: Jens Axboe <axboe@fb.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
2015-11-07 10:40:46 -07:00
Linus Torvalds ccf21b69a8 Merge branch 'for-4.4/reservations' of git://git.kernel.dk/linux-block
Pull block reservation support from Jens Axboe:
 "This adds support for persistent reservations, both at the core level,
  as well as for sd and NVMe"

[ Background from the docs: "Persistent Reservations allow restricting
  access to block devices to specific initiators in a shared storage
  setup.  All implementations are expected to ensure the reservations
  survive a power loss and cover all connections in a multi path
  environment" ]

* 'for-4.4/reservations' of git://git.kernel.dk/linux-block:
  NVMe: Precedence error in nvme_pr_clear()
  nvme: add missing endianess annotations in nvme_pr_command
  NVMe: Add persistent reservation ops
  sd: implement the Persistent Reservation API
  block: add an API for Persistent Reservations
  block: cleanup blkdev_ioctl
2015-11-04 21:01:27 -08:00
Christoph Hellwig bbd3e06436 block: add an API for Persistent Reservations
This commits adds a driver API and ioctls for controlling Persistent
Reservations s/genericly/generically/ at the block layer.  Persistent
Reservations are supported by SCSI and NVMe and allow controlling who gets
access to a device in a shared storage setup.

Note that we add a pr_ops structure to struct block_device_operations
instead of adding the members directly to avoid bloating all instances
of devices that will never support Persistent Reservations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-21 14:46:56 -06:00
Dan Williams ac6fc48c9f block: move blk_integrity to request_queue
A trace like the following proceeds a crash in bio_integrity_process()
when it goes to use an already freed blk_integrity profile.

 BUG: unable to handle kernel paging request at ffff8800d31b10d8
 IP: [<ffff8800d31b10d8>] 0xffff8800d31b10d8
 PGD 2f65067 PUD 21fffd067 PMD 80000000d30001e3
 Oops: 0011 [#1] SMP
 Dumping ftrace buffer:
 ---------------------------------
    ndctl-2222    2.... 44526245us : disk_release: pmem1s
 systemd--2223    4.... 44573945us : bio_integrity_endio: pmem1s
    <...>-409     4.... 44574005us : bio_integrity_process: pmem1s
 ---------------------------------
[..]
  Call Trace:
  [<ffffffff8144e0f9>] ? bio_integrity_process+0x159/0x2d0
  [<ffffffff8144e4f6>] bio_integrity_verify_fn+0x36/0x60
  [<ffffffff810bd2dc>] process_one_work+0x1cc/0x4e0

Given that a request_queue is pinned while i/o is in flight and that a
gendisk is allowed to have a shorter lifetime, move blk_integrity to
request_queue to satisfy requests arriving after the gendisk has been
torn down.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
[martin: fix the CONFIG_BLK_DEV_INTEGRITY=n case]
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-21 14:43:42 -06:00
Dan Williams 3ef28e83ab block: generic request_queue reference counting
Allow pmem, and other synchronous/bio-based block drivers, to fallback
on a per-cpu reference count managed by the core for tracking queue
live/dead state.

The existing per-cpu reference count for the blk_mq case is promoted to
be used in all block i/o scenarios.  This involves initializing it by
default, waiting for it to drop to zero at exit, and holding a live
reference over the invocation of q->make_request_fn() in
generic_make_request().  The blk_mq code continues to take its own
reference per blk_mq request and retains the ability to freeze the
queue, but the check that the queue is frozen is moved to
generic_make_request().

This fixes crash signatures like the following:

 BUG: unable to handle kernel paging request at ffff880140000000
 [..]
 Call Trace:
  [<ffffffff8145e8bf>] ? copy_user_handle_tail+0x5f/0x70
  [<ffffffffa004e1e0>] pmem_do_bvec.isra.11+0x70/0xf0 [nd_pmem]
  [<ffffffffa004e331>] pmem_make_request+0xd1/0x200 [nd_pmem]
  [<ffffffff811c3162>] ? mempool_alloc+0x72/0x1a0
  [<ffffffff8141f8b6>] generic_make_request+0xd6/0x110
  [<ffffffff8141f966>] submit_bio+0x76/0x170
  [<ffffffff81286dff>] submit_bh_wbc+0x12f/0x160
  [<ffffffff81286e62>] submit_bh+0x12/0x20
  [<ffffffff813395bd>] jbd2_write_superblock+0x8d/0x170
  [<ffffffff8133974d>] jbd2_mark_journal_empty+0x5d/0x90
  [<ffffffff813399cb>] jbd2_journal_destroy+0x24b/0x270
  [<ffffffff810bc4ca>] ? put_pwq_unlocked+0x2a/0x30
  [<ffffffff810bc6f5>] ? destroy_workqueue+0x225/0x250
  [<ffffffff81303494>] ext4_put_super+0x64/0x360
  [<ffffffff8124ab1a>] generic_shutdown_super+0x6a/0xf0

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-21 14:43:41 -06:00
Martin K. Petersen 25520d55cd block: Inline blk_integrity in struct gendisk
Up until now the_integrity profile has been dynamically allocated and
attached to struct gendisk after the disk has been made active.

This causes problems because NVMe devices need to register the profile
prior to the partition table being read due to a mandatory metadata
buffer requirement. In addition, DM goes through hoops to deal with
preallocating, but not initializing integrity profiles.

Since the integrity profile is small (4 bytes + a pointer), Christoph
suggested moving it to struct gendisk proper. This requires several
changes:

 - Moving the blk_integrity definition to genhd.h.

 - Inlining blk_integrity in struct gendisk.

 - Removing the dynamic allocation code.

 - Adding helper functions which allow gendisk to set up and tear down
   the integrity sysfs dir when a disk is added/deleted.

 - Adding a blk_integrity_revalidate() callback for updating the stable
   pages bdi setting.

 - The calls that depend on whether a device has an integrity profile or
   not now key off of the bi->profile pointer.

 - Simplifying the integrity support routines in DM (Mike Snitzer).

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reported-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-21 14:42:42 -06:00
Martin K. Petersen a48f041d91 block: Reduce the size of struct blk_integrity
The per-device properties in the blk_integrity structure were previously
unsigned short. However, most of the values fit inside a char. The only
exception is the data interval size and we can work around that by
storing it as a power of two.

This cuts the size of the dynamic portion of blk_integrity in half.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reported-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-21 14:42:39 -06:00
Martin K. Petersen 0f8087ecde block: Consolidate static integrity profile properties
We previously made a complete copy of a device's data integrity profile
even though several of the fields inside the blk_integrity struct are
pointers to fixed template entries in t10-pi.c.

Split the static and per-device portions so that we can reference the
template directly.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reported-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-21 14:42:38 -06:00
Martin K. Petersen aff34e192e block: Move integrity kobject to struct gendisk
The integrity kobject purely exists to support the integrity
subdirectory in sysfs and doesn't really have anything to do with the
blk_integrity data structure. Move the kobject to struct gendisk where
it belongs.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reported-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-21 14:42:36 -06:00
Akinobu Mita 4593fdbe7a blk-mq: fix sysfs registration/unregistration race
There is a race between cpu hotplug handling and adding/deleting
gendisk for blk-mq, where both are trying to register and unregister
the same sysfs entries.

null_add_dev
    --> blk_mq_init_queue
        --> blk_mq_init_allocated_queue
            --> add to 'all_q_list' (*)
    --> add_disk
        --> blk_register_queue
            --> blk_mq_register_disk (++)

null_del_dev
    --> del_gendisk
        --> blk_unregister_queue
            --> blk_mq_unregister_disk (--)
    --> blk_cleanup_queue
        --> blk_mq_free_queue
            --> del from 'all_q_list' (*)

blk_mq_queue_reinit
    --> blk_mq_sysfs_unregister (-)
    --> blk_mq_sysfs_register (+)

While the request queue is added to 'all_q_list' (*),
blk_mq_queue_reinit() can be called for the queue anytime by CPU
hotplug callback.  But blk_mq_sysfs_unregister (-) and
blk_mq_sysfs_register (+) in blk_mq_queue_reinit must not be called
before blk_mq_register_disk (++) and after blk_mq_unregister_disk (--)
is finished.  Because '/sys/block/*/mq/' is not exists.

There has already been BLK_MQ_F_SYSFS_UP flag in hctx->flags which can
be used to track these sysfs stuff, but it is only fixing this issue
partially.

In order to fix it completely, we just need per-queue flag instead of
per-hctx flag with appropriate locking.  So this introduces
q->mq_sysfs_init_done which is properly protected with all_q_mutex.

Also, we need to ensure that blk_mq_map_swqueue() is called with
all_q_mutex is held.  Since hctx->nr_ctx is reset temporarily and
updated in blk_mq_map_swqueue(), so we should avoid
blk_mq_register_hctx() seeing the temporary hctx->nr_ctx value
in CPU hotplug handling or adding/deleting gendisk .

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Cc: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-09-29 11:32:45 -06:00
Linus Torvalds 133bb59585 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:
 "This is a bit bigger than it should be, but I could (did) not want to
  send it off last week due to both wanting extra testing, and expecting
  a fix for the bounce regression as well.  In any case, this contains:

   - Fix for the blk-merge.c compilation warning on gcc 5.x from me.

   - A set of back/front SG gap merge fixes, from me and from Sagi.
     This ensures that we honor SG gapping for integrity payloads as
     well.

   - Two small fixes for null_blk from Matias, fixing a leak and a
     capacity propagation issue.

   - A blkcg fix from Tejun, fixing a NULL dereference.

   - A fast clone optimization from Ming, fixing a performance
     regression since the arbitrarily sized bio's were introduced.

   - Also from Ming, a regression fix for bouncing IOs"

* 'for-linus' of git://git.kernel.dk/linux-block:
  block: fix bounce_end_io
  block: blk-merge: fast-clone bio when splitting rw bios
  block: blkg_destroy_all() should clear q->root_blkg and ->root_rl.blkg
  block: Copy a user iovec if it includes gaps
  block: Refuse adding appending a gapped integrity page to a bio
  block: Refuse request/bio merges with gaps in the integrity payload
  block: Check for gaps on front and back merges
  null_blk: fix wrong capacity when bs is not 512 bytes
  null_blk: fix memory leak on cleanup
  block: fix bogus compiler warnings in blk-merge.c
2015-09-19 18:57:09 -07:00
Linus Torvalds 10fbd36e36 blk: rq_data_dir() should not return a boolean
rq_data_dir() returns either READ or WRITE (0 == READ, 1 == WRITE), not
a boolean value.

Now, admittedly the "!= 0" doesn't really change the value (0 stays as
zero, 1 stays as one), but it's not only redundant, it confuses gcc, and
causes gcc to warn about the construct

    switch (rq_data_dir(req)) {
        case READ:
            ...
        case WRITE:
            ...

that we have in a few drivers.

Now, the gcc warning is silly and stupid (it seems to warn not about the
switch value having a different type from the case statements, but about
_any_ boolean switch value), but in this case the code itself is silly
and stupid too, so let's just change it, and get rid of warnings like
this:

  drivers/block/hd.c: In function ‘hd_request’:
  drivers/block/hd.c:630:11: warning: switch condition has boolean value [-Wswitch-bool]
     switch (rq_data_dir(req)) {

The odd '!= 0' came in when "cmd_flags" got turned into a "u64" in
commit 5953316dbf ("block: make rq->cmd_flags be 64-bit") and is
presumably because the old code (that just did a logical 'and' with 1)
would then end up making the type of rq_data_dir() be u64 too.

But if we want to retain the old regular integer type, let's just cast
the result to 'int' rather than use that rather odd '!= 0'.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-12 12:03:30 -07:00
Sagi Grimberg 7f39add3b0 block: Refuse request/bio merges with gaps in the integrity payload
If a driver sets the block queue virtual boundary mask, it means that
it cannot handle gaps so we must not allow those in the integrity
payload as well.

Signed-off-by: Sagi Grimberg <sagig@mellanox.com>

Fixed up by me to have duplicate integrity merge functions, depending
on whether block integrity is enabled or not. Fixes a compilations
issue with CONFIG_BLK_DEV_INTEGRITY unset.

Signed-off-by: Jens Axboe <axboe@fb.com>
2015-09-11 09:03:04 -06:00