1
0
Fork 0
Commit Graph

2323 Commits (f70ced09170761acb69840cafaace4abc72cba4b)

Author SHA1 Message Date
Linus Torvalds d429a3639c Merge branch 'for-3.17/drivers' of git://git.kernel.dk/linux-block
Pull block driver changes from Jens Axboe:
 "Nothing out of the ordinary here, this pull request contains:

   - A big round of fixes for bcache from Kent Overstreet, Slava Pestov,
     and Surbhi Palande.  No new features, just a lot of fixes.

   - The usual round of drbd updates from Andreas Gruenbacher, Lars
     Ellenberg, and Philipp Reisner.

   - virtio_blk was converted to blk-mq back in 3.13, but now Ming Lei
     has taken it one step further and added support for actually using
     more than one queue.

   - Addition of an explicit SG_FLAG_Q_AT_HEAD for block/bsg, to
     compliment the the default behavior of adding to the tail of the
     queue.  From Douglas Gilbert"

* 'for-3.17/drivers' of git://git.kernel.dk/linux-block: (86 commits)
  bcache: Drop unneeded blk_sync_queue() calls
  bcache: add mutex lock for bch_is_open
  bcache: Correct printing of btree_gc_max_duration_ms
  bcache: try to set b->parent properly
  bcache: fix memory corruption in init error path
  bcache: fix crash with incomplete cache set
  bcache: Fix more early shutdown bugs
  bcache: fix use-after-free in btree_gc_coalesce()
  bcache: Fix an infinite loop in journal replay
  bcache: fix crash in bcache_btree_node_alloc_fail tracepoint
  bcache: bcache_write tracepoint was crashing
  bcache: fix typo in bch_bkey_equal_header
  bcache: Allocate bounce buffers with GFP_NOWAIT
  bcache: Make sure to pass GFP_WAIT to mempool_alloc()
  bcache: fix uninterruptible sleep in writeback thread
  bcache: wait for buckets when allocating new btree root
  bcache: fix crash on shutdown in passthrough mode
  bcache: fix lockdep warnings on shutdown
  bcache allocator: send discards with correct size
  bcache: Fix to remove the rcu_sched stalls.
  ...
2014-08-14 09:10:21 -06:00
Linus Torvalds 4a319a490c Merge branch 'for-3.17/core' of git://git.kernel.dk/linux-block
Pull block core bits from Jens Axboe:
 "Small round this time, after the massive blk-mq dump for 3.16.  This
  pull request contains:

   - Fixes for max_sectors overflow in ioctls from Akinoby Mita.

   - Partition off-by-one bug fix in aix partitions from Dan Carpenter.

   - Various small partition cleanups from Fabian Frederick.

   - Fix for the block integrity code sometimes returning the wrong
     vector count from Gu Zheng.

   - Cleanup an re-org of the blk-mq queue enter/exit percpu counters
     from Tejun.  Dependent on the percpu pull for 3.17 (which was in
     the block tree too), that you have already pulled in.

   - A blkcg oops fix, also from Tejun"

* 'for-3.17/core' of git://git.kernel.dk/linux-block:
  partitions: aix.c: off by one bug
  blkcg: don't call into policy draining if root_blkg is already gone
  Revert "bio: modify __bio_add_page() to accept pages that don't start a new segment"
  bio: modify __bio_add_page() to accept pages that don't start a new segment
  block: fix SG_[GS]ET_RESERVED_SIZE ioctl when max_sectors is huge
  block: fix BLKSECTGET ioctl when max_sectors is greater than USHRT_MAX
  block/partitions/efi.c: kerneldoc fixing
  block/partitions/msdos.c: code clean-up
  block/partitions/amiga.c: replace nolevel printk by pr_err
  block/partitions/aix.c: replace count*size kzalloc by kcalloc
  bio-integrity: add "bip_max_vcnt" into struct bio_integrity_payload
  blk-mq: use percpu_ref for mq usage count
  blk-mq: collapse __blk_mq_drain_queue() into blk_mq_freeze_queue()
  blk-mq: decouble blk-mq freezing from generic bypassing
  block, blk-mq: draining can't be skipped even if bypass_depth was non-zero
  blk-mq: fix a memory ordering bug in blk_mq_queue_enter()
2014-08-14 09:07:02 -06:00
Dan Carpenter d97a86c170 partitions: aix.c: off by one bug
The lvip[] array has "state->limit" elements so the condition here
should be >= instead of >.

Fixes: 6ceea22bbb ('partitions: add aix lvm partition support files')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Philippe De Muyter <phdm@macqel.be>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-08-05 13:13:24 -06:00
Linus Torvalds 47dfe4037e Merge branch 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup changes from Tejun Heo:
 "Mostly changes to get the v2 interface ready.  The core features are
  mostly ready now and I think it's reasonable to expect to drop the
  devel mask in one or two devel cycles at least for a subset of
  controllers.

   - cgroup added a controller dependency mechanism so that block cgroup
     can depend on memory cgroup.  This will be used to finally support
     IO provisioning on the writeback traffic, which is currently being
     implemented.

   - The v2 interface now uses a separate table so that the interface
     files for the new interface are explicitly declared in one place.
     Each controller will explicitly review and add the files for the
     new interface.

   - cpuset is getting ready for the hierarchical behavior which is in
     the similar style with other controllers so that an ancestor's
     configuration change doesn't change the descendants' configurations
     irreversibly and processes aren't silently migrated when a CPU or
     node goes down.

  All the changes are to the new interface and no behavior changed for
  the multiple hierarchies"

* 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (29 commits)
  cpuset: fix the WARN_ON() in update_nodemasks_hier()
  cgroup: initialize cgrp_dfl_root_inhibit_ss_mask from !->dfl_files test
  cgroup: make CFTYPE_ONLY_ON_DFL and CFTYPE_NO_ internal to cgroup core
  cgroup: distinguish the default and legacy hierarchies when handling cftypes
  cgroup: replace cgroup_add_cftypes() with cgroup_add_legacy_cftypes()
  cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes
  cgroup: split cgroup_base_files[] into cgroup_{dfl|legacy}_base_files[]
  cpuset: export effective masks to userspace
  cpuset: allow writing offlined masks to cpuset.cpus/mems
  cpuset: enable onlined cpu/node in effective masks
  cpuset: refactor cpuset_hotplug_update_tasks()
  cpuset: make cs->{cpus, mems}_allowed as user-configured masks
  cpuset: apply cs->effective_{cpus,mems}
  cpuset: initialize top_cpuset's configured masks at mount
  cpuset: use effective cpumask to build sched domains
  cpuset: inherit ancestor's masks if effective_{cpus, mems} becomes empty
  cpuset: update cs->effective_{cpus, mems} when config changes
  cpuset: update cpuset->effective_{cpus,mems} at hotplug
  cpuset: add cs->effective_cpus and cs->effective_mems
  cgroup: clean up sane_behavior handling
  ...
2014-08-04 10:11:28 -07:00
Mikulas Patocka 6a24148361 block: use kmalloc alignment for bio slab
Various subsystems can ask the bio subsystem to create a bio slab cache
with some free space before the bio.  This free space can be used for any
purpose.  Device mapper uses this per-bio-data feature to place some
target-specific and device-mapper specific data before the bio, so that
the target-specific data doesn't have to be allocated separately.

This per-bio-data mechanism is used in place of kmalloc, so we need the
allocated slab to have the same memory alignment as memory allocated
with kmalloc.

Change bio_find_or_create_slab() so that it uses ARCH_KMALLOC_MINALIGN
alignment when creating the slab cache.  This is needed so that dm-crypt
can use per-bio-data for encryption - the crypto subsystem assumes this
data will have the same alignment as kmalloc'ed memory.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Jens Axboe <axboe@fb.com>
2014-08-01 12:30:34 -04:00
Tejun Heo 2a1b4cf233 blkcg: don't call into policy draining if root_blkg is already gone
While a queue is being destroyed, all the blkgs are destroyed and its
->root_blkg pointer is set to NULL.  If someone else starts to drain
while the queue is in this state, the following oops happens.

  NULL pointer dereference at 0000000000000028
  IP: [<ffffffff8144e944>] blk_throtl_drain+0x84/0x230
  PGD e4a1067 PUD b773067 PMD 0
  Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
  Modules linked in: cfq_iosched(-) [last unloaded: cfq_iosched]
  CPU: 1 PID: 537 Comm: bash Not tainted 3.16.0-rc3-work+ #2
  Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  task: ffff88000e222250 ti: ffff88000efd4000 task.ti: ffff88000efd4000
  RIP: 0010:[<ffffffff8144e944>]  [<ffffffff8144e944>] blk_throtl_drain+0x84/0x230
  RSP: 0018:ffff88000efd7bf0  EFLAGS: 00010046
  RAX: 0000000000000000 RBX: ffff880015091450 RCX: 0000000000000001
  RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
  RBP: ffff88000efd7c10 R08: 0000000000000000 R09: 0000000000000001
  R10: ffff88000e222250 R11: 0000000000000000 R12: ffff880015091450
  R13: ffff880015092e00 R14: ffff880015091d70 R15: ffff88001508fc28
  FS:  00007f1332650740(0000) GS:ffff88001fa80000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
  CR2: 0000000000000028 CR3: 0000000009446000 CR4: 00000000000006e0
  Stack:
   ffffffff8144e8f6 ffff880015091450 0000000000000000 ffff880015091d80
   ffff88000efd7c28 ffffffff8144ae2f ffff880015091450 ffff88000efd7c58
   ffffffff81427641 ffff880015091450 ffffffff82401f00 ffff880015091450
  Call Trace:
   [<ffffffff8144ae2f>] blkcg_drain_queue+0x1f/0x60
   [<ffffffff81427641>] __blk_drain_queue+0x71/0x180
   [<ffffffff81429b3e>] blk_queue_bypass_start+0x6e/0xb0
   [<ffffffff814498b8>] blkcg_deactivate_policy+0x38/0x120
   [<ffffffff8144ec44>] blk_throtl_exit+0x34/0x50
   [<ffffffff8144aea5>] blkcg_exit_queue+0x35/0x40
   [<ffffffff8142d476>] blk_release_queue+0x26/0xd0
   [<ffffffff81454968>] kobject_cleanup+0x38/0x70
   [<ffffffff81454848>] kobject_put+0x28/0x60
   [<ffffffff81427505>] blk_put_queue+0x15/0x20
   [<ffffffff817d07bb>] scsi_device_dev_release_usercontext+0x16b/0x1c0
   [<ffffffff810bc339>] execute_in_process_context+0x89/0xa0
   [<ffffffff817d064c>] scsi_device_dev_release+0x1c/0x20
   [<ffffffff817930e2>] device_release+0x32/0xa0
   [<ffffffff81454968>] kobject_cleanup+0x38/0x70
   [<ffffffff81454848>] kobject_put+0x28/0x60
   [<ffffffff817934d7>] put_device+0x17/0x20
   [<ffffffff817d11b9>] __scsi_remove_device+0xa9/0xe0
   [<ffffffff817d121b>] scsi_remove_device+0x2b/0x40
   [<ffffffff817d1257>] sdev_store_delete+0x27/0x30
   [<ffffffff81792ca8>] dev_attr_store+0x18/0x30
   [<ffffffff8126f75e>] sysfs_kf_write+0x3e/0x50
   [<ffffffff8126ea87>] kernfs_fop_write+0xe7/0x170
   [<ffffffff811f5e9f>] vfs_write+0xaf/0x1d0
   [<ffffffff811f69bd>] SyS_write+0x4d/0xc0
   [<ffffffff81d24692>] system_call_fastpath+0x16/0x1b

776687bce4 ("block, blk-mq: draining can't be skipped even if
bypass_depth was non-zero") made it easier to trigger this bug by
making blk_queue_bypass_start() drain even when it loses the first
bypass test to blk_cleanup_queue(); however, the bug has always been
there even before the commit as blk_queue_bypass_start() could race
against queue destruction, win the initial bypass test but perform the
actual draining after blk_cleanup_queue() already destroyed all blkgs.

Fix it by skippping calling into policy draining if all the blkgs are
already gone.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Shirish Pargaonkar <spargaonkar@suse.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Reported-by: Jet Chen <jet.chen@intel.com>
Cc: stable@vger.kernel.org
Tested-by: Shirish Pargaonkar <spargaonkar@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-07-16 09:52:03 +02:00
Tejun Heo 2cf669a58d cgroup: replace cgroup_add_cftypes() with cgroup_add_legacy_cftypes()
Currently, cftypes added by cgroup_add_cftypes() are used for both the
unified default hierarchy and legacy ones and subsystems can mark each
file with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to
appear only on one of them.  This is quite hairy and error-prone.
Also, we may end up exposing interface files to the default hierarchy
without thinking it through.

cgroup_subsys will grow two separate cftype addition functions and
apply each only on the hierarchies of the matching type.  This will
allow organizing cftypes in a lot clearer way and encourage subsystems
to scrutinize the interface which is being exposed in the new default
hierarchy.

In preparation, this patch adds cgroup_add_legacy_cftypes() which
currently is a simple wrapper around cgroup_add_cftypes() and replaces
all cgroup_add_cftypes() usages with it.

While at it, this patch drops a completely spurious return from
__hugetlb_cgroup_file_init().

This patch doesn't introduce any functional differences.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2014-07-15 11:05:09 -04:00
Tejun Heo 5577964e64 cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes
Currently, cgroup_subsys->base_cftypes is used for both the unified
default hierarchy and legacy ones and subsystems can mark each file
with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to appear
only on one of them.  This is quite hairy and error-prone.  Also, we
may end up exposing interface files to the default hierarchy without
thinking it through.

cgroup_subsys will grow two separate cftype arrays and apply each only
on the hierarchies of the matching type.  This will allow organizing
cftypes in a lot clearer way and encourage subsystems to scrutinize
the interface which is being exposed in the new default hierarchy.

In preparation, this patch renames cgroup_subsys->base_cftypes to
cgroup_subsys->legacy_cftypes.  This patch is pure rename.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Aristeu Rozanski <aris@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2014-07-15 11:05:09 -04:00
Jens Axboe 26a337944e Revert "bio: modify __bio_add_page() to accept pages that don't start a new segment"
This reverts commit 254c4407cb.

It causes crashes with cryptsetup, even after a few iterations and
updates. Drop it for now.
2014-07-14 22:04:47 +02:00
Mikulas Patocka 3b3a1814d1 block: provide compat ioctl for BLKZEROOUT
This patch provides the compat BLKZEROOUT ioctl. The argument is a pointer
to two uint64_t values, so there is no need to translate it.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org	# 3.7+
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-07-14 12:27:20 +02:00
Tejun Heo 0b462c89e3 blkcg: don't call into policy draining if root_blkg is already gone
While a queue is being destroyed, all the blkgs are destroyed and its
->root_blkg pointer is set to NULL.  If someone else starts to drain
while the queue is in this state, the following oops happens.

  NULL pointer dereference at 0000000000000028
  IP: [<ffffffff8144e944>] blk_throtl_drain+0x84/0x230
  PGD e4a1067 PUD b773067 PMD 0
  Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
  Modules linked in: cfq_iosched(-) [last unloaded: cfq_iosched]
  CPU: 1 PID: 537 Comm: bash Not tainted 3.16.0-rc3-work+ #2
  Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  task: ffff88000e222250 ti: ffff88000efd4000 task.ti: ffff88000efd4000
  RIP: 0010:[<ffffffff8144e944>]  [<ffffffff8144e944>] blk_throtl_drain+0x84/0x230
  RSP: 0018:ffff88000efd7bf0  EFLAGS: 00010046
  RAX: 0000000000000000 RBX: ffff880015091450 RCX: 0000000000000001
  RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
  RBP: ffff88000efd7c10 R08: 0000000000000000 R09: 0000000000000001
  R10: ffff88000e222250 R11: 0000000000000000 R12: ffff880015091450
  R13: ffff880015092e00 R14: ffff880015091d70 R15: ffff88001508fc28
  FS:  00007f1332650740(0000) GS:ffff88001fa80000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
  CR2: 0000000000000028 CR3: 0000000009446000 CR4: 00000000000006e0
  Stack:
   ffffffff8144e8f6 ffff880015091450 0000000000000000 ffff880015091d80
   ffff88000efd7c28 ffffffff8144ae2f ffff880015091450 ffff88000efd7c58
   ffffffff81427641 ffff880015091450 ffffffff82401f00 ffff880015091450
  Call Trace:
   [<ffffffff8144ae2f>] blkcg_drain_queue+0x1f/0x60
   [<ffffffff81427641>] __blk_drain_queue+0x71/0x180
   [<ffffffff81429b3e>] blk_queue_bypass_start+0x6e/0xb0
   [<ffffffff814498b8>] blkcg_deactivate_policy+0x38/0x120
   [<ffffffff8144ec44>] blk_throtl_exit+0x34/0x50
   [<ffffffff8144aea5>] blkcg_exit_queue+0x35/0x40
   [<ffffffff8142d476>] blk_release_queue+0x26/0xd0
   [<ffffffff81454968>] kobject_cleanup+0x38/0x70
   [<ffffffff81454848>] kobject_put+0x28/0x60
   [<ffffffff81427505>] blk_put_queue+0x15/0x20
   [<ffffffff817d07bb>] scsi_device_dev_release_usercontext+0x16b/0x1c0
   [<ffffffff810bc339>] execute_in_process_context+0x89/0xa0
   [<ffffffff817d064c>] scsi_device_dev_release+0x1c/0x20
   [<ffffffff817930e2>] device_release+0x32/0xa0
   [<ffffffff81454968>] kobject_cleanup+0x38/0x70
   [<ffffffff81454848>] kobject_put+0x28/0x60
   [<ffffffff817934d7>] put_device+0x17/0x20
   [<ffffffff817d11b9>] __scsi_remove_device+0xa9/0xe0
   [<ffffffff817d121b>] scsi_remove_device+0x2b/0x40
   [<ffffffff817d1257>] sdev_store_delete+0x27/0x30
   [<ffffffff81792ca8>] dev_attr_store+0x18/0x30
   [<ffffffff8126f75e>] sysfs_kf_write+0x3e/0x50
   [<ffffffff8126ea87>] kernfs_fop_write+0xe7/0x170
   [<ffffffff811f5e9f>] vfs_write+0xaf/0x1d0
   [<ffffffff811f69bd>] SyS_write+0x4d/0xc0
   [<ffffffff81d24692>] system_call_fastpath+0x16/0x1b

776687bce4 ("block, blk-mq: draining can't be skipped even if
bypass_depth was non-zero") made it easier to trigger this bug by
making blk_queue_bypass_start() drain even when it loses the first
bypass test to blk_cleanup_queue(); however, the bug has always been
there even before the commit as blk_queue_bypass_start() could race
against queue destruction, win the initial bypass test but perform the
actual draining after blk_cleanup_queue() already destroyed all blkgs.

Fix it by skippping calling into policy draining if all the blkgs are
already gone.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Shirish Pargaonkar <spargaonkar@suse.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Reported-by: Jet Chen <jet.chen@intel.com>
Cc: stable@vger.kernel.org
Tested-by: Shirish Pargaonkar <spargaonkar@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-07-12 17:55:10 +02:00
Tejun Heo aa6ec29bee cgroup: remove sane_behavior support on non-default hierarchies
sane_behavior has been used as a development vehicle for the default
unified hierarchy.  Now that the default hierarchy is in place, the
flag became redundant and confusing as its usage is allowed on all
hierarchies.  There are gonna be either the default hierarchy or
legacy ones.  Let's make that clear by removing sane_behavior support
on non-default hierarchies.

This patch replaces cgroup_sane_behavior() with cgroup_on_dfl().  The
comment on top of CGRP_ROOT_SANE_BEHAVIOR is moved to on top of
cgroup_on_dfl() with sane_behavior specific part dropped.

On the default and legacy hierarchies w/o sane_behavior, this
shouldn't cause any behavior differences.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
2014-07-09 10:08:08 -04:00
Tejun Heo 1ced953b17 blkcg, memcg: make blkcg depend on memcg on the default hierarchy
Currently, the blkio subsystem attributes all of writeback IOs to the
root.  One of the issues is that there's no way to tell who originated
a writeback IO from block layer.  Those IOs are usually issued
asynchronously from a task which didn't have anything to do with
actually generating the dirty pages.  The memory subsystem, when
enabled, already keeps track of the ownership of each dirty page and
it's desirable for blkio to piggyback instead of adding its own
per-page tag.

cgroup now has a mechanism to express such dependency -
cgroup_subsys->depends_on.  This patch declares that blkcg depends on
memcg so that memcg is enabled automatically on the default hierarchy
when available.  Future changes will make blkcg map the memcg tag to
find out the cgroup to blame for writeback IOs.

As this means that a memcg may be made invisible, this patch also
implements css_reset() for memcg which resets its basic
configurations.  This implementation will probably need to be expanded
to cover other states which are used in the default hierarchy.

v2: blkcg's dependency on memcg is wrapped with CONFIG_MEMCG to avoid
    build failure.  Reported by kbuild test robot.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
2014-07-08 18:02:57 -04:00
Christoph Hellwig d45b3279a5 block: don't assume last put of shared tags is for the host
There is no inherent reason why the last put of a tag structure must be
the one for the Scsi_Host, as device model objects can be held for
arbitrary periods.  Merge blk_free_tags and __blk_free_tags into a single
funtion that just release a references and get rid of the BUG() when the
host reference wasn't the last.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-07-08 12:25:28 +02:00
Maurizio Lombardi 254c4407cb bio: modify __bio_add_page() to accept pages that don't start a new segment
The original behaviour is to refuse to add a new page if the maximum
number of segments has been reached, regardless of the fact the page we
are going to add can be merged into the last segment or not.

Unfortunately, when the system runs under heavy memory fragmentation
conditions, a driver may try to add multiple pages to the last segment.
The original code won't accept them and EBUSY will be reported to
userspace.

This patch modifies the function so it refuses to add a page only in case
the latter starts a new segment and the maximum number of segments has
already been reached.

The bug can be easily reproduced with the st driver:

1) set CONFIG_SCSI_MPT2SAS_MAX_SGE or CONFIG_SCSI_MPT3SAS_MAX_SGE  to 16
2) modprobe st buffer_kbs=1024
3) #dd if=/dev/zero of=/dev/st0 bs=1M count=10
   dd: error writing `/dev/st0': Device or resource busy

[ming.lei@canonical.com: update bi_iter.bi_size before recounting segments]
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Tested-by: Dongsu Park <dongsu.park@profitbricks.com>
Tested-by: Jet Chen <jet.chen@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-07-01 10:55:15 -06:00
Douglas Gilbert d15156138d block SG_IO: add SG_FLAG_Q_AT_HEAD flag
After the SG_IO ioctl was copied into the block layer and
later into the bsg driver, subtle differences emerged.

One difference is the way injected commands are queued through
the block layer (i.e. this is not SCSI device queueing nor SATA
NCQ). Summarizing:
  - SG_IO on block layer device: blk_exec*(at_head=false)
  - sg device SG_IO: at_head=true
  - bsg device SG_IO: at_head=true

Some time ago Boaz Harrosh introduced a sg v4 flag called
BSG_FLAG_Q_AT_TAIL to override the bsg driver default. A
recent patch titled: "sg: add SG_FLAG_Q_AT_TAIL flag"
allowed the sg driver default to be overridden. This patch
allows a SG_IO ioctl sent to a block layer device to have
its default overridden.

ChangeLog:
    - introduce SG_FLAG_Q_AT_HEAD flag in sg.h to cause
      commands that are injected via a block layer
      device SG_IO ioctl to set at_head=true
    - make comments clearer about queueing in sg.h since the
      header is used both by the sg device and block layer
      device implementations of the SG_IO ioctl.
    - introduce BSG_FLAG_Q_AT_HEAD in bsg.h for compatibility
      (it does nothing) and update comments.

Signed-off-by: Douglas Gilbert <dgilbert@interlog.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-07-01 10:48:05 -06:00
Akinobu Mita 9b4231bf99 block: fix SG_[GS]ET_RESERVED_SIZE ioctl when max_sectors is huge
SG_GET_RESERVED_SIZE and SG_SET_RESERVED_SIZE ioctls access a reserved
buffer in bytes as int type.  The value needs to be capped at the request
queue's max_sectors.  But integer overflow is not correctly handled in
the calculation when converting max_sectors from sectors to bytes.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Cc: linux-scsi@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-07-01 10:43:09 -06:00
Akinobu Mita 63f2649659 block: fix BLKSECTGET ioctl when max_sectors is greater than USHRT_MAX
BLKSECTGET ioctl loads the request queue's max_sectors as unsigned
short value to the argument pointer.  So if the max_sector is greater
than USHRT_MAX, the upper 16 bits of that is just discarded.

In such case, USHRT_MAX is more preferable than the lower 16 bits of
max_sectors.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Cc: linux-scsi@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-07-01 10:43:07 -06:00
Fabian Frederick 16e1556526 block/partitions/efi.c: kerneldoc fixing
Adding function documentation and fixing kerneldoc warnings
('field: description' uniformization).

Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-07-01 10:40:05 -06:00
Fabian Frederick dce14c239a block/partitions/msdos.c: code clean-up
checkpatch fixing:
WARNING: Missing a blank line after declarations
WARNING: space prohibited between function name and open parenthesis '('
ERROR: spaces required around that '<' (ctx:VxV)

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-07-01 10:40:03 -06:00
Fabian Frederick 600ffc5ead block/partitions/amiga.c: replace nolevel printk by pr_err
Also add no prefix pr_fmt to avoid any future default format update

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-07-01 10:40:02 -06:00
Fabian Frederick 472d5e2af2 block/partitions/aix.c: replace count*size kzalloc by kcalloc
kcalloc manages count*sizeof overflow.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-07-01 10:40:00 -06:00
Gu Zheng cbcd1054a1 bio-integrity: add "bip_max_vcnt" into struct bio_integrity_payload
Commit 08778795 ("block: Fix nr_vecs for inline integrity vectors") from
Martin introduces the function bip_integrity_vecs(get the useful vectors)
to fix the issue about nr_vecs for inline integrity vectors that reported
by David Milburn.

But it seems that bip_integrity_vecs() will return the wrong number if the
bio is not based on any bio_set for some reason(bio->bi_pool == NULL),
because in that case, the bip_inline_vecs[0] is malloced directly.  So
here we add the bip_max_vcnt to record the count of vector slots, and
cleanup the function bip_integrity_vecs().

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-07-01 10:36:47 -06:00
Tejun Heo add703fda9 blk-mq: use percpu_ref for mq usage count
Currently, blk-mq uses a percpu_counter to keep track of how many
usages are in flight.  The percpu_counter is drained while freezing to
ensure that no usage is left in-flight after freezing is complete.
blk_mq_queue_enter/exit() and blk_mq_[un]freeze_queue() implement this
per-cpu gating mechanism.

This type of code has relatively high chance of subtle bugs which are
extremely difficult to trigger and it's way too hairy to be open coded
in blk-mq.  percpu_ref can serve the same purpose after the recent
changes.  This patch replaces the open-coded per-cpu usage counting
and draining mechanism with percpu_ref.

blk_mq_queue_enter() performs tryget_live on the ref and exit()
performs put.  blk_mq_freeze_queue() kills the ref and waits until the
reference count reaches zero.  blk_mq_unfreeze_queue() revives the ref
and wakes up the waiters.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
Cc: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-07-01 10:34:38 -06:00
Tejun Heo 72d6f02a8d blk-mq: collapse __blk_mq_drain_queue() into blk_mq_freeze_queue()
Keeping __blk_mq_drain_queue() as a separate function doesn't buy us
anything and it's gonna be further simplified.  Let's flatten it into
its caller.

This patch doesn't make any functional change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-07-01 10:33:02 -06:00
Tejun Heo 780db2071a blk-mq: decouble blk-mq freezing from generic bypassing
blk_mq freezing is entangled with generic bypassing which bypasses
blkcg and io scheduler and lets IO requests fall through the block
layer to the drivers in FIFO order.  This allows forward progress on
IOs with the advanced features disabled so that those features can be
configured or altered without worrying about stalling IO which may
lead to deadlock through memory allocation.

However, generic bypassing doesn't quite fit blk-mq.  blk-mq currently
doesn't make use of blkcg or ioscheds and it maps bypssing to
freezing, which blocks request processing and drains all the in-flight
ones.  This causes problems as bypassing assumes that request
processing is online.  blk-mq works around this by conditionally
allowing request processing for the problem case - during queue
initialization.

Another weirdity is that except for during queue cleanup, bypassing
started on the generic side prevents blk-mq from processing new
requests but doesn't drain the in-flight ones.  This shouldn't break
anything but again highlights that something isn't quite right here.

The root cause is conflating blk-mq freezing and generic bypassing
which are two different mechanisms.  The only intersecting purpose
that they serve is during queue cleanup.  Let's properly separate
blk-mq freezing from generic bypassing and simply use it where
necessary.

* request_queue->mq_freeze_depth is added and
  blk_mq_[un]freeze_queue() now operate on this counter instead of
  ->bypass_depth.  The replacement for QUEUE_FLAG_BYPASS isn't added
  but the counter is tested directly.  This will be further updated by
  later changes.

* blk_mq_drain_queue() is dropped and "__" prefix is dropped from
  blk_mq_freeze_queue().  Queue cleanup path now calls
  blk_mq_freeze_queue() directly.

* blk_queue_enter()'s fast path condition is simplified to simply
  check @q->mq_freeze_depth.  Previously, the condition was

	!blk_queue_dying(q) &&
	    (!blk_queue_bypass(q) || !blk_queue_init_done(q))

  mq_freeze_depth is incremented right after dying is set and
  blk_queue_init_done() exception isn't necessary as blk-mq doesn't
  start frozen, which only leaves the blk_queue_bypass() test which
  can be replaced by @q->mq_freeze_depth test.

This change simplifies the code and reduces confusion in the area.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-07-01 10:31:13 -06:00
Tejun Heo 776687bce4 block, blk-mq: draining can't be skipped even if bypass_depth was non-zero
Currently, both blk_queue_bypass_start() and blk_mq_freeze_queue()
skip queue draining if bypass_depth was already above zero.  The
assumption is that the one which bumped the bypass_depth should have
performed draining already; however, there's nothing which prevents a
new instance of bypassing/freezing from starting before the previous
one finishes draining.  The current code may allow the later
bypassing/freezing instances to complete while there still are
in-flight requests which haven't finished draining.

Fix it by draining regardless of bypass_depth.  We still skip draining
from blk_queue_bypass_start() while the queue is initializing to avoid
introducing excessive delays during boot.  INIT_DONE setting is moved
above the initial blk_queue_bypass_end() so that bypassing attempts
can't slip inbetween.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-07-01 10:29:17 -06:00
Tejun Heo 531ed6261e blk-mq: fix a memory ordering bug in blk_mq_queue_enter()
blk-mq uses a percpu_counter to keep track of how many usages are in
flight.  The percpu_counter is drained while freezing to ensure that
no usage is left in-flight after freezing is complete.

blk_mq_queue_enter/exit() and blk_mq_[un]freeze_queue() implement this
per-cpu gating mechanism; unfortunately, it contains a subtle bug -
smp_wmb() in blk_mq_queue_enter() doesn't prevent prevent the cpu from
fetching @q->bypass_depth before incrementing @q->mq_usage_counter and
if freezing happens inbetween the caller can slip through and freezing
can be complete while there are active users.

Use smp_mb() instead so that bypass_depth and mq_usage_counter
modifications and tests are properly interlocked.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-07-01 10:27:06 -06:00
Linus Torvalds 3493860c76 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
 "A small collection of fixes/changes for the current series.  This
  contains:

   - Removal of dead code from Gu Zheng.

   - Revert of two bad fixes that went in earlier in this round, marking
     things as __init that were not purely used from init.

   - A fix for blk_mq_start_hw_queue() using the __blk_mq_run_hw_queue(),
     which could place us wrongly.  Make it use the non __ variant,
     which handles cases where we are called from the wrong CPU set.
     From me.

   - A fix for drbd, which allocates discard requests without room for
     the SCSI payload.  From Lars Ellenberg.

   - A fix for user-after-free in the blkcg code from Tejun.

   - Addition of limiting gaps in SG lists, if the hardware needs it.
     This is the last pre-req patch for blk-mq to enable the full NVMe
     conversion.  Could wait until 3.17, but it's simple enough so would
     be nice to have everything we need for the NVMe port in the 3.17
     release.  From me"

* 'for-linus' of git://git.kernel.dk/linux-block:
  drbd: fix NULL pointer deref in blk_add_request_payload
  blk-mq: blk_mq_start_hw_queue() should use blk_mq_run_hw_queue()
  block: add support for limiting gaps in SG lists
  bio: remove unused macro bip_vec_idx()
  Revert "block: add __init to elv_register"
  Revert "block: add __init to blkcg_policy_register"
  blkcg: fix use-after-free in __blkg_release_rcu() by making blkcg_gq refcnt an atomic_t
  floppy: format block0 read error message properly
2014-06-26 13:06:13 -07:00
Jens Axboe 0ffbce80c2 blk-mq: blk_mq_start_hw_queue() should use blk_mq_run_hw_queue()
Currently it calls __blk_mq_run_hw_queue(), which depends on the
CPU placement being correct. This means it's not possible to call
blk_mq_start_hw_queues(q) from a context that is correct for all
queues, leading to triggering the

WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask));

in __blk_mq_run_hw_queue().

Reported-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-06-25 08:22:34 -06:00
Jens Axboe 66cb45aa41 block: add support for limiting gaps in SG lists
Another restriction inherited for NVMe - those devices don't support
SG lists that have "gaps" in them. Gaps refers to cases where the
previous SG entry doesn't end on a page boundary. For NVMe, all SG
entries must start at offset 0 (except the first) and end on a page
boundary (except the last).

Signed-off-by: Jens Axboe <axboe@fb.com>
2014-06-24 16:22:24 -06:00
Jens Axboe e567bf7112 Revert "block: add __init to elv_register"
This reverts commit b5097e956a.

The original commit is buggy, we do use the registration functions
at runtime, for instance when loading IO schedulers through sysfs.

Reported-by: Damien Wyart <damien.wyart@gmail.com>
2014-06-22 16:34:11 -06:00
Jens Axboe d5bf02914e Revert "block: add __init to blkcg_policy_register"
This reverts commit a2d445d440.

The original commit is buggy, we do use the registration functions
at runtime for modular builds.
2014-06-22 16:34:11 -06:00
Tejun Heo a5049a8ae3 blkcg: fix use-after-free in __blkg_release_rcu() by making blkcg_gq refcnt an atomic_t
Hello,

So, this patch should do.  Joe, Vivek, can one of you guys please
verify that the oops goes away with this patch?

Jens, the original thread can be read at

  http://thread.gmane.org/gmane.linux.kernel/1720729

The fix converts blkg->refcnt from int to atomic_t.  It does some
overhead but it should be minute compared to everything else which is
going on and the involved cacheline bouncing, so I think it's highly
unlikely to cause any noticeable difference.  Also, the refcnt in
question should be converted to a perpcu_ref for blk-mq anyway, so the
atomic_t is likely to go away pretty soon anyway.

Thanks.

------- 8< -------
__blkg_release_rcu() may be invoked after the associated request_queue
is released with a RCU grace period inbetween.  As such, the function
and callbacks invoked from it must not dereference the associated
request_queue.  This is clearly indicated in the comment above the
function.

Unfortunately, while trying to fix a different issue, 2a4fd070ee
("blkcg: move bulk of blkcg_gq release operations to the RCU
callback") ignored this and added [un]locking of @blkg->q->queue_lock
to __blkg_release_rcu().  This of course can cause oops as the
request_queue may be long gone by the time this code gets executed.

  general protection fault: 0000 [#1] SMP
  CPU: 21 PID: 30 Comm: rcuos/21 Not tainted 3.15.0 #1
  Hardware name: Stratus ftServer 6400/G7LAZ, BIOS BIOS Version 6.3:57 12/25/2013
  task: ffff880854021de0 ti: ffff88085403c000 task.ti: ffff88085403c000
  RIP: 0010:[<ffffffff8162e9e5>]  [<ffffffff8162e9e5>] _raw_spin_lock_irq+0x15/0x60
  RSP: 0018:ffff88085403fdf0  EFLAGS: 00010086
  RAX: 0000000000020000 RBX: 0000000000000010 RCX: 0000000000000000
  RDX: 000060ef80008248 RSI: 0000000000000286 RDI: 6b6b6b6b6b6b6b6b
  RBP: ffff88085403fdf0 R08: 0000000000000286 R09: 0000000000009f39
  R10: 0000000000020001 R11: 0000000000020001 R12: ffff88103c17a130
  R13: ffff88103c17a080 R14: 0000000000000000 R15: 0000000000000000
  FS:  0000000000000000(0000) GS:ffff88107fca0000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00000000006e5ab8 CR3: 000000000193d000 CR4: 00000000000407e0
  Stack:
   ffff88085403fe18 ffffffff812cbfc2 ffff88103c17a130 0000000000000000
   ffff88103c17a130 ffff88085403fec0 ffffffff810d1d28 ffff880854021de0
   ffff880854021de0 ffff88107fcaec58 ffff88085403fe80 ffff88107fcaec30
  Call Trace:
   [<ffffffff812cbfc2>] __blkg_release_rcu+0x72/0x150
   [<ffffffff810d1d28>] rcu_nocb_kthread+0x1e8/0x300
   [<ffffffff81091d81>] kthread+0xe1/0x100
   [<ffffffff8163813c>] ret_from_fork+0x7c/0xb0
  Code: ff 47 04 48 8b 7d 08 be 00 02 00 00 e8 55 48 a4 ff 5d c3 0f 1f 00 66 66 66 66 90 55 48 89 e5
  +fa 66 66 90 66 66 90 b8 00 00 02 00 <f0> 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 02 5d c3 83 e2 fe 0f
  +b7
  RIP  [<ffffffff8162e9e5>] _raw_spin_lock_irq+0x15/0x60
   RSP <ffff88085403fdf0>

The request_queue locking was added because blkcg_gq->refcnt is an int
protected with the queue lock and __blkg_release_rcu() needs to put
the parent.  Let's fix it by making blkcg_gq->refcnt an atomic_t and
dropping queue locking in the function.

Given the general heavy weight of the current request_queue and blkcg
operations, this is unlikely to cause any noticeable overhead.
Moreover, blkcg_gq->refcnt is likely to be converted to percpu_ref in
the near future, so whatever (most likely negligible) overhead it may
add is temporary.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Joe Lawrence <joe.lawrence@stratus.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Link: http://lkml.kernel.org/g/alpine.DEB.2.02.1406081816540.17948@jlaw-desktop.mno.stratus.com
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-06-22 16:30:52 -06:00
Linus Torvalds f1d702487b Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
 "A smaller collection of fixes for the block core that would be nice to
  have in -rc2.  This pull request contains:

   - Fixes for races in the wait/wakeup logic used in blk-mq from
     Alexander.  No issues have been observed, but it is definitely a
     bit flakey currently.  Alternatively, we may drop the cyclic
     wakeups going forward, but that needs more testing.

   - Some cleanups from Christoph.

   - Fix for an oops in null_blk if queue_mode=1 and softirq completions
     are used.  From me.

   - A fix for a regression caused by the chunk size setting.  It
     inadvertently used max_hw_sectors instead of max_sectors, which is
     incorrect, and causes hangs on btrfs multi-disk setups (where hw
     sectors apparently isn't set).  From me.

   - Removal of WQ_POWER_EFFICIENT in the kblockd creation.  This was a
     recent addition as well, but it actually breaks blk-mq which relies
     on strict scheduling.  If the workqueue power_efficient mode is
     turned on, this breaks blk-mq.  From Matias.

   - null_blk module parameter description fix from Mike"

* 'for-linus' of git://git.kernel.dk/linux-block:
  blk-mq: bitmap tag: fix races in bt_get() function
  blk-mq: bitmap tag: fix race on blk_mq_bitmap_tags::wake_cnt
  blk-mq: bitmap tag: fix races on shared ::wake_index fields
  block: blk_max_size_offset() should check ->max_sectors
  null_blk: fix softirq completions for queue_mode == 1
  blk-mq: merge blk_mq_drain_queue and __blk_mq_drain_queue
  blk-mq: properly drain stopped queues
  block: remove WQ_POWER_EFFICIENT from kblockd
  null_blk: fix name and description of 'queue_mode' module parameter
  block: remove elv_abort_queue and blk_abort_flushes
2014-06-19 17:56:43 -10:00
Alexander Gordeev 86fb5c56cf blk-mq: bitmap tag: fix races in bt_get() function
This update fixes few issues in bt_get() function:

- list_empty(&wait.task_list) check is not protected;

- was_empty check is always true which results in *every* thread
  entering the loop resets bt_wait_state::wait_cnt counter rather
  than every bt->wake_cnt'th thread;

- 'bt_wait_state::wait_cnt' counter update is redundant, since
  it also gets reset in bt_clear_tag() function;

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ming Lei <tom.leiming@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-06-17 22:13:08 -07:00
Alexander Gordeev 2971c35f35 blk-mq: bitmap tag: fix race on blk_mq_bitmap_tags::wake_cnt
This piece of code in bt_clear_tag() function is racy:

	bs = bt_wake_ptr(bt);
	if (bs && atomic_dec_and_test(&bs->wait_cnt)) {
		atomic_set(&bs->wait_cnt, bt->wake_cnt);
 		wake_up(&bs->wait);
	}

Since nothing prevents bt_wake_ptr() from returning the very
same 'bs' address on multiple CPUs, the following scenario is
possible:

    CPU1                                CPU2
    ----                                ----

0.  bs = bt_wake_ptr(bt);               bs = bt_wake_ptr(bt);
1.  atomic_dec_and_test(&bs->wait_cnt)
2.                                      atomic_dec_and_test(&bs->wait_cnt)
3.  atomic_set(&bs->wait_cnt, bt->wake_cnt);

If the decrement in [1] yields zero then for some amount of time
the decrement in [2] results in a negative/overflow value, which
is not expected. The follow-up assignment in [3] overwrites the
invalid value with the batch value (and likely prevents the issue
from being severe) which is still incorrect and should be a lesser.

Cc: Ming Lei <tom.leiming@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-06-17 22:13:05 -07:00
Alexander Gordeev 8537b12034 blk-mq: bitmap tag: fix races on shared ::wake_index fields
Fix racy updates of shared blk_mq_bitmap_tags::wake_index
and blk_mq_hw_ctx::wake_index fields.

Cc: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-06-17 22:12:35 -07:00
Linus Torvalds b55b390202 Merge git://git.infradead.org/users/willy/linux-nvme
Pull NVMe update from Matthew Wilcox:
 "Mostly bugfixes again for the NVMe driver.  I'd like to call out the
  exported tracepoint in the block layer; I believe Keith has cleared
  this with Jens.

  We've had a few reports from people who're really pounding on NVMe
  devices at scale, hence the timeout changes (and new module
  parameters), hotplug cpu deadlock, tracepoints, and minor performance
  tweaks"

[ Jens hadn't seen that tracepoint thing, but is ok with it - it will
  end up going away when mq conversion happens ]

* git://git.infradead.org/users/willy/linux-nvme: (22 commits)
  NVMe: Fix START_STOP_UNIT Scsi->NVMe translation.
  NVMe: Use Log Page constants in SCSI emulation
  NVMe: Define Log Page constants
  NVMe: Fix hot cpu notification dead lock
  NVMe: Rename io_timeout to nvme_io_timeout
  NVMe: Use last bytes of f/w rev SCSI Inquiry
  NVMe: Adhere to request queue block accounting enable/disable
  NVMe: Fix nvme get/put queue semantics
  NVMe: Delete NVME_GET_FEAT_TEMP_THRESH
  NVMe: Make admin timeout a module parameter
  NVMe: Make iod bio timeout a parameter
  NVMe: Prevent possible NULL pointer dereference
  NVMe: Fix the buffer size passed in GetLogPage(CDW10.NUMD)
  NVMe: Update data structures for NVMe 1.2
  NVMe: Enable BUILD_BUG_ON checks
  NVMe: Update namespace and controller identify structures to the 1.1a spec
  NVMe: Flush with data support
  NVMe: Configure support for block flush
  NVMe: Add tracepoints
  NVMe: Protect against badly formatted CQEs
  ...
2014-06-15 15:58:03 -10:00
Christoph Hellwig 95ed068165 blk-mq: merge blk_mq_drain_queue and __blk_mq_drain_queue
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-06-13 12:17:40 -06:00
Christoph Hellwig 8f5280f4ee blk-mq: properly drain stopped queues
If we need to drain a queue we need to run all queues, even if they
are marked stopped to make sure the driver has a chance to error out
on all queued requests.

This fixes surprise removal with scsi-mq.

Reported-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Bart Van Assche <bvanassche@acm.org>

Signed-off-by: Jens Axboe <axboe@fb.com>
2014-06-13 12:17:38 -06:00
Matias Bjørling 28747fcd22 block: remove WQ_POWER_EFFICIENT from kblockd
blk-mq issues async requests through kblockd. To issue a work request on
a specific CPU, kblockd_schedule_delayed_work_on is used. However, the
specific CPU choice may not be honored, if the power_efficient option
for workqueues is set. blk-mq requires that we have strict per-cpu
scheduling, so it wont work properly if kblockd is marked
POWER_EFFICIENT and power_efficient is set.

Remove the kblockd WQ_POWER_EFFICIENT flag to prevent this behavior.
This essentially reverts part of commit 695588f945, which added
the WQ_POWER_EFFICIENT marker to kblockd.

Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-06-11 15:53:39 -06:00
Christoph Hellwig 2940474af7 block: remove elv_abort_queue and blk_abort_flushes
elv_abort_queue has no callers, and blk_abort_flushes is only called by
elv_abort_queue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-06-11 15:31:21 -06:00
Linus Torvalds 23d4ed53b7 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block layer fixes from Jens Axboe:
 "Final small batch of fixes to be included before -rc1.  Some general
  cleanups in here as well, but some of the blk-mq fixes we need for the
  NVMe conversion and/or scsi-mq.  The pull request contains:

   - Support for not merging across a specified "chunk size", if set by
     the driver.  Some NVMe devices perform poorly for IO that crosses
     such a chunk, so we need to support it generically as part of
     request merging avoid having to do complicated split logic.  From
     me.

   - Bump max tag depth to 10Ki tags.  Some scsi devices have a huge
     shared tag space.  Before we failed with EINVAL if a too large tag
     depth was specified, now we truncate it and pass back the actual
     value.  From me.

   - Various blk-mq rq init fixes from me and others.

   - A fix for enter on a dying queue for blk-mq from Keith.  This is
     needed to prevent oopsing on hot device removal.

   - Fixup for blk-mq timer addition from Ming Lei.

   - Small round of performance fixes for mtip32xx from Sam Bradshaw.

   - Minor stack leak fix from Rickard Strandqvist.

   - Two __init annotations from Fabian Frederick"

* 'for-linus' of git://git.kernel.dk/linux-block:
  block: add __init to blkcg_policy_register
  block: add __init to elv_register
  block: ensure that bio_add_page() always accepts a page for an empty bio
  blk-mq: add timer in blk_mq_start_request
  blk-mq: always initialize request->start_time
  block: blk-exec.c: Cleaning up local variable address returnd
  mtip32xx: minor performance enhancements
  blk-mq: ->timeout should be cleared in blk_mq_rq_ctx_init()
  blk-mq: don't allow queue entering for a dying queue
  blk-mq: bump max tag depth to 10K tags
  block: add blk_rq_set_block_pc()
  block: add notion of a chunk size for request merging
2014-06-11 08:41:17 -07:00
Fabian Frederick a2d445d440 block: add __init to blkcg_policy_register
blkcg_policy_register is only called by
__init functions:

__init cfq_init
__init throtl_init

Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-06-10 13:13:12 -06:00
Fabian Frederick b5097e956a block: add __init to elv_register
elv_register is only called by elevator init functions:

__init cfq_init
__init deadline_init
__init noop_init

Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-06-10 13:13:11 -06:00
Jens Axboe 58a4915ad2 block: ensure that bio_add_page() always accepts a page for an empty bio
With commit 762380ad93 added support for chunk sizes and no merging
across them, it broke the rule of always allowing adding of a single
page to an empty bio. So relax the restriction a bit to allow for that,
similarly to what we have always done.

This fixes a crash with mkfs.xfs and 512b sector sizes on NVMe.

Reported-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-06-10 12:53:56 -06:00
Linus Torvalds 14208b0ec5 Merge branch 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "A lot of activities on cgroup side.  Heavy restructuring including
  locking simplification took place to improve the code base and enable
  implementation of the unified hierarchy, which currently exists behind
  a __DEVEL__ mount option.  The core support is mostly complete but
  individual controllers need further work.  To explain the design and
  rationales of the the unified hierarchy

        Documentation/cgroups/unified-hierarchy.txt

  is added.

  Another notable change is css (cgroup_subsys_state - what each
  controller uses to identify and interact with a cgroup) iteration
  update.  This is part of continuing updates on css object lifetime and
  visibility.  cgroup started with reference count draining on removal
  way back and is now reaching a point where csses behave and are
  iterated like normal refcnted objects albeit with some complexities to
  allow distinguishing the state where they're being deleted.  The css
  iteration update isn't taken advantage of yet but is planned to be
  used to simplify memcg significantly"

* 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (77 commits)
  cgroup: disallow disabled controllers on the default hierarchy
  cgroup: don't destroy the default root
  cgroup: disallow debug controller on the default hierarchy
  cgroup: clean up MAINTAINERS entries
  cgroup: implement css_tryget()
  device_cgroup: use css_has_online_children() instead of has_children()
  cgroup: convert cgroup_has_live_children() into css_has_online_children()
  cgroup: use CSS_ONLINE instead of CGRP_DEAD
  cgroup: iterate cgroup_subsys_states directly
  cgroup: introduce CSS_RELEASED and reduce css iteration fallback window
  cgroup: move cgroup->serial_nr into cgroup_subsys_state
  cgroup: link all cgroup_subsys_states in their sibling lists
  cgroup: move cgroup->sibling and ->children into cgroup_subsys_state
  cgroup: remove cgroup->parent
  device_cgroup: remove direct access to cgroup->children
  memcg: update memcg_has_children() to use css_next_child()
  memcg: remove tasks/children test from mem_cgroup_force_empty()
  cgroup: remove css_parent()
  cgroup: skip refcnting on normal root csses and cgrp_dfl_root self css
  cgroup: use cgroup->self.refcnt for cgroup refcnting
  ...
2014-06-09 15:03:33 -07:00
Ming Lei 2b8393b43e blk-mq: add timer in blk_mq_start_request
This way will become consistent with non-mq case, also
avoid to update rq->deadline twice for mq.

The comment said: "We do this early, to ensure we are on
the right CPU.", but no percpu stuff is used in blk_add_timer(),
so it isn't necessary. Even when inserting from plug list, there
is no such guarantee at all.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-06-09 10:20:06 -06:00
Jens Axboe 3ee3237239 blk-mq: always initialize request->start_time
The blk-mq core only initializes this if io stats are enabled, since
blk-mq only reads the field in that case. But drivers could
potentially use it internally, so ensure that we always set it to
the current time when the request is allocated.

Reported-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-06-09 09:36:53 -06:00
Rickard Strandqvist de83953f9d block: blk-exec.c: Cleaning up local variable address returnd
Address of local variable assigned to a function parameter

This was partly found using a static code analysis program called cppcheck.

Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-06-08 19:51:31 -06:00
Mitchel Humpherys b1de0d139c mm: convert some level-less printks to pr_*
printk is meant to be used with an associated log level.  There are some
instances of printk scattered around the mm code where the log level is
missing.  Add a log level and adhere to suggestions by
scripts/checkpatch.pl by moving to the pr_* macros.

Also add the typical pr_fmt definition so that print statements can be
easily traced back to the modules where they occur, correlated one with
another, etc.  This will require the removal of some (now redundant)
prefixes on a few print statements.

Signed-off-by: Mitchel Humpherys <mitchelh@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 16:08:18 -07:00
Jens Axboe f6be4fb4bc blk-mq: ->timeout should be cleared in blk_mq_rq_ctx_init()
It'll be used in blk_mq_start_request() to set a potential timeout
for the request, so clear it to zero at alloc time to ensure that
we know if someone has set it or not.

Fixes random early timeouts on NVMe testing.

Signed-off-by: Jens Axboe <axboe@fb.com>
2014-06-06 11:05:25 -06:00
Keith Busch 3b632cf0ea blk-mq: don't allow queue entering for a dying queue
If the queue is going away, don't let new allocs or queueing
happen on it. Go through the normal wait process, and exit with
ENODEV in that case.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-06-06 10:40:03 -06:00
Jens Axboe a4391c6465 blk-mq: bump max tag depth to 10K tags
For some scsi-mq cases, the tag map can be huge. So increase the
max number of tags we support.

Additionally, don't fail with EINVAL if a user requests too many
tags. Warn that the tag depth has been adjusted down, and store
the new value inside the tag_set passed in.

Signed-off-by: Jens Axboe <axboe@fb.com>
2014-06-06 08:04:46 -06:00
Jens Axboe f27b087b81 block: add blk_rq_set_block_pc()
With the optimizations around not clearing the full request at alloc
time, we are leaving some of the needed init for REQ_TYPE_BLOCK_PC
up to the user allocating the request.

Add a blk_rq_set_block_pc() that sets the command type to
REQ_TYPE_BLOCK_PC, and properly initializes the members associated
with this type of request. Update callers to use this function instead
of manipulating rq->cmd_type directly.

Includes fixes from Christoph Hellwig <hch@lst.de> for my half-assed
attempt.

Signed-off-by: Jens Axboe <axboe@fb.com>
2014-06-06 07:57:37 -06:00
Jens Axboe 762380ad93 block: add notion of a chunk size for request merging
Some drivers have different limits on what size a request should
optimally be, depending on the offset of the request. Similar to
dividing a device into chunks. Add a setting that allows the driver
to inform the block layer of such a chunk size. The block layer will
then prevent merging across the chunks.

This is needed to optimally support NVMe with a non-zero stripe size.

Signed-off-by: Jens Axboe <axboe@fb.com>
2014-06-05 13:38:39 -06:00
Ming Lei 14b83e172f block: mq flush: clear flush_rq's tag in flush_end_io()
blk_mq_tag_to_rq() needs to be able to tell if it should return
the original request, or the flush request if we are doing a flush
sequence. Clear the flush tag when IO completes for a flush, since
that is what we are comparing against.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-06-04 10:40:16 -06:00
Jens Axboe 0e62f51f87 blk-mq: let blk_mq_tag_to_rq() take blk_mq_tags as the main parameter
We currently pass in the hardware queue, and get the tags from there.
But from scsi-mq, with a shared tag space, it's a lot more convenient
to pass in the blk_mq_tags instead as the hardware queue isn't always
directly available. So instead of having to re-map to a given
hardware queue from rq->mq_ctx, just pass in the tags structure.

Signed-off-by: Jens Axboe <axboe@fb.com>
2014-06-04 10:23:49 -06:00
Jens Axboe f899fed442 blk-mq: fix regression from commit 624dbe4754
When the code was collapsed to avoid duplication, the recent patch
for ensuring that a queue is idled before free was dropped, which was
added by commit 19c5d84f14.

Add back the blk_mq_tag_idle(), to ensure we don't leak a reference
to an active queue when it is freed.

Signed-off-by: Jens Axboe <axboe@fb.com>
2014-06-04 09:11:53 -06:00
Jens Axboe ff87bcec19 blk-mq: handle NULL req return from blk_map_request in single queue mode
blk_mq_map_request() can return NULL if we fail entering the queue
(dying, or removed), in which case it has already ended IO on the
bio. So nothing more to do, except just return.

Signed-off-by: Jens Axboe <axboe@fb.com>
2014-06-03 21:04:39 -06:00
Ming Lei e6cdb0929f blk-mq: fix sparse warning on missed __percpu annotation
'struct blk_mq_ctx' is  __percpu, so add the annotation
and fix the sparse warning reported from Fengguang:

	[block:for-linus 2/3] block/blk-mq.h:75:16: sparse: incorrect
	type in initializer (different address spaces)

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-06-03 21:04:39 -06:00
Ming Lei cb96a42cc1 blk-mq: fix schedule from atomic context
blk_mq_put_ctx() has to be called before io_schedule() in
bt_get().

This patch fixes the problem by taking similar approach from
percpu_ida allocation for the situation.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-06-03 21:04:39 -06:00
Ming Lei 1aecfe4887 blk-mq: move blk_mq_get_ctx/blk_mq_put_ctx to mq private header
The blk-mq tag code need these helpers.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-06-03 21:04:38 -06:00
Linus Torvalds 776edb5931 Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next
Pull core locking updates from Ingo Molnar:
 "The main changes in this cycle were:

   - reduced/streamlined smp_mb__*() interface that allows more usecases
     and makes the existing ones less buggy, especially in rarer
     architectures

   - add rwsem implementation comments

   - bump up lockdep limits"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
  rwsem: Add comments to explain the meaning of the rwsem's count field
  lockdep: Increase static allocations
  arch: Mass conversion of smp_mb__*()
  arch,doc: Convert smp_mb__*()
  arch,xtensa: Convert smp_mb__*()
  arch,x86: Convert smp_mb__*()
  arch,tile: Convert smp_mb__*()
  arch,sparc: Convert smp_mb__*()
  arch,sh: Convert smp_mb__*()
  arch,score: Convert smp_mb__*()
  arch,s390: Convert smp_mb__*()
  arch,powerpc: Convert smp_mb__*()
  arch,parisc: Convert smp_mb__*()
  arch,openrisc: Convert smp_mb__*()
  arch,mn10300: Convert smp_mb__*()
  arch,mips: Convert smp_mb__*()
  arch,metag: Convert smp_mb__*()
  arch,m68k: Convert smp_mb__*()
  arch,m32r: Convert smp_mb__*()
  arch,ia64: Convert smp_mb__*()
  ...
2014-06-03 12:57:53 -07:00
Linus Torvalds 80081ec309 Merge branch 'for-3.16/drivers' of git://git.kernel.dk/linux-block into next
Pull block driver changes from Jens Axboe:
 "Now that the core bits are in, here's the pull request for the driver
  related changes for 3.16.  Nothing out of the ordinary here, mostly
  business as usual.  There are a few pulls of for-3.16/core into this
  branch, which were done when the blk-mq was modified after the
  mtip32xx conversion was put in.

  The pull request contains:

   - skd and cciss converted to use pci_enable_msix_exact().  From
     Alexander Gordeev.

   - A few mtip32xx fixes from Asai @ Micron.

   - The conversion of mtip32xx from make_request_fn to blk-mq, and a
     later small fix for that conversion on quiescing for non-queued IO.
     From me.

   - A fix for bsg to use an exported function to check whether this
     driver is request based or not.  Needed updating for blk-mq, which
     is request based, but does not have a request_fn hook.  From me.

   - Small floppy bug fix from Jiri.

   - A series of cleanups for the cdrom uniform layer from Joe Perches.
     Gets rid of various old ugly macros, making the code conform more
     to the modern coding style.

   - A series of patches for drbd from the drbd crew (Lars Ellenberg and
     Philipp Reisner).

   - A use-after-free fix for null_blk from Ming Lei.

   - Also from Ming Lei is a performance patch for virtio-blk, which can
     net us a 3x win on kvm platforms where world notification is
     expensive.

   - Ming Lei also fixed a stall issue in virtio-blk, due to a race
     between queue start/stop and resource limits.

   - A small batch of fixes for xen-blk{back,front} from Olaf Hering and
     Valentin Priescu"

* 'for-3.16/drivers' of git://git.kernel.dk/linux-block: (54 commits)
  block: virtio_blk: don't hold spin lock during world switch
  xen-blkback: defer freeing blkif to avoid blocking xenwatch
  xen blkif.h: fix comment typo in discard-alignment
  xen/blkback: disable discard feature if requested by toolstack
  xen-blkfront: remove type check from blkfront_setup_discard
  floppy: do not corrupt bio.bi_flags when reading block 0
  mtip32xx: move error handling to service thread
  virtio_blk: fix race between start and stop queue
  mtip32xx: stop block hardware queues before quiescing IO
  mtip32xx: blk_mq_init_queue() returns an ERR_PTR
  mtip32xx: convert to use blk-mq
  cdrom: Remove unnecessary prototype for cdrom_get_disc_info
  cdrom: Remove unnecessary prototype for cdrom_mrw_exit
  cdrom: Remove cdrom_count_tracks prototype
  cdrom: Remove cdrom_get_next_writeable prototype
  cdrom: Remove cdrom_get_last_written prototype
  cdrom: Move mmc_ioctls above cdrom_ioctl to remove unnecessary prototype
  cdrom: Remove unnecessary sanitize_format prototype
  cdrom: Remove unnecessary check_for_audio_disc prototype
  cdrom: Remove prototype for open_for_data
  ...
2014-06-02 13:57:01 -07:00
Linus Torvalds 681a289548 Merge branch 'for-3.16/core' of git://git.kernel.dk/linux-block into next
Pull block core updates from Jens Axboe:
 "It's a big(ish) round this time, lots of development effort has gone
  into blk-mq in the last 3 months.  Generally we're heading to where
  3.16 will be a feature complete and performant blk-mq.  scsi-mq is
  progressing nicely and will hopefully be in 3.17.  A nvme port is in
  progress, and the Micron pci-e flash driver, mtip32xx, is converted
  and will be sent in with the driver pull request for 3.16.

  This pull request contains:

   - Lots of prep and support patches for scsi-mq have been integrated.
     All from Christoph.

   - API and code cleanups for blk-mq from Christoph.

   - Lots of good corner case and error handling cleanup fixes for
     blk-mq from Ming Lei.

   - A flew of blk-mq updates from me:

     * Provide strict mappings so that the driver can rely on the CPU
       to queue mapping.  This enables optimizations in the driver.

     * Provided a bitmap tagging instead of percpu_ida, which never
       really worked well for blk-mq.  percpu_ida relies on the fact
       that we have a lot more tags available than we really need, it
       fails miserably for cases where we exhaust (or are close to
       exhausting) the tag space.

     * Provide sane support for shared tag maps, as utilized by scsi-mq

     * Various fixes for IO timeouts.

     * API cleanups, and lots of perf tweaks and optimizations.

   - Remove 'buffer' from struct request.  This is ancient code, from
     when requests were always virtually mapped.  Kill it, to reclaim
     some space in struct request.  From me.

   - Remove 'magic' from blk_plug.  Since we store these on the stack
     and since we've never caught any actual bugs with this, lets just
     get rid of it.  From me.

   - Only call part_in_flight() once for IO completion, as includes two
     atomic reads.  Hopefully we'll get a better implementation soon, as
     the part IO stats are now one of the more expensive parts of doing
     IO on blk-mq.  From me.

   - File migration of block code from {mm,fs}/ to block/.  This
     includes bio.c, bio-integrity.c, bounce.c, and ioprio.c.  From me,
     from a discussion on lkml.

  That should describe the meat of the pull request.  Also has various
  little fixes and cleanups from Dave Jones, Shaohua Li, Duan Jiong,
  Fengguang Wu, Fabian Frederick, Randy Dunlap, Robert Elliott, and Sam
  Bradshaw"

* 'for-3.16/core' of git://git.kernel.dk/linux-block: (100 commits)
  blk-mq: push IPI or local end_io decision to __blk_mq_complete_request()
  blk-mq: remember to start timeout handler for direct queue
  block: ensure that the timer is always added
  blk-mq: blk_mq_unregister_hctx() can be static
  blk-mq: make the sysfs mq/ layout reflect current mappings
  blk-mq: blk_mq_tag_to_rq should handle flush request
  block: remove dead code in scsi_ioctl:blk_verify_command
  blk-mq: request initialization optimizations
  block: add queue flag for disabling SG merging
  block: remove 'magic' from struct blk_plug
  blk-mq: remove alloc_hctx and free_hctx methods
  blk-mq: add file comments and update copyright notices
  blk-mq: remove blk_mq_alloc_request_pinned
  blk-mq: do not use blk_mq_alloc_request_pinned in blk_mq_map_request
  blk-mq: remove blk_mq_wait_for_tags
  blk-mq: initialize request in __blk_mq_alloc_request
  blk-mq: merge blk_mq_alloc_reserved_request into blk_mq_alloc_request
  blk-mq: add helper to insert requests from irq context
  blk-mq: remove stale comment for blk_mq_complete_request()
  blk-mq: allow non-softirq completions
  ...
2014-06-02 09:29:34 -07:00
Jens Axboe ed851860b4 blk-mq: push IPI or local end_io decision to __blk_mq_complete_request()
We have callers outside of the blk-mq proper (like timeouts) that
want to call __blk_mq_complete_request(), so rename the function
and put the decision code for whether to use ->softirq_done_fn
or blk_mq_endio() into __blk_mq_complete_request().

This also makes the interface more logical again.
blk_mq_complete_request() attempts to atomically mark the request
completed, and calls __blk_mq_complete_request() if successful.
__blk_mq_complete_request() then just ends the request.

Signed-off-by: Jens Axboe <axboe@fb.com>
2014-05-30 21:20:50 -06:00
Jens Axboe feff689412 blk-mq: remember to start timeout handler for direct queue
Commit 07068d5b8e added a direct-to-hw-queue mode, but this mode
needs to remember to add the request timeout handler as well.
Without it, we don't track timeouts for these requests.

Signed-off-by: Jens Axboe <axboe@fb.com>
2014-05-30 15:42:56 -06:00
Jens Axboe c7bca4183f block: ensure that the timer is always added
Commit f793aa5378 relaxed the timer addition a little too much.
If the timer isn't pending, we always need to add it.

Signed-off-by: Jens Axboe <axboe@fb.com>
2014-05-30 15:41:39 -06:00
Fengguang Wu ee3c5db089 blk-mq: blk_mq_unregister_hctx() can be static
CC: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-05-30 10:31:13 -06:00
Jens Axboe 67aec14ce8 blk-mq: make the sysfs mq/ layout reflect current mappings
Currently blk-mq registers all the hardware queues in sysfs,
regardless of whether it uses them (e.g. they have CPU mappings)
or not. The unused hardware queues lack the cpux/ directories,
and the other sysfs entries (like active, pending, etc) are all
zeroes.

Change this so that sysfs correctly reflects the current mappings
of the hardware queues.

Signed-off-by: Jens Axboe <axboe@fb.com>
2014-05-30 08:25:36 -06:00
Jens Axboe f89ca16646 Merge branch 'for-3.16/core' into for-3.16/drivers
Pulled in for the blk_mq_tag_to_rq() change, which impacts
mtip32xx.

Signed-off-by: Jens Axboe <axboe@fb.com>
2014-05-30 08:11:50 -06:00
Shaohua Li 2230237500 blk-mq: blk_mq_tag_to_rq should handle flush request
flush request is special, which borrows the tag from the parent
request. Hence blk_mq_tag_to_rq needs special handling to return
the flush request from the tag.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-05-30 08:06:42 -06:00
Dave Jones da52f22fa9 block: remove dead code in scsi_ioctl:blk_verify_command
filter gets assigned the address of blk_default_cmd_filter on
entry to this function, so the !filter condition can never be true.

Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-05-29 13:38:50 -06:00
Jens Axboe 4b570521be blk-mq: request initialization optimizations
We currently clear a lot more than we need to, so make that a bit
more clever. Make some of the init dependent on features, like
only setting start_time if we are going to use it.

Signed-off-by: Jens Axboe <axboe@fb.com>
2014-05-29 11:00:11 -06:00
Jens Axboe 05f1dd5315 block: add queue flag for disabling SG merging
If devices are not SG starved, we waste a lot of time potentially
collapsing SG segments. Enough that 1.5% of the CPU time goes
to this, at only 400K IOPS. Add a queue flag, QUEUE_FLAG_NO_SG_MERGE,
which just returns the number of vectors in a bio instead of looping
over all segments and checking for collapsible ones.

Add a BLK_MQ_F_SG_MERGE flag so that drivers can opt-in on the sg
merging, if they so desire.

Signed-off-by: Jens Axboe <axboe@fb.com>
2014-05-29 09:53:32 -06:00
Jens Axboe 4d92a9beb3 block: remove 'magic' from struct blk_plug
I don't think we've ever caught any bugs with this, and there's the
list poisoning for the plug lists to catch uninitialized cases.
So remove the magic member and save 8 bytes in the struct.

Signed-off-by: Jens Axboe <axboe@fb.com>
2014-05-29 08:09:00 -06:00
Jens Axboe 0fb662e225 Merge branch 'for-3.16/core' into for-3.16/drivers
Pull in core changes (again), since we got rid of the alloc/free
hctx mq_ops hooks and mtip32xx then needed updating again.

Signed-off-by: Jens Axboe <axboe@fb.com>
2014-05-28 10:18:51 -06:00
Christoph Hellwig cdef54dd85 blk-mq: remove alloc_hctx and free_hctx methods
There is no need for drivers to control hardware context allocation
now that we do the context to node mapping in common code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-05-28 10:18:31 -06:00
Jens Axboe 75bb4625bb blk-mq: add file comments and update copyright notices
None of the blk-mq files have an explanatory comment at the top
for what that particular file does. Add that and add appropriate
copyright notices as well.

Signed-off-by: Jens Axboe <axboe@fb.com>
2014-05-28 10:15:41 -06:00
Jens Axboe 6178976500 Merge branch 'for-3.16/core' into for-3.16/drivers
mtip32xx uses blk_mq_alloc_reserved_request(), so pull in the
core changes so we have a properly merged end result.

Signed-off-by: Jens Axboe <axboe@fb.com>
2014-05-28 09:50:26 -06:00
Christoph Hellwig d852564f8c blk-mq: remove blk_mq_alloc_request_pinned
We now only have one caller left and can open code it there in a cleaner
way.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-05-28 09:49:27 -06:00
Christoph Hellwig 793597a6a9 blk-mq: do not use blk_mq_alloc_request_pinned in blk_mq_map_request
We already do a non-blocking allocation in blk_mq_map_request, no need
to repeat it.  Just call __blk_mq_alloc_request to wait directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-05-28 09:49:25 -06:00
Christoph Hellwig a3bd77567c blk-mq: remove blk_mq_wait_for_tags
The current logic for blocking tag allocation is rather confusing, as we
first allocated and then free again a tag in blk_mq_wait_for_tags, just
to attempt a non-blocking allocation and then repeat if someone else
managed to grab the tag before us.

Instead change blk_mq_alloc_request_pinned to simply do a blocking tag
allocation itself and use the request we get back from it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-05-28 09:49:23 -06:00
Christoph Hellwig 5dee857720 blk-mq: initialize request in __blk_mq_alloc_request
Both callers if __blk_mq_alloc_request want to initialize the request, so
lift it into the common path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-05-28 09:49:21 -06:00
Christoph Hellwig 4ce01dd1a0 blk-mq: merge blk_mq_alloc_reserved_request into blk_mq_alloc_request
Instead of having two almost identical copies of the same code just let
the callers pass in the reserved flag directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-05-28 09:49:19 -06:00
Christoph Hellwig 6fca6a611c blk-mq: add helper to insert requests from irq context
Both the cache flush state machine and the SCSI midlayer want to submit
requests from irq context, and the current per-request requeue_work
unfortunately causes corruption due to sharing with the csd field for
flushes.  Replace them with a per-request_queue list of requests to
be requeued.

Based on an earlier test by Ming Lei.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Ming Lei <tom.leiming@gmail.com>
Tested-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-05-28 08:08:02 -06:00
Jens Axboe 95f0968499 blk-mq: allow non-softirq completions
Right now we export two ways of completing a request:

1) blk_mq_complete_request(). This uses an IPI (if needed) and
   completes through q->softirq_done_fn(). It also works with
   timeouts.

2) blk_mq_end_io(). This completes inline, and ignores any timeout
   state of the request.

Let blk_mq_complete_request() handle non-softirq_done_fn completions
as well, by just completing inline. If a driver has enough completion
ports to place completions correctly, it need not define a
mq_ops->complete() and we can avoid an indirect function call by
doing the completion inline.

Signed-off-by: Jens Axboe <axboe@fb.com>
2014-05-27 17:46:48 -06:00
Jens Axboe f14bbe77a9 blk-mq: pass in suggested NUMA node to ->alloc_hctx()
Drivers currently have to figure this out on their own, and they
are missing information to do it properly. The ones that did
attempt to do it, do it wrong.

So just pass in the suggested node directly to the alloc
function.

Signed-off-by: Jens Axboe <axboe@fb.com>
2014-05-27 12:06:53 -06:00
Ming Lei 3d2936f457 block: only allocate/free mq_usage_counter in blk-mq
The percpu counter is only used for blk-mq, so move
its allocation and free inside blk-mq, and don't
allocate it for legacy queue device.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-05-27 09:37:08 -06:00
Ming Lei 624dbe4754 blk-mq: avoid code duplication
blk_mq_exit_hw_queues() and blk_mq_free_hw_queues()
are introduced to avoid code duplication.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-05-27 09:37:06 -06:00
Ming Lei 1f9f07e917 blk-mq: fix leak of hctx->ctx_map
hctx->ctx_map should have been freed inside blk_mq_free_queue().

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-05-27 08:34:45 -06:00
Fabian Frederick 35086784ca block/blk-lib.c: make __blkdev_issue_zeroout static
__blkdev_issue_zeroout is only used in blk-lib.c

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-05-26 17:39:09 -06:00
Christoph Hellwig 19c5d84f14 blk-mq: idle all hardware contexts before freeing a queue
Without this we can leak the active_queues reference if a command is
freed while it is considered active.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-05-26 17:21:51 -06:00
Jens Axboe c22d9d8a60 blk-mq: allow setting of per-request timeouts
Currently blk-mq uses the queue timeout for all requests. But
for some commands, drivers may want to set a specific timeout
for special requests. Allow this to be passed in through
request->timeout, and use it if set.

Signed-off-by: Jens Axboe <axboe@fb.com>
2014-05-23 14:14:57 -06:00
Sam Bradshaw edf866b380 blk-mq: export blk_mq_tag_busy_iter
Export the blk-mq in-flight tag iterator for driver consumption.
This is particularly useful in exception paths or SRSI where
in-flight IOs need to be cancelled and/or reissued. The NVMe driver
conversion will use this.

Signed-off-by: Sam Bradshaw <sbradshaw@micron.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-05-23 13:30:16 -06:00
Jens Axboe 07068d5b8e blk-mq: split make request handler for multi and single queue
We want slightly different behavior from them:

- On single queue devices, we currently use the per-process plug
  for deferred IO and for merging.

- On multi queue devices, we don't use the per-process plug, but
  we want to go straight to hardware for SYNC IO.

Split blk_mq_make_request() into a blk_sq_make_request() for single
queue devices, and retain blk_mq_make_request() for multi queue
devices. Then we don't need multiple checks for q->nr_hw_queues
in the request mapping.

Signed-off-by: Jens Axboe <axboe@fb.com>
2014-05-22 10:43:07 -06:00
Jens Axboe 484b4061e6 blk-mq: save memory by freeing requests on unused hardware queues
Depending on the topology of the machine and the number of queues
exposed by a device, we can end up in a situation where some of
the hardware queues are unused (as in, they don't map to any
software queues). For this case, free up the memory used by the
request map, as we will not use it. This can be a substantial
amount of memory, depending on the number of queues vs CPUs and
the queue depth of the device.

Signed-off-by: Jens Axboe <axboe@fb.com>
2014-05-21 14:01:15 -06:00
Jens Axboe e814e71ba4 blk-mq: allow the hctx cpu hotplug notifier to return errors
Prepare this for the next patch which adds more smarts in the
plugging logic, so that we can save some memory.

Signed-off-by: Jens Axboe <axboe@fb.com>
2014-05-21 13:59:08 -06:00