1
0
Fork 0
Commit Graph

740 Commits (redonkable)

Author SHA1 Message Date
Bart Van Assche 21e768b442 blk-mq: Make blk_mq_get_request() error path less confusing
blk_mq_get_tag() can modify data->ctx. This means that in the
error path of blk_mq_get_request() data->ctx should be passed to
blk_mq_put_ctx() instead of local_ctx. Note: since blk_mq_put_ctx()
ignores its argument, this patch does not change any functionality.

References: commit 1ad43c0078 ("blk-mq: don't leak preempt counter/q_usage_counter when allocating rq failed")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-03 12:33:56 -06:00
weiping zhang c2e82a2348 blk-mq: fix nr_requests wrong value when modify it from sysfs
if blk-mq use "none" io scheduler, nr_request get a wrong value when
input a number > tag_set->queue_depth. blk_mq_tag_update_depth will get
the smaller one min(nr, set->queue_depth), and then q->nr_request get a
wrong value.

Reproduce:

echo none > /sys/block/nvme0n1/queue/scheduler
echo 1000000 > /sys/block/nvme0n1/queue/nr_requests
cat /sys/block/nvme0n1/queue/nr_requests
1000000

Signed-off-by: weiping zhang <zhangweiping@didichuxing.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-03 12:21:06 -06:00
Christoph Hellwig ea435e1b93 block: add a poll_fn callback to struct request_queue
That we we can also poll non blk-mq queues.  Mostly needed for
the NVMe multipath code, but could also be useful elsewhere.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-03 10:31:48 -06:00
Ming Lei 1f460b63d4 blk-mq: don't restart queue when .get_budget returns BLK_STS_RESOURCE
SCSI restarts its queue in scsi_end_request() automatically, so we don't
need to handle this case in blk-mq.

Especailly any request won't be dequeued in this case, we needn't to
worry about IO hang caused by restart vs. dispatch.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-01 08:20:34 -06:00
Ming Lei b347689ffb blk-mq-sched: improve dispatching from sw queue
SCSI devices use host-wide tagset, and the shared driver tag space is
often quite big. However, there is also a queue depth for each lun(
.cmd_per_lun), which is often small, for example, on both lpfc and
qla2xxx, .cmd_per_lun is just 3.

So lots of requests may stay in sw queue, and we always flush all
belonging to same hw queue and dispatch them all to driver.
Unfortunately it is easy to cause queue busy because of the small
.cmd_per_lun.  Once these requests are flushed out, they have to stay in
hctx->dispatch, and no bio merge can happen on these requests, and
sequential IO performance is harmed.

This patch introduces blk_mq_dequeue_from_ctx for dequeuing a request
from a sw queue, so that we can dispatch them in scheduler's way. We can
then avoid dequeueing too many requests from sw queue, since we don't
flush ->dispatch completely.

This patch improves dispatching from sw queue by using the .get_budget
and .put_budget callbacks.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-01 08:20:02 -06:00
Ming Lei de14829740 blk-mq: introduce .get_budget and .put_budget in blk_mq_ops
For SCSI devices, there is often a per-request-queue depth, which needs
to be respected before queuing one request.

Currently blk-mq always dequeues the request first, then calls
.queue_rq() to dispatch the request to lld. One obvious issue with this
approach is that I/O merging may not be successful, because when the
per-request-queue depth can't be respected, .queue_rq() has to return
BLK_STS_RESOURCE, and then this request has to stay in hctx->dispatch
list. This means it never gets a chance to be merged with other IO.

This patch introduces .get_budget and .put_budget callback in blk_mq_ops,
then we can try to get reserved budget first before dequeuing request.
If the budget for queueing I/O can't be satisfied, we don't need to
dequeue request at all. Hence the request can be left in the IO
scheduler queue, for more merging opportunities.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-01 08:20:02 -06:00
Shaohua Li 85acb3ba2f block: set request_list for request
Legacy queue sets request's request_list, mq doesn't. This makes mq does
the same thing, so we can find cgroup of a request. Note, we really
only use blkg field of request_list, it's pointless to allocate mempool
for request_list in mq case.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-10 13:48:16 -06:00
Jens Axboe fc13457f74 blk-mq: document the need to have STARTED and COMPLETED share a byte
For memory ordering guarantees on stores, we need to ensure that
these two bits share the same byte of storage in the unsigned
long. Add a comment as to why, and a BUILD_BUG_ON() to ensure that
we don't violate this requirement.

Suggested-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-04 11:22:24 -06:00
Peter Zijlstra a7af0af321 blk-mq: attempt to fix atomic flag memory ordering
Attempt to untangle the ordering in blk-mq. The patch introducing the
single smp_mb__before_atomic() is obviously broken in that it doesn't
clearly specify a pairing barrier and an obtained guarantee.

The comment is further misleading in that it hints that the
deadline store and the COMPLETE store also need to be ordered, but
AFAICT there is no such dependency. However what does appear to be
important is the clear happening _after_ the store, and that worked by
pure accident.

This clarifies blk_mq_start_request() -- we should not get there with
STARTING set -- this simplifies the code and makes the barrier usage
sane (the old code could be read to allow not having _any_ atomic after
the barrier, in which case the barrier hasn't got anything to order). We
then also introduce the missing pairing barrier for it.

Also down-grade the barrier to smp_wmb(), this is cheaper for
PowerPC/ARM and doesn't cost anything extra on x86.

And it documents the STARTING vs COMPLETE ordering. Although I've not
been entirely successful in reverse engineering the blk-mq state
machine so there might still be more funnies around timeout vs
requeue.

If I got anything wrong, feel free to educate me by adding comments to
clarify things ;-)

Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Ming Lei <tom.leiming@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andrea Parri <parri.andrea@gmail.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Fixes: 538b753418 ("blk-mq: request deadline must be visible before marking rq as started")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-04 11:20:11 -06:00
Jens Axboe 7beb2f845b blk-mq: wire up completion notifier for laptop mode
For some reason, the laptop mode IO completion notified was never wired
up for blk-mq. Ensure that we trigger the callback appropriately, to arm
the laptop mode flush timer.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-03 08:38:17 -06:00
weiping zhang 547248736a blk-mq: remove unused function hctx_allow_merges
since 9bddeb2a5b "blk-mq: make per-sw-queue bio merge as default .bio_merge"
there is no caller for this function.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: weiping zhang <zhangweiping@didichuxing.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-30 02:17:37 -06:00
Jens Axboe 157f377beb block: directly insert blk-mq request from blk_insert_cloned_request()
A NULL pointer crash was reported for the case of having the BFQ IO
scheduler attached to the underlying blk-mq paths of a DM multipath
device.  The crash occured in blk_mq_sched_insert_request()'s call to
e->type->ops.mq.insert_requests().

Paolo Valente correctly summarized why the crash occured with:
"the call chain (dm_mq_queue_rq -> map_request -> setup_clone ->
blk_rq_prep_clone) creates a cloned request without invoking
e->type->ops.mq.prepare_request for the target elevator e.  The cloned
request is therefore not initialized for the scheduler, but it is
however inserted into the scheduler by blk_mq_sched_insert_request."

All said, a request-based DM multipath device's IO scheduler should be
the only one used -- when the original requests are issued to the
underlying paths as cloned requests they are inserted directly in the
underlying dispatch queue(s) rather than through an additional elevator.

But commit bd166ef18 ("blk-mq-sched: add framework for MQ capable IO
schedulers") switched blk_insert_cloned_request() from using
blk_mq_insert_request() to blk_mq_sched_insert_request().  Which
incorrectly added elevator machinery into a call chain that isn't
supposed to have any.

To fix this introduce a blk-mq private blk_mq_request_bypass_insert()
that blk_insert_cloned_request() calls to insert the request without
involving any elevator that may be attached to the cloned request's
request_queue.

Fixes: bd166ef183 ("blk-mq-sched: add framework for MQ capable IO schedulers")
Cc: stable@vger.kernel.org
Reported-by: Bart Van Assche <Bart.VanAssche@wdc.com>
Tested-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-11 16:43:57 -06:00
Linus Torvalds a0725ab0c7 Merge branch 'for-4.14/block' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:
 "This is the first pull request for 4.14, containing most of the code
  changes. It's a quiet series this round, which I think we needed after
  the churn of the last few series. This contains:

   - Fix for a registration race in loop, from Anton Volkov.

   - Overflow complaint fix from Arnd for DAC960.

   - Series of drbd changes from the usual suspects.

   - Conversion of the stec/skd driver to blk-mq. From Bart.

   - A few BFQ improvements/fixes from Paolo.

   - CFQ improvement from Ritesh, allowing idling for group idle.

   - A few fixes found by Dan's smatch, courtesy of Dan.

   - A warning fixup for a race between changing the IO scheduler and
     device remova. From David Jeffery.

   - A few nbd fixes from Josef.

   - Support for cgroup info in blktrace, from Shaohua.

   - Also from Shaohua, new features in the null_blk driver to allow it
     to actually hold data, among other things.

   - Various corner cases and error handling fixes from Weiping Zhang.

   - Improvements to the IO stats tracking for blk-mq from me. Can
     drastically improve performance for fast devices and/or big
     machines.

   - Series from Christoph removing bi_bdev as being needed for IO
     submission, in preparation for nvme multipathing code.

   - Series from Bart, including various cleanups and fixes for switch
     fall through case complaints"

* 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits)
  kernfs: checking for IS_ERR() instead of NULL
  drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set
  drbd: Fix allyesconfig build, fix recent commit
  drbd: switch from kmalloc() to kmalloc_array()
  drbd: abort drbd_start_resync if there is no connection
  drbd: move global variables to drbd namespace and make some static
  drbd: rename "usermode_helper" to "drbd_usermode_helper"
  drbd: fix race between handshake and admin disconnect/down
  drbd: fix potential deadlock when trying to detach during handshake
  drbd: A single dot should be put into a sequence.
  drbd: fix rmmod cleanup, remove _all_ debugfs entries
  drbd: Use setup_timer() instead of init_timer() to simplify the code.
  drbd: fix potential get_ldev/put_ldev refcount imbalance during attach
  drbd: new disk-option disable-write-same
  drbd: Fix resource role for newly created resources in events2
  drbd: mark symbols static where possible
  drbd: Send P_NEG_ACK upon write error in protocol != C
  drbd: add explicit plugging when submitting batches
  drbd: change list_for_each_safe to while(list_first_entry_or_null)
  drbd: introduce drbd_recv_header_maybe_unplug
  ...
2017-09-07 11:59:42 -07:00
Bart Van Assche 4d6062193b block: Fix two comments that refer to .queue_rq() return values
Since patch "blk-mq: switch .queue_rq return value to blk_status_t"
.queue_rq() returns a BLK_STS_* value instead of a BLK_MQ_RQ_*
value. Hence refer to the former in comments about .queue_rq()
return values.

Fixes: commit 39a70c76b8 ("blk-mq: clarify dispatch may not be drained/blocked by stopping queue")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-18 08:36:58 -06:00
Keith Busch 3280d66a63 blk-mq: Fix queue usage on failed request allocation
blk_mq_get_request() does not release the callers queue usage counter
when allocation fails. The caller still needs to account for its own
queue usage when it is unable to allocate a request.

Fixes: 1ad43c0078 ("blk-mq: don't leak preempt counter/q_usage_counter when allocating rq failed")

Reported-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-15 07:22:34 -06:00
Bart Van Assche d4acf3650c block: Make blk_mq_delay_kick_requeue_list() rerun the queue at a quiet time
The blk_mq_delay_kick_requeue_list() function is used by the device
mapper and only by the device mapper to rerun the queue and requeue
list after a delay. This function is called once per request that
gets requeued. Modify this function such that the queue is run once
per path change event instead of once per request that is requeued.

Fixes: commit 2849450ad3 ("blk-mq: introduce blk_mq_delay_kick_requeue_list()")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-09 20:24:38 -06:00
Jens Axboe b8d62b3a9c blk-mq: enable checking two part inflight counts at the same time
Modify blk_mq_in_flight() to count both a partition and root at
the same time. Then we only have to call it once, instead of
potentially looping the tags twice.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-09 13:09:33 -06:00
Jens Axboe f299b7c7a9 blk-mq: provide internal in-flight variant
We don't have to inc/dec some counter, since we can just
iterate the tags. That makes inc/dec a noop, but means we
have to iterate busy tags to get an in-flight count.

Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-09 13:09:28 -06:00
Ming Lei 1ad43c0078 blk-mq: don't leak preempt counter/q_usage_counter when allocating rq failed
When blk_mq_get_request() failed, preempt counter isn't
released, and blk_mq_make_request() doesn't release the counter
too.

This patch fixes the issue, and makes sure that preempt counter
is only held if rq is allocated successfully. The same policy is
applied on .q_usage_counter too.

Signed-off-by: Ming Lei <minlei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-02 08:23:57 -06:00
Jens Axboe b7a71e66d4 blk-mq: add warning to __blk_mq_run_hw_queue() for ints disabled
We recently had a bug in the IPR SCSI driver, where it would end up
making the SCSI mid layer run the mq hardware queue with interrupts
disabled. This isn't legal, since the software queue locking relies
on never being grabbed from interrupt context. Additionally, drivers
that set BLK_MQ_F_BLOCKING may schedule from this context.

Add a WARN_ON_ONCE() to catch bad users up front.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-01 09:28:24 -06:00
Jens Axboe 18e9781d44 blk-mq: blk_mq_requeue_work() doesn't need to save IRQ flags
We know we're in process context, so don't bother using the
IRQ safe versions of the spin lock.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-07-29 09:00:03 -06:00
Linus Torvalds 130568d5ea Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull more block updates from Jens Axboe:
 "This is a followup for block changes, that didn't make the initial
  pull request. It's a bit of a mixed bag, this contains:

   - A followup pull request from Sagi for NVMe. Outside of fixups for
     NVMe, it also includes a series for ensuring that we properly
     quiesce hardware queues when browsing live tags.

   - Set of integrity fixes from Dmitry (mostly), fixing various issues
     for folks using DIF/DIX.

   - Fix for a bug introduced in cciss, with the req init changes. From
     Christoph.

   - Fix for a bug in BFQ, from Paolo.

   - Two followup fixes for lightnvm/pblk from Javier.

   - Depth fix from Ming for blk-mq-sched.

   - Also from Ming, performance fix for mtip32xx that was introduced
     with the dynamic initialization of commands"

* 'for-linus' of git://git.kernel.dk/linux-block: (44 commits)
  block: call bio_uninit in bio_endio
  nvmet: avoid unneeded assignment of submit_bio return value
  nvme-pci: add module parameter for io queue depth
  nvme-pci: compile warnings in nvme_alloc_host_mem()
  nvmet_fc: Accept variable pad lengths on Create Association LS
  nvme_fc/nvmet_fc: revise Create Association descriptor length
  lightnvm: pblk: remove unnecessary checks
  lightnvm: pblk: control I/O flow also on tear down
  cciss: initialize struct scsi_req
  null_blk: fix error flow for shared tags during module_init
  block: Fix __blkdev_issue_zeroout loop
  nvme-rdma: unconditionally recycle the request mr
  nvme: split nvme_uninit_ctrl into stop and uninit
  virtio_blk: quiesce/unquiesce live IO when entering PM states
  mtip32xx: quiesce request queues to make sure no submissions are inflight
  nbd: quiesce request queues to make sure no submissions are inflight
  nvme: kick requeue list when requeueing a request instead of when starting the queues
  nvme-pci: quiesce/unquiesce admin_q instead of start/stop its hw queues
  nvme-loop: quiesce/unquiesce admin_q instead of start/stop its hw queues
  nvme-fc: quiesce/unquiesce admin_q instead of start/stop its hw queues
  ...
2017-07-11 15:36:52 -07:00
Linus Torvalds 03ffbcdd78 Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
 "The irq department delivers:

   - Expand the generic infrastructure handling the irq migration on CPU
     hotplug and convert X86 over to it. (Thomas Gleixner)

     Aside of consolidating code this is a preparatory change for:

   - Finalizing the affinity management for multi-queue devices. The
     main change here is to shut down interrupts which are affine to a
     outgoing CPU and reenabling them when the CPU comes online again.
     That avoids moving interrupts pointlessly around and breaking and
     reestablishing affinities for no value. (Christoph Hellwig)

     Note: This contains also the BLOCK-MQ and NVME changes which depend
     on the rework of the irq core infrastructure. Jens acked them and
     agreed that they should go with the irq changes.

   - Consolidation of irq domain code (Marc Zyngier)

   - State tracking consolidation in the core code (Jeffy Chen)

   - Add debug infrastructure for hierarchical irq domains (Thomas
     Gleixner)

   - Infrastructure enhancement for managing generic interrupt chips via
     devmem (Bartosz Golaszewski)

   - Constification work all over the place (Tobias Klauser)

   - Two new interrupt controller drivers for MVEBU (Thomas Petazzoni)

   - The usual set of fixes, updates and enhancements all over the
     place"

* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (112 commits)
  irqchip/or1k-pic: Fix interrupt acknowledgement
  irqchip/irq-mvebu-gicp: Allocate enough memory for spi_bitmap
  irqchip/gic-v3: Fix out-of-bound access in gic_set_affinity
  nvme: Allocate queues for all possible CPUs
  blk-mq: Create hctx for each present CPU
  blk-mq: Include all present CPUs in the default queue mapping
  genirq: Avoid unnecessary low level irq function calls
  genirq: Set irq masked state when initializing irq_desc
  genirq/timings: Add infrastructure for estimating the next interrupt arrival time
  genirq/timings: Add infrastructure to track the interrupt timings
  genirq/debugfs: Remove pointless NULL pointer check
  irqchip/gic-v3-its: Don't assume GICv3 hardware supports 16bit INTID
  irqchip/gic-v3-its: Add ACPI NUMA node mapping
  irqchip/gic-v3-its-platform-msi: Make of_device_ids const
  irqchip/gic-v3-its: Make of_device_ids const
  irqchip/irq-mvebu-icu: Add new driver for Marvell ICU
  irqchip/irq-mvebu-gicp: Add new driver for Marvell GICP
  dt-bindings/interrupt-controller: Add DT binding for the Marvell ICU
  genirq/irqdomain: Remove auto-recursive hierarchy support
  irqchip/MSI: Use irq_domain_update_bus_token instead of an open coded access
  ...
2017-07-03 16:50:31 -07:00
Dmitry Monakhov e23947bd76 bio-integrity: fold bio_integrity_enabled to bio_integrity_prep
Currently all integrity prep hooks are open-coded, and if prepare fails
we ignore it's code and fail bio with EIO. Let's return real error to
upper layer, so later caller may react accordingly.

In fact no one want to use bio_integrity_prep() w/o bio_integrity_enabled,
so it is reasonable to fold it in to one function.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
[hch: merged with the latest block tree,
	return bool from bio_integrity_prep]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-07-03 16:56:24 -06:00
Linus Torvalds 9bd42183b9 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The main changes in this cycle were:

   - Add the SYSTEM_SCHEDULING bootup state to move various scheduler
     debug checks earlier into the bootup. This turns silent and
     sporadically deadly bugs into nice, deterministic splats. Fix some
     of the splats that triggered. (Thomas Gleixner)

   - A round of restructuring and refactoring of the load-balancing and
     topology code (Peter Zijlstra)

   - Another round of consolidating ~20 of incremental scheduler code
     history: this time in terms of wait-queue nomenclature. (I didn't
     get much feedback on these renaming patches, and we can still
     easily change any names I might have misplaced, so if anyone hates
     a new name, please holler and I'll fix it.) (Ingo Molnar)

   - sched/numa improvements, fixes and updates (Rik van Riel)

   - Another round of x86/tsc scheduler clock code improvements, in hope
     of making it more robust (Peter Zijlstra)

   - Improve NOHZ behavior (Frederic Weisbecker)

   - Deadline scheduler improvements and fixes (Luca Abeni, Daniel
     Bristot de Oliveira)

   - Simplify and optimize the topology setup code (Lauro Ramos
     Venancio)

   - Debloat and decouple scheduler code some more (Nicolas Pitre)

   - Simplify code by making better use of llist primitives (Byungchul
     Park)

   - ... plus other fixes and improvements"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (103 commits)
  sched/cputime: Refactor the cputime_adjust() code
  sched/debug: Expose the number of RT/DL tasks that can migrate
  sched/numa: Hide numa_wake_affine() from UP build
  sched/fair: Remove effective_load()
  sched/numa: Implement NUMA node level wake_affine()
  sched/fair: Simplify wake_affine() for the single socket case
  sched/numa: Override part of migrate_degrades_locality() when idle balancing
  sched/rt: Move RT related code from sched/core.c to sched/rt.c
  sched/deadline: Move DL related code from sched/core.c to sched/deadline.c
  sched/cpuset: Only offer CONFIG_CPUSETS if SMP is enabled
  sched/fair: Spare idle load balancing on nohz_full CPUs
  nohz: Move idle balancer registration to the idle path
  sched/loadavg: Generalize "_idle" naming to "_nohz"
  sched/core: Drop the unused try_get_task_struct() helper function
  sched/fair: WARN() and refuse to set buddy when !se->on_rq
  sched/debug: Fix SCHED_WARN_ON() to return a value on !CONFIG_SCHED_DEBUG as well
  sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming
  sched/wait: Move bit_wait_table[] and related functionality from sched/core.c to sched/wait_bit.c
  sched/wait: Split out the wait_bit*() APIs from <linux/wait.h> into <linux/wait_bit.h>
  sched/wait: Re-adjust macro line continuation backslashes in <linux/wait.h>
  ...
2017-07-03 13:08:04 -07:00
Christoph Hellwig 4b855ad371 blk-mq: Create hctx for each present CPU
Currently we only create hctx for online CPUs, which can lead to a lot
of churn due to frequent soft offline / online operations.  Instead
allocate one for each present CPU to avoid this and dramatically simplify
the code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Cc: Keith Busch <keith.busch@intel.com>
Cc: linux-block@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Link: http://lkml.kernel.org/r/20170626102058.10200-3-hch@lst.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-06-28 23:00:07 +02:00
Christoph Hellwig 46685d1a95 blk-mq: don't bounce by default
For historical reasons we default to bouncing highmem pages for all block
queues.  But the blk-mq drivers are easy to audit to ensure that we don't
need this - scsi and mtip32xx set explicit limits and everyone else doesn't
have any particular ones.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27 12:13:45 -06:00
Christoph Hellwig 0b0bcacc3b block: don't bother with bounce limits for make_request drivers
We only call blk_queue_bounce for request-based drivers, so stop messing
with it for make_request based drivers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27 12:13:45 -06:00
Ingo Molnar 1bc3cd4dfa Merge branch 'linus' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-24 08:57:20 +02:00
Jens Axboe f95a0d6a95 Merge commit '8e8320c9315c' into for-4.13/block
Pull in the fix for shared tags, as it conflicts with the pending
changes in for-4.13/block. We already pulled in v4.12-rc5 to solve
other conflicts or get fixes that went into 4.12, so not a lot
of changes in this merge.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-22 21:55:24 -06:00
weiping a9590fe148 blk-mq: remove double set queue_num
hwctx's queue_num has been set prior call blk_mq_init_hctx, so no need
set it again.

Signed-off-by: weiping <zhangweiping@didichuxing.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-22 09:18:25 -06:00
Bart Van Assche 852ec80983 blk-mq: Make it safe to quiesce and unquiesce from an interrupt handler
Since blk_mq_quiesce_queue_nowait() can be called from interrupt
context, make this safe. Since this function is not in the hot
path, uninline it.

Fixes: commit f4560ffe8c ("blk-mq: use QUEUE_FLAG_QUIESCED to quiesce queue")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-21 12:01:15 -06:00
Jens Axboe 8e8320c931 blk-mq: fix performance regression with shared tags
If we have shared tags enabled, then every IO completion will trigger
a full loop of every queue belonging to a tag set, and every hardware
queue for each of those queues, even if nothing needs to be done.
This causes a massive performance regression if you have a lot of
shared devices.

Instead of doing this huge full scan on every IO, add an atomic
counter to the main queue that tracks how many hardware queues have
been marked as needing a restart. With that, we can avoid looking for
restartable queues, if we don't have to.

Max reports that this restores performance. Before this patch, 4K
IOPS was limited to 22-23K IOPS. With the patch, we are running at
950-970K IOPS.

Fixes: 6d8c6c0f97 ("blk-mq: Restart a single queue if tag sets are shared")
Reported-by: Max Gurtovoy <maxg@mellanox.com>
Tested-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Tested-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-21 10:17:49 -06:00
Bart Van Assche 5435c023b9 blk-mq: Warn when attempting to run a hardware queue that is not mapped
A queue must be frozen while the mapped state of a hardware queue
is changed. Additionally, any change of the mapped state is
followed by a call to blk_mq_map_swqueue() (see also
blk_mq_init_allocated_queue() and blk_mq_update_nr_hw_queues()).
Since blk_mq_map_swqueue() does not map any unmapped hardware
queue onto any software queue, no attempt will be made to run
an unmapped hardware queue. Hence issue a warning upon attempts
to run an unmapped hardware queue.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20 19:27:14 -06:00
Bart Van Assche 7b6078146c blk-mq: Document locking assumptions
Document the locking assumptions in functions that modify
blk_mq_ctx.rq_list to make it easier for humans to verify
this code.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20 19:27:14 -06:00
Bart Van Assche c3a148d20a blk-mq: Initialize .rq_flags in blk_mq_rq_ctx_init()
Initialization of blk-mq requests is a bit weird: blk_mq_rq_ctx_init()
is called after a value has been assigned to .rq_flags and .rq_flags
is initialized in __blk_mq_finish_request(). Initialize .rq_flags in
blk_mq_rq_ctx_init() instead of relying on __blk_mq_finish_request().
Moving the initialization of .rq_flags is fine because all changes
and tests of .rq_flags occur between blk_get_request() and finishing
a request.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20 19:27:14 -06:00
Bart Van Assche cd6ce1482f block: Make request operation type argument declarations consistent
Instead of declaring the second argument of blk_*_get_request()
as int and passing it to functions that expect an unsigned int,
declare that second argument as unsigned int. Also because of
consistency, rename that second argument from 'rw' into 'op'.
This patch does not change any functionality.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20 19:27:14 -06:00
Bart Van Assche 0731967877 blk-mq: Reduce blk_mq_hw_ctx size
Since the srcu structure is rather large (184 bytes on an x86-64
system with kernel debugging disabled), only allocate it if needed.

Reported-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20 19:27:14 -06:00
Goldwyn Rodrigues 03a07c92a9 block: return on congested block device
A new bio operation flag REQ_NOWAIT is introduced to identify bio's
orignating from iocb with IOCB_NOWAIT. This flag indicates
to return immediately if a request cannot be made instead
of retrying.

Stacked devices such as md (the ones with make_request_fn hooks)
currently are not supported because it may block for housekeeping.
For example, an md can have a part of the device suspended.
For this reason, only request based devices are supported.
In the future, this feature will be expanded to stacked devices
by teaching them how to handle the REQ_NOWAIT flags.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20 07:12:03 -06:00
Ingo Molnar 2055da9738 sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming
So I've noticed a number of instances where it was not obvious from the
code whether ->task_list was for a wait-queue head or a wait-queue entry.

Furthermore, there's a number of wait-queue users where the lists are
not for 'tasks' but other entities (poll tables, etc.), in which case
the 'task_list' name is actively confusing.

To clear this all up, name the wait-queue head and entry list structure
fields unambiguously:

	struct wait_queue_head::task_list	=> ::head
	struct wait_queue_entry::task_list	=> ::entry

For example, this code:

	rqw->wait.task_list.next != &wait->task_list

... is was pretty unclear (to me) what it's doing, while now it's written this way:

	rqw->wait.head.next != &wait->entry

... which makes it pretty clear that we are iterating a list until we see the head.

Other examples are:

	list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
	list_for_each_entry(wq, &fence->wait.task_list, task_list) {

... where it's unclear (to me) what we are iterating, and during review it's
hard to tell whether it's trying to walk a wait-queue entry (which would be
a bug), while now it's written as:

	list_for_each_entry_safe(pos, next, &x->head, entry) {
	list_for_each_entry(wq, &fence->wait.head, entry) {

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 12:19:14 +02:00
Ingo Molnar ac6424b981 sched/wait: Rename wait_queue_t => wait_queue_entry_t
Rename:

	wait_queue_t		=>	wait_queue_entry_t

'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
which had to carry the name.

Start sorting this out by renaming it to 'wait_queue_entry_t'.

This also allows the real structure name 'struct __wait_queue' to
lose its double underscore and become 'struct wait_queue_entry',
which is the more canonical nomenclature for such data types.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 12:18:27 +02:00
Ming Lei 641a9ed60f Revert "blk-mq: don't use sync workqueue flushing from drivers"
This patch reverts commit 2719aa217e0d02(blk-mq: don't use
sync workqueue flushing from drivers) because only
blk_mq_quiesce_queue() need the sync flush, and now
we don't need to stop queue any more, so revert it.

Also changes to cancel_delayed_work() in blk_mq_stop_hw_queue().

Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-18 14:24:48 -06:00
Ming Lei 39a70c76b8 blk-mq: clarify dispatch may not be drained/blocked by stopping queue
BLK_MQ_S_STOPPED may not be observed in other concurrent I/O paths,
we can't guarantee that dispatching won't happen after returning
from the APIs of stopping queue.

So clarify the fact and avoid potential misuse.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-18 14:24:48 -06:00
Ming Lei 1d9e9bc6b5 blk-mq: don't stop queue for quiescing
Queue can be started by other blk-mq APIs and can be used in
different cases, this limits uses of blk_mq_quiesce_queue()
if it is based on stopping queue, and make its usage very
difficult, especially users have to use the stop queue APIs
carefully for avoiding to break blk_mq_quiesce_queue().

We have applied the QUIESCED flag for draining and blocking
dispatch, so it isn't necessary to stop queue any more.

After stopping queue is removed, blk_mq_quiesce_queue() can
be used safely and easily, then users won't worry about queue
restarting during quiescing at all.

Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-18 14:24:48 -06:00
Ming Lei 69e07c4adb blk-mq: update comments on blk_mq_quiesce_queue()
Actually what we want to get from blk_mq_quiesce_queue()
isn't only to wait for completion of all ongoing .queue_rq().

In the typical context of canceling requests, we need to
make sure that the following is done in the dispatch path
before starting to cancel requests:

	- failed dispatched request is finished
	- busy dispatched request is requeued, and the STARTED
	flag is cleared

So update comment to keep code, doc and our expection consistent.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-18 14:24:48 -06:00
Ming Lei f4560ffe8c blk-mq: use QUEUE_FLAG_QUIESCED to quiesce queue
It is required that no dispatch can happen any more once
blk_mq_quiesce_queue() returns, and we don't have such requirement
on APIs of stopping queue.

But blk_mq_quiesce_queue() still may not block/drain dispatch in the
the case of BLK_MQ_S_START_ON_RUN, so use the new introduced flag of
QUEUE_FLAG_QUIESCED and evaluate it inside RCU read-side critical
sections for fixing this issue.

Also blk_mq_quiesce_queue() is implemented via stopping queue, which
limits its uses, and easy to cause race, because any queue restart in
other paths may break blk_mq_quiesce_queue(). With the introduced
flag of QUEUE_FLAG_QUIESCED, we don't need to depend on stopping queue
for quiescing any more.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-18 14:24:27 -06:00
Ming Lei e4e739131a blk-mq: introduce blk_mq_unquiesce_queue
blk_mq_start_stopped_hw_queues() is used implictly
as counterpart of blk_mq_quiesce_queue() for unquiescing queue,
so we introduce blk_mq_unquiesce_queue() and make it
as counterpart of blk_mq_quiesce_queue() explicitly.

This function is for improving the current quiescing mechanism
in the following patches.

Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-18 14:20:04 -06:00
NeilBrown af67c31fba blk: remove bio_set arg from blk_queue_split()
blk_queue_split() is always called with the last arg being q->bio_split,
where 'q' is the first arg.

Also blk_queue_split() sometimes uses the passed-in 'bs' and sometimes uses
q->bio_split.

This is inconsistent and unnecessary.  Remove the last arg and always use
q->bio_split inside blk_queue_split()

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Credit-to: Javier González <jg@lightnvm.io> (Noticed that lightnvm was missed)
Reviewed-by: Javier González <javier@cnexlabs.com>
Tested-by: Javier González <javier@cnexlabs.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-18 12:40:59 -06:00
Christoph Hellwig e4cdf1a1cb blk-mq: remove __blk_mq_alloc_request
Move most code into blk_mq_rq_ctx_init, and the rest into
blk_mq_get_request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-18 10:08:55 -06:00
Christoph Hellwig 5bbf4e5a8e blk-mq-sched: unify request prepare methods
This patch makes sure we always allocate requests in the core blk-mq
code and use a common prepare_request method to initialize them for
both mq I/O schedulers.  For Kyber and additional limit_depth method
is added that is called before allocating the request.

Also because none of the intializations can really fail the new method
does not return an error - instead the bfq finish method is hardened
to deal with the no-IOC case.

Last but not least this removes the abuse of RQF_QUEUE by the blk-mq
scheduling code as RQF_ELFPRIV is all that is needed now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-18 10:08:55 -06:00
Christoph Hellwig 44e8c2bff8 blk-mq: refactor blk_mq_sched_assign_ioc
blk_mq_sched_assign_ioc now only handles the assigned of the ioc if
the schedule needs it (bfq only at the moment).  The caller to the
per-request initializer is moved out so that it can be merged with
a similar call for the kyber I/O scheduler.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-18 10:08:55 -06:00
Christoph Hellwig 037cebb85b blk-mq: streamline blk_mq_get_request
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-18 10:08:55 -06:00
Christoph Hellwig 6af54051a0 blk-mq: simplify blk_mq_free_request
Merge three functions only tail-called by blk_mq_free_request into
blk_mq_free_request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-18 10:08:55 -06:00
Christoph Hellwig 7b9e936163 blk-mq-sched: unify request finished methods
No need to have two different callouts of bfq vs kyber.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-18 10:08:55 -06:00
Christoph Hellwig ea511e3c28 blk-mq: remove blk_mq_sched_{get,put}_rq_priv
Having these as separate helpers in a header really does not help
readability, or my chances to refactor this code sanely.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-18 10:08:55 -06:00
Christoph Hellwig d2c0d38324 blk-mq: move blk_mq_sched_{get,put}_request to blk-mq.c
Having them out of line in blk-mq-sched.c just makes the code flow
unnecessarily complicated.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-18 10:08:55 -06:00
Christoph Hellwig 6e15cf2a0b blk-mq: mark blk_mq_rq_ctx_init static
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-18 10:08:55 -06:00
Jens Axboe f06345add9 blk-mq: fixup type of 'ret' in __blk_mq_try_issue_directly()
Should be a blk_status_t type, not an integer.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-12 11:22:46 -06:00
Jens Axboe 8f66439eec Linux 4.12-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJZPdbLAAoJEHm+PkMAQRiGx4wH/1nCjfnl6fE8oJ24/1gEAOUh
 biFdqJkYZmlLYHVtYfLm4Ueg4adJdg0wx6qM/4RaAzmQVvLfDV34bc1qBf1+P95G
 kVF+osWyXrZo5cTwkwapHW/KNu4VJwAx2D1wrlxKDVG5AOrULH1pYOYGOpApEkZU
 4N+q5+M0ce0GJpqtUZX+UnI33ygjdDbBxXoFKsr24B7eA0ouGbAJ7dC88WcaETL+
 2/7tT01SvDMo0jBSV0WIqlgXwZ5gp3yPGnklC3F4159Yze6VFrzHMKS/UpPF8o8E
 W9EbuzwxsKyXUifX2GY348L1f+47glen/1sedbuKnFhP6E9aqUQQJXvEO7ueQl4=
 =m2Gx
 -----END PGP SIGNATURE-----

Merge tag 'v4.12-rc5' into for-4.13/block

We've already got a few conflicts and upcoming work depends on some of the
changes that have gone into mainline as regression fixes for this series.

Pull in 4.12-rc5 to resolve these conflicts and make it easier on down stream
trees to continue working on 4.13 changes.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-12 08:30:13 -06:00
Christoph Hellwig fc17b6534e blk-mq: switch ->queue_rq return value to blk_status_t
Use the same values for use for request completion errors as the return
value from ->queue_rq.  BLK_STS_RESOURCE is special cased to cause
a requeue, and all the others are completed as-is.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-06-09 09:27:32 -06:00
Christoph Hellwig 2a842acab1 block: introduce new block status code type
Currently we use nornal Linux errno values in the block layer, and while
we accept any error a few have overloaded magic meanings.  This patch
instead introduces a new  blk_status_t value that holds block layer specific
status codes and explicitly explains their meaning.  Helpers to convert from
and to the previous special meanings are provided for now, but I suspect
we want to get rid of them in the long run - those drivers that have a
errno input (e.g. networking) usually get errnos that don't know about
the special block layer overloads, and similarly returning them to userspace
will usually return somethings that strictly speaking isn't correct
for file system operations, but that's left as an exercise for later.

For now the set of errors is a very limited set that closely corresponds
to the previous overloaded errno values, but there is some low hanging
fruite to improve it.

blk_status_t (ab)uses the sparse __bitwise annotations to allow for sparse
typechecking, so that we can easily catch places passing the wrong values.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-06-09 09:27:32 -06:00
Ming Lei d964f04a8f blk-mq: fix direct issue
If queue is stopped, we shouldn't dispatch request into driver and
hardware, unfortunately the check is removed in bd166ef183c2(blk-mq-sched:
add framework for MQ capable IO schedulers).

This patch fixes the issue by moving the check back into
__blk_mq_try_issue_directly().

This patch fixes request use-after-free[1][2] during canceling requets
of NVMe in nvme_dev_disable(), which can be triggered easily during
NVMe reset & remove test.

[1] oops kernel log when CONFIG_BLK_DEV_INTEGRITY is on
[  103.412969] BUG: unable to handle kernel NULL pointer dereference at 000000000000000a
[  103.412980] IP: bio_integrity_advance+0x48/0xf0
[  103.412981] PGD 275a88067
[  103.412981] P4D 275a88067
[  103.412982] PUD 276c43067
[  103.412983] PMD 0
[  103.412984]
[  103.412986] Oops: 0000 [#1] SMP
[  103.412989] Modules linked in: vfat fat intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel crypto_simd cryptd ipmi_ssif iTCO_wdt iTCO_vendor_support mxm_wmi glue_helper dcdbas ipmi_si mei_me pcspkr mei sg ipmi_devintf lpc_ich ipmi_msghandler shpchp acpi_power_meter wmi nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm crc32c_intel nvme ahci nvme_core libahci libata tg3 i2c_core megaraid_sas ptp pps_core dm_mirror dm_region_hash dm_log dm_mod
[  103.413035] CPU: 0 PID: 102 Comm: kworker/0:2 Not tainted 4.11.0+ #1
[  103.413036] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016
[  103.413041] Workqueue: events nvme_remove_dead_ctrl_work [nvme]
[  103.413043] task: ffff9cc8775c8000 task.stack: ffffc033c252c000
[  103.413045] RIP: 0010:bio_integrity_advance+0x48/0xf0
[  103.413046] RSP: 0018:ffffc033c252fc10 EFLAGS: 00010202
[  103.413048] RAX: 0000000000000000 RBX: ffff9cc8720a8cc0 RCX: ffff9cca72958240
[  103.413049] RDX: ffff9cca72958000 RSI: 0000000000000008 RDI: ffff9cc872537f00
[  103.413049] RBP: ffffc033c252fc28 R08: 0000000000000000 R09: ffffffffb963a0d5
[  103.413050] R10: 000000000000063e R11: 0000000000000000 R12: ffff9cc8720a8d18
[  103.413051] R13: 0000000000001000 R14: ffff9cc872682e00 R15: 00000000fffffffb
[  103.413053] FS:  0000000000000000(0000) GS:ffff9cc877c00000(0000) knlGS:0000000000000000
[  103.413054] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  103.413055] CR2: 000000000000000a CR3: 0000000276c41000 CR4: 00000000001406f0
[  103.413056] Call Trace:
[  103.413063]  bio_advance+0x2a/0xe0
[  103.413067]  blk_update_request+0x76/0x330
[  103.413072]  blk_mq_end_request+0x1a/0x70
[  103.413074]  blk_mq_dispatch_rq_list+0x370/0x410
[  103.413076]  ? blk_mq_flush_busy_ctxs+0x94/0xe0
[  103.413080]  blk_mq_sched_dispatch_requests+0x173/0x1a0
[  103.413083]  __blk_mq_run_hw_queue+0x8e/0xa0
[  103.413085]  __blk_mq_delay_run_hw_queue+0x9d/0xa0
[  103.413088]  blk_mq_start_hw_queue+0x17/0x20
[  103.413090]  blk_mq_start_hw_queues+0x32/0x50
[  103.413095]  nvme_kill_queues+0x54/0x80 [nvme_core]
[  103.413097]  nvme_remove_dead_ctrl_work+0x1f/0x40 [nvme]
[  103.413103]  process_one_work+0x149/0x360
[  103.413105]  worker_thread+0x4d/0x3c0
[  103.413109]  kthread+0x109/0x140
[  103.413111]  ? rescuer_thread+0x380/0x380
[  103.413113]  ? kthread_park+0x60/0x60
[  103.413120]  ret_from_fork+0x2c/0x40
[  103.413121] Code: 08 4c 8b 63 50 48 8b 80 80 00 00 00 48 8b 90 d0 03 00 00 31 c0 48 83 ba 40 02 00 00 00 48 8d 8a 40 02 00 00 48 0f 45 c1 c1 ee 09 <0f> b6 48 0a 0f b6 40 09 41 89 f5 83 e9 09 41 d3 ed 44 0f af e8
[  103.413145] RIP: bio_integrity_advance+0x48/0xf0 RSP: ffffc033c252fc10
[  103.413146] CR2: 000000000000000a
[  103.413157] ---[ end trace cd6875d16eb5a11e ]---
[  103.455368] Kernel panic - not syncing: Fatal exception
[  103.459826] Kernel Offset: 0x37600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  103.850916] ---[ end Kernel panic - not syncing: Fatal exception
[  103.857637] sched: Unexpected reschedule of offline CPU#1!
[  103.863762] ------------[ cut here ]------------

[2] kernel hang in blk_mq_freeze_queue_wait() when CONFIG_BLK_DEV_INTEGRITY is off
[  247.129825] INFO: task nvme-test:1772 blocked for more than 120 seconds.
[  247.137311]       Not tainted 4.12.0-rc2.upstream+ #4
[  247.142954] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  247.151704] Call Trace:
[  247.154445]  __schedule+0x28a/0x880
[  247.158341]  schedule+0x36/0x80
[  247.161850]  blk_mq_freeze_queue_wait+0x4b/0xb0
[  247.166913]  ? remove_wait_queue+0x60/0x60
[  247.171485]  blk_freeze_queue+0x1a/0x20
[  247.175770]  blk_cleanup_queue+0x7f/0x140
[  247.180252]  nvme_ns_remove+0xa3/0xb0 [nvme_core]
[  247.185503]  nvme_remove_namespaces+0x32/0x50 [nvme_core]
[  247.191532]  nvme_uninit_ctrl+0x2d/0xa0 [nvme_core]
[  247.196977]  nvme_remove+0x70/0x110 [nvme]
[  247.201545]  pci_device_remove+0x39/0xc0
[  247.205927]  device_release_driver_internal+0x141/0x200
[  247.211761]  device_release_driver+0x12/0x20
[  247.216531]  pci_stop_bus_device+0x8c/0xa0
[  247.221104]  pci_stop_and_remove_bus_device_locked+0x1a/0x30
[  247.227420]  remove_store+0x7c/0x90
[  247.231320]  dev_attr_store+0x18/0x30
[  247.235409]  sysfs_kf_write+0x3a/0x50
[  247.239497]  kernfs_fop_write+0xff/0x180
[  247.243867]  __vfs_write+0x37/0x160
[  247.247757]  ? selinux_file_permission+0xe5/0x120
[  247.253011]  ? security_file_permission+0x3b/0xc0
[  247.258260]  vfs_write+0xb2/0x1b0
[  247.261964]  ? syscall_trace_enter+0x1d0/0x2b0
[  247.266924]  SyS_write+0x55/0xc0
[  247.270540]  do_syscall_64+0x67/0x150
[  247.274636]  entry_SYSCALL64_slow_path+0x25/0x25
[  247.279794] RIP: 0033:0x7f5c96740840
[  247.283785] RSP: 002b:00007ffd00e87ee8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  247.292238] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f5c96740840
[  247.300194] RDX: 0000000000000002 RSI: 00007f5c97060000 RDI: 0000000000000001
[  247.308159] RBP: 00007f5c97060000 R08: 000000000000000a R09: 00007f5c97059740
[  247.316123] R10: 0000000000000001 R11: 0000000000000246 R12: 00007f5c96a14400
[  247.324087] R13: 0000000000000002 R14: 0000000000000001 R15: 0000000000000000
[  370.016340] INFO: task nvme-test:1772 blocked for more than 120 seconds.

Fixes: 12d70958a2e8(blk-mq: don't fail allocating driver tag for stopped hw queue)
Cc: stable@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-06-06 10:00:35 -06:00
Ming Lei dad7a3be49 blk-mq: pass correct hctx to blk_mq_try_issue_directly
When direct issue is done on request picked up from plug list,
the hctx need to be updated with the actual hw queue, otherwise
wrong hctx is used and may hurt performance, especially when
wrong SRCU readlock is acquired/released

Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-06-06 10:00:33 -06:00
Keith Busch e4dc2b32df blk-mq: Take tagset lock when updating hw queues
The tagset lock needs to be held when iterating the tag_list, so a
lockdep assert was added when updating number of hardware queues. The
drivers calling this API, however, were unaware of the new requirement,
so are failing the assertion.

This patch takes the lock within the blk-mq function so the drivers do
not have to be modified in order to be safe.

Fixes: 705cda97e ("blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list")
Reported-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-30 12:32:37 -06:00
Ming Lei 9bddeb2a5b blk-mq: make per-sw-queue bio merge as default .bio_merge
Because what the per-sw-queue bio merge does is basically same with
scheduler's .bio_merge(), this patch makes per-sw-queue bio merge
as the default .bio_merge if no scheduler is used or io scheduler
doesn't provide .bio_merge().

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-26 14:12:04 -06:00
Ming Lei ab42f35d9c blk-mq: merge bio into sw queue before plugging
Before blk-mq is introduced, I/O is merged to elevator
before being putted into plug queue, but blk-mq changed the
order and makes merging to sw queue basically impossible.
Then it is observed that throughput of sequential I/O is degraded
about 10%~20% on virtio-blk in the test[1] if mq-deadline isn't used.

This patch moves the bio merging per sw queue before plugging,
like what blk_queue_bio() does, and the performance regression is
fixed under this situation.

[1]. test script:
sudo fio --direct=1 --size=128G --bsrange=4k-4k --runtime=40 --numjobs=16 --ioengine=libaio --iodepth=64 --group_reporting=1 --filename=/dev/vdb --name=virtio_blk-test-$RW --rw=$RW --output-format=json

RW=read or write

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-26 14:12:03 -06:00
Ming Lei 7254a50a5d blk-mq: remove blk_mq_abort_requeue_list()
No one uses it any more, so remove it.

Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-05-22 20:50:11 +02:00
Wen Xiong f36ea50ca0 blk-mq: NVMe 512B/4K+T10 DIF/DIX format returns I/O error on dd with split op
When formatting NVMe to 512B/4K + T10 DIf/DIX, dd with split op returns
"Input/output error". Looks block layer split the bio after calling
bio_integrity_prep(bio). This patch fixes the issue.

Below is how we debug this issue:
(1)format nvme to 4K block # size with type 2 DIF
(2)dd with block size bigger than 1024k.
oflag=direct
dd: error writing '/dev/nvme0n1': Input/output error

We added some debug code in nvme device driver. It showed us the first
op and the second op have the same bi and pi address. This is not
correct.

1st op: nvme0n1 Op:Wr slba 0x505 length 0x100, PI ctrl=0x1400,
	dsmgmt=0x0, AT=0x0 & RT=0x505
	Guard 0x00b1, AT 0x0000, RT physical 0x00000505 RT virtual 0x00002828

2nd op: nvme0n1 Op:Wr slba 0x605 length 0x1, PI ctrl=0x1400, dsmgmt=0x0,
	AT=0x0 & RT=0x605  ==> This op fails and subsequent 5 retires..
	Guard 0x00b1, AT 0x0000, RT physical 0x00000605 RT virtual 0x00002828

With the fix, It showed us both of the first op and the second op have
correct bi and pi address.

1st op: nvme2n1 Op:Wr slba 0x505 length 0x100, PI ctrl=0x1400,
	dsmgmt=0x0, AT=0x0 & RT=0x505
	Guard 0x5ccb, AT 0x0000, RT physical 0x00000505 RT virtual
	0x00002828
2nd op: nvme2n1 Op:Wr slba 0x605 length 0x1, PI ctrl=0x1400, dsmgmt=0x0,
	AT=0x0 & RT=0x605
	Guard 0xab4c, AT 0x0000, RT physical 0x00000605 RT virtual
	0x00003028

Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-10 08:02:58 -06:00
Colin Ian King ebd7685795 blk-mq: make __blk_mq_stop_hw_queues static
Making __blk_mq_stop_hw_queues static fixes sparse warning:

  block/blk-mq.c:6: warning: symbol '__blk_mq_stop_hw_queues' was not
  declared. Should it be static?

Fixes: 2719aa217e ("blk-mq: don't use sync workqueue flushing from drivers")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-08 08:09:39 -06:00
Wanpeng Li 51d638b1f5 block/mq: fix potential deadlock during cpu hotplug
This can be triggered by hot-unplug one cpu.

======================================================
 [ INFO: possible circular locking dependency detected ]
 4.11.0+ #17 Not tainted
 -------------------------------------------------------
 step_after_susp/2640 is trying to acquire lock:
  (all_q_mutex){+.+...}, at: [<ffffffffb33f95b8>] blk_mq_queue_reinit_work+0x18/0x110

 but task is already holding lock:
  (cpu_hotplug.lock){+.+.+.}, at: [<ffffffffb306d04f>] cpu_hotplug_begin+0x7f/0xe0

 which lock already depends on the new lock.

 the existing dependency chain (in reverse order) is:

 -> #1 (cpu_hotplug.lock){+.+.+.}:
        lock_acquire+0x11c/0x230
        __mutex_lock+0x92/0x990
        mutex_lock_nested+0x1b/0x20
        get_online_cpus+0x64/0x80
        blk_mq_init_allocated_queue+0x3a0/0x4e0
        blk_mq_init_queue+0x3a/0x60
        loop_add+0xe5/0x280
        loop_init+0x124/0x177
        do_one_initcall+0x53/0x1c0
        kernel_init_freeable+0x1e3/0x27f
        kernel_init+0xe/0x100
        ret_from_fork+0x31/0x40

 -> #0 (all_q_mutex){+.+...}:
        __lock_acquire+0x189a/0x18a0
        lock_acquire+0x11c/0x230
        __mutex_lock+0x92/0x990
        mutex_lock_nested+0x1b/0x20
        blk_mq_queue_reinit_work+0x18/0x110
        blk_mq_queue_reinit_dead+0x1c/0x20
        cpuhp_invoke_callback+0x1f2/0x810
        cpuhp_down_callbacks+0x42/0x80
        _cpu_down+0xb2/0xe0
        freeze_secondary_cpus+0xb6/0x390
        suspend_devices_and_enter+0x3b3/0xa40
        pm_suspend+0x129/0x490
        state_store+0x82/0xf0
        kobj_attr_store+0xf/0x20
        sysfs_kf_write+0x45/0x60
        kernfs_fop_write+0x135/0x1c0
        __vfs_write+0x37/0x160
        vfs_write+0xcd/0x1d0
        SyS_write+0x58/0xc0
        do_syscall_64+0x8f/0x710
        return_from_SYSCALL_64+0x0/0x7a

 other info that might help us debug this:

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(cpu_hotplug.lock);
                                lock(all_q_mutex);
                                lock(cpu_hotplug.lock);
   lock(all_q_mutex);

  *** DEADLOCK ***

 8 locks held by step_after_susp/2640:
  #0:  (sb_writers#6){.+.+.+}, at: [<ffffffffb3244aed>] vfs_write+0x1ad/0x1d0
  #1:  (&of->mutex){+.+.+.}, at: [<ffffffffb32d3a51>] kernfs_fop_write+0x101/0x1c0
  #2:  (s_active#166){.+.+.+}, at: [<ffffffffb32d3a59>] kernfs_fop_write+0x109/0x1c0
  #3:  (pm_mutex){+.+...}, at: [<ffffffffb30d2ecd>] pm_suspend+0x21d/0x490
  #4:  (acpi_scan_lock){+.+.+.}, at: [<ffffffffb34dc3d7>] acpi_scan_lock_acquire+0x17/0x20
  #5:  (cpu_add_remove_lock){+.+.+.}, at: [<ffffffffb306d6d7>] freeze_secondary_cpus+0x27/0x390
  #6:  (cpu_hotplug.dep_map){++++++}, at: [<ffffffffb306cfd5>] cpu_hotplug_begin+0x5/0xe0
  #7:  (cpu_hotplug.lock){+.+.+.}, at: [<ffffffffb306d04f>] cpu_hotplug_begin+0x7f/0xe0

 stack backtrace:
 CPU: 3 PID: 2640 Comm: step_after_susp Not tainted 4.11.0+ #17
 Hardware name: Dell Inc. OptiPlex 7040/0JCTF8, BIOS 1.4.9 09/12/2016
 Call Trace:
  dump_stack+0x99/0xce
  print_circular_bug+0x1fa/0x270
  __lock_acquire+0x189a/0x18a0
  lock_acquire+0x11c/0x230
  ? lock_acquire+0x11c/0x230
  ? blk_mq_queue_reinit_work+0x18/0x110
  ? blk_mq_queue_reinit_work+0x18/0x110
  __mutex_lock+0x92/0x990
  ? blk_mq_queue_reinit_work+0x18/0x110
  ? kmem_cache_free+0x2cb/0x330
  ? anon_transport_class_unregister+0x20/0x20
  ? blk_mq_queue_reinit_work+0x110/0x110
  mutex_lock_nested+0x1b/0x20
  ? mutex_lock_nested+0x1b/0x20
  blk_mq_queue_reinit_work+0x18/0x110
  blk_mq_queue_reinit_dead+0x1c/0x20
  cpuhp_invoke_callback+0x1f2/0x810
  ? __flow_cache_shrink+0x160/0x160
  cpuhp_down_callbacks+0x42/0x80
  _cpu_down+0xb2/0xe0
  freeze_secondary_cpus+0xb6/0x390
  suspend_devices_and_enter+0x3b3/0xa40
  ? rcu_read_lock_sched_held+0x79/0x80
  pm_suspend+0x129/0x490
  state_store+0x82/0xf0
  kobj_attr_store+0xf/0x20
  sysfs_kf_write+0x45/0x60
  kernfs_fop_write+0x135/0x1c0
  __vfs_write+0x37/0x160
  ? rcu_read_lock_sched_held+0x79/0x80
  ? rcu_sync_lockdep_assert+0x2f/0x60
  ? __sb_start_write+0xd9/0x1c0
  ? vfs_write+0x1ad/0x1d0
  vfs_write+0xcd/0x1d0
  SyS_write+0x58/0xc0
  ? rcu_read_lock_sched_held+0x79/0x80
  do_syscall_64+0x8f/0x710
  ? trace_hardirqs_on_thunk+0x1a/0x1c
  entry_SYSCALL64_slow_path+0x25/0x25

The cpu hotplug path will hold cpu_hotplug.lock and then reinit all exiting
queues for blk mq w/ all_q_mutex, however, blk_mq_init_allocated_queue() will
contend these two locks in the inversion order. This is due to commit eabe06595d
(blk/mq: Cure cpu hotplug lock inversion), it fixes a cpu hotplug lock inversion
issue because of hotplug rework, however the hotplug rework is still work-in-progress
and lives in a -tip branch and mainline cannot yet trigger that splat. The commit
breaks the linus's tree in the merge window, so this patch reverts the lock order
and avoids to splat linus's tree.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-07 19:50:11 -06:00
Omar Sandoval 9c1051aacd blk-mq: untangle debugfs and sysfs
Originally, I tied debugfs registration/unregistration together with
sysfs. There's no reason to do this, and it's getting in the way of
letting schedulers define their own debugfs attributes. Instead, tie the
debugfs registration to the lifetime of the structures themselves.

The saner lifetimes mean we can also get rid of the extra mq directory
and move everything one level up. I.e., nvme0n1/mq/hctx0/tags is now
just nvme0n1/hctx0/tags.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-04 08:24:13 -06:00
Peter Zijlstra eabe06595d block/mq: Cure cpu hotplug lock inversion
By poking at /debug/sched_features I triggered the following splat:

 [] ======================================================
 [] WARNING: possible circular locking dependency detected
 [] 4.11.0-00873-g964c8b7-dirty #694 Not tainted
 [] ------------------------------------------------------
 [] bash/2109 is trying to acquire lock:
 []  (cpu_hotplug_lock.rw_sem){++++++}, at: [<ffffffff8120cb8b>] static_key_slow_dec+0x1b/0x50
 []
 [] but task is already holding lock:
 []  (&sb->s_type->i_mutex_key#4){+++++.}, at: [<ffffffff81140216>] sched_feat_write+0x86/0x170
 []
 [] which lock already depends on the new lock.
 []
 []
 [] the existing dependency chain (in reverse order) is:
 []
 [] -> #2 (&sb->s_type->i_mutex_key#4){+++++.}:
 []        lock_acquire+0x100/0x210
 []        down_write+0x28/0x60
 []        start_creating+0x5e/0xf0
 []        debugfs_create_dir+0x13/0x110
 []        blk_mq_debugfs_register+0x21/0x70
 []        blk_mq_register_dev+0x64/0xd0
 []        blk_register_queue+0x6a/0x170
 []        device_add_disk+0x22d/0x440
 []        loop_add+0x1f3/0x280
 []        loop_init+0x104/0x142
 []        do_one_initcall+0x43/0x180
 []        kernel_init_freeable+0x1de/0x266
 []        kernel_init+0xe/0x100
 []        ret_from_fork+0x31/0x40
 []
 [] -> #1 (all_q_mutex){+.+.+.}:
 []        lock_acquire+0x100/0x210
 []        __mutex_lock+0x6c/0x960
 []        mutex_lock_nested+0x1b/0x20
 []        blk_mq_init_allocated_queue+0x37c/0x4e0
 []        blk_mq_init_queue+0x3a/0x60
 []        loop_add+0xe5/0x280
 []        loop_init+0x104/0x142
 []        do_one_initcall+0x43/0x180
 []        kernel_init_freeable+0x1de/0x266
 []        kernel_init+0xe/0x100
 []        ret_from_fork+0x31/0x40

 []  *** DEADLOCK ***
 []
 [] 3 locks held by bash/2109:
 []  #0:  (sb_writers#11){.+.+.+}, at: [<ffffffff81292bcd>] vfs_write+0x17d/0x1a0
 []  #1:  (debugfs_srcu){......}, at: [<ffffffff8155a90d>] full_proxy_write+0x5d/0xd0
 []  #2:  (&sb->s_type->i_mutex_key#4){+++++.}, at: [<ffffffff81140216>] sched_feat_write+0x86/0x170
 []
 [] stack backtrace:
 [] CPU: 9 PID: 2109 Comm: bash Not tainted 4.11.0-00873-g964c8b7-dirty #694
 [] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
 [] Call Trace:

 []  lock_acquire+0x100/0x210
 []  get_online_cpus+0x2a/0x90
 []  static_key_slow_dec+0x1b/0x50
 []  static_key_disable+0x20/0x30
 []  sched_feat_write+0x131/0x170
 []  full_proxy_write+0x97/0xd0
 []  __vfs_write+0x28/0x120
 []  vfs_write+0xb5/0x1a0
 []  SyS_write+0x49/0xa0
 []  entry_SYSCALL_64_fastpath+0x23/0xc2

This is because of the cpu hotplug lock rework. Break the chain at #1
by reversing the lock acquisition order. This way i_mutex_key#4 no
longer depends on cpu_hotplug_lock and things are good.

Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-04 07:57:13 -06:00
Jens Axboe 2719aa217e blk-mq: don't use sync workqueue flushing from drivers
A previous commit introduced the sync flush, which we need from
internal callers like blk_mq_quiesce_queue(). However, we also
call the stop helpers from drivers, particularly from ->queue_rq()
when we have to stop processing for a bit. We can't block from
those locations, and we don't have to guarantee that we're
fully flushed.

Fixes: 9f99373790 ("blk-mq: unify hctx delayed_run_work and run_work")
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-03 11:44:43 -06:00
Ming Lei 7a148c2fcf block: don't call blk_mq_quiesce_queue() after queue is frozen
After queue is frozen, no request in this queue can be in use at all, so
there can't be any .queue_rq() running on this queue.  It isn't
necessary to call blk_mq_quiesce_queue() any more, so remove it in both
elevator_switch_mq() and blk_mq_update_nr_requests().

Cc: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>

Fixed up the description a bit.

Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-02 11:33:08 -06:00
Christoph Hellwig d6296d39e9 blk-mq: update ->init_request and ->exit_request prototypes
Remove the request_idx parameter, which can't be used safely now that we
support I/O schedulers with blk-mq.  Except for a superflous check in
mtip32xx it was unused anyway.

Also pass the tag_set instead of just the driver data - this allows drivers
to avoid some code duplication in a follow on cleanup.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-02 07:52:08 -06:00
Linus Torvalds 694752922b Merge branch 'for-4.12/block' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:

 - Add BFQ IO scheduler under the new blk-mq scheduling framework. BFQ
   was initially a fork of CFQ, but subsequently changed to implement
   fairness based on B-WF2Q+, a modified variant of WF2Q. BFQ is meant
   to be used on desktop type single drives, providing good fairness.
   From Paolo.

 - Add Kyber IO scheduler. This is a full multiqueue aware scheduler,
   using a scalable token based algorithm that throttles IO based on
   live completion IO stats, similary to blk-wbt. From Omar.

 - A series from Jan, moving users to separately allocated backing
   devices. This continues the work of separating backing device life
   times, solving various problems with hot removal.

 - A series of updates for lightnvm, mostly from Javier. Includes a
   'pblk' target that exposes an open channel SSD as a physical block
   device.

 - A series of fixes and improvements for nbd from Josef.

 - A series from Omar, removing queue sharing between devices on mostly
   legacy drivers. This helps us clean up other bits, if we know that a
   queue only has a single device backing. This has been overdue for
   more than a decade.

 - Fixes for the blk-stats, and improvements to unify the stats and user
   windows. This both improves blk-wbt, and enables other users to
   register a need to receive IO stats for a device. From Omar.

 - blk-throttle improvements from Shaohua. This provides a scalable
   framework for implementing scalable priotization - particularly for
   blk-mq, but applicable to any type of block device. The interface is
   marked experimental for now.

 - Bucketized IO stats for IO polling from Stephen Bates. This improves
   efficiency of polled workloads in the presence of mixed block size
   IO.

 - A few fixes for opal, from Scott.

 - A few pulls for NVMe, including a lot of fixes for NVMe-over-fabrics.
   From a variety of folks, mostly Sagi and James Smart.

 - A series from Bart, improving our exposed info and capabilities from
   the blk-mq debugfs support.

 - A series from Christoph, cleaning up how handle WRITE_ZEROES.

 - A series from Christoph, cleaning up the block layer handling of how
   we track errors in a request. On top of being a nice cleanup, it also
   shrinks the size of struct request a bit.

 - Removal of mg_disk and hd (sorry Linus) by Christoph. The former was
   never used by platforms, and the latter has outlived it's usefulness.

 - Various little bug fixes and cleanups from a wide variety of folks.

* 'for-4.12/block' of git://git.kernel.dk/linux-block: (329 commits)
  block: hide badblocks attribute by default
  blk-mq: unify hctx delay_work and run_work
  block: add kblock_mod_delayed_work_on()
  blk-mq: unify hctx delayed_run_work and run_work
  nbd: fix use after free on module unload
  MAINTAINERS: bfq: Add Paolo as maintainer for the BFQ I/O scheduler
  blk-mq-sched: alloate reserved tags out of normal pool
  mtip32xx: use runtime tag to initialize command header
  scsi: Implement blk_mq_ops.show_rq()
  blk-mq: Add blk_mq_ops.show_rq()
  blk-mq: Show operation, cmd_flags and rq_flags names
  blk-mq: Make blk_flags_show() callers append a newline character
  blk-mq: Move the "state" debugfs attribute one level down
  blk-mq: Unregister debugfs attributes earlier
  blk-mq: Only unregister hctxs for which registration succeeded
  blk-mq-debugfs: Rename functions for registering and unregistering the mq directory
  blk-mq: Let blk_mq_debugfs_register() look up the queue name
  blk-mq: Register <dev>/queue/mq after having registered <dev>/queue
  ide-pm: always pass 0 error to ide_complete_rq in ide_do_devset
  ide-pm: always pass 0 error to __blk_end_request_all
  ..
2017-05-01 10:39:57 -07:00
Jens Axboe 21c6e939a9 blk-mq: unify hctx delay_work and run_work
The only difference between ->run_work and ->delay_work, is that
the latter is used to defer running a queue. This is done by
marking the queue stopped, and scheduling ->delay_work to run
sometime in the future. While the queue is stopped, direct runs
or runs through ->run_work will not run the queue.

If we combine the handlers, then we need to handle two things:

1) If a delayed/stopped run is scheduled, then we should not run
   the queue before that has been completed.
2) If a queue is delayed/stopped, the handler needs to restart
   the queue. Normally a run of a queue with the stopped bit set
   would be a no-op.

Case 1 is handled by modifying a currently pending queue run
to the deadline set by the caller of blk_mq_delay_queue().
Subsequent attempts to queue a queue run will find the work
item already pending, and direct runs will see a stopped queue
as before.

Case 2 is handled by adding a new bit, BLK_MQ_S_START_ON_RUN,
that tells the work handler that it should clear a stopped
queue and run the handler.

Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-28 08:11:43 -06:00
Jens Axboe 9f99373790 blk-mq: unify hctx delayed_run_work and run_work
They serve the exact same purpose. Get rid of the non-delayed
work variant, and just run it without delay for the normal case.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-28 08:10:15 -06:00
Bart Van Assche abc25a6930 blk-mq: Fix preempt count imbalance
Avoid that the following kernel bug gets triggered:

BUG: sleeping function called from invalid context at ./include/linux/buffer_head.h:349
in_atomic(): 1, irqs_disabled(): 0, pid: 8019, name: find
CPU: 10 PID: 8019 Comm: find Tainted: G        W I     4.11.0-rc4-dbg+ #2
Call Trace:
 dump_stack+0x68/0x93
 ___might_sleep+0x16e/0x230
 __might_sleep+0x4a/0x80
 __ext4_get_inode_loc+0x1e0/0x4e0
 ext4_iget+0x70/0xbc0
 ext4_iget_normal+0x2f/0x40
 ext4_lookup+0xb6/0x1f0
 lookup_slow+0x104/0x1e0
 walk_component+0x19a/0x330
 path_lookupat+0x4b/0x100
 filename_lookup+0x9a/0x110
 user_path_at_empty+0x36/0x40
 vfs_statx+0x67/0xc0
 SYSC_newfstatat+0x20/0x40
 SyS_newfstatat+0xe/0x10
 entry_SYSCALL_64_fastpath+0x18/0xad

This happens since the big if/else in blk_mq_make_request() doesn't
have final else section that also drops the ctx. Add that.

Fixes: b00c53e8f4 ("blk-mq: fix schedule-while-atomic with scheduler attached")
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Omar Sandoval <osandov@fb.com>

Added a bit more to the commit log.

Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-21 12:00:40 -06:00
Jens Axboe 99c749a4c4 blk-stat: kill blk_stat_rq_ddir()
No point in providing and exporting this helper. There's just
one (real) user of it, just use rq_data_dir().

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-21 07:56:23 -06:00
Jens Axboe 5feeacdd4a blk-mq: add might_sleep check to blk_mq_get_driver_tag()
If the caller passes in wait=true, it has to be able to block
for a driver tag. We just had a bug where flush insertion
would block on tag allocation, while we had preempt disabled.
Ensure that we catch cases like that earlier next time.

Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20 17:23:13 -06:00
Stephen Bates 0206319fdf blk-mq: Fix poll_stat for new size-based bucketing.
Fixes an issue where the size of the poll_stat array in request_queue
does not match the size expected by the new size based bucketing for
IO completion polling.

Fixes: 720b8ccc45 ("blk-mq: Add a polling specific stats function")
Signed-off-by: Stephen Bates <sbates@raithlin.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20 17:10:48 -06:00
Jens Axboe b00c53e8f4 blk-mq: fix schedule-while-atomic with scheduler attached
We must have dropped the ctx before we call
blk_mq_sched_insert_request() with can_block=true, otherwise we risk
that a flush request can block on insertion if we are currently out of
tags.

[   47.667190] BUG: scheduling while atomic: jbd2/sda2-8/2089/0x00000002
[   47.674493] Modules linked in: x86_pkg_temp_thermal btrfs xor zlib_deflate raid6_pq sr_mod cdre
[   47.690572] Preemption disabled at:
[   47.690584] [<ffffffff81326c7c>] blk_mq_sched_get_request+0x6c/0x280
[   47.701764] CPU: 1 PID: 2089 Comm: jbd2/sda2-8 Not tainted 4.11.0-rc7+ #271
[   47.709630] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 11/09/2016
[   47.718081] Call Trace:
[   47.720903]  dump_stack+0x4f/0x73
[   47.724694]  ? blk_mq_sched_get_request+0x6c/0x280
[   47.730137]  __schedule_bug+0x6c/0xc0
[   47.734314]  __schedule+0x559/0x780
[   47.738302]  schedule+0x3b/0x90
[   47.741899]  io_schedule+0x11/0x40
[   47.745788]  blk_mq_get_tag+0x167/0x2a0
[   47.750162]  ? remove_wait_queue+0x70/0x70
[   47.754901]  blk_mq_get_driver_tag+0x92/0xf0
[   47.759758]  blk_mq_sched_insert_request+0x134/0x170
[   47.765398]  ? blk_account_io_start+0xd0/0x270
[   47.770679]  blk_mq_make_request+0x1b2/0x850
[   47.775766]  generic_make_request+0xf7/0x2d0
[   47.780860]  submit_bio+0x5f/0x120
[   47.784979]  ? submit_bio+0x5f/0x120
[   47.789631]  submit_bh_wbc.isra.46+0x10d/0x130
[   47.794902]  submit_bh+0xb/0x10
[   47.798719]  journal_submit_commit_record+0x190/0x210
[   47.804686]  ? _raw_spin_unlock+0x13/0x30
[   47.809480]  jbd2_journal_commit_transaction+0x180a/0x1d00
[   47.815925]  kjournald2+0xb6/0x250
[   47.820022]  ? kjournald2+0xb6/0x250
[   47.824328]  ? remove_wait_queue+0x70/0x70
[   47.829223]  kthread+0x10e/0x140
[   47.833147]  ? commit_timeout+0x10/0x10
[   47.837742]  ? kthread_create_on_node+0x40/0x40
[   47.843122]  ret_from_fork+0x29/0x40

Fixes: a4d907b6a3 ("blk-mq: streamline blk_mq_make_request")
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20 16:42:02 -06:00
Stephen Bates 720b8ccc45 blk-mq: Add a polling specific stats function
Rather than bucketing IO statisics based on direction only we also
bucket based on the IO size. This leads to improved polling
performance. Update the bucket callback function and use it in the
polling latency estimation.

Signed-off-by: Stephen Bates <sbates@raithlin.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20 15:29:40 -06:00
Jens Axboe 3a07bb1d76 blk-mq: fix potential oops with polling and blk-mq scheduler
If we have a scheduler attached, blk_mq_tag_to_rq() on the
scheduled tags will return NULL if a request is no longer
in flight. This is different than using the normal tags,
where it will always return the fixed request. Check for
this condition for polling, in case we happen to enter
polling for a completed request.

The request address remains valid, so this check and return
should be perfectly safe.

Fixes: bd166ef183 ("blk-mq-sched: add framework for MQ capable IO schedulers")
Tested-by: Stephen Bates <sbates@raithlin.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20 14:53:28 -06:00
Christoph Hellwig caf7df1227 block: remove the errors field from struct request
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20 12:16:10 -06:00
Christoph Hellwig 453f83418d blk-mq: simplify __blk_mq_complete_request
Merge blk_mq_ipi_complete_request and blk_mq_stat_add into their only
caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20 12:16:10 -06:00
Christoph Hellwig 08e0029aa2 blk-mq: remove the error argument to blk_mq_complete_request
Now that all drivers that call blk_mq_complete_requests have a
->complete callback we can remove the direct call to blk_mq_end_request,
as well as the error argument to blk_mq_complete_request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20 12:16:10 -06:00
Bart Van Assche da8d7f079b block: Export blk_init_request_from_bio()
Export this function such that it becomes available to block
drivers.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Matias Bjørling <m@bjorling.me>
Cc: Adam Manzanares <adam.manzanares@wdc.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-19 17:38:30 -06:00
Omar Sandoval c05f8525f6 blk-mq-sched: make completed_request() callback more useful
Currently, this callback is called right after put_request() and has no
distinguishable purpose. Instead, let's call it before put_request() as
soon as I/O has completed on the request, before we account it in
blk-stat. With this, Kyber can enable stats when it sees a latency
outlier and make sure the outlier gets accounted.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-14 14:06:57 -06:00
Omar Sandoval 5b72727299 blk-mq: export helpers
blk_mq_finish_request() is required for schedulers that define their own
put_request(). blk_mq_run_hw_queue() is required for schedulers that
hold back requests to be run later.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-14 14:06:55 -06:00
Bart Van Assche 710c785f80 blk-mq: Clarify comments in blk_mq_dispatch_rq_list()
The blk_mq_dispatch_rq_list() implementation got modified several
times but the comments in that function were not updated every
time. Since it is nontrivial what is going on, update the comments
in blk_mq_dispatch_rq_list().

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-07 12:45:49 -06:00
Bart Van Assche 705cda97ee blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list
Since the next patch in this series will use RCU to iterate over
tag_list, make this safe. Add lockdep_assert_held() statements
in functions that iterate over tag_list to make clear that using
list_for_each_entry() instead of list_for_each_entry_rcu() is
fine in these functions.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-07 12:45:47 -06:00
Omar Sandoval d945a365a0 blk-mq: use true instead of 1 for blk_mq_queue_data.last
Trivial cleanup.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-07 12:45:45 -06:00
Omar Sandoval 807b10417b blk-mq: make driver tag failure path easier to follow
Minor cleanup that makes it easier to figure out what's going on in the
driver tag allocation failure path of blk_mq_dispatch_rq_list().

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-07 12:45:43 -06:00
Jens Axboe 65f619d253 Merge branch 'for-linus' into for-4.12/block
We've added a considerable amount of fixes for stalls and issues
with the blk-mq scheduling in the 4.11 series since forking
off the for-4.12/block branch. We need to do improvements on
top of that for 4.12, so pull in the previous fixes to make
our lives easier going forward.

Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-07 12:45:20 -06:00
Bart Van Assche 6d8c6c0f97 blk-mq: Restart a single queue if tag sets are shared
To improve scalability, if hardware queues are shared, restart
a single hardware queue in round-robin fashion. Rename
blk_mq_sched_restart_queues() to reflect the new semantics.
Remove blk_mq_sched_mark_restart_queue() because this function
has no callers. Remove flag QUEUE_FLAG_RESTART because this
patch removes the code that uses this flag.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-07 12:40:09 -06:00
Bart Van Assche 7587a5ae7e blk-mq: Introduce blk_mq_delay_run_hw_queue()
Introduce a function that runs a hardware queue unconditionally
after a delay. Note: there is already a function that stops and
restarts a hardware queue after a delay, namely blk_mq_delay_queue().

This function will be used in the next patch in this series.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Long Li <longli@microsoft.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-07 12:27:06 -06:00
Omar Sandoval ebe8bddb6e blk-mq: remap queues when adding/removing hardware queues
blk_mq_update_nr_hw_queues() used to remap hardware queues, which is the
behavior that drivers expect. However, commit 4e68a01142 changed
blk_mq_queue_reinit() to not remap queues for the case of CPU
hotplugging, inadvertently making blk_mq_update_nr_hw_queues() not remap
queues as well. This breaks, for example, NBD's multi-connection mode,
leaving the added hardware queues unused. Fix it by making
blk_mq_update_nr_hw_queues() explicitly remap the queues.

Fixes: 4e68a01142 ("blk-mq: don't redistribute hardware queues on a CPU hotplug event")
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-07 08:56:49 -06:00
Omar Sandoval 54d5329d42 blk-mq-sched: fix crash in switch error path
In elevator_switch(), if blk_mq_init_sched() fails, we attempt to fall
back to the original scheduler. However, at this point, we've already
torn down the original scheduler's tags, so this causes a crash. Doing
the fallback like the legacy elevator path is much harder for mq, so fix
it by just falling back to none, instead.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-07 08:56:48 -06:00
Omar Sandoval 93252632e8 blk-mq-sched: set up scheduler tags when bringing up new queues
If a new hardware queue is added at runtime, we don't allocate scheduler
tags for it, leading to a crash. This hooks up the scheduler framework
to blk_mq_{init,exit}_hctx() to make sure everything gets properly
initialized/freed.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-07 08:56:46 -06:00
Omar Sandoval 81380ca107 blk-mq: use the right hctx when getting a driver tag fails
While dispatching requests, if we fail to get a driver tag, we mark the
hardware queue as waiting for a tag and put the requests on a
hctx->dispatch list to be run later when a driver tag is freed. However,
blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
queues if using a single-queue scheduler with a multiqueue device. If
blk_mq_get_driver_tag() fails, it doesn't update the hardware queue we
are processing. This means we end up using the hardware queue of the
previous request, which may or may not be the same as that of the
current request. If it isn't, the wrong hardware queue will end up
waiting for a tag, and the requests will be on the wrong dispatch list,
leading to a hang.

The fix is twofold:

1. Make sure we save which hardware queue we were trying to get a
   request for in blk_mq_get_driver_tag() regardless of whether it
   succeeds or not.
2. Make blk_mq_dispatch_rq_list() take a request_queue instead of a
   blk_mq_hw_queue to make it clear that it must handle multiple
   hardware queues, since I've already messed this up on a couple of
   occasions.

This didn't appear in testing with nvme and mq-deadline because nvme has
more driver tags than the default number of scheduler tags. However,
with the blk_mq_update_nr_hw_queues() fix, it showed up with nbd.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-07 08:56:26 -06:00
Bart Van Assche f2fbc9dd78 blk-mq: Remove blk_mq_queue_data.list
The block layer core sets blk_mq_queue_data.list but no block
drivers read that member. Hence remove it and also the code that
is used to set this member.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-05 09:40:15 -06:00
Jens Axboe bf4907c05e blk-mq: fix schedule-under-preempt for blocking drivers
Commit a4d907b6a3 unified the single and multi queue request handlers,
but in the process, it also screwed up the locking balance and calls
blk_mq_try_issue_directly() with the ctx preempt lock held. This is a
problem for drivers that have set BLK_MQ_F_BLOCKING, since now they
can't reliably sleep.

While in there, protect against similar issues in the future, by adding
a might_sleep() trigger in the BLOCKING path for direct issue or queue
run.

Reported-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Josef Bacik <josef@toxicpanda.com>
Fixes: a4d907b6a3 ("blk-mq: streamline blk_mq_make_request")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-30 12:30:39 -06:00
Minchan Kim ac77a0c463 block: do not put mq context in blk_mq_alloc_request_hctx
In blk_mq_alloc_request_hctx, blk_mq_sched_get_request doesn't
get sw context so we don't need to put the context with
blk_mq_put_ctx. Unless, we will see preempt counter underflow.

Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-30 08:13:23 -06:00
Minchan Kim 3e06eb3dac block: do not put mq context in blk_mq_alloc_request_hctx
In blk_mq_alloc_request_hctx, blk_mq_sched_get_request doesn't
get sw context so we don't need to put the context with
blk_mq_put_ctx. Unless, we will see preempt counter underflow.

Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-30 08:13:05 -06:00
Jens Axboe 3e8a7069b9 blk-mq: include errors in did_work calculation
Currently we return true in blk_mq_dispatch_rq_list() if we queued IO
successfully, but we really want to return whether or not the we made
progress. Progress includes if we got an error return.  If we don't,
this can lead to a hang in blk_mq_sched_dispatch_requests() when a
driver is draining IO by returning BLK_MQ_QUEUE_ERROR instead of
manually ending the IO in error and return BLK_MQ_QUEUE_OK.

Tested-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-29 13:21:13 -06:00
Josef Bacik b58e176914 block-mq: don't re-queue if we get a queue error
When try to issue a request directly and we fail we will requeue the
request, but call blk_mq_end_request() as well.  This leads to the
completed request being on a queuelist and getting ended twice, which
causes list corruption in schedulers and other shenanigans.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-29 13:18:18 -06:00
Jens Axboe 48b99c9d65 blk-mq: fix a typo and a spelling mistake
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-29 11:10:34 -06:00
Omar Sandoval 0c9539a431 blk-mq: fix leak of q->stats
blk_alloc_queue_node() already allocates q->stats, so
blk_mq_init_allocated_queue() is overwriting it with a new allocation.

Fixes: a83b576c9c ("block: fix stacked driver stats init and free")
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-29 08:09:08 -06:00
Ming Lei 1671d522cd block: rename blk_mq_freeze_queue_start()
As the .q_usage_counter is used by both legacy and
mq path, we need to block new I/O if queue becomes
dead in blk_queue_enter().

So rename it and we can use this function in both
paths.

Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-29 08:03:42 -06:00
Ming Lei d9d149a396 blk-mq: comment on races related with timeout handler
This patch adds comment on two races related with
timeout handler:

- requeue from queue busy vs. timeout
- rq free & reallocation vs. timeout

Both the races themselves and current solution aren't
explicit enough, so add comments on them.

Cc: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-29 08:03:42 -06:00
Ming Lei a4ef8e566f blk-mq: don't complete un-started request in timeout handler
When iterating busy requests in timeout handler,
if the STARTED flag of one request isn't set, that means
the request is being processed in block layer or driver, and
isn't submitted to hardware yet.

In current implementation of blk_mq_check_expired(),
if the request queue becomes dying, un-started requests are
handled as being completed/freed immediately. This way is
wrong, and can cause rq corruption or double allocation[1][2],
when doing I/O and removing&resetting NVMe device at the sametime.

This patch fixes several issues reported by Yi Zhang.

[1]. oops log 1
[  581.789754] ------------[ cut here ]------------
[  581.789758] kernel BUG at block/blk-mq.c:374!
[  581.789760] invalid opcode: 0000 [#1] SMP
[  581.789761] Modules linked in: vfat fat ipmi_ssif intel_rapl sb_edac
edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm nvme
irqbypass crct10dif_pclmul nvme_core crc32_pclmul ghash_clmulni_intel
intel_cstate ipmi_si mei_me ipmi_devintf intel_uncore sg ipmi_msghandler
intel_rapl_perf iTCO_wdt mei iTCO_vendor_support mxm_wmi lpc_ich dcdbas shpchp
pcspkr acpi_power_meter wmi nfsd auth_rpcgss nfs_acl lockd dm_multipath grace
sunrpc ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper
syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ahci libahci
crc32c_intel tg3 libata megaraid_sas i2c_core ptp fjes pps_core dm_mirror
dm_region_hash dm_log dm_mod
[  581.789796] CPU: 1 PID: 1617 Comm: kworker/1:1H Not tainted 4.10.0.bz1420297+ #4
[  581.789797] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016
[  581.789804] Workqueue: kblockd blk_mq_timeout_work
[  581.789806] task: ffff8804721c8000 task.stack: ffffc90006ee4000
[  581.789809] RIP: 0010:blk_mq_end_request+0x58/0x70
[  581.789810] RSP: 0018:ffffc90006ee7d50 EFLAGS: 00010202
[  581.789811] RAX: 0000000000000001 RBX: ffff8802e4195340 RCX: ffff88028e2f4b88
[  581.789812] RDX: 0000000000001000 RSI: 0000000000001000 RDI: 0000000000000000
[  581.789813] RBP: ffffc90006ee7d60 R08: 0000000000000003 R09: ffff88028e2f4b00
[  581.789814] R10: 0000000000001000 R11: 0000000000000001 R12: 00000000fffffffb
[  581.789815] R13: ffff88042abe5780 R14: 000000000000002d R15: ffff88046fbdff80
[  581.789817] FS:  0000000000000000(0000) GS:ffff88047fc00000(0000) knlGS:0000000000000000
[  581.789818] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  581.789819] CR2: 00007f64f403a008 CR3: 000000014d078000 CR4: 00000000001406e0
[  581.789820] Call Trace:
[  581.789825]  blk_mq_check_expired+0x76/0x80
[  581.789828]  bt_iter+0x45/0x50
[  581.789830]  blk_mq_queue_tag_busy_iter+0xdd/0x1f0
[  581.789832]  ? blk_mq_rq_timed_out+0x70/0x70
[  581.789833]  ? blk_mq_rq_timed_out+0x70/0x70
[  581.789840]  ? __switch_to+0x140/0x450
[  581.789841]  blk_mq_timeout_work+0x88/0x170
[  581.789845]  process_one_work+0x165/0x410
[  581.789847]  worker_thread+0x137/0x4c0
[  581.789851]  kthread+0x101/0x140
[  581.789853]  ? rescuer_thread+0x3b0/0x3b0
[  581.789855]  ? kthread_park+0x90/0x90
[  581.789860]  ret_from_fork+0x2c/0x40
[  581.789861] Code: 48 85 c0 74 0d 44 89 e6 48 89 df ff d0 5b 41 5c 5d c3 48
8b bb 70 01 00 00 48 85 ff 75 0f 48 89 df e8 7d f0 ff ff 5b 41 5c 5d c3 <0f>
0b e8 71 f0 ff ff 90 eb e9 0f 1f 40 00 66 2e 0f 1f 84 00 00
[  581.789882] RIP: blk_mq_end_request+0x58/0x70 RSP: ffffc90006ee7d50
[  581.789889] ---[ end trace bcaf03d9a14a0a70 ]---

[2]. oops log2
[ 6984.857362] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
[ 6984.857372] IP: nvme_queue_rq+0x6e6/0x8cd [nvme]
[ 6984.857373] PGD 0
[ 6984.857374]
[ 6984.857376] Oops: 0000 [#1] SMP
[ 6984.857379] Modules linked in: ipmi_ssif vfat fat intel_rapl sb_edac
edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm
irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ipmi_si iTCO_wdt
iTCO_vendor_support mxm_wmi ipmi_devintf intel_cstate sg dcdbas intel_uncore
mei_me intel_rapl_perf mei pcspkr lpc_ich ipmi_msghandler shpchp
acpi_power_meter wmi nfsd auth_rpcgss dm_multipath nfs_acl lockd grace sunrpc
ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea
sysfillrect crc32c_intel sysimgblt fb_sys_fops ttm nvme drm nvme_core ahci
libahci i2c_core tg3 libata ptp megaraid_sas pps_core fjes dm_mirror
dm_region_hash dm_log dm_mod
[ 6984.857416] CPU: 7 PID: 1635 Comm: kworker/7:1H Not tainted
4.10.0-2.el7.bz1420297.x86_64 #1
[ 6984.857417] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016
[ 6984.857427] Workqueue: kblockd blk_mq_run_work_fn
[ 6984.857429] task: ffff880476e3da00 task.stack: ffffc90002e90000
[ 6984.857432] RIP: 0010:nvme_queue_rq+0x6e6/0x8cd [nvme]
[ 6984.857433] RSP: 0018:ffffc90002e93c50 EFLAGS: 00010246
[ 6984.857434] RAX: 0000000000000000 RBX: ffff880275646600 RCX: 0000000000001000
[ 6984.857435] RDX: 0000000000000fff RSI: 00000002fba2a000 RDI: ffff8804734e6950
[ 6984.857436] RBP: ffffc90002e93d30 R08: 0000000000002000 R09: 0000000000001000
[ 6984.857437] R10: 0000000000001000 R11: 0000000000000000 R12: ffff8804741d8000
[ 6984.857438] R13: 0000000000000040 R14: ffff880475649f80 R15: ffff8804734e6780
[ 6984.857439] FS:  0000000000000000(0000) GS:ffff88047fcc0000(0000) knlGS:0000000000000000
[ 6984.857440] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6984.857442] CR2: 0000000000000010 CR3: 0000000001c09000 CR4: 00000000001406e0
[ 6984.857443] Call Trace:
[ 6984.857451]  ? mempool_free+0x2b/0x80
[ 6984.857455]  ? bio_free+0x4e/0x60
[ 6984.857459]  blk_mq_dispatch_rq_list+0xf5/0x230
[ 6984.857462]  blk_mq_process_rq_list+0x133/0x170
[ 6984.857465]  __blk_mq_run_hw_queue+0x8c/0xa0
[ 6984.857467]  blk_mq_run_work_fn+0x12/0x20
[ 6984.857473]  process_one_work+0x165/0x410
[ 6984.857475]  worker_thread+0x137/0x4c0
[ 6984.857478]  kthread+0x101/0x140
[ 6984.857480]  ? rescuer_thread+0x3b0/0x3b0
[ 6984.857481]  ? kthread_park+0x90/0x90
[ 6984.857489]  ret_from_fork+0x2c/0x40
[ 6984.857490] Code: 8b bd 70 ff ff ff 89 95 50 ff ff ff 89 8d 58 ff ff ff 44
89 95 60 ff ff ff e8 b7 dd 12 e1 8b 95 50 ff ff ff 48 89 85 68 ff ff ff <4c>
8b 48 10 44 8b 58 18 8b 8d 58 ff ff ff 44 8b 95 60 ff ff ff
[ 6984.857511] RIP: nvme_queue_rq+0x6e6/0x8cd [nvme] RSP: ffffc90002e93c50
[ 6984.857512] CR2: 0000000000000010
[ 6984.895359] ---[ end trace 2d7ceb528432bf83 ]---

Cc: stable@vger.kernel.org
Reported-by: Yi Zhang <yizhan@redhat.com>
Tested-by: Yi Zhang <yizhan@redhat.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-29 08:03:38 -06:00
Shaohua Li 88eeca495b block: track request size in blk_issue_stat
Currently there is no way to know the request size when the request is
finished. Next patch will need this info. We could add extra field to
record the size, but blk_issue_stat has enough space to record it, so
this patch just overloads blk_issue_stat. With this, we will have 49bits
to track time, which still is very long time.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-28 08:02:20 -06:00
Jens Axboe 93efe9817e blk-mq: include errors in did_work calculation
Currently we return true in blk_mq_dispatch_rq_list() if we queued IO
successfully, but we really want to return whether or not the we made
progress. Progress includes if we got an error return.  If we don't,
this can lead to a hang in blk_mq_sched_dispatch_requests() when a
driver is draining IO by returning BLK_MQ_QUEUE_ERROR instead of
manually ending the IO in error and return BLK_MQ_QUEUE_OK.

Tested-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-24 15:42:47 -06:00
Christoph Hellwig a4d907b6a3 blk-mq: streamline blk_mq_make_request
Turn the different ways of merging or issuing I/O into a series of if/else
statements instead of the current maze of gotos.  Note that this means we
pin the CPU a little longer for some cases as the CTX put is moved to
common code at the end of the function.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-22 20:17:03 -06:00
Christoph Hellwig 2299722c4b blk-mq: split the plug and sync cases in blk_mq_make_request
Now that we have a nice direct issue heper this helps simplifying
the code a bit, and also gets rid of the old_rq variable.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-22 20:17:01 -06:00
Christoph Hellwig 5eb6126e1c blk-mq: improve blk_mq_try_issue_directly
Rename blk_mq_try_issue_directly to __blk_mq_try_issue_directly and add a
new wrapper that takes care of RCU / SRCU locking to avoid having
boileplate code in the caller which would get duplicated with new callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-22 20:16:59 -06:00
Christoph Hellwig 254d259da0 blk-mq: merge mq and sq make_request instances
They are mostly the same code anyway - this just one small conditional
for the plug case that is different for both variants.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-22 20:16:58 -06:00
Christoph Hellwig 7642747d67 blk-mq: remove BLK_MQ_F_DEFER_ISSUE
This flag was never used since it was introduced.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-22 20:16:56 -06:00
Ming Lei 95a4960370 blk-mq: don't complete un-started request in timeout handler
When iterating busy requests in timeout handler,
if the STARTED flag of one request isn't set, that means
the request is being processed in block layer or driver, and
isn't submitted to hardware yet.

In current implementation of blk_mq_check_expired(),
if the request queue becomes dying, un-started requests are
handled as being completed/freed immediately. This way is
wrong, and can cause rq corruption or double allocation[1][2],
when doing I/O and removing&resetting NVMe device at the sametime.

This patch fixes several issues reported by Yi Zhang.

[1]. oops log 1
[  581.789754] ------------[ cut here ]------------
[  581.789758] kernel BUG at block/blk-mq.c:374!
[  581.789760] invalid opcode: 0000 [#1] SMP
[  581.789761] Modules linked in: vfat fat ipmi_ssif intel_rapl sb_edac
edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm nvme
irqbypass crct10dif_pclmul nvme_core crc32_pclmul ghash_clmulni_intel
intel_cstate ipmi_si mei_me ipmi_devintf intel_uncore sg ipmi_msghandler
intel_rapl_perf iTCO_wdt mei iTCO_vendor_support mxm_wmi lpc_ich dcdbas shpchp
pcspkr acpi_power_meter wmi nfsd auth_rpcgss nfs_acl lockd dm_multipath grace
sunrpc ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper
syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ahci libahci
crc32c_intel tg3 libata megaraid_sas i2c_core ptp fjes pps_core dm_mirror
dm_region_hash dm_log dm_mod
[  581.789796] CPU: 1 PID: 1617 Comm: kworker/1:1H Not tainted 4.10.0.bz1420297+ #4
[  581.789797] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016
[  581.789804] Workqueue: kblockd blk_mq_timeout_work
[  581.789806] task: ffff8804721c8000 task.stack: ffffc90006ee4000
[  581.789809] RIP: 0010:blk_mq_end_request+0x58/0x70
[  581.789810] RSP: 0018:ffffc90006ee7d50 EFLAGS: 00010202
[  581.789811] RAX: 0000000000000001 RBX: ffff8802e4195340 RCX: ffff88028e2f4b88
[  581.789812] RDX: 0000000000001000 RSI: 0000000000001000 RDI: 0000000000000000
[  581.789813] RBP: ffffc90006ee7d60 R08: 0000000000000003 R09: ffff88028e2f4b00
[  581.789814] R10: 0000000000001000 R11: 0000000000000001 R12: 00000000fffffffb
[  581.789815] R13: ffff88042abe5780 R14: 000000000000002d R15: ffff88046fbdff80
[  581.789817] FS:  0000000000000000(0000) GS:ffff88047fc00000(0000) knlGS:0000000000000000
[  581.789818] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  581.789819] CR2: 00007f64f403a008 CR3: 000000014d078000 CR4: 00000000001406e0
[  581.789820] Call Trace:
[  581.789825]  blk_mq_check_expired+0x76/0x80
[  581.789828]  bt_iter+0x45/0x50
[  581.789830]  blk_mq_queue_tag_busy_iter+0xdd/0x1f0
[  581.789832]  ? blk_mq_rq_timed_out+0x70/0x70
[  581.789833]  ? blk_mq_rq_timed_out+0x70/0x70
[  581.789840]  ? __switch_to+0x140/0x450
[  581.789841]  blk_mq_timeout_work+0x88/0x170
[  581.789845]  process_one_work+0x165/0x410
[  581.789847]  worker_thread+0x137/0x4c0
[  581.789851]  kthread+0x101/0x140
[  581.789853]  ? rescuer_thread+0x3b0/0x3b0
[  581.789855]  ? kthread_park+0x90/0x90
[  581.789860]  ret_from_fork+0x2c/0x40
[  581.789861] Code: 48 85 c0 74 0d 44 89 e6 48 89 df ff d0 5b 41 5c 5d c3 48
8b bb 70 01 00 00 48 85 ff 75 0f 48 89 df e8 7d f0 ff ff 5b 41 5c 5d c3 <0f>
0b e8 71 f0 ff ff 90 eb e9 0f 1f 40 00 66 2e 0f 1f 84 00 00
[  581.789882] RIP: blk_mq_end_request+0x58/0x70 RSP: ffffc90006ee7d50
[  581.789889] ---[ end trace bcaf03d9a14a0a70 ]---

[2]. oops log2
[ 6984.857362] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
[ 6984.857372] IP: nvme_queue_rq+0x6e6/0x8cd [nvme]
[ 6984.857373] PGD 0
[ 6984.857374]
[ 6984.857376] Oops: 0000 [#1] SMP
[ 6984.857379] Modules linked in: ipmi_ssif vfat fat intel_rapl sb_edac
edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm
irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ipmi_si iTCO_wdt
iTCO_vendor_support mxm_wmi ipmi_devintf intel_cstate sg dcdbas intel_uncore
mei_me intel_rapl_perf mei pcspkr lpc_ich ipmi_msghandler shpchp
acpi_power_meter wmi nfsd auth_rpcgss dm_multipath nfs_acl lockd grace sunrpc
ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea
sysfillrect crc32c_intel sysimgblt fb_sys_fops ttm nvme drm nvme_core ahci
libahci i2c_core tg3 libata ptp megaraid_sas pps_core fjes dm_mirror
dm_region_hash dm_log dm_mod
[ 6984.857416] CPU: 7 PID: 1635 Comm: kworker/7:1H Not tainted
4.10.0-2.el7.bz1420297.x86_64 #1
[ 6984.857417] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016
[ 6984.857427] Workqueue: kblockd blk_mq_run_work_fn
[ 6984.857429] task: ffff880476e3da00 task.stack: ffffc90002e90000
[ 6984.857432] RIP: 0010:nvme_queue_rq+0x6e6/0x8cd [nvme]
[ 6984.857433] RSP: 0018:ffffc90002e93c50 EFLAGS: 00010246
[ 6984.857434] RAX: 0000000000000000 RBX: ffff880275646600 RCX: 0000000000001000
[ 6984.857435] RDX: 0000000000000fff RSI: 00000002fba2a000 RDI: ffff8804734e6950
[ 6984.857436] RBP: ffffc90002e93d30 R08: 0000000000002000 R09: 0000000000001000
[ 6984.857437] R10: 0000000000001000 R11: 0000000000000000 R12: ffff8804741d8000
[ 6984.857438] R13: 0000000000000040 R14: ffff880475649f80 R15: ffff8804734e6780
[ 6984.857439] FS:  0000000000000000(0000) GS:ffff88047fcc0000(0000) knlGS:0000000000000000
[ 6984.857440] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6984.857442] CR2: 0000000000000010 CR3: 0000000001c09000 CR4: 00000000001406e0
[ 6984.857443] Call Trace:
[ 6984.857451]  ? mempool_free+0x2b/0x80
[ 6984.857455]  ? bio_free+0x4e/0x60
[ 6984.857459]  blk_mq_dispatch_rq_list+0xf5/0x230
[ 6984.857462]  blk_mq_process_rq_list+0x133/0x170
[ 6984.857465]  __blk_mq_run_hw_queue+0x8c/0xa0
[ 6984.857467]  blk_mq_run_work_fn+0x12/0x20
[ 6984.857473]  process_one_work+0x165/0x410
[ 6984.857475]  worker_thread+0x137/0x4c0
[ 6984.857478]  kthread+0x101/0x140
[ 6984.857480]  ? rescuer_thread+0x3b0/0x3b0
[ 6984.857481]  ? kthread_park+0x90/0x90
[ 6984.857489]  ret_from_fork+0x2c/0x40
[ 6984.857490] Code: 8b bd 70 ff ff ff 89 95 50 ff ff ff 89 8d 58 ff ff ff 44
89 95 60 ff ff ff e8 b7 dd 12 e1 8b 95 50 ff ff ff 48 89 85 68 ff ff ff <4c>
8b 48 10 44 8b 58 18 8b 8d 58 ff ff ff 44 8b 95 60 ff ff ff
[ 6984.857511] RIP: nvme_queue_rq+0x6e6/0x8cd [nvme] RSP: ffffc90002e93c50
[ 6984.857512] CR2: 0000000000000010
[ 6984.895359] ---[ end trace 2d7ceb528432bf83 ]---

Cc: stable@vger.kernel.org
Reported-by: Yi Zhang <yizhan@redhat.com>
Tested-by: Yi Zhang <yizhan@redhat.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-22 08:03:35 -06:00
Omar Sandoval 34dbad5d26 blk-stat: convert to callback-based statistics reporting
Currently, statistics are gathered in ~0.13s windows, and users grab the
statistics whenever they need them. This is not ideal for both in-tree
users:

1. Writeback throttling wants its own dynamically sized window of
   statistics. Since the blk-stats statistics are reset after every
   window and the wbt windows don't line up with the blk-stats windows,
   wbt doesn't see every I/O.
2. Polling currently grabs the statistics on every I/O. Again, depending
   on how the window lines up, we may miss some I/Os. It's also
   unnecessary overhead to get the statistics on every I/O; the hybrid
   polling heuristic would be just as happy with the statistics from the
   previous full window.

This reworks the blk-stats infrastructure to be callback-based: users
register a callback that they want called at a given time with all of
the statistics from the window during which the callback was active.
Users can dynamically bucketize the statistics. wbt and polling both
currently use read vs. write, but polling can be extended to further
subdivide based on request size.

The callbacks are kept on an RCU list, and each callback has percpu
stats buffers. There will only be a few users, so the overhead on the
I/O completion side is low. The stats flushing is also simplified
considerably: since the timer function is responsible for clearing the
statistics, we don't have to worry about stale statistics.

wbt is a trivial conversion. After the conversion, the windowing problem
mentioned above is fixed.

For polling, we register an extra callback that caches the previous
window's statistics in the struct request_queue for the hybrid polling
heuristic to use.

Since we no longer have a single stats buffer for the request queue,
this also removes the sysfs and debugfs stats entries. To replace those,
we add a debugfs entry for the poll statistics.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-21 10:03:11 -06:00
Omar Sandoval fa2e39cb9e blk-stat: use READ and WRITE instead of BLK_STAT_{READ,WRITE}
The stats buckets will become generic soon, so make the existing users
use the common READ and WRITE definitions instead of one internal to
blk-stat.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-21 10:03:08 -06:00
Omar Sandoval 0315b15908 block: remove extra calls to wbt_exit()
We always call wbt_exit() from blk_release_queue(), so these are
unnecessary.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-21 10:03:08 -06:00
Jens Axboe 9c62110454 blk-mq-sched: don't run the queue async from blk_mq_try_issue_directly()
If we have scheduling enabled, we jump directly to insert-and-run.
That's fine, but we run the queue async and we don't pass in information
on whether we can block from this context or not. Fixup both these
cases.

Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-14 11:51:59 -06:00
Ming Lei 01388df376 blk-mq: free hctx->cpumask in release handler of hctx's kobject
It is obviously that hctx->cpumask is per hctx, and both
share same lifetime, so this patch moves freeing of hctx->cpumask
into release handler of hctx's kobject.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-08 09:56:12 -07:00
Ming Lei 6c8b232efe blk-mq: make lifetime consistent between hctx and its kobject
This patch removes kobject_put() over hctx in __blk_mq_unregister_dev(),
and trys to keep lifetime consistent between hctx and hctx's kobject.

Now blk_mq_sysfs_register() and blk_mq_sysfs_unregister() become
totally symmetrical, and kobject's refcounter drops to zero just
when the hctx is freed.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-08 09:56:12 -07:00
Ming Lei 7ea5fe31c1 blk-mq: make lifetime consitent between q/ctx and its kobject
Currently from kobject view, both q->mq_kobj and ctx->kobj can
be released during one cycle of blk_mq_register_dev() and
blk_mq_unregister_dev(). Actually, sw queue's lifetime is
same with its request queue's, which is covered by request_queue->kobj.

So we don't need to call kobject_put() for the two kinds of
kobject in __blk_mq_unregister_dev(), instead we do that
in release handler of request queue.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-08 09:56:12 -07:00
Ming Lei 737f98cfe7 blk-mq: initialize mq kobjects in blk_mq_init_allocated_queue()
Both q->mq_kobj and sw queues' kobjects should have been initialized
once, instead of doing that each add_disk context.

Also this patch removes clearing of ctx in blk_mq_init_cpu_queues()
because percpu allocator fills zero to allocated variable.

This patch fixes one issue[1] reported from Omar.

[1] kernel wearning when doing unbind/bind on one scsi-mq device

[   19.347924] kobject (ffff8800791ea0b8): tried to init an initialized object, something is seriously wrong.
[   19.349781] CPU: 1 PID: 84 Comm: kworker/u8:1 Not tainted 4.10.0-rc7-00210-g53f39eeaa263 #34
[   19.350686] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-20161122_114906-anatol 04/01/2014
[   19.350920] Workqueue: events_unbound async_run_entry_fn
[   19.350920] Call Trace:
[   19.350920]  dump_stack+0x63/0x83
[   19.350920]  kobject_init+0x77/0x90
[   19.350920]  blk_mq_register_dev+0x40/0x130
[   19.350920]  blk_register_queue+0xb6/0x190
[   19.350920]  device_add_disk+0x1ec/0x4b0
[   19.350920]  sd_probe_async+0x10d/0x1c0 [sd_mod]
[   19.350920]  async_run_entry_fn+0x48/0x150
[   19.350920]  process_one_work+0x1d0/0x480
[   19.350920]  worker_thread+0x48/0x4e0
[   19.350920]  kthread+0x101/0x140
[   19.350920]  ? process_one_work+0x480/0x480
[   19.350920]  ? kthread_create_on_node+0x60/0x60
[   19.350920]  ret_from_fork+0x2c/0x40

Cc: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-08 09:56:12 -07:00
Linus Torvalds e0d072250a Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block layer fixes from Jens Axboe:
 "A collection of fixes for this merge window, either fixes for existing
  issues, or parts that were waiting for acks to come in. This pull
  request contains:

   - Allocation of nvme queues on the right node from Shaohua.

     This was ready long before the merge window, but waiting on an ack
     from Bjorn on the PCI bit. Now that we have that, the three patches
     can go in.

   - Two fixes for blk-mq-sched with nvmeof, which uses hctx specific
     request allocations. This caused an oops. One part from Sagi, one
     part from Omar.

   - A loop partition scan deadlock fix from Omar, fixing a regression
     in this merge window.

   - A three-patch series from Keith, closing up a hole on clearing out
     requests on shutdown/resume.

   - A stable fix for nbd from Josef, fixing a leak of sockets.

   - Two fixes for a regression in this window from Jan, fixing a
     problem with one of his earlier patches dealing with queue vs bdi
     life times.

   - A fix for a regression with virtio-blk, causing an IO stall if
     scheduling is used. From me.

   - A fix for an io context lock ordering problem. From me"

* 'for-linus' of git://git.kernel.dk/linux-block:
  block: Move bdi_unregister() to del_gendisk()
  blk-mq: ensure that bd->last is always set correctly
  block: don't call ioc_exit_icq() with the queue lock held for blk-mq
  block: Initialize bd_bdi on inode initialization
  loop: fix LO_FLAGS_PARTSCAN hang
  nvme: Complete all stuck requests
  blk-mq: Provide freeze queue timeout
  blk-mq: Export blk_mq_freeze_queue_wait
  nbd: stop leaking sockets
  blk-mq: move update of tags->rqs to __blk_mq_alloc_request()
  blk-mq: kill blk_mq_set_alloc_data()
  blk-mq: make blk_mq_alloc_request_hctx() allocate a scheduler request
  blk-mq-sched: Allocate sched reserved tags as specified in the original queue tagset
  nvme: allocate nvme_queue in correct node
  PCI: add an API to get node from vector
  blk-mq: allocate blk_mq_tags and requests in correct node
2017-03-03 10:53:35 -08:00
Jens Axboe 113285b473 blk-mq: ensure that bd->last is always set correctly
When drivers are called with a request in blk-mq, blk-mq flags the
state such that the driver knows if this is the last request in
this call chain or not. The driver can then use that information
to defer kicking off IO until bd->last is true. However, with blk-mq
and scheduling, we need to allocate a driver tag for a request before
it can be issued. If we fail to allocate such a tag, we could end up
in the situation where the last request issued did not have
bd->last == true set. This can then cause a driver hang.

This fixes a hang with virtio-blk, which uses bd->last as a hint
on whether to kick the queue or not.

Reported-by: Chris Mason <clm@fb.com>
Tested-by: Chris Mason <clm@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-02 14:30:51 -07:00
Keith Busch f91328c40a blk-mq: Provide freeze queue timeout
A driver may wish to take corrective action if queued requests do not
complete within a set time.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-02 08:56:04 -07:00
Keith Busch 6bae363ee3 blk-mq: Export blk_mq_freeze_queue_wait
Drivers can start a freeze, so this provides a way to wait for frozen.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-02 08:56:04 -07:00
Omar Sandoval 562bef4259 blk-mq: move update of tags->rqs to __blk_mq_alloc_request()
No functional difference, it just makes a little more sense to update
the tag map where we actually allocate the tag.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tested-by: Sagi Grimberg <sagi@grimberg.me>
2017-03-02 08:56:04 -07:00
Omar Sandoval 6d2809d51a blk-mq: make blk_mq_alloc_request_hctx() allocate a scheduler request
blk_mq_alloc_request_hctx() allocates a driver request directly, unlike
its blk_mq_alloc_request() counterpart. It also crashes because it
doesn't update the tags->rqs map.

Fix it by making it allocate a scheduler request.

Reported-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tested-by: Sagi Grimberg <sagi@grimberg.me>
2017-03-02 08:56:04 -07:00
Sagi Grimberg 415b806de5 blk-mq-sched: Allocate sched reserved tags as specified in the original queue tagset
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>

Modified by me to also check at driver tag allocation time if the
original request was reserved, so we can be sure to allocate a
properly reserved tag at that point in time, too.

Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-02 08:56:04 -07:00
Shaohua Li 59f082e464 blk-mq: allocate blk_mq_tags and requests in correct node
blk_mq_tags/requests of specific hardware queue are mostly used in
specific cpus, which might not be in the same numa node as disk. For
example, a nvme card is in node 0. half hardware queue will be used by
node 0, the other node 1.

Signed-off-by: Shaohua Li <shli@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-02 08:56:04 -07:00
Ingo Molnar 174cd4b1e5 sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h>
Fix up affected files that include this signal functionality via sched.h.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:32 +01:00
Ingo Molnar 105ab3d8ce sched/headers: Prepare for new header dependencies before moving code to <linux/sched/topology.h>
We are going to split <linux/sched/topology.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/topology.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:26 +01:00
Omar Sandoval da55f2cc78 blk-mq: use sbq wait queues instead of restart for driver tags
Commit 50e1dab86a ("blk-mq-sched: fix starvation for multiple hardware
queues and shared tags") fixed one starvation issue for shared tags.
However, we can still get into a situation where we fail to allocate a
tag because all tags are allocated but we don't have any pending
requests on any hardware queue.

One solution for this would be to restart all queues that share a tag
map, but that really sucks. Ideally, we could just block and wait for a
tag, but that isn't always possible from blk_mq_dispatch_rq_list().

However, we can still use the struct sbitmap_queue wait queues with a
custom callback instead of blocking. This has a few benefits:

1. It avoids iterating over all hardware queues when completing an I/O,
   which the current restart code has to do.
2. It benefits from the existing rolling wakeup code.
3. It avoids punting to another thread just to have it block.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-23 11:55:46 -07:00
Jens Axboe 818551e2b2 Merge branch 'for-4.11/next' into for-4.11/linus-merge
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-17 14:08:19 -07:00
Jens Axboe 0c2a6fe4dc blk-mq: don't special case flush inserts for blk-mq-sched
The current request insertion machinery works just fine for
directly inserting flushes, so no need to special case
this anymore.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-02-17 12:35:47 -07:00
Jens Axboe 2aa0f21d54 blk-mq: have blk_mq_dispatch_rq_list() return if we queued IO or not
Currently we're almost there, but if we dispatch nothing, then we
still return success.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-02-17 12:35:47 -07:00
Josef Bacik f6f94300cd block: set make_request_fn manually in blk_mq_update_nr_hw_queues
Calling blk_queue_make_request resets a bunch of settings on the
request_queue, but all we really want is to update the make_request_fn,
so do this directly so we don't lose things like the logical and
physical block sizes.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-10 11:15:01 -07:00
Christoph Hellwig 1e739730c5 block: optionally merge discontiguous discard bios into a single request
Add a new merge strategy that merges discard bios into a request until the
maximum number of discard ranges (or the maximum discard size) is reached
from the plug merging code.  I/O scheduler merging is not wired up yet
but might also be useful, although not for fast devices like NVMe which
are the only user for now.

Note that for now we don't support limiting the size of each discard range,
but if needed that can be added later.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-08 13:43:08 -07:00
Christoph Hellwig 34fe7c0540 block: enumify ELEVATOR_*_MERGE
Switch these constants to an enum, and make let the compiler ensure that
all callers of blk_try_merge and elv_merge handle all potential values.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-08 13:43:06 -07:00
Omar Sandoval 18fbda91c6 block: use same block debugfs directory for blk-mq and blktrace
When I added the blk-mq debugging information to debugfs, I didn't
notice that blktrace also creates a "block" directory in debugfs. Make
them use the same dentry, now created in the core block code. Based on a
patch from Jens.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-02 10:20:16 -07:00
Jens Axboe 12d70958a2 blk-mq: don't fail allocating driver tag for stopped hw queue
We rely on blk_mq_get_driver_tag() not failing if 'wait' is true,
but it currently fails in that case if the queue happens to be
stopped at the time of the call.

We don't need to check for stopped here, it's just assigning
the tag. If the queue is stopped, we'll handle it when
attempting to run the queue.

This fixes a stall/crash on flush intensive workloads, where
we proceed to process a flush that doesn't have a valid tag
assigned.

Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-31 23:32:50 -07:00
Christoph Hellwig 82ed4db499 block: split scsi_request out of struct request
And require all drivers that want to support BLOCK_PC to allocate it
as the first thing of their private data.  To support this the legacy
IDE and BSG code is switched to set cmd_size on their queues to let
the block layer allocate the additional space.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-27 15:08:35 -07:00
Jens Axboe bd6737f1ae blk-mq-sched: add flush insertion into blk_mq_sched_insert_request()
Instead of letting the caller check this and handle the details
of inserting a flush request, put the logic in the scheduler
insertion function. This fixes direct flush insertion outside
of the usual make_request_fn calls, like from dm via
blk_insert_cloned_request().

Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-27 09:03:14 -07:00
Christoph Hellwig f73f44eb00 block: add a op_is_flush helper
This centralizes the checks for bios that needs to be go into the flush
state machine.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-27 09:01:45 -07:00
Jens Axboe c13660a08c blk-mq-sched: change ->dispatch_requests() to ->dispatch_request()
When we invoke dispatch_requests(), the scheduler empties everything
into the passed in list. This isn't always a good thing, since it
means that we remove items that we could have potentially merged
with.

Change the function to dispatch single requests at the time. If
we do that, we can backoff exactly at the point where the device
can't consume more IO, and leave the rest with the scheduler for
better merging and future dispatch decision making.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Tested-by: Hannes Reinecke <hare@suse.com>
2017-01-27 08:20:35 -07:00
Jens Axboe 50e1dab86a blk-mq-sched: fix starvation for multiple hardware queues and shared tags
If we have both multiple hardware queues and shared tag map between
devices, we need to ensure that we propagate the hardware queue
restart bit higher up. This is because we can get into a situation
where we don't have any IO pending on a hardware queue, yet we fail
getting a tag to start new IO. If that happens, it's not enough to
mark the hardware queue as needing a restart, we need to bubble
that up to the higher level queue as well.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Tested-by: Hannes Reinecke <hare@suse.com>
2017-01-27 08:20:34 -07:00
Jens Axboe 99cf1dc580 blk-mq: release driver tag on a requeue event
We don't want to hold on to this resource when we have a scheduler
attached.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Tested-by: Hannes Reinecke <hare@suse.com>
2017-01-27 08:20:32 -07:00
Jens Axboe 3c782d67c1 blk-mq: fix potential race in queue restart and driver tag allocation
Once we mark the queue as needing a restart, re-check if we can
get a driver tag. This fixes a theoretical issue where the needed
IO completes _after_ blk_mq_get_driver_tag() fails, but before we
manage to set the restart bit.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Tested-by: Hannes Reinecke <hare@suse.com>
2017-01-27 08:20:31 -07:00
Jens Axboe 0abad77412 blk-mq: improve scheduler queue sync/async running
We'll use the same criteria for whether we need to run the queue sync
or async when we have a scheduler, as we do without one.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Tested-by: Hannes Reinecke <hare@suse.com>
2017-01-27 08:20:30 -07:00
Omar Sandoval 07e4fead45 blk-mq: create debugfs directory tree
In preparation for putting blk-mq debugging information in debugfs,
create a directory tree mirroring the one in sysfs:

    # tree -d /sys/kernel/debug/block
    /sys/kernel/debug/block
    |-- nvme0n1
    |   `-- mq
    |       |-- 0
    |       |   `-- cpu0
    |       |-- 1
    |       |   `-- cpu1
    |       |-- 2
    |       |   `-- cpu2
    |       `-- 3
    |           `-- cpu3
    `-- vda
        `-- mq
            `-- 0
                |-- cpu0
                |-- cpu1
                |-- cpu2
                `-- cpu3

Also add the scaffolding for the actual files that will go in here,
either under the hardware queue or software queue directories.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-27 08:17:44 -07:00
Jens Axboe 5a797e00dc blk-mq: don't lose flags passed in to blk_mq_alloc_request()
If we come in from blk_mq_alloc_requst() with NOWAIT set in flags,
we must ensure that we don't later overwrite that in
blk_mq_sched_get_request(). Initialize alloc_data->flags before
passing it in.

Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-26 12:22:11 -07:00
Jens Axboe 200e86b337 blk-mq: only apply active queue tag throttling for driver tags
If we have a scheduler attached, we have two sets of tags. We don't
want to apply our active queue throttling for the scheduler side
of tags, that only applies to driver tags since that's the resource
we need to dispatch an IO.

Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-25 08:11:38 -07:00
Jens Axboe 70f36b6001 blk-mq: allow resize of scheduler requests
Add support for growing the tags associated with a hardware queue, for
the scheduler tags. Currently we only support resizing within the
limits of the original depth, change that so we can grow it as well by
allocating and replacing the existing scheduler tag set.

This is similar to how we could increase the software queue depth with
the legacy IO stack and schedulers.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-01-20 09:05:53 -07:00
Jens Axboe 7e79dadce2 blk-mq: stop hardware queue in blk_mq_delay_queue()
The run handler we register for the delayed work requires that the
queue be stopped, yet we leave that up to the caller. Let's move
it into blk_mq_delay_queue() itself, so that the API is sane.

This fixes a stall with SCSI, where it calls blk_mq_delay_queue()
without having stopped the queue. Hence the queue is never run.

Reported-by: Hannes Reinecke <hare@suse.com>
Fixes: 70f4db639c ("blk-mq: add blk_mq_delay_queue")
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-19 08:01:55 -07:00
Jens Axboe d348499138 blk-mq-sched: allow setting of default IO scheduler
Add Kconfig entries to manage what devices get assigned an MQ
scheduler, and add a blk-mq flag for drivers to opt out of scheduling.
The latter is useful for admin type queues that still allocate a blk-mq
queue and tag set, but aren't use for normal IO.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-01-17 10:04:31 -07:00
Jens Axboe bd166ef183 blk-mq-sched: add framework for MQ capable IO schedulers
This adds a set of hooks that intercepts the blk-mq path of
allocating/inserting/issuing/completing requests, allowing
us to develop a scheduler within that framework.

We reuse the existing elevator scheduler API on the registration
side, but augment that with the scheduler flagging support for
the blk-mq interfce, and with a separate set of ops hooks for MQ
devices.

We split driver and scheduler tags, so we can run the scheduling
independently of device queue depth.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-01-17 10:04:20 -07:00
Jens Axboe 2af8cbe305 blk-mq: split tag ->rqs[] into two
This is in preparation for having two sets of tags available. For
that we need a static index, and a dynamically assignable one.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-01-17 10:04:15 -07:00
Jens Axboe fd2d332677 blk-mq: add support for carrying internal tag information in blk_qc_t
No functional change in this patch, just in preparation for having
two types of tags available to the block layer for a single request.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-01-17 10:04:09 -07:00
Jens Axboe cc71a6f438 blk-mq: abstract out helpers for allocating/freeing tag maps
Prep patch for adding an extra tag map for scheduler requests.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-01-17 10:04:04 -07:00
Jens Axboe 4941115bef blk-mq-tag: cleanup the normal/reserved tag allocation
This is in preparation for having another tag set available. Cleanup
the parameters, and allow passing in of tags for blk_mq_put_tag().

Signed-off-by: Jens Axboe <axboe@fb.com>
[hch: even more cleanups]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-01-17 10:03:59 -07:00
Jens Axboe 2c3ad66790 blk-mq: export some helpers we need to the scheduling framework
Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-01-17 10:03:53 -07:00
Jens Axboe 16a3c2a70c blk-mq: un-export blk_mq_free_hctx_request()
It's only used in blk-mq, kill it from the main exported header
and kill the symbol export as well.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-01-17 10:03:48 -07:00
Jens Axboe f8a5b12247 blk-mq: make mq_ops a const pointer
We never change it, make that clear.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
2017-01-11 20:47:44 -07:00
Thomas Gleixner 8b0e195314 ktime: Cleanup ktime_set() usage
ktime_set(S,N) was required for the timespec storage type and is still
useful for situations where a Seconds and Nanoseconds part of a time value
needs to be converted. For anything where the Seconds argument is 0, this
is pointless and can be replaced with a simple assignment.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
2016-12-25 17:21:22 +01:00
Gabriel Krisman Bertazi d1b1cea1e5 blk-mq: Fix failed allocation path when mapping queues
In blk_mq_map_swqueue, there is a memory optimization that frees the
tags of a queue that has gone unmapped.  Later, if that hctx is remapped
after another topology change, the tags need to be reallocated.

If this allocation fails, a simple WARN_ON triggers, but the block layer
ends up with an active hctx without any corresponding set of tags.
Then, any income IO to that hctx can trigger an Oops.

I can reproduce it consistently by running IO, flipping CPUs on and off
and eventually injecting a memory allocation failure in that path.

In the fix below, if the system experiences a failed allocation of any
hctx's tags, we remap all the ctxs of that queue to the hctx_0, which
should always keep it's tags.  There is a minor performance hit, since
our mapping just got worse after the error path, but this is
the simplest solution to handle this error path.  The performance hit
will disappear after another successful remap.

I considered dropping the memory optimization all together, but it
seemed a bad trade-off to handle this very specific error case.

This should apply cleanly on top of Jens' for-next branch.

The Oops is the one below:

SP (3fff935ce4d0) is in userspace
1:mon> e
cpu 0x1: Vector: 300 (Data Access) at [c000000fe99eb110]
    pc: c0000000005e868c: __sbitmap_queue_get+0x2c/0x180
    lr: c000000000575328: __bt_get+0x48/0xd0
    sp: c000000fe99eb390
   msr: 900000010280b033
   dar: 28
 dsisr: 40000000
  current = 0xc000000fe9966800
  paca    = 0xc000000007e80300   softe: 0        irq_happened: 0x01
    pid   = 11035, comm = aio-stress
Linux version 4.8.0-rc6+ (root@bean) (gcc version 5.4.0 20160609
(Ubuntu/IBM 5.4.0-6ubuntu1~16.04.2) ) #3 SMP Mon Oct 10 20:16:53 CDT 2016
1:mon> s
[c000000fe99eb3d0] c000000000575328 __bt_get+0x48/0xd0
[c000000fe99eb400] c000000000575838 bt_get.isra.1+0x78/0x2d0
[c000000fe99eb480] c000000000575cb4 blk_mq_get_tag+0x44/0x100
[c000000fe99eb4b0] c00000000056f6f4 __blk_mq_alloc_request+0x44/0x220
[c000000fe99eb500] c000000000570050 blk_mq_map_request+0x100/0x1f0
[c000000fe99eb580] c000000000574650 blk_mq_make_request+0xf0/0x540
[c000000fe99eb640] c000000000561c44 generic_make_request+0x144/0x230
[c000000fe99eb690] c000000000561e00 submit_bio+0xd0/0x200
[c000000fe99eb740] c0000000003ef740 ext4_io_submit+0x90/0xb0
[c000000fe99eb770] c0000000003e95d8 ext4_writepages+0x588/0xdd0
[c000000fe99eb910] c00000000025a9f0 do_writepages+0x60/0xc0
[c000000fe99eb940] c000000000246c88 __filemap_fdatawrite_range+0xf8/0x180
[c000000fe99eb9e0] c000000000246f90 filemap_write_and_wait_range+0x70/0xf0
[c000000fe99eba20] c0000000003dd844 ext4_sync_file+0x214/0x540
[c000000fe99eba80] c000000000364718 vfs_fsync_range+0x78/0x130
[c000000fe99ebad0] c0000000003dd46c ext4_file_write_iter+0x35c/0x430
[c000000fe99ebb90] c00000000038c280 aio_run_iocb+0x3b0/0x450
[c000000fe99ebce0] c00000000038dc28 do_io_submit+0x368/0x730
[c000000fe99ebe30] c000000000009404 system_call+0x38/0xec

Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Cc: Brian King <brking@linux.vnet.ibm.com>
Cc: Douglas Miller <dougmill@linux.vnet.ibm.com>
Cc: linux-block@vger.kernel.org
Cc: linux-scsi@vger.kernel.org
Reviewed-by: Douglas Miller <dougmill@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-12-14 13:57:47 -07:00
Gabriel Krisman Bertazi 36e1f3d107 blk-mq: Avoid memory reclaim when remapping queues
While stressing memory and IO at the same time we changed SMT settings,
we were able to consistently trigger deadlocks in the mm system, which
froze the entire machine.

I think that under memory stress conditions, the large allocations
performed by blk_mq_init_rq_map may trigger a reclaim, which stalls
waiting on the block layer remmaping completion, thus deadlocking the
system.  The trace below was collected after the machine stalled,
waiting for the hotplug event completion.

The simplest fix for this is to make allocations in this path
non-reclaimable, with GFP_NOIO.  With this patch, We couldn't hit the
issue anymore.

This should apply on top of Jens's for-next branch cleanly.

Changes since v1:
  - Use GFP_NOIO instead of GFP_NOWAIT.

 Call Trace:
[c000000f0160aaf0] [c000000f0160ab50] 0xc000000f0160ab50 (unreliable)
[c000000f0160acc0] [c000000000016624] __switch_to+0x2e4/0x430
[c000000f0160ad20] [c000000000b1a880] __schedule+0x310/0x9b0
[c000000f0160ae00] [c000000000b1af68] schedule+0x48/0xc0
[c000000f0160ae30] [c000000000b1b4b0] schedule_preempt_disabled+0x20/0x30
[c000000f0160ae50] [c000000000b1d4fc] __mutex_lock_slowpath+0xec/0x1f0
[c000000f0160aed0] [c000000000b1d678] mutex_lock+0x78/0xa0
[c000000f0160af00] [d000000019413cac] xfs_reclaim_inodes_ag+0x33c/0x380 [xfs]
[c000000f0160b0b0] [d000000019415164] xfs_reclaim_inodes_nr+0x54/0x70 [xfs]
[c000000f0160b0f0] [d0000000194297f8] xfs_fs_free_cached_objects+0x38/0x60 [xfs]
[c000000f0160b120] [c0000000003172c8] super_cache_scan+0x1f8/0x210
[c000000f0160b190] [c00000000026301c] shrink_slab.part.13+0x21c/0x4c0
[c000000f0160b2d0] [c000000000268088] shrink_zone+0x2d8/0x3c0
[c000000f0160b380] [c00000000026834c] do_try_to_free_pages+0x1dc/0x520
[c000000f0160b450] [c00000000026876c] try_to_free_pages+0xdc/0x250
[c000000f0160b4e0] [c000000000251978] __alloc_pages_nodemask+0x868/0x10d0
[c000000f0160b6f0] [c000000000567030] blk_mq_init_rq_map+0x160/0x380
[c000000f0160b7a0] [c00000000056758c] blk_mq_map_swqueue+0x33c/0x360
[c000000f0160b820] [c000000000567904] blk_mq_queue_reinit+0x64/0xb0
[c000000f0160b850] [c00000000056a16c] blk_mq_queue_reinit_notify+0x19c/0x250
[c000000f0160b8a0] [c0000000000f5d38] notifier_call_chain+0x98/0x100
[c000000f0160b8f0] [c0000000000c5fb0] __cpu_notify+0x70/0xe0
[c000000f0160b930] [c0000000000c63c4] notify_prepare+0x44/0xb0
[c000000f0160b9b0] [c0000000000c52f4] cpuhp_invoke_callback+0x84/0x250
[c000000f0160ba10] [c0000000000c570c] cpuhp_up_callbacks+0x5c/0x120
[c000000f0160ba60] [c0000000000c7cb8] _cpu_up+0xf8/0x1d0
[c000000f0160bac0] [c0000000000c7eb0] do_cpu_up+0x120/0x150
[c000000f0160bb40] [c0000000006fe024] cpu_subsys_online+0x64/0xe0
[c000000f0160bb90] [c0000000006f5124] device_online+0xb4/0x120
[c000000f0160bbd0] [c0000000006f5244] online_store+0xb4/0xc0
[c000000f0160bc20] [c0000000006f0a68] dev_attr_store+0x68/0xa0
[c000000f0160bc60] [c0000000003ccc30] sysfs_kf_write+0x80/0xb0
[c000000f0160bca0] [c0000000003cbabc] kernfs_fop_write+0x17c/0x250
[c000000f0160bcf0] [c00000000030fe6c] __vfs_write+0x6c/0x1e0
[c000000f0160bd90] [c000000000311490] vfs_write+0xd0/0x270
[c000000f0160bde0] [c0000000003131fc] SyS_write+0x6c/0x110
[c000000f0160be30] [c000000000009204] system_call+0x38/0xec

Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Cc: Brian King <brking@linux.vnet.ibm.com>
Cc: Douglas Miller <dougmill@linux.vnet.ibm.com>
Cc: linux-block@vger.kernel.org
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-12-14 08:15:25 -07:00
Jens Axboe f04c3df3ef blk-mq: abstract out blk_mq_dispatch_rq_list() helper
Takes a list of requests, and dispatches it. Moves any residual
requests to the dispatch list.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
2016-12-09 09:03:02 -07:00
Jens Axboe ae911c5e79 blk-mq: add blk_mq_start_stopped_hw_queue()
We have a variant for all hardware queues, but not one for a single
hardware queue.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
2016-12-09 09:03:02 -07:00
Jens Axboe 6e85eaf307 blk-mq: blk_account_io_start() takes a bool
Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
2016-12-05 12:14:28 -07:00
Gabriel Krisman Bertazi 415d3dab96 blk-mq: Drop explicit timeout sync in hotplug
After commit 287922eb0b ("block: defer timeouts to a workqueue"),
deleting the timeout work after freezing the queue shouldn't be
necessary, since the synchronization is already enforced by the
acquisition of a q_usage_counter reference in blk_mq_timeout_work.

Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Reviewed-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-29 08:01:08 -07:00
Jens Axboe 64f1c21e86 blk-mq: make the polling code adaptive
The previous commit introduced the hybrid sleep/poll mode. Take
that one step further, and use the completion latencies to
automatically sleep for half the mean completion time. This is
a good approximation.

This changes the 'io_poll_delay' sysfs file a bit to expose the
various options. Depending on the value, the polling code will
behave differently:

-1	Never enter hybrid sleep mode
 0	Use half of the completion mean for the sleep delay
>0	Use this specific value as the sleep delay

Signed-off-by: Jens Axboe <axboe@fb.com>
Tested-By: Stephen Bates <sbates@raithlin.com>
Reviewed-By: Stephen Bates <sbates@raithlin.com>
2016-11-17 13:34:57 -07:00
Jens Axboe 06426adf07 blk-mq: implement hybrid poll mode for sync O_DIRECT
This patch enables a hybrid polling mode. Instead of polling after IO
submission, we can induce an artificial delay, and then poll after that.
For example, if the IO is presumed to complete in 8 usecs from now, we
can sleep for 4 usecs, wake up, and then do our polling. This still puts
a sleep/wakeup cycle in the IO path, but instead of the wakeup happening
after the IO has completed, it'll happen before. With this hybrid
scheme, we can achieve big latency reductions while still using the same
(or less) amount of CPU.

Signed-off-by: Jens Axboe <axboe@fb.com>
Tested-By: Stephen Bates <sbates@raithlin.com>
Reviewed-By: Stephen Bates <sbates@raithlin.com>
2016-11-17 13:34:51 -07:00
Ming Lei 0a6219a95f block: deal with stale req count of plug list
In both legacy and mq path, req count of plug list is computed
before allocating request, so the number can be stale when falling
back to slept allocation, also the new introduced wbt can sleep
too.

This patch deals with the case by checking if plug list becomes
empty, and fixes the KASAN report of 'BUG: KASAN: stack-out-of-bounds'
which is introduced by Shaohua's patches of dispatching big request.

Fixes: 600271d900002(blk-mq: immediately dispatch big size request)
Fixes: 50d24c34403c6(block: immediately dispatch big size request)
Cc: Shaohua Li <shli@fb.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-16 08:09:51 -07:00
Jens Axboe bbd7bb7017 block: move poll code to blk-mq
The poll code is blk-mq specific, let's move it to blk-mq.c. This
is a prep patch for improving the polling code.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-11-11 13:40:25 -07:00
Jens Axboe 066a4a73ce blk-mq: blk_mq_try_issue_directly() should lookup hardware queue
A previous commit changed this to pass in the hardware queue, but
it was using the wrong hardware queue. Hence a request that was
allocated on one hardware queue ended up being issued on another
one, and that caused IO timeouts and oopses on some drivers. Since
the request holds hardware queue private resources, like a tag,
we can't just issue it on a different hardware queue.

Fixes: 2253efc850 ("blk-mq: Move more code into blk_mq_direct_issue_request()")
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-11 12:24:46 -07:00
Jens Axboe 87760e5eef block: hook up writeback throttling
Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.

The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.

Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.

The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
target to me met. It defaults to 2 msec for non-rotational storage, and
75 msec for rotational storage. Setting this value to '0' disables
blk-wb. Generally, a user would not have to touch this setting.

We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-10 13:53:40 -07:00
Jens Axboe cf43e6be86 block: add scalable completion tracking of requests
For legacy block, we simply track them in the request queue. For
blk-mq, we track them on a per-sw queue basis, which we can then
sum up through the hardware queues and finally to a per device
state.

The stats are tracked in, roughly, 0.1s interval windows.

Add sysfs files to display the stats.

The feature is off by default, to avoid any extra overhead. In-kernel
users of it can turn it on by setting QUEUE_FLAG_STATS in the queue
flags. We currently don't turn it on if someone just reads any of
the stats files, that is something we could add as well.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-10 13:53:26 -07:00
Gabriel Krisman Bertazi c02ebfdddb blk-mq: Always schedule hctx->next_cpu
Commit 0e87e58bf6 ("blk-mq: improve warning for running a queue on the
wrong CPU") attempts to avoid triggering the WARN_ON in
__blk_mq_run_hw_queue when the expected CPU is dead.  Problem is, in the
last batch execution before round robin, blk_mq_hctx_next_cpu can
schedule a dead CPU and also update next_cpu to the next alive CPU in
the mask, which will trigger the WARN_ON despite the previous
workaround.

The following patch fixes this scenario by always scheduling the value
in hctx->next_cpu.  This changes the moment when we round-robin the CPU
running the hctx, but it really doesn't matter, since it still executes
BLK_MQ_CPU_WORK_BATCH times in a row before switching to another CPU.

Fixes: 0e87e58bf6 ("blk-mq: improve warning for running a queue on the wrong CPU")
Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-06 14:14:41 -07:00
Shaohua Li 600271d900 blk-mq: immediately dispatch big size request
This is corresponding part for blk-mq. Disk with multiple hardware
queues doesn't need this as we only hold 1 request at most.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-03 22:00:38 -06:00
Bart Van Assche 2b053aca76 blk-mq: Add a kick_requeue_list argument to blk_mq_requeue_request()
Most blk_mq_requeue_request() and blk_mq_add_to_requeue_list() calls
are followed by kicking the requeue list. Hence add an argument to
these two functions that allows to kick the requeue list. This was
proposed by Christoph Hellwig.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-02 12:50:19 -06:00
Bart Van Assche 6a83e74d21 blk-mq: Introduce blk_mq_quiesce_queue()
blk_mq_quiesce_queue() waits until ongoing .queue_rq() invocations
have finished. This function does *not* wait until all outstanding
requests have finished (this means invocation of request.end_io()).
The algorithm used by blk_mq_quiesce_queue() is as follows:
* Hold either an RCU read lock or an SRCU read lock around
  .queue_rq() calls. The former is used if .queue_rq() does not
  block and the latter if .queue_rq() may block.
* blk_mq_quiesce_queue() first calls blk_mq_stop_hw_queues()
  followed by synchronize_srcu() or synchronize_rcu(). The latter
  call waits for .queue_rq() invocations that started before
  blk_mq_quiesce_queue() was called.
* The blk_mq_hctx_stopped() calls that control whether or not
  .queue_rq() will be called are called with the (S)RCU read lock
  held. This is necessary to avoid race conditions against
  blk_mq_quiesce_queue().

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-02 12:50:19 -06:00
Bart Van Assche 9b7dd572cc blk-mq: Remove blk_mq_cancel_requeue_work()
Since blk_mq_requeue_work() no longer restarts stopped queues
canceling requeue work is no longer needed to prevent that a
stopped queue would be restarted. Hence remove this function.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-02 12:50:19 -06:00
Bart Van Assche 52d7f1b5c2 blk-mq: Avoid that requeueing starts stopped queues
Since blk_mq_requeue_work() starts stopped queues and since
execution of this function can be scheduled after a queue has
been stopped it is not possible to stop queues without using
an additional state variable to track whether or not the queue
has been stopped. Hence modify blk_mq_requeue_work() such that it
does not start stopped queues. My conclusion after a review of
the blk_mq_stop_hw_queues() and blk_mq_{delay_,}kick_requeue_list()
callers is as follows:
* In the dm driver starting and stopping queues should only happen
  if __dm_suspend() or __dm_resume() is called and not if the
  requeue list is processed.
* In the SCSI core queue stopping and starting should only be
  performed by the scsi_internal_device_block() and
  scsi_internal_device_unblock() functions but not by any other
  function. Although the blk_mq_stop_hw_queue() call in
  scsi_queue_rq() may help to reduce CPU load if a LLD queue is
  full, figuring out whether or not a queue should be restarted
  when requeueing a command would require to introduce additional
  locking in scsi_mq_requeue_cmd() to avoid a race with
  scsi_internal_device_block(). Avoid this complexity by removing
  the blk_mq_stop_hw_queue() call from scsi_queue_rq().
* In the NVMe core only the functions that call
  blk_mq_start_stopped_hw_queues() explicitly should start stopped
  queues.
* A blk_mq_start_stopped_hwqueues() call must be added in the
  xen-blkfront driver in its blkif_recover() function.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Roger Pau Monné <roger.pau@citrix.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: James Bottomley <jejb@linux.vnet.ibm.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-02 12:50:19 -06:00
Bart Van Assche 2253efc850 blk-mq: Move more code into blk_mq_direct_issue_request()
Move the "hctx stopped" test and the insert request calls into
blk_mq_direct_issue_request(). Rename that function into
blk_mq_try_issue_directly() to reflect its new semantics. Pass
the hctx pointer to that function instead of looking it up a
second time. These changes avoid that code has to be duplicated
in the next patch.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-02 12:50:19 -06:00
Bart Van Assche fd00144301 blk-mq: Introduce blk_mq_queue_stopped()
The function blk_queue_stopped() allows to test whether or not a
traditional request queue has been stopped. Introduce a helper
function that allows block drivers to query easily whether or not
one or more hardware contexts of a blk-mq queue have been stopped.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-02 12:50:19 -06:00
Bart Van Assche 5d1b25c1ec blk-mq: Introduce blk_mq_hctx_stopped()
Multiple functions test the BLK_MQ_S_STOPPED bit so introduce
a helper function that performs this test.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-02 12:50:19 -06:00
Bart Van Assche bc27c01b5c blk-mq: Do not invoke .queue_rq() for a stopped queue
The meaning of the BLK_MQ_S_STOPPED flag is "do not call
.queue_rq()". Hence modify blk_mq_make_request() such that requests
are queued instead of issued if a queue has been stopped.

Reported-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-02 12:50:19 -06:00
Christoph Hellwig ef295ecf09 block: better op and flags encoding
Now that we don't need the common flags to overflow outside the range
of a 32-bit type we can encode them the same way for both the bio and
request fields.  This in addition allows us to place the operation
first (and make some room for more ops while we're at it) and to
stop having to shift around the operation values.

In addition this allows passing around only one value in the block layer
instead of two (and eventuall also in the file systems, but we can do
that later) and thus clean up a lot of code.

Last but not least this allows decreasing the size of the cmd_flags
field in struct request to 32-bits.  Various functions passing this
value could also be updated, but I'd like to avoid the churn for now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-28 08:48:16 -06:00
Christoph Hellwig e806402130 block: split out request-only flags into a new namespace
A lot of the REQ_* flags are only used on struct requests, and only of
use to the block layer and a few drivers that dig into struct request
internals.

This patch adds a new req_flags_t rq_flags field to struct request for
them, and thus dramatically shrinks the number of common requests.  It
also removes the unfortunate situation where we have to fit the fields
from the same enum into 32 bits for struct bio and 64 bits for
struct request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaun Tancheff <shaun.tancheff@seagate.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-28 08:45:17 -06:00
Jens Axboe 2552e3f878 blk-mq: get rid of confusing blk_map_ctx structure
We can just use struct blk_mq_alloc_data - it has a few more
members, but we allocate it further down the stack anyway. So
this cleans up the code, and reduces the stack overhead a bit.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-27 19:13:58 -06:00
Jens Axboe 7dd2fb6877 blk-mq: update hardware and software queues for sleeping alloc
If we end up sleeping due to running out of requests, we should
update the hardware and software queues in the map ctx structure.
Otherwise we could end up having rq->mq_ctx point to the pre-sleep
context, and risk corrupting ctx->rq_list since we'll be
grabbing the wrong lock when inserting the request.

Reported-by: Dave Jones <davej@codemonkey.org.uk>
Reported-by: Chris Mason <clm@fb.com>
Tested-by: Chris Mason <clm@fb.com>
Fixes: 63581af3f3 ("blk-mq: remove non-blocking pass in blk_mq_map_request")
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-27 19:13:56 -06:00
Linus Torvalds 24532f7681 Merge branch 'for-4.9/block-smp' of git://git.kernel.dk/linux-block
Pull blk-mq CPU hotplug update from Jens Axboe:
 "This is the conversion of blk-mq to the new hotplug state machine"

* 'for-4.9/block-smp' of git://git.kernel.dk/linux-block:
  blk-mq: fixup "Convert to new hotplug state machine"
  blk-mq: Convert to new hotplug state machine
  blk-mq/cpu-notif: Convert to new hotplug state machine
2016-10-09 17:32:20 -07:00
Linus Torvalds 12e3d3cdd9 Merge branch 'for-4.9/block-irq' of git://git.kernel.dk/linux-block
Pull blk-mq irq/cpu mapping updates from Jens Axboe:
 "This is the block-irq topic branch for 4.9-rc. It's mostly from
  Christoph, and it allows drivers to specify their own mappings, and
  more importantly, to share the blk-mq mappings with the IRQ affinity
  mappings. It's a good step towards making this work better out of the
  box"

* 'for-4.9/block-irq' of git://git.kernel.dk/linux-block:
  blk_mq: linux/blk-mq.h does not include all the headers it depends on
  blk-mq: kill unused blk_mq_create_mq_map()
  blk-mq: get rid of the cpumask in struct blk_mq_tags
  nvme: remove the post_scan callout
  nvme: switch to use pci_alloc_irq_vectors
  blk-mq: provide a default queue mapping for PCI device
  blk-mq: allow the driver to pass in a queue mapping
  blk-mq: remove ->map_queue
  blk-mq: only allocate a single mq_map per tag_set
  blk-mq: don't redistribute hardware queues on a CPU hotplug event
2016-10-09 17:29:33 -07:00
Linus Torvalds 513a4befae Merge branch 'for-4.9/block' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:
 "This is the main pull request for block layer changes in 4.9.

  As mentioned at the last merge window, I've changed things up and now
  do just one branch for core block layer changes, and driver changes.
  This avoids dependencies between the two branches. Outside of this
  main pull request, there are two topical branches coming as well.

  This pull request contains:

   - A set of fixes, and a conversion to blk-mq, of nbd. From Josef.

   - Set of fixes and updates for lightnvm from Matias, Simon, and Arnd.
     Followup dependency fix from Geert.

   - General fixes from Bart, Baoyou, Guoqing, and Linus W.

   - CFQ async write starvation fix from Glauber.

   - Add supprot for delayed kick of the requeue list, from Mike.

   - Pull out the scalable bitmap code from blk-mq-tag.c and make it
     generally available under the name of sbitmap. Only blk-mq-tag uses
     it for now, but the blk-mq scheduling bits will use it as well.
     From Omar.

   - bdev thaw error progagation from Pierre.

   - Improve the blk polling statistics, and allow the user to clear
     them. From Stephen.

   - Set of minor cleanups from Christoph in block/blk-mq.

   - Set of cleanups and optimizations from me for block/blk-mq.

   - Various nvme/nvmet/nvmeof fixes from the various folks"

* 'for-4.9/block' of git://git.kernel.dk/linux-block: (54 commits)
  fs/block_dev.c: return the right error in thaw_bdev()
  nvme: Pass pointers, not dma addresses, to nvme_get/set_features()
  nvme/scsi: Remove power management support
  nvmet: Make dsm number of ranges zero based
  nvmet: Use direct IO for writes
  admin-cmd: Added smart-log command support.
  nvme-fabrics: Add host_traddr options field to host infrastructure
  nvme-fabrics: revise host transport option descriptions
  nvme-fabrics: rework nvmf_get_address() for variable options
  nbd: use BLK_MQ_F_BLOCKING
  blkcg: Annotate blkg_hint correctly
  cfq: fix starvation of asynchronous writes
  blk-mq: add flag for drivers wanting blocking ->queue_rq()
  blk-mq: remove non-blocking pass in blk_mq_map_request
  blk-mq: get rid of manual run of queue with __blk_mq_run_hw_queue()
  block: export bio_free_pages to other modules
  lightnvm: propagate device_add() error code
  lightnvm: expose device geometry through sysfs
  lightnvm: control life of nvm_dev in driver
  blk-mq: register device instead of disk
  ...
2016-10-07 14:42:05 -07:00
Christoph Hellwig c8712c6a67 blk-mq: skip unmapped queues in blk_mq_alloc_request_hctx
This provides the caller a feedback that a given hctx is not mapped and thus
no command can be sent on it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-23 10:25:48 -06:00
Sebastian Andrzej Siewior 97a32864e6 blk-mq: fixup "Convert to new hotplug state machine"
The "blk_mq_queue_reinit_dead()" just cleared the cpumask instead doing
a copy. Since we might never had an online callback we could end up with
a ZERO mask which in turn leads to crash as test robot demonstarted.

Fixes: 65d5291eee ("blk-mq: Convert to new hotplug state machine")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-23 09:49:32 -06:00
Jens Axboe 1b792f2f92 blk-mq: add flag for drivers wanting blocking ->queue_rq()
If a driver sets BLK_MQ_F_BLOCKING, it is allowed to block in its
->queue_rq() handler. For that case, blk-mq ensures that we always
calls it from a safe context.

Signed-off-by: Jens Axboe <axboe@fb.com>
Tested-by: Josef Bacik <jbacik@fb.com>
2016-09-22 14:28:38 -06:00
Christoph Hellwig 63581af3f3 blk-mq: remove non-blocking pass in blk_mq_map_request
bt_get already does a non-blocking pass as well as running the queue
when scheduling internally, no need to duplicate it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-22 14:27:39 -06:00
Jens Axboe 841bac2c87 blk-mq: get rid of manual run of queue with __blk_mq_run_hw_queue()
Two cases:

1) blk_mq_alloc_request() needlessly re-runs the queue, after
   calling into the tag allocation without NOWAIT set. We don't
   need to do that.

2) blk_mq_map_request() should just use blk_mq_run_hw_queue() with
   the async flag set to false.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-09-22 09:39:53 -06:00
Sebastian Andrzej Siewior 65d5291eee blk-mq: Convert to new hotplug state machine
Install the callbacks via the state machine so we can phase out the cpu
hotplug notifiers mess.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-block@vger.kernel.org
Cc: rt@linutronix.de
Cc: Christoph Hellwing <hch@lst.de>
Link: http://lkml.kernel.org/r/20160919212601.180033814@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-22 08:05:19 -06:00
Thomas Gleixner 9467f85960 blk-mq/cpu-notif: Convert to new hotplug state machine
Replace the block-mq notifier list management with the multi instance
facility in the cpu hotplug state machine.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-block@vger.kernel.org
Cc: rt@linutronix.de
Cc: Christoph Hellwing <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-22 08:05:17 -06:00
Omar Sandoval 40aabb6746 sbitmap: push per-cpu last_tag into sbitmap_queue
Allocating your own per-cpu allocation hint separately makes for an
awkward API. Instead, allocate the per-cpu hint as part of the struct
sbitmap_queue. There's no point for a struct sbitmap_queue without the
cache, but you can still use a bare struct sbitmap.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-17 08:39:10 -06:00
Omar Sandoval 88459642cb blk-mq: abstract tag allocation out into sbitmap library
This is a generally useful data structure, so make it available to
anyone else who might want to use it. It's also a nice cleanup
separating the allocation logic from the rest of the tag handling logic.

The code is behind a new Kconfig option, CONFIG_SBITMAP, which is only
selected by CONFIG_BLOCK for now.

This should be a complete noop functionality-wise.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-17 08:38:44 -06:00
Jens Axboe 703fd1c0f1 blk-mq: account higher order dispatch
We currently account a '0' dispatch, and anything above that still falls
below the range set by BLK_MQ_MAX_DISPATCH_ORDER. If we dispatch more,
we don't account it.

Change the last bucket to be inclusive of anything above the range we
track, and have the sysfs file reflect that by including a '+' in the
output:

$ cat /sys/block/nvme0n1/mq/0/dispatched
        0	1006
        1	20229
        2	1
        4	0
        8	0
       16	0
       32+	0

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
2016-09-16 14:03:04 -06:00
Jens Axboe 9151bcb4fb blk-mq: kill unused blk_mq_create_mq_map()
Fixes 1b157939f9 ("blk-mq: get rid of the cpumask in struct blk_mq_tags")
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-15 08:45:45 -06:00
Christoph Hellwig 1b157939f9 blk-mq: get rid of the cpumask in struct blk_mq_tags
Unused now that NVMe sets up irq affinity before calling into blk-mq.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-15 08:42:03 -06:00
Christoph Hellwig da695ba236 blk-mq: allow the driver to pass in a queue mapping
This allows drivers specify their own queue mapping by overriding the
setup-time function that builds the mq_map.  This can be used for
example to build the map based on the MSI-X vector mapping provided
by the core interrupt layer for PCI devices.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-15 08:42:03 -06:00
Christoph Hellwig 7d7e0f90b7 blk-mq: remove ->map_queue
All drivers use the default, so provide an inline version of it.  If we
ever need other queue mapping we can add an optional method back,
although supporting will also require major changes to the queue setup
code.

This provides better code generation, and better debugability as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-15 08:42:03 -06:00
Christoph Hellwig bdd17e75cd blk-mq: only allocate a single mq_map per tag_set
The mapping is identical for all queues in a tag_set, so stop wasting
memory for building multiple.  Note that for now I've kept the mq_map
pointer in the request_queue, but we'll need to investigate if we can
remove it without suffering too much from the additional pointer chasing.
The same would apply to the mq_ops pointer as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-15 08:42:03 -06:00
Christoph Hellwig 4e68a01142 blk-mq: don't redistribute hardware queues on a CPU hotplug event
Currently blk-mq will totally remap hardware context when a CPU hotplug
even happened, which causes major havoc for drivers, as they are never
told about this remapping.  E.g. any carefully sorted out CPU affinity
will just be completely messed up.

The rebuild also doesn't really help for the common case of cpu
hotplug, which is soft onlining / offlining of cpus - in this case we
should just leave the queue and irq mapping as is.  If it actually
worked it would have helped in the case of physical cpu hotplug,
although for that we'd need a way to actually notify the driver.
Note that drivers may already be able to accommodate such a topology
change on their own, e.g. using the reset_controller sysfs file in NVMe
will cause the driver to get things right for this case.

With the rebuild removed we will simplify retain the queue mapping for
a soft offlined CPU that will work when it comes back online, and will
map any newly onlined CPU to queue 0 until the driver initiates
a rebuild of the queue map.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-15 08:42:03 -06:00
Mike Snitzer 2849450ad3 blk-mq: introduce blk_mq_delay_kick_requeue_list()
blk_mq_delay_kick_requeue_list() provides the ability to kick the
q->requeue_list after a specified time.  To do this the request_queue's
'requeue_work' member was changed to a delayed_work.

blk_mq_delay_kick_requeue_list() allows DM to defer processing requeued
requests while it doesn't make sense to immediately requeue them
(e.g. when all paths in a DM multipath have failed).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-14 11:48:34 -06:00
Jens Axboe 88c7b2b751 blk-mq: prefetch request in blk_mq_tag_to_rq()
When drivers or the core calls this function, they usually
dereference the request shortly there after. Prefetch the first
cache line.

Profiling IO workloads shows that this is the most common cache
miss on the block side of things.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-29 08:13:21 -06:00
Jens Axboe 27489a3c82 blk-mq: turn hctx->run_work into a regular work struct
We don't need the larger delayed work struct, since we always run it
immediately.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-29 08:13:21 -06:00
Jens Axboe 0e87e58bf6 blk-mq: improve warning for running a queue on the wrong CPU
__blk_mq_run_hw_queue() currently warns if we are running the queue on a
CPU that isn't set in its mask. However, this can happen if a CPU is
being offlined, and the workqueue handling will place the work on CPU0
instead. Improve the warning so that it only triggers if the batch cpu
in the hardware queue is currently online.  If it triggers for that
case, then it's indicative of a flow problem in blk-mq, so we want to
retain it for that case.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-24 15:38:01 -06:00
Jens Axboe e57690fe00 blk-mq: don't overwrite rq->mq_ctx
We do this in a few places, if the CPU is offline. This isn't allowed,
though, since on multi queue hardware, we can't just move a request
from one software queue to another, if they map to different hardware
queues. The request and tag isn't valid on another hardware queue.

This can happen if plugging races with CPU offlining. But it does
no harm, since it can only happen in the window where we are
currently busy freezing the queue and flushing IO, in preparation
for redoing the software <-> hardware queue mappings.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-24 15:34:35 -06:00
Jens Axboe 1eff9d322a block: rename bio bi_rw to bi_opf
Since commit 63a4cc2486, bio->bi_rw contains flags in the lower
portion and the op code in the higher portions. This means that
old code that relies on manually setting bi_rw is most likely
going to be broken. Instead of letting that brokeness linger,
rename the member, to force old and out-of-tree code to break
at compile time instead of at runtime.

No intended functional changes in this commit.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-07 14:41:02 -06:00
Gabriel Krisman Bertazi 71f79fb317 blk-mq: Allow timeouts to run while queue is freezing
In case a submitted request gets stuck for some reason, the block layer
can prevent the request starvation by starting the scheduled timeout work.
If this stuck request occurs at the same time another thread has started
a queue freeze, the blk_mq_timeout_work will not be able to acquire the
queue reference and will return silently, thus not issuing the timeout.
But since the request is already holding a q_usage_counter reference and
is unable to complete, it will never release its reference, preventing
the queue from completing the freeze started by first thread.  This puts
the request_queue in a hung state, forever waiting for the freeze
completion.

This was observed while running IO to a NVMe device at the same time we
toggled the CPU hotplug code. Eventually, once a request got stuck
requiring a timeout during a queue freeze, we saw the CPU Hotplug
notification code get stuck inside blk_mq_freeze_queue_wait, as shown in
the trace below.

[c000000deaf13690] [c000000deaf13738] 0xc000000deaf13738 (unreliable)
[c000000deaf13860] [c000000000015ce8] __switch_to+0x1f8/0x350
[c000000deaf138b0] [c000000000ade0e4] __schedule+0x314/0x990
[c000000deaf13940] [c000000000ade7a8] schedule+0x48/0xc0
[c000000deaf13970] [c0000000005492a4] blk_mq_freeze_queue_wait+0x74/0x110
[c000000deaf139e0] [c00000000054b6a8] blk_mq_queue_reinit_notify+0x1a8/0x2e0
[c000000deaf13a40] [c0000000000e7878] notifier_call_chain+0x98/0x100
[c000000deaf13a90] [c0000000000b8e08] cpu_notify_nofail+0x48/0xa0
[c000000deaf13ac0] [c0000000000b92f0] _cpu_down+0x2a0/0x400
[c000000deaf13b90] [c0000000000b94a8] cpu_down+0x58/0xa0
[c000000deaf13bc0] [c0000000006d5dcc] cpu_subsys_offline+0x2c/0x50
[c000000deaf13bf0] [c0000000006cd244] device_offline+0x104/0x140
[c000000deaf13c30] [c0000000006cd40c] online_store+0x6c/0xc0
[c000000deaf13c80] [c0000000006c8c78] dev_attr_store+0x68/0xa0
[c000000deaf13cc0] [c0000000003974d0] sysfs_kf_write+0x80/0xb0
[c000000deaf13d00] [c0000000003963e8] kernfs_fop_write+0x188/0x200
[c000000deaf13d50] [c0000000002e0f6c] __vfs_write+0x6c/0xe0
[c000000deaf13d90] [c0000000002e1ca0] vfs_write+0xc0/0x230
[c000000deaf13de0] [c0000000002e2cdc] SyS_write+0x6c/0x110
[c000000deaf13e30] [c000000000009204] system_call+0x38/0xb4

The fix is to allow the timeout work to execute in the window between
dropping the initial refcount reference and the release of the last
reference, which actually marks the freeze completion.  This can be
achieved with percpu_refcount_tryget, which does not require the counter
to be alive.  This way the timeout work can do it's job and terminate a
stuck request even during a freeze, returning its reference and avoiding
the deadlock.

Allowing the timeout to run is just a part of the fix, since for some
devices, we might get stuck again inside the device driver's timeout
handler, should it attempt to allocate a new request in that path -
which is a quite common action for Abort commands, which need to be sent
after a timeout.  In NVMe, for instance, we call blk_mq_alloc_request
from inside the timeout handler, which will fail during a freeze, since
it also tries to acquire a queue reference.

I considered a similar change to blk_mq_alloc_request as a generic
solution for further device driver hangs, but we can't do that, since it
would allow new requests to disturb the freeze process.  I thought about
creating a new function in the block layer to support unfreezable
requests for these occasions, but after working on it for a while, I
feel like this should be handled in a per-driver basis.  I'm now
experimenting with changes to the NVMe timeout path, but I'm open to
suggestions of ways to make this generic.

Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Cc: Brian King <brking@linux.vnet.ibm.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: linux-nvme@lists.infradead.org
Cc: linux-block@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-04 14:19:16 -06:00
Linus Torvalds 3fc9d69093 Merge branch 'for-4.8/drivers' of git://git.kernel.dk/linux-block
Pull block driver updates from Jens Axboe:
 "This branch also contains core changes.  I've come to the conclusion
  that from 4.9 and forward, I'll be doing just a single branch.  We
  often have dependencies between core and drivers, and it's hard to
  always split them up appropriately without pulling core into drivers
  when that happens.

  That said, this contains:

   - separate secure erase type for the core block layer, from
     Christoph.

   - set of discard fixes, from Christoph.

   - bio shrinking fixes from Christoph, as a followup up to the
     op/flags change in the core branch.

   - map and append request fixes from Christoph.

   - NVMeF (NVMe over Fabrics) code from Christoph.  This is pretty
     exciting!

   - nvme-loop fixes from Arnd.

   - removal of ->driverfs_dev from Dan, after providing a
     device_add_disk() helper.

   - bcache fixes from Bhaktipriya and Yijing.

   - cdrom subchannel read fix from Vchannaiah.

   - set of lightnvm updates from Wenwei, Matias, Johannes, and Javier.

   - set of drbd updates and fixes from Fabian, Lars, and Philipp.

   - mg_disk error path fix from Bart.

   - user notification for failed device add for loop, from Minfei.

   - NVMe in general:
        + NVMe delay quirk from Guilherme.
        + SR-IOV support and command retry limits from Keith.
        + fix for memory-less NUMA node from Masayoshi.
        + use UINT_MAX for discard sectors, from Minfei.
        + cancel IO fixes from Ming.
        + don't allocate unused major, from Neil.
        + error code fixup from Dan.
        + use constants for PSDT/FUSE from James.
        + variable init fix from Jay.
        + fabrics fixes from Ming, Sagi, and Wei.
        + various fixes"

* 'for-4.8/drivers' of git://git.kernel.dk/linux-block: (115 commits)
  nvme/pci: Provide SR-IOV support
  nvme: initialize variable before logical OR'ing it
  block: unexport various bio mapping helpers
  scsi/osd: open code blk_make_request
  target: stop using blk_make_request
  block: simplify and export blk_rq_append_bio
  block: ensure bios return from blk_get_request are properly initialized
  virtio_blk: use blk_rq_map_kern
  memstick: don't allow REQ_TYPE_BLOCK_PC requests
  block: shrink bio size again
  block: simplify and cleanup bvec pool handling
  block: get rid of bio_rw and READA
  block: don't ignore -EOPNOTSUPP blkdev_issue_write_same
  block: introduce BLKDEV_DISCARD_ZERO to fix zeroout
  NVMe: don't allocate unused nvme_major
  nvme: avoid crashes when node 0 is memoryless node.
  nvme: Limit command retries
  loop: Make user notify for adding loop device failed
  nvme-loop: fix nvme-loop Kconfig dependencies
  nvmet: fix return value check in nvmet_subsys_alloc()
  ...
2016-07-26 15:37:51 -07:00
Linus Torvalds d05d7f4079 Merge branch 'for-4.8/core' of git://git.kernel.dk/linux-block
Pull core block updates from Jens Axboe:

   - the big change is the cleanup from Mike Christie, cleaning up our
     uses of command types and modified flags.  This is what will throw
     some merge conflicts

   - regression fix for the above for btrfs, from Vincent

   - following up to the above, better packing of struct request from
     Christoph

   - a 2038 fix for blktrace from Arnd

   - a few trivial/spelling fixes from Bart Van Assche

   - a front merge check fix from Damien, which could cause issues on
     SMR drives

   - Atari partition fix from Gabriel

   - convert cfq to highres timers, since jiffies isn't granular enough
     for some devices these days.  From Jan and Jeff

   - CFQ priority boost fix idle classes, from me

   - cleanup series from Ming, improving our bio/bvec iteration

   - a direct issue fix for blk-mq from Omar

   - fix for plug merging not involving the IO scheduler, like we do for
     other types of merges.  From Tahsin

   - expose DAX type internally and through sysfs.  From Toshi and Yigal

* 'for-4.8/core' of git://git.kernel.dk/linux-block: (76 commits)
  block: Fix front merge check
  block: do not merge requests without consulting with io scheduler
  block: Fix spelling in a source code comment
  block: expose QUEUE_FLAG_DAX in sysfs
  block: add QUEUE_FLAG_DAX for devices to advertise their DAX support
  Btrfs: fix comparison in __btrfs_map_block()
  block: atari: Return early for unsupported sector size
  Doc: block: Fix a typo in queue-sysfs.txt
  cfq-iosched: Charge at least 1 jiffie instead of 1 ns
  cfq-iosched: Fix regression in bonnie++ rewrite performance
  cfq-iosched: Convert slice_resid from u64 to s64
  block: Convert fifo_time from ulong to u64
  blktrace: avoid using timespec
  block/blk-cgroup.c: Declare local symbols static
  block/bio-integrity.c: Add #include "blk.h"
  block/partition-generic.c: Remove a set-but-not-used variable
  block: bio: kill BIO_MAX_SIZE
  cfq-iosched: temporarily boost queue priority for idle classes
  block: drbd: avoid to use BIO_MAX_SIZE
  block: bio: remove BIO_MAX_SECTORS
  ...
2016-07-26 15:03:07 -07:00
Christoph Hellwig 0c4de0f33b block: ensure bios return from blk_get_request are properly initialized
blk_get_request is used for BLOCK_PC and similar passthrough requests.
Currently we always need to call blk_rq_set_block_pc or an open coded
version of it to allow appending bios using the request mapping helpers
later on, which is a somewhat awkward API.  Instead move the
initialization part of blk_rq_set_block_pc into blk_get_request, so that
we always have a safe to use request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-20 17:38:30 -06:00
Ming Lin 1f5bd336b9 blk-mq: add blk_mq_alloc_request_hctx
For some protocols like NVMe over Fabrics we need to be able to send
initialization commands to a specific queue.

Based on an earlier patch from Christoph Hellwig <hch@lst.de>.

Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
[hch: disallow sleeping allocation, req_op fixes]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-05 11:28:07 -06:00
Omar Sandoval 52b9c330c6 blk-mq: actually hook up defer list when running requests
If ->queue_rq() returns BLK_MQ_RQ_QUEUE_OK, we use continue and skip
over the rest of the loop body. However, dptr is assigned later in the
loop body, and the BLK_MQ_RQ_QUEUE_OK case is exactly the case that we'd
want it for.

NVMe isn't actually using BLK_MQ_F_DEFER_ISSUE yet, nor is any other
in-tree driver, but if the code's going to be there, it might as well
work.

Fixes: 74c450521d ("blk-mq: add a 'list' parameter to ->queue_rq()")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-09 09:55:15 -06:00
Mike Christie 28a8f0d317 block, drivers, fs: rename REQ_FLUSH to REQ_PREFLUSH
To avoid confusion between REQ_OP_FLUSH, which is handled by
request_fn drivers, and upper layers requesting the block layer
perform a flush sequence along with possibly a WRITE, this patch
renames REQ_FLUSH to REQ_PREFLUSH.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie d9d8c5c489 block: convert is_sync helpers to use REQ_OPs.
This patch converts the is_sync helpers to use separate variables
for the operation and flags.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie cc6e3b1092 block: prepare mq request creation to use REQ_OPs
This patch modifies the blk mq request creation code to use
separate variables for the operation and flags, because in the
the next patches the struct request users will be converted like
was done for bios.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Omar Sandoval 87c279e613 blk-mq: really fix plug list flushing for nomerge queues
Commit 0809e3ac62 ("block: fix plug list flushing for nomerge queues")
updated blk_mq_make_request() to set request_count even when
blk_queue_nomerges() returns true. However, blk_mq_make_request() only
does limited plugging and doesn't use request_count;
blk_sq_make_request() is the one that should have been fixed. Do that
and get rid of the unnecessary work in the mq version.

Fixes: 0809e3ac62 ("block: fix plug list flushing for nomerge queues")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-02 11:47:32 -06:00
Ming Lin c7de572630 blk-mq: clear q->mq_ops if init fail
blk_mq_init_queue() calls blk_mq_init_allocated_queue(), but q->mq_ops
was not cleared when blk_mq_init_allocated_queue() fails.
Then blk_cleanup_queue() calls blk_mq_free_queue() which will crash because:
- q->all_q_node is not added to all_q_list yet
- q->tag_set is NULL
- hctx was not setup yet or already freed

Fixed it by clearing q->mq_ops on error path.

Signed-off-by: Ming Lin <ming.l@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-26 08:51:43 -06:00
Bartlomiej Zolnierkiewicz b3a834b159 blk-mq: fix undefined behaviour in order_to_size()
When this_order variable in blk_mq_init_rq_map() becomes zero
the code incorrectly decrements the variable and passes the result
to order_to_size() helper causing undefined behaviour:

 UBSAN: Undefined behaviour in block/blk-mq.c:1459:27
 shift exponent 4294967295 is too large for 32-bit type 'unsigned int'
 CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.6.0-rc6-00072-g33656a1 #22

Fix the code by checking this_order variable for not having the zero
value first.

Reported-by: Meelis Roos <mroos@linux.ee>
Fixes: 320ae51fee ("blk-mq: new multi-queue block IO queueing mechanism")
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-16 09:54:47 -06:00
Michael Callahan a21f2a3ec6 block: Minor blk_account_io_start usage cleanup
blk_account_io_start does not need to be wrapped with blk_do_io_stat
ais it already checks for that condition.

Signed-off-by: Michael Callahan <michaelcallahan@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-03 09:26:58 -06:00
Thomas Gleixner 897bb0c7f1 blk-mq: Use proper cpumask iterator
queue_for_each_ctx() iterates over per_cpu variables under the assumption that
the possible cpu mask cannot have holes. That's wrong as all cpumasks can have
holes. In case there are holes the iteration ends up accessing uninitialized
memory and crashing as a result.

Replace the macro by a proper for_each_possible_cpu() loop and drop the unused
macro blk_ctx_sum() which references queue_for_each_ctx().

Reported-by: Xiong Zhou <jencce.kernel@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-20 09:34:02 -06:00
Linus Torvalds 35d88d97be Merge branch 'for-4.6/core' of git://git.kernel.dk/linux-block
Pull core block updates from Jens Axboe:
 "Here are the core block changes for this merge window.  Not a lot of
  exciting stuff going on in this round, most of the changes have been
  on the driver side of things.  That pull request is coming next.  This
  pull request contains:

   - A set of fixes for chained bio handling from Christoph.

   - A tag bounds check for blk-mq from Hannes, ensuring that we don't
     do something stupid if a device reports an invalid tag value.

   - A set of fixes/updates for the CFQ IO scheduler from Jan Kara.

   - A set of blk-mq fixes from Keith, adding support for dynamic
     hardware queues, and fixing init of max_dev_sectors for stacking
     devices.

   - A fix for the dynamic hw context from Ming.

   - Enabling of cgroup writeback support on a block device, from
     Shaohua"

* 'for-4.6/core' of git://git.kernel.dk/linux-block:
  blk-mq: add bounds check on tag-to-rq conversion
  block: bio_remaining_done() isn't unlikely
  block: cleanup bio_endio
  block: factor out chained bio completion
  block: don't unecessarily clobber bi_error for chained bios
  block-dev: enable writeback cgroup support
  blk-mq: Fix NULL pointer updating nr_requests
  blk-mq: mark request queue as mq asap
  block: Initialize max_dev_sectors to 0
  blk-mq: dynamic h/w context count
  cfq-iosched: Allow parent cgroup to preempt its child
  cfq-iosched: Allow sync noidle workloads to preempt each other
  cfq-iosched: Reorder checks in cfq_should_preempt()
  cfq-iosched: Don't group_idle if cfqq has big thinktime
2016-03-18 16:43:11 -07:00
Hannes Reinecke 4ee86babe0 blk-mq: add bounds check on tag-to-rq conversion
We need to check for a valid index before accessing the array
element to avoid accessing invalid memory regions.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>

Modified by Jens to drop the unlikely(), and make the fall through
path be having a valid tag.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-15 12:03:28 -07:00
Keith Busch e9137d4b93 blk-mq: Fix NULL pointer updating nr_requests
A h/w context's tags are freed if it was not assigned a CPU. Check if
the context has tags before updating the depth.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-03 14:46:04 -07:00
Ming Lei 6684167216 blk-mq: mark request queue as mq asap
Currently q->mq_ops is used widely to decide if the queue
is mq or not, so we should set the 'flag' asap so that both
block core and drivers can get the correct mq info.

For example, commit 868f2f0b720(blk-mq: dynamic h/w context count)
moves the hctx's initialization before setting q->mq_ops in
blk_mq_init_allocated_queue(), then cause blk_alloc_flush_queue()
to think the queue is non-mq and don't allocate command size
for the per-hctx flush rq.

This patches should fix the problem reported by Sasha.

Cc: Keith Busch <keith.busch@intel.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Fixes: 868f2f0b72 ("blk-mq: dynamic h/w context count")
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-02-14 15:35:14 -07:00
Keith Busch a59e0f5795 blk-mq: End unstarted requests on dying queue
Go directly to ending a request if it wasn't started. Previously, completing a
request may invoke a driver callback for a request it didn't initialize.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Johannes Thumshirn <jthumshirn at suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-02-11 13:14:00 -07:00
Keith Busch 868f2f0b72 blk-mq: dynamic h/w context count
The hardware's provided queue count may change at runtime with resource
provisioning. This patch allows a block driver to alter the number of
h/w queues available when its resource count changes.

The main part is a new blk-mq API to request a new number of h/w queues
for a given live tag set. The new API freezes all queues using that set,
then adjusts the allocated count prior to remapping these to CPUs.

The bulk of the rest just shifts where h/w contexts and all their
artifacts are allocated and freed.

The number of max h/w contexts is capped to the number of possible cpus
since there is no use for more than that. As such, all pre-allocated
memory for pointers need to account for the max possible rather than
the initial number of queues.

A side effect of this is that the blk-mq will proceed successfully as
long as it can allocate at least one h/w context. Previously it would
fail request queue initialization if less than the requested number
was allocated.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Jon Derrick <jonathan.derrick@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-02-09 12:42:17 -07:00
Linus Torvalds 3e1e21c7bf Merge branch 'for-4.5/nvme' of git://git.kernel.dk/linux-block
Pull NVMe updates from Jens Axboe:
 "Last branch for this series is the nvme changes.  It's in a separate
  branch to avoid splitting too much between core and NVMe changes,
  since NVMe is still helping drive some blk-mq changes.  That said, not
  a huge amount of core changes in here.  The grunt of the work is the
  continued split of the code"

* 'for-4.5/nvme' of git://git.kernel.dk/linux-block: (67 commits)
  uapi: update install list after nvme.h rename
  NVMe: Export NVMe attributes to sysfs group
  NVMe: Shutdown controller only for power-off
  NVMe: IO queue deletion re-write
  NVMe: Remove queue freezing on resets
  NVMe: Use a retryable error code on reset
  NVMe: Fix admin queue ring wrap
  nvme: make SG_IO support optional
  nvme: fixes for NVME_IOCTL_IO_CMD on the char device
  nvme: synchronize access to ctrl->namespaces
  nvme: Move nvme_freeze/unfreeze_queues to nvme core
  PCI/AER: include header file
  NVMe: Export namespace attributes to sysfs
  NVMe: Add pci error handlers
  block: remove REQ_NO_TIMEOUT flag
  nvme: merge iod and cmd_info
  nvme: meta_sg doesn't have to be an array
  nvme: properly free resources for cancelled command
  nvme: simplify completion handling
  nvme: special case AEN requests
  ...
2016-01-21 19:58:02 -08:00
Christoph Hellwig bbc758ec04 block: remove REQ_NO_TIMEOUT flag
This was added for the 'magic' AEN requests in the NVMe driver that never
return.  We now handle them purely inside the driver and don't need this
core hack any more.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-12-22 09:38:34 -07:00
Christoph Hellwig 287922eb0b block: defer timeouts to a workqueue
Timer context is not very useful for drivers to perform any meaningful abort
action from.  So instead of calling the driver from this useless context
defer it to a workqueue as soon as possible.

Note that while a delayed_work item would seem the right thing here I didn't
dare to use it due to the magic in blk_add_timer that pokes deep into timer
internals.  But maybe this encourages Tejun to add a sensible API for that to
the workqueue API and we'll all be fine in the end :)

Contains a major update from Keith Bush:

"This patch removes synchronizing the timeout work so that the timer can
 start a freeze on its own queue. The timer enters the queue, so timer
 context can only start a freeze, but not wait for frozen."

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-12-22 09:38:16 -07:00
Raghavendra K T bffed45716 blk-mq: Avoid memoryless numa node encoded in hctx numa_node
In architecture like powerpc, we can have cpus without any local memory
attached to it (a.k.a memoryless nodes). In such cases cpu to node mapping
can result in memory allocation hints for block hctx->numa_node populated
with node values which does not have real memory.

Instead use local_memory_node(), which is guaranteed to have memory.
local_memory_node is a noop in other architectures that does not support
memoryless nodes.

Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-12-03 09:56:27 -07:00
Raghavendra K T e0e827b9fc blk-mq: Reuse hardware context cpumask for tags
hctx->cpumask is already populated and let the tag cpumask follow that
instead of going through a new for loop.

Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-12-03 09:56:25 -07:00
Christoph Hellwig 6f3b0e8bcf blk-mq: add a flags parameter to blk_mq_alloc_request
We already have the reserved flag, and a nowait flag awkwardly encoded as
a gfp_t.  Add a real flags argument to make the scheme more extensible and
allow for a nicer calling convention.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-12-01 10:53:59 -07:00
Jens Axboe b094f89ca4 blk-mq: fix calling unplug callbacks with preempt disabled
Liu reported that running certain parts of xfstests threw the
following error:

BUG: sleeping function called from invalid context at mm/page_alloc.c:3190
in_atomic(): 1, irqs_disabled(): 0, pid: 6, name: kworker/u16:0
3 locks held by kworker/u16:0/6:
 #0:  ("writeback"){++++.+}, at: [<ffffffff8107f083>] process_one_work+0x173/0x730
 #1:  ((&(&wb->dwork)->work)){+.+.+.}, at: [<ffffffff8107f083>] process_one_work+0x173/0x730
 #2:  (&type->s_umount_key#44){+++++.}, at: [<ffffffff811e6805>] trylock_super+0x25/0x60
CPU: 5 PID: 6 Comm: kworker/u16:0 Tainted: G           OE   4.3.0+ #3
Hardware name: Red Hat KVM, BIOS Bochs 01/01/2011
Workqueue: writeback wb_workfn (flush-btrfs-108)
 ffffffff81a3abab ffff88042e282ba8 ffffffff8130191b ffffffff81a3abab
 0000000000000c76 ffff88042e282ba8 ffff88042e27c180 ffff88042e282bd8
 ffffffff8108ed95 ffff880400000004 0000000000000000 0000000000000c76
Call Trace:
 [<ffffffff8130191b>] dump_stack+0x4f/0x74
 [<ffffffff8108ed95>] ___might_sleep+0x185/0x240
 [<ffffffff8108eea2>] __might_sleep+0x52/0x90
 [<ffffffff811817e8>] __alloc_pages_nodemask+0x268/0x410
 [<ffffffff8109a43c>] ? sched_clock_local+0x1c/0x90
 [<ffffffff8109a6d1>] ? local_clock+0x21/0x40
 [<ffffffff810b9eb0>] ? __lock_release+0x420/0x510
 [<ffffffff810b534c>] ? __lock_acquired+0x16c/0x3c0
 [<ffffffff811ca265>] alloc_pages_current+0xc5/0x210
 [<ffffffffa0577105>] ? rbio_is_full+0x55/0x70 [btrfs]
 [<ffffffff810b7ed8>] ? mark_held_locks+0x78/0xa0
 [<ffffffff81666d50>] ? _raw_spin_unlock_irqrestore+0x40/0x60
 [<ffffffffa0578c0a>] full_stripe_write+0x5a/0xc0 [btrfs]
 [<ffffffffa0578ca9>] __raid56_parity_write+0x39/0x60 [btrfs]
 [<ffffffffa0578deb>] run_plug+0x11b/0x140 [btrfs]
 [<ffffffffa0578e33>] btrfs_raid_unplug+0x23/0x70 [btrfs]
 [<ffffffff812d36c2>] blk_flush_plug_list+0x82/0x1f0
 [<ffffffff812e0349>] blk_sq_make_request+0x1f9/0x740
 [<ffffffff812ceba2>] ? generic_make_request_checks+0x222/0x7c0
 [<ffffffff812cf264>] ? blk_queue_enter+0x124/0x310
 [<ffffffff812cf1d2>] ? blk_queue_enter+0x92/0x310
 [<ffffffff812d0ae2>] generic_make_request+0x172/0x2c0
 [<ffffffff812d0ad4>] ? generic_make_request+0x164/0x2c0
 [<ffffffff812d0ca0>] submit_bio+0x70/0x140
 [<ffffffffa0577b29>] ? rbio_add_io_page+0x99/0x150 [btrfs]
 [<ffffffffa0578a89>] finish_rmw+0x4d9/0x600 [btrfs]
 [<ffffffffa0578c4c>] full_stripe_write+0x9c/0xc0 [btrfs]
 [<ffffffffa057ab7f>] raid56_parity_write+0xef/0x160 [btrfs]
 [<ffffffffa052bd83>] btrfs_map_bio+0xe3/0x2d0 [btrfs]
 [<ffffffffa04fbd6d>] btrfs_submit_bio_hook+0x8d/0x1d0 [btrfs]
 [<ffffffffa05173c4>] submit_one_bio+0x74/0xb0 [btrfs]
 [<ffffffffa0517f55>] submit_extent_page+0xe5/0x1c0 [btrfs]
 [<ffffffffa0519b18>] __extent_writepage_io+0x408/0x4c0 [btrfs]
 [<ffffffffa05179c0>] ? alloc_dummy_extent_buffer+0x140/0x140 [btrfs]
 [<ffffffffa051dc88>] __extent_writepage+0x218/0x3a0 [btrfs]
 [<ffffffff810b7ed8>] ? mark_held_locks+0x78/0xa0
 [<ffffffffa051e2c9>] extent_write_cache_pages.clone.0+0x2f9/0x400 [btrfs]
 [<ffffffffa051e422>] extent_writepages+0x52/0x70 [btrfs]
 [<ffffffffa05001f0>] ? btrfs_set_inode_index+0x70/0x70 [btrfs]
 [<ffffffffa04fcc17>] btrfs_writepages+0x27/0x30 [btrfs]
 [<ffffffff81184df3>] do_writepages+0x23/0x40
 [<ffffffff81212229>] __writeback_single_inode+0x89/0x4d0
 [<ffffffff81212a60>] ? writeback_sb_inodes+0x260/0x480
 [<ffffffff81212a60>] ? writeback_sb_inodes+0x260/0x480
 [<ffffffff8121295f>] ? writeback_sb_inodes+0x15f/0x480
 [<ffffffff81212ad2>] writeback_sb_inodes+0x2d2/0x480
 [<ffffffff810b1397>] ? down_read_trylock+0x57/0x60
 [<ffffffff811e6805>] ? trylock_super+0x25/0x60
 [<ffffffff810d629f>] ? rcu_read_lock_sched_held+0x4f/0x90
 [<ffffffff81212d0c>] __writeback_inodes_wb+0x8c/0xc0
 [<ffffffff812130b5>] wb_writeback+0x2b5/0x500
 [<ffffffff810b7ed8>] ? mark_held_locks+0x78/0xa0
 [<ffffffff810660a8>] ? __local_bh_enable_ip+0x68/0xc0
 [<ffffffff81213362>] ? wb_do_writeback+0x62/0x310
 [<ffffffff812133c1>] wb_do_writeback+0xc1/0x310
 [<ffffffff8107c3d9>] ? set_worker_desc+0x79/0x90
 [<ffffffff81213842>] wb_workfn+0x92/0x330
 [<ffffffff8107f133>] process_one_work+0x223/0x730
 [<ffffffff8107f083>] ? process_one_work+0x173/0x730
 [<ffffffff8108035f>] ? worker_thread+0x18f/0x430
 [<ffffffff810802ed>] worker_thread+0x11d/0x430
 [<ffffffff810801d0>] ? maybe_create_worker+0xf0/0xf0
 [<ffffffff810801d0>] ? maybe_create_worker+0xf0/0xf0
 [<ffffffff810858df>] kthread+0xef/0x110
 [<ffffffff8108f74e>] ? schedule_tail+0x1e/0xd0
 [<ffffffff810857f0>] ? __init_kthread_worker+0x70/0x70
 [<ffffffff816673bf>] ret_from_fork+0x3f/0x70
 [<ffffffff810857f0>] ? __init_kthread_worker+0x70/0x70

The issue is that we've got the software context pinned while
calling blk_flush_plug_list(), which flushes callbacks that
are allowed to sleep. btrfs and raid has such callbacks.

Flip the checks around a bit, so we can enable preempt a bit
earlier and flush plugs without having preempt disabled.

This only affects blk-mq driven devices, and only those that
register a single queue.

Reported-by: Liu Bo <bo.li.liu@oracle.com>
Tested-by: Liu Bo <bo.li.liu@oracle.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-11-20 20:29:45 -07:00
Jens Axboe 1fa8cc52f4 blk-mq: mark __blk_mq_complete_request() static
It's no longer used outside of blk-mq core.

Signed-off-by: Jens Axboe <axboe@fb.com>
2015-11-11 09:36:56 -07:00
Linus Torvalds 3419b45039 Merge branch 'for-4.4/io-poll' of git://git.kernel.dk/linux-block
Pull block IO poll support from Jens Axboe:
 "Various groups have been doing experimentation around IO polling for
  (really) fast devices.  The code has been reviewed and has been
  sitting on the side for a few releases, but this is now good enough
  for coordinated benchmarking and further experimentation.

  Currently O_DIRECT sync read/write are supported.  A framework is in
  the works that allows scalable stats tracking so we can auto-tune
  this.  And we'll add libaio support as well soon.  Fow now, it's an
  opt-in feature for test purposes"

* 'for-4.4/io-poll' of git://git.kernel.dk/linux-block:
  direct-io: be sure to assign dio->bio_bdev for both paths
  directio: add block polling support
  NVMe: add blk polling support
  block: add block polling support
  blk-mq: return tag/queue combo in the make_request_fn handlers
  block: change ->make_request_fn() and users to return a queue cookie
2015-11-10 17:23:49 -08:00
Jens Axboe 7b371636fb blk-mq: return tag/queue combo in the make_request_fn handlers
Return a cookie, blk_qc_t, from the blk-mq make request functions, that
allows a later caller to uniquely identify a specific IO. The cookie
doesn't mean anything to the caller, but the caller can use it to later
pass back to the block layer. The block layer can then identify the
hardware queue and request from that cookie.

Signed-off-by: Jens Axboe <axboe@fb.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
2015-11-07 10:40:47 -07:00
Jens Axboe dece16353e block: change ->make_request_fn() and users to return a queue cookie
No functional changes in this patch, but it prepares us for returning
a more useful cookie related to the IO that was queued up.

Signed-off-by: Jens Axboe <axboe@fb.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
2015-11-07 10:40:46 -07:00
Mel Gorman 71baba4b92 mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIM
__GFP_WAIT was used to signal that the caller was in atomic context and
could not sleep.  Now it is possible to distinguish between true atomic
context and callers that are not willing to sleep.  The latter should
clear __GFP_DIRECT_RECLAIM so kswapd will still wake.  As clearing
__GFP_WAIT behaves differently, there is a risk that people will clear the
wrong flags.  This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly
indicate what it does -- setting it allows all reclaim activity, clearing
them prevents it.

[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06 17:50:42 -08:00
Mel Gorman d0164adc89 mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
__GFP_WAIT has been used to identify atomic context in callers that hold
spinlocks or are in interrupts.  They are expected to be high priority and
have access one of two watermarks lower than "min" which can be referred
to as the "atomic reserve".  __GFP_HIGH users get access to the first
lower watermark and can be called the "high priority reserve".

Over time, callers had a requirement to not block when fallback options
were available.  Some have abused __GFP_WAIT leading to a situation where
an optimisitic allocation with a fallback option can access atomic
reserves.

This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
cannot sleep and have no alternative.  High priority users continue to use
__GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
redefined as a caller that is willing to enter direct reclaim and wake
kswapd for background reclaim.

This patch then converts a number of sites

o __GFP_ATOMIC is used by callers that are high priority and have memory
  pools for those requests. GFP_ATOMIC uses this flag.

o Callers that have a limited mempool to guarantee forward progress clear
  __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
  into this category where kswapd will still be woken but atomic reserves
  are not used as there is a one-entry mempool to guarantee progress.

o Callers that are checking if they are non-blocking should use the
  helper gfpflags_allow_blocking() where possible. This is because
  checking for __GFP_WAIT as was done historically now can trigger false
  positives. Some exceptions like dm-crypt.c exist where the code intent
  is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
  flag manipulations.

o Callers that built their own GFP flags instead of starting with GFP_KERNEL
  and friends now also need to specify __GFP_KSWAPD_RECLAIM.

The first key hazard to watch out for is callers that removed __GFP_WAIT
and was depending on access to atomic reserves for inconspicuous reasons.
In some cases it may be appropriate for them to use __GFP_HIGH.

The second key hazard is callers that assembled their own combination of
GFP flags instead of starting with something like GFP_KERNEL.  They may
now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
if it's missed in most cases as other activity will wake kswapd.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06 17:50:42 -08:00
Linus Torvalds 527d1529e3 Merge branch 'for-4.4/integrity' of git://git.kernel.dk/linux-block
Pull block integrity updates from Jens Axboe:
 ""This is the joint work of Dan and Martin, cleaning up and improving
  the support for block data integrity"

* 'for-4.4/integrity' of git://git.kernel.dk/linux-block:
  block, libnvdimm, nvme: provide a built-in blk_integrity nop profile
  block: blk_flush_integrity() for bio-based drivers
  block: move blk_integrity to request_queue
  block: generic request_queue reference counting
  nvme: suspend i/o during runtime blk_integrity_unregister
  md: suspend i/o during runtime blk_integrity_unregister
  md, dm, scsi, nvme, libnvdimm: drop blk_integrity_unregister() at shutdown
  block: Inline blk_integrity in struct gendisk
  block: Export integrity data interval size in sysfs
  block: Reduce the size of struct blk_integrity
  block: Consolidate static integrity profile properties
  block: Move integrity kobject to struct gendisk
2015-11-04 20:51:48 -08:00
Linus Torvalds d9734e0d1c Merge branch 'for-4.4/core' of git://git.kernel.dk/linux-block
Pull core block updates from Jens Axboe:
 "This is the core block pull request for 4.4.  I've got a few more
  topic branches this time around, some of them will layer on top of the
  core+drivers changes and will come in a separate round.  So not a huge
  chunk of changes in this round.

  This pull request contains:

   - Enable blk-mq page allocation tracking with kmemleak, from Catalin.

   - Unused prototype removal in blk-mq from Christoph.

   - Cleanup of the q->blk_trace exchange, using cmpxchg instead of two
     xchg()'s, from Davidlohr.

   - A plug flush fix from Jeff.

   - Also from Jeff, a fix that means we don't have to update shared tag
     sets at init time unless we do a state change.  This cuts down boot
     times on thousands of devices a lot with scsi/blk-mq.

   - blk-mq waitqueue barrier fix from Kosuke.

   - Various fixes from Ming:

        - Fixes for segment merging and splitting, and checks, for
          the old core and blk-mq.

        - Potential blk-mq speedup by marking ctx pending at the end
          of a plug insertion batch in blk-mq.

        - direct-io no page dirty on kernel direct reads.

   - A WRITE_SYNC fix for mpage from Roman"

* 'for-4.4/core' of git://git.kernel.dk/linux-block:
  blk-mq: avoid excessive boot delays with large lun counts
  blktrace: re-write setting q->blk_trace
  blk-mq: mark ctx as pending at batch in flush plug path
  blk-mq: fix for trace_block_plug()
  block: check bio_mergeable() early before merging
  blk-mq: check bio_mergeable() early before merging
  block: avoid to merge splitted bio
  block: setup bi_phys_segments after splitting
  block: fix plug list flushing for nomerge queues
  blk-mq: remove unused blk_mq_clone_flush_request prototype
  blk-mq: fix waitqueue_active without memory barrier in block/blk-mq-tag.c
  fs: direct-io: don't dirtying pages for ITER_BVEC/ITER_KVEC direct read
  fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write
  block: kmemleak: Track the page allocations for struct request
2015-11-04 20:28:10 -08:00
Jeff Moyer 2404e607a9 blk-mq: avoid excessive boot delays with large lun counts
Hi,

Zhangqing Luo reported long boot times on a system with thousands of
LUNs when scsi-mq was enabled.  He narrowed the problem down to
blk_mq_add_queue_tag_set, where every queue is frozen in order to set
the BLK_MQ_F_TAG_SHARED flag.  Each added device will freeze all queues
added before it in sequence, which involves waiting for an RCU grace
period for each one.  We don't need to do this.  After the second queue
is added, only new queues need to be initialized with the shared tag.
We can do that by percolating the flag up to the blk_mq_tag_set, and
updating the newly added queue's hctxs if the flag is set.

This problem was introduced by commit 0d2602ca30 (blk-mq: improve
support for shared tags maps).

Reported-and-tested-by: Jason Luo <zhangqing.luo@oracle.com>
Reviewed-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-11-03 08:42:19 -07:00
Ming Lei cfd0c552a8 blk-mq: mark ctx as pending at batch in flush plug path
Most of times, flush plug should be the hottest I/O path,
so mark ctx as pending after all requests in the list are
inserted.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-21 15:00:58 -06:00
Ming Lei 676d06077f blk-mq: fix for trace_block_plug()
The trace point is for tracing plug event of each request
queue instead of each task, so we should check the request
count in the plug list from current queue instead of
current task.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-21 15:00:56 -06:00
Ming Lei e18378a60e blk-mq: check bio_mergeable() early before merging
It isn't necessary to try to merge the bio which is marked
as NOMERGE.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-21 15:00:53 -06:00
Jeff Moyer 0809e3ac62 block: fix plug list flushing for nomerge queues
Request queues with merging disabled will not flush the plug list after
BLK_MAX_REQUEST_COUNT requests have been queued, since the code relies
on blk_attempt_plug_merge to compute the request_count.  Fix this by
computing the number of queued requests even for nomerge queues.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-21 15:00:48 -06:00
Dan Williams 3ef28e83ab block: generic request_queue reference counting
Allow pmem, and other synchronous/bio-based block drivers, to fallback
on a per-cpu reference count managed by the core for tracking queue
live/dead state.

The existing per-cpu reference count for the blk_mq case is promoted to
be used in all block i/o scenarios.  This involves initializing it by
default, waiting for it to drop to zero at exit, and holding a live
reference over the invocation of q->make_request_fn() in
generic_make_request().  The blk_mq code continues to take its own
reference per blk_mq request and retains the ability to freeze the
queue, but the check that the queue is frozen is moved to
generic_make_request().

This fixes crash signatures like the following:

 BUG: unable to handle kernel paging request at ffff880140000000
 [..]
 Call Trace:
  [<ffffffff8145e8bf>] ? copy_user_handle_tail+0x5f/0x70
  [<ffffffffa004e1e0>] pmem_do_bvec.isra.11+0x70/0xf0 [nd_pmem]
  [<ffffffffa004e331>] pmem_make_request+0xd1/0x200 [nd_pmem]
  [<ffffffff811c3162>] ? mempool_alloc+0x72/0x1a0
  [<ffffffff8141f8b6>] generic_make_request+0xd6/0x110
  [<ffffffff8141f966>] submit_bio+0x76/0x170
  [<ffffffff81286dff>] submit_bh_wbc+0x12f/0x160
  [<ffffffff81286e62>] submit_bh+0x12/0x20
  [<ffffffff813395bd>] jbd2_write_superblock+0x8d/0x170
  [<ffffffff8133974d>] jbd2_mark_journal_empty+0x5d/0x90
  [<ffffffff813399cb>] jbd2_journal_destroy+0x24b/0x270
  [<ffffffff810bc4ca>] ? put_pwq_unlocked+0x2a/0x30
  [<ffffffff810bc6f5>] ? destroy_workqueue+0x225/0x250
  [<ffffffff81303494>] ext4_put_super+0x64/0x360
  [<ffffffff8124ab1a>] generic_shutdown_super+0x6a/0xf0

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-21 14:43:41 -06:00
Junichi Nomura f42d79ab67 blk-mq: fix use-after-free in blk_mq_free_tag_set()
tags is freed in blk_mq_free_rq_map() and should not be used after that.
The problem doesn't manifest if CONFIG_CPUMASK_OFFSTACK is false because
free_cpumask_var() is nop.

tags->cpumask is allocated in blk_mq_init_tags() so it's natural to
free cpumask in its counter part, blk_mq_free_tags().

Fixes: f26cdc8536 ("blk-mq: Shared tag enhancements")
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Keith Busch <keith.busch@intel.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-15 08:45:58 -06:00
Jens Axboe fd48ca3849 Linux 4.3-rc4
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJWEUxnAAoJEHm+PkMAQRiGYCYH/3gtGkFdvSLi+E1PfI8Qk3ZA
 XuYA4Mj09JBVSmaICeueMTDVrdiq0OE0zPib26GWlF/za13kNU8KgMR3+6XCuLSX
 DiCmh6mwDItoNoSIIUERLqrFHABXz8rZ3gb3uu2+kNN74Cl0piNm1YpFclEEWjMr
 9Wk5fkq+ontnDVUQOvWUxPiUXOJTvdLXBWTRDw1yTdE3RMNwRI2d/hme6Hq++WYV
 tRalZZKQaoB33js9WRVAoLVunvtna+i+/y7VGLj8QyS0+d6ec81Hey2r1/fR/oG4
 bs4ul6vtqeb3IR/PjUqxF59pSrCLEO+qrp9KrTlJNYgr1m1QyjRxWUdy/XhyaWo=
 =gIhN
 -----END PGP SIGNATURE-----

Merge tag 'v4.3-rc4' into for-4.4/core

Linux 4.3-rc4

Pulling in v4.3-rc4 to avoid conflicts with NVMe fixes that have gone
in since for-4.4/core was based.
2015-10-09 10:08:39 -06:00
Christoph Hellwig 0bf6cd5b95 blk-mq: factor out a helper to iterate all tags for a request_queue
And replace the blk_mq_tag_busy_iter with it - the driver use has been
replaced with a new helper a while ago, and internal to the block we
only need the new version.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-01 10:10:57 +02:00
Christoph Hellwig f4829a9b7a blk-mq: fix racy updates of rq->errors
blk_mq_complete_request may be a no-op if the request has already
been completed by others means (e.g. a timeout or cancellation), but
currently drivers have to set rq->errors before calling
blk_mq_complete_request, which might leave us with the wrong error value.

Add an error parameter to blk_mq_complete_request so that we can
defer setting rq->errors until we known we won the race to complete the
request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-01 10:10:55 +02:00
Akinobu Mita 60de074ba1 blk-mq: fix deadlock when reading cpu_list
CPU hotplug handling for blk-mq (blk_mq_queue_reinit) acquires
all_q_mutex in blk_mq_queue_reinit_notify() and then removes sysfs
entries by blk_mq_sysfs_unregister().  Removing sysfs entry needs to
be blocked until the active reference of the kernfs_node to be zero.

On the other hand, reading blk_mq_hw_sysfs_cpu sysfs entry (e.g.
/sys/block/nullb0/mq/0/cpu_list) acquires all_q_mutex in
blk_mq_hw_sysfs_cpus_show().

If these happen at the same time, a deadlock can happen.  Because one
can wait for the active reference to be zero with holding all_q_mutex,
and the other tries to acquire all_q_mutex with holding the active
reference.

The reason that all_q_mutex is acquired in blk_mq_hw_sysfs_cpus_show()
is to avoid reading an imcomplete hctx->cpumask.  Since reading sysfs
entry for blk-mq needs to acquire q->sysfs_lock, we can avoid deadlock
and reading an imcomplete hctx->cpumask by protecting q->sysfs_lock
while hctx->cpumask is being updated.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Cc: Ming Lei <tom.leiming@gmail.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-09-29 11:32:51 -06:00
Akinobu Mita 5778322e67 blk-mq: avoid inserting requests before establishing new mapping
Notifier callbacks for CPU_ONLINE action can be run on the other CPU
than the CPU which was just onlined.  So it is possible for the
process running on the just onlined CPU to insert request and run
hw queue before establishing new mapping which is done by
blk_mq_queue_reinit_notify().

This can cause a problem when the CPU has just been onlined first time
since the request queue was initialized.  At this time ctx->index_hw
for the CPU, which is the index in hctx->ctxs[] for this ctx, is still
zero before blk_mq_queue_reinit_notify() is called by notifier
callbacks for CPU_ONLINE action.

For example, there is a single hw queue (hctx) and two CPU queues
(ctx0 for CPU0, and ctx1 for CPU1).  Now CPU1 is just onlined and
a request is inserted into ctx1->rq_list and set bit0 in pending
bitmap as ctx1->index_hw is still zero.

And then while running hw queue, flush_busy_ctxs() finds bit0 is set
in pending bitmap and tries to retrieve requests in
hctx->ctxs[0]->rq_list.  But htx->ctxs[0] is a pointer to ctx0, so the
request in ctx1->rq_list is ignored.

Fix it by ensuring that new mapping is established before onlined cpu
starts running.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-09-29 11:32:50 -06:00
Akinobu Mita 0e62636820 blk-mq: fix q->mq_usage_counter access race
CPU hotplug handling for blk-mq (blk_mq_queue_reinit) accesses
q->mq_usage_counter while freezing all request queues in all_q_list.
On the other hand, q->mq_usage_counter is deinitialized in
blk_mq_free_queue() before deleting the queue from all_q_list.

So if CPU hotplug event occurs in the window, percpu_ref_kill() is
called with q->mq_usage_counter which has already been marked dead,
and it triggers warning.  Fix it by deleting the queue from all_q_list
earlier than destroying q->mq_usage_counter.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Cc: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-09-29 11:32:48 -06:00
Akinobu Mita a723bab3d7 blk-mq: Fix use after of free q->mq_map
CPU hotplug handling for blk-mq (blk_mq_queue_reinit) updates
q->mq_map by blk_mq_update_queue_map() for all request queues in
all_q_list.  On the other hand, q->mq_map is released before deleting
the queue from all_q_list.

So if CPU hotplug event occurs in the window, invalid memory access
can happen.  Fix it by releasing q->mq_map in blk_mq_release() to make
it happen latter than removal from all_q_list.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Suggested-by: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Cc: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-09-29 11:32:46 -06:00
Akinobu Mita 4593fdbe7a blk-mq: fix sysfs registration/unregistration race
There is a race between cpu hotplug handling and adding/deleting
gendisk for blk-mq, where both are trying to register and unregister
the same sysfs entries.

null_add_dev
    --> blk_mq_init_queue
        --> blk_mq_init_allocated_queue
            --> add to 'all_q_list' (*)
    --> add_disk
        --> blk_register_queue
            --> blk_mq_register_disk (++)

null_del_dev
    --> del_gendisk
        --> blk_unregister_queue
            --> blk_mq_unregister_disk (--)
    --> blk_cleanup_queue
        --> blk_mq_free_queue
            --> del from 'all_q_list' (*)

blk_mq_queue_reinit
    --> blk_mq_sysfs_unregister (-)
    --> blk_mq_sysfs_register (+)

While the request queue is added to 'all_q_list' (*),
blk_mq_queue_reinit() can be called for the queue anytime by CPU
hotplug callback.  But blk_mq_sysfs_unregister (-) and
blk_mq_sysfs_register (+) in blk_mq_queue_reinit must not be called
before blk_mq_register_disk (++) and after blk_mq_unregister_disk (--)
is finished.  Because '/sys/block/*/mq/' is not exists.

There has already been BLK_MQ_F_SYSFS_UP flag in hctx->flags which can
be used to track these sysfs stuff, but it is only fixing this issue
partially.

In order to fix it completely, we just need per-queue flag instead of
per-hctx flag with appropriate locking.  So this introduces
q->mq_sysfs_init_done which is properly protected with all_q_mutex.

Also, we need to ensure that blk_mq_map_swqueue() is called with
all_q_mutex is held.  Since hctx->nr_ctx is reset temporarily and
updated in blk_mq_map_swqueue(), so we should avoid
blk_mq_register_hctx() seeing the temporary hctx->nr_ctx value
in CPU hotplug handling or adding/deleting gendisk .

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Cc: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-09-29 11:32:45 -06:00
Akinobu Mita 1356aae083 blk-mq: avoid setting hctx->tags->cpumask before allocation
When unmapped hw queue is remapped after CPU topology is changed,
hctx->tags->cpumask has to be set after hctx->tags is setup in
blk_mq_map_swqueue(), otherwise it causes null pointer dereference.

Fixes: f26cdc8536 ("blk-mq: Shared tag enhancements")
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-09-29 11:32:43 -06:00
Catalin Marinas f75782e4e0 block: kmemleak: Track the page allocations for struct request
The pages allocated for struct request contain pointers to other slab
allocations (via ops->init_request). Since kmemleak does not track/scan
page allocations, the slab objects will be reported as leaks (false
positives). This patch adds kmemleak callbacks to allow tracking of such
pages.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Tested-by: Bart Van Assche<bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-09-23 11:00:57 -06:00
Ming Lei 0048b4837a blk-mq: fix race between timeout and freeing request
Inside timeout handler, blk_mq_tag_to_rq() is called
to retrieve the request from one tag. This way is obviously
wrong because the request can be freed any time and some
fiedds of the request can't be trusted, then kernel oops
might be triggered[1].

Currently wrt. blk_mq_tag_to_rq(), the only special case is
that the flush request can share same tag with the request
cloned from, and the two requests can't be active at the same
time, so this patch fixes the above issue by updating tags->rqs[tag]
with the active request(either flush rq or the request cloned
from) of the tag.

Also blk_mq_tag_to_rq() gets much simplified with this patch.

Given blk_mq_tag_to_rq() is mainly for drivers and the caller must
make sure the request can't be freed, so in bt_for_each() this
helper is replaced with tags->rqs[tag].

[1] kernel oops log
[  439.696220] BUG: unable to handle kernel NULL pointer dereference at 0000000000000158^M
[  439.697162] IP: [<ffffffff812d89ba>] blk_mq_tag_to_rq+0x21/0x6e^M
[  439.700653] PGD 7ef765067 PUD 7ef764067 PMD 0 ^M
[  439.700653] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC ^M
[  439.700653] Dumping ftrace buffer:^M
[  439.700653]    (ftrace buffer empty)^M
[  439.700653] Modules linked in: nbd ipv6 kvm_intel kvm serio_raw^M
[  439.700653] CPU: 6 PID: 2779 Comm: stress-ng-sigfd Not tainted 4.2.0-rc5-next-20150805+ #265^M
[  439.730500] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011^M
[  439.730500] task: ffff880605308000 ti: ffff88060530c000 task.ti: ffff88060530c000^M
[  439.730500] RIP: 0010:[<ffffffff812d89ba>]  [<ffffffff812d89ba>] blk_mq_tag_to_rq+0x21/0x6e^M
[  439.730500] RSP: 0018:ffff880819203da0  EFLAGS: 00010283^M
[  439.730500] RAX: ffff880811b0e000 RBX: ffff8800bb465f00 RCX: 0000000000000002^M
[  439.730500] RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000^M
[  439.730500] RBP: ffff880819203db0 R08: 0000000000000002 R09: 0000000000000000^M
[  439.730500] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000202^M
[  439.730500] R13: ffff880814104800 R14: 0000000000000002 R15: ffff880811a2ea00^M
[  439.730500] FS:  00007f165b3f5740(0000) GS:ffff880819200000(0000) knlGS:0000000000000000^M
[  439.730500] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b^M
[  439.730500] CR2: 0000000000000158 CR3: 00000007ef766000 CR4: 00000000000006e0^M
[  439.730500] Stack:^M
[  439.730500]  0000000000000008 ffff8808114eed90 ffff880819203e00 ffffffff812dc104^M
[  439.755663]  ffff880819203e40 ffffffff812d9f5e 0000020000000000 ffff8808114eed80^M
[  439.755663] Call Trace:^M
[  439.755663]  <IRQ> ^M
[  439.755663]  [<ffffffff812dc104>] bt_for_each+0x6e/0xc8^M
[  439.755663]  [<ffffffff812d9f5e>] ? blk_mq_rq_timed_out+0x6a/0x6a^M
[  439.755663]  [<ffffffff812d9f5e>] ? blk_mq_rq_timed_out+0x6a/0x6a^M
[  439.755663]  [<ffffffff812dc1b3>] blk_mq_tag_busy_iter+0x55/0x5e^M
[  439.755663]  [<ffffffff812d88b4>] ? blk_mq_bio_to_request+0x38/0x38^M
[  439.755663]  [<ffffffff812d8911>] blk_mq_rq_timer+0x5d/0xd4^M
[  439.755663]  [<ffffffff810a3e10>] call_timer_fn+0xf7/0x284^M
[  439.755663]  [<ffffffff810a3d1e>] ? call_timer_fn+0x5/0x284^M
[  439.755663]  [<ffffffff812d88b4>] ? blk_mq_bio_to_request+0x38/0x38^M
[  439.755663]  [<ffffffff810a46d6>] run_timer_softirq+0x1ce/0x1f8^M
[  439.755663]  [<ffffffff8104c367>] __do_softirq+0x181/0x3a4^M
[  439.755663]  [<ffffffff8104c76e>] irq_exit+0x40/0x94^M
[  439.755663]  [<ffffffff81031482>] smp_apic_timer_interrupt+0x33/0x3e^M
[  439.755663]  [<ffffffff815559a4>] apic_timer_interrupt+0x84/0x90^M
[  439.755663]  <EOI> ^M
[  439.755663]  [<ffffffff81554350>] ? _raw_spin_unlock_irq+0x32/0x4a^M
[  439.755663]  [<ffffffff8106a98b>] finish_task_switch+0xe0/0x163^M
[  439.755663]  [<ffffffff8106a94d>] ? finish_task_switch+0xa2/0x163^M
[  439.755663]  [<ffffffff81550066>] __schedule+0x469/0x6cd^M
[  439.755663]  [<ffffffff8155039b>] schedule+0x82/0x9a^M
[  439.789267]  [<ffffffff8119b28b>] signalfd_read+0x186/0x49a^M
[  439.790911]  [<ffffffff8106d86a>] ? wake_up_q+0x47/0x47^M
[  439.790911]  [<ffffffff811618c2>] __vfs_read+0x28/0x9f^M
[  439.790911]  [<ffffffff8117a289>] ? __fget_light+0x4d/0x74^M
[  439.790911]  [<ffffffff811620a7>] vfs_read+0x7a/0xc6^M
[  439.790911]  [<ffffffff8116292b>] SyS_read+0x49/0x7f^M
[  439.790911]  [<ffffffff81554c17>] entry_SYSCALL_64_fastpath+0x12/0x6f^M
[  439.790911] Code: 48 89 e5 e8 a9 b8 e7 ff 5d c3 0f 1f 44 00 00 55 89
f2 48 89 e5 41 54 41 89 f4 53 48 8b 47 60 48 8b 1c d0 48 8b 7b 30 48 8b
53 38 <48> 8b 87 58 01 00 00 48 85 c0 75 09 48 8b 97 88 0c 00 00 eb 10
^M
[  439.790911] RIP  [<ffffffff812d89ba>] blk_mq_tag_to_rq+0x21/0x6e^M
[  439.790911]  RSP <ffff880819203da0>^M
[  439.790911] CR2: 0000000000000158^M
[  439.790911] ---[ end trace d40af58949325661 ]---^M

Cc: <stable@vger.kernel.org>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-15 09:45:21 -06:00
Kent Overstreet 54efd50bfd block: make generic_make_request handle arbitrarily sized bios
The way the block layer is currently written, it goes to great lengths
to avoid having to split bios; upper layer code (such as bio_add_page())
checks what the underlying device can handle and tries to always create
bios that don't need to be split.

But this approach becomes unwieldy and eventually breaks down with
stacked devices and devices with dynamic limits, and it adds a lot of
complexity. If the block layer could split bios as needed, we could
eliminate a lot of complexity elsewhere - particularly in stacked
drivers. Code that creates bios can then create whatever size bios are
convenient, and more importantly stacked drivers don't have to deal with
both their own bio size limitations and the limitations of the
(potentially multiple) devices underneath them.  In the future this will
let us delete merge_bvec_fn and a bunch of other code.

We do this by adding calls to blk_queue_split() to the various
make_request functions that need it - a few can already handle arbitrary
size bios. Note that we add the call _after_ any call to
blk_queue_bounce(); this means that blk_queue_split() and
blk_recalc_rq_segments() don't need to be concerned with bouncing
affecting segment merging.

Some make_request_fn() callbacks were simple enough to audit and verify
they don't need blk_queue_split() calls. The skipped ones are:

 * nfhd_make_request (arch/m68k/emu/nfblock.c)
 * axon_ram_make_request (arch/powerpc/sysdev/axonram.c)
 * simdisk_make_request (arch/xtensa/platforms/iss/simdisk.c)
 * brd_make_request (ramdisk - drivers/block/brd.c)
 * mtip_submit_request (drivers/block/mtip32xx/mtip32xx.c)
 * loop_make_request
 * null_queue_bio
 * bcache's make_request fns

Some others are almost certainly safe to remove now, but will be left
for future patches.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: drbd-user@lists.linbit.com
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Jim Paris <jim@jtan.com>
Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: Andreas Dilger <andreas.dilger@intel.com>
Acked-by: NeilBrown <neilb@suse.de> (for the 'md/md.c' bits)
Acked-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[dpark: skip more mq-based drivers, resolve merge conflicts, etc.]
Signed-off-by: Dongsu Park <dpark@posteo.net>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-13 12:31:33 -06:00
Christoph Hellwig 4246a0b63b block: add a bi_error field to struct bio
Currently we have two different ways to signal an I/O error on a BIO:

 (1) by clearing the BIO_UPTODATE flag
 (2) by returning a Linux errno value to the bi_end_io callback

The first one has the drawback of only communicating a single possible
error (-EIO), and the second one has the drawback of not beeing persistent
when bios are queued up, and are not passed along from child to parent
bio in the ever more popular chaining scenario.  Having both mechanisms
available has the additional drawback of utterly confusing driver authors
and introducing bugs where various I/O submitters only deal with one of
them, and the others have to add boilerplate code to deal with both kinds
of error returns.

So add a new bi_error field to store an errno value directly in struct
bio and remove the existing mechanisms to clean all this up.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-29 08:55:15 -06:00
Ming Lei e56f698bd0 blk-mq: set default timeout as 30 seconds
It is reasonable to set default timeout of request as 30 seconds instead of
30000 ticks, which may be 300 seconds if HZ is 100, for example, some arm64
based systems may choose 100 HZ.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
Fixes: c76cbbcf40 ("blk-mq: put blk_queue_rq_timeout together in blk_mq_init_queue()"
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-16 08:39:11 -06:00
Linus Torvalds bfffa1cc9d Merge branch 'for-4.2/core' of git://git.kernel.dk/linux-block
Pull core block IO update from Jens Axboe:
 "Nothing really major in here, mostly a collection of smaller
  optimizations and cleanups, mixed with various fixes.  In more detail,
  this contains:

   - Addition of policy specific data to blkcg for block cgroups.  From
     Arianna Avanzini.

   - Various cleanups around command types from Christoph.

   - Cleanup of the suspend block I/O path from Christoph.

   - Plugging updates from Shaohua and Jeff Moyer, for blk-mq.

   - Eliminating atomic inc/dec of both remaining IO count and reference
     count in a bio.  From me.

   - Fixes for SG gap and chunk size support for data-less (discards)
     IO, so we can merge these better.  From me.

   - Small restructuring of blk-mq shared tag support, freeing drivers
     from iterating hardware queues.  From Keith Busch.

   - A few cfq-iosched tweaks, from Tahsin Erdogan and me.  Makes the
     IOPS mode the default for non-rotational storage"

* 'for-4.2/core' of git://git.kernel.dk/linux-block: (35 commits)
  cfq-iosched: fix other locations where blkcg_to_cfqgd() can return NULL
  cfq-iosched: fix sysfs oops when attempting to read unconfigured weights
  cfq-iosched: move group scheduling functions under ifdef
  cfq-iosched: fix the setting of IOPS mode on SSDs
  blktrace: Add blktrace.c to BLOCK LAYER in MAINTAINERS file
  block, cgroup: implement policy-specific per-blkcg data
  block: Make CFQ default to IOPS mode on SSDs
  block: add blk_set_queue_dying() to blkdev.h
  blk-mq: Shared tag enhancements
  block: don't honor chunk sizes for data-less IO
  block: only honor SG gap prevention for merges that contain data
  block: fix returnvar.cocci warnings
  block, dm: don't copy bios for request clones
  block: remove management of bi_remaining when restoring original bi_end_io
  block: replace trylock with mutex_lock in blkdev_reread_part()
  block: export blkdev_reread_part() and __blkdev_reread_part()
  suspend: simplify block I/O handling
  block: collapse bio bit space
  block: remove unused BIO_RW_BLOCK and BIO_EOF flags
  block: remove BIO_EOPNOTSUPP
  ...
2015-06-25 14:29:53 -07:00
Ming Lei c3b4afca70 blk-mq: free hctx->ctxs in queue's release handler
Now blk_cleanup_queue() can be called before calling
del_gendisk()[1], inside which hctx->ctxs is touched
from blk_mq_unregister_hctx(), but the variable has
been freed by blk_cleanup_queue() at that time.

So this patch moves freeing of hctx->ctxs into queue's
release handler for fixing the oops reported by Stefan.

[1], 6cd18e711d (block: destroy bdi before blockdev is
unregistered)

Reported-by: Stefan Seyfried <stefan.seyfried@googlemail.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org (v4.0)
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-09 15:32:38 -06:00
Keith Busch f26cdc8536 blk-mq: Shared tag enhancements
Storage controllers may expose multiple block devices that share hardware
resources managed by blk-mq. This patch enhances the shared tags so a
low-level driver can access the shared resources not tied to the unshared
h/w contexts. This way the LLD can dynamically add and delete disks and
request queues without having to track all the request_queue hctx's to
iterate outstanding tags.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-01 14:35:56 -06:00
Christoph Hellwig 4ecd4fef3a block: use an atomic_t for mq_freeze_depth
lockdep gets unhappy about the not disabling irqs when using the queue_lock
around it.  Instead of trying to fix that up just switch to an atomic_t
and get rid of the lock.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-19 09:12:59 -06:00
Shaohua Li 5b3f341f09 blk-mq: make plug work for mutiple disks and queues
Last patch makes plug work for multiple queue case. However it only
works for single disk case, because it assumes only one request in the
plug list. If a task is accessing multiple disks, eg MD/DM, the
assumption is wrong. Let blk_attempt_plug_merge() record request from
the same queue.

V2: use NULL parameter in !mq case. Fix a bug. Add comments in
blk_attempt_plug_merge to make it less (hopefully) confusion.

Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-08 14:17:23 -06:00
Shaohua Li f984df1f0f blk-mq: do limited block plug for multiple queue case
plug is still helpful for workload with IO merge, but it can be harmful
otherwise especially with multiple hardware queues, as there is
(supposed) no lock contention in this case and plug can introduce
latency. For multiple queues, we do limited plug, eg plug only if there
is request merge. If a request doesn't have merge with following
request, the requet will be dispatched immediately.

V2: check blk_queue_nomerges() as suggested by Jeff.

Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-08 14:17:21 -06:00
Shaohua Li 239ad215f0 blk-mq: avoid re-initialize request which is failed in direct dispatch
If we directly issue a request and it fails, we use
blk_mq_merge_queue_io(). But we already assigned bio to a request in
blk_mq_bio_to_request. blk_mq_merge_queue_io shouldn't run
blk_mq_bio_to_request again.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-08 14:17:19 -06:00
Jeff Moyer e6c4438ba7 blk-mq: fix plugging in blk_sq_make_request
The following appears in blk_sq_make_request:

	/*
	 * If we have multiple hardware queues, just go directly to
	 * one of those for sync IO.
	 */

We clearly don't have multiple hardware queues, here!  This comment was
introduced with this commit 07068d5b8e (blk-mq: split make request
handler for multi and single queue):

    We want slightly different behavior from them:

    - On single queue devices, we currently use the per-process plug
      for deferred IO and for merging.

    - On multi queue devices, we don't use the per-process plug, but
      we want to go straight to hardware for SYNC IO.

The old code had this:

        use_plug = !is_flush_fua && ((q->nr_hw_queues == 1) || !is_sync);

and that was converted to:

	use_plug = !is_flush_fua && !is_sync;

which is not equivalent.  For the single queue case, that second half of
the && expression is always true.  So, what I think was actually inteded
follows (and this more closely matches what is done in blk_queue_bio).

V2: delete the 'likely', which should not be a big deal

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-08 14:17:17 -06:00
Shaohua Li 9ba52e5812 blk-mq: don't lose requests if a stopped queue restarts
Normally if driver is busy to dispatch a request the logic is like below:
block layer:					driver:
	__blk_mq_run_hw_queue
a.						blk_mq_stop_hw_queue
b.	rq add to ctx->dispatch

later:
1.						blk_mq_start_hw_queue
2.	__blk_mq_run_hw_queue

But it's possible step 1-2 runs between a and b. And since rq isn't in
ctx->dispatch yet, step 2 will not run rq. The rq might get lost if
there are no subsequent requests kick in.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-04 14:32:48 -06:00
Ming Lei 2a34c0872a blk-mq: fix CPU hotplug handling
hctx->tags has to be set as NULL in case that it is to be unmapped
no matter if set->tags[hctx->queue_num] is NULL or not in blk_mq_map_swqueue()
because shared tags can be freed already from another request queue.

The same situation has to be considered during handling CPU online too.
Unmapped hw queue can be remapped after CPU topo is changed, so we need
to allocate tags for the hw queue in blk_mq_map_swqueue(). Then tags
allocation for hw queue can be removed in hctx cpu online notifier, and it
is reasonable to do that after mapping is updated.

Cc: <stable@vger.kernel.org>
Reported-by: Dongsu Park <dongsu.park@profitbricks.com>
Tested-by: Dongsu Park <dongsu.park@profitbricks.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-04-23 10:27:38 -06:00
Ming Lei f054b56c95 blk-mq: fix race between timeout and CPU hotplug
Firstly during CPU hotplug, even queue is freezed, timeout
handler still may come and access hctx->tags, which may cause
use after free, so this patch deactivates timeout handler
inside CPU hotplug notifier.

Secondly, tags can be shared by more than one queues, so we
have to check if the hctx has been unmapped, otherwise
still use-after-free on tags can be triggered.

Cc: <stable@vger.kernel.org>
Reported-by: Dongsu Park <dongsu.park@profitbricks.com>
Tested-by: Dongsu Park <dongsu.park@profitbricks.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-04-23 10:27:35 -06:00
Jens Axboe 569fd0ce96 blk-mq: fix iteration of busy bitmap
Commit 889fa31f00 was a bit too eager in reducing the loop count,
so we ended up missing queues in some configurations. Ensure that
our division rounds up, so that's not the case.

Reported-by: Guenter Roeck <linux@roeck-us.net>
Fixes: 889fa31f00 ("blk-mq: reduce unnecessary software queue looping")
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-04-17 08:31:12 -06:00
Linus Torvalds d82312c808 Merge branch 'for-4.1/core' of git://git.kernel.dk/linux-block
Pull block layer core bits from Jens Axboe:
 "This is the core pull request for 4.1.  Not a lot of stuff in here for
  this round, mostly little fixes or optimizations.  This pull request
  contains:

   - An optimization that speeds up queue runs on blk-mq, especially for
     the case where there's a large difference between nr_cpu_ids and
     the actual mapped software queues on a hardware queue.  From Chong
     Yuan.

   - Honor node local allocations for requests on legacy devices.  From
     David Rientjes.

   - Cleanup of blk_mq_rq_to_pdu() from me.

   - exit_aio() fixup from me, greatly speeding up exiting multiple IO
     contexts off exit_group().  For my particular test case, fio exit
     took ~6 seconds.  A typical case of both exposing RCU grace periods
     to user space, and serializing exit of them.

   - Make blk_mq_queue_enter() honor the gfp mask passed in, so we only
     wait if __GFP_WAIT is set.  From Keith Busch.

   - blk-mq exports and two added helpers from Mike Snitzer, which will
     be used by the dm-mq code.

   - Cleanups of blk-mq queue init from Wei Fang and Xiaoguang Wang"

* 'for-4.1/core' of git://git.kernel.dk/linux-block:
  blk-mq: reduce unnecessary software queue looping
  aio: fix serial draining in exit_aio()
  blk-mq: cleanup blk_mq_rq_to_pdu()
  blk-mq: put blk_queue_rq_timeout together in blk_mq_init_queue()
  block: remove redundant check about 'set->nr_hw_queues' in blk_mq_alloc_tag_set()
  block: allocate request memory local to request queue
  blk-mq: don't wait in blk_mq_queue_enter() if __GFP_WAIT isn't set
  blk-mq: export blk_mq_run_hw_queues
  blk-mq: add blk_mq_init_allocated_queue and export blk_mq_register_disk
2015-04-16 21:49:16 -04:00
Chong Yuan 889fa31f00 blk-mq: reduce unnecessary software queue looping
In flush_busy_ctxs() and blk_mq_hctx_has_pending(), regardless of how many
ctxs assigned to one hctx, they will all loop hctx->ctx_map.map_size
times. Here hctx->ctx_map.map_size is a const ALIGN(nr_cpu_ids, 8) / 8.
Especially, flush_busy_ctxs() is in hot code path. And it's unnecessary.
Change ->map_size to contain the actually mapped software queues, so we
only loop for as many iterations as we have to.

And remove cpumask setting and nr_ctx count in blk_mq_init_cpu_queues()
since they are all re-done in blk_mq_map_swqueue().
blk_mq_map_swqueue().

Signed-off-by: Chong Yuan <chong.yuan@memblaze.com>
Reviewed-by: Wenbo Wang <wenbo.wang@memblaze.com>

Updated by me for formatting and commenting.

Signed-off-by: Jens Axboe <axboe@fb.com>
2015-04-15 11:39:29 -06:00
Linus Torvalds ac2111753c blk-mq: initialize 'struct request' and associated data to zero
Jan Engelhardt reports a strange oops with an invalid ->sense_buffer
pointer in scsi_init_cmd_errh() with the blk-mq code.

The sense_buffer pointer should have been initialized by the call to
scsi_init_request() from blk_mq_init_rq_map(), but there seems to be
some non-repeatable memory corruptor.

This patch makes sure we initialize the whole struct request allocation
(and the associated 'struct scsi_cmnd' for the SCSI case) to zero, by
using __GFP_ZERO in the allocation.  The old code initialized a couple
of individual fields, leaving the rest undefined (although many of them
are then initialized in later phases, like blk_mq_rq_ctx_init() etc.

It's not entirely clear why this matters, but it's the rigth thing to do
regardless, and with 4.0 imminent this is the defensive "let's just make
sure everything is initialized properly" patch.

Tested-by: Jan Engelhardt <jengelh@inai.de>
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-11 13:42:16 -07:00
Wei Fang c76cbbcf40 blk-mq: put blk_queue_rq_timeout together in blk_mq_init_queue()
Don't assign ->rq_timeout twice.

Signed-off-by: Wei Fang <fangwei1@huawei.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-03-30 09:07:00 -06:00
Xiaoguang Wang f9018ac930 block: remove redundant check about 'set->nr_hw_queues' in blk_mq_alloc_tag_set()
At the beginning of blk_mq_alloc_tag_set(), we have already checked whether
'set->nr_hw_queues' is zero, so here remove this redundant check.

Signed-off-by: Xiaoguang Wang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-03-30 09:04:27 -06:00
Keith Busch bfd343aa17 blk-mq: don't wait in blk_mq_queue_enter() if __GFP_WAIT isn't set
Return -EBUSY if we're unable to enter a queue immediately when
allocating a blk-mq request without __GFP_WAIT.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-03-13 08:30:55 -06:00
Mike Snitzer b94ec29640 blk-mq: export blk_mq_run_hw_queues
Rename blk_mq_run_queues to blk_mq_run_hw_queues, add async argument,
and export it.

DM's suspend support must be able to run the queue without starting
stopped hw queues.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-03-13 08:28:33 -06:00
Mike Snitzer b62c21b71f blk-mq: add blk_mq_init_allocated_queue and export blk_mq_register_disk
Add a variant of blk_mq_init_queue that allows a previously allocated
queue to be initialized.  blk_mq_init_allocated_queue models
blk_init_allocated_queue -- which was also created for DM's use.

DM's approach to device creation requires a placeholder request_queue be
allocated for use with alloc_dev() but the decision about what type of
request_queue will be ultimately created is deferred until all component
devices referenced in the DM table are processed to determine the table
type (request-based, blk-mq request-based, or bio-based).

Also, because of DM's late finalization of the request_queue type
the call to blk_mq_register_disk() doesn't happen during alloc_dev().
Must export blk_mq_register_disk() so that DM can backfill the 'mq' dir
once the blk-mq queue is fully allocated.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-03-13 08:26:53 -06:00
Mike Snitzer 9a30b096b5 blk-mq: fix use of incorrect goto label in blk_mq_init_queue error path
If percpu_ref_init() fails the allocated q and hctxs must get cleaned
up; using 'err_map' doesn't allow that to happen.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Ming Lei <ming.lei@canonical.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-03-13 08:22:32 -06:00