1
0
Fork 0
Commit Graph

706556 Commits (a6a252e6491443c1c18eab7e254daee63d4a7a04)

Author SHA1 Message Date
Ming Lei a6a252e649 blk-mq-sched: decide how to handle flush rq via RQF_FLUSH_SEQ
In case of IO scheduler we always pre-allocate one driver tag before
calling blk_insert_flush(), and flush request will be marked as
RQF_FLUSH_SEQ once it is in flush machinery.

So if RQF_FLUSH_SEQ isn't set, we call blk_insert_flush() to handle
the request, otherwise the flush request is dispatched to ->dispatch
list directly.

This is a preparation patch for not preallocating a driver tag for flush
requests, and for not treating flush requests as a special case. This is
similar to what the legacy path does.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-04 12:38:50 -06:00
Ming Lei 598906f814 blk-flush: use blk_mq_request_bypass_insert()
In the following patch, we will use RQF_FLUSH_SEQ to decide:

1) if the flag isn't set, the flush rq need to be inserted via
blk_insert_flush()

2) otherwise, the flush rq need to be dispatched directly since
it is in flush machinery now.

So we use blk_mq_request_bypass_insert() for requests of bypassing
flush machinery, just like the legacy path did.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-04 12:38:40 -06:00
Ming Lei b0850297c7 block: pass 'run_queue' to blk_mq_request_bypass_insert
Block flush need this function without running the queue, so add a
parameter controlling whether we run it or not.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-04 12:38:40 -06:00
Ming Lei 9c71c83c85 blk-flush: don't run queue for requests bypassing flush
blk_insert_flush() should only insert request since run queue always
follows it.

In case of bypassing flush, we don't need to run queue because every
blk_insert_flush() follows one run queue.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-04 12:38:40 -06:00
Jianchao Wang 6d6f167ce7 blk-mq: put the driver tag of nxt rq before first one is requeued
When freeing the driver tag of the next rq with an I/O scheduler
configured, we get the first entry of the list. However, this can
race with requeue of a request, and we end up getting the wrong request
from the head of the list. Free the driver tag of next rq before the
failed one is requeued in the failure branch of queue_rq callback.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-04 12:38:40 -06:00
weiping zhang e840107322 blkcg: add sanity check for blkcg policy operations
blkcg policy should keep cpd/pd's alloc_fn and free_fn in pairs,
otherwise policy would register fail.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: weiping zhang <zhangweiping@didichuxing.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-04 12:31:15 -06:00
Ming Lei 88022d7201 blk-mq: don't handle failure in .get_budget
It is enough to just check if we can get the budget via .get_budget().
And we don't need to deal with device state change in .get_budget().

For SCSI, one issue to be fixed is that we have to call
scsi_mq_uninit_cmd() to free allocated ressources if SCSI device fails
to handle the request. And it isn't enough to simply call
blk_mq_end_request() to do that if this request is marked as
RQF_DONTPREP.

Fixes: 0df21c86bdbf(scsi: implement .get_budget and .put_budget for blk-mq)
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-04 12:31:08 -06:00
Ming Lei 826a70a08b SCSI: don't get target/host busy_count in scsi_mq_get_budget()
It is very expensive to atomic_inc/atomic_dec the host wide counter of
host->busy_count, and it should have been avoided via blk-mq's mechanism
of getting driver tag, which uses the more efficient way of sbitmap queue.

Also we don't check atomic_read(&sdev->device_busy) in scsi_mq_get_budget()
and don't run queue if the counter becomes zero, so IO hang may be caused
if all requests are completed just before the current SCSI device
is added to shost->starved_list.

Fixes: 0df21c86bdbf(scsi: implement .get_budget and .put_budget for blk-mq)
Reported-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-04 08:19:25 -06:00
Christoph Hellwig e4f36b249b block: fix peeking requests during PM
We need to look for an active PM request until the next softbarrier
instead of looking for the first non-PM request.  Otherwise any cause
of request reordering might starve the PM request(s).

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-04 08:17:06 -06:00
Bart Van Assche 21e768b442 blk-mq: Make blk_mq_get_request() error path less confusing
blk_mq_get_tag() can modify data->ctx. This means that in the
error path of blk_mq_get_request() data->ctx should be passed to
blk_mq_put_ctx() instead of local_ctx. Note: since blk_mq_put_ctx()
ignores its argument, this patch does not change any functionality.

References: commit 1ad43c0078 ("blk-mq: don't leak preempt counter/q_usage_counter when allocating rq failed")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-03 12:33:56 -06:00
weiping zhang c2e82a2348 blk-mq: fix nr_requests wrong value when modify it from sysfs
if blk-mq use "none" io scheduler, nr_request get a wrong value when
input a number > tag_set->queue_depth. blk_mq_tag_update_depth will get
the smaller one min(nr, set->queue_depth), and then q->nr_request get a
wrong value.

Reproduce:

echo none > /sys/block/nvme0n1/queue/scheduler
echo 1000000 > /sys/block/nvme0n1/queue/nr_requests
cat /sys/block/nvme0n1/queue/nr_requests
1000000

Signed-off-by: weiping zhang <zhangweiping@didichuxing.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-03 12:21:06 -06:00
Jens Axboe a116895fc7 cdrom: hide CONFIG_CDROM menu selection
We don't need to expose this. The point is that drivers select
the uniform CDROM layer, if they need it, the user should not
have to make a conscious decision on whether to include this
separately or not.

Fixes: 2a750166a5 ("block: Rework drivers/cdrom/Makefile")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-03 11:02:15 -06:00
Christoph Hellwig ea435e1b93 block: add a poll_fn callback to struct request_queue
That we we can also poll non blk-mq queues.  Mostly needed for
the NVMe multipath code, but could also be useful elsewhere.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-03 10:31:48 -06:00
Christoph Hellwig 8ddcd65325 block: introduce GENHD_FL_HIDDEN
With this flag a driver can create a gendisk that can be used for I/O
submission inside the kernel, but which is not registered as user
facing block device.  This will be useful for the NVMe multipath
implementation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-03 10:31:48 -06:00
Christoph Hellwig 517bf3c306 block: don't look at the struct device dev_t in disk_devt
The hidden gendisks introduced in the next patch need to keep the dev
field in their struct device empty so that udev won't try to create
block device nodes for them.  To support that rewrite disk_devt to
look at the major and first_minor fields in the gendisk itself instead
of looking into the struct device.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-03 10:31:48 -06:00
Christoph Hellwig ef71de8b15 block: add a blk_steal_bios helper
This helpers allows to bounce steal the uncompleted bios from a request so
that they can be reissued on another path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-03 10:31:48 -06:00
Christoph Hellwig f421e1d9ad block: provide a direct_make_request helper
This helper allows reinserting a bio into a new queue without much
overhead, but requires all queue limits to be the same for the upper
and lower queues, and it does not provide any recursion preventions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Javier González <javier@cnexlabs.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-03 10:31:48 -06:00
Christoph Hellwig 96222bcc73 block: add REQ_DRV bit
Set aside a bit in the request/bio flags for driver use.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-03 10:31:48 -06:00
Christoph Hellwig 8977f56384 block: move REQ_NOWAIT
This flag should be before the operation-specific REQ_NOUNMAP bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-03 10:31:48 -06:00
Jens Axboe 3e2cb3ad47 Merge branch 'nvme-4.15' of git://git.infradead.org/nvme into for-4.15/block
Pull NVMe changes from Christoph:

"Below are the currently queue nvme updates for Linux 4.15.  There are
a few more things that could make it for this merge window, but I'd
like to get things into linux-next, especially for the unlikely case
that Linus decided to cut -rc8.

Highlights:
 - support for SGLs in the PCIe driver (Chaitanya Kulkarni)
 - disable I/O schedulers for the admin queue (Israel Rukshin)
 - various Fibre Channel fixes and enhancements (James Smart)
 - various refactoring for better code sharing between transports
   (Sagi Grimberg and me)

as well as lots of little bits from various contributors."
2017-11-03 10:28:51 -06:00
Minwoo Im a806c6c81e nvme: comment typo fixed in clearing AER
a tiny typo fixed in a comment.

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-11-03 10:02:20 +03:00
Arnd Bergmann 474f5da235 skd: use ktime_get_real_seconds()
Like many storage drivers, skd uses an unsigned 32-bit number for
interchanging the current time with the firmware. This will overflow in
y2106 and is otherwise safe.

However, the get_seconds() function is generally considered deprecated
since the behavior is different between 32-bit and 64-bit architectures,
and using it may indicate a bigger problem.

To annotate that we've thought about this, let's add a comment here
and migrate to the ktime_get_real_seconds() function that consistently
returns a 64-bit number.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-02 08:27:21 -06:00
Arnd Bergmann c091fbe9a2 block: fix CDROM dependency on BLK_DEV
After the cdrom cleanup, I get randconfig warnings for some configurations:

warning: (BLK_DEV_IDECD && BLK_DEV_SR) selects CDROM which has unmet direct dependencies (BLK_DEV)

This adds an explicit BLK_DEV dependency for both drivers. The other
drivers that select 'CDROM' already have this and don't need a change.

Fixes: 2a750166a5 ("block: Rework drivers/cdrom/Makefile")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-02 08:26:55 -06:00
Keith Busch 3639efef8f nvme: Remove unused headers
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-11-01 16:34:43 +01:00
James Smart a96d4bd867 nvmet: fix fatal_err_work deadlock
Below is a stack trace for an issue that was reported.

What's happening is that the nvmet layer had it's controller kato
timeout fire, which causes it to schedule its fatal error handler
via the fatal_err_work element. The error handler is invoked, which
calls the transport delete_ctrl() entry point, and as the transport
tears down the controller, nvmet_sq_destroy ends up doing the final
put on the ctlr causing it to enter its free routine. The ctlr free
routine does a cancel_work_sync() on fatal_err_work element, which
then does a flush_work and wait_for_completion. But, as the wait is
in the context of the work element being flushed, its in a catch-22
and the thread hangs.

[  326.903131] nvmet: ctrl 1 keep-alive timer (15 seconds) expired!
[  326.909832] nvmet: ctrl 1 fatal error occurred!
[  327.643100] lpfc 0000:04:00.0: 0:6313 NVMET Defer ctx release xri
x114 flg x2
[  494.582064] INFO: task kworker/0:2:243 blocked for more than 120
seconds.
[  494.589638]       Not tainted 4.14.0-rc1.James+ #1
[  494.594986] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  494.603718] kworker/0:2     D    0   243      2 0x80000000
[  494.609839] Workqueue: events nvmet_fatal_error_handler [nvmet]
[  494.616447] Call Trace:
[  494.619177]  __schedule+0x28d/0x890
[  494.623070]  schedule+0x36/0x80
[  494.626571]  schedule_timeout+0x1dd/0x300
[  494.631044]  ? dequeue_task_fair+0x592/0x840
[  494.635810]  ? pick_next_task_fair+0x23b/0x5c0
[  494.640756]  wait_for_completion+0x121/0x180
[  494.645521]  ? wake_up_q+0x80/0x80
[  494.649315]  flush_work+0x11d/0x1a0
[  494.653206]  ? wake_up_worker+0x30/0x30
[  494.657484]  __cancel_work_timer+0x10b/0x190
[  494.662249]  cancel_work_sync+0x10/0x20
[  494.666525]  nvmet_ctrl_put+0xa3/0x100 [nvmet]
[  494.671482]  nvmet_sq_:q+0x64/0xd0 [nvmet]
[  494.676540]  nvmet_fc_delete_target_queue+0x202/0x220 [nvmet_fc]
[  494.683245]  nvmet_fc_delete_target_assoc+0x6d/0xc0 [nvmet_fc]
[  494.689743]  nvmet_fc_delete_ctrl+0x137/0x1a0 [nvmet_fc]
[  494.695673]  nvmet_fatal_error_handler+0x30/0x40 [nvmet]
[  494.701589]  process_one_work+0x149/0x360
[  494.706064]  worker_thread+0x4d/0x3c0
[  494.710148]  kthread+0x109/0x140
[  494.713751]  ? rescuer_thread+0x380/0x380
[  494.718214]  ? kthread_park+0x60/0x60

Correct by having the fc transport convert to a different workq context
for the actual controller teardown which may call the cancel_work_sync.

Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-11-01 16:34:42 +01:00
James Smart 2b632970da nvme-fc: add dev_loss_tmo timeout and remoteport resume support
When a remoteport is unregistered (connectivity lost), the following
actions are taken:

 - the remoteport is marked DELETED
 - the time when dev_loss_tmo would expire is set in the remoteport
 - all controllers on the remoteport are reset.

After a controller resets, it will stall in a RECONNECTING state waiting
for one of the following:

 - the controller will continue to attempt reconnect per max_retries and
   reconnect_delay.  As no remoteport connectivity, the reconnect attempt
   will immediately fail.  If max reconnects has not been reached, a new
   reconnect_delay timer will be schedule.  If the current time plus
   another reconnect_delay exceeds when dev_loss_tmo expires on the remote
   port, then the reconnect_delay will be shortend to schedule no later
   than when dev_loss_tmo expires.  If max reconnect attempts are reached
   (e.g. ctrl_loss_tmo reached) or dev_loss_tmo ix exceeded without
   connectivity, the controller is deleted.
 - the remoteport is re-registered prior to dev_loss_tmo expiring.
   The resume of the remoteport will immediately attempt to reconnect
   each of its suspended controllers.

Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
[hch: updated to use nvme_delete_ctrl]
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-11-01 16:34:41 +01:00
James Smart 3cec7f9de4 nvme: allow controller RESETTING to RECONNECTING transition
Transport will typically transition from LIVE to RESETTING when initially
performing a reset or recovering from an error.  Adding this transition
allows a transport to transition to RECONNECTING when it checks/waits for
connectivity then creates new transport connections and reinits the
controller.

Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-11-01 16:34:40 +01:00
James Smart 96e2480105 nvme-fc: check connectivity before initiating reconnects
Check remoteport connectivity before initiating reconnects

Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-11-01 16:34:39 +01:00
James Smart ac7fe82b6f nvme-fc: add a dev_loss_tmo field to the remoteport
Add a dev_loss_tmo value, paralleling the SCSI FC transport, for device
connectivity loss.

The transport initializes the value in the nvme_fc_register_remoteport()
call. If the value is not set, a default of 60s is set.

Add a new routine to the api, nvme_fc_set_remoteport_devloss() routine,
which allows the lldd to dynamically update the value on an existing
remoteport.

Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-11-01 16:34:36 +01:00
James Smart 44c6ec77e1 nvme-fc: change ctlr state assignments during reset/reconnect
Clean up some of the controller state checks and add the
RESETTING->RECONNECTING state transition.

Specifically:
- the movement of the RESETTING state change and schedule of reset_work
  to core doesn't work wiht nvme_fc_error_recovery setting state to
  RECONNECTING before attempting to reset.  Remove the state change as
  the reset request does it.
- In the rare cases where an error occurs right as we're transitioning
  to LIVE, defer the controller start actions.
- In error handling on teardown of associations while performing initial
  controller creation - avoid quiesce calls on the admin_q.  They are
  unneeded.
- Add the RESETTING->RECONNECTING transition in the reset handler.

Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-11-01 16:28:23 +01:00
Sagi Grimberg 4054637c9b nvme: flush reset_work before safely continuing with delete operation
Prevent racing controller reset and delete flows. reset_work must not
ever self-requeue so flushing it suffices.

Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-11-01 16:28:17 +01:00
Sagi Grimberg 12fa1304ba nvme-rdma: reuse nvme_delete_ctrl when reconnect attempts expire
instead of just queueing delete work

Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-11-01 16:28:07 +01:00
Christoph Hellwig 6cd53d14aa nvme: consolidate common code from ->reset_work
No change in behavior except that the FC code cancels two work items a
little later now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-11-01 16:28:06 +01:00
Christoph Hellwig e9bc25874c nvme-rdma: remove nvme_rdma_remove_ctrl
It is only used in two places, and some of the work done by it will
be taken into common code soon.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-11-01 16:28:05 +01:00
Christoph Hellwig c5017e8570 nvme: move controller deletion to common code
Move the ->delete_work and the associated helpers to common code instead
of duplicating them in every driver.  This also adds the missing reference
get/put for the loop driver.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-11-01 16:28:04 +01:00
Christoph Hellwig 29c0964873 nvme-fc: merge __nvme_fc_schedule_delete_work into __nvme_fc_del_ctrl
No need to have two functions doing the same thing.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-11-01 16:28:04 +01:00
James Smart 71c691fd06 nvme-fc: avoid workqueue flush stalls
There's no need to wait for the full nvme_wq, which is now shared,
to flush. flush only the delete_work item.

Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Sagi Grimberg <sgi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-11-01 16:28:03 +01:00
Bart Van Assche 2a750166a5 block: Rework drivers/cdrom/Makefile
Instead of referring from inside drivers/cdrom/Makefile to all the
drivers that use this driver, let these drivers select the cdrom
driver. This change makes the cdrom build code follow the approach
that is used for most other drivers, namely refer from the higher
layers to the lower layer instead of from the lower layer to the
higher layers.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-01 08:33:57 -06:00
Ming Lei 1f460b63d4 blk-mq: don't restart queue when .get_budget returns BLK_STS_RESOURCE
SCSI restarts its queue in scsi_end_request() automatically, so we don't
need to handle this case in blk-mq.

Especailly any request won't be dequeued in this case, we needn't to
worry about IO hang caused by restart vs. dispatch.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-01 08:20:34 -06:00
Ming Lei 358a3a6bcc blk-mq: don't handle TAG_SHARED in restart
Now restart is used in the following cases, and TAG_SHARED is for
SCSI only.

1) .get_budget() returns BLK_STS_RESOURCE
- if resource in target/host level isn't satisfied, this SCSI device
will be added in shost->starved_list, and the whole queue will be rerun
(via SCSI's built-in RESTART) in scsi_end_request() after any request
initiated from this host/targe is completed. Forget to mention, host level
resource can't be an issue for blk-mq at all.

- the same is true if resource in the queue level isn't satisfied.

- if there isn't outstanding request on this queue, then SCSI's RESTART
can't work(blk-mq's can't work too), and the queue will be run after
SCSI_QUEUE_DELAY, and finally all starved sdevs will be handled by SCSI's
RESTART when this request is finished

2) scsi_dispatch_cmd() returns BLK_STS_RESOURCE
- if there isn't onprogressing request on this queue, the queue
will be run after SCSI_QUEUE_DELAY

- otherwise, SCSI's RESTART covers the rerun.

3) blk_mq_get_driver_tag() failed
- BLK_MQ_S_TAG_WAITING covers the cross-queue RESTART for driver
allocation.

In one word, SCSI's built-in RESTART is enough to cover the queue
rerun, and we don't need to pay special attention to TAG_SHARED wrt. restart.

In my test on scsi_debug(8 luns), this patch improves IOPS by 20% ~ 30% when
running I/O on these 8 luns concurrently.

Aslo Roman Pen reported the current RESTART is very expensive especialy
when there are lots of LUNs attached in one host, such as in his
test, RESTART causes half of IOPS be cut.

Fixes: https://marc.info/?l=linux-kernel&m=150832216727524&w=2
Fixes: 6d8c6c0f97 ("blk-mq: Restart a single queue if tag sets are shared")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-01 08:20:33 -06:00
Ming Lei 0df21c86bd scsi: implement .get_budget and .put_budget for blk-mq
We need to tell blk-mq to reserve resources before queuing one request,
so implement these two callbacks. Then blk-mq can avoid to dequeue
request too early, and IO merging can be improved a lot.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-01 08:20:02 -06:00
Ming Lei aeec77629a scsi: allow passing in null rq to scsi_prep_state_check()
In the following patch, we will implement scsi_get_budget()
which need to call scsi_prep_state_check() when rq isn't
dequeued yet.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-01 08:20:02 -06:00
Ming Lei b347689ffb blk-mq-sched: improve dispatching from sw queue
SCSI devices use host-wide tagset, and the shared driver tag space is
often quite big. However, there is also a queue depth for each lun(
.cmd_per_lun), which is often small, for example, on both lpfc and
qla2xxx, .cmd_per_lun is just 3.

So lots of requests may stay in sw queue, and we always flush all
belonging to same hw queue and dispatch them all to driver.
Unfortunately it is easy to cause queue busy because of the small
.cmd_per_lun.  Once these requests are flushed out, they have to stay in
hctx->dispatch, and no bio merge can happen on these requests, and
sequential IO performance is harmed.

This patch introduces blk_mq_dequeue_from_ctx for dequeuing a request
from a sw queue, so that we can dispatch them in scheduler's way. We can
then avoid dequeueing too many requests from sw queue, since we don't
flush ->dispatch completely.

This patch improves dispatching from sw queue by using the .get_budget
and .put_budget callbacks.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-01 08:20:02 -06:00
Ming Lei de14829740 blk-mq: introduce .get_budget and .put_budget in blk_mq_ops
For SCSI devices, there is often a per-request-queue depth, which needs
to be respected before queuing one request.

Currently blk-mq always dequeues the request first, then calls
.queue_rq() to dispatch the request to lld. One obvious issue with this
approach is that I/O merging may not be successful, because when the
per-request-queue depth can't be respected, .queue_rq() has to return
BLK_STS_RESOURCE, and then this request has to stay in hctx->dispatch
list. This means it never gets a chance to be merged with other IO.

This patch introduces .get_budget and .put_budget callback in blk_mq_ops,
then we can try to get reserved budget first before dequeuing request.
If the budget for queueing I/O can't be satisfied, we don't need to
dequeue request at all. Hence the request can be left in the IO
scheduler queue, for more merging opportunities.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-01 08:20:02 -06:00
Ming Lei 63ba8e31c3 block: kyber: check if there are requests in ctx in kyber_has_work()
There may be request in sw queue, and not fetched to domain queue
yet, so check it in kyber_has_work().

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-01 08:20:02 -06:00
Ming Lei 7930d0a00f sbitmap: introduce __sbitmap_for_each_set()
For blk-mq, we need to be able to iterate software queues starting
from any queue in a round robin fashion, so introduce this helper.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-01 08:20:02 -06:00
Ming Lei caf8eb0d60 blk-mq-sched: move actual dispatching into one helper
So that it becomes easy to support to dispatch from sw queue in the
following patch.

No functional change.

Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Suggested-by: Christoph Hellwig <hch@lst.de> # for simplifying dispatch logic
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-01 08:20:02 -06:00
Ming Lei 5e3d02bbaf blk-mq-sched: dispatch from scheduler IFF progress is made in ->dispatch
When the hw queue is busy, we shouldn't take requests from the scheduler
queue any more, otherwise it is difficult to do IO merge.

This patch fixes the awful IO performance on some SCSI devices(lpfc,
qla2xxx, ...) when mq-deadline/kyber is used by not taking requests if
hw queue is busy.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-01 08:20:02 -06:00
Scott Bauer 0e7d3a8d4e MAINTAINERS: Remove Rafael from Opal maintainers.
He is no longer working on storage.

Signed-off-by: Scott Bauer <scott.bauer@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-31 14:39:20 -06:00
Liang Chen 330a4db89d bcache: explicitly destroy mutex while exiting
mutex_destroy does nothing most of time, but it's better to call
it to make the code future proof and it also has some meaning
for like mutex debug.

As Coly pointed out in a previous review, bcache_exit() may not be
able to handle all the references properly if userspace registers
cache and backing devices right before bch_debug_init runs and
bch_debug_init failes later. So not exposing userspace interface
until everything is ready to avoid that issue.

Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Coly Li <colyli@suse.de>
Reviewed-by: Eric Wheeler <bcache@linux.ewheeler.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-30 15:57:54 -06:00