1
0
Fork 0
Commit Graph

3523 Commits (085c17ff6bacccb45201098e71f95675e08f0e17)

Author SHA1 Message Date
weiping zhang 37dcd6570f block, bfq: fix error handle in bfq_init
if elv_register fail, bfq_pool should be free.

Signed-off-by: weiping zhang <zhangweiping@didichuxing.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-23 15:35:54 -06:00
Christoph Hellwig 74d46992e0 block: replace bi_bdev with a gendisk pointer and partitions index
This way we don't need a block_device structure to submit I/O.  The
block_device has different life time rules from the gendisk and
request_queue and is usually only available when the block device node
is open.  Other callers need to explicitly create one (e.g. the lightnvm
passthrough code, or the new nvme multipathing code).

For the actual I/O path all that we need is the gendisk, which exists
once per block device.  But given that the block layer also does
partition remapping we additionally need a partition index, which is
used for said remapping in generic_make_request.

Note that all the block drivers generally want request_queue or
sometimes the gendisk, so this removes a layer of indirection all
over the stack.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-23 12:49:55 -06:00
Christoph Hellwig 807d4af2f6 block: add a __disk_get_part helper
This helper allows looking up a partion under RCU protection without
grabbing a reference to it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-23 12:49:52 -06:00
Christoph Hellwig de65b01232 block: reject attempts to allocate more than DISK_MAX_PARTS partitions
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-23 12:49:50 -06:00
Bart Van Assche 130d733a61 block: Warn if blk_queue_rq_timed_out() is called for a blk-mq queue
The timeout handler set by blk_queue_rq_timed_out() is only used
in single queue mode. Calling this function for blk-mq drivers is
wrong. Hence issue a warning if this function is called by a blk-mq
driver.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-23 12:02:30 -06:00
Bart Van Assche 4ddd56b003 block: Relax a check in blk_start_queue()
Calling blk_start_queue() from interrupt context with the queue
lock held and without disabling IRQs, as the skd driver does, is
safe. This patch avoids that loading the skd driver triggers the
following warning:

WARNING: CPU: 11 PID: 1348 at block/blk-core.c:283 blk_start_queue+0x84/0xa0
RIP: 0010:blk_start_queue+0x84/0xa0
Call Trace:
 skd_unquiesce_dev+0x12a/0x1d0 [skd]
 skd_complete_internal+0x1e7/0x5a0 [skd]
 skd_complete_other+0xc2/0xd0 [skd]
 skd_isr_completion_posted.isra.30+0x2a5/0x470 [skd]
 skd_isr+0x14f/0x180 [skd]
 irq_forced_thread_fn+0x2a/0x70
 irq_thread+0x144/0x1a0
 kthread+0x125/0x140
 ret_from_fork+0x2a/0x40

Fixes: commit a038e25364 ("[PATCH] blk_start_queue() must be called with irq disabled - add warning")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-18 08:45:29 -06:00
Bart Van Assche 6d2cf6f2b4 genhd: Annotate all part and part_tbl pointer dereferences
Annotate gendisk.part_tbl and disk_part_tbl.part dereferences with
rcu_dereference_protected(). This patch does not change the behavior
of the modified code but ensures that sparse does not complain about
disk->part_tbl manipulations nor about part_tbl->part accesses.
Additionally, improve documentation of the locking requirements of
the modified functions.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-18 08:36:58 -06:00
Bart Van Assche f846593391 blk-mq-debugfs: Declare a local symbol static
This was detected by sparse.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-18 08:36:58 -06:00
Bart Van Assche d352ae205d blk-mq: Make blk_mq_reinit_tagset() calls easier to read
Since blk_mq_ops.reinit_request is only called from inside
blk_mq_reinit_tagset(), make this function pointer an argument of
blk_mq_reinit_tagset() instead of a member of struct blk_mq_ops.
This patch does not change any functionality but makes
blk_mq_reinit_tagset() calls easier to read and to analyze.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: James Smart <james.smart@broadcom.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-18 08:36:58 -06:00
Bart Van Assche 37f02e5fb3 block: Unexport blk_queue_end_tag()
This function is only used inside the block layer core. Hence
unexport it.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-18 08:36:58 -06:00
Bart Van Assche 4d6062193b block: Fix two comments that refer to .queue_rq() return values
Since patch "blk-mq: switch .queue_rq return value to blk_status_t"
.queue_rq() returns a BLK_STS_* value instead of a BLK_MQ_RQ_*
value. Hence refer to the former in comments about .queue_rq()
return values.

Fixes: commit 39a70c76b8 ("blk-mq: clarify dispatch may not be drained/blocked by stopping queue")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-18 08:36:58 -06:00
Christoph Hellwig c005390374 blk-mq-pci: add a fallback when pci_irq_get_affinity returns NULL
While pci_irq_get_affinity should never fail for SMP kernel that
implement the affinity mapping, it will always return NULL in the
UP case, so provide a fallback mapping of all queues to CPU 0 in
that case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-18 08:08:14 -06:00
Keith Busch 3280d66a63 blk-mq: Fix queue usage on failed request allocation
blk_mq_get_request() does not release the callers queue usage counter
when allocation fails. The caller still needs to account for its own
queue usage when it is unable to allocate a request.

Fixes: 1ad43c0078 ("blk-mq: don't leak preempt counter/q_usage_counter when allocating rq failed")

Reported-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-15 07:22:34 -06:00
Greg Kroah-Hartman d985524680 Merge 4.13-rc5 into char-misc-next
We want the firmware, and other changes, in here as well.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-14 13:29:31 -07:00
Ritesh Harjani b3193bc0dc cfq: Give a chance for arming slice idle timer in case of group_idle
In below scenario blkio cgroup does not work as per their assigned
weights :-
1. When the underlying device is nonrotational with a single HW queue
with depth of >= CFQ_HW_QUEUE_MIN
2. When the use case is forming two blkio cgroups cg1(weight 1000) &
cg2(wight 100) and two processes(file1 and file2) doing sync IO in
their respective blkio cgroups.

For above usecase result of fio (without this patch):-
file1: (groupid=0, jobs=1): err= 0: pid=685: Thu Jan  1 19:41:49 1970
  write: IOPS=1315, BW=41.1MiB/s (43.1MB/s)(1024MiB/24906msec)
<...>
file2: (groupid=0, jobs=1): err= 0: pid=686: Thu Jan  1 19:41:49 1970
  write: IOPS=1295, BW=40.5MiB/s (42.5MB/s)(1024MiB/25293msec)
<...>
// both the process BW is equal even though they belong to diff.
cgroups with weight of 1000(cg1) and 100(cg2)

In above case (for non rotational NCQ devices),
as soon as the request from cg1 is completed and even
though it is provided with higher set_slice=10, because of CFQ
algorithm when the driver tries to fetch the request, CFQ expires
this group without providing any idle time nor weight priority
and schedules another cfq group (in this case cg2).
And thus both cfq groups(cg1 & cg2) keep alternating to get the
disk time and hence loses the cgroup weight based scheduling.

Below patch gives a chance to cfq algorithm (cfq_arm_slice_timer)
to arm the slice timer in case group_idle is enabled.
In case if group_idle is also not required (including for nonrotational
NCQ drives), we need to explicitly set group_idle = 0 from sysfs for
such cases.

With this patch result of fio(for above usecase) :-
file1: (groupid=0, jobs=1): err= 0: pid=690: Thu Jan  1 00:06:08 1970
  write: IOPS=1706, BW=53.3MiB/s (55.9MB/s)(1024MiB/19197msec)
<..>
file2: (groupid=0, jobs=1): err= 0: pid=691: Thu Jan  1 00:06:08 1970
  write: IOPS=1043, BW=32.6MiB/s (34.2MB/s)(1024MiB/31401msec)
<..>
// In this processes BW is as per their respective cgroups weight.

Signed-off-by: Ritesh Harjani <riteshh@codeaurora.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-11 09:00:26 -06:00
Paolo Valente edaf94285b block, bfq: boost throughput with flash-based non-queueing devices
When a queue associated with a process remains empty, there are cases
where throughput gets boosted if the device is idled to await the
arrival of a new I/O request for that queue. Currently, BFQ assumes
that one of these cases is when the device has no internal queueing
(regardless of the properties of the I/O being served). Unfortunately,
this condition has proved to be too general. So, this commit refines it
as "the device has no internal queueing and is rotational".

This refinement provides a significant throughput boost with random
I/O, on flash-based storage without internal queueing. For example, on
a HiKey board, throughput increases by up to 125%, growing, e.g., from
6.9MB/s to 15.6MB/s with two or three random readers in parallel.

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Luca Miccio <lucmiccio@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-11 08:58:03 -06:00
Paolo Valente d5be3fefc9 block,bfq: refactor device-idling logic
The logic that decides whether to idle the device is scattered across
three functions. Almost all of the logic is in the function
bfq_bfqq_may_idle, but (1) part of the decision is made in
bfq_update_idle_window, and (2) the function bfq_bfqq_must_idle may
switch off idling regardless of the output of bfq_bfqq_may_idle. In
addition, both bfq_update_idle_window and bfq_bfqq_must_idle make
their decisions as a function of parameters that are used, for similar
purposes, also in bfq_bfqq_may_idle. This commit addresses these
issues by moving all the logic into bfq_bfqq_may_idle.

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-11 08:58:02 -06:00
Jens Axboe e743eb1ecd block: remove unused syncfull/asyncfull queue flags
We haven't used these in years, but somehow the definitions still
remained. Kill them, and renumber the QUEUE_FLAG_ space. We had
a hole in the beginning of the space, too.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-10 08:25:38 -06:00
Bart Van Assche d4acf3650c block: Make blk_mq_delay_kick_requeue_list() rerun the queue at a quiet time
The blk_mq_delay_kick_requeue_list() function is used by the device
mapper and only by the device mapper to rerun the queue and requeue
list after a delay. This function is called once per request that
gets requeued. Modify this function such that the queue is run once
per path change event instead of once per request that is requeued.

Fixes: commit 2849450ad3 ("blk-mq: introduce blk_mq_delay_kick_requeue_list()")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-09 20:24:38 -06:00
Christoph Hellwig f86e28c4dc bio-integrity: only verify integrity on the lowest stacked driver
This gets us back to the behavior in 4.12 and earlier.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Fixes: 7c20f116 ("bio-integrity: stop abusing bi_end_io")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-09 20:24:36 -06:00
Milan Broz c775d2098d bio-integrity: Fix regression if profile verify_fn is NULL
In dm-integrity target we register integrity profile that have
both generate_fn and verify_fn callbacks set to NULL.

This is used if dm-integrity is stacked under a dm-crypt device
for authenticated encryption (integrity payload contains authentication
tag and IV seed).

In this case the verification is done through own crypto API
processing inside dm-crypt; integrity profile is only holder
of these data. (And memory is owned by dm-crypt as well.)

After the commit (and previous changes)
  Commit 7c20f11680
  Author: Christoph Hellwig <hch@lst.de>
  Date:   Mon Jul 3 16:58:43 2017 -0600

    bio-integrity: stop abusing bi_end_io

we get this crash:

: BUG: unable to handle kernel NULL pointer dereference at   (null)
: IP:   (null)
: *pde = 00000000
...
:
: Workqueue: kintegrityd bio_integrity_verify_fn
: task: f48ae180 task.stack: f4b5c000
: EIP:   (null)
: EFLAGS: 00210286 CPU: 0
: EAX: f4b5debc EBX: 00001000 ECX: 00000001 EDX: 00000000
: ESI: 00001000 EDI: ed25f000 EBP: f4b5dee8 ESP: f4b5dea4
:  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
: CR0: 80050033 CR2: 00000000 CR3: 32823000 CR4: 001406d0
: Call Trace:
:  ? bio_integrity_process+0xe3/0x1e0
:  bio_integrity_verify_fn+0xea/0x150
:  process_one_work+0x1c7/0x5c0
:  worker_thread+0x39/0x380
:  kthread+0xd6/0x110
:  ? process_one_work+0x5c0/0x5c0
:  ? kthread_worker_fn+0x100/0x100
:  ? kthread_worker_fn+0x100/0x100
:  ret_from_fork+0x19/0x24
: Code:  Bad EIP value.
: EIP:   (null) SS:ESP: 0068:f4b5dea4
: CR2: 0000000000000000

Patch just skip the whole verify workqueue if verify_fn is set to NULL.

Fixes: 7c20f116 ("bio-integrity: stop abusing bi_end_io")
Signed-off-by: Milan Broz <gmazyland@gmail.com>
[hch: trivial whitespace fix]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-09 20:24:26 -06:00
Jens Axboe b8d62b3a9c blk-mq: enable checking two part inflight counts at the same time
Modify blk_mq_in_flight() to count both a partition and root at
the same time. Then we only have to call it once, instead of
potentially looping the tags twice.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-09 13:09:33 -06:00
Jens Axboe f299b7c7a9 blk-mq: provide internal in-flight variant
We don't have to inc/dec some counter, since we can just
iterate the tags. That makes inc/dec a noop, but means we
have to iterate busy tags to get an in-flight count.

Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-09 13:09:28 -06:00
Jens Axboe 0609e0efc5 block: make part_in_flight() take an array of two ints
Instead of returning the count that matches the partition, pass
in an array of two ints. Index 0 will be filled with the inflight
count for the partition in question, and index 1 will filled
with the root inflight count, if the partition passed in is not the
root.

This is in preparation for being able to calculate both in one
go.

Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-09 13:09:20 -06:00
Jens Axboe d62e26b3ff block: pass in queue to inflight accounting
No functional change in this patch, just in preparation for
basing the inflight mechanism on the queue in question.

Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-09 13:09:16 -06:00
Jens Axboe 7f5562d5ec blk-mq-tag: check for NULL rq when iterating tags
Since we introduced blk-mq-sched, the tags->rqs[] array has been
dynamically assigned. So we need to check for NULL when iterating,
since there's a window of time where the bit is set, but we haven't
dynamically assigned the tags->rqs[] array position yet.

This is perfectly safe, since the memory backing of the request is
never going away while the device is alive.

Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-09 13:09:08 -06:00
Christoph Hellwig 9346beb9d0 bio-integrity: move the bio integrity profile check earlier in bio_integrity_prep
This makes the code more obvious, and moves the most likely branch first
in the function.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-09 10:00:29 -06:00
Sagi Grimberg 24c5dc6610 block: Add rdma affinity based queue mapping helper
Like pci and virtio, we add a rdma helper for affinity
spreading. This achieves optimal mq affinity assignments
according to the underlying rdma device affinity maps.

Reviewed-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-08-08 14:58:03 -04:00
Jan Kara 3d289d6882 block: Add comment to submit_bio_wait()
submit_bio_wait() does not consume bio reference. Add comment about
that.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-02 08:25:04 -06:00
Ming Lei 1ad43c0078 blk-mq: don't leak preempt counter/q_usage_counter when allocating rq failed
When blk_mq_get_request() failed, preempt counter isn't
released, and blk_mq_make_request() doesn't release the counter
too.

This patch fixes the issue, and makes sure that preempt counter
is only held if rq is allocated successfully. The same policy is
applied on .q_usage_counter too.

Signed-off-by: Ming Lei <minlei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-02 08:23:57 -06:00
Jens Axboe b7a71e66d4 blk-mq: add warning to __blk_mq_run_hw_queue() for ints disabled
We recently had a bug in the IPR SCSI driver, where it would end up
making the SCSI mid layer run the mq hardware queue with interrupts
disabled. This isn't legal, since the software queue locking relies
on never being grabbed from interrupt context. Additionally, drivers
that set BLK_MQ_F_BLOCKING may schedule from this context.

Add a WARN_ON_ONCE() to catch bad users up front.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-01 09:28:24 -06:00
Paolo Valente 46d556e6aa block, bfq: consider also in_service_entity to state whether an entity is active
Groups of BFQ queues are represented by generic entities in BFQ. When
a queue belonging to a parent entity is deactivated, the parent entity
may need to be deactivated too, in case the deactivated queue was the
only active queue for the parent entity. This deactivation may need to
be propagated upwards if the entity belongs, in its turn, to a further
higher-level entity, and so on. In particular, the upward propagation
of deactivation stops at the first parent entity that remains active
even if one of its child entities has been deactivated.

To decide whether the last non-deactivation condition holds for a
parent entity, BFQ checks whether the field next_in_service is still
not NULL for the parent entity, after the deactivation of one of its
child entity. If it is not NULL, then there are certainly other active
entities in the parent entity, and deactivations can stop.

Unfortunately, this check misses a corner case: if in_service_entity
is not NULL, then next_in_service may happen to be NULL, although the
parent entity is evidently active. This happens if: 1) the entity
pointed by in_service_entity is the only active entity in the parent
entity, and 2) according to the definition of next_in_service, the
in_service_entity cannot be considered as next_in_service. See the
comments on the definition of next_in_service for details on this
second point.

Hitting the above corner case causes crashes.

To address this issue, this commit:
1) Extends the above check on only next_in_service to controlling both
next_in_service and in_service_entity (if any of them is not NULL,
then no further deactivation is performed)
2) Improves the (important) comments on how next_in_service is defined
and updated; in particular it fixes a few rather obscure paragraphs

Reported-by: Eric Wheeler <bfq-sched@lists.ewheeler.net>
Reported-by: Rick Yiu <rick_yiu@htc.com>
Reported-by: Tom X Nguyen <tom81094@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Tested-by: Eric Wheeler <bfq-sched@lists.ewheeler.net>
Tested-by: Rick Yiu <rick_yiu@htc.com>
Tested-by: Laurentiu Nicola <lnicola@dend.ro>
Tested-by: Tom X Nguyen <tom81094@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-07-29 15:32:49 -06:00
Paolo Valente 6ab1d8da97 block, bfq: reset in_service_entity if it becomes idle
BFQ implements hierarchical scheduling by representing each group of
queues with a generic parent entity. For each parent entity, BFQ
maintains an in_service_entity pointer: if one of the child entities
happens to be in service, in_service_entity points to it.  The
resetting of these pointers happens only on queue expirations: when
the in-service queue is expired, i.e., stops to be the queue in
service, BFQ resets all in_service_entity pointers along the
parent-entity path from this queue to the root entity.

Functions handling the scheduling of entities assume, naturally, that
in-service entities are active, i.e., have pending I/O requests (or,
as a special case, even if they have no pending requests, they are
expected to receive a new request very soon, with the scheduler idling
the storage device while waiting for such an event). Unfortunately,
the above resetting scheme of the in_service_entity pointers may cause
this assumption to be violated.  For example, the in-service queue may
happen to remain without requests because of a request merge. In this
case the queue does become idle, and all related data structures are
updated accordingly. But in_service_entity still points to the queue
in the parent entity. This inconsistency may even propagate to
higher-level parent entities, if they happen to become idle as well,
as a consequence of the leaf queue becoming idle. For this queue and
parent entities, scheduling functions have an undefined behaviour,
and, as reported, may easily lead to kernel crashes or hangs.

This commit addresses this issue by simply resetting the
in_service_entity field also when it is detected to point to an entity
becoming idle (regardless of why the entity becomes idle).

Reported-by: Laurentiu Nicola <lnicola@dend.ro>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Tested-by: Laurentiu Nicola <lnicola@dend.ro>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-07-29 15:32:47 -06:00
Jens Axboe 18e9781d44 blk-mq: blk_mq_requeue_work() doesn't need to save IRQ flags
We know we're in process context, so don't bother using the
IRQ safe versions of the spin lock.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-07-29 09:00:03 -06:00
Shaohua Li 35fe6d7632 block: use standard blktrace API to output cgroup info for debug notes
Currently cfq/bfq/blk-throttle output cgroup info in trace in their own
way. Now we have standard blktrace API for this, so convert them to use
it.

Note, this changes the behavior a little bit. cgroup info isn't output
by default, we only do this with 'blk_cgroup' option enabled. cgroup
info isn't output as a string by default too, we only do this with
'blk_cgname' option enabled. Also cgroup info is output in different
position of the note string. I think these behavior changes aren't a big
issue (actually we make trace data shorter which is good), since the
blktrace note is solely for debugging.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-07-29 09:00:03 -06:00
Shaohua Li 007cc56b7e block: always attach cgroup info into bio
blkcg_bio_issue_check() already gets blkcg for a BIO.
bio_associate_blkcg() uses a percpu refcounter, so it's a very cheap
operation. There is no point we don't attach the cgroup info into bio at
blkcg_bio_issue_check. This also makes blktrace outputs correct cgroup
info.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-07-29 09:00:03 -06:00
Christoph Hellwig 76451d79bd blk-mq: map queues to all present CPUs
We already do this for PCI mappings, and the higher level code now
expects that CPU on/offlining doesn't have an affect on the queue
mappings.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-07-24 10:01:31 -06:00
Christoph Hellwig 765e40b675 block: disable runtime-pm for blk-mq
The blk-mq code lacks support for looking at the rpm_status field, tracking
active requests and the RQF_PM flag.

Due to the default switch to blk-mq for scsi people start to run into
suspend / resume issue due to this fact, so make sure we disable the runtime
PM functionality until it is properly implemented.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-07-24 08:46:40 -06:00
Greg Kroah-Hartman 24a81a2c25 Merge 4.13-rc2 into char-misc-next
We want the char/misc driver fixes in here as well to handle future
changes.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-23 19:58:30 -07:00
Logan Gunthorpe 133d55cdb2 block: order /proc/devices by major number
Presently, the order of the block devices listed in /proc/devices is not
entirely sequential. If a block device has a major number greater than
BLKDEV_MAJOR_HASH_SIZE (255), it will be ordered as if its major were
module 255. For example, 511 appears after 1.

This patch cleans that up and prints each major number in the correct
order, regardless of where they are stored in the hash table.

In order to do this, we introduce BLKDEV_MAJOR_MAX as an artificial
limit (chosen to be 512). It will then print all devices in major
order number from 0 to the maximum.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-17 15:42:20 +02:00
Hou Tao 3f7cb4f413 bfq: dispatch request to prevent queue stalling after the request completion
There are mq devices (eg., virtio-blk, nbd and loopback) which don't
invoke blk_mq_run_hw_queues() after the completion of a request.
If bfq is enabled on these devices and the slice_idle attribute or
strict_guarantees attribute is set as zero, it is possible that
after a request completion the remaining requests of busy bfq queue
will stalled in the bfq schedule until a new request arrives.

To fix the scheduler latency problem, we need to check whether or not
all issued requests have completed and dispatch more requests to driver
if there is no request in driver.

The problem can be reproduced by running the following script
on a virtio-blk device with nr_hw_queues as 1:

#!/bin/sh

dev=vdb
# mount point for dev
mp=/tmp/mnt
cd $mp

job=strict.job
cat <<EOF > $job
[global]
direct=1
bs=4k
size=256M
rw=write
ioengine=libaio
iodepth=128
runtime=5
time_based

[1]
filename=1.data

[2]
new_group
filename=2.data
EOF

echo bfq > /sys/block/$dev/queue/scheduler
echo 1 > /sys/block/$dev/queue/iosched/strict_guarantees
fio $job

Signed-off-by: Hou Tao <houtao1@huawei.com>
Reviewed-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-07-12 08:32:04 -06:00
Hou Tao 38c9140740 bfq: fix typos in comments about B-WF2Q+ algorithm
The start time of eligible entity should be less than or equal to
the current virtual time, and the entity in idle tree has a finish
time being greater than the current virtual time.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Reviewed-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-07-12 08:32:02 -06:00
Linus Torvalds 130568d5ea Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull more block updates from Jens Axboe:
 "This is a followup for block changes, that didn't make the initial
  pull request. It's a bit of a mixed bag, this contains:

   - A followup pull request from Sagi for NVMe. Outside of fixups for
     NVMe, it also includes a series for ensuring that we properly
     quiesce hardware queues when browsing live tags.

   - Set of integrity fixes from Dmitry (mostly), fixing various issues
     for folks using DIF/DIX.

   - Fix for a bug introduced in cciss, with the req init changes. From
     Christoph.

   - Fix for a bug in BFQ, from Paolo.

   - Two followup fixes for lightnvm/pblk from Javier.

   - Depth fix from Ming for blk-mq-sched.

   - Also from Ming, performance fix for mtip32xx that was introduced
     with the dynamic initialization of commands"

* 'for-linus' of git://git.kernel.dk/linux-block: (44 commits)
  block: call bio_uninit in bio_endio
  nvmet: avoid unneeded assignment of submit_bio return value
  nvme-pci: add module parameter for io queue depth
  nvme-pci: compile warnings in nvme_alloc_host_mem()
  nvmet_fc: Accept variable pad lengths on Create Association LS
  nvme_fc/nvmet_fc: revise Create Association descriptor length
  lightnvm: pblk: remove unnecessary checks
  lightnvm: pblk: control I/O flow also on tear down
  cciss: initialize struct scsi_req
  null_blk: fix error flow for shared tags during module_init
  block: Fix __blkdev_issue_zeroout loop
  nvme-rdma: unconditionally recycle the request mr
  nvme: split nvme_uninit_ctrl into stop and uninit
  virtio_blk: quiesce/unquiesce live IO when entering PM states
  mtip32xx: quiesce request queues to make sure no submissions are inflight
  nbd: quiesce request queues to make sure no submissions are inflight
  nvme: kick requeue list when requeueing a request instead of when starting the queues
  nvme-pci: quiesce/unquiesce admin_q instead of start/stop its hw queues
  nvme-loop: quiesce/unquiesce admin_q instead of start/stop its hw queues
  nvme-fc: quiesce/unquiesce admin_q instead of start/stop its hw queues
  ...
2017-07-11 15:36:52 -07:00
Shaohua Li b222dd2fdd block: call bio_uninit in bio_endio
bio_free isn't a good place to free cgroup info. There are a
lot of cases bio is allocated in special way (for example, in stack) and
never gets called by bio_put hence bio_free, we are leaking memory. This
patch moves the free to bio endio, which should be called anyway. The
bio_uninit call in bio_free is kept, in case the bio never gets called
bio endio.

This assumes ->bi_end_io() doesn't access cgroup info, which seems true
in my audit.

This along with Christoph's integrity patch should fix the memory leak
issue.

Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-07-10 12:43:33 -06:00
Linus Torvalds c856863988 Merge branch 'misc.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc compat stuff updates from Al Viro:
 "This part is basically untangling various compat stuff. Compat
  syscalls moved to their native counterparts, getting rid of quite a
  bit of double-copying and/or set_fs() uses. A lot of field-by-field
  copyin/copyout killed off.

   - kernel/compat.c is much closer to containing just the
     copyin/copyout of compat structs. Not all compat syscalls are gone
     from it yet, but it's getting there.

   - ipc/compat_mq.c killed off completely.

   - block/compat_ioctl.c cleaned up; floppy compat ioctls moved to
     drivers/block/floppy.c where they belong. Yes, there are several
     drivers that implement some of the same ioctls. Some are m68k and
     one is 32bit-only pmac. drivers/block/floppy.c is the only one in
     that bunch that can be built on biarch"

* 'misc.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  mqueue: move compat syscalls to native ones
  usbdevfs: get rid of field-by-field copyin
  compat_hdio_ioctl: get rid of set_fs()
  take floppy compat ioctls to sodding floppy.c
  ipmi: get rid of field-by-field __get_user()
  ipmi: get COMPAT_IPMICTL_RECEIVE_MSG in sync with the native one
  rt_sigtimedwait(): move compat to native
  select: switch compat_{get,put}_fd_set() to compat_{get,put}_bitmap()
  put_compat_rusage(): switch to copy_to_user()
  sigpending(): move compat to native
  getrlimit()/setrlimit(): move compat to native
  times(2): move compat to native
  compat_{get,put}_bitmap(): use unsafe_{get,put}_user()
  fb_get_fscreeninfo(): don't bother with do_fb_ioctl()
  do_sigaltstack(): lift copying to/from userland into callers
  take compat_sys_old_getrlimit() to native syscall
  trim __ARCH_WANT_SYS_OLD_GETRLIMIT
2017-07-06 20:57:13 -07:00
Damien Le Moal 615d22a51c block: Fix __blkdev_issue_zeroout loop
The BIO issuing loop in __blkdev_issue_zeroout() is allocating BIOs
with a maximum number of bvec (pages) equal to

min(nr_sects, (sector_t)BIO_MAX_PAGES)

This works since the requested number of bvecs will always be limited
to the absolute maximum number supported (BIO_MAX_PAGES), but this is
ineficient as too many bvec entries may be requested due to the
different units being used in the min() operation (number of sectors vs
number of pages).
To fix this, introduce the helper __blkdev_sectors_to_bio_pages() to
correctly calculate the number of bvecs for zeroout BIOs as the issuing
loop progresses. The calculation is done using consistent units and
makes sure that the number of pages return is at least 1 (for cases
where the number of sectors is less that the number of sectors in
a page).

Also remove a trailing space after the bit shift in the internal loop
min() call.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-07-06 09:43:20 -06:00
kbuild test robot ea4d12dabf bio-integrity: fix boolreturn.cocci warnings
block/bio-integrity.c:318:10-11: WARNING: return of 0/1 in function 'bio_integrity_prep' with return type bool

 Return statements in functions returning bool should use
 true/false instead of 1/0.
Generated by: scripts/coccinelle/misc/boolreturn.cocci

Fixes: e23947bd76 ("bio-integrity: fold bio_integrity_enabled to bio_integrity_prep")
CC: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-07-04 16:11:53 -06:00
Linus Torvalds 03ffbcdd78 Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
 "The irq department delivers:

   - Expand the generic infrastructure handling the irq migration on CPU
     hotplug and convert X86 over to it. (Thomas Gleixner)

     Aside of consolidating code this is a preparatory change for:

   - Finalizing the affinity management for multi-queue devices. The
     main change here is to shut down interrupts which are affine to a
     outgoing CPU and reenabling them when the CPU comes online again.
     That avoids moving interrupts pointlessly around and breaking and
     reestablishing affinities for no value. (Christoph Hellwig)

     Note: This contains also the BLOCK-MQ and NVME changes which depend
     on the rework of the irq core infrastructure. Jens acked them and
     agreed that they should go with the irq changes.

   - Consolidation of irq domain code (Marc Zyngier)

   - State tracking consolidation in the core code (Jeffy Chen)

   - Add debug infrastructure for hierarchical irq domains (Thomas
     Gleixner)

   - Infrastructure enhancement for managing generic interrupt chips via
     devmem (Bartosz Golaszewski)

   - Constification work all over the place (Tobias Klauser)

   - Two new interrupt controller drivers for MVEBU (Thomas Petazzoni)

   - The usual set of fixes, updates and enhancements all over the
     place"

* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (112 commits)
  irqchip/or1k-pic: Fix interrupt acknowledgement
  irqchip/irq-mvebu-gicp: Allocate enough memory for spi_bitmap
  irqchip/gic-v3: Fix out-of-bound access in gic_set_affinity
  nvme: Allocate queues for all possible CPUs
  blk-mq: Create hctx for each present CPU
  blk-mq: Include all present CPUs in the default queue mapping
  genirq: Avoid unnecessary low level irq function calls
  genirq: Set irq masked state when initializing irq_desc
  genirq/timings: Add infrastructure for estimating the next interrupt arrival time
  genirq/timings: Add infrastructure to track the interrupt timings
  genirq/debugfs: Remove pointless NULL pointer check
  irqchip/gic-v3-its: Don't assume GICv3 hardware supports 16bit INTID
  irqchip/gic-v3-its: Add ACPI NUMA node mapping
  irqchip/gic-v3-its-platform-msi: Make of_device_ids const
  irqchip/gic-v3-its: Make of_device_ids const
  irqchip/irq-mvebu-icu: Add new driver for Marvell ICU
  irqchip/irq-mvebu-gicp: Add new driver for Marvell GICP
  dt-bindings/interrupt-controller: Add DT binding for the Marvell ICU
  genirq/irqdomain: Remove auto-recursive hierarchy support
  irqchip/MSI: Use irq_domain_update_bus_token instead of an open coded access
  ...
2017-07-03 16:50:31 -07:00
Christoph Hellwig 7c20f11680 bio-integrity: stop abusing bi_end_io
And instead call directly into the integrity code from bio_end_io.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-07-03 17:00:59 -06:00
Dmitry Monakhov 63573e359d bio-integrity: Restore original iterator on verify stage
Currently ->verify_fn not woks at all because at the moment it is called
bio->bi_iter.bi_size == 0, so we do not iterate integrity bvecs at all.

In order to perform verification we need to know original data vector,
with new bvec rewind API this is trivial.

testcase: 3c6509eaa8

Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
[hch: adopted for new status values]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-07-03 16:56:29 -06:00
Dmitry Monakhov 128b6f9fdd t10-pi: Move opencoded contants to common header
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-07-03 16:56:25 -06:00
Dmitry Monakhov e23947bd76 bio-integrity: fold bio_integrity_enabled to bio_integrity_prep
Currently all integrity prep hooks are open-coded, and if prepare fails
we ignore it's code and fail bio with EIO. Let's return real error to
upper layer, so later caller may react accordingly.

In fact no one want to use bio_integrity_prep() w/o bio_integrity_enabled,
so it is reasonable to fold it in to one function.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
[hch: merged with the latest block tree,
	return bool from bio_integrity_prep]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-07-03 16:56:24 -06:00
Dmitry Monakhov fbd08e7673 bio-integrity: fix interface for bio_integrity_trim
bio_integrity_trim inherent it's interface from bio_trim and accept
offset and size, but this API is error prone because data offset
must always be insync with bio's data offset. That is why we have
integrity update hook in bio_advance()

So only meaningful values are: offset == 0, sectors == bio_sectors(bio)
Let's just remove them completely.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-07-03 16:56:22 -06:00
Dmitry Monakhov 309a62fa3a bio-integrity: bio_integrity_advance must update integrity seed
SCSI drivers do care about bip_seed so we must update it accordingly.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-07-03 16:56:21 -06:00
Dmitry Monakhov 376a78abf5 bio-integrity: bio_trim should truncate integrity vector accordingly
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-07-03 16:56:19 -06:00
Ming Lei 32825c45ff blk-mq-sched: fix performance regression of mq-deadline
When mq-deadline is taken, IOPS of sequential read and
seqential write is observed more than 20% drop on sata(scsi-mq)
devices, compared with using 'none' scheduler.

The reason is that the default nr_requests for scheduler is
too big for small queuedepth devices, and latency is increased
much.

Since the principle of taking 256 requests for mq scheduler
is based on 128 queue depth, this patch changes into
double size of min(hw queue_depth, 128).

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-07-03 16:54:09 -06:00
Paolo Valente 431b17f9d5 block, bfq: don't change ioprio class for a bfq_queue on a service tree
On each deactivation or re-scheduling (after being served) of a
bfq_queue, BFQ invokes the function __bfq_entity_update_weight_prio(),
to perform pending updates of ioprio, weight and ioprio class for the
bfq_queue. BFQ also invokes this function on I/O-request dispatches,
to raise or lower weights more quickly when needed, thereby improving
latency. However, the entity representing the bfq_queue may be on the
active (sub)tree of a service tree when this happens, and, although
with a very low probability, the bfq_queue may happen to also have a
pending change of its ioprio class. If both conditions hold when
__bfq_entity_update_weight_prio() is invoked, then the entity moves to
a sort of hybrid state: the new service tree for the entity, as
returned by bfq_entity_service_tree(), differs from service tree on
which the entity still is. The functions that handle activations and
deactivations of entities do not cope with such a hybrid state (and
would need to become more complex to cope).

This commit addresses this issue by just making
__bfq_entity_update_weight_prio() not perform also a possible pending
change of ioprio class, when invoked on an I/O-request dispatch for a
bfq_queue. Such a change is thus postponed to when
__bfq_entity_update_weight_prio() is invoked on deactivation or
re-scheduling of the bfq_queue.

Reported-by: Marco Piazza <mpiazza@gmail.com>
Reported-by: Laurentiu Nicola <lnicola@dend.ro>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Tested-by: Marco Piazza <mpiazza@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-07-03 16:50:00 -06:00
Linus Torvalds 9bd42183b9 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The main changes in this cycle were:

   - Add the SYSTEM_SCHEDULING bootup state to move various scheduler
     debug checks earlier into the bootup. This turns silent and
     sporadically deadly bugs into nice, deterministic splats. Fix some
     of the splats that triggered. (Thomas Gleixner)

   - A round of restructuring and refactoring of the load-balancing and
     topology code (Peter Zijlstra)

   - Another round of consolidating ~20 of incremental scheduler code
     history: this time in terms of wait-queue nomenclature. (I didn't
     get much feedback on these renaming patches, and we can still
     easily change any names I might have misplaced, so if anyone hates
     a new name, please holler and I'll fix it.) (Ingo Molnar)

   - sched/numa improvements, fixes and updates (Rik van Riel)

   - Another round of x86/tsc scheduler clock code improvements, in hope
     of making it more robust (Peter Zijlstra)

   - Improve NOHZ behavior (Frederic Weisbecker)

   - Deadline scheduler improvements and fixes (Luca Abeni, Daniel
     Bristot de Oliveira)

   - Simplify and optimize the topology setup code (Lauro Ramos
     Venancio)

   - Debloat and decouple scheduler code some more (Nicolas Pitre)

   - Simplify code by making better use of llist primitives (Byungchul
     Park)

   - ... plus other fixes and improvements"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (103 commits)
  sched/cputime: Refactor the cputime_adjust() code
  sched/debug: Expose the number of RT/DL tasks that can migrate
  sched/numa: Hide numa_wake_affine() from UP build
  sched/fair: Remove effective_load()
  sched/numa: Implement NUMA node level wake_affine()
  sched/fair: Simplify wake_affine() for the single socket case
  sched/numa: Override part of migrate_degrades_locality() when idle balancing
  sched/rt: Move RT related code from sched/core.c to sched/rt.c
  sched/deadline: Move DL related code from sched/core.c to sched/deadline.c
  sched/cpuset: Only offer CONFIG_CPUSETS if SMP is enabled
  sched/fair: Spare idle load balancing on nohz_full CPUs
  nohz: Move idle balancer registration to the idle path
  sched/loadavg: Generalize "_idle" naming to "_nohz"
  sched/core: Drop the unused try_get_task_struct() helper function
  sched/fair: WARN() and refuse to set buddy when !se->on_rq
  sched/debug: Fix SCHED_WARN_ON() to return a value on !CONFIG_SCHED_DEBUG as well
  sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming
  sched/wait: Move bit_wait_table[] and related functionality from sched/core.c to sched/wait_bit.c
  sched/wait: Split out the wait_bit*() APIs from <linux/wait.h> into <linux/wait_bit.h>
  sched/wait: Re-adjust macro line continuation backslashes in <linux/wait.h>
  ...
2017-07-03 13:08:04 -07:00
Linus Torvalds c6b1e36c8f Merge branch 'for-4.13/block' of git://git.kernel.dk/linux-block
Pull core block/IO updates from Jens Axboe:
 "This is the main pull request for the block layer for 4.13. Not a huge
  round in terms of features, but there's a lot of churn related to some
  core cleanups.

  Note this depends on the UUID tree pull request, that Christoph
  already sent out.

  This pull request contains:

   - A series from Christoph, unifying the error/stats codes in the
     block layer. We now use blk_status_t everywhere, instead of using
     different schemes for different places.

   - Also from Christoph, some cleanups around request allocation and IO
     scheduler interactions in blk-mq.

   - And yet another series from Christoph, cleaning up how we handle
     and do bounce buffering in the block layer.

   - A blk-mq debugfs series from Bart, further improving on the support
     we have for exporting internal information to aid debugging IO
     hangs or stalls.

   - Also from Bart, a series that cleans up the request initialization
     differences across types of devices.

   - A series from Goldwyn Rodrigues, allowing the block layer to return
     failure if we will block and the user asked for non-blocking.

   - Patch from Hannes for supporting setting loop devices block size to
     that of the underlying device.

   - Two series of patches from Javier, fixing various issues with
     lightnvm, particular around pblk.

   - A series from me, adding support for write hints. This comes with
     NVMe support as well, so applications can help guide data placement
     on flash to improve performance, latencies, and write
     amplification.

   - A series from Ming, improving and hardening blk-mq support for
     stopping/starting and quiescing hardware queues.

   - Two pull requests for NVMe updates. Nothing major on the feature
     side, but lots of cleanups and bug fixes. From the usual crew.

   - A series from Neil Brown, greatly improving the bio rescue set
     support. Most notably, this kills the bio rescue work queues, if we
     don't really need them.

   - Lots of other little bug fixes that are all over the place"

* 'for-4.13/block' of git://git.kernel.dk/linux-block: (217 commits)
  lightnvm: pblk: set line bitmap check under debug
  lightnvm: pblk: verify that cache read is still valid
  lightnvm: pblk: add initialization check
  lightnvm: pblk: remove target using async. I/Os
  lightnvm: pblk: use vmalloc for GC data buffer
  lightnvm: pblk: use right metadata buffer for recovery
  lightnvm: pblk: schedule if data is not ready
  lightnvm: pblk: remove unused return variable
  lightnvm: pblk: fix double-free on pblk init
  lightnvm: pblk: fix bad le64 assignations
  nvme: Makefile: remove dead build rule
  blk-mq: map all HWQ also in hyperthreaded system
  nvmet-rdma: register ib_client to not deadlock in device removal
  nvme_fc: fix error recovery on link down.
  nvmet_fc: fix crashes on bad opcodes
  nvme_fc: Fix crash when nvme controller connection fails.
  nvme_fc: replace ioabort msleep loop with completion
  nvme_fc: fix double calls to nvme_cleanup_cmd()
  nvme-fabrics: verify that a controller returns the correct NQN
  nvme: simplify nvme_dev_attrs_are_visible
  ...
2017-07-03 10:34:51 -07:00
Linus Torvalds 81e3e04489 UUID/GUID updates:
- introduce the new uuid_t/guid_t types that are going to replace
    the somewhat confusing uuid_be/uuid_le types and make the terminology
    fit the various specs, as well as the userspace libuuid library.
    (me, based on a previous version from Amir)
  - consolidated generic uuid/guid helper functions lifted from XFS
    and libnvdimm (Amir and me)
  - conversions to the new types and helpers (Amir, Andy and me)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCAApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAllZfmILHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYMvyg/9EvWHOOsSdeDykCK3KdH2uIqnxwpl+m7ljccaGJIc
 MmaH0KnsP9p/Cuw5hESh2tYlmCYN7pmYziNXpf/LRS65/HpEYbs4oMqo8UQsN0UM
 2IXHfXY0HnCoG5OixH8RNbFTkxuGphsTY8meaiDr6aAmqChDQI2yGgQLo3WM2/Qe
 R9N1KoBWH/bqY6dHv+urlFwtsREm2fBH+8ovVma3TO73uZCzJGLJBWy3anmZN+08
 uYfdbLSyRN0T8rqemVdzsZ2SrpHYkIsYGUZV43F581vp8e/3OKMoMxpWRRd9fEsa
 MXmoaHcLJoBsyVSFR9lcx3axKrhAgBPZljASbbA0h49JneWXrzghnKBQZG2SnEdA
 ktHQ2sE4Yb5TZSvvWEKMQa3kXhEfIbTwgvbHpcDr5BUZX8WvEw2Zq8e7+Mi4+KJw
 QkvFC1S96tRYO2bxdJX638uSesGUhSidb+hJ/edaOCB/GK+sLhUdDTJgwDpUGmyA
 xVXTF51ramRS2vhlbzN79x9g33igIoNnG4/PV0FPvpCTSqxkHmPc5mK6Vals1lqt
 cW6XfUjSQECq5nmTBtYDTbA/T+8HhBgSQnrrvmferjJzZUFGr/7MXl+Evz2x4CjX
 OBQoAMu241w6Vp3zoXqxzv+muZ/NLar52M/zbi9TUjE0GvvRNkHvgCC4NmpIlWYJ
 Sxg=
 =J/4P
 -----END PGP SIGNATURE-----

Merge tag 'uuid-for-4.13' of git://git.infradead.org/users/hch/uuid

Pull uuid subsystem from Christoph Hellwig:
 "This is the new uuid subsystem, in which Amir, Andy and I have started
  consolidating our uuid/guid helpers and improving the types used for
  them. Note that various other subsystems have pulled in this tree, so
  I'd like it to go in early.

  UUID/GUID summary:

   - introduce the new uuid_t/guid_t types that are going to replace the
     somewhat confusing uuid_be/uuid_le types and make the terminology
     fit the various specs, as well as the userspace libuuid library.
     (me, based on a previous version from Amir)

   - consolidated generic uuid/guid helper functions lifted from XFS and
     libnvdimm (Amir and me)

   - conversions to the new types and helpers (Amir, Andy and me)"

* tag 'uuid-for-4.13' of git://git.infradead.org/users/hch/uuid: (34 commits)
  ACPI: hns_dsaf_acpi_dsm_guid can be static
  mmc: sdhci-pci: make guid intel_dsm_guid static
  uuid: Take const on input of uuid_is_null() and guid_is_null()
  thermal: int340x_thermal: fix compile after the UUID API switch
  thermal: int340x_thermal: Switch to use new generic UUID API
  acpi: always include uuid.h
  ACPI: Switch to use generic guid_t in acpi_evaluate_dsm()
  ACPI / extlog: Switch to use new generic UUID API
  ACPI / bus: Switch to use new generic UUID API
  ACPI / APEI: Switch to use new generic UUID API
  acpi, nfit: Switch to use new generic UUID API
  MAINTAINERS: add uuid entry
  tmpfs: generate random sb->s_uuid
  scsi_debug: switch to uuid_t
  nvme: switch to uuid_t
  sysctl: switch to use uuid_t
  partitions/ldm: switch to use uuid_t
  overlayfs: use uuid_t instead of uuid_be
  fs: switch ->s_uuid to uuid_t
  ima/policy: switch to use uuid_t
  ...
2017-07-03 09:55:26 -07:00
Al Viro 30138384da compat_hdio_ioctl: get rid of set_fs()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-29 18:17:52 -04:00
Al Viro 229b53c9bf take floppy compat ioctls to sodding floppy.c
all other drivers recognizing those ioctls are very much *not*
biarch.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-29 18:17:39 -04:00
Max Gurtovoy fe631457ff blk-mq: map all HWQ also in hyperthreaded system
This patch performs sequential mapping between CPUs and queues.
In case the system has more CPUs than HWQs then there are still
CPUs to map to HWQs. In hyperthreaded system, map the unmapped CPUs
and their siblings to the same HWQ.
This actually fixes a bug that found unmapped HWQs in a system with
2 sockets, 18 cores per socket, 2 threads per core (total 72 CPUs)
running NVMEoF (opens upto maximum of 64 HWQs).

Performance results running fio (72 jobs, 128 iodepth)
using null_blk (w/w.o patch):

bs      IOPS(read submit_queues=72)   IOPS(write submit_queues=72)   IOPS(read submit_queues=24)  IOPS(write submit_queues=24)
-----  ----------------------------  ------------------------------ ---------------------------- -----------------------------
512    4890.4K/4723.5K                 4524.7K/4324.2K                   4280.2K/4264.3K               3902.4K/3909.5K
1k     4910.1K/4715.2K                 4535.8K/4309.6K                   4296.7K/4269.1K               3906.8K/3914.9K
2k     4906.3K/4739.7K                 4526.7K/4330.6K                   4301.1K/4262.4K               3890.8K/3900.1K
4k     4918.6K/4730.7K                 4556.1K/4343.6K                   4297.6K/4264.5K               3886.9K/3893.9K
8k     4906.4K/4748.9K                 4550.9K/4346.7K                   4283.2K/4268.8K               3863.4K/3858.2K
16k    4903.8K/4782.6K                 4501.5K/4233.9K                   4292.3K/4282.3K               3773.1K/3773.5K
32k    4885.8K/4782.4K                 4365.9K/4184.2K                   4307.5K/4289.4K               3780.3K/3687.3K
64k    4822.5K/4762.7K                 2752.8K/2675.1K                   4308.8K/4312.3K               2651.5K/2655.7K
128k   2388.5K/2313.8K                 1391.9K/1375.7K                   2142.8K/2152.2K               1395.5K/1374.2K

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-29 08:40:11 -06:00
Jens Axboe 9ae3b3f52c block: provide bio_uninit() free freeing integrity/task associations
Wen reports significant memory leaks with DIF and O_DIRECT:

"With nvme devive + T10 enabled, On a system it has 256GB and started
logging /proc/meminfo & /proc/slabinfo for every minute and in an hour
it increased by 15968128 kB or ~15+GB.. Approximately 256 MB / minute
leaking.

/proc/meminfo | grep SUnreclaim...

SUnreclaim:      6752128 kB
SUnreclaim:      6874880 kB
SUnreclaim:      7238080 kB
....
SUnreclaim:     22307264 kB
SUnreclaim:     22485888 kB
SUnreclaim:     22720256 kB

When testcases with T10 enabled call into __blkdev_direct_IO_simple,
code doesn't free memory allocated by bio_integrity_alloc. The patch
fixes the issue. HTX has been run with +60 hours without failure."

Since __blkdev_direct_IO_simple() allocates the bio on the stack, it
doesn't go through the regular bio free. This means that any ancillary
data allocated with the bio through the stack is not freed. Hence, we
can leak the integrity data associated with the bio, if the device is
using DIF/DIX.

Fix this by providing a bio_uninit() and export it, so that we can use
it to free this data. Note that this is a minimal fix for this issue.
Any current user of bio's that are allocated outside of
bio_alloc_bioset() suffers from this issue, most notably some drivers.
We will fix those in a more comprehensive patch for 4.13. This also
means that the commit marked as being fixed by this isn't the real
culprit, it's just the most obvious one out there.

Fixes: 542ff7bf18 ("block: new direct I/O implementation")
Reported-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-28 15:30:13 -06:00
Christoph Hellwig 4b855ad371 blk-mq: Create hctx for each present CPU
Currently we only create hctx for online CPUs, which can lead to a lot
of churn due to frequent soft offline / online operations.  Instead
allocate one for each present CPU to avoid this and dramatically simplify
the code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Cc: Keith Busch <keith.busch@intel.com>
Cc: linux-block@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Link: http://lkml.kernel.org/r/20170626102058.10200-3-hch@lst.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-06-28 23:00:07 +02:00
Christoph Hellwig 5f042e7cbd blk-mq: Include all present CPUs in the default queue mapping
This way we get a nice distribution independent of the current cpu
online / offline state.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Cc: Keith Busch <keith.busch@intel.com>
Cc: linux-block@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Link: http://lkml.kernel.org/r/20170626102058.10200-2-hch@lst.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-06-28 23:00:06 +02:00
Paolo Valente 13c931bd9a block, bfq: update wr_busy_queues if needed on a queue split
This commit fixes a bug triggered by a non-trivial sequence of
events. These events are briefly described in the next two
paragraphs. The impatiens, or those who are familiar with queue
merging and splitting, can jump directly to the last paragraph.

On each I/O-request arrival for a shared bfq_queue, i.e., for a
bfq_queue that is the result of the merge of two or more bfq_queues,
BFQ checks whether the shared bfq_queue has become seeky (i.e., if too
many random I/O requests have arrived for the bfq_queue; if the device
is non rotational, then random requests must be also small for the
bfq_queue to be tagged as seeky). If the shared bfq_queue is actually
detected as seeky, then a split occurs: the bfq I/O context of the
process that has issued the request is redirected from the shared
bfq_queue to a new non-shared bfq_queue. As a degenerate case, if the
shared bfq_queue actually happens to be shared only by one process
(because of previous splits), then no new bfq_queue is created: the
state of the shared bfq_queue is just changed from shared to non
shared.

Regardless of whether a brand new non-shared bfq_queue is created, or
the pre-existing shared bfq_queue is just turned into a non-shared
bfq_queue, several parameters of the non-shared bfq_queue are set
(restored) to the original values they had when the bfq_queue
associated with the bfq I/O context of the process (that has just
issued an I/O request) was merged with the shared bfq_queue. One of
these parameters is the weight-raising state.

If, on the split of a shared bfq_queue,
1) a pre-existing shared bfq_queue is turned into a non-shared
bfq_queue;
2) the previously shared bfq_queue happens to be busy;
3) the weight-raising state of the previously shared bfq_queue happens
to change;
the number of weight-raised busy queues changes. The field
wr_busy_queues must then be updated accordingly, but such an update
was missing. This commit adds the missing update.

Reported-by: Luca Miccio <lucmiccio@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27 12:30:47 -06:00
Christoph Hellwig 8fc450443e block: don't set bounce limit in blk_init_queue
Instead move it to the callers.  Those that either don't use bio_data() or
page_address() or are specific to architectures that do not support highmem
are skipped.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27 12:13:45 -06:00
Christoph Hellwig 0bf6595ec8 block: don't set bounce limit in blk_init_allocated_queue
And just move it into scsi_transport_sas which needs it due to low-level
drivers directly derferencing bio_data, and into blk_init_queue_node,
which will need a further push into the callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27 12:13:45 -06:00
Christoph Hellwig 46685d1a95 blk-mq: don't bounce by default
For historical reasons we default to bouncing highmem pages for all block
queues.  But the blk-mq drivers are easy to audit to ensure that we don't
need this - scsi and mtip32xx set explicit limits and everyone else doesn't
have any particular ones.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27 12:13:45 -06:00
Christoph Hellwig 0b0bcacc3b block: don't bother with bounce limits for make_request drivers
We only call blk_queue_bounce for request-based drivers, so stop messing
with it for make_request based drivers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27 12:13:45 -06:00
Christoph Hellwig 1c4bc3ab9a block: remove the queue_bounce_pfn helper
Only used inside the bounce code, and opencoding it makes it more obvious
what is going on.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27 12:13:45 -06:00
Christoph Hellwig 3bce016a4c block: move bounce declarations to block/blk.h
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27 12:13:45 -06:00
Christoph Hellwig caa4b02476 blk-map: call blk_queue_bounce from blk_rq_append_bio
This makes moves the knowledge about bouncing out of the callers into the
block core (just like we do for the normal I/O path), and allows to unexport
blk_queue_bounce.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27 12:13:21 -06:00
Jens Axboe f793dfd3f3 blk-mq: expose write hints through debugfs
Useful to verify that things are working the way they should.
Reading the file will return number of kb written with each
write hint. Writing the file will reset the statistics. No care
is taken to ensure that we don't race on updates.

Drivers will write to q->write_hints[] if they handle a given
write hint.

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27 12:05:31 -06:00
Jens Axboe cb6934f8ea block: add support for write hints in a bio
No functional changes in this patch, we just use up some holes
in the bio and request structures to define a write hint that
we psas down the stack.

Ensure that we don't merge requests that have different life time
hints assigned to them, and that we inherit the write hint when
cloning a bio.

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27 12:05:27 -06:00
Ingo Molnar 1bc3cd4dfa Merge branch 'linus' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-24 08:57:20 +02:00
Jens Axboe f95a0d6a95 Merge commit '8e8320c9315c' into for-4.13/block
Pull in the fix for shared tags, as it conflicts with the pending
changes in for-4.13/block. We already pulled in v4.12-rc5 to solve
other conflicts or get fixes that went into 4.12, so not a lot
of changes in this merge.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-22 21:55:24 -06:00
weiping a9590fe148 blk-mq: remove double set queue_num
hwctx's queue_num has been set prior call blk_mq_init_hctx, so no need
set it again.

Signed-off-by: weiping <zhangweiping@didichuxing.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-22 09:18:25 -06:00
Bart Van Assche 852ec80983 blk-mq: Make it safe to quiesce and unquiesce from an interrupt handler
Since blk_mq_quiesce_queue_nowait() can be called from interrupt
context, make this safe. Since this function is not in the hot
path, uninline it.

Fixes: commit f4560ffe8c ("blk-mq: use QUEUE_FLAG_QUIESCED to quiesce queue")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-21 12:01:15 -06:00
Bart Van Assche 34bd9c1c4f block: Fix off-by-one errors in blk_status_to_errno() and print_req_error()
This was detected by the smatch static analyzer.

Fixes: commit 2a842acab1 ("block: introduce new block status code type")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-21 12:01:14 -06:00
Bart Van Assche e0fc443a86 block: Declare local symbols static
Avoid that building with W=1 causes the compiler to complain that
a declaration for bounce_bio_set and bounce_bio_split is missing.

References: commit a8821f3f32 ("block: Improvements to bounce-buffer handling")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Neil Brown <neilb@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-21 12:01:13 -06:00
Bart Van Assche e29387ebd8 block: Add fallthrough markers to switch statements
This patch suppresses gcc 7 warnings about falling through in switch
statements when building with W=1. From the gcc documentation: The
-Wimplicit-fallthrough=3 warning is enabled by -Wextra. See also
https://gcc.gnu.org/onlinedocs/gcc-7.1.0/gcc/Warning-Options.html.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-21 11:46:07 -06:00
Jens Axboe 8e8320c931 blk-mq: fix performance regression with shared tags
If we have shared tags enabled, then every IO completion will trigger
a full loop of every queue belonging to a tag set, and every hardware
queue for each of those queues, even if nothing needs to be done.
This causes a massive performance regression if you have a lot of
shared devices.

Instead of doing this huge full scan on every IO, add an atomic
counter to the main queue that tracks how many hardware queues have
been marked as needing a restart. With that, we can avoid looking for
restartable queues, if we don't have to.

Max reports that this restores performance. Before this patch, 4K
IOPS was limited to 22-23K IOPS. With the patch, we are running at
950-970K IOPS.

Fixes: 6d8c6c0f97 ("blk-mq: Restart a single queue if tag sets are shared")
Reported-by: Max Gurtovoy <maxg@mellanox.com>
Tested-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Tested-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-21 10:17:49 -06:00
Bart Van Assche 5435c023b9 blk-mq: Warn when attempting to run a hardware queue that is not mapped
A queue must be frozen while the mapped state of a hardware queue
is changed. Additionally, any change of the mapped state is
followed by a call to blk_mq_map_swqueue() (see also
blk_mq_init_allocated_queue() and blk_mq_update_nr_hw_queues()).
Since blk_mq_map_swqueue() does not map any unmapped hardware
queue onto any software queue, no attempt will be made to run
an unmapped hardware queue. Hence issue a warning upon attempts
to run an unmapped hardware queue.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20 19:27:14 -06:00
Bart Van Assche edf8ff5588 block: Constify disk_type
The variable 'disk_type' is never modified so constify it.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20 19:27:14 -06:00
Bart Van Assche 7b6078146c blk-mq: Document locking assumptions
Document the locking assumptions in functions that modify
blk_mq_ctx.rq_list to make it easier for humans to verify
this code.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20 19:27:14 -06:00
Bart Van Assche 332ebbf7f9 block: Document what queue type each function is intended for
Some functions in block/blk-core.c must only be used on blk-sq queues
while others are safe to use against any queue type. Document which
functions are intended for blk-sq queues and issue a warning if the
blk-sq API is misused. This does not only help block driver authors
but will also make it easier to remove the blk-sq code once that code
is declared obsolete.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20 19:27:14 -06:00
Bart Van Assche 2fff8a924d block: Check locking assumptions at runtime
Instead of documenting the locking assumptions of most block layer
functions as a comment, use lockdep_assert_held() to verify locking
assumptions at runtime.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20 19:27:14 -06:00
Bart Van Assche c3a148d20a blk-mq: Initialize .rq_flags in blk_mq_rq_ctx_init()
Initialization of blk-mq requests is a bit weird: blk_mq_rq_ctx_init()
is called after a value has been assigned to .rq_flags and .rq_flags
is initialized in __blk_mq_finish_request(). Initialize .rq_flags in
blk_mq_rq_ctx_init() instead of relying on __blk_mq_finish_request().
Moving the initialization of .rq_flags is fine because all changes
and tests of .rq_flags occur between blk_get_request() and finishing
a request.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20 19:27:14 -06:00
Bart Van Assche c8d9cf22cf block: Change argument type of scsi_req_init()
Since scsi_req_init() works on a struct scsi_request, change the
argument type into struct scsi_request *.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20 19:27:14 -06:00
Bart Van Assche ca18d6f769 block: Make most scsi_req_init() calls implicit
Instead of explicitly calling scsi_req_init() after blk_get_request(),
call that function from inside blk_get_request(). Add an
.initialize_rq_fn() callback function to the block drivers that need
it. Merge the IDE .init_rq_fn() function into .initialize_rq_fn()
because it is too small to keep it as a separate function. Keep the
scsi_req_init() call in ide_prep_sense() because it follows a
blk_rq_init() call.

References: commit 82ed4db499 ("block: split scsi_request out of struct request")
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Nicholas Bellinger <nab@linux-iscsi.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20 19:27:14 -06:00
Bart Van Assche d280bab305 block: Introduce request_queue.initialize_rq_fn()
Several block drivers need to initialize the driver-private request
data after having called blk_get_request() and before .prep_rq_fn()
is called, e.g. when submitting a REQ_OP_SCSI_* request. Avoid that
that initialization code has to be repeated after every
blk_get_request() call by adding new callback functions to struct
request_queue and to struct blk_mq_ops.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20 19:27:14 -06:00
Bart Van Assche cd6ce1482f block: Make request operation type argument declarations consistent
Instead of declaring the second argument of blk_*_get_request()
as int and passing it to functions that expect an unsigned int,
declare that second argument as unsigned int. Also because of
consistency, rename that second argument from 'rw' into 'op'.
This patch does not change any functionality.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20 19:27:14 -06:00
Bart Van Assche 0731967877 blk-mq: Reduce blk_mq_hw_ctx size
Since the srcu structure is rather large (184 bytes on an x86-64
system with kernel debugging disabled), only allocate it if needed.

Reported-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20 19:27:14 -06:00
Goldwyn Rodrigues 03a07c92a9 block: return on congested block device
A new bio operation flag REQ_NOWAIT is introduced to identify bio's
orignating from iocb with IOCB_NOWAIT. This flag indicates
to return immediately if a request cannot be made instead
of retrying.

Stacked devices such as md (the ones with make_request_fn hooks)
currently are not supported because it may block for housekeeping.
For example, an md can have a part of the device suspended.
For this reason, only request based devices are supported.
In the future, this feature will be expanded to stacked devices
by teaching them how to handle the REQ_NOWAIT flags.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20 07:12:03 -06:00
Ingo Molnar 2055da9738 sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming
So I've noticed a number of instances where it was not obvious from the
code whether ->task_list was for a wait-queue head or a wait-queue entry.

Furthermore, there's a number of wait-queue users where the lists are
not for 'tasks' but other entities (poll tables, etc.), in which case
the 'task_list' name is actively confusing.

To clear this all up, name the wait-queue head and entry list structure
fields unambiguously:

	struct wait_queue_head::task_list	=> ::head
	struct wait_queue_entry::task_list	=> ::entry

For example, this code:

	rqw->wait.task_list.next != &wait->task_list

... is was pretty unclear (to me) what it's doing, while now it's written this way:

	rqw->wait.head.next != &wait->entry

... which makes it pretty clear that we are iterating a list until we see the head.

Other examples are:

	list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
	list_for_each_entry(wq, &fence->wait.task_list, task_list) {

... where it's unclear (to me) what we are iterating, and during review it's
hard to tell whether it's trying to walk a wait-queue entry (which would be
a bug), while now it's written as:

	list_for_each_entry_safe(pos, next, &x->head, entry) {
	list_for_each_entry(wq, &fence->wait.head, entry) {

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 12:19:14 +02:00
Ingo Molnar ac6424b981 sched/wait: Rename wait_queue_t => wait_queue_entry_t
Rename:

	wait_queue_t		=>	wait_queue_entry_t

'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
which had to carry the name.

Start sorting this out by renaming it to 'wait_queue_entry_t'.

This also allows the real structure name 'struct __wait_queue' to
lose its double underscore and become 'struct wait_queue_entry',
which is the more canonical nomenclature for such data types.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 12:18:27 +02:00
Ming Lei 641a9ed60f Revert "blk-mq: don't use sync workqueue flushing from drivers"
This patch reverts commit 2719aa217e0d02(blk-mq: don't use
sync workqueue flushing from drivers) because only
blk_mq_quiesce_queue() need the sync flush, and now
we don't need to stop queue any more, so revert it.

Also changes to cancel_delayed_work() in blk_mq_stop_hw_queue().

Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-18 14:24:48 -06:00
Ming Lei 39a70c76b8 blk-mq: clarify dispatch may not be drained/blocked by stopping queue
BLK_MQ_S_STOPPED may not be observed in other concurrent I/O paths,
we can't guarantee that dispatching won't happen after returning
from the APIs of stopping queue.

So clarify the fact and avoid potential misuse.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-18 14:24:48 -06:00