remarkable-linux/block
Mike Snitzer fb1840e257 dm: fix excessive dm-mq context switching
[ Upstream commit 6acfe68bac ]

Request-based DM's blk-mq support (dm-mq) was reported to be 50% slower
than if an underlying null_blk device were used directly.  One of the
reasons for this drop in performance is that blk_insert_clone_request()
was calling blk_mq_insert_request() with @async=true.  This forced the
use of kblockd_schedule_delayed_work_on() to run the blk-mq hw queues
which ushered in ping-ponging between process context (fio in this case)
and kblockd's kworker to submit the cloned request.  The ftrace
function_graph tracer showed:

  kworker-2013  =>   fio-12190
  fio-12190    =>  kworker-2013
  ...
  kworker-2013  =>   fio-12190
  fio-12190    =>  kworker-2013
  ...

Fixing blk_insert_clone_request()'s blk_mq_insert_request() call to
_not_ use kblockd to submit the cloned requests isn't enough to
eliminate the observed context switches.

In addition to this dm-mq specific blk-core fix, there are 2 DM core
fixes to dm-mq that (when paired with the blk-core fix) completely
eliminate the observed context switching:

1)  don't blk_mq_run_hw_queues in blk-mq request completion

    Motivated by desire to reduce overhead of dm-mq, punting to kblockd
    just increases context switches.

    In my testing against a really fast null_blk device there was no benefit
    to running blk_mq_run_hw_queues() on completion (and no other blk-mq
    driver does this).  So hopefully this change doesn't induce the need for
    yet another revert like commit 621739b00e !

2)  use blk_mq_complete_request() in dm_complete_request()

    blk_complete_request() doesn't offer the traditional q->mq_ops vs
    .request_fn branching pattern that other historic block interfaces
    do (e.g. blk_get_request).  Using blk_mq_complete_request() for
    blk-mq requests is important for performance.  It should be noted
    that, like blk_complete_request(), blk_mq_complete_request() doesn't
    natively handle partial completions -- but the request-based
    DM-multipath target does provide the required partial completion
    support by dm.c:end_clone_bio() triggering requeueing of the request
    via dm-mpath.c:multipath_end_io()'s return of DM_ENDIO_REQUEUE.

dm-mq fix #2 is _much_ more important than #1 for eliminating the
context switches.
Before: cpu          : usr=15.10%, sys=59.39%, ctx=7905181, majf=0, minf=475
After:  cpu          : usr=20.60%, sys=79.35%, ctx=2008, majf=0, minf=472

With these changes multithreaded async read IOPs improved from ~950K
to ~1350K for this dm-mq stacked on null_blk test-case.  The raw read
IOPs of the underlying null_blk device for the same workload is ~1950K.

Fixes: 7fb4898e0 ("block: add blk-mq support to blk_insert_cloned_request()")
Fixes: bfebd1cdb ("dm: add full blk-mq support to request-based DM")
Cc: stable@vger.kernel.org # 4.1+
Reported-by: Sagi Grimberg <sagig@dev.mellanox.co.il>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-04-18 08:50:38 -04:00
..
partitions Merge branch 'for-3.20/core' of git://git.kernel.dk/linux-block 2015-02-12 14:13:23 -08:00
bio-integrity.c bio integrity: do not assume bio_integrity_pool exists if bioset exists 2015-08-10 12:21:52 -07:00
bio.c bio: return EINTR if copying to user space got interrupted 2016-03-04 10:25:43 -05:00
blk-cgroup.c blkcg: fix gendisk reference leak in blkg_conf_prep() 2015-08-10 12:21:56 -07:00
blk-cgroup.h blkcg: remove blkcg->id 2014-09-08 09:55:37 -06:00
blk-core.c dm: fix excessive dm-mq context switching 2016-04-18 08:50:38 -04:00
blk-exec.c blk-mq: avoid infinite recursion with the FUA flag 2014-09-22 11:55:19 -06:00
blk-flush.c blk-mq: support per-distpatch_queue flush machinery 2014-09-25 15:22:45 -06:00
blk-integrity.c block: Don't merge requests if integrity flags differ 2014-09-27 09:14:57 -06:00
blk-ioc.c block: Substitute rcu_access_pointer() for rcu_dereference_raw() 2014-02-18 12:21:26 -08:00
blk-iopoll.c Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next 2014-06-03 12:57:53 -07:00
blk-lib.c block: Quiesce zeroout wrapper 2015-02-05 10:14:54 -07:00
blk-map.c blk_rq_map_user(): use import_single_range() 2015-04-11 22:27:13 -04:00
blk-merge.c Fix bug in blk_rq_merge_ok 2015-03-20 08:50:41 -06:00
blk-mq-cpu.c blk-mq: add file comments and update copyright notices 2014-05-28 10:15:41 -06:00
blk-mq-cpumap.c blk-mq: Use all available hardware queues 2014-12-09 09:08:21 -07:00
blk-mq-sysfs.c blk-mq: fix buffer overflow when reading sysfs file of 'pending' 2015-09-29 19:25:56 +02:00
blk-mq-tag.c blkmq: Fix NULL pointer deref when all reserved tags in 2015-03-18 17:06:18 -06:00
blk-mq-tag.h blk-mq: add tag allocation policy 2015-01-23 14:18:00 -07:00
blk-mq.c blk-mq: set default timeout as 30 seconds 2015-08-10 12:21:57 -07:00
blk-mq.h blk-mq: release mq's kobjects in blk_release_queue() 2015-01-29 08:30:51 -08:00
blk-settings.c sd: Fix maximum I/O size for BLOCK_PC requests 2015-09-13 09:07:45 -07:00
blk-softirq.c block: fix regression with block enabled tagging 2014-04-09 21:54:06 -06:00
blk-sysfs.c block: destroy bdi before blockdev is unregistered. 2015-04-27 10:27:20 -06:00
blk-tag.c block: support different tag allocation policy 2015-01-23 14:15:46 -07:00
blk-throttle.c blk-throttle: check stats_cpu before reading it from sysfs 2015-02-20 22:11:58 -08:00
blk-timeout.c blk-mq: Allow requests to never expire 2015-01-08 08:59:01 -07:00
blk.h blk-mq: support per-distpatch_queue flush machinery 2014-09-25 15:22:45 -06:00
bounce.c block:bounce: fix call inc_|dec_zone_page_state on different pages confuse value of NR_BOUNCE 2015-04-27 09:24:07 -06:00
bsg-lib.c bsg: Remove unused function bsg_goose_queue() 2012-12-06 14:33:02 +01:00
bsg.c block: Simplify bsg complete all 2015-02-04 09:57:52 -07:00
cfq-iosched.c cfq-iosched: handle failure of cfq group allocation 2015-02-09 10:22:39 -07:00
cmdline-parser.c block: remove unrelated header files and export symbol 2014-01-21 20:18:26 -08:00
compat_ioctl.c block, bdi: an active gendisk always has a request_queue associated with it 2014-09-08 10:00:35 -06:00
deadline-iosched.c block: Stop abusing csd.list for fifo_time 2014-02-24 14:46:32 -08:00
elevator.c elevator: fix double release of elevator module 2015-04-23 10:47:44 -06:00
genhd.c block: fix ext_dev_lock lockdep report 2015-06-11 09:01:40 -06:00
ioctl.c block: Add discard flag to blkdev_issue_zeroout() function 2015-01-21 10:41:46 -07:00
ioprio.c block: Fix computation of merged request priority 2014-10-31 08:30:43 -06:00
Kconfig block: Add T10 Protection Information functions 2014-09-27 09:14:59 -06:00
Kconfig.iosched blkcg: make CONFIG_BLK_CGROUP bool 2012-03-06 21:27:21 +01:00
Makefile block: Add T10 Protection Information functions 2014-09-27 09:14:59 -06:00
noop-iosched.c elevator: Fix a race in elevator switching 2013-07-03 13:25:24 +02:00
partition-generic.c block: Fix dev_t minor allocation lifetime 2014-09-03 15:01:02 -06:00
scsi_ioctl.c sg_io(): use import_iovec() 2015-04-11 22:27:13 -04:00
t10-pi.c block: Add T10 Protection Information functions 2014-09-27 09:14:59 -06:00