remarkable-linux/block
Tahsin Erdogan 72ef799b3f block: do not merge requests without consulting with io scheduler
Before merging a bio into an existing request, io scheduler is called to
get its approval first. However, the requests that come from a plug
flush may get merged by block layer without consulting with io
scheduler.

In case of CFQ, this can cause fairness problems. For instance, if a
request gets merged into a low weight cgroup's request, high weight cgroup
now will depend on low weight cgroup to get scheduled. If high weigt cgroup
needs that io request to complete before submitting more requests, then it
will also lose its timeslice.

Following script demonstrates the problem. Group g1 has a low weight, g2
and g3 have equal high weights but g2's requests are adjacent to g1's
requests so they are subject to merging. Due to these merges, g2 gets
poor disk time allocation.

cat > cfq-merge-repro.sh << "EOF"
#!/bin/bash
set -e

IO_ROOT=/mnt-cgroup/io

mkdir -p $IO_ROOT

if ! mount | grep -qw $IO_ROOT; then
  mount -t cgroup none -oblkio $IO_ROOT
fi

cd $IO_ROOT

for i in g1 g2 g3; do
  if [ -d $i ]; then
    rmdir $i
  fi
done

mkdir g1 && echo 10 > g1/blkio.weight
mkdir g2 && echo 495 > g2/blkio.weight
mkdir g3 && echo 495 > g3/blkio.weight

RUNTIME=10

(echo $BASHPID > g1/cgroup.procs &&
 fio --readonly --name name1 --filename /dev/sdb \
     --rw read --size 64k --bs 64k --time_based \
     --runtime=$RUNTIME --offset=0k &> /dev/null)&

(echo $BASHPID > g2/cgroup.procs &&
 fio --readonly --name name1 --filename /dev/sdb \
     --rw read --size 64k --bs 64k --time_based \
     --runtime=$RUNTIME --offset=64k &> /dev/null)&

(echo $BASHPID > g3/cgroup.procs &&
 fio --readonly --name name1 --filename /dev/sdb \
     --rw read --size 64k --bs 64k --time_based \
     --runtime=$RUNTIME --offset=256k &> /dev/null)&

sleep $((RUNTIME+1))

for i in g1 g2 g3; do
  echo ---- $i ----
  cat $i/blkio.time
done

EOF
# ./cfq-merge-repro.sh
---- g1 ----
8:16 162
---- g2 ----
8:16 165
---- g3 ----
8:16 686

After applying the patch:

# ./cfq-merge-repro.sh
---- g1 ----
8:16 90
---- g2 ----
8:16 445
---- g3 ----
8:16 471

Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-20 21:35:12 -06:00
..
partitions block: atari: Return early for unsupported sector size 2016-07-13 09:31:44 -07:00
badblocks.c block, badblocks: introduce devm_init_badblocks 2016-01-09 08:39:04 -08:00
bio-integrity.c block/bio-integrity.c: Add #include "blk.h" 2016-06-14 09:09:17 -06:00
bio.c block, fs, mm, drivers: use bio set/get op accessors 2016-06-07 13:41:38 -06:00
blk-cgroup.c block/blk-cgroup.c: Declare local symbols static 2016-06-14 09:09:33 -06:00
blk-core.c cfq-iosched: temporarily boost queue priority for idle classes 2016-06-09 16:15:01 -06:00
blk-exec.c block: Fix spelling in a source code comment 2016-07-20 21:28:22 -06:00
blk-flush.c block, drivers, fs: rename REQ_FLUSH to REQ_PREFLUSH 2016-06-07 13:41:38 -06:00
blk-integrity.c block, libnvdimm, nvme: provide a built-in blk_integrity nop profile 2015-10-21 14:43:45 -06:00
blk-ioc.c mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd 2015-11-06 17:50:42 -08:00
blk-lib.c block discard: use bio set op accessor 2016-06-07 13:41:38 -06:00
blk-map.c block, fs, mm, drivers: use bio set/get op accessors 2016-06-07 13:41:38 -06:00
blk-merge.c block: do not merge requests without consulting with io scheduler 2016-07-20 21:35:12 -06:00
blk-mq-cpu.c
blk-mq-cpumap.c blk-mq: Avoid memoryless numa node encoded in hctx numa_node 2015-12-03 09:56:27 -07:00
blk-mq-sysfs.c blk-mq: Use proper cpumask iterator 2016-03-20 09:34:02 -06:00
blk-mq-tag.c blk-mq: Make blk_mq_all_tag_busy_iter static 2016-04-12 15:07:36 -06:00
blk-mq-tag.h blk-mq: factor out a helper to iterate all tags for a request_queue 2015-10-01 10:10:57 +02:00
blk-mq.c blk-mq: actually hook up defer list when running requests 2016-06-09 09:55:15 -06:00
blk-mq.h blk-mq: dynamic h/w context count 2016-02-09 12:42:17 -07:00
blk-settings.c block: kill off q->flush_flags 2016-04-13 13:33:19 -06:00
blk-softirq.c
blk-sysfs.c block: expose QUEUE_FLAG_DAX in sysfs 2016-07-20 21:01:08 -06:00
blk-tag.c
blk-throttle.c blk-throttle: don't parse cgroup path if trace isn't enabled 2016-05-10 08:41:37 -06:00
blk-timeout.c block: remove REQ_NO_TIMEOUT flag 2015-12-22 09:38:34 -07:00
blk.h block: defer timeouts to a workqueue 2015-12-22 09:38:16 -07:00
bounce.c Merge branch 'for-linus' of git://git.kernel.dk/linux-block 2015-09-19 18:57:09 -07:00
bsg-lib.c
bsg.c
cfq-iosched.c block: do not merge requests without consulting with io scheduler 2016-07-20 21:35:12 -06:00
cmdline-parser.c
compat_ioctl.c mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros 2016-04-04 10:41:08 -07:00
deadline-iosched.c block: do not merge requests without consulting with io scheduler 2016-07-20 21:35:12 -06:00
elevator.c block: do not merge requests without consulting with io scheduler 2016-07-20 21:35:12 -06:00
genhd.c Merge branch 'for-4.5/core' of git://git.kernel.dk/linux-block 2016-01-19 15:03:34 -08:00
ioctl.c DAX error handling for 4.7 2016-05-26 19:34:26 -07:00
ioprio.c pidns: fix set/getpriority and ioprio_set/get in PRIO_USER mode 2015-11-06 17:50:42 -08:00
Kconfig block: disable block device DAX by default 2016-02-27 10:28:52 -08:00
Kconfig.iosched
Makefile Initial roundup of 4.5 merge window patches 2016-01-23 18:45:06 -08:00
noop-iosched.c elevator: use list_{first,prev,next}_entry 2015-11-16 15:21:48 -07:00
partition-generic.c block/partition-generic.c: Remove a set-but-not-used variable 2016-06-14 09:09:15 -06:00
scsi_ioctl.c mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIM 2015-11-06 17:50:42 -08:00
t10-pi.c block: Consolidate static integrity profile properties 2015-10-21 14:42:38 -06:00