alistair23-linux/block
Tejun Heo 1f7dd3e5a6 cgroup: fix handling of multi-destination migration from subtree_control enabling
Consider the following v2 hierarchy.

  P0 (+memory) --- P1 (-memory) --- A
                                 \- B
       
P0 has memory enabled in its subtree_control while P1 doesn't.  If
both A and B contain processes, they would belong to the memory css of
P1.  Now if memory is enabled on P1's subtree_control, memory csses
should be created on both A and B and A's processes should be moved to
the former and B's processes the latter.  IOW, enabling controllers
can cause atomic migrations into different csses.

The core cgroup migration logic has been updated accordingly but the
controller migration methods haven't and still assume that all tasks
migrate to a single target css; furthermore, the methods were fed the
css in which subtree_control was updated which is the parent of the
target csses.  pids controller depends on the migration methods to
move charges and this made the controller attribute charges to the
wrong csses often triggering the following warning by driving a
counter negative.

 WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
 Modules linked in:
 CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
 ...
  ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
  ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
  ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
 Call Trace:
  [<ffffffff81551ffc>] dump_stack+0x4e/0x82
  [<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
  [<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
  [<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
  [<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
  [<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
  [<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
  [<ffffffff81189016>] cgroup_attach_task+0x176/0x200
  [<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
  [<ffffffff81189684>] cgroup_procs_write+0x14/0x20
  [<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
  [<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
  [<ffffffff81265f88>] __vfs_write+0x28/0xe0
  [<ffffffff812666fc>] vfs_write+0xac/0x1a0
  [<ffffffff81267019>] SyS_write+0x49/0xb0
  [<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76

This patch fixes the bug by removing @css parameter from the three
migration methods, ->can_attach, ->cancel_attach() and ->attach() and
updating cgroup_taskset iteration helpers also return the destination
css in addition to the task being migrated.  All controllers are
updated accordingly.

* Controllers which don't care whether there are one or multiple
  target csses can be converted trivially.  cpu, io, freezer, perf,
  netclassid and netprio fall in this category.

* cpuset's current implementation assumes that there's single source
  and destination and thus doesn't support v2 hierarchy already.  The
  only change made by this patchset is how that single destination css
  is obtained.

* memory migration path already doesn't do anything on v2.  How the
  single destination css is obtained is updated and the prep stage of
  mem_cgroup_can_attach() is reordered to accomodate the change.

* pids is the only controller which was affected by this bug.  It now
  correctly handles multi-destination migrations and no longer causes
  counter underflow from incorrect accounting.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
2015-12-03 10:18:21 -05:00
..
partitions Merge branch 'for-3.20/core' of git://git.kernel.dk/linux-block 2015-02-12 14:13:23 -08:00
bio-integrity.c block: blk_flush_integrity() for bio-based drivers 2015-10-21 14:43:44 -06:00
bio.c mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd 2015-11-06 17:50:42 -08:00
blk-cgroup.c cgroup: fix handling of multi-destination migration from subtree_control enabling 2015-12-03 10:18:21 -05:00
blk-core.c block: fix blk-core.c kernel-doc warning 2015-11-11 09:36:57 -07:00
blk-exec.c block: move PM request support to IDE 2015-05-05 13:40:42 -06:00
blk-flush.c blk-mq: fix race between timeout and freeing request 2015-08-15 09:45:21 -06:00
blk-integrity.c block, libnvdimm, nvme: provide a built-in blk_integrity nop profile 2015-10-21 14:43:45 -06:00
blk-ioc.c mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd 2015-11-06 17:50:42 -08:00
blk-iopoll.c
blk-lib.c block: re-add discard_granularity and alignment checks 2015-10-28 09:12:58 +09:00
blk-map.c block: Copy a user iovec if it includes gaps 2015-09-11 09:03:50 -06:00
blk-merge.c block: avoid to merge splitted bio 2015-10-21 15:00:51 -06:00
blk-mq-cpu.c
blk-mq-cpumap.c blk-mq: avoid inserting requests before establishing new mapping 2015-09-29 11:32:50 -06:00
blk-mq-sysfs.c block: add block polling support 2015-11-07 10:40:47 -07:00
blk-mq-tag.c mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd 2015-11-06 17:50:42 -08:00
blk-mq-tag.h blk-mq: factor out a helper to iterate all tags for a request_queue 2015-10-01 10:10:57 +02:00
blk-mq.c blk-mq: mark __blk_mq_complete_request() static 2015-11-11 09:36:56 -07:00
blk-mq.h blk-mq: mark __blk_mq_complete_request() static 2015-11-11 09:36:56 -07:00
blk-settings.c Merge branch 'for-4.3/core' of git://git.kernel.dk/linux-block 2015-09-02 13:10:25 -07:00
blk-softirq.c
blk-sysfs.c block: add block polling support 2015-11-07 10:40:47 -07:00
blk-tag.c block: support different tag allocation policy 2015-01-23 14:15:46 -07:00
blk-throttle.c cgroup: replace cgroup_on_dfl() tests in controllers with cgroup_subsys_on_dfl() 2015-09-18 11:56:28 -04:00
blk-timeout.c blk-mq: Allow requests to never expire 2015-01-08 08:59:01 -07:00
blk.h Merge branch 'for-4.4/integrity' of git://git.kernel.dk/linux-block 2015-11-04 20:51:48 -08:00
bounce.c Merge branch 'for-linus' of git://git.kernel.dk/linux-block 2015-09-19 18:57:09 -07:00
bsg-lib.c
bsg.c block: Simplify bsg complete all 2015-02-04 09:57:52 -07:00
cfq-iosched.c cgroup: replace cgroup_on_dfl() tests in controllers with cgroup_subsys_on_dfl() 2015-09-18 11:56:28 -04:00
cmdline-parser.c
compat_ioctl.c block, bdi: an active gendisk always has a request_queue associated with it 2014-09-08 10:00:35 -06:00
deadline-iosched.c
elevator.c block: check bio_mergeable() early before merging 2015-10-21 15:00:54 -06:00
genhd.c block: Inline blk_integrity in struct gendisk 2015-10-21 14:42:42 -06:00
ioctl.c block: add an API for Persistent Reservations 2015-10-21 14:46:56 -06:00
ioprio.c pidns: fix set/getpriority and ioprio_set/get in PRIO_USER mode 2015-11-06 17:50:42 -08:00
Kconfig block: Add T10 Protection Information functions 2014-09-27 09:14:59 -06:00
Kconfig.iosched
Makefile block: Add T10 Protection Information functions 2014-09-27 09:14:59 -06:00
noop-iosched.c
partition-generic.c block: Inline blk_integrity in struct gendisk 2015-10-21 14:42:42 -06:00
scsi_ioctl.c mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIM 2015-11-06 17:50:42 -08:00
t10-pi.c block: Consolidate static integrity profile properties 2015-10-21 14:42:38 -06:00