1
0
Fork 0
alistair23-linux/kernel/cgroup
Daniel Jordan 6eab3f646b cpuset: fix race between hotplug work and later CPU offline
commit 406100f3da upstream.

One of our machines keeled over trying to rebuild the scheduler domains.
Mainline produces the same splat:

  BUG: unable to handle page fault for address: 0000607f820054db
  CPU: 2 PID: 149 Comm: kworker/1:1 Not tainted 5.10.0-rc1-master+ #6
  Workqueue: events cpuset_hotplug_workfn
  RIP: build_sched_domains
  Call Trace:
   partition_sched_domains_locked
   rebuild_sched_domains_locked
   cpuset_hotplug_workfn

It happens with cgroup2 and exclusive cpusets only.  This reproducer
triggers it on an 8-cpu vm and works most effectively with no
preexisting child cgroups:

  cd $UNIFIED_ROOT
  mkdir cg1
  echo 4-7 > cg1/cpuset.cpus
  echo root > cg1/cpuset.cpus.partition

  # with smt/control reading 'on',
  echo off > /sys/devices/system/cpu/smt/control

RIP maps to

  sd->shared = *per_cpu_ptr(sdd->sds, sd_id);

from sd_init().  sd_id is calculated earlier in the same function:

  cpumask_and(sched_domain_span(sd), cpu_map, tl->mask(cpu));
  sd_id = cpumask_first(sched_domain_span(sd));

tl->mask(cpu), which reads cpu_sibling_map on x86, returns an empty mask
and so cpumask_first() returns >= nr_cpu_ids, which leads to the bogus
value from per_cpu_ptr() above.

The problem is a race between cpuset_hotplug_workfn() and a later
offline of CPU N.  cpuset_hotplug_workfn() updates the effective masks
when N is still online, the offline clears N from cpu_sibling_map, and
then the worker uses the stale effective masks that still have N to
generate the scheduling domains, leading the worker to read
N's empty cpu_sibling_map in sd_init().

rebuild_sched_domains_locked() prevented the race during the cgroup2
cpuset series up until the Fixes commit changed its check.  Make the
check more robust so that it can detect an offline CPU in any exclusive
cpuset's effective mask, not just the top one.

Fixes: 0ccea8feb9 ("cpuset: Make generate_sched_domains() work with partition")
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20201112171711.639541-1-daniel.m.jordan@oracle.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30 11:51:36 +01:00
..
Makefile cgroup: cgroup v2 freezer 2019-04-19 11:26:48 -07:00
cgroup-internal.h cgroup: make TRACE_CGROUP_PATH irq-safe 2019-04-19 11:26:49 -07:00
cgroup-v1.c cgroup1: don't call release_agent when it is "" 2020-04-01 11:01:51 +02:00
cgroup.c cgroup: fix cgroup_sk_alloc() for sk_clone_lock() 2020-07-22 09:32:49 +02:00
cpuset.c cpuset: fix race between hotplug work and later CPU offline 2020-12-30 11:51:36 +01:00
debug.c kernel: cgroup: fix misuse of %x 2019-05-06 08:47:48 -07:00
freezer.c cgroup: freezer: don't change task and cgroups status unnecessarily 2019-12-31 16:45:06 +01:00
legacy_freezer.c cgroup: rename freezer.c into legacy_freezer.c 2019-04-19 11:26:48 -07:00
namespace.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
pids.c cgroup: pids: use atomic64_t for pids->limit 2019-12-17 19:56:15 +01:00
rdma.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 451 2019-06-19 17:09:08 +02:00
rstat.c Revert "cgroup: Add memory barriers to plug cgroup_rstat_updated() race window" 2020-06-07 13:18:46 +02:00