remarkable-linux/kernel
Preeti U Murthy d4573c3e1c sched: Improve load balancing in the presence of idle CPUs
When a CPU is kicked to do nohz idle balancing, it wakes up to do load
balancing on itself, followed by load balancing on behalf of idle CPUs.
But it may end up with load after the load balancing attempt on itself.
This aborts nohz idle balancing. As a result several idle CPUs are left
without tasks till such a time that an ILB CPU finds it unfavorable to
pull tasks upon itself. This delays spreading of load across idle CPUs
and worse, clutters only a few CPUs with tasks.

The effect of the above problem was observed on an SMT8 POWER server
with 2 levels of numa domains. Busy loops equal to number of cores were
spawned. Since load balancing on fork/exec is discouraged across numa
domains, all busy loops would start on one of the numa domains. However
it was expected that eventually one busy loop would run per core across
all domains due to nohz idle load balancing. But it was observed that it
took as long as 10 seconds to spread the load across numa domains.

Further investigation showed that this was a consequence of the
following:

 1. An ILB CPU was chosen from the first numa domain to trigger nohz idle
    load balancing [Given the experiment, upto 6 CPUs per core could be
    potentially idle in this domain.]

 2. However the ILB CPU would call load_balance() on itself before
    initiating nohz idle load balancing.

 3. Given cores are SMT8, the ILB CPU had enough opportunities to pull
    tasks from its sibling cores to even out load.

 4. Now that the ILB CPU was no longer idle, it would abort nohz idle
    load balancing

As a result the opportunities to spread load across numa domains were
lost until such a time that the cores within the first numa domain had
equal number of tasks among themselves.  This is a pretty bad scenario,
since the cores within the first numa domain would have as many as 4
tasks each, while cores in the neighbouring numa domains would all
remain idle.

Fix this, by checking if a CPU was woken up to do nohz idle load
balancing, before it does load balancing upon itself. This way we allow
idle CPUs across the system to do load balancing which results in
quicker spread of load, instead of performing load balancing within the
local sched domain hierarchy of the ILB CPU alone under circumstances
such as above.

Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Jason Low <jason.low2@hp.com>
Cc: benh@kernel.crashing.org
Cc: daniel.lezcano@linaro.org
Cc: efault@gmx.de
Cc: iamjoonsoo.kim@lge.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: riel@redhat.com
Cc: srikar@linux.vnet.ibm.com
Cc: svaidy@linux.vnet.ibm.com
Cc: tim.c.chen@linux.intel.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/20150326130014.21532.17158.stgit@preeti.in.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27 09:36:09 +01:00
..
bpf Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-01-27 13:55:36 -08:00
configs
debug debug: prevent entering debug mode on panic/exception. 2015-02-19 12:39:03 -06:00
events perf: Fix context leak in put_event() 2015-03-13 10:02:18 +01:00
gcov kbuild,gcov: simplify kernel/gcov/Makefile more 2015-01-09 17:25:44 +01:00
irq genirq / PM: Add flag for shared NO_SUSPEND interrupt lines 2015-03-04 21:42:19 +01:00
livepatch livepatch: Fix subtle race with coming and going modules 2015-03-17 10:31:54 +01:00
locking locking/rtmutex: Set state back to running on error 2015-03-01 09:45:06 +01:00
power PM / sleep: Re-implement suspend-to-idle handling 2015-02-13 23:49:36 +01:00
printk console: Fix console name size mismatch 2015-03-07 03:39:55 +01:00
rcu Merge branches 'core-urgent-for-linus' and 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2015-02-21 10:36:06 -08:00
sched sched: Improve load balancing in the presence of idle CPUs 2015-03-27 09:36:09 +01:00
time Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2015-02-21 11:05:22 -08:00
trace ftrace: Fix ftrace enable ordering of sysctl ftrace_enabled 2015-03-09 10:55:34 -04:00
.gitignore
acct.c new fs_pin killing logics 2015-01-25 23:17:28 -05:00
async.c kernel/async.c: switch to pr_foo() 2014-10-09 22:26:04 -04:00
audit.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2014-12-30 10:45:47 -08:00
audit.h audit: replace getname()/putname() hacks with reference counters 2015-01-23 00:23:58 -05:00
audit_tree.c fsnotify: unify inode and mount marks handling 2014-12-13 12:42:53 -08:00
audit_watch.c audit: invalid op= values for rules 2014-09-23 16:37:53 -04:00
auditfilter.c Merge branch 'upstream' of git://git.infradead.org/users/pcmoore/audit 2015-02-11 20:07:47 -08:00
auditsc.c Merge branch 'getname2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-02-17 15:27:47 -08:00
backtracetest.c
bounds.c
capability.c
cgroup.c kernfs: remove KERNFS_STATIC_NAME 2015-02-13 21:21:36 -08:00
cgroup_freezer.c
compat.c all arches, signal: move restart_block to struct task_struct 2015-02-12 18:54:12 -08:00
configs.c
context_tracking.c sched: stop the unbound recursion in preempt_schedule_context() 2014-10-28 10:46:05 +01:00
cpu.c hotplugcpu: Avoid deadlocks by waking active_writer 2015-01-06 11:01:14 -08:00
cpu_pm.c
cpuset.c cpuset: Fix cpuset sched_relax_domain_level 2015-03-02 11:55:04 -05:00
crash_dump.c
cred.c
delayacct.c
dma.c
elfcore.c
exec_domain.c
exit.c oom, PM: make OOM detection in the freezer path raceless 2015-02-11 17:06:03 -08:00
extable.c ftrace/x86/extable: Add is_ftrace_trampoline() function 2014-11-19 15:25:26 -05:00
fork.c mm: do not use mm->nr_pmds on !MMU configurations 2015-02-12 18:54:10 -08:00
freezer.c freezer: remove obsolete comments in __thaw_task() 2014-10-21 23:44:20 +02:00
futex.c all arches, signal: move restart_block to struct task_struct 2015-02-12 18:54:12 -08:00
futex_compat.c
groups.c userns: Don't allow setgroups until a gid mapping has been setablished 2014-12-09 16:58:40 -06:00
hung_task.c
irq_work.c percpu: Convert remaining __get_cpu_var uses in 3.18-rcX 2014-10-29 11:18:18 -04:00
jump_label.c
kallsyms.c kernel/kallsyms.c: use __seq_open_private() 2014-10-14 02:18:16 +02:00
kcmp.c
Kconfig.freezer
Kconfig.hz
Kconfig.locks locking/mcs: Better differentiate between MCS variants 2015-01-14 15:07:32 +01:00
Kconfig.preempt
kexec.c kexec: simplify conditional 2015-02-17 14:34:51 -08:00
kmod.c usermodehelper: kill the kmod_thread_locker logic 2014-12-10 17:41:17 -08:00
kprobes.c kprobes: makes kprobes/enabled works correctly for optimized kprobes. 2015-02-13 21:21:42 -08:00
ksysfs.c
kthread.c kernel/kthread.c: partial revert of 81c98869fa ("kthread: ensure locality of task_struct allocations") 2014-10-09 22:25:51 -04:00
latencytop.c
Makefile Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security 2015-02-11 20:25:11 -08:00
module-internal.h
module.c kasan, module, vmalloc: rework shadow allocation for modules 2015-03-12 18:46:08 -07:00
module_signing.c
notifier.c rcu: Make SRCU optional by using CONFIG_SRCU 2015-01-06 11:04:29 -08:00
nsproxy.c bury struct proc_ns in fs/proc 2014-12-04 14:34:54 -05:00
padata.c padata: use %*pb[l] to print bitmaps including cpumasks and nodemasks 2015-02-13 21:21:38 -08:00
panic.c livepatch: kernel: add TAINT_LIVEPATCH 2014-12-22 15:40:48 +01:00
params.c param: fix uninitialized read with CONFIG_DEBUG_LOCK_ALLOC 2015-01-20 11:38:31 +10:30
pid.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-12-16 15:53:03 -08:00
pid_namespace.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-12-16 15:53:03 -08:00
profile.c profile: use %*pb[l] to print bitmaps including cpumasks and nodemasks 2015-02-13 21:21:38 -08:00
ptrace.c ptrace: remove linux/compat.h inclusion under CONFIG_COMPAT 2015-02-17 14:34:51 -08:00
range.c kernel: avoid overflow in cmp_range 2015-01-17 10:02:23 +13:00
reboot.c kernel: add support for kernel restart handler call chain 2014-09-26 00:00:06 -07:00
relay.c
resource.c resources: Move struct resource_list_entry from ACPI into resource core 2015-02-05 15:09:25 +01:00
seccomp.c seccomp: cap SECCOMP_RET_ERRNO data to MAX_ERRNO 2015-02-17 14:34:55 -08:00
signal.c signal: use current->state helpers 2015-02-17 14:34:51 -08:00
smp.c Merge branch 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu 2014-10-15 07:48:18 +02:00
smpboot.c smpboot: Add missing get_online_cpus() in smpboot_register_percpu_thread() 2015-01-23 11:33:51 +01:00
smpboot.h
softirq.c Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2015-02-09 15:24:03 -08:00
stacktrace.c stacktrace: introduce snprint_stack_trace for buffer output 2014-12-13 12:42:48 -08:00
stop_machine.c
sys.c kernel/sys.c: fix UNAME26 for 4.0 2015-02-28 09:57:51 -08:00
sys_ni.c syscalls: implement execveat() system call 2014-12-13 12:42:51 -08:00
sysctl.c mm, hugetlb: remove unnecessary lower bound on sysctl handlers"? 2015-02-10 14:30:34 -08:00
sysctl_binary.c kernel: add panic_on_warn 2014-12-10 17:41:10 -08:00
system_certificates.S
system_keyring.c
task_work.c
taskstats.c netlink: make nlmsg_end() and genlmsg_end() void 2015-01-18 01:03:45 -05:00
test_kprobes.c
torture.c torture: Address race in module cleanup 2014-09-16 13:41:06 -07:00
tracepoint.c
tsacct.c
uid16.c groups: Consolidate the setgroups permission checks 2014-12-05 17:19:27 -06:00
up.c
user-return-notifier.c
user.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2014-12-17 12:31:40 -08:00
user_namespace.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2014-12-17 12:31:40 -08:00
utsname.c copy address of proc_ns_ops into ns_common 2014-12-04 14:34:47 -05:00
utsname_sysctl.c
watchdog.c kernel/sched/clock.c: add another clock for use with the soft lockup watchdog 2015-02-12 18:54:13 -08:00
workqueue.c workqueue: fix hang involving racing cancel[_delayed]_work_sync()'s for PREEMPT_NONE 2015-03-05 08:04:13 -05:00
workqueue_internal.h