alistair23-linux

redonkable

History

Linus Torvalds 3510ca20ec Minor page waitqueue cleanups Tim Chen and Kan Liang have been battling a customer load that shows extremely long page wakeup lists. The cause seems to be constant NUMA migration of a hot page that is shared across a lot of threads, but the actual root cause for the exact behavior has not been found. Tim has a patch that batches the wait list traversal at wakeup time, so that we at least don't get long uninterruptible cases where we traverse and wake up thousands of processes and get nasty latency spikes. That is likely 4.14 material, but we're still discussing the page waitqueue specific parts of it. In the meantime, I've tried to look at making the page wait queues less expensive, and failing miserably. If you have thousands of threads waiting for the same page, it will be painful. We'll need to try to figure out the NUMA balancing issue some day, in addition to avoiding the excessive spinlock hold times. That said, having tried to rewrite the page wait queues, I can at least fix up some of the braindamage in the current situation. In particular: (a) we don't want to continue walking the page wait list if the bit we're waiting for already got set again (which seems to be one of the patterns of the bad load). That makes no progress and just causes pointless cache pollution chasing the pointers. (b) we don't want to put the non-locking waiters always on the front of the queue, and the locking waiters always on the back. Not only is that unfair, it means that we wake up thousands of reading threads that will just end up being blocked by the writer later anyway. Also add a comment about the layout of 'struct wait_page_key' - there is an external user of it in the cachefiles code that means that it has to match the layout of 'struct wait_bit_key' in the two first members. It so happens to match, because 'struct page ' and 'unsigned long ' end up having the same values simply because the page flags are the first member in struct page. Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Christopher Lameter <cl@linux.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>		2017-08-27 13:55:12 -07:00
..
Makefile	Merge branch 'WIP.sched/core' into sched/core	2017-06-20 12:28:21 +02:00
autogroup.c	sched/autogroup: Rename auto_group.[ch] to autogroup.[ch]	2017-02-08 09:01:11 +01:00
autogroup.h	sched/headers: Prepare for new header dependencies before moving code to <linux/sched/autogroup.h>	2017-03-02 08:42:28 +01:00
clock.c	sched/clock: Fix early boot preempt assumption in __set_sched_clock_stable()	2017-05-24 09:10:00 +02:00
completion.c	sched/wait: Rename wait_queue_t => wait_queue_entry_t	2017-06-20 12:18:27 +02:00
core.c	sched/core: Fix some documentation build warnings	2017-07-25 11:17:02 +02:00
cpuacct.c	sched/cputime: Convert kcpustat to nsecs	2017-02-01 09:13:47 +01:00
cpuacct.h	sched/cpuacct: Simplify the cpuacct code	2016-03-21 11:00:28 +01:00
cpudeadline.c	sched/core: Remove the tsk_cpus_allowed() wrapper	2017-03-02 08:42:24 +01:00
cpudeadline.h	sched/deadline: Split cpudl_set() into cpudl_set() and cpudl_clear()	2016-09-05 13:29:43 +02:00
cpufreq.c	cpufreq / sched: Pass flags to cpufreq_update_util()	2016-08-16 22:14:55 +02:00
cpufreq_schedutil.c	cpufreq: schedutil: Fix sugov_start() versus sugov_update_shared() race	2017-07-12 14:47:48 +02:00
cpupri.c	sched/core: Remove the tsk_cpus_allowed() wrapper	2017-03-02 08:42:24 +01:00
cpupri.h	sched/cpupri: Remove unnecessary definitions in cpupri.h	2014-11-16 10:58:59 +01:00
cputime.c	sched/cputime: Don't use smp_processor_id() in preemptible context	2017-07-14 10:27:15 +02:00
deadline.c	sched/deadline: Fix confusing comments about selection of top pi-waiter	2017-07-14 10:35:16 +02:00
debug.c	sched/debug: Expose the number of RT/DL tasks that can migrate	2017-06-30 09:32:07 +02:00
fair.c	sched/fair: Fix load_balance() affinity redo path	2017-07-05 16:28:48 +02:00
features.h	sched/core: Implement new approach to scale select_idle_cpu()	2017-06-08 10:25:17 +02:00
idle.c	sched/idle: Add deferrable vmstat_updater back	2017-06-08 10:32:09 +02:00
idle_task.c	sched/core: Add wrappers for lockdep_(un)pin_lock()	2017-01-14 11:29:30 +01:00
loadavg.c	sched/loadavg: Generalize "_idle" naming to "_nohz"	2017-06-22 11:30:01 +02:00
rt.c	sched/rt: Move RT related code from sched/core.c to sched/rt.c	2017-06-23 10:46:45 +02:00
sched-pelt.h	sched/fair: Move the PELT constants into a generated header	2017-04-14 10:26:37 +02:00
sched.h	sched/rt: Move RT related code from sched/core.c to sched/rt.c	2017-06-23 10:46:45 +02:00
stats.c	sched: use %*pb[l] to print bitmaps including cpumasks and nodemasks	2015-02-13 21:21:37 -08:00
stats.h	sched/headers: Move cputime functionality from <linux/sched.h> and <linux/cputime.h> into <linux/sched/cputime.h>	2017-03-03 01:45:22 +01:00
stop_task.c	sched/core: Add wrappers for lockdep_(un)pin_lock()	2017-01-14 11:29:30 +01:00
swait.c	sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h>	2017-03-02 08:42:32 +01:00
topology.c	sched/topology: Rename sched_group_cpus()	2017-05-15 10:15:34 +02:00
wait.c	Minor page waitqueue cleanups	2017-08-27 13:55:12 -07:00
wait_bit.c	sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming	2017-06-20 12:19:14 +02:00