1
0
Fork 0
Commit Graph

354 Commits (31da067023dd0e35c5ec5556f0be7a31e5588277)

Author SHA1 Message Date
Paul E. McKenney b1a2d79fe7 rcu: Make NOCB CPUs migrate CBs directly from outgoing CPU
RCU's CPU-hotplug callback-migration code first moves the outgoing
CPU's callbacks to ->orphan_done and ->orphan_pend, and only then
moves them to the NOCB callback list.  This commit avoids the
extra step (and simplifies the code) by moving the callbacks directly
from the outgoing CPU's callback list to the NOCB callback list.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-07-25 13:04:47 -07:00
Paul E. McKenney 8be6e1b15c rcu: Use timer as backstop for NOCB deferred wakeups
The handling of RCU's no-CBs CPUs has a maintenance headache, namely
that if call_rcu() is invoked with interrupts disabled, the rcuo kthread
wakeup must be defered to a point where we can be sure that scheduler
locks are not held.  Of course, there are a lot of code paths leading
from an interrupts-disabled invocation of call_rcu(), and missing any
one of these can result in excessive callback-invocation latency, and
potentially even system hangs.

This commit therefore uses a timer to guarantee that the wakeup will
eventually occur.  If one of the deferred-wakeup points kicks in, then
the timer is simply cancelled.

This commit also fixes up an incomplete removal of commits that were
intended to plug remaining exit paths, which should have the added
benefit of reducing the overhead of RCU's context-switch hooks.  In
addition, it simplifies leader-to-follower callback-list handoff by
introducing locking.  The call_rcu()-to-leader handoff continues to
use atomic operations in order to maintain good real-time latency for
common-case use of call_rcu().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Dan Carpenter fix for mod_timer() usage bug found by smatch. ]
2017-07-25 09:53:09 -07:00
Paul E. McKenney 44c65ff2e3 rcu: Eliminate NOCBs CPU-state Kconfig options
The CONFIG_RCU_NOCB_CPU_ALL, CONFIG_RCU_NOCB_CPU_NONE, and
CONFIG_RCU_NOCB_CPU_ZERO Kconfig options are used only in testing and
are redundant with the rcu_nocbs= boot parameter.  This commit therefore
removes these three Kconfig options and adjusts the rcutorture scripts
to use the boot parameter instead.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 18:52:43 -07:00
Paul E. McKenney ae91aa0adb rcu: Remove debugfs tracing
RCU's debugfs tracing used to be the only reasonable low-level debug
information available, but ftrace and event tracing has since surpassed
the RCU debugfs level of usefulness.  This commit therefore removes
RCU's debugfs tracing.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 18:52:43 -07:00
Paul E. McKenney c4a09ff752 rcu: Remove the now-obsolete PROVE_RCU_REPEATEDLY Kconfig option
The PROVE_RCU_REPEATEDLY Kconfig option was initially added due to
the volume of messages from PROVE_RCU: Doing just one per boot would
have required excessive numbers of boots to locate them all.  However,
PROVE_RCU messages are now relatively rare, so there is no longer any
reason to need more than one such message per boot.  This commit therefore
removes the PROVE_RCU_REPEATEDLY Kconfig option.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
2017-06-08 18:52:41 -07:00
Paul E. McKenney fe5ac724d8 rcu: Remove nohz_full full-system-idle state machine
The NO_HZ_FULL_SYSIDLE full-system-idle capability was added in 2013
by commit 0edd1b1784 ("nohz_full: Add full-system-idle state machine"),
but has not been used.  This commit therefore removes it.

If it turns out to be needed later, this commit can always be reverted.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-08 18:52:39 -07:00
Paul E. McKenney 90040c9e30 rcu: Remove *_SLOW_* Kconfig options
The RCU_TORTURE_TEST_SLOW_PREINIT, RCU_TORTURE_TEST_SLOW_PREINIT_DELAY,
RCU_TORTURE_TEST_SLOW_PREINIT_DELAY, RCU_TORTURE_TEST_SLOW_INIT,
RCU_TORTURE_TEST_SLOW_INIT_DELAY, RCU_TORTURE_TEST_SLOW_CLEANUP,
and RCU_TORTURE_TEST_SLOW_CLEANUP_DELAY Kconfig options are only
useful for torture testing, and there are the rcutree.gp_cleanup_delay,
rcutree.gp_init_delay, and rcutree.gp_preinit_delay kernel boot parameters
that rcutorture can use instead.  The effect of these parameters is to
artificially slow down grace period initialization and cleanup in order
to make some types of race conditions happen more often.

This commit therefore simplifies Tree RCU a bit by removing the Kconfig
options and adding the corresponding kernel parameters to rcutorture's
.boot files instead.  However, this commit also leaves out the kernel
parameters for TREE02, TREE04, and TREE07 in order to have about the
same number of tests slowed as not slowed.  TREE01, TREE03, TREE05,
and TREE06 are slowed, and the rest are not slowed.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 18:52:38 -07:00
Paul E. McKenney a68a2bb28b rcu: Move docbook comments out of rcupdate.h
The include/linux/rcupdate.h file is included by more than 200
files, so shrinking it should provide some build-time benefits.
This commit therefore moves several docbook comments from rcupdate.h to
kernel/rcu/update.c, kernel/rcu/tree.c, and kernel/rcu/tree_plugin.h, thus
reducing the number of times that the compiler has to scan these comments.
This likely provides only a small benefit, but every little bit helps.

This commit also fixes a malformed bulleted list noted by the 0day
Test Robot.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 18:52:27 -07:00
Paul E. McKenney 6b5fc3a133 rcu: Add memory barriers for NOCB leader wakeup
Wait/wakeup operations do not guarantee ordering on their own.  Instead,
either locking or memory barriers are required.  This commit therefore
adds memory barriers to wake_nocb_leader() and nocb_leader_wait().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Krister Johansen <kjlx@templeofstupid.com>
Cc: <stable@vger.kernel.org> # 4.6.x
2017-06-08 18:51:59 -07:00
Paul E. McKenney 511324e462 rcu: Use RCU_NOCB_WAKE rather than RCU_NOGP_WAKE
The RCU_NOGP_WAKE_NOT, RCU_NOGP_WAKE, and RCU_NOGP_WAKE_FORCE flags
are used to mediate wakeups for the no-CBs CPU kthreads.  The "NOGP"
really doesn't make any sense, so this commit does s/NOGP/NOCB/.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 08:25:40 -07:00
Paul E. McKenney ea9b0c8a26 rcu: Add lockdep_assert_held() teeth to tree_plugin.h
Comments can be helpful, but assertions carry more force.  This commit
therefore adds lockdep_assert_held() and RCU_LOCKDEP_WARN() calls to
enforce lock-held and interrupt-disabled preconditions.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 08:25:37 -07:00
Paul E. McKenney 17c7798bea rcu: Update rcu_bootup_announce_oddness()
This commit updates rcu_bootup_announce_oddness() to check additional
Kconfig options and module/boot parameters.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 08:25:35 -07:00
Paul E. McKenney 59d80fd835 rcu: Print out rcupdate.c non-default boot-time settings
This commit adds a rcupdate_announce_bootup_oddness() function to
print out non-default values of significant kernel boot parameter
settings to aid in debugging.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 08:25:35 -07:00
Paul E. McKenney e28371c891 rcu: Remove obsolete reference to synchronize_kernel()
The synchronize_kernel() primitive was removed in favor of
synchronize_sched() more than a decade ago, and it seems likely that
rather few kernel hackers are familiar with it.  Its continued presence
is therefore providing more confusion than enlightenment.  This commit
therefore removes the reference from the synchronize_sched() header
comment, and adds the corresponding information to the synchronize_rcu(0
header comment.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 08:25:25 -07:00
Paul E. McKenney 5b72f9643b rcu: Complain if blocking in preemptible RCU read-side critical section
Although preemptible RCU allows its read-side critical sections to be
preempted, general blocking is forbidden.  The reason for this is that
excessive preemption times can be handled by CONFIG_RCU_BOOST=y, but a
voluntarily blocked task doesn't care how high you boost its priority.
Because preemptible RCU is a global mechanism, one ill-behaved reader
hurts everyone.  Hence the prohibition against general blocking in
RCU-preempt read-side critical sections.  Preemption yes, blocking no.

This commit enforces this prohibition.

There is a special exception for the -rt patchset (which they kindly
volunteered to implement):  It is OK to block (as opposed to merely being
preempted) within an RCU-preempt read-side critical section, but only if
the blocking is subject to priority inheritance.  This exception permits
CONFIG_RCU_BOOST=y to get -rt RCU readers out of trouble.

Why doesn't this exception also apply to mainline's rt_mutex?  Because
of the possibility that someone does general blocking while holding
an rt_mutex.  Yes, the priority boosting will affect the rt_mutex,
but it won't help with the task doing general blocking while holding
that rt_mutex.

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-06-08 08:25:24 -07:00
Paul E. McKenney 933dfbd7c4 rcu: Open-code the rcu_cblist_n_lazy_cbs() function
Because the rcu_cblist_n_lazy_cbs() just samples the ->len_lazy counter,
and because the rcu_cblist structure is quite straightforward, it makes
sense to open-code rcu_cblist_n_lazy_cbs(p) as p->len_lazy, cutting out
a level of indirection.  This commit makes this change.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-02 09:22:48 -07:00
Paul E. McKenney 4b27f20b40 rcu: Open-code the rcu_cblist_n_cbs() function
Because the rcu_cblist_n_cbs() just samples the ->len counter, and
because the rcu_cblist structure is quite straightforward, it makes
sense to open-code rcu_cblist_n_cbs(p) as p->len, cutting out a level
of indirection.  This commit makes this change.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-02 09:21:59 -07:00
Paul E. McKenney 8ef0f37efb rcu: Open-code the rcu_cblist_empty() function
Because the rcu_cblist_empty() just samples the ->head pointer, and
because the rcu_cblist structure is quite straightforward, it makes
sense to open-code rcu_cblist_empty(p) as !p->head, cutting out a
level of indirection.  This commit makes this change.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-02 08:18:40 -07:00
Paul E. McKenney f2094107ac Merge branches 'doc.2017.04.12a', 'fixes.2017.04.19a' and 'srcu.2017.04.21a' into HEAD
doc.2017.04.12a: Documentation updates
fixes.2017.04.19a: Miscellaneous fixes
srcu.2017.04.21a: Parallelize SRCU callback handling
2017-04-21 06:00:13 -07:00
Nicholas Mc Guire 5455a7f6a8 rcu: Use true/false in assignment to bool
This commit makes the parse_rcu_nocb_poll() function assign true
(rather than the constant 1) to the bool variable rcu_nocb_poll.

Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-04-19 09:29:20 -07:00
Paul E. McKenney 15fecf89e4 srcu: Abstract multi-tail callback list handling
RCU has only one multi-tail callback list, which is implemented via
the nxtlist, nxttail, nxtcompleted, qlen_lazy, and qlen fields in the
rcu_data structure, and whose operations are open-code throughout the
Tree RCU implementation.  This has been more or less OK in the past,
but upcoming callback-list optimizations in SRCU could really use
a multi-tail callback list there as well.

This commit therefore abstracts the multi-tail callback list handling
into a new kernel/rcu/rcu_segcblist.h file, and uses this new API.
The simple head-and-tail pointer callback list is also abstracted and
applied everywhere except for the NOCB callback-offload lists.  (Yes,
the plan is to apply them there as well, but this commit is already
bigger than would be good.)

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-04-18 11:38:18 -07:00
Paul E. McKenney 9226b10d78 rcu: Place guard on rcu_all_qs() and rcu_note_context_switch() actions
The rcu_all_qs() and rcu_note_context_switch() do a series of checks,
taking various actions to supply RCU with quiescent states, depending
on the outcomes of the various checks.  This is a bit much for scheduling
fastpaths, so this commit creates a separate ->rcu_urgent_qs field in
the rcu_dynticks structure that acts as a global guard for these checks.
Thus, in the common case, rcu_all_qs() and rcu_note_context_switch()
check the ->rcu_urgent_qs field, find it false, and simply return.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2017-04-18 11:38:18 -07:00
Ingo Molnar b17b01533b sched/headers: Prepare for new header dependencies before moving code to <linux/sched/debug.h>
We are going to split <linux/sched/debug.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/debug.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:34 +01:00
Ingo Molnar ae7e81c077 sched/headers: Prepare for new header dependencies before moving code to <uapi/linux/sched/types.h>
We are going to move scheduler ABI details to <uapi/linux/sched/types.h>,
which will be used from a number of .c files.

Create empty placeholder header that maps to <linux/types.h>.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:27 +01:00
Paul E. McKenney 31945aa9f1 Merge branches 'doc.2017.01.15b', 'dyntick.2017.01.23a', 'fixes.2017.01.23a', 'srcu.2017.01.25a' and 'torture.2017.01.15b' into HEAD
doc.2017.01.15b: Documentation updates
dyntick.2017.01.23a: Dyntick tracking consolidation
fixes.2017.01.23a: Miscellaneous fixes
srcu.2017.01.25a: SRCU rewrite, fixes, and verification
torture.2017.01.15b: Torture-test updates
2017-01-25 12:56:05 -08:00
Paul E. McKenney 02a5c550b2 rcu: Abstract extended quiescent state determination
This commit is the fourth step towards full abstraction of all accesses
to the ->dynticks counter, implementing previously open-coded checks and
comparisons in new rcu_dynticks_in_eqs() and rcu_dynticks_in_eqs_since()
functions.  This abstraction will ease changes to the ->dynticks counter
operation.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2017-01-23 11:44:18 -08:00
Paul E. McKenney 9831ce3bb4 rcu: Fix comment in rcu_organize_nocb_kthreads()
It used to be that the rcuo callback-offload kthreads were spawned
in rcu_organize_nocb_kthreads(), and the comment before the "for"
loop says as much.  However, this spawning has long since moved to
the CPU-hotplug code, so this commit fixes this comment.

Reported-by: Michalis Kokologiannakis <mixaskok@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2017-01-23 11:37:13 -08:00
Paul E. McKenney 52d7e48b86 rcu: Narrow early boot window of illegal synchronous grace periods
The current preemptible RCU implementation goes through three phases
during bootup.  In the first phase, there is only one CPU that is running
with preemption disabled, so that a no-op is a synchronous grace period.
In the second mid-boot phase, the scheduler is running, but RCU has
not yet gotten its kthreads spawned (and, for expedited grace periods,
workqueues are not yet running.  During this time, any attempt to do
a synchronous grace period will hang the system (or complain bitterly,
depending).  In the third and final phase, RCU is fully operational and
everything works normally.

This has been OK for some time, but there has recently been some
synchronous grace periods showing up during the second mid-boot phase.
This code worked "by accident" for awhile, but started failing as soon
as expedited RCU grace periods switched over to workqueues in commit
8b355e3bc1 ("rcu: Drive expedited grace periods from workqueue").
Note that the code was buggy even before this commit, as it was subject
to failure on real-time systems that forced all expedited grace periods
to run as normal grace periods (for example, using the rcu_normal ksysfs
parameter).  The callchain from the failure case is as follows:

early_amd_iommu_init()
|-> acpi_put_table(ivrs_base);
|-> acpi_tb_put_table(table_desc);
|-> acpi_tb_invalidate_table(table_desc);
|-> acpi_tb_release_table(...)
|-> acpi_os_unmap_memory
|-> acpi_os_unmap_iomem
|-> acpi_os_map_cleanup
|-> synchronize_rcu_expedited

The kernel showing this callchain was built with CONFIG_PREEMPT_RCU=y,
which caused the code to try using workqueues before they were
initialized, which did not go well.

This commit therefore reworks RCU to permit synchronous grace periods
to proceed during this mid-boot phase.  This commit is therefore a
fix to a regression introduced in v4.9, and is therefore being put
forward post-merge-window in v4.10.

This commit sets a flag from the existing rcu_scheduler_starting()
function which causes all synchronous grace periods to take the expedited
path.  The expedited path now checks this flag, using the requesting task
to drive the expedited grace period forward during the mid-boot phase.
Finally, this flag is updated by a core_initcall() function named
rcu_exp_runtime_mode(), which causes the runtime codepaths to be used.

Note that this arrangement assumes that tasks are not sent POSIX signals
(or anything similar) from the time that the first task is spawned
through core_initcall() time.

Fixes: 8b355e3bc1 ("rcu: Drive expedited grace periods from workqueue")
Reported-by: "Zheng, Lv" <lv.zheng@intel.com>
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Stan Kain <stan.kain@gmail.com>
Tested-by: Ivan <waffolz@hotmail.com>
Tested-by: Emanuel Castelo <emanuel.castelo@gmail.com>
Tested-by: Bruno Pesavento <bpesavento@infinito.it>
Tested-by: Borislav Petkov <bp@suse.de>
Tested-by: Frederic Bezies <fredbezies@gmail.com>
Cc: <stable@vger.kernel.org> # 4.9.0-
2017-01-14 21:23:48 -08:00
Ding Tianhong bedc196915 rcu: Fix soft lockup for rcu_nocb_kthread
Carrying out the following steps results in a softlockup in the
RCU callback-offload (rcuo) kthreads:

1. Connect to ixgbevf, and set the speed to 10Gb/s.
2. Use ifconfig to bring the nic up and down repeatedly.

[  317.005148] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
[  368.106005] BUG: soft lockup - CPU#1 stuck for 22s! [rcuos/1:15]
[  368.106005] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  368.106005] task: ffff88057dd8a220 ti: ffff88057dd9c000 task.ti: ffff88057dd9c000
[  368.106005] RIP: 0010:[<ffffffff81579e04>]  [<ffffffff81579e04>] fib_table_lookup+0x14/0x390
[  368.106005] RSP: 0018:ffff88061fc83ce8  EFLAGS: 00000286
[  368.106005] RAX: 0000000000000001 RBX: 00000000020155c0 RCX: 0000000000000001
[  368.106005] RDX: ffff88061fc83d50 RSI: ffff88061fc83d70 RDI: ffff880036d11a00
[  368.106005] RBP: ffff88061fc83d08 R08: 0000000000000001 R09: 0000000000000000
[  368.106005] R10: ffff880036d11a00 R11: ffffffff819e0900 R12: ffff88061fc83c58
[  368.106005] R13: ffffffff816154dd R14: ffff88061fc83d08 R15: 00000000020155c0
[  368.106005] FS:  0000000000000000(0000) GS:ffff88061fc80000(0000) knlGS:0000000000000000
[  368.106005] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  368.106005] CR2: 00007f8c2aee9c40 CR3: 000000057b222000 CR4: 00000000000407e0
[  368.106005] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  368.106005] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  368.106005] Stack:
[  368.106005]  00000000010000c0 ffff88057b766000 ffff8802e380b000 ffff88057af03e00
[  368.106005]  ffff88061fc83dc0 ffffffff815349a6 ffff88061fc83d40 ffffffff814ee146
[  368.106005]  ffff8802e380af00 00000000e380af00 ffffffff819e0900 020155c0010000c0
[  368.106005] Call Trace:
[  368.106005]  <IRQ>
[  368.106005]
[  368.106005]  [<ffffffff815349a6>] ip_route_input_noref+0x516/0xbd0
[  368.106005]  [<ffffffff814ee146>] ? skb_release_data+0xd6/0x110
[  368.106005]  [<ffffffff814ee20a>] ? kfree_skb+0x3a/0xa0
[  368.106005]  [<ffffffff8153698f>] ip_rcv_finish+0x29f/0x350
[  368.106005]  [<ffffffff81537034>] ip_rcv+0x234/0x380
[  368.106005]  [<ffffffff814fd656>] __netif_receive_skb_core+0x676/0x870
[  368.106005]  [<ffffffff814fd868>] __netif_receive_skb+0x18/0x60
[  368.106005]  [<ffffffff814fe4de>] process_backlog+0xae/0x180
[  368.106005]  [<ffffffff814fdcb2>] net_rx_action+0x152/0x240
[  368.106005]  [<ffffffff81077b3f>] __do_softirq+0xef/0x280
[  368.106005]  [<ffffffff8161619c>] call_softirq+0x1c/0x30
[  368.106005]  <EOI>
[  368.106005]
[  368.106005]  [<ffffffff81015d95>] do_softirq+0x65/0xa0
[  368.106005]  [<ffffffff81077174>] local_bh_enable+0x94/0xa0
[  368.106005]  [<ffffffff81114922>] rcu_nocb_kthread+0x232/0x370
[  368.106005]  [<ffffffff81098250>] ? wake_up_bit+0x30/0x30
[  368.106005]  [<ffffffff811146f0>] ? rcu_start_gp+0x40/0x40
[  368.106005]  [<ffffffff8109728f>] kthread+0xcf/0xe0
[  368.106005]  [<ffffffff810971c0>] ? kthread_create_on_node+0x140/0x140
[  368.106005]  [<ffffffff816147d8>] ret_from_fork+0x58/0x90
[  368.106005]  [<ffffffff810971c0>] ? kthread_create_on_node+0x140/0x140

==================================cut here==============================

It turns out that the rcuos callback-offload kthread is busy processing
a very large quantity of RCU callbacks, and it is not reliquishing the
CPU while doing so.  This commit therefore adds an cond_resched_rcu_qs()
within the loop to allow other tasks to run.

Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
[ paulmck: Substituted cond_resched_rcu_qs for cond_resched. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-08-22 07:53:20 -07:00
Paul E. McKenney 4d03754f04 Merge branches 'doc.2016.06.15a', 'fixes.2016.06.15b' and 'torture.2016.06.14a' into HEAD
doc.2016.06.15a: Documentation updates
fixes.2016.06.15b: Documentation updates
torture.2016.06.14a: Documentation updates
2016-06-15 16:58:03 -07:00
Mark Rutland bc75e99983 rcu: Correctly handle sparse possible cpus
In many cases in the RCU tree code, we iterate over the set of cpus for
a leaf node described by rcu_node::grplo and rcu_node::grphi, checking
per-cpu data for each cpu in this range. However, if the set of possible
cpus is sparse, some cpus described in this range are not possible, and
thus no per-cpu region will have been allocated (or initialised) for
them by the generic percpu code.

Erroneous accesses to a per-cpu area for these !possible cpus may fault
or may hit other data depending on the addressed generated when the
erroneous per cpu offset is applied. In practice, both cases have been
observed on arm64 hardware (the former being silent, but detectable with
additional patches).

To avoid issues resulting from this, we must iterate over the set of
*possible* cpus for a given leaf node. This patch add a new helper,
for_each_leaf_node_possible_cpu, to enable this. As iteration is often
intertwined with rcu_node local bitmask manipulation, a new
leaf_node_cpu_bit helper is added to make this simpler and more
consistent. The RCU tree code is made to use both of these where
appropriate.

Without this patch, running reboot at a shell can result in an oops
like:

[ 3369.075979] Unable to handle kernel paging request at virtual address ffffff8008b21b4c
[ 3369.083881] pgd = ffffffc3ecdda000
[ 3369.087270] [ffffff8008b21b4c] *pgd=00000083eca48003, *pud=00000083eca48003, *pmd=0000000000000000
[ 3369.096222] Internal error: Oops: 96000007 [#1] PREEMPT SMP
[ 3369.101781] Modules linked in:
[ 3369.104825] CPU: 2 PID: 1817 Comm: NetworkManager Tainted: G        W       4.6.0+ #3
[ 3369.121239] task: ffffffc0fa13e000 ti: ffffffc3eb940000 task.ti: ffffffc3eb940000
[ 3369.128708] PC is at sync_rcu_exp_select_cpus+0x188/0x510
[ 3369.134094] LR is at sync_rcu_exp_select_cpus+0x104/0x510
[ 3369.139479] pc : [<ffffff80081109a8>] lr : [<ffffff8008110924>] pstate: 200001c5
[ 3369.146860] sp : ffffffc3eb9435a0
[ 3369.150162] x29: ffffffc3eb9435a0 x28: ffffff8008be4f88
[ 3369.155465] x27: ffffff8008b66c80 x26: ffffffc3eceb2600
[ 3369.160767] x25: 0000000000000001 x24: ffffff8008be4f88
[ 3369.166070] x23: ffffff8008b51c3c x22: ffffff8008b66c80
[ 3369.171371] x21: 0000000000000001 x20: ffffff8008b21b40
[ 3369.176673] x19: ffffff8008b66c80 x18: 0000000000000000
[ 3369.181975] x17: 0000007fa951a010 x16: ffffff80086a30f0
[ 3369.187278] x15: 0000007fa9505590 x14: 0000000000000000
[ 3369.192580] x13: ffffff8008b51000 x12: ffffffc3eb940000
[ 3369.197882] x11: 0000000000000006 x10: ffffff8008b51b78
[ 3369.203184] x9 : 0000000000000001 x8 : ffffff8008be4000
[ 3369.208486] x7 : ffffff8008b21b40 x6 : 0000000000001003
[ 3369.213788] x5 : 0000000000000000 x4 : ffffff8008b27280
[ 3369.219090] x3 : ffffff8008b21b4c x2 : 0000000000000001
[ 3369.224406] x1 : 0000000000000001 x0 : 0000000000000140
...
[ 3369.972257] [<ffffff80081109a8>] sync_rcu_exp_select_cpus+0x188/0x510
[ 3369.978685] [<ffffff80081128b4>] synchronize_rcu_expedited+0x64/0xa8
[ 3369.985026] [<ffffff80086b987c>] synchronize_net+0x24/0x30
[ 3369.990499] [<ffffff80086ddb54>] dev_deactivate_many+0x28c/0x298
[ 3369.996493] [<ffffff80086b6bb8>] __dev_close_many+0x60/0xd0
[ 3370.002052] [<ffffff80086b6d48>] __dev_close+0x28/0x40
[ 3370.007178] [<ffffff80086bf62c>] __dev_change_flags+0x8c/0x158
[ 3370.012999] [<ffffff80086bf718>] dev_change_flags+0x20/0x60
[ 3370.018558] [<ffffff80086cf7f0>] do_setlink+0x288/0x918
[ 3370.023771] [<ffffff80086d0798>] rtnl_newlink+0x398/0x6a8
[ 3370.029158] [<ffffff80086cee84>] rtnetlink_rcv_msg+0xe4/0x220
[ 3370.034891] [<ffffff80086e274c>] netlink_rcv_skb+0xc4/0xf8
[ 3370.040364] [<ffffff80086ced8c>] rtnetlink_rcv+0x2c/0x40
[ 3370.045663] [<ffffff80086e1fe8>] netlink_unicast+0x160/0x238
[ 3370.051309] [<ffffff80086e24b8>] netlink_sendmsg+0x2f0/0x358
[ 3370.056956] [<ffffff80086a0070>] sock_sendmsg+0x18/0x30
[ 3370.062168] [<ffffff80086a21cc>] ___sys_sendmsg+0x26c/0x280
[ 3370.067728] [<ffffff80086a30ac>] __sys_sendmsg+0x44/0x88
[ 3370.073027] [<ffffff80086a3100>] SyS_sendmsg+0x10/0x20
[ 3370.078153] [<ffffff8008085e70>] el0_svc_naked+0x24/0x28

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reported-by: Dennis Chen <dennis.chen@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Steve Capper <steve.capper@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-06-15 16:00:05 -07:00
Paul E. McKenney 4e9a073f60 torture: Remove CONFIG_RCU_TORTURE_TEST_RUNNABLE, simplify code
This commit removes CONFIG_RCU_TORTURE_TEST_RUNNABLE in favor of the
already-existing rcutorture.torture_runnable kernel boot parameter.
It also converts an #ifdef into IS_ENABLED(), saving a few lines of code.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-06-14 16:02:15 -07:00
Paul E. McKenney 40e0a6cfd5 rcu: Move expedited code from tree_plugin.h to tree_exp.h
People have been having some difficulty finding their way around the
RCU code.  This commit therefore pulls some of the expedited grace-period
code from tree_plugin.h to a new tree_exp.h file.  This commit is strictly
code movement.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-06-14 16:01:42 -07:00
Paul E. McKenney aff12cdf86 rcu: Consolidate expedited GP code into exp_funnel_lock()
This commit pulls the grace-period-start counter adjustment and tracing
from synchronize_rcu_expedited() and synchronize_sched_expedited()
into exp_funnel_lock(), thus eliminating some code duplication.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-03-31 13:34:11 -07:00
Paul E. McKenney 179e5dcd1e rcu: Consolidate expedited GP tracing into rcu_exp_gp_seq_snap()
This commit moves some duplicate code from synchronize_rcu_expedited()
and synchronize_sched_expedited() into rcu_exp_gp_seq_snap().  This
doesn't save lines of code, but does eliminate a "tell me twice" issue.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-03-31 13:34:10 -07:00
Paul E. McKenney 4ea3e85b11 rcu: Consolidate expedited GP code into rcu_exp_wait_wake()
Currently, synchronize_rcu_expedited() and rcu_sched_expedited() have
significant duplicate code.  This commit therefore consolidates some of
this code into rcu_exp_wake(), which is now renamed to rcu_exp_wait_wake()
in recognition of its added responsibilities.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-03-31 13:34:10 -07:00
Paul E. McKenney f6a12f34a4 rcu: Enforce expedited-GP fairness via funnel wait queue
The current mutex-based funnel-locking approach used by expedited grace
periods is subject to severe unfairness.  The problem arises when a
few tasks, making a path from leaves to root, all wake up before other
tasks do.  A new task can then follow this path all the way to the root,
which needlessly delays tasks whose grace period is done, but who do
not happen to acquire the lock quickly enough.

This commit avoids this problem by maintaining per-rcu_node wait queues,
along with a per-rcu_node counter that tracks the latest grace period
sought by an earlier task to visit this node.  If that grace period
would satisfy the current task, instead of proceeding up the tree,
it waits on the current rcu_node structure using a pair of wait queues
provided for that purpose.  This decouples awakening of old tasks from
the arrival of new tasks.

If the wakeups prove to be a bottleneck, additional kthreads can be
brought to bear for that purpose.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-03-31 13:34:08 -07:00
Paul E. McKenney 4f41530245 rcu: Add expedited-grace-period event tracing
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-03-31 13:34:06 -07:00
Paul E. McKenney bea2de44ae rcu: Add funnel-locking tracing for expedited grace periods
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-03-31 13:34:06 -07:00
Paul E. McKenney 26ece8ef6e rcu: Fix synchronize_rcu_expedited() header comment
This commit brings the synchronize_rcu_expedited() function's header
comment into line with the new implementation.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-03-31 13:34:04 -07:00
Ingo Molnar 8bc6782fe2 Merge commit 'fixes.2015.02.23a' into core/rcu
Conflicts:
	kernel/rcu/tree.c

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-15 09:01:06 +01:00
Paul Gortmaker abedf8e241 rcu: Use simple wait queues where possible in rcutree
As of commit dae6e64d2b ("rcu: Introduce proper blocking to no-CBs kthreads
GP waits") the RCU subsystem started making use of wait queues.

Here we convert all additions of RCU wait queues to use simple wait queues,
since they don't need the extra overhead of the full wait queue features.

Originally this was done for RT kernels[1], since we would get things like...

  BUG: sleeping function called from invalid context at kernel/rtmutex.c:659
  in_atomic(): 1, irqs_disabled(): 1, pid: 8, name: rcu_preempt
  Pid: 8, comm: rcu_preempt Not tainted
  Call Trace:
   [<ffffffff8106c8d0>] __might_sleep+0xd0/0xf0
   [<ffffffff817d77b4>] rt_spin_lock+0x24/0x50
   [<ffffffff8106fcf6>] __wake_up+0x36/0x70
   [<ffffffff810c4542>] rcu_gp_kthread+0x4d2/0x680
   [<ffffffff8105f910>] ? __init_waitqueue_head+0x50/0x50
   [<ffffffff810c4070>] ? rcu_gp_fqs+0x80/0x80
   [<ffffffff8105eabb>] kthread+0xdb/0xe0
   [<ffffffff8106b912>] ? finish_task_switch+0x52/0x100
   [<ffffffff817e0754>] kernel_thread_helper+0x4/0x10
   [<ffffffff8105e9e0>] ? __init_kthread_worker+0x60/0x60
   [<ffffffff817e0750>] ? gs_change+0xb/0xb

...and hence simple wait queues were deployed on RT out of necessity
(as simple wait uses a raw lock), but mainline might as well take
advantage of the more streamline support as well.

[1] This is a carry forward of work from v3.10-rt; the original conversion
was by Thomas on an earlier -rt version, and Sebastian extended it to
additional post-3.10 added RCU waiters; here I've added a commit log and
unified the RCU changes into one, and uprev'd it to match mainline RCU.

Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: linux-rt-users@vger.kernel.org
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1455871601-27484-6-git-send-email-wagi@monom.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-25 11:27:16 +01:00
Daniel Wagner 065bb78c5b rcu: Do not call rcu_nocb_gp_cleanup() while holding rnp->lock
rcu_nocb_gp_cleanup() is called while holding rnp->lock. Currently,
this is okay because the wake_up_all() in rcu_nocb_gp_cleanup() will
not enable the IRQs. lockdep is happy.

By switching over using swait this is not true anymore. swake_up_all()
enables the IRQs while processing the waiters. __do_softirq() can now
run and will eventually call rcu_process_callbacks() which wants to
grap nrp->lock.

Let's move the rcu_nocb_gp_cleanup() call outside the lock before we
switch over to swait.

If we would hold the rnp->lock and use swait, lockdep reports
following:

 =================================
 [ INFO: inconsistent lock state ]
 4.2.0-rc5-00025-g9a73ba0 #136 Not tainted
 ---------------------------------
 inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
 rcu_preempt/8 [HC0[0]:SC0[0]:HE1:SE1] takes:
  (rcu_node_1){+.?...}, at: [<ffffffff811387c7>] rcu_gp_kthread+0xb97/0xeb0
 {IN-SOFTIRQ-W} state was registered at:
   [<ffffffff81109b9f>] __lock_acquire+0xd5f/0x21e0
   [<ffffffff8110be0f>] lock_acquire+0xdf/0x2b0
   [<ffffffff81841cc9>] _raw_spin_lock_irqsave+0x59/0xa0
   [<ffffffff81136991>] rcu_process_callbacks+0x141/0x3c0
   [<ffffffff810b1a9d>] __do_softirq+0x14d/0x670
   [<ffffffff810b2214>] irq_exit+0x104/0x110
   [<ffffffff81844e96>] smp_apic_timer_interrupt+0x46/0x60
   [<ffffffff81842e70>] apic_timer_interrupt+0x70/0x80
   [<ffffffff810dba66>] rq_attach_root+0xa6/0x100
   [<ffffffff810dbc2d>] cpu_attach_domain+0x16d/0x650
   [<ffffffff810e4b42>] build_sched_domains+0x942/0xb00
   [<ffffffff821777c2>] sched_init_smp+0x509/0x5c1
   [<ffffffff821551e3>] kernel_init_freeable+0x172/0x28f
   [<ffffffff8182cdce>] kernel_init+0xe/0xe0
   [<ffffffff8184231f>] ret_from_fork+0x3f/0x70
 irq event stamp: 76
 hardirqs last  enabled at (75): [<ffffffff81841330>] _raw_spin_unlock_irq+0x30/0x60
 hardirqs last disabled at (76): [<ffffffff8184116f>] _raw_spin_lock_irq+0x1f/0x90
 softirqs last  enabled at (0): [<ffffffff810a8df2>] copy_process.part.26+0x602/0x1cf0
 softirqs last disabled at (0): [<          (null)>]           (null)
 other info that might help us debug this:
  Possible unsafe locking scenario:
        CPU0
        ----
   lock(rcu_node_1);
   <Interrupt>
     lock(rcu_node_1);
  *** DEADLOCK ***
 1 lock held by rcu_preempt/8:
  #0:  (rcu_node_1){+.?...}, at: [<ffffffff811387c7>] rcu_gp_kthread+0xb97/0xeb0
 stack backtrace:
 CPU: 0 PID: 8 Comm: rcu_preempt Not tainted 4.2.0-rc5-00025-g9a73ba0 #136
 Hardware name: Dell Inc. PowerEdge R820/066N7P, BIOS 2.0.20 01/16/2014
  0000000000000000 000000006d7e67d8 ffff881fb081fbd8 ffffffff818379e0
  0000000000000000 ffff881fb0812a00 ffff881fb081fc38 ffffffff8110813b
  0000000000000000 0000000000000001 ffff881f00000001 ffffffff8102fa4f
 Call Trace:
  [<ffffffff818379e0>] dump_stack+0x4f/0x7b
  [<ffffffff8110813b>] print_usage_bug+0x1db/0x1e0
  [<ffffffff8102fa4f>] ? save_stack_trace+0x2f/0x50
  [<ffffffff811087ad>] mark_lock+0x66d/0x6e0
  [<ffffffff81107790>] ? check_usage_forwards+0x150/0x150
  [<ffffffff81108898>] mark_held_locks+0x78/0xa0
  [<ffffffff81841330>] ? _raw_spin_unlock_irq+0x30/0x60
  [<ffffffff81108a28>] trace_hardirqs_on_caller+0x168/0x220
  [<ffffffff81108aed>] trace_hardirqs_on+0xd/0x10
  [<ffffffff81841330>] _raw_spin_unlock_irq+0x30/0x60
  [<ffffffff810fd1c7>] swake_up_all+0xb7/0xe0
  [<ffffffff811386e1>] rcu_gp_kthread+0xab1/0xeb0
  [<ffffffff811089bf>] ? trace_hardirqs_on_caller+0xff/0x220
  [<ffffffff81841341>] ? _raw_spin_unlock_irq+0x41/0x60
  [<ffffffff81137c30>] ? rcu_barrier+0x20/0x20
  [<ffffffff810d2014>] kthread+0x104/0x120
  [<ffffffff81841330>] ? _raw_spin_unlock_irq+0x30/0x60
  [<ffffffff810d1f10>] ? kthread_create_on_node+0x260/0x260
  [<ffffffff8184231f>] ret_from_fork+0x3f/0x70
  [<ffffffff810d1f10>] ? kthread_create_on_node+0x260/0x260

Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: linux-rt-users@vger.kernel.org
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1455871601-27484-5-git-send-email-wagi@monom.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-25 11:27:16 +01:00
Boqun Feng 67c583a7de RCU: Privatize rcu_node::lock
In patch:

"rcu: Add transitivity to remaining rcu_node ->lock acquisitions"

All locking operations on rcu_node::lock are replaced with the wrappers
because of the need of transitivity, which indicates we should never
write code using LOCK primitives alone(i.e. without a proper barrier
following) on rcu_node::lock outside those wrappers. We could detect
this kind of misuses on rcu_node::lock in the future by adding __private
modifier on rcu_node::lock.

To privatize rcu_node::lock, unlock wrappers are also needed. Replacing
spinlock unlocks with these wrappers not only privatizes rcu_node::lock
but also makes it easier to figure out critical sections of rcu_node.

This patch adds __private modifier to rcu_node::lock and makes every
access to it wrapped by ACCESS_PRIVATE(). Besides, unlock wrappers are
added and raw_spin_unlock(&rnp->lock) and its friends are replaced with
those wrappers.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-02-23 19:59:54 -08:00
Chen Gang 1914aab543 rcu: Remove useless rcu_data_p when !PREEMPT_RCU
The related warning from gcc 6.0:

  In file included from kernel/rcu/tree.c:4630:0:
  kernel/rcu/tree_plugin.h:810:40: warning: ‘rcu_data_p’ defined but not used [-Wunused-const-variable]
   static struct rcu_data __percpu *const rcu_data_p = &rcu_sched_data;
                                          ^~~~~~~~~~

Also remove always redundant rcu_data_p in tree.c.

Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2016-02-23 19:59:53 -08:00
Paul E. McKenney 648c630c64 Merge branches 'doc.2015.12.05a', 'exp.2015.12.07a', 'fixes.2015.12.07a', 'list.2015.12.04b' and 'torture.2015.12.05a' into HEAD
doc.2015.12.05a:  Documentation updates
exp.2015.12.07a:  Expedited grace-period updates
fixes.2015.12.07a:  Miscellaneous fixes
list.2015.12.04b:  Linked-list updates
torture.2015.12.05a:  Torture-test updates
2015-12-07 17:02:54 -08:00
Paul E. McKenney a87f203e27 rcu: Eliminate unused rcu_init_one() argument
Now that the rcu_state structure's ->rda field is compile-time initialized,
there is no need to pass the per-CPU rcu_data structure into rcu_init_one().
This commit therefore eliminates this now-unused parameter.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-12-07 17:01:19 -08:00
Paul E. McKenney 46a5d164db rcu: Stop disabling interrupts in scheduler fastpaths
We need the scheduler's fastpaths to be, well, fast, and unnecessarily
disabling and re-enabling interrupts is not necessarily consistent with
this goal.  Especially given that there are regions of the scheduler that
already have interrupts disabled.

This commit therefore moves the call to rcu_note_context_switch()
to one of the interrupts-disabled regions of the scheduler, and
removes the now-redundant disabling and re-enabling of interrupts from
rcu_note_context_switch() and the functions it calls.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Shift rcu_note_context_switch() to avoid deadlock, as suggested
  by Peter Zijlstra. ]
2015-12-04 12:27:31 -08:00
Paul E. McKenney f0f2e7d307 rcu: Avoid tick_nohz_active checks on NOCBs CPUs
Currently, rcu_prepare_for_idle() checks for tick_nohz_active, even on
individual NOCBs CPUs, unless all CPUs are marked as NOCBs CPUs at build
time.  This check is pointless on NOCBs CPUs because they never have any
callbacks posted, given that all of their callbacks are handed off to the
corresponding rcuo kthread.  There is a check for individually designated
NOCBs CPUs, but it pointelessly follows the check for tick_nohz_active.

This commit therefore moves the check for individually designated NOCBs
CPUs up with the check for CONFIG_RCU_NOCB_CPU_ALL.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-12-04 12:27:31 -08:00
Paul E. McKenney 699d403520 rcu: Fix obsolete rcu_bootup_announce_oddness() comment
This function no longer has #ifdefs, so this commit removes the
header comment calling them out.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-12-04 12:27:30 -08:00
Paul E. McKenney 8ba9153b2c rcu: Remove lock-acquisition loop from rcu_read_unlock_special()
Several releases have come and gone without the warning triggering,
so remove the lock-acquisition loop.  Retain the WARN_ON_ONCE()
out of sheer paranoia.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-12-04 12:27:30 -08:00
Paul E. McKenney 5a9be7c628 rcu: Add rcu_normal kernel parameter to suppress expediting
Although expedited grace periods can be quite useful, and although their
OS jitter has been greatly reduced, they can still pose problems for
extreme real-time workloads.  This commit therefore adds a rcu_normal
kernel boot parameter (which can also be manipulated via sysfs)
to suppress expedited grace periods, that is, to treat requests for
expedited grace periods as if they were requests for normal grace periods.
If both rcu_expedited and rcu_normal are specified, rcu_normal wins.
This means that if you are relying on expedited grace periods to speed up
boot, you will want to specify rcu_expedited on the kernel command line,
and then specify rcu_normal via sysfs once boot completes.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-12-04 12:26:53 -08:00
Paul E. McKenney 6cf1008122 rcu: Add transitivity to remaining rcu_node ->lock acquisitions
The rule is that all acquisitions of the rcu_node structure's ->lock
must provide transitivity:  The lock is not acquired that frequently,
and sorting out exactly which required it and which did not would be
a maintenance nightmare.  This commit therefore supplies the needed
transitivity to the remaining ->lock acquisitions.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-11-23 10:37:35 -08:00
Peter Zijlstra 2a67e741bb rcu: Create transitive rnp->lock acquisition functions
Providing RCU's memory-ordering guarantees requires that the rcu_node
tree's locking provide transitive memory ordering, which the Linux kernel's
spinlocks currently do not provide unless smp_mb__after_unlock_lock()
is used.  Having a separate smp_mb__after_unlock_lock() after each and
every lock acquisition is error-prone, hard to read, and a bit annoying,
so this commit provides wrapper functions that pull in the
smp_mb__after_unlock_lock() invocations.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-11-23 10:37:35 -08:00
Paul E. McKenney d2856b046d Merge branches 'fixes.2015.10.06a' and 'exp.2015.10.07a' into HEAD
exp.2015.10.07a:  Reduce OS jitter of RCU-sched expedited grace periods.
fixes.2015.10.06a:  Miscellaneous fixes.
2015-10-07 16:05:21 -07:00
Paul E. McKenney b08517c76d rcu: Enable stall warnings for synchronize_rcu_expedited()
This commit redirects synchronize_rcu_expedited()'s wait to
synchronize_sched_expedited_wait(), thus enabling RCU CPU
stall warnings.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-10-07 16:02:50 -07:00
Paul E. McKenney 74611ecb0f rcu: Add online/offline info to expedited stall warning message
This commit makes the RCU CPU stall warning message print online/offline
indications immediately after the CPU number.  A "O" indicates global
offline, a "." global online, and a "o" indicates RCU believes that the
CPU is offline for the current grace period and "." otherwise, and an
"N" indicates that RCU believes that the CPU will be offline for the
next grace period, and "." otherwise, all right after the CPU number.
So for CPU 10, you would normally see "10-...:" indicating that everything
believes that the CPU is online.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-10-07 16:02:50 -07:00
Paul E. McKenney dcdb8807ba rcu: Consolidate expedited CPU selection
Now that sync_sched_exp_select_cpus() and sync_rcu_exp_select_cpus()
are identical aside from the the argument to smp_call_function_single(),
this commit consolidates them with a functional argument.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-10-07 16:02:50 -07:00
Paul E. McKenney 7f21aeef72 rcu: Add online/offline info to stall warning message
This commit makes the RCU CPU stall warning message print online/offline
indications immediately after a hyphen following the CPU number.  A "O"
indicates that the global CPU-hotplug system believes that the CPU is
online, a "o" that RCU perceived the CPU to be online at the beginning
of the current expedited grace period, and an "N" that RCU currently
believes that it will perceive the CPU as being online at the beginning
of the next expedited grace period, with "." otherwise for all three
indications.  So for CPU 10, you would normally see "10-OoN:" indicating
that everything believes that the CPU is online.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-10-06 11:10:18 -07:00
Boqun Feng b6a4ae766e rcu: Use rcu_callback_t in call_rcu*() and friends
As we now have rcu_callback_t typedefs as the type of rcu callbacks, we
should use it in call_rcu*() and friends as the type of parameters. This
could save us a few lines of code and make it clear which function
requires an rcu callbacks rather than other callbacks as its argument.

Besides, this can also help cscope to generate a better database for
code reading.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2015-10-06 11:08:05 -07:00
Paul E. McKenney 5b74c45890 rcu: Make ->cpu_no_qs be a union for aggregate OR
This commit converts the rcu_data structure's ->cpu_no_qs field
to a union.  The bytewise side of this union allows individual access
to indications as to whether this CPU needs to find a quiescent state
for a normal (.norm) and/or expedited (.exp) grace period.  The setwise
side of the union allows testing whether or not a quiescent state is
needed at all, for either type of grace period.

For now, only .norm is used.  A later commit will introduce the expedited
usage.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-09-20 21:16:21 -07:00
Paul E. McKenney 0d43eb34f9 rcu: Invert passed_quiesce and rename to cpu_no_qs
This commit inverts the sense of the rcu_data structure's ->passed_quiesce
field and renames it to ->cpu_no_qs.  This will allow a later commit to
use an "aggregate OR" operation to test expedited as well as normal grace
periods without added overhead.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-09-20 21:16:21 -07:00
Paul E. McKenney 97c668b8e9 rcu: Rename qs_pending to core_needs_qs
An upcoming commit needs to invert the sense of the ->passed_quiesce
rcu_data structure field, so this commit is taking this opportunity
to clarify things a bit by renaming ->qs_pending to ->core_needs_qs.

So if !rdp->core_needs_qs, then this CPU need not concern itself with
quiescent states, in particular, it need not acquire its leaf rcu_node
structure's ->lock to check.  Otherwise, it needs to report the next
quiescent state.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-09-20 21:16:20 -07:00
Paul E. McKenney 8203d6d0ee rcu: Use single-stage IPI algorithm for RCU expedited grace period
The current preemptible-RCU expedited grace-period algorithm invokes
synchronize_sched_expedited() to enqueue all tasks currently running
in a preemptible-RCU read-side critical section, then waits for all the
->blkd_tasks lists to drain.  This works, but results in both an IPI and
a double context switch even on CPUs that do not happen to be running
in a preemptible RCU read-side critical section.

This commit implements a new algorithm that causes less OS jitter.
This new algorithm IPIs all online CPUs that are not idle (from an
RCU perspective), but refrains from self-IPIs.  If a CPU receiving
this IPI is not in a preemptible RCU read-side critical section (or
is just now exiting one), it pushes quiescence up the rcu_node tree,
otherwise, it sets a flag that will be handled by the upcoming outermost
rcu_read_unlock(), which will then push quiescence up the tree.

The expedited grace period must of course wait on any pre-existing blocked
readers, and newly blocked readers must be queued carefully based on
the state of both the normal and the expedited grace periods.  This
new queueing approach also avoids the need to update boost state,
courtesy of the fact that blocked tasks are no longer ever migrated to
the root rcu_node structure.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-09-20 21:16:19 -07:00
Paul E. McKenney b9585e940a rcu: Consolidate tree setup for synchronize_rcu_expedited()
This commit replaces sync_rcu_preempt_exp_init1(() and
sync_rcu_preempt_exp_init2() with sync_exp_reset_tree_hotplug()
and sync_exp_reset_tree(), which will also be used by
synchronize_sched_expedited(), and sync_rcu_exp_select_nodes(), which
contains code specific to synchronize_rcu_expedited().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-09-20 21:16:18 -07:00
Paul E. McKenney 7922cd0e56 rcu: Move rcu_report_exp_rnp() to allow consolidation
This is a nearly pure code-movement commit, moving rcu_report_exp_rnp(),
sync_rcu_preempt_exp_done(), and rcu_preempted_readers_exp() so
that later commits can make synchronize_sched_expedited() use them.
The non-code-movement portion of this commit tags rcu_report_exp_rnp()
as __maybe_unused to avoid build errors when CONFIG_PREEMPT=n.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-09-20 21:16:18 -07:00
Paul E. McKenney f4ecea309d rcu: Use rsp->expedited_wq instead of sync_rcu_preempt_exp_wq
Now that there is an ->expedited_wq waitqueue in each rcu_state structure,
there is no need for the sync_rcu_preempt_exp_wq global variable.  This
commit therefore substitutes ->expedited_wq for sync_rcu_preempt_exp_wq.
It also initializes ->expedited_wq only once at boot instead of at the
start of each expedited grace period.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-09-20 21:16:17 -07:00
Paul E. McKenney 8ff4fbfd69 Merge branches 'fixes.2015.07.22a' and 'initexp.2015.08.04a' into HEAD
fixes.2015.07.22a: Miscellaneous fixes.
initexp.2015.08.04a: Initialization and expedited updates.
	(Single branch due to conflicts.)
2015-08-04 08:40:58 -07:00
Paul E. McKenney 9a54f98e34 rcu: Don't disable CPU hotplug during OOM notifiers
RCU's rcu_oom_notify() disables CPU hotplug in order to stabilize the
list of online CPUs, which it traverses.  However, this is completely
pointless because smp_call_function_single() will quietly fail if invoked
on an offline CPU.  Because the count of requests is incremented in the
rcu_oom_notify_cpu() function that is remotely invoked, everything works
nicely even in the face of concurrent CPU-hotplug operations.

Furthermore, in recent kernels, invoking get_online_cpus() from an OOM
notifier can result in deadlock.  This commit therefore removes the
call to get_online_cpus() and put_online_cpus() from rcu_oom_notify().

Reported-by: Marcin Åšlusarz <marcin.slusarz@gmail.com>
Reported-by: David Rientjes <rientjes@google.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: David Rientjes <rientjes@google.com>
Tested-by: Marcin Åšlusarz <marcin.slusarz@gmail.com>
2015-07-22 15:27:43 -07:00
Paul E. McKenney f78f5b90c4 rcu: Rename rcu_lockdep_assert() to RCU_LOCKDEP_WARN()
This commit renames rcu_lockdep_assert() to RCU_LOCKDEP_WARN() for
consistency with the WARN() series of macros.  This also requires
inverting the sense of the conditional, which this commit also does.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
2015-07-22 15:27:32 -07:00
Paul E. McKenney bc17ea1092 rcu: Fix obsolete priority-boosting comment
Tasks are no longer migrated to the root rcu_node, so there is no
longer any need for a boost kthread for the root rcu_node, and there no
longer is such a kthread.  This commit therefore fixes the comment in
rcu_boost_kthread()'s header to reflect this new reality.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-07-22 15:27:28 -07:00
Paul E. McKenney 704dd435ac rcu: Consolidate last open-coded expedited memory barrier
One of the requirements on RCU grace periods is that if there is a
causal chain of operations that starts after one grace period and
ends before another grace period, then the two grace periods must
be serialized.  There has been (and might still be) code that relies
on this, for example, certain types of reference-counting code that
does a call_rcu() within an RCU callback function.

This requirement is why there is an smp_mb() at the end of both
synchronize_sched_expedited() and synchronize_rcu_expedited().
However, this is the only smp_mb() in these functions, so it would
be nicer to consolidate it into rcu_exp_gp_seq_end().  This commit
does just that.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-07-17 14:58:59 -07:00
Paul E. McKenney 29fd930940 rcu: Use funnel locking for synchronize_rcu_expedited()'s polling loop
This commit gets rid of synchronize_rcu_expedited()'s mutex_trylock()
polling loop in favor of the funnel-locking scheme that was abstracted
from synchronize_sched_expedited().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-07-17 14:58:56 -07:00
Paul E. McKenney 543c6158f6 rcu: Make synchronize_rcu_expedited() use sequence-counter scheme
Although synchronize_rcu_expedited() uses a sequence-counter scheme, it
is based on a single increment per grace period, which means that tasks
piggybacking off of concurrent grace periods may be forced to wait longer
than necessary.  This commit therefore applies the new sequence-count
functions developed for synchronize_sched_expedited() to speed things
up a bit and to consolidate the sequence-counter implementation.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-07-17 14:58:52 -07:00
Paul E. McKenney 75c27f119b rcu: Remove CONFIG_RCU_CPU_STALL_INFO
The CONFIG_RCU_CPU_STALL_INFO has been default-y for a couple of
releases with no complaints, so it is time to eliminate this Kconfig
option entirely, so that the long-form RCU CPU stall warnings cannot
be disabled.  This commit does just that.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-07-17 14:58:44 -07:00
Paul E. McKenney 9b68387450 rcu: Stop disabling CPU hotplug in synchronize_rcu_expedited()
The fact that tasks could be migrated from leaf to root rcu_node
structures meant that synchronize_rcu_expedited() had to disable
CPU hotplug.  However, tasks now stay put, so this commit removes the
CPU-hotplug disabling from synchronize_rcu_expedited().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-07-17 14:58:42 -07:00
Alexander Gordeev 426216970e rcu: Simplify arithmetic to calculate number of RCU nodes
This update makes arithmetic to calculate number of RCU nodes
more straight and easy to read.

Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-07-15 14:45:21 -07:00
Linus Torvalds 43224b96af Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
 "A rather largish update for everything time and timer related:

   - Cache footprint optimizations for both hrtimers and timer wheel

   - Lower the NOHZ impact on systems which have NOHZ or timer migration
     disabled at runtime.

   - Optimize run time overhead of hrtimer interrupt by making the clock
     offset updates smarter

   - hrtimer cleanups and removal of restrictions to tackle some
     problems in sched/perf

   - Some more leap second tweaks

   - Another round of changes addressing the 2038 problem

   - First step to change the internals of clock event devices by
     introducing the necessary infrastructure

   - Allow constant folding for usecs/msecs_to_jiffies()

   - The usual pile of clockevent/clocksource driver updates

  The hrtimer changes contain updates to sched, perf and x86 as they
  depend on them plus changes all over the tree to cleanup API changes
  and redundant code, which got copied all over the place.  The y2038
  changes touch s390 to remove the last non 2038 safe code related to
  boot/persistant clock"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (114 commits)
  clocksource: Increase dependencies of timer-stm32 to limit build wreckage
  timer: Minimize nohz off overhead
  timer: Reduce timer migration overhead if disabled
  timer: Stats: Simplify the flags handling
  timer: Replace timer base by a cpu index
  timer: Use hlist for the timer wheel hash buckets
  timer: Remove FIFO "guarantee"
  timers: Sanitize catchup_timer_jiffies() usage
  hrtimer: Allow hrtimer::function() to free the timer
  seqcount: Introduce raw_write_seqcount_barrier()
  seqcount: Rename write_seqcount_barrier()
  hrtimer: Fix hrtimer_is_queued() hole
  hrtimer: Remove HRTIMER_STATE_MIGRATE
  selftest: Timers: Avoid signal deadlock in leap-a-day
  timekeeping: Copy the shadow-timekeeper over the real timekeeper last
  clockevents: Check state instead of mode in suspend/resume path
  selftests: timers: Add leap-second timer edge testing to leap-a-day.c
  ntp: Do leapsecond adjustment in adjtimex read path
  time: Prevent early expiry of hrtimers[CLOCK_REALTIME] at the leap second edge
  ntp: Introduce and use SECS_PER_DAY macro instead of 86400
  ...
2015-06-22 18:57:44 -07:00
Thomas Gleixner bc7a34b8b9 timer: Reduce timer migration overhead if disabled
Eric reported that the timer_migration sysctl is not really nice
performance wise as it needs to check at every timer insertion whether
the feature is enabled or not. Further the check does not live in the
timer code, so we have an extra function call which checks an extra
cache line to figure out that it is disabled.

We can do better and store that information in the per cpu (hr)timer
bases. I pondered to use a static key, but that's a nightmare to
update from the nohz code and the timer base cache line is hot anyway
when we select a timer base.

The old logic enabled the timer migration unconditionally if
CONFIG_NO_HZ was set even if nohz was disabled on the kernel command
line.

With this modification, we start off with migration disabled. The user
visible sysctl is still set to enabled. If the kernel switches to NOHZ
migration is enabled, if the user did not disable it via the sysctl
prior to the switch. If nohz=off is on the kernel command line,
migration stays disabled no matter what.

Before:
  47.76%  hog       [.] main
  14.84%  [kernel]  [k] _raw_spin_lock_irqsave
   9.55%  [kernel]  [k] _raw_spin_unlock_irqrestore
   6.71%  [kernel]  [k] mod_timer
   6.24%  [kernel]  [k] lock_timer_base.isra.38
   3.76%  [kernel]  [k] detach_if_pending
   3.71%  [kernel]  [k] del_timer
   2.50%  [kernel]  [k] internal_add_timer
   1.51%  [kernel]  [k] get_nohz_timer_target
   1.28%  [kernel]  [k] __internal_add_timer
   0.78%  [kernel]  [k] timerfn
   0.48%  [kernel]  [k] wake_up_nohz_cpu

After:
  48.10%  hog       [.] main
  15.25%  [kernel]  [k] _raw_spin_lock_irqsave
   9.76%  [kernel]  [k] _raw_spin_unlock_irqrestore
   6.50%  [kernel]  [k] mod_timer
   6.44%  [kernel]  [k] lock_timer_base.isra.38
   3.87%  [kernel]  [k] detach_if_pending
   3.80%  [kernel]  [k] del_timer
   2.67%  [kernel]  [k] internal_add_timer
   1.33%  [kernel]  [k] __internal_add_timer
   0.73%  [kernel]  [k] timerfn
   0.54%  [kernel]  [k] wake_up_nohz_cpu


Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Joonwoo Park <joonwoop@codeaurora.org>
Cc: Wenbo Wang <wenbo.wang@memblaze.com>
Link: http://lkml.kernel.org/r/20150526224512.127050787@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-06-19 15:18:28 +02:00
Paul E. McKenney 0868aa2216 Merge branches 'array.2015.05.27a', 'doc.2015.05.27a', 'fixes.2015.05.27a', 'hotplug.2015.05.27a', 'init.2015.05.27a', 'tiny.2015.05.27a' and 'torture.2015.05.27a' into HEAD
array.2015.05.27a:  Remove all uses of RCU-protected array indexes.
doc.2015.05.27a:  Docuemntation updates.
fixes.2015.05.27a:  Miscellaneous fixes.
hotplug.2015.05.27a:  CPU-hotplug updates.
init.2015.05.27a:  Initialization/Kconfig updates.
tiny.2015.05.27a:  Updates to Tiny RCU.
torture.2015.05.27a:  Torture-testing updates.
2015-05-27 13:00:49 -07:00
Paul E. McKenney 47d631af58 rcu: Make RCU able to tolerate undefined CONFIG_RCU_FANOUT_LEAF
This commit introduces an RCU_FANOUT_LEAF C-preprocessor macro so
that RCU will build even when CONFIG_RCU_FANOUT_LEAF is undefined.
The RCU_FANOUT_LEAF macro is set to the value of CONFIG_RCU_FANOUT_LEAF
when defined, otherwise it is set to 32 for 32-bit systems and 64 for
64-bit systems.  This commit then makes CONFIG_RCU_FANOUT_LEAF depend
on CONFIG_RCU_EXPERT, so that Kconfig users won't be asked about
CONFIG_RCU_FANOUT_LEAF unless they want to be.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
2015-05-27 12:59:05 -07:00
Paul E. McKenney 05c5df31af rcu: Make RCU able to tolerate undefined CONFIG_RCU_FANOUT
This commit introduces an RCU_FANOUT C-preprocessor macro so that RCU will
build even when CONFIG_RCU_FANOUT is undefined.  The RCU_FANOUT macro is
set to the value of CONFIG_RCU_FANOUT when defined, otherwise it is set
to 32 for 32-bit systems and 64 for 64-bit systems.  This commit then
makes CONFIG_RCU_FANOUT depend on CONFIG_RCU_EXPERT, so that Kconfig
users won't be asked about CONFIG_RCU_FANOUT unless they want to be.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
2015-05-27 12:59:05 -07:00
Paul E. McKenney 7fa270010e rcu: Convert CONFIG_RCU_FANOUT_EXACT to boot parameter
The CONFIG_RCU_FANOUT_EXACT Kconfig parameter is used primarily (and
perhaps only) by rcutorture to verify that RCU works correctly in specific
rcu_node combining-tree configurations.  It therefore does not make
much sense have this as a question to people attempting to configure
their kernels.  So this commit creates an rcutree.rcu_fanout_exact=
boot parameter that rcutorture can use, and eliminates the original
CONFIG_RCU_FANOUT_EXACT Kconfig parameter.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
2015-05-27 12:59:04 -07:00
Paul E. McKenney 0a0ba1c93f rcu: Adjust ->lock acquisition for tasks no longer migrating
Tasks are no longer migrated away from a given rcu_node structure
when all CPUs corresponding to that rcu_node structure have gone offline.
This means that rcu_read_unlock_special() no longer needs to loop
retrying rcu_node ->lock acquisition because the current task is
guaranteed to stay put.

This commit takes a small and paranoid step towards relying on this
guarantee by placing a WARN_ON_ONCE() just after the early exit from
the lock-acquisition loop.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-05-27 12:58:37 -07:00
Patrick Daly 82efed06d5 rcu: Fix missing task information during rcu-preempt stall
The first item list_for_each_entry_continue(alist) iterates over is
alist->next, rather than alist itself. Consequently,
rcu_print_detail_task_stall_rnp() skips the task referenced by gp_tasks.

Use gp_tasks->prev as the argument to list_for_each_entry_continue()
instead.

Signed-off-by: Patrick Daly <pdaly@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-05-27 12:58:02 -07:00
Joe Perches 5ce035fb7d rcu: tree_plugin: Use bool function return values of true/false not 1/0
Use the normal return values for bool functions

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-05-27 12:58:01 -07:00
Paul E. McKenney 3382adbc1b rcu: Eliminate a few CONFIG_RCU_NOCB_CPU_ALL #ifdefs
This commit converts several CONFIG_RCU_NOCB_CPU_ALL #ifdefs to
instead use IS_ENABLED().  This change should help avoid hiding
code from compiler diagnostics.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-05-27 12:58:00 -07:00
Paul E. McKenney 2927a689e8 rcu: Create an immutable rcu_data_p pointer to default rcu_data structure
This commit creates an immutable rcu_data_p pointer that references
rcu_preempt_data for TREE_PREEMPT_RCU builds and that references
rcu_sched_data for TREE_RCU builds.  This rcu_data_p pointer will enable
more code to move from #ifdef to IS_ENABLED().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-05-27 12:58:00 -07:00
Paul E. McKenney b28a7c0166 rcu: Tell the compiler that rcu_state_p is immutable
This commit adds a "const" tag to the declarations of rcu_state_p,
which should allow the compiler to generate better code and also to
catch erroneous assignments to this variable.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-05-27 12:57:59 -07:00
Paul E. McKenney 727b705baf rcu: Eliminate a few RCU_BOOST #ifdefs in favor of IS_ENABLED()
This commit removes a few RCU_BOOST #ifdefs, replacing them with
IS_ENABLED()-protected return statements.  This relies on the
optimizer to remove any resulting dead code.  There are several other
RCU_BOOST #ifdefs, however these rely on some per-CPU variables that
are available only under RCU_BOOST.  These might be converted later,
if the simplification proves to outweigh the increase in memory footprint.
One hoped-for advantage is more easily locating compiler errors in
obscure combinations of Kconfig parameters.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: <linux-rt-users@vger.kernel.org>
2015-05-27 12:57:59 -07:00
Paul E. McKenney e63c887cfe rcu: Convert from rcu_preempt_state to *rcu_state_p
It would be good to move more code from #ifdef to IS_ENABLED(), but
that does not work if the body of the IS_ENABLED() "if" statement
references a variable (such as rcu_preempt_state) that does not
exist if the IS_ENABLED() Kconfig variable is not set.  This commit
therefore substitutes *rcu_state_p for all uses of rcu_preempt_state
in kernel/rcu/tree_preempt.h, which should enable elimination of
a few #ifdefs.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-05-27 12:57:59 -07:00
Paul E. McKenney 7d0ae8086b rcu: Convert ACCESS_ONCE() to READ_ONCE() and WRITE_ONCE()
This commit moves from the old ACCESS_ONCE() API to the new READ_ONCE()
and WRITE_ONCE() APIs.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck:  Updated to include kernel/torture.c as suggested by Jason Low. ]
2015-05-27 12:56:15 -07:00
Thomas Gleixner c1ad348b45 tick: Nohz: Rework next timer evaluation
The evaluation of the next timer in the nohz code is based on jiffies
while all the tick internals are nano seconds based. We have also to
convert hrtimer nanoseconds to jiffies in the !highres case. That's
just wrong and introduces interesting corner cases.

Turn it around and convert the next timer wheel timer expiry and the
rcu event to clock monotonic and base all calculations on
nanoseconds. That identifies the case where no timer is pending
clearly with an absolute expiry value of KTIME_MAX.

Makes the code more readable and gets rid of the jiffies magic in the
nohz code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Link: http://lkml.kernel.org/r/20150414203502.184198593@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-22 17:06:50 +02:00
Paul E. McKenney 42528795ac Merge branches 'doc.2015.02.26a', 'earlycb.2015.03.03a', 'fixes.2015.03.03a', 'gpexp.2015.02.26a', 'hotplug.2015.03.20a', 'sysidle.2015.02.26b' and 'tiny.2015.02.26a' into HEAD
doc.2015.02.26a:  Documentation changes
earlycb.2015.03.03a:  Permit early-boot RCU callbacks
fixes.2015.03.03a:  Miscellaneous fixes
gpexp.2015.02.26a:  In-kernel expediting of normal grace periods
hotplug.2015.03.20a:  CPU hotplug fixes
sysidle.2015.02.26b:  NO_HZ_FULL_SYSIDLE fixes
tiny.2015.02.26a:  TINY_RCU fixes
2015-03-20 08:31:01 -07:00
Paul E. McKenney 0aa04b055e rcu: Process offlining and onlining only at grace-period start
Races between CPU hotplug and grace periods can be difficult to resolve,
so the ->onoff_mutex is used to exclude the two events.  Unfortunately,
this means that it is impossible for an outgoing CPU to perform the
last bits of its offlining from its last pass through the idle loop,
because sleeplocks cannot be acquired in that context.

This commit avoids these problems by buffering online and offline events
in a new ->qsmaskinitnext field in the leaf rcu_node structures.  When a
grace period starts, the events accumulated in this mask are applied to
the ->qsmaskinit field, and, if needed, up the rcu_node tree.  The special
case of all CPUs corresponding to a given leaf rcu_node structure being
offline while there are still elements in that structure's ->blkd_tasks
list is handled using a new ->wait_blkd_tasks field.  In this case,
propagating the offline bits up the tree is deferred until the beginning
of the grace period after all of the tasks have exited their RCU read-side
critical sections and removed themselves from the list, at which point
the ->wait_blkd_tasks flag is cleared.  If one of that leaf rcu_node
structure's CPUs comes back online before the list empties, then the
->wait_blkd_tasks flag is simply cleared.

This of course means that RCU's notion of which CPUs are offline can be
out of date.  This is OK because RCU need only wait on CPUs that were
online at the time that the grace period started.  In addition, RCU's
force-quiescent-state actions will handle the case where a CPU goes
offline after the grace period starts.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-03-12 15:19:37 -07:00
Paul E. McKenney cc99a310ca rcu: Move rcu_report_unblock_qs_rnp() to common code
The rcu_report_unblock_qs_rnp() function is invoked when the
last task blocking the current grace period exits its outermost
RCU read-side critical section.  Previously, this was called only
from rcu_read_unlock_special(), and was therefore defined only when
CONFIG_RCU_PREEMPT=y.  However, this function will be invoked even when
CONFIG_RCU_PREEMPT=n once CPU-hotplug operations are processed only at
the beginnings of RCU grace periods.  The reason for this change is that
the last task on a given leaf rcu_node structure's ->blkd_tasks list
might well exit its RCU read-side critical section between the time that
recent CPU-hotplug operations were applied and when the new grace period
was initialized.  This situation could result in RCU waiting forever on
that leaf rcu_node structure, because if all that structure's CPUs were
already offline, there would be no quiescent-state events to drive that
structure's part of the grace period.

This commit therefore moves rcu_report_unblock_qs_rnp() to common code
that is built unconditionally so that the quiescent-state-forcing code
can clean up after this situation, avoiding the grace-period stall.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-03-12 15:19:36 -07:00
Paul E. McKenney 8eb74b2b29 rcu: Rework preemptible expedited bitmask handling
Currently, the rcu_node tree ->expmask bitmasks are initially set to
reflect the online CPUs.  This is pointless, because only the CPUs
preempted within RCU read-side critical sections by the preceding
synchronize_sched_expedited() need to be tracked.  This commit therefore
instead sets up these bitmasks based on the state of the ->blkd_tasks
lists.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-03-12 15:18:42 -07:00
Paul E. McKenney 18c629eaeb rcu: Eliminate empty HOTPLUG_CPU ifdef
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-03-11 13:22:37 -07:00
Paul E. McKenney c8aead6a9b rcu: Simplify sync_rcu_preempt_exp_init()
This commit eliminates a boolean and associated "if" statement by
rearranging the code.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-03-11 13:22:37 -07:00
Paul E. McKenney a3bd2c09ad rcu: Add boot-up check for non-default CONFIG_RCU_FANOUT_LEAF values
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-03-03 11:16:31 -08:00
Paul E. McKenney ab6f5bd674 rcu: Use IS_ENABLED() to simplify rcu_bootup_announce_oddness()
This commit gets rid of some inline #ifdefs by replacing them with
IS_ENABLED.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-03-03 11:16:20 -08:00
Paul E. McKenney d24209bb68 rcu: Improve diagnostics for blocked critical sections in irq
If an RCU read-side critical section occurs within an interrupt handler
or a softirq handler, it cannot have been preempted.  Therefore, there is
a check in rcu_read_unlock_special() checking for this error.  However,
when this check triggers, it lacks diagnostic information.  This commit
therefore moves rcu_read_unlock()'s lockdep annotation to follow the
call to __rcu_read_unlock() and changes rcu_read_unlock_special()'s
WARN_ON_ONCE() to an lockdep_rcu_suspicious() in order to locate where
the offending RCU read-side critical section began.  In addition, the
value of the ->rcu_read_unlock_special field is printed.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-03-03 11:16:00 -08:00
Paul E. McKenney 34404ca8fb rcu: Move early-boot callbacks to no-CBs lists for no-CBs CPUs
When a CPU is first determined to be a no-CBs CPUs, this commit causes
any early boot callbacks to be moved to the no-CBs callback list,
allowing them to be invoked.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-03-03 11:06:02 -08:00
Paul E. McKenney 5871968d53 rcu: Tighten up affinity and check for sysidle
If the RCU grace-period kthread invoking rcu_sysidle_check_cpu()
happens to be running on the tick_do_timer_cpu initially,
then rcu_bind_gp_kthread() won't bind it.  This kthread might
then migrate before invoking rcu_gp_fqs(), which will trigger the
WARN_ON_ONCE() in rcu_sysidle_check_cpu().  This commit therefore makes
rcu_bind_gp_kthread() do the binding even if the kthread is currently
on the same CPU.  Because this incurs added overhead, this commit also
causes each RCU grace-period kthread to invoke rcu_bind_gp_kthread()
once at boot rather than at the beginning of each grace period.
And as long as rcu_bind_gp_kthread() is being modified, this commit
eliminates its #ifdef.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-02-26 16:04:37 -08:00
Paul E. McKenney 5afff48bdf rcu: Update from rcu_expedited variable to rcu_gp_is_expedited()
This commit updates open-coded tests of the rcu_expedited variable
to instead use rcu_gp_is_expedited().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-02-26 12:03:01 -08:00
Paul E. McKenney 59f792d1ef rcu: Refine diagnostics for lacking kthread for no-CBs callbacks
Some diagnostics under CONFIG_PROVE_RCU in rcu_nocb_cpu_needs_barrier()
assume that there can be no early-boot callbacks.  This commit therefore
qualifies the diagnostic with rcu_scheduler_fully_active to permit
early boot callbacks to avoid this splat.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-02-26 12:01:29 -08:00
Linus Torvalds 3f4d9925e9 Merge branches 'core-urgent-for-linus' and 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull rcu fix and x86 irq fix from Ingo Molnar:

 - Fix a bug that caused an RCU warning splat.

 - Two x86 irq related fixes: a hotplug crash fix and an ACPI IRQ
   registry fix.

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  rcu: Clear need_qs flag to prevent splat

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/irq: Check for valid irq descriptor in check_irq_vectors_for_cpu_disable()
  x86/irq: Fix regression caused by commit b568b8601f
2015-02-21 10:36:06 -08:00
Tejun Heo ad853b48cb rcu: use %*pb[l] to print bitmaps including cpumasks and nodemasks
printk and friends can now format bitmaps using '%*pb[l]'.  cpumask
and nodemask also provide cpumask_pr_args() and nodemask_pr_args()
respectively which can be used to generate the two printf arguments
necessary to format the specified cpu/nodemask.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-13 21:21:37 -08:00
Paul E. McKenney c0135d07b0 rcu: Clear need_qs flag to prevent splat
If the scheduling-clock interrupt sets the current tasks need_qs flag,
but if the current CPU passes through a quiescent state in the meantime,
then rcu_preempt_qs() will fail to clear the need_qs flag, which can fool
RCU into thinking that additional rcu_read_unlock_special() processing
is needed.  This commit therefore clears the need_qs flag before checking
for additional processing.

For this problem to occur, we need rcu_preempt_data.passed_quiesce equal
to true and current->rcu_read_unlock_special.b.need_qs also equal to true.
This condition can occur as follows:

1.	CPU 0 is aware of the current preemptible RCU grace period,
	but has not yet passed through a quiescent state.  Among other
	things, this means that rcu_preempt_data.passed_quiesce is false.

2.	Task A running on CPU 0 enters a preemptible RCU read-side
	critical section.

3.	CPU 0 takes a scheduling-clock interrupt, which notices the
	RCU read-side critical section and the need for a quiescent state,
	and thus sets current->rcu_read_unlock_special.b.need_qs to true.

4.	Task A is preempted, enters the scheduler, eventually invoking
	rcu_preempt_note_context_switch() which in turn invokes
	rcu_preempt_qs().

	Because rcu_preempt_data.passed_quiesce is false,
	control enters the body of the "if" statement, which sets
	rcu_preempt_data.passed_quiesce to true.

5.	At this point, CPU 0 takes an interrupt.  The interrupt
	handler contains an RCU read-side critical section, and
	the rcu_read_unlock() notes that current->rcu_read_unlock_special
	is nonzero, and thus invokes rcu_read_unlock_special().

6.	Once in rcu_read_unlock_special(), the fact that
	current->rcu_read_unlock_special.b.need_qs is true becomes
	apparent, so rcu_read_unlock_special() invokes rcu_preempt_qs().
	Recursively, given that we interrupted out of that same
	function in the preceding step.

7.	Because rcu_preempt_data.passed_quiesce is now true,
	rcu_preempt_qs() does nothing, and simply returns.

8.	Upon return to rcu_read_unlock_special(), it is noted that
	current->rcu_read_unlock_special is still nonzero (because
	the interrupted rcu_preempt_qs() had not yet gotten around
	to clearing current->rcu_read_unlock_special.b.need_qs).

9.	Execution proceeds to the WARN_ON_ONCE(), which notes that
	we are in an interrupt handler and thus duly splats.

The solution, as noted above, is to make rcu_read_unlock_special()
clear out current->rcu_read_unlock_special.b.need_qs after calling
rcu_preempt_qs().  The interrupted rcu_preempt_qs() will clear it again,
but this is harmless.  The worst that happens is that we clobber another
attempt to set this field, but this is not a problem because we just
got done reporting a quiescent state.

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Fix embarrassing build bug noted by Sasha Levin. ]
Tested-by: Sasha Levin <sasha.levin@oracle.com>
2015-02-11 15:46:43 -08:00
Paul E. McKenney 78e691f4ae Merge branches 'doc.2015.01.07a', 'fixes.2015.01.15a', 'preempt.2015.01.06a', 'srcu.2015.01.06a', 'stall.2015.01.16a' and 'torture.2015.01.11a' into HEAD
doc.2015.01.07a: Documentation updates.
fixes.2015.01.15a: Miscellaneous fixes.
preempt.2015.01.06a: Changes to handling of lists of preempted tasks.
srcu.2015.01.06a: SRCU updates.
stall.2015.01.16a: RCU CPU stall-warning updates and fixes.
torture.2015.01.11a: RCU torture-test updates and fixes.
2015-01-15 23:34:34 -08:00
Paul E. McKenney a94844b22a rcu: Optionally run grace-period kthreads at real-time priority
Recent testing has shown that under heavy load, running RCU's grace-period
kthreads at real-time priority can improve performance (according to 0day
test robot) and reduce the incidence of RCU CPU stall warnings.  However,
most systems do just fine with the default non-realtime priorities for
these kthreads, and it does not make sense to expose the entire user
base to any risk stemming from this change, given that this change is
of use only to a few users running extremely heavy workloads.

Therefore, this commit allows users to specify realtime priorities
for the grace-period kthreads, but leaves them running SCHED_OTHER
by default.  The realtime priority may be specified at build time
via the RCU_KTHREAD_PRIO Kconfig parameter, or at boot time via the
rcutree.kthread_prio parameter.  Either way, 0 says to continue the
default SCHED_OTHER behavior and values from 1-99 specify that priority
of SCHED_FIFO behavior.  Note that a value of 0 is not permitted when
the RCU_BOOST Kconfig parameter is specified.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-15 23:25:04 -08:00
Paul E. McKenney 917963d0b3 rcutorture: Check from beginning to end of grace period
Currently, rcutorture's Reader Batch checks measure from the end of
the previous grace period to the end of the current one.  This commit
tightens up these checks by measuring from the start and end of the same
grace period.  This involves adding rcu_batches_started() and friends
corresponding to the existing rcu_batches_completed() and friends.

We leave SRCU alone for the moment, as it does not yet have a way of
tracking both ends of its grace periods.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-10 19:08:02 -08:00
Paul E. McKenney 9733e4f0a9 rcu: Make _batches_completed() functions return unsigned long
Long ago, the various ->completed fields were of type long, but now are
unsigned long due to signed-integer-overflow concerns.  However, the
various _batches_completed() functions remained of type long, even though
their only purpose in life is to return the corresponding ->completed
field.  This patch cleans this up by changing these functions' return
types to unsigned long.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-10 19:07:56 -08:00
Paul E. McKenney e3663b1024 rcu: Handle gpnum/completed wrap while dyntick idle
Subtle race conditions can result if a CPU stays in dyntick-idle mode
long enough for the ->gpnum and ->completed fields to wrap.  For
example, consider the following sequence of events:

o	CPU 1 encounters a quiescent state while waiting for grace period
	5 to complete, but then enters dyntick-idle mode.

o	While CPU 1 is in dyntick-idle mode, the grace-period counters
	wrap around so that the grace period number is now 4.

o	Just as CPU 1 exits dyntick-idle mode, grace period 4 completes
	and grace period 5 begins.

o	The quiescent state that CPU 1 passed through during the old
	grace period 5 looks like it applies to the new grace period
	5.  Therefore, the new grace period 5 completes without CPU 1
	having passed through a quiescent state.

This could clearly be a fatal surprise to any long-running RCU read-side
critical section that happened to be running on CPU 1 at the time.  At one
time, this was not a problem, given that it takes significant time for
the grace-period counters to overflow even on 32-bit systems.  However,
with the advent of NO_HZ_FULL and SMP embedded systems, arbitrarily long
idle periods are now becoming quite feasible.  It is therefore time to
close this race.

This commit therefore avoids this race condition by having the
quiescent-state forcing code detect when a CPU is falling too far
behind, and setting a new rcu_data field ->gpwrap when this happens.
Whenever this new ->gpwrap field is set, the CPU's ->gpnum and ->completed
fields are known to be untrustworthy, and can be ignored, along with
any associated quiescent states.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06 11:05:28 -08:00
Paul E. McKenney fc908ed33e rcu: Make RCU_CPU_STALL_INFO include number of fqs attempts
One way that an RCU CPU stall warning can happen is if the grace-period
kthread is not allowed to execute.  One proxy for this kthread's
forward progress is the number of force-quiescent-state (fqs) scans.
This commit therefore adds the number of fqs scans to the RCU CPU stall
warning printouts when CONFIG_RCU_CPU_STALL_INFO=y.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06 11:05:25 -08:00
Lai Jiangshan abaf3f9d27 rcu: Revert "Allow post-unlock reference for rt_mutex" to avoid priority-inversion
The patch dfeb9765ce ("Allow post-unlock reference for rt_mutex")
ensured rcu-boost safe even the rt_mutex has post-unlock reference.

But rt_mutex allowing post-unlock reference is definitely a bug and it was
fixed by the commit 27e35715df ("rtmutex: Plug slow unlock race").
This fix made the previous patch (dfeb9765ce) useless.

And even worse, the priority-inversion introduced by the the previous
patch still exists.

rcu_read_unlock_special() {
	rt_mutex_unlock(&rnp->boost_mtx);
	/* Priority-Inversion:
	 * the current task had been deboosted and preempted as a low
	 * priority task immediately, it could wait long before reschedule in,
	 * and the rcu-booster also waits on this low priority task and sleeps.
	 * This priority-inversion makes rcu-booster can't work
	 * as expected.
	 */
	complete(&rnp->boost_completion);
}

Just revert the patch to avoid it.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06 11:02:52 -08:00
Paul E. McKenney 5d0b024973 rcu: Don't bother affinitying rcub kthreads away from offline CPUs
When rcu_boost_kthread_setaffinity() sees that all CPUs for a given
rcu_node structure are now offline, it affinities the corresponding
RCU-boost ("rcub") kthread away from those CPUs.  This is pointless
because the kthread cannot run on those offline CPUs in any case.
This commit therefore removes this unneeded code.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06 11:02:50 -08:00
Paul E. McKenney 3e9f5c70d8 rcu: Don't spawn rcub kthreads on root rcu_node structure
Now that offlining CPUs no longer moves leaf rcu_node structures'
->blkd_tasks lists to the root, there is no way for the root rcu_node
structure's ->blkd_task list to be nonempty, unless the root node is also
the sole leaf node.  This commit therefore refrains from creating an rcub
kthread for the root rcu_node structure unless it is also the sole leaf.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06 11:02:47 -08:00
Paul E. McKenney 96e92021d4 rcu: Make use of rcu_preempt_has_tasks()
Given that there is now arcu_preempt_has_tasks() function that checks
to see if the ->blkd_tasks list is non-empty, this commit makes use of it.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06 11:02:46 -08:00
Paul E. McKenney d19fb8d1f3 rcu: Don't migrate blocked tasks even if all corresponding CPUs offline
When the last CPU associated with a given leaf rcu_node structure
goes offline, something must be done about the tasks queued on that
rcu_node structure.  Each of these tasks has been preempted on one of
the leaf rcu_node structure's CPUs while in an RCU read-side critical
section that it have not yet exited.  Handling these tasks is the job of
rcu_preempt_offline_tasks(), which migrates them from the leaf rcu_node
structure to the root rcu_node structure.

Unfortunately, this migration has to be done one task at a time because
each tasks allegiance must be shifted from the original leaf rcu_node to
the root, so that future attempts to deal with these tasks will acquire
the root rcu_node structure's ->lock rather than that of the leaf.
Worse yet, this migration must be done with interrupts disabled, which
is not so good for realtime response, especially given that there is
no bound on the number of tasks on a given rcu_node structure's list.
(OK, OK, there is a bound, it is just that it is unreasonably large,
especially on 64-bit systems.)  This was not considered a problem back
when rcu_preempt_offline_tasks() was first written because realtime
systems were assumed not to do CPU-hotplug operations while real-time
applications were running.  This assumption has proved of dubious validity
given that people are starting to run multiple realtime applications
on a single SMP system and that it is common practice to offline then
online a CPU before starting its real-time application in order to clear
extraneous processing off of that CPU.  So we now need CPU hotplug
operations to avoid undue latencies.

This commit therefore avoids migrating these tasks, instead letting
them be dequeued one by one from the original leaf rcu_node structure
by rcu_read_unlock_special().  This means that the clearing of bits
from the upper-level rcu_node structures must be deferred until the
last such task has been dequeued, because otherwise subsequent grace
periods won't wait on them.  This commit has the beneficial side effect
of simplifying the CPU-hotplug code for TREE_PREEMPT_RCU, especially in
CONFIG_RCU_BOOST builds.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06 11:02:44 -08:00
Paul E. McKenney b6a932d1d9 rcu: Make rcu_read_unlock_special() propagate ->qsmaskinit bit clearing
This commit causes rcu_read_unlock_special() to propagate ->qsmaskinit
bit clearing up the rcu_node tree once a given rcu_node structure's
blkd_tasks list becomes empty.  This is the final commit in preparation
for the rework of RCU priority boosting:  It enables preempted tasks to
remain queued on their rcu_node structure even after all of that rcu_node
structure's CPUs have gone offline.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06 11:02:43 -08:00
Paul E. McKenney 8af3a5e78c rcu: Abstract rcu_cleanup_dead_rnp() from rcu_cleanup_dead_cpu()
This commit abstracts rcu_cleanup_dead_rnp() from rcu_cleanup_dead_cpu()
in preparation for the rework of RCU priority boosting.  This new function
will be invoked from rcu_read_unlock_special() in the reworked scheme,
which is why rcu_cleanup_dead_rnp() assumes that the leaf rcu_node
structure's ->qsmaskinit field has already been updated.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06 11:02:41 -08:00
Paul E. McKenney 74e871ac6c rcu: Rename "empty" to "empty_norm" in preparation for boost rework
This commit undertakes a simple variable renaming to make way for
some rework of RCU priority boosting.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06 11:02:40 -08:00
Paul E. McKenney b08ea27d95 rcu: Protect rcu_boost() lockless accesses with ACCESS_ONCE()
This commit prevents random compiler optimizations by applying
ACCESS_ONCE() to lockless accesses.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06 11:02:39 -08:00
Paul E. McKenney 41050a0096 rcu: Fix rcu_barrier() race that could result in too-short wait
The rcu_barrier() no-callbacks check for no-CBs CPUs has race conditions.
It checks a given CPU's lists of callbacks, and if all three no-CBs lists
are empty, ignores that CPU.  However, these three lists could potentially
be empty even when callbacks are present if the check executed just as
the callbacks were being moved from one list to another.  It turns out
that recent versions of rcutorture can spot this race.

This commit plugs this hole by consolidating the per-list counts of
no-CBs callbacks into a single count, which is incremented before
the corresponding callback is posted and after it is invoked.  Then
rcu_barrier() checks this single count to reliably determine whether
the corresponding CPU has no-CBs callbacks.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-01-06 11:01:15 -08:00
Paul E. McKenney 9ea6c58856 Merge branches 'torture.2014.11.03a', 'cpu.2014.11.03a', 'doc.2014.11.13a', 'fixes.2014.11.13a', 'signal.2014.10.29a' and 'rt.2014.10.29a' into HEAD
cpu.2014.11.03a: Changes for per-CPU variables.
doc.2014.11.13a: Documentation updates.
fixes.2014.11.13a: Miscellaneous fixes.
signal.2014.10.29a: Signal changes.
rt.2014.10.29a: Real-time changes.
torture.2014.11.03a: torture-test changes.
2014-11-13 10:39:04 -08:00
Paul E. McKenney 8fa7845df5 rcu: Remove "cpu" argument to rcu_cleanup_after_idle()
The "cpu" argument to rcu_cleanup_after_idle() is always the current
CPU, so drop it.  This moves the smp_processor_id() from the caller to
rcu_cleanup_after_idle(), saving argument-passing overhead.  Again,
the anticipated cross-CPU uses of these functions has been replaced
by NO_HZ_FULL.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
2014-11-03 19:20:56 -08:00
Paul E. McKenney 198bbf8127 rcu: Remove "cpu" argument to rcu_prepare_for_idle()
The "cpu" argument to rcu_prepare_for_idle() is always the current
CPU, so drop it.  This in turn allows two of the uses of "cpu" in
this function to be replaced with a this_cpu_ptr() and the third by
smp_processor_id(), replacing that of the call to rcu_prepare_for_idle().
Again, the anticipated cross-CPU uses of these functions has been replaced
by NO_HZ_FULL.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
2014-11-03 19:20:49 -08:00
Paul E. McKenney aa6da5140b rcu: Remove "cpu" argument to rcu_needs_cpu()
The "cpu" argument to rcu_needs_cpu() is always the current CPU, so drop
it.  This in turn allows the "cpu" argument to rcu_cpu_has_callbacks()
to be removed, which allows the uses of "cpu" in both functions to be
replaced with a this_cpu_ptr().  Again, the anticipated cross-CPU uses
of these functions has been replaced by NO_HZ_FULL.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
2014-11-03 19:20:43 -08:00
Paul E. McKenney 38200cf247 rcu: Remove "cpu" argument to rcu_note_context_switch()
The "cpu" argument to rcu_note_context_switch() is always the current
CPU, so drop it.  This in turn allows the "cpu" argument to
rcu_preempt_note_context_switch() to be removed, which allows the sole
use of "cpu" in both functions to be replaced with a this_cpu_ptr().
Again, the anticipated cross-CPU uses of these functions has been
replaced by NO_HZ_FULL.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
2014-11-03 19:20:34 -08:00
Paul E. McKenney 86aea0e6e7 rcu: Remove "cpu" argument to rcu_preempt_check_callbacks()
Because rcu_preempt_check_callbacks()'s argument is guaranteed to
always be the current CPU, drop the argument and replace per_cpu()
with __this_cpu_read().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
2014-11-03 19:20:26 -08:00
Christoph Lameter 28ced795cb rcu: Remove rcu_dynticks * parameters when they are always this_cpu_ptr(&rcu_dynticks)
For some functions in kernel/rcu/tree* the rdtp parameter is always
this_cpu_ptr(rdtp).  Remove the parameter if constant and calculate the
pointer in function.

This will have the advantage that it is obvious that the address are
all per cpu offsets and thus it will enable the use of this_cpu_ops in
the future.

Signed-off-by: Christoph Lameter <cl@linux.com>
[ paulmck: Forward-ported to rcu/dev, whitespace adjustment. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
2014-11-03 19:19:26 -08:00
Paul E. McKenney bbe5d7a93a rcu: Fix for rcuo online-time-creation reorganization bug
Commit 35ce7f29a4 (rcu: Create rcuo kthreads only for onlined CPUs)
contains checks for the case where CPUs are brought online out of
order, re-wiring the rcuo leader-follower relationships as needed.
Unfortunately, this rewiring was broken.  This apparently went undetected
due to the tendency of systems to bring CPUs online in order.  This commit
nevertheless fixes the rewiring.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-10-29 10:20:08 -07:00
Pranith Kumar 28f6569ab7 rcu: Remove redundant TREE_PREEMPT_RCU config option
PREEMPT_RCU and TREE_PREEMPT_RCU serve the same function after
TINY_PREEMPT_RCU has been removed. This patch removes TREE_PREEMPT_RCU
and uses PREEMPT_RCU config option in its place.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-10-29 10:20:05 -07:00
Clark Williams 21871d7eff rcu: Unify boost and kthread priorities
Rename CONFIG_RCU_BOOST_PRIO to CONFIG_RCU_KTHREAD_PRIO and use this
value for both the per-CPU kthreads (rcuc/N) and the rcu boosting
threads (rcub/n).

Also, create the module_parameter rcutree.kthread_prio to be used on
the kernel command line at boot to set a new value (rcutree.kthread_prio=N).

Signed-off-by: Clark Williams <clark.williams@gmail.com>
[ paulmck: Ported to rcu/dev, applied Paul Bolle and Peter Zijlstra feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-10-29 10:19:41 -07:00
Paul E. McKenney 61cfd0970e rcu: Move RCU_BOOST variable declarations, eliminating #ifdef
There are some RCU_BOOST-specific per-CPU variable declarations that
are needlessly defined under #ifdef in kernel/rcu/tree.c.  This commit
therefore moves these declarations into a pre-existing #ifdef in
kernel/rcu/tree_plugin.h.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-10-28 13:49:28 -07:00
Paul E. McKenney 0eafa46823 rcu: Remove CONFIG_RCU_CPU_STALL_VERBOSE
The CONFIG_RCU_CPU_STALL_VERBOSE Kconfig parameter causes preemptible
RCU's CPU stall warnings to dump out any preempted tasks that are blocking
the current RCU grace period.  This information is useful, and the default
has been CONFIG_RCU_CPU_STALL_VERBOSE=y for some years.  It is therefore
time for this commit to remove this Kconfig parameter, so that future
kernel builds will always act as if CONFIG_RCU_CPU_STALL_VERBOSE=y.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-10-28 13:48:13 -07:00
Paul E. McKenney d7e2993396 rcu: Make rcu_barrier() understand about missing rcuo kthreads
Commit 35ce7f29a4 (rcu: Create rcuo kthreads only for onlined CPUs)
avoids creating rcuo kthreads for CPUs that never come online.  This
fixes a bug in many instances of firmware: Instead of lying about their
age, these systems instead lie about the number of CPUs that they have.
Before commit 35ce7f29a4, this could result in huge numbers of useless
rcuo kthreads being created.

It appears that experience indicates that I should have told the
people suffering from this problem to fix their broken firmware, but
I instead produced what turned out to be a partial fix.   The missing
piece supplied by this commit makes sure that rcu_barrier() knows not to
post callbacks for no-CBs CPUs that have not yet come online, because
otherwise rcu_barrier() will hang on systems having firmware that lies
about the number of CPUs.

It is tempting to simply have rcu_barrier() refuse to post a callback on
any no-CBs CPU that does not have an rcuo kthread.  This unfortunately
does not work because rcu_barrier() is required to wait for all pending
callbacks.  It is therefore required to wait even for those callbacks
that cannot possibly be invoked.  Even if doing so hangs the system.

Given that posting a callback to a no-CBs CPU that does not yet have an
rcuo kthread can hang rcu_barrier(), It is tempting to report an error
in this case.  Unfortunately, this will result in false positives at
boot time, when it is perfectly legal to post callbacks to the boot CPU
before the scheduler has started, in other words, before it is legal
to invoke rcu_barrier().

So this commit instead has rcu_barrier() avoid posting callbacks to
CPUs having neither rcuo kthread nor pending callbacks, and has it
complain bitterly if it finds CPUs having no rcuo kthread but some
pending callbacks.  And when rcu_barrier() does find CPUs having no rcuo
kthread but pending callbacks, as noted earlier, it has no choice but
to hang indefinitely.

Reported-by: Yanko Kaneti <yaneti@declera.com>
Reported-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Reported-by: Meelis Roos <mroos@linux.ee>
Reported-by: Eric B Munson <emunson@akamai.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Eric B Munson <emunson@akamai.com>
Tested-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Tested-by: Yanko Kaneti <yaneti@declera.com>
Tested-by: Kevin Fenzi <kevin@scrye.com>
Tested-by: Meelis Roos <mroos@linux.ee>
2014-10-28 13:24:13 -07:00
Paul E. McKenney dd56af42bd rcu: Eliminate deadlock between CPU hotplug and expedited grace periods
Currently, the expedited grace-period primitives do get_online_cpus().
This greatly simplifies their implementation, but means that calls
to them holding locks that are acquired by CPU-hotplug notifiers (to
say nothing of calls to these primitives from CPU-hotplug notifiers)
can deadlock.  But this is starting to become inconvenient, as can be
seen here: https://lkml.org/lkml/2014/8/5/754.  The problem in this
case is that some developers need to acquire a mutex from a CPU-hotplug
notifier, but also need to hold it across a synchronize_rcu_expedited().
As noted above, this currently results in deadlock.

This commit avoids the deadlock and retains the simplicity by creating
a try_get_online_cpus(), which returns false if the get_online_cpus()
reference count could not immediately be incremented.  If a call to
try_get_online_cpus() returns true, the expedited primitives operate as
before.  If a call returns false, the expedited primitives fall back to
normal grace-period operations.  This falling back of course results in
increased grace-period latency, but only during times when CPU hotplug
operations are actually in flight.  The effect should therefore be
negligible during normal operation.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Tested-by: Lan Tianyu <tianyu.lan@intel.com>
2014-09-18 16:22:27 -07:00
Paul E. McKenney 96b4672703 Merge branch 'rcu-tasks.2014.09.10a' into HEAD
rcu-tasks.2014.09.10a: Add RCU-tasks flavor of RCU.
2014-09-16 10:10:44 -07:00
Paul E. McKenney e98d06dd6c Merge branches 'doc.2014.09.07a', 'fixes.2014.09.10a', 'nocb-nohz.2014.09.16b' and 'torture.2014.09.07a' into HEAD
doc.2014.09.07a: Documentation updates.
fixes.2014.09.10a: Miscellaneous fixes.
nocb-nohz.2014.09.16b: No-CBs CPUs and NO_HZ_FULL updates.
torture.2014.09.07a: Torture-test updates.
2014-09-16 10:08:34 -07:00
Paul E. McKenney c847f14217 rcu: Avoid misordering in nocb_leader_wait()
The NOCB follower wakeup ordering depends on the store to the tail
pointer happening before the wakeup.  However, because atomic_long_add()
does not return a value, it does not provide ordering guarantees, and
the locking in wake_up() only guarantees that the store will happen
before the unlock, which might be too late.  Even though this is only a
theoretical issue, this commit adds a smp_mb__after_atomic() after the
final atomic_long_add() to provide the needed ordering guarantee.

Reported-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2014-09-16 10:08:03 -07:00
Paul E. McKenney 1772947bd0 rcu: Handle NOCB callbacks from irq-disabled idle code
If an RCU callback is queued on a no-CBs CPU from idle code with irqs
disabled, and if that CPU stays idle forever after, the callback will
never be invoked.  This commit therefore adds a check for this situation
in ____call_rcu_nocb(), invoking the RCU core solely for the purpose
of the ensuing return-to-idle transition.  (If the CPU doesn't return
to idle, the next scheduling-clock interrupt will fix things up.)

Reported-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2014-09-16 10:08:03 -07:00
Paul E. McKenney 39953dfd40 rcu: Avoid misordering in __call_rcu_nocb_enqueue()
The NOCB leader wakeup ordering depends on the store to the header
happening before the check for the leader already being awake.  However,
because atomic_long_add() does not return a value, it does not provide
ordering guarantees, the incorrect comment in wake_nocb_leader()
notwithstanding.  This commit therefore adds a smp_mb__after_atomic()
after the final atomic_long_add() to provide the needed ordering
guarantee.

Reported-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2014-09-16 10:08:03 -07:00
Paul E. McKenney 663e131090 rcu: Don't track sysidle state if no nohz_full= CPUs
If there are no nohz_full= CPUs, then there is currently no reason to
track sysidle state.  This commit therefore short-circuits this state
tracking if !tick_nohz_full_enabled().

Note that these checks will need to be revisited if nohz_full= state
can ever be changed at runtime.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2014-09-16 10:08:02 -07:00
Paul E. McKenney 417e8d2655 rcu: Eliminate redundant rcu_sysidle_state variable
Now that we have rcu_state_p, which references rcu_preempt_state for
TREE_PREEMPT_RCU and rcu_sched_state for TREE_RCU, we don't need a
separate rcu_sysidle_state variable.  This commit therefore eliminates
rcu_preempt_state in favor of rcu_state_p.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2014-09-16 10:08:02 -07:00
Pranith Kumar 22c2f66961 rcu: Check for have_rcu_nocb_mask instead of rcu_nocb_mask
If we configure a kernel with CONFIG_NOCB_CPU=y, CONFIG_RCU_NOCB_CPU_NONE=y and
CONFIG_CPUMASK_OFFSTACK=n and do not pass in a rcu_nocb= boot parameter, the
cpumask rcu_nocb_mask can be garbage instead of NULL.

Hence this commit replaces checks for rcu_nocb_mask == NULL with a check for
have_rcu_nocb_mask.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2014-09-16 10:08:02 -07:00
Paul E. McKenney 35ce7f29a4 rcu: Create rcuo kthreads only for onlined CPUs
RCU currently uses for_each_possible_cpu() to spawn rcuo kthreads,
which can result in more rcuo kthreads than one would expect, for
example, derRichard reported 64 CPUs worth of rcuo kthreads on an
8-CPU image.  This commit therefore creates rcuo kthreads only for
those CPUs that actually come online.

This was reported by derRichard on the OFTC IRC network.

Reported-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2014-09-16 10:08:02 -07:00
Paul E. McKenney 9386c0b75d rcu: Rationalize kthread spawning
Currently, RCU spawns kthreads from several different early_initcall()
functions.  Although this has served RCU well for quite some time,
as more kthreads are added a more deterministic approach is required.
This commit therefore causes all of RCU's early-boot kthreads to be
spawned from a single early_initcall() function.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2014-09-16 10:08:01 -07:00
Pranith Kumar f4aa84ba24 rcu: Return false instead of 0 in rcu_nocb_adopt_orphan_cbs()
Return false instead of 0 in rcu_nocb_adopt_orphan_cbs() as this has
bool as return type.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2014-09-16 10:08:01 -07:00
Pranith Kumar 4afc7e269b rcu: Use false for return in __call_rcu_nocb()
Return false instead of 0 in __call_rcu_nocb() as this has bool as
return type.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2014-09-16 10:08:01 -07:00
Pranith Kumar 0a9e1e111b rcu: Use true/false for return in rcu_nocb_adopt_orphan_cbs()
Return true/false in rcu_nocb_adopt_orphan_cbs() instead of 0/1 as
this function has return type of bool.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2014-09-16 10:08:00 -07:00
Pranith Kumar c271d3a957 rcu: Use true/false for return in __call_rcu_nocb()
Return true/false instead of 0/1 in __call_rcu_nocb() as this returns a
bool type.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2014-09-16 10:08:00 -07:00
Pranith Kumar 949cccdbe6 rcu: Check the return value of zalloc_cpumask_var()
This commit checks the return value of the zalloc_cpumask_var() used for
allocating cpumask for rcu_nocb_mask.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2014-09-16 10:08:00 -07:00
Paul E. McKenney f4579fc57c rcu: Fix attempt to avoid unsolicited offloading of callbacks
Commit b58cc46c5f (rcu: Don't offload callbacks unless specifically
requested) failed to adjust the callback lists of the CPUs that are
known to be no-CBs CPUs only because they are also nohz_full= CPUs.
This failure can result in callbacks that are posted during early boot
getting stranded on nxtlist for CPUs whose no-CBs property becomes
apparent late, and there can also be spurious warnings about offline
CPUs posting callbacks.

This commit fixes these problems by adding an early-boot rcu_init_nohz()
that properly initializes the no-CBs CPUs.

Note that kernels built with CONFIG_RCU_NOCB_CPU_ALL=y or with
CONFIG_RCU_NOCB_CPU=n do not exhibit this bug.  Neither do kernels
booted without the nohz_full= boot parameter.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2014-09-16 10:07:59 -07:00
Paul E. McKenney 284a8c93af rcu: Per-CPU operation cleanups to rcu_*_qs() functions
The rcu_bh_qs(), rcu_preempt_qs(), and rcu_sched_qs() functions use
old-style per-CPU variable access and write to ->passed_quiesce even
if it is already set.  This commit therefore updates to use the new-style
per-CPU variable access functions and avoids the spurious writes.
This commit also eliminates the "cpu" argument to these functions because
they are always invoked on the indicated CPU.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:27:35 -07:00
Paul E. McKenney 1d082fd061 rcu: Remove local_irq_disable() in rcu_preempt_note_context_switch()
The rcu_preempt_note_context_switch() function is on a scheduling fast
path, so it would be good to avoid disabling irqs.  The reason that irqs
are disabled is to synchronize process-level and irq-handler access to
the task_struct ->rcu_read_unlock_special bitmask.  This commit therefore
makes ->rcu_read_unlock_special instead be a union of bools with a short
allowing single-access checks in RCU's __rcu_read_unlock().  This results
in the process-level and irq-handler accesses being simple loads and
stores, so that irqs need no longer be disabled.  This commit therefore
removes the irq disabling from rcu_preempt_note_context_switch().

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:27:34 -07:00
Paul E. McKenney 176f8f7a52 rcu: Make TASKS_RCU handle nohz_full= CPUs
Currently TASKS_RCU would ignore a CPU running a task in nohz_full=
usermode execution.  There would be neither a context switch nor a
scheduling-clock interrupt to tell TASKS_RCU that the task in question
had passed through a quiescent state.  The grace period would therefore
extend indefinitely.  This commit therefore makes RCU's dyntick-idle
subsystem record the task_struct structure of the task that is running
in dyntick-idle mode on each CPU.  The TASKS_RCU grace period can
then access this information and record a quiescent state on
behalf of any CPU running in dyntick-idle usermode.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:27:30 -07:00
Paul E. McKenney bde6c3aa99 rcu: Provide cond_resched_rcu_qs() to force quiescent states in long loops
RCU-tasks requires the occasional voluntary context switch
from CPU-bound in-kernel tasks.  In some cases, this requires
instrumenting cond_resched().  However, there is some reluctance
to countenance unconditionally instrumenting cond_resched() (see
http://lwn.net/Articles/603252/), so this commit creates a separate
cond_resched_rcu_qs() that may be used in place of cond_resched() in
locations prone to long-duration in-kernel looping.

This commit currently instruments only RCU-tasks.  Future possibilities
include also instrumenting RCU, RCU-bh, and RCU-sched in order to reduce
IPI usage.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:27:20 -07:00
Paul E. McKenney 73a860cd58 rcu: Replace flush_signals() with WARN_ON(signal_pending())
Currently, when RCU awakens from a wait_event_interruptible() that
might have awakened prematurely, it does a flush_signals(). This is
done on the off-chance that someone figured out how to deliver a signal
to a kthread, which is supposed to be impossible.  Given that this
is supposed to be impossible, this commit changes the flush_signals()
calls into WARN_ON(signal_pending()).

Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:18:20 -07:00
Paul E. McKenney 9fdd3bc900 rcu: Break more call_rcu() deadlock involving scheduler and perf
Commit 96d3fd0d31 (rcu: Break call_rcu() deadlock involving scheduler
and perf) covered the case where __call_rcu_nocb_enqueue() needs to wake
the rcuo kthread due to the queue being initially empty, but did not
do anything for the case where the queue was overflowing.  This commit
therefore also defers wakeup for the overflow case.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:18:17 -07:00
Pranith Kumar d0bc90fd37 rcu: Return bool type for rcu_try_advance_all_cbs()
Return a bool type instead of 0 in rcu_try_advance_all_cbs().

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:18:10 -07:00
Pranith Kumar bf33eb1aef rcu: Fix sparse warning about rcu_batches_completed_preempt() being non-static
fix sparse warning about rcu_batches_completed_preempt() being non-static by
marking it as static

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:18:08 -07:00
Pranith Kumar 4de376a1b1 rcu: Remove remaining read-modify-write ACCESS_ONCE() calls
Change the remaining uses of ACCESS_ONCE() so that each ACCESS_ONCE() either does a load or a store, but not both.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:18:07 -07:00
Pranith Kumar 11ed7f934c rcu: Make nocb leader kthreads process pending callbacks after spawning
The nocb callbacks generated before the nocb kthreads are spawned are
enqueued in the nocb queue for later processing. Commit fbce7497ee ("rcu:
Parallelize and economize NOCB kthread wakeups") introduced nocb leader kthreads
which checked the nocb_leader_wake flag to see if there were any such pending
callbacks. A case was reported in which newly spawned leader kthreads were not
processing the pending callbacks as this flag was not set, which led to a boot
hang.

The following commit ensures that the newly spawned nocb kthreads process the
pending callbacks by allowing the kthreads to run immediately after spawning
instead of waiting. This is done by inverting the logic of nocb_leader_wake
tests to nocb_leader_sleep which allows us to use the default initialization of
this flag to 0 to let the kthreads run.

Reported-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Link: http://www.spinics.net/lists/kernel/msg1802899.html
[ paulmck: Backported to v3.17-rc2. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Amit Shah <amit.shah@redhat.com>
2014-08-28 05:59:59 -07:00
Paul E. McKenney 187497fa5e rcu: Allow for NULL tick_nohz_full_mask when nohz_full= missing
If there isn't a nohz_full= kernel parameter specified, then
tick_nohz_full_mask can legitimately be NULL.  This can cause
problems when RCU's boot code tries to cpumask_or() this value into
rcu_nocb_mask.  In addition, if NO_HZ_FULL_ALL=y, there is no point
in doing the cpumask_or() in the first place because this will cause
RCU_NOCB_CPU_ALL=y, which in turn will have all bits already set in
rcu_nocb_mask.

This commit therefore avoids the cpumask_or() if NO_HZ_FULL_ALL=y
and checks for !tick_nohz_full_running otherwise, this latter check
catching cases when there was no nohz_full= kernel parameter specified.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-07-16 10:44:46 -07:00
Paul E. McKenney 1823172ab5 Merge branches 'doc.2014.07.08a', 'fixes.2014.07.09a', 'maintainers.2014.07.08b', 'nocbs.2014.07.07a' and 'torture.2014.07.07a' into HEAD
doc.2014.07.08a: Documentation updates.
fixes.2014.07.09a: Miscellaneous fixes.
maintainers.2014.07.08b: Maintainership updates.
nocbs.2014.07.07a: Callback-offloading fixes.
torture.2014.07.07a: Torture-test updates.
2014-07-09 09:16:54 -07:00
Pranith Kumar b41d1b924d rcu: Fix a sparse warning in rcu_report_unblock_qs_rnp()
This commit annotates rcu_report_unblock_qs_rnp() in order to fix the
following sparse warning:

kernel/rcu/tree_plugin.h:990:13: warning: context imbalance in 'rcu_report_unblock_qs_rnp' - unexpected unlock

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2014-07-09 09:15:51 -07:00
Pranith Kumar 615e41c605 rcu: Fix a sparse warning in rcu_initiate_boost()
This commit annotates rcu_initiate_boost() fixes the following sparse
warning:

	kernel/rcu/tree_plugin.h:1494:13: warning: context imbalance in 'rcu_initiate_boost' - unexpected unlock

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2014-07-09 09:15:45 -07:00
Paul E. McKenney c0f489d2c6 rcu: Bind grace-period kthreads to non-NO_HZ_FULL CPUs
Binding the grace-period kthreads to the timekeeping CPU resulted in
significant performance decreases for some workloads.  For more detail,
see:

https://lkml.org/lkml/2014/6/3/395 for benchmark numbers

https://lkml.org/lkml/2014/6/4/218 for CPU statistics

It turns out that it is necessary to bind the grace-period kthreads
to the timekeeping CPU only when all but CPU 0 is a nohz_full CPU
on the one hand or if CONFIG_NO_HZ_FULL_SYSIDLE=y on the other.
In other cases, it suffices to bind the grace-period kthreads to the
set of non-nohz_full CPUs.

This commit therefore creates a tick_nohz_not_full_mask that is the
complement of tick_nohz_full_mask, and then binds the grace-period
kthread to the set of CPUs indicated by this new mask, which covers
the CONFIG_NO_HZ_FULL_SYSIDLE=n case.  The CONFIG_NO_HZ_FULL_SYSIDLE=y
case still binds the grace-period kthreads to the timekeeping CPU.
This commit also includes the tick_nohz_full_enabled() check suggested
by Frederic Weisbecker.

Reported-by: Jet Chen <jet.chen@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Created housekeeping_affine() and housekeeping_mask per
  fweisbec feedback. ]
2014-07-09 09:15:02 -07:00
Paul E. McKenney abaa93d9e1 rcu: Simplify priority boosting by putting rt_mutex in rcu_node
RCU priority boosting currently checks for boosting via a pointer in
task_struct.  However, this is not needed: As Oleg noted, if the
rt_mutex is placed in the rcu_node instead of on the booster's stack,
the boostee can simply check it see if it owns the lock.  This commit
makes this change, shrinking task_struct by one pointer and the kernel
by thirteen lines.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-07-09 09:15:01 -07:00
Paul E. McKenney dfeb9765ce rcu: Allow post-unlock reference for rt_mutex
The current approach to RCU priority boosting uses an rt_mutex strictly
for its priority-boosting side effects.  The rt_mutex_init_proxy_locked()
function is used by the booster to initialize the lock as held by the
boostee.  The booster then uses rt_mutex_lock() to acquire this rt_mutex,
which priority-boosts the boostee.  When the boostee reaches the end
of its outermost RCU read-side critical section, it checks a field in
its task structure to see whether it has been boosted, and, if so, uses
rt_mutex_unlock() to release the rt_mutex.  The booster can then go on
to boost the next task that is blocking the current RCU grace period.

But reasonable implementations of rt_mutex_unlock() might result in the
boostee referencing the rt_mutex's data after releasing it.  But the
booster might have re-initialized the rt_mutex between the time that the
boostee released it and the time that it later referenced it.  This is
clearly asking for trouble, so this commit introduces a completion that
forces the booster to wait until the boostee has completely finished with
the rt_mutex, thus avoiding the case where the booster is re-initializing
the rt_mutex before the last boostee's last reference to that rt_mutex.

This of course does introduce some overhead, but the priority-boosting
code paths are miles from any possible fastpath, and the overhead of
executing the completion will normally be quite small compared to the
overhead of priority boosting and deboosting, so this should be OK.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-07-09 09:15:00 -07:00
Paul E. McKenney 4da117cfa7 rcu: Remove redundant ACCESS_ONCE() from tick_do_timer_cpu
In kernels built with CONFIG_NO_HZ_FULL, tick_do_timer_cpu is constant
once boot completes.  Thus, there is no need to wrap it in ACCESS_ONCE()
in code that is built only when CONFIG_NO_HZ_FULL.  This commit therefore
removes the redundant ACCESS_ONCE().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2014-07-09 09:14:35 -07:00
Paul E. McKenney b58cc46c5f rcu: Don't offload callbacks unless specifically requested
Enabling NO_HZ_FULL currently has the side effect of enabling callback
offloading on all CPUs.  This results in lots of additional rcuo kthreads,
and can also increase context switching and wakeups, even in cases where
callback offloading is neither needed nor particularly desirable.  This
commit therefore enables callback offloading on a given CPU only if
specifically requested at build time or boot time, or if that CPU has
been specifically designated (again, either at build time or boot time)
as a nohz_full CPU.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-07-07 15:13:44 -07:00
Paul E. McKenney fbce7497ee rcu: Parallelize and economize NOCB kthread wakeups
An 80-CPU system with a context-switch-heavy workload can require so
many NOCB kthread wakeups that the RCU grace-period kthreads spend several
tens of percent of a CPU just awakening things.  This clearly will not
scale well: If you add enough CPUs, the RCU grace-period kthreads would
get behind, increasing grace-period latency.

To avoid this problem, this commit divides the NOCB kthreads into leaders
and followers, where the grace-period kthreads awaken the leaders each of
whom in turn awakens its followers.  By default, the number of groups of
kthreads is the square root of the number of CPUs, but this default may
be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
This reduces the number of wakeups done per grace period by the RCU
grace-period kthread by the square root of the number of CPUs, but of
course by shifting those wakeups to the leaders.  In addition, because
the leaders do grace periods on behalf of their respective followers,
the number of wakeups of the followers decreases by up to a factor of two.
Instead of being awakened once when new callbacks arrive and again
at the end of the grace period, the followers are awakened only at
the end of the grace period.

For a numerical example, in a 4096-CPU system, the grace-period kthread
would awaken 64 leaders, each of which would awaken its 63 followers
at the end of the grace period.  This compares favorably with the 79
wakeups for the grace-period kthread on an 80-CPU system.

Reported-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-07-07 15:13:44 -07:00
Paul E. McKenney 4a81e8328d rcu: Reduce overhead of cond_resched() checks for RCU
Commit ac1bea8578 (Make cond_resched() report RCU quiescent states)
fixed a problem where a CPU looping in the kernel with but one runnable
task would give RCU CPU stall warnings, even if the in-kernel loop
contained cond_resched() calls.  Unfortunately, in so doing, it introduced
performance regressions in Anton Blanchard's will-it-scale "open1" test.
The problem appears to be not so much the increased cond_resched() path
length as an increase in the rate at which grace periods complete, which
increased per-update grace-period overhead.

This commit takes a different approach to fixing this bug, mainly by
moving the RCU-visible quiescent state from cond_resched() to
rcu_note_context_switch(), and by further reducing the check to a
simple non-zero test of a single per-CPU variable.  However, this
approach requires that the force-quiescent-state processing send
resched IPIs to the offending CPUs.  These will be sent only once
the grace period has reached an age specified by the boot/sysfs
parameter rcutree.jiffies_till_sched_qs, or once the grace period
reaches an age halfway to the point at which RCU CPU stall warnings
will be emitted, whichever comes first.

Reported-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Christoph Lameter <cl@gentwo.org>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
[ paulmck: Made rcu_momentary_dyntick_idle() as suggested by the
  ktest build robot.  Also fixed smp_mb() comment as noted by
  Oleg Nesterov. ]

Merge with e552592e (Reduce overhead of cond_resched() checks for RCU)

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-06-23 11:19:32 -07:00
Linus Torvalds 776edb5931 Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next
Pull core locking updates from Ingo Molnar:
 "The main changes in this cycle were:

   - reduced/streamlined smp_mb__*() interface that allows more usecases
     and makes the existing ones less buggy, especially in rarer
     architectures

   - add rwsem implementation comments

   - bump up lockdep limits"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
  rwsem: Add comments to explain the meaning of the rwsem's count field
  lockdep: Increase static allocations
  arch: Mass conversion of smp_mb__*()
  arch,doc: Convert smp_mb__*()
  arch,xtensa: Convert smp_mb__*()
  arch,x86: Convert smp_mb__*()
  arch,tile: Convert smp_mb__*()
  arch,sparc: Convert smp_mb__*()
  arch,sh: Convert smp_mb__*()
  arch,score: Convert smp_mb__*()
  arch,s390: Convert smp_mb__*()
  arch,powerpc: Convert smp_mb__*()
  arch,parisc: Convert smp_mb__*()
  arch,openrisc: Convert smp_mb__*()
  arch,mn10300: Convert smp_mb__*()
  arch,mips: Convert smp_mb__*()
  arch,metag: Convert smp_mb__*()
  arch,m68k: Convert smp_mb__*()
  arch,m32r: Convert smp_mb__*()
  arch,ia64: Convert smp_mb__*()
  ...
2014-06-03 12:57:53 -07:00
Uma Sharma e534165bbf rcu: Variable name changed in tree_plugin.h and used in tree.c
The variable and struct both having the name "rcu_state" confuses
sparse in some situations, so this commit changes the variable to
"rcu_state_p" in order to avoid this confusion.  This also makes
things easier for human readers.

Signed-off-by: Uma Sharma <uma.sharma523@gmail.com>
[ paulmck: Changed the declaration and several additional uses. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-05-14 11:41:04 -07:00
Christoph Lameter fa07a58f71 rcu: Replace __this_cpu_ptr() uses with raw_cpu_ptr()
__this_cpu_ptr is being phased out.

One special case is increment_cpu_stall_ticks().
A per cpu variable is incremented so use raw_cpu_inc().

Cc: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-04-29 08:45:35 -07:00
Paul E. McKenney becb41bfe0 rcu: Make large and small sysidle systems use same state machine
Currently, small systems move back into RCU_SYSIDLE_NOT from
RCU_SYSIDLE_SHORT and large systems do not.  This works because moving
aggressively to RCU_SYSIDLE_NOT affects only performance, not correctness,
and on small systems, the performance impact should be negligible.  That
said, this difference does make RCU a bit more complex, and RCU does not
seem to be suffering from any lack of complexity.  This commit therefore
adjusts small-system operation to match that of large systems, so that
the state never moves back to RCU_SYSIDLE_NOT from RCU_SYSIDLE_SHORT.

Reported-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-04-29 08:45:24 -07:00
Paul E. McKenney 5057f55e54 rcu: Bind RCU grace-period kthreads if NO_HZ_FULL
Currently, RCU binds the grace-period kthreads to the timekeeping
CPU only if CONFIG_NO_HZ_FULL_SYSIDLE=y.  This means that these
kthreads must be bound manually when CONFIG_NO_HZ_FULL_SYSIDLE=n and
CONFIG_NO_HZ_FULL=y: Otherwise, these kthreads will induce OS jitter on
random CPUs.  Given that we are trying to reduce the amount of manual
tweaking required to make CONFIG_NO_HZ_FULL=y work nicely, this commit
makes this binding happen when CONFIG_NO_HZ_FULL=y, even in cases where
CONFIG_NO_HZ_FULL_SYSIDLE=n.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-04-29 08:45:19 -07:00
Andreea-Cristina Bernat a381d757d9 rcu: Merge rcu_sched_force_quiescent_state() with rcu_force_quiescent_state()
This patch merges the function rcu_force_quiescent_state() with
rcu_sched_force_quiescent_state(), using the rcu_state pointer.  Firstly,
the rcu_sched_force_quiescent_state() function is deleted from the file
kernel/rcu/tree.c. Also, the rcu_force_quiescent_state() function that was
calling force_quiescent_state with the argument rcu_preempt_state pointer
was deleted as well.  The new function that combines the old ones uses
the rcu_state pointer and is located after rcu_batches_completed_bh()
in kernel/rcu/tree.c.

Signed-off-by: Andreea-Cristina Bernat <bernat.ada@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-04-29 08:45:07 -07:00
Andreea-Cristina Bernat 495aa969db rcu: Consolidate kfree_call_rcu() to use rcu_state pointer
kfree_call_rcu is defined two times. When defined under CONFIG_TREE_PREEMPT_RCU,
it uses rcu_preempt_state. Otherwise, it uses rcu_sched_state.
This patch uses the rcu_state_pointer to combine the two definitions into one.
The resulting function is placed after the closing of the preprocessor
conditional CONFIG_TREE_PREEMPT_RCU.

Signed-off-by: Andreea-Cristina Bernat <bernat.ada@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-04-29 08:45:01 -07:00
Paul E. McKenney 48a7639ce8 rcu: Make callers awaken grace-period kthread
The rcu_start_gp_advanced() function currently uses irq_work_queue()
to defer wakeups of the RCU grace-period kthread.  This deferring
is necessary to avoid RCU-scheduler deadlocks involving the rcu_node
structure's lock, meaning that RCU cannot call any of the scheduler's
wake-up functions while holding one of these locks.

Unfortunately, the second and subsequent calls to irq_work_queue() are
ignored, and the first call will be ignored (aside from queuing the work
item) if the scheduler-clock tick is turned off.  This is OK for many
uses, especially those where irq_work_queue() is called from an interrupt
or softirq handler, because in those cases the scheduler-clock-tick state
will be re-evaluated, which will turn the scheduler-clock tick back on.
On the next tick, any deferred work will then be processed.

However, this strategy does not always work for RCU, which can be invoked
at process level from idle CPUs.  In this case, the tick might never
be turned back on, indefinitely defering a grace-period start request.
Note that the RCU CPU stall detector cannot see this condition, because
there is no RCU grace period in progress.  Therefore, we can (and do!)
see long tens-of-seconds stalls in grace-period handling.  In theory,
we could see a full grace-period hang, but rcutorture testing to date
has seen only the tens-of-seconds stalls.  Event tracing demonstrates
that irq_work_queue() is being called repeatedly to no effect during
these stalls: The "newreq" event appears repeatedly from a task that is
not one of the grace-period kthreads.

In theory, irq_work_queue() might be fixed to avoid this sort of issue,
but RCU's requirements are unusual and it is quite straightforward to pass
wake-up responsibility up through RCU's call chain, so that the wakeup
happens when the offending locks are released.

This commit therefore makes this change.  The rcu_start_gp_advanced(),
rcu_start_future_gp(), rcu_accelerate_cbs(), rcu_advance_cbs(),
__note_gp_changes(), and rcu_start_gp() functions now return a boolean
which indicates when a wake-up is needed.  A new rcu_gp_kthread_wake()
does the wakeup when it is necessary and safe to do so: No self-wakes,
no wake-ups if the ->gp_flags field indicates there is no need (as in
someone else did the wake-up before we got around to it), and no wake-ups
before the grace-period kthread has been created.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-04-29 08:44:07 -07:00
Paul E. McKenney 365187fbc0 rcu: Update cpu_needs_another_gp() for futures from non-NOCB CPUs
In the old days, the only source of requests for future grace periods
was NOCB CPUs.  This has changed: CPUs routinely post requests for
future grace periods in order to promote power efficiency and reduce
OS jitter with minimal impact on grace-period latency.  This commit
therefore updates cpu_needs_another_gp() to invoke rcu_future_needs_gp()
instead of rcu_nocb_needs_gp().  The latter is no longer used, so is
now removed.  This commit also adds tracing for the irq_work_queue()
wakeup case.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-04-29 08:43:32 -07:00
Liu Ping Fan 24342c963a rcu: Fix incorrect notes for code
Signed-off-by: Liu Ping Fan <kernelfans@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-04-29 08:43:19 -07:00
Peter Zijlstra 4e857c58ef arch: Mass conversion of smp_mb__*()
Mostly scripted conversion of the smp_mb__* barriers.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-arch@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-04-18 14:20:48 +02:00
Paul E. McKenney 322efba5b6 Merge branches 'doc.2014.02.24a', 'fixes.2014.02.26a' and 'rt.2014.02.17b' into HEAD
doc.2014.02.24a: Documentation changes
fixes.2014.02.26a: Miscellaneous fixes
rt.2014.02.17b: Response-time-related changes
2014-02-26 06:36:09 -08:00
Paul E. McKenney f1f399d128 rcu: Optimize RCU_FAST_NO_HZ for RCU_NOCB_CPU_ALL
If CONFIG_RCU_NOCB_CPU_ALL=y, then no CPU will ever have RCU callbacks
because these callbacks will instead be handled by the rcuo kthreads.
However, the current version of RCU_FAST_NO_HZ nevertheless checks for RCU
callbacks.  This commit therefore creates static inline implementations
of rcu_prepare_for_idle() and rcu_cleanup_after_idle() that are no-ops
when CONFIG_RCU_NOCB_CPU_ALL=y.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-02-17 16:03:33 -08:00
Paul E. McKenney ffa83fb565 rcu: Optimize rcu_needs_cpu() for RCU_NOCB_CPU_ALL
If CONFIG_RCU_NOCB_CPU_ALL=y, then rcu_needs_cpu() will always
return false, however, the current version nevertheless checks
for RCU callbacks.  This commit therefore creates a static inline
implementation of rcu_needs_cpu() that unconditionally returns false
when CONFIG_RCU_NOCB_CPU_ALL=y.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-02-17 16:03:09 -08:00
Paul E. McKenney 2f33b512a5 rcu: Optimize rcu_is_nocb_cpu() for RCU_NOCB_CPU_ALL
If CONFIG_RCU_NOCB_CPU_ALL=y, then rcu_is_nocb_cpu() will always
return true, however, the current version nevertheless checks
rcu_nocb_mask.  This commit therefore creates a static inline
implementation of rcu_is_nocb_cpu() that unconditionally returns
true when CONFIG_RCU_NOCB_CPU_ALL=y.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-02-17 15:32:48 -08:00
Paul Bolle 52e2bb958a rcu: Disambiguate CONFIG_RCU_NOCB_CPUs
This commit fixes a grammar issue in the rcu_nohz_full_cpu() comment
header, so that it is clear that the plural is CPUs not Kconfig options.

Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-02-17 15:02:08 -08:00
Paul E. McKenney 87de1cfdc5 rcu: Stop tracking FSF's postal address
All of the RCU source files have the usual GPL header, which contains a
long-obsolete postal address for FSF.  To avoid the need to track the
FSF office's movements, this commit substitutes the URL where GPL may
be found.

Reported-by: Greg KH <gregkh@linuxfoundation.org>
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-02-17 15:01:37 -08:00
Linus Torvalds a693c46e14 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 - add RCU torture scripts/tooling
 - static analysis improvements
 - update RCU documentation
 - miscellaneous fixes

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits)
  rcu: Remove "extern" from function declarations in kernel/rcu/rcu.h
  rcu: Remove "extern" from function declarations in include/linux/*rcu*.h
  rcu/torture: Dynamically allocate SRCU output buffer to avoid overflow
  rcu: Don't activate RCU core on NO_HZ_FULL CPUs
  rcu: Warn on allegedly impossible rcu_read_unlock_special() from irq
  rcu: Add an RCU_INITIALIZER for global RCU-protected pointers
  rcu: Make rcu_assign_pointer's assignment volatile and type-safe
  bonding: Use RCU_INIT_POINTER() for better overhead and for sparse
  rcu: Add comment on evaluate-once properties of rcu_assign_pointer().
  rcu: Provide better diagnostics for blocking in RCU callback functions
  rcu: Improve SRCU's grace-period comments
  rcu: Fix CONFIG_RCU_FANOUT_EXACT for odd fanout/leaf values
  rcu: Fix coccinelle warnings
  rcutorture: Stop tracking FSF's postal address
  rcutorture: Move checkarg to functions.sh
  rcutorture: Flag errors and warnings with color coding
  rcutorture: Record results from repeated runs of the same test scenario
  rcutorture: Test summary at end of run with less chattiness
  rcutorture: Update comment in kvm.sh listing typical RCU trace events
  rcutorture: Add tracing-enabled version of TREE08
  ...
2014-01-20 10:25:12 -08:00
Ingo Molnar bb799d3b98 Linux 3.13-rc4
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.15 (GNU/Linux)
 
 iQEcBAABAgAGBQJSrhGrAAoJEHm+PkMAQRiGsNoH/jIK3CsQ2lbW7yRLXmfgtbzz
 i2Kep6D4SDvmaLpLYOVC8xNYTiE8jtTbSXHomwP5wMZ63MQDhBfnEWsEWqeZ9+D9
 3Q46p0QWuoBgYu2VGkoxTfygkT6hhSpwWIi3SeImbY4fg57OHiUil/+YGhORM4Qc
 K4549OCTY3sIrgmWL77gzqjRUo+pQ4C73NKqZ3+5nlOmYBZC1yugk8mFwEpQkwhK
 4NRNU760Fo+XIht/bINqRiPMddzC15p0mxvJy3cDW8bZa1tFSS9SB7AQUULBbcHL
 +2dFlFOEb5SV1sNiNPrJ0W+h2qUh2e7kPB0F8epaBppgbwVdyQoC2u4uuLV2ZN0=
 =lI2r
 -----END PGP SIGNATURE-----

Merge tag 'v3.13-rc4' into core/locking

Merge Linux 3.13-rc4, to refresh this rather old tree with the latest fixes.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-12-17 15:27:08 +01:00
Ingo Molnar 73a7ac2808 Merge branch 'rcu/next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull v3.14 RCU updates from Paul E. McKenney.

The main changes:

  * Update RCU documentation.

  * Miscellaneous fixes.

  * Add RCU torture scripts.

  * Static-analysis improvements.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-12-16 11:43:41 +01:00
Paul E. McKenney 6303b9c87d rcu: Apply smp_mb__after_unlock_lock() to preserve grace periods
RCU must ensure that there is the equivalent of a full memory
barrier between any memory access preceding grace period and any
memory access following that same grace period, regardless of
which CPU(s) happen to execute the two memory accesses.
Therefore, downgrading UNLOCK+LOCK to no longer imply a full
memory barrier requires some adjustments to RCU.

This commit therefore adds smp_mb__after_unlock_lock()
invocations as needed after the RCU lock acquisitions that need
to be part of a full-memory-barrier UNLOCK+LOCK.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <linux-arch@vger.kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1386799151-2219-7-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-12-16 11:36:16 +01:00
Paul E. McKenney a096932f0c rcu: Don't activate RCU core on NO_HZ_FULL CPUs
Whenever a CPU receives a scheduling-clock interrupt, RCU checks to see
if the RCU core needs anything from this CPU.  If so, RCU raises
RCU_SOFTIRQ to carry out any needed processing.

This approach has worked well historically, but it is undesirable on
NO_HZ_FULL CPUs.  Such CPUs are expected to spend almost all of their time
in userspace, so that scheduling-clock interrupts can be disabled while
there is only one runnable task on the CPU in question.  Unfortunately,
raising any softirq has the potential to wake up ksoftirqd, which would
provide the second runnable task on that CPU, preventing disabling of
scheduling-clock interrupts.

What is needed instead is for RCU to leave NO_HZ_FULL CPUs alone,
relying on the grace-period kthreads' quiescent-state forcing to
do any needed RCU work on behalf of those CPUs.

This commit therefore refrains from raising RCU_SOFTIRQ on any
NO_HZ_FULL CPUs during any grace periods that have been in effect
for less than one second.  The one-second limit handles the case
where an inappropriate workload is running on a NO_HZ_FULL CPU
that features lots of scheduling-clock interrupts, but no idle
or userspace time.

Reported-by: Mike Galbraith <bitbucket@online.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Mike Galbraith <bitbucket@online.de>
Toasted-by: Frederic Weisbecker <fweisbec@gmail.com>
2013-12-12 12:34:15 -08:00
Lai Jiangshan 79a62f957e rcu: Warn on allegedly impossible rcu_read_unlock_special() from irq
After commit #10f39bb1b2c1 (rcu: protect __rcu_read_unlock() against
scheduler-using irq handlers), it is no longer possible to enter
the main body of rcu_read_lock_special() from an NMI, interrupt, or
softirq handler.  In theory, this implies that the check for "in_irq()
|| in_serving_softirq()" must always fail, so that in theory this check
could be removed entirely.

In practice, this commit wraps this condition with a WARN_ON_ONCE().
If this warning never triggers, then the condition will be removed
entirely.

[ paulmck: And one way of triggering the WARN_ON() is if a scheduling
  clock interrupt occurs in an RCU read-side critical section, setting
  RCU_READ_UNLOCK_NEED_QS, which is handled by rcu_read_unlock_special().
  Updated this commit to return if only that bit was set. ]

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-12-12 12:34:00 -08:00
Paul E. McKenney 96d3fd0d31 rcu: Break call_rcu() deadlock involving scheduler and perf
Dave Jones got the following lockdep splat:

>  ======================================================
>  [ INFO: possible circular locking dependency detected ]
>  3.12.0-rc3+ #92 Not tainted
>  -------------------------------------------------------
>  trinity-child2/15191 is trying to acquire lock:
>   (&rdp->nocb_wq){......}, at: [<ffffffff8108ff43>] __wake_up+0x23/0x50
>
> but task is already holding lock:
>   (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
>
> -> #3 (&ctx->lock){-.-...}:
>         [<ffffffff810cc243>] lock_acquire+0x93/0x200
>         [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
>         [<ffffffff811500ff>] __perf_event_task_sched_out+0x2df/0x5e0
>         [<ffffffff81091b83>] perf_event_task_sched_out+0x93/0xa0
>         [<ffffffff81732052>] __schedule+0x1d2/0xa20
>         [<ffffffff81732f30>] preempt_schedule_irq+0x50/0xb0
>         [<ffffffff817352b6>] retint_kernel+0x26/0x30
>         [<ffffffff813eed04>] tty_flip_buffer_push+0x34/0x50
>         [<ffffffff813f0504>] pty_write+0x54/0x60
>         [<ffffffff813e900d>] n_tty_write+0x32d/0x4e0
>         [<ffffffff813e5838>] tty_write+0x158/0x2d0
>         [<ffffffff811c4850>] vfs_write+0xc0/0x1f0
>         [<ffffffff811c52cc>] SyS_write+0x4c/0xa0
>         [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
>
> -> #2 (&rq->lock){-.-.-.}:
>         [<ffffffff810cc243>] lock_acquire+0x93/0x200
>         [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
>         [<ffffffff810980b2>] wake_up_new_task+0xc2/0x2e0
>         [<ffffffff81054336>] do_fork+0x126/0x460
>         [<ffffffff81054696>] kernel_thread+0x26/0x30
>         [<ffffffff8171ff93>] rest_init+0x23/0x140
>         [<ffffffff81ee1e4b>] start_kernel+0x3f6/0x403
>         [<ffffffff81ee1571>] x86_64_start_reservations+0x2a/0x2c
>         [<ffffffff81ee1664>] x86_64_start_kernel+0xf1/0xf4
>
> -> #1 (&p->pi_lock){-.-.-.}:
>         [<ffffffff810cc243>] lock_acquire+0x93/0x200
>         [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
>         [<ffffffff810979d1>] try_to_wake_up+0x31/0x350
>         [<ffffffff81097d62>] default_wake_function+0x12/0x20
>         [<ffffffff81084af8>] autoremove_wake_function+0x18/0x40
>         [<ffffffff8108ea38>] __wake_up_common+0x58/0x90
>         [<ffffffff8108ff59>] __wake_up+0x39/0x50
>         [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
>         [<ffffffff81111450>] __call_rcu+0x140/0x820
>         [<ffffffff81111b8d>] call_rcu+0x1d/0x20
>         [<ffffffff81093697>] cpu_attach_domain+0x287/0x360
>         [<ffffffff81099d7e>] build_sched_domains+0xe5e/0x10a0
>         [<ffffffff81efa7fc>] sched_init_smp+0x3b7/0x47a
>         [<ffffffff81ee1f4e>] kernel_init_freeable+0xf6/0x202
>         [<ffffffff817200be>] kernel_init+0xe/0x190
>         [<ffffffff8173d22c>] ret_from_fork+0x7c/0xb0
>
> -> #0 (&rdp->nocb_wq){......}:
>         [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
>         [<ffffffff810cc243>] lock_acquire+0x93/0x200
>         [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
>         [<ffffffff8108ff43>] __wake_up+0x23/0x50
>         [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
>         [<ffffffff81111450>] __call_rcu+0x140/0x820
>         [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
>         [<ffffffff81149abf>] put_ctx+0x4f/0x70
>         [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
>         [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
>         [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
>         [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
>         [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
>
> other info that might help us debug this:
>
> Chain exists of:
>   &rdp->nocb_wq --> &rq->lock --> &ctx->lock
>
>   Possible unsafe locking scenario:
>
>         CPU0                    CPU1
>         ----                    ----
>    lock(&ctx->lock);
>                                 lock(&rq->lock);
>                                 lock(&ctx->lock);
>    lock(&rdp->nocb_wq);
>
>  *** DEADLOCK ***
>
> 1 lock held by trinity-child2/15191:
>  #0:  (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
>
> stack backtrace:
> CPU: 2 PID: 15191 Comm: trinity-child2 Not tainted 3.12.0-rc3+ #92
>  ffffffff82565b70 ffff880070c2dbf8 ffffffff8172a363 ffffffff824edf40
>  ffff880070c2dc38 ffffffff81726741 ffff880070c2dc90 ffff88022383b1c0
>  ffff88022383aac0 0000000000000000 ffff88022383b188 ffff88022383b1c0
> Call Trace:
>  [<ffffffff8172a363>] dump_stack+0x4e/0x82
>  [<ffffffff81726741>] print_circular_bug+0x200/0x20f
>  [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
>  [<ffffffff810c6439>] ? get_lock_stats+0x19/0x60
>  [<ffffffff8100b2f4>] ? native_sched_clock+0x24/0x80
>  [<ffffffff810cc243>] lock_acquire+0x93/0x200
>  [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
>  [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
>  [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
>  [<ffffffff8108ff43>] __wake_up+0x23/0x50
>  [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
>  [<ffffffff81111450>] __call_rcu+0x140/0x820
>  [<ffffffff8109bc8f>] ? local_clock+0x3f/0x50
>  [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
>  [<ffffffff81149abf>] put_ctx+0x4f/0x70
>  [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
>  [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
>  [<ffffffff810c9af5>] ? trace_hardirqs_on_caller+0x115/0x1e0
>  [<ffffffff810c9bcd>] ? trace_hardirqs_on+0xd/0x10
>  [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
>  [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
>  [<ffffffff8173d4e4>] tracesys+0xdd/0xe2

The underlying problem is that perf is invoking call_rcu() with the
scheduler locks held, but in NOCB mode, call_rcu() will with high
probability invoke the scheduler -- which just might want to use its
locks.  The reason that call_rcu() needs to invoke the scheduler is
to wake up the corresponding rcuo callback-offload kthread, which
does the job of starting up a grace period and invoking the callbacks
afterwards.

One solution (championed on a related problem by Lai Jiangshan) is to
simply defer the wakeup to some point where scheduler locks are no longer
held.  Since we don't want to unnecessarily incur the cost of such
deferral, the task before us is threefold:

1.	Determine when it is likely that a relevant scheduler lock is held.

2.	Defer the wakeup in such cases.

3.	Ensure that all deferred wakeups eventually happen, preferably
	sooner rather than later.

We use irqs_disabled_flags() as a proxy for relevant scheduler locks
being held.  This works because the relevant locks are always acquired
with interrupts disabled.  We may defer more often than needed, but that
is at least safe.

The wakeup deferral is tracked via a new field in the per-CPU and
per-RCU-flavor rcu_data structure, namely ->nocb_defer_wakeup.

This flag is checked by the RCU core processing.  The __rcu_pending()
function now checks this flag, which causes rcu_check_callbacks()
to initiate RCU core processing at each scheduling-clock interrupt
where this flag is set.  Of course this is not sufficient because
scheduling-clock interrupts are often turned off (the things we used to
be able to count on!).  So the flags are also checked on entry to any
state that RCU considers to be idle, which includes both NO_HZ_IDLE idle
state and NO_HZ_FULL user-mode-execution state.

This approach should allow call_rcu() to be invoked regardless of what
locks you might be holding, the key word being "should".

Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2013-12-03 10:10:18 -08:00