From 7455cdd1a0fe9a1367ee99596ea2564031daec00 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 11 Feb 2019 12:13:57 -0800 Subject: [PATCH 01/86] tools/memory-model: Make scripts be executable This commit simplifies life a bit by making all of the scripts in tools/memory-model/scripts be executable. Signed-off-by: Paul E. McKenney --- tools/memory-model/scripts/checkghlitmus.sh | 0 tools/memory-model/scripts/checklitmushist.sh | 0 tools/memory-model/scripts/cmplitmushist.sh | 0 tools/memory-model/scripts/initlitmushist.sh | 0 tools/memory-model/scripts/judgelitmus.sh | 0 tools/memory-model/scripts/newlitmushist.sh | 0 tools/memory-model/scripts/parseargs.sh | 0 tools/memory-model/scripts/runlitmushist.sh | 0 8 files changed, 0 insertions(+), 0 deletions(-) mode change 100644 => 100755 tools/memory-model/scripts/checkghlitmus.sh mode change 100644 => 100755 tools/memory-model/scripts/checklitmushist.sh mode change 100644 => 100755 tools/memory-model/scripts/cmplitmushist.sh mode change 100644 => 100755 tools/memory-model/scripts/initlitmushist.sh mode change 100644 => 100755 tools/memory-model/scripts/judgelitmus.sh mode change 100644 => 100755 tools/memory-model/scripts/newlitmushist.sh mode change 100644 => 100755 tools/memory-model/scripts/parseargs.sh mode change 100644 => 100755 tools/memory-model/scripts/runlitmushist.sh diff --git a/tools/memory-model/scripts/checkghlitmus.sh b/tools/memory-model/scripts/checkghlitmus.sh old mode 100644 new mode 100755 diff --git a/tools/memory-model/scripts/checklitmushist.sh b/tools/memory-model/scripts/checklitmushist.sh old mode 100644 new mode 100755 diff --git a/tools/memory-model/scripts/cmplitmushist.sh b/tools/memory-model/scripts/cmplitmushist.sh old mode 100644 new mode 100755 diff --git a/tools/memory-model/scripts/initlitmushist.sh b/tools/memory-model/scripts/initlitmushist.sh old mode 100644 new mode 100755 diff --git a/tools/memory-model/scripts/judgelitmus.sh b/tools/memory-model/scripts/judgelitmus.sh old mode 100644 new mode 100755 diff --git a/tools/memory-model/scripts/newlitmushist.sh b/tools/memory-model/scripts/newlitmushist.sh old mode 100644 new mode 100755 diff --git a/tools/memory-model/scripts/parseargs.sh b/tools/memory-model/scripts/parseargs.sh old mode 100644 new mode 100755 diff --git a/tools/memory-model/scripts/runlitmushist.sh b/tools/memory-model/scripts/runlitmushist.sh old mode 100644 new mode 100755 From d143b3d1cd89f6bcab67dc88160914aa3536c663 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Sat, 22 Jun 2019 12:05:54 -0700 Subject: [PATCH 02/86] rcu: Simplify rcu_read_unlock_special() deferred wakeups In !use_softirq runs, we clearly cannot rely on raise_softirq() and its lightweight bit setting, so we must instead do some form of wakeup. In the absence of a self-IPI when interrupts are disabled, these wakeups can be delayed until the next interrupt occurs. This means that calling invoke_rcu_core() doesn't actually do any expediting. In this case, it is better to take the "else" clause, which sets the current CPU's resched bits and, if there is an expedited grace period in flight, uses IRQ-work to force the needed self-IPI. This commit therefore removes the "else if" clause that calls invoke_rcu_core(). Reported-by: Scott Wood Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 7 +------ 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index acb225023ed1..3f0701e860e4 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -631,17 +631,12 @@ static void rcu_read_unlock_special(struct task_struct *t) // Using softirq, safe to awaken, and we get // no help from enabling irqs, unlike bh/preempt. raise_softirq_irqoff(RCU_SOFTIRQ); - } else if (exp && irqs_were_disabled && !use_softirq && - !t->rcu_read_unlock_special.b.deferred_qs) { - // Safe to awaken and we get no help from enabling - // irqs, unlike bh/preempt. - invoke_rcu_core(); } else { // Enabling BH or preempt does reschedule, so... // Also if no expediting or NO_HZ_FULL, slow is OK. set_tsk_need_resched(current); set_preempt_need_resched(); - if (IS_ENABLED(CONFIG_IRQ_WORK) && + if (IS_ENABLED(CONFIG_IRQ_WORK) && irqs_were_disabled && !rdp->defer_qs_iw_pending && exp) { // Get scheduler to re-evaluate and call hooks. // If !IRQ_WORK, FQS scan will eventually IPI. From 87446b48748b49dd34900904649a5ec95a591699 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 28 Jun 2019 11:25:26 -0700 Subject: [PATCH 03/86] rcu: Make rcu_read_unlock_special() checks match raise_softirq_irqoff() Threaded interrupts provide additional interesting interactions between RCU and raise_softirq() that can result in self-deadlocks in v5.0-2 of the Linux kernel. These self-deadlocks can be provoked in susceptible kernels within a few minutes using the following rcutorture command on an 8-CPU system: tools/testing/selftests/rcutorture/bin/kvm.sh --duration 5 --configs "TREE03" --bootargs "threadirqs" Although post-v5.2 RCU commits have at least greatly reduced the probability of these self-deadlocks, this was entirely by accident. Although this sort of accident should be rowdily celebrated on those rare occasions when it does occur, such celebrations should be quickly followed by a principled patch, which is what this patch purports to be. The key point behind this patch is that when in_interrupt() returns true, __raise_softirq_irqoff() will never attempt a wakeup. Therefore, if in_interrupt(), calls to raise_softirq*() are both safe and extremely cheap. This commit therefore replaces the in_irq() calls in the "if" statement in rcu_read_unlock_special() with in_interrupt() and simplifies the "if" condition to the following: if (irqs_were_disabled && use_softirq && (in_interrupt() || (exp && !t->rcu_read_unlock_special.b.deferred_qs))) { raise_softirq_irqoff(RCU_SOFTIRQ); } else { /* Appeal to the scheduler. */ } The rationale behind the "if" condition is as follows: 1. irqs_were_disabled: If interrupts are enabled, we should instead appeal to the scheduler so as to let the upcoming irq_enable()/local_bh_enable() do the rescheduling for us. 2. use_softirq: If this kernel isn't using softirq, then raise_softirq_irqoff() will be unhelpful. 3. a. in_interrupt(): If this returns true, the subsequent call to raise_softirq_irqoff() is guaranteed not to do a wakeup, so that call will be both very cheap and quite safe. b. Otherwise, if !in_interrupt() the raise_softirq_irqoff() might do a wakeup, which is expensive and, in some contexts, unsafe. i. The "exp" (an expedited RCU grace period is being blocked) says that the wakeup is worthwhile, and: ii. The !.deferred_qs says that scheduler locks cannot be held, so the wakeup will be safe. Backporting this requires considerable care, so no auto-backport, please! Fixes: 05f415715ce45 ("rcu: Speed up expedited GPs when interrupting RCU reader") Reported-by: Sebastian Andrzej Siewior Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 3f0701e860e4..1fd3ca4ffc1d 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -626,8 +626,9 @@ static void rcu_read_unlock_special(struct task_struct *t) (rdp->grpmask & rnp->expmask) || tick_nohz_full_cpu(rdp->cpu); // Need to defer quiescent state until everything is enabled. - if ((exp || in_irq()) && irqs_were_disabled && use_softirq && - (in_irq() || !t->rcu_read_unlock_special.b.deferred_qs)) { + if (irqs_were_disabled && use_softirq && + (in_interrupt() || + (exp && !t->rcu_read_unlock_special.b.deferred_qs))) { // Using softirq, safe to awaken, and we get // no help from enabling irqs, unlike bh/preempt. raise_softirq_irqoff(RCU_SOFTIRQ); From cb4dbbfaa1f5a190f041b174177699a009ab2ecd Mon Sep 17 00:00:00 2001 From: "Joel Fernandes (Google)" Date: Mon, 1 Jul 2019 00:04:14 -0400 Subject: [PATCH 04/86] rcu: Simplify rcu_note_context_switch exit from critical section Because __rcu_read_unlock() can be preempted just before the call to rcu_read_unlock_special(), it is possible for a task to be preempted just before it would have fully exited its RCU read-side critical section. This would result in a needless extension of that critical section until that task was resumed, which might in turn result in a needlessly long grace period, needless RCU priority boosting, and needless force-quiescent-state actions. Therefore, rcu_note_context_switch() invokes __rcu_read_unlock() followed by rcu_preempt_deferred_qs() when it detects this situation. This action by rcu_note_context_switch() ends the RCU read-side critical section immediately. Of course, once the task resumes, it will invoke rcu_read_unlock_special() redundantly. This is harmless because the fact that a preemption happened means that interrupts, preemption, and softirqs cannot have been disabled, so there would be no deferred quiescent state. While ->rcu_read_lock_nesting remains less than zero, none of the ->rcu_read_unlock_special.b bits can be set, and they were all zeroed by the call to rcu_note_context_switch() at task-preemption time. Therefore, setting ->rcu_read_unlock_special.b.exp_hint to false has no effect. Therefore, the extra call to rcu_preempt_deferred_qs_irqrestore() would return immediately. With one possible exception, which is if an expedited grace period started just as the task was being resumed, which could leave ->exp_deferred_qs set. This will cause rcu_preempt_deferred_qs_irqrestore() to invoke rcu_report_exp_rdp(), reporting the quiescent state, just as it should. (Such an expedited grace period won't affect the preemption code path due to interrupts having already been disabled.) But when rcu_note_context_switch() invokes __rcu_read_unlock(), it is doing so with preemption disabled, hence __rcu_read_unlock() will unconditionally defer the quiescent state, only to immediately invoke rcu_preempt_deferred_qs(), thus immediately reporting the deferred quiescent state. It turns out to be safe (and faster) to instead just invoke rcu_preempt_deferred_qs() without the __rcu_read_unlock() middleman. Because this is the invocation during the preemption (as opposed to the invocation just after the resume), at least one of the bits in ->rcu_read_unlock_special.b must be set and ->rcu_read_lock_nesting must be negative. This means that rcu_preempt_need_deferred_qs() must return true, avoiding the early exit from rcu_preempt_deferred_qs(). Thus, rcu_preempt_deferred_qs_irqrestore() will be invoked immediately, as required. This commit therefore simplifies the CONFIG_PREEMPT=y version of rcu_note_context_switch() by removing the "else if" branch of its "if" statement. This change means that all callers that would have invoked rcu_read_unlock_special() followed by rcu_preempt_deferred_qs() will now simply invoke rcu_preempt_deferred_qs(), thus avoiding the rcu_read_unlock_special() middleman when __rcu_read_unlock() is preempted. Cc: rcu@vger.kernel.org Cc: kernel-team@android.com Signed-off-by: Joel Fernandes (Google) Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 9 --------- 1 file changed, 9 deletions(-) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 1fd3ca4ffc1d..ce6ef345102b 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -314,15 +314,6 @@ void rcu_note_context_switch(bool preempt) ? rnp->gp_seq : rcu_seq_snap(&rnp->gp_seq)); rcu_preempt_ctxt_queue(rnp, rdp); - } else if (t->rcu_read_lock_nesting < 0 && - t->rcu_read_unlock_special.s) { - - /* - * Complete exit from RCU read-side critical section on - * behalf of preempted instance of __rcu_read_unlock(). - */ - rcu_read_unlock_special(t); - rcu_preempt_deferred_qs(t); } else { rcu_preempt_deferred_qs(t); } From 519248f36d6f3c80e176f6fa844c10d94f1f5990 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 30 May 2019 05:39:25 -0700 Subject: [PATCH 05/86] lockdep: Make print_lock() address visible Security is a wonderful thing, but so is the ability to debug based on lockdep warnings. This commit therefore makes lockdep lock addresses visible in the clear. Signed-off-by: Paul E. McKenney --- kernel/locking/lockdep.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 4861cf8e274b..4aca3f4379d2 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -620,7 +620,7 @@ static void print_lock(struct held_lock *hlock) return; } - printk(KERN_CONT "%p", hlock->instance); + printk(KERN_CONT "%px", hlock->instance); print_lock_name(lock); printk(KERN_CONT ", at: %pS\n", (void *)hlock->acquire_ip); } From b55bd585551ed2220eefdab96b31e6f935310eec Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 30 May 2019 05:39:25 -0700 Subject: [PATCH 06/86] time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint The TASKS03 and TREE04 rcutorture scenarios produce the following lockdep complaint: ------------------------------------------------------------------------ ================================ WARNING: inconsistent lock state 5.2.0-rc1+ #513 Not tainted -------------------------------- inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. migration/1/14 [HC0[0]:SC0[0]:HE1:SE1] takes: (____ptrval____) (tick_broadcast_lock){?...}, at: tick_broadcast_offline+0xf/0x70 {IN-HARDIRQ-W} state was registered at: lock_acquire+0xb0/0x1c0 _raw_spin_lock_irqsave+0x3c/0x50 tick_broadcast_switch_to_oneshot+0xd/0x40 tick_switch_to_oneshot+0x4f/0xd0 hrtimer_run_queues+0xf3/0x130 run_local_timers+0x1c/0x50 update_process_times+0x1c/0x50 tick_periodic+0x26/0xc0 tick_handle_periodic+0x1a/0x60 smp_apic_timer_interrupt+0x80/0x2a0 apic_timer_interrupt+0xf/0x20 _raw_spin_unlock_irqrestore+0x4e/0x60 rcu_nocb_gp_kthread+0x15d/0x590 kthread+0xf3/0x130 ret_from_fork+0x3a/0x50 irq event stamp: 171 hardirqs last enabled at (171): [] trace_hardirqs_on_thunk+0x1a/0x1c hardirqs last disabled at (170): [] trace_hardirqs_off_thunk+0x1a/0x1c softirqs last enabled at (0): [] copy_process.part.56+0x650/0x1cb0 softirqs last disabled at (0): [<0000000000000000>] 0x0 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(tick_broadcast_lock); lock(tick_broadcast_lock); *** DEADLOCK *** 1 lock held by migration/1/14: #0: (____ptrval____) (clockevents_lock){+.+.}, at: tick_offline_cpu+0xf/0x30 stack backtrace: CPU: 1 PID: 14 Comm: migration/1 Not tainted 5.2.0-rc1+ #513 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Bochs 01/01/2011 Call Trace: dump_stack+0x5e/0x8b print_usage_bug+0x1fc/0x216 ? print_shortest_lock_dependencies+0x1b0/0x1b0 mark_lock+0x1f2/0x280 __lock_acquire+0x1e0/0x18f0 ? __lock_acquire+0x21b/0x18f0 ? _raw_spin_unlock_irqrestore+0x4e/0x60 lock_acquire+0xb0/0x1c0 ? tick_broadcast_offline+0xf/0x70 _raw_spin_lock+0x33/0x40 ? tick_broadcast_offline+0xf/0x70 tick_broadcast_offline+0xf/0x70 tick_offline_cpu+0x16/0x30 take_cpu_down+0x7d/0xa0 multi_cpu_stop+0xa2/0xe0 ? cpu_stop_queue_work+0xc0/0xc0 cpu_stopper_thread+0x6d/0x100 smpboot_thread_fn+0x169/0x240 kthread+0xf3/0x130 ? sort_range+0x20/0x20 ? kthread_cancel_delayed_work_sync+0x10/0x10 ret_from_fork+0x3a/0x50 ------------------------------------------------------------------------ To reproduce, run the following rcutorture test: tools/testing/selftests/rcutorture/bin/kvm.sh --duration 5 --kconfig "CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_PROVE_LOCKING=y" --configs "TASKS03 TREE04" It turns out that tick_broadcast_offline() was an innocent bystander. After all, interrupts are supposed to be disabled throughout take_cpu_down(), and therefore should have been disabled upon entry to tick_offline_cpu() and thus to tick_broadcast_offline(). This suggests that one of the CPU-hotplug notifiers was incorrectly enabling interrupts, and leaving them enabled on return. Some debugging code showed that the culprit was sched_cpu_dying(). It had irqs enabled after return from sched_tick_stop(). Which in turn had irqs enabled after return from cancel_delayed_work_sync(). Which is a wrapper around __cancel_work_timer(). Which can sleep in the case where something else is concurrently trying to cancel the same delayed work, and as Thomas Gleixner pointed out on IRC, sleeping is a decidedly bad idea when you are invoked from take_cpu_down(), regardless of the state you leave interrupts in upon return. Code inspection located no reason why the delayed work absolutely needed to be canceled from sched_tick_stop(): The work is not bound to the outgoing CPU by design, given that the whole point is to collect statistics without disturbing the outgoing CPU. This commit therefore simply drops the cancel_delayed_work_sync() from sched_tick_stop(). Instead, a new ->state field is added to the tick_work structure so that the delayed-work handler function sched_tick_remote() can avoid reposting itself. A cpu_is_offline() check is also added to sched_tick_remote() to avoid mucking with the state of an offlined CPU (though it does appear safe to do so). The sched_tick_start() and sched_tick_stop() functions also update ->state, and sched_tick_start() also schedules the delayed work if ->state indicates that it is not already in flight. Signed-off-by: Paul E. McKenney Cc: Ingo Molnar Cc: Peter Zijlstra Reviewed-by: Frederic Weisbecker [ paulmck: Apply Peter Zijlstra and Frederic Weisbecker atomics feedback. ] Acked-by: Peter Zijlstra (Intel) --- kernel/sched/core.c | 57 ++++++++++++++++++++++++++++++++++++++------- 1 file changed, 49 insertions(+), 8 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 2b037f195473..0b22e55cebe8 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3486,8 +3486,36 @@ void scheduler_tick(void) struct tick_work { int cpu; + atomic_t state; struct delayed_work work; }; +/* Values for ->state, see diagram below. */ +#define TICK_SCHED_REMOTE_OFFLINE 0 +#define TICK_SCHED_REMOTE_OFFLINING 1 +#define TICK_SCHED_REMOTE_RUNNING 2 + +/* + * State diagram for ->state: + * + * + * TICK_SCHED_REMOTE_OFFLINE + * | ^ + * | | + * | | sched_tick_remote() + * | | + * | | + * +--TICK_SCHED_REMOTE_OFFLINING + * | ^ + * | | + * sched_tick_start() | | sched_tick_stop() + * | | + * V | + * TICK_SCHED_REMOTE_RUNNING + * + * + * Other transitions get WARN_ON_ONCE(), except that sched_tick_remote() + * and sched_tick_start() are happy to leave the state in RUNNING. + */ static struct tick_work __percpu *tick_work_cpu; @@ -3500,6 +3528,7 @@ static void sched_tick_remote(struct work_struct *work) struct task_struct *curr; struct rq_flags rf; u64 delta; + int os; /* * Handle the tick only if it appears the remote CPU is running in full @@ -3513,7 +3542,7 @@ static void sched_tick_remote(struct work_struct *work) rq_lock_irq(rq, &rf); curr = rq->curr; - if (is_idle_task(curr)) + if (is_idle_task(curr) || cpu_is_offline(cpu)) goto out_unlock; update_rq_clock(rq); @@ -3533,13 +3562,18 @@ out_requeue: /* * Run the remote tick once per second (1Hz). This arbitrary * frequency is large enough to avoid overload but short enough - * to keep scheduler internal stats reasonably up to date. + * to keep scheduler internal stats reasonably up to date. But + * first update state to reflect hotplug activity if required. */ - queue_delayed_work(system_unbound_wq, dwork, HZ); + os = atomic_fetch_add_unless(&twork->state, -1, TICK_SCHED_REMOTE_RUNNING); + WARN_ON_ONCE(os == TICK_SCHED_REMOTE_OFFLINE); + if (os == TICK_SCHED_REMOTE_RUNNING) + queue_delayed_work(system_unbound_wq, dwork, HZ); } static void sched_tick_start(int cpu) { + int os; struct tick_work *twork; if (housekeeping_cpu(cpu, HK_FLAG_TICK)) @@ -3548,15 +3582,20 @@ static void sched_tick_start(int cpu) WARN_ON_ONCE(!tick_work_cpu); twork = per_cpu_ptr(tick_work_cpu, cpu); - twork->cpu = cpu; - INIT_DELAYED_WORK(&twork->work, sched_tick_remote); - queue_delayed_work(system_unbound_wq, &twork->work, HZ); + os = atomic_xchg(&twork->state, TICK_SCHED_REMOTE_RUNNING); + WARN_ON_ONCE(os == TICK_SCHED_REMOTE_RUNNING); + if (os == TICK_SCHED_REMOTE_OFFLINE) { + twork->cpu = cpu; + INIT_DELAYED_WORK(&twork->work, sched_tick_remote); + queue_delayed_work(system_unbound_wq, &twork->work, HZ); + } } #ifdef CONFIG_HOTPLUG_CPU static void sched_tick_stop(int cpu) { struct tick_work *twork; + int os; if (housekeeping_cpu(cpu, HK_FLAG_TICK)) return; @@ -3564,7 +3603,10 @@ static void sched_tick_stop(int cpu) WARN_ON_ONCE(!tick_work_cpu); twork = per_cpu_ptr(tick_work_cpu, cpu); - cancel_delayed_work_sync(&twork->work); + /* There cannot be competing actions, but don't rely on stop-machine. */ + os = atomic_xchg(&twork->state, TICK_SCHED_REMOTE_OFFLINING); + WARN_ON_ONCE(os != TICK_SCHED_REMOTE_RUNNING); + /* Don't cancel, as this would mess up the state machine. */ } #endif /* CONFIG_HOTPLUG_CPU */ @@ -3572,7 +3614,6 @@ int __init sched_tick_offload_init(void) { tick_work_cpu = alloc_percpu(struct tick_work); BUG_ON(!tick_work_cpu); - return 0; } From 1f3ebc8253ee56bfaa883c5114fb5569c56f6197 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 4 Jun 2019 14:05:52 -0700 Subject: [PATCH 07/86] rcu: Restore barrier() to rcu_read_lock() and rcu_read_unlock() Commit bb73c52bad36 ("rcu: Don't disable preemption for Tiny and Tree RCU readers") removed the barrier() calls from rcu_read_lock() and rcu_write_lock() in CONFIG_PREEMPT=n&&CONFIG_PREEMPT_COUNT=n kernels. Within RCU, this commit was OK, but it failed to account for things like get_user() that can pagefault and that can be reordered by the compiler. Lack of the barrier() calls in rcu_read_lock() and rcu_read_unlock() can cause these page faults to migrate into RCU read-side critical sections, which in CONFIG_PREEMPT=n kernels could result in too-short grace periods and arbitrary misbehavior. Please see commit 386afc91144b ("spinlocks and preemption points need to be at least compiler barriers") and Linus's commit 66be4e66a7f4 ("rcu: locking and unlocking need to always be at least barriers"), this last of which restores the barrier() call to both rcu_read_lock() and rcu_read_unlock(). This commit removes barrier() calls that are no longer needed given that the addition of them in Linus's commit noted above. The combination of this commit and Linus's commit effectively reverts commit bb73c52bad36 ("rcu: Don't disable preemption for Tiny and Tree RCU readers"). Reported-by: Herbert Xu Reported-by: Linus Torvalds Signed-off-by: Paul E. McKenney [ paulmck: Fix embarrassing typo located by Alan Stern. ] --- .../RCU/Design/Requirements/Requirements.html | 71 +++++++++++++++++++ kernel/rcu/tree_plugin.h | 11 --- 2 files changed, 71 insertions(+), 11 deletions(-) diff --git a/Documentation/RCU/Design/Requirements/Requirements.html b/Documentation/RCU/Design/Requirements/Requirements.html index 5a9238a2883c..f04c467e55c5 100644 --- a/Documentation/RCU/Design/Requirements/Requirements.html +++ b/Documentation/RCU/Design/Requirements/Requirements.html @@ -2129,6 +2129,8 @@ Some of the relevant points of interest are as follows:
  • Hotplug CPU.
  • Scheduler and RCU.
  • Tracing and RCU. +
  • +Accesses to User Memory and RCU.
  • Energy Efficiency.
  • Scheduling-Clock Interrupts and RCU. @@ -2521,6 +2523,75 @@ cannot be used. The tracing folks both located the requirement and provided the needed fix, so this surprise requirement was relatively painless. +

    +Accesses to User Memory and RCU

    + +

    +The kernel needs to access user-space memory, for example, to access +data referenced by system-call parameters. +The get_user() macro does this job. + +

    +However, user-space memory might well be paged out, which means +that get_user() might well page-fault and thus block while +waiting for the resulting I/O to complete. +It would be a very bad thing for the compiler to reorder +a get_user() invocation into an RCU read-side critical +section. +For example, suppose that the source code looked like this: + +

    +
    + 1 rcu_read_lock();
    + 2 p = rcu_dereference(gp);
    + 3 v = p->value;
    + 4 rcu_read_unlock();
    + 5 get_user(user_v, user_p);
    + 6 do_something_with(v, user_v);
    +
    +
    + +

    +The compiler must not be permitted to transform this source code into +the following: + +

    +
    + 1 rcu_read_lock();
    + 2 p = rcu_dereference(gp);
    + 3 get_user(user_v, user_p); // BUG: POSSIBLE PAGE FAULT!!!
    + 4 v = p->value;
    + 5 rcu_read_unlock();
    + 6 do_something_with(v, user_v);
    +
    +
    + +

    +If the compiler did make this transformation in a +CONFIG_PREEMPT=n kernel build, and if get_user() did +page fault, the result would be a quiescent state in the middle +of an RCU read-side critical section. +This misplaced quiescent state could result in line 4 being +a use-after-free access, which could be bad for your kernel's +actuarial statistics. +Similar examples can be constructed with the call to get_user() +preceding the rcu_read_lock(). + +

    +Unfortunately, get_user() doesn't have any particular +ordering properties, and in some architectures the underlying asm +isn't even marked volatile. +And even if it was marked volatile, the above access to +p->value is not volatile, so the compiler would not have any +reason to keep those two accesses in order. + +

    +Therefore, the Linux-kernel definitions of rcu_read_lock() +and rcu_read_unlock() must act as compiler barriers, +at least for outermost instances of rcu_read_lock() and +rcu_read_unlock() within a nested set of RCU read-side critical +sections. +

    Energy Efficiency

    diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index acb225023ed1..3f1b5041de9b 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -288,7 +288,6 @@ void rcu_note_context_switch(bool preempt) struct rcu_data *rdp = this_cpu_ptr(&rcu_data); struct rcu_node *rnp; - barrier(); /* Avoid RCU read-side critical sections leaking down. */ trace_rcu_utilization(TPS("Start context switch")); lockdep_assert_irqs_disabled(); WARN_ON_ONCE(!preempt && t->rcu_read_lock_nesting > 0); @@ -340,7 +339,6 @@ void rcu_note_context_switch(bool preempt) if (rdp->exp_deferred_qs) rcu_report_exp_rdp(rdp); trace_rcu_utilization(TPS("End context switch")); - barrier(); /* Avoid RCU read-side critical sections leaking up. */ } EXPORT_SYMBOL_GPL(rcu_note_context_switch); @@ -828,11 +826,6 @@ static void rcu_qs(void) * dyntick-idle quiescent state visible to other CPUs, which will in * some cases serve for expedited as well as normal grace periods. * Either way, register a lightweight quiescent state. - * - * The barrier() calls are redundant in the common case when this is - * called externally, but just in case this is called from within this - * file. - * */ void rcu_all_qs(void) { @@ -847,14 +840,12 @@ void rcu_all_qs(void) return; } this_cpu_write(rcu_data.rcu_urgent_qs, false); - barrier(); /* Avoid RCU read-side critical sections leaking down. */ if (unlikely(raw_cpu_read(rcu_data.rcu_need_heavy_qs))) { local_irq_save(flags); rcu_momentary_dyntick_idle(); local_irq_restore(flags); } rcu_qs(); - barrier(); /* Avoid RCU read-side critical sections leaking up. */ preempt_enable(); } EXPORT_SYMBOL_GPL(rcu_all_qs); @@ -864,7 +855,6 @@ EXPORT_SYMBOL_GPL(rcu_all_qs); */ void rcu_note_context_switch(bool preempt) { - barrier(); /* Avoid RCU read-side critical sections leaking down. */ trace_rcu_utilization(TPS("Start context switch")); rcu_qs(); /* Load rcu_urgent_qs before other flags. */ @@ -877,7 +867,6 @@ void rcu_note_context_switch(bool preempt) rcu_tasks_qs(current); out: trace_rcu_utilization(TPS("End context switch")); - barrier(); /* Avoid RCU read-side critical sections leaking up. */ } EXPORT_SYMBOL_GPL(rcu_note_context_switch); From cdc694b2359d52cd6d0465d5a6263d97c786fb0c Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 13 Jun 2019 15:30:49 -0700 Subject: [PATCH 08/86] rcu: Add kernel parameter to dump trace after RCU CPU stall warning This commit adds a rcu_cpu_stall_ftrace_dump kernel boot parameter, that, when set, causes the trace buffer to be dumped after an RCU CPU stall warning is printed. This kernel boot parameter is disabled by default, maintaining compatibility with previous behavior. Signed-off-by: Paul E. McKenney --- Documentation/admin-guide/kernel-parameters.txt | 4 ++++ kernel/rcu/rcu.h | 1 + kernel/rcu/tree_stall.h | 4 ++++ kernel/rcu/update.c | 2 ++ 4 files changed, 11 insertions(+) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 7ccd158b3894..f3fcd6140ee1 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4047,6 +4047,10 @@ rcutorture.verbose= [KNL] Enable additional printk() statements. + rcupdate.rcu_cpu_stall_ftrace_dump= [KNL] + Dump ftrace buffer after reporting RCU CPU + stall warning. + rcupdate.rcu_cpu_stall_suppress= [KNL] Suppress RCU CPU stall warning messages. diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h index 5290b01de534..8fd4f82c9b3d 100644 --- a/kernel/rcu/rcu.h +++ b/kernel/rcu/rcu.h @@ -227,6 +227,7 @@ static inline bool __rcu_reclaim(const char *rn, struct rcu_head *head) #ifdef CONFIG_RCU_STALL_COMMON +extern int rcu_cpu_stall_ftrace_dump; extern int rcu_cpu_stall_suppress; extern int rcu_cpu_stall_timeout; int rcu_jiffies_till_stall_check(void); diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h index 065183391f75..0627a66699a6 100644 --- a/kernel/rcu/tree_stall.h +++ b/kernel/rcu/tree_stall.h @@ -527,6 +527,8 @@ static void check_cpu_stall(struct rcu_data *rdp) /* We haven't checked in, so go dump stack. */ print_cpu_stall(); + if (rcu_cpu_stall_ftrace_dump) + rcu_ftrace_dump(DUMP_ALL); } else if (rcu_gp_in_progress() && ULONG_CMP_GE(j, js + RCU_STALL_RAT_DELAY) && @@ -534,6 +536,8 @@ static void check_cpu_stall(struct rcu_data *rdp) /* They had a few time units to dump stack, so complain. */ print_other_cpu_stall(gs2); + if (rcu_cpu_stall_ftrace_dump) + rcu_ftrace_dump(DUMP_ALL); } } diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c index 61df2bf08563..249517058b13 100644 --- a/kernel/rcu/update.c +++ b/kernel/rcu/update.c @@ -437,6 +437,8 @@ EXPORT_SYMBOL_GPL(rcutorture_sched_setaffinity); #endif #ifdef CONFIG_RCU_STALL_COMMON +int rcu_cpu_stall_ftrace_dump __read_mostly; +module_param(rcu_cpu_stall_ftrace_dump, int, 0644); int rcu_cpu_stall_suppress __read_mostly; /* 1 = suppress stall warnings. */ EXPORT_SYMBOL_GPL(rcu_cpu_stall_suppress); module_param(rcu_cpu_stall_suppress, int, 0644); From fbad01af8c3bb9618848abde8054ab7e0c2330fe Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 19 Jun 2019 15:42:51 -0700 Subject: [PATCH 09/86] rcu: Add destroy_work_on_stack() to match INIT_WORK_ONSTACK() The synchronize_rcu_expedited() function has an INIT_WORK_ONSTACK(), but lacks the corresponding destroy_work_on_stack(). This commit therefore adds destroy_work_on_stack(). Reported-by: Andrea Arcangeli Signed-off-by: Paul E. McKenney Acked-by: Andrea Arcangeli --- kernel/rcu/tree_exp.h | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index af7e7b9c86af..513b403b683b 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -792,6 +792,7 @@ static int rcu_print_task_exp_stall(struct rcu_node *rnp) */ void synchronize_rcu_expedited(void) { + bool boottime = (rcu_scheduler_active == RCU_SCHEDULER_INIT); struct rcu_exp_work rew; struct rcu_node *rnp; unsigned long s; @@ -817,7 +818,7 @@ void synchronize_rcu_expedited(void) return; /* Someone else did our work for us. */ /* Ensure that load happens before action based on it. */ - if (unlikely(rcu_scheduler_active == RCU_SCHEDULER_INIT)) { + if (unlikely(boottime)) { /* Direct call during scheduler init and early_initcalls(). */ rcu_exp_sel_wait_wake(s); } else { @@ -835,5 +836,8 @@ void synchronize_rcu_expedited(void) /* Let the next expedited grace period start. */ mutex_unlock(&rcu_state.exp_mutex); + + if (likely(!boottime)) + destroy_work_on_stack(&rew.rew_work); } EXPORT_SYMBOL_GPL(synchronize_rcu_expedited); From 7e210a653ec9445512534cd235cac29e7301af2a Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 28 Jun 2019 17:11:10 -0700 Subject: [PATCH 10/86] srcu: Avoid srcutorture security-based pointer obfuscation Because pointer output is now obfuscated, and because what you really want to know is whether or not the callback lists are empty, this commit replaces the srcu_data structure's head callback pointer printout with a single character that is "." is the callback list is empty or "C" otherwise. This is the only remaining user of rcu_segcblist_head(), so this commit also removes this function's definition. It also turns out that rcu_segcblist_tail() no longer has any callers, so this commit removes that function's definition while in the area. They were both marked "Interim", and their end has come. Signed-off-by: Paul E. McKenney --- kernel/rcu/rcu_segcblist.h | 21 --------------------- kernel/rcu/srcutree.c | 5 +++-- 2 files changed, 3 insertions(+), 23 deletions(-) diff --git a/kernel/rcu/rcu_segcblist.h b/kernel/rcu/rcu_segcblist.h index 71b64648464e..822a39da0533 100644 --- a/kernel/rcu/rcu_segcblist.h +++ b/kernel/rcu/rcu_segcblist.h @@ -76,27 +76,6 @@ static inline bool rcu_segcblist_restempty(struct rcu_segcblist *rsclp, int seg) return !*rsclp->tails[seg]; } -/* - * Interim function to return rcu_segcblist head pointer. Longer term, the - * rcu_segcblist will be used more pervasively, removing the need for this - * function. - */ -static inline struct rcu_head *rcu_segcblist_head(struct rcu_segcblist *rsclp) -{ - return rsclp->head; -} - -/* - * Interim function to return rcu_segcblist head pointer. Longer term, the - * rcu_segcblist will be used more pervasively, removing the need for this - * function. - */ -static inline struct rcu_head **rcu_segcblist_tail(struct rcu_segcblist *rsclp) -{ - WARN_ON_ONCE(rcu_segcblist_empty(rsclp)); - return rsclp->tails[RCU_NEXT_TAIL]; -} - void rcu_segcblist_init(struct rcu_segcblist *rsclp); void rcu_segcblist_disable(struct rcu_segcblist *rsclp); bool rcu_segcblist_ready_cbs(struct rcu_segcblist *rsclp); diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c index cf0e886314f2..5dffade2d7cd 100644 --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -1279,8 +1279,9 @@ void srcu_torture_stats_print(struct srcu_struct *ssp, char *tt, char *tf) c0 = l0 - u0; c1 = l1 - u1; - pr_cont(" %d(%ld,%ld %1p)", - cpu, c0, c1, rcu_segcblist_head(&sdp->srcu_cblist)); + pr_cont(" %d(%ld,%ld %c)", + cpu, c0, c1, + "C."[rcu_segcblist_empty(&sdp->srcu_cblist)]); s0 += c0; s1 += c1; } From 3545832fc22e2316d9c289f6ba825710a268bfa6 Mon Sep 17 00:00:00 2001 From: Byungchul Park Date: Mon, 1 Jul 2019 09:40:39 +0900 Subject: [PATCH 11/86] rcu: Change return type of rcu_spawn_one_boost_kthread() The return value of rcu_spawn_one_boost_kthread() is not used any longer. This commit therefore changes its return type from int to void, and removes the cast to void from its callers. Signed-off-by: Byungchul Park Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 20 +++++++++++--------- 1 file changed, 11 insertions(+), 9 deletions(-) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 3f1b5041de9b..307ae6ebb804 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1123,7 +1123,7 @@ static void rcu_preempt_boost_start_gp(struct rcu_node *rnp) * already exist. We only create this kthread for preemptible RCU. * Returns zero if all is well, a negated errno otherwise. */ -static int rcu_spawn_one_boost_kthread(struct rcu_node *rnp) +static void rcu_spawn_one_boost_kthread(struct rcu_node *rnp) { int rnp_index = rnp - rcu_get_root(); unsigned long flags; @@ -1131,25 +1131,27 @@ static int rcu_spawn_one_boost_kthread(struct rcu_node *rnp) struct task_struct *t; if (!IS_ENABLED(CONFIG_PREEMPT_RCU)) - return 0; + return; if (!rcu_scheduler_fully_active || rcu_rnp_online_cpus(rnp) == 0) - return 0; + return; rcu_state.boost = 1; + if (rnp->boost_kthread_task != NULL) - return 0; + return; + t = kthread_create(rcu_boost_kthread, (void *)rnp, "rcub/%d", rnp_index); - if (IS_ERR(t)) - return PTR_ERR(t); + if (WARN_ON_ONCE(IS_ERR(t))) + return; + raw_spin_lock_irqsave_rcu_node(rnp, flags); rnp->boost_kthread_task = t; raw_spin_unlock_irqrestore_rcu_node(rnp, flags); sp.sched_priority = kthread_prio; sched_setscheduler_nocheck(t, SCHED_FIFO, &sp); wake_up_process(t); /* get to TASK_INTERRUPTIBLE quickly. */ - return 0; } /* @@ -1190,7 +1192,7 @@ static void __init rcu_spawn_boost_kthreads(void) struct rcu_node *rnp; rcu_for_each_leaf_node(rnp) - (void)rcu_spawn_one_boost_kthread(rnp); + rcu_spawn_one_boost_kthread(rnp); } static void rcu_prepare_kthreads(int cpu) @@ -1200,7 +1202,7 @@ static void rcu_prepare_kthreads(int cpu) /* Fire up the incoming CPU's kthread and leaf rcu_node kthread. */ if (rcu_scheduler_fully_active) - (void)rcu_spawn_one_boost_kthread(rnp); + rcu_spawn_one_boost_kthread(rnp); } #else /* #ifdef CONFIG_RCU_BOOST */ From 0500873de968df6fdef5752d7bbdca317ddc220b Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 8 Jul 2019 08:01:50 -0700 Subject: [PATCH 12/86] doc: Add rcutree.kthread_prio pointer to stallwarn.txt This commit adds mention of the rcutree.kthread_prio kernel boot parameter to the discussion of how high-priority real-time tasks can result in RCU CPU stall warnings. (However, this does not necessarily help when the high-priority real-time tasks are using dubious deadlines.) Signed-off-by: Paul E. McKenney --- Documentation/RCU/stallwarn.txt | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/Documentation/RCU/stallwarn.txt b/Documentation/RCU/stallwarn.txt index 13e88fc00f01..f48f4621ccbc 100644 --- a/Documentation/RCU/stallwarn.txt +++ b/Documentation/RCU/stallwarn.txt @@ -57,6 +57,12 @@ o A CPU-bound real-time task in a CONFIG_PREEMPT_RT kernel that CONFIG_PREEMPT_RCU case, you might see stall-warning messages. + You can use the rcutree.kthread_prio kernel boot parameter to + increase the scheduling priority of RCU's kthreads, which can + help avoid this problem. However, please note that doing this + can increase your system's context-switch rate and thus degrade + performance. + o A periodic interrupt whose handler takes longer than the time interval between successive pairs of interrupts. This can prevent RCU's kthreads and softirq handlers from running. From 0a5b99f57873e233ad42ef71e23c629f6ea1fcfe Mon Sep 17 00:00:00 2001 From: "Joel Fernandes (Google)" Date: Thu, 11 Jul 2019 16:45:41 -0400 Subject: [PATCH 13/86] treewide: Rename rcu_dereference_raw_notrace() to _check() The rcu_dereference_raw_notrace() API name is confusing. It is equivalent to rcu_dereference_raw() except that it also does sparse pointer checking. There are only a few users of rcu_dereference_raw_notrace(). This patches renames all of them to be rcu_dereference_raw_check() with the "_check()" indicating sparse checking. Signed-off-by: Joel Fernandes (Google) [ paulmck: Fix checkpatch warnings about parentheses. ] Signed-off-by: Paul E. McKenney --- Documentation/RCU/Design/Requirements/Requirements.html | 2 +- arch/powerpc/include/asm/kvm_book3s_64.h | 2 +- include/linux/rculist.h | 6 +++--- include/linux/rcupdate.h | 2 +- kernel/trace/ftrace_internal.h | 8 ++++---- kernel/trace/trace.c | 4 ++-- 6 files changed, 12 insertions(+), 12 deletions(-) diff --git a/Documentation/RCU/Design/Requirements/Requirements.html b/Documentation/RCU/Design/Requirements/Requirements.html index 5a9238a2883c..bdbc84f1b949 100644 --- a/Documentation/RCU/Design/Requirements/Requirements.html +++ b/Documentation/RCU/Design/Requirements/Requirements.html @@ -2512,7 +2512,7 @@ disabled across the entire RCU read-side critical section.

    It is possible to use tracing on RCU code, but tracing itself uses RCU. -For this reason, rcu_dereference_raw_notrace() +For this reason, rcu_dereference_raw_check() is provided for use by tracing, which avoids the destructive recursion that could otherwise ensue. This API is also used by virtualization in some architectures, diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h b/arch/powerpc/include/asm/kvm_book3s_64.h index bb7c8cc77f1a..04b2b927bb5a 100644 --- a/arch/powerpc/include/asm/kvm_book3s_64.h +++ b/arch/powerpc/include/asm/kvm_book3s_64.h @@ -535,7 +535,7 @@ static inline void note_hpte_modification(struct kvm *kvm, */ static inline struct kvm_memslots *kvm_memslots_raw(struct kvm *kvm) { - return rcu_dereference_raw_notrace(kvm->memslots[0]); + return rcu_dereference_raw_check(kvm->memslots[0]); } extern void kvmppc_mmu_debugfs_init(struct kvm *kvm); diff --git a/include/linux/rculist.h b/include/linux/rculist.h index e91ec9ddcd30..932296144131 100644 --- a/include/linux/rculist.h +++ b/include/linux/rculist.h @@ -622,7 +622,7 @@ static inline void hlist_add_behind_rcu(struct hlist_node *n, * as long as the traversal is guarded by rcu_read_lock(). */ #define hlist_for_each_entry_rcu(pos, head, member) \ - for (pos = hlist_entry_safe (rcu_dereference_raw(hlist_first_rcu(head)),\ + for (pos = hlist_entry_safe(rcu_dereference_raw(hlist_first_rcu(head)),\ typeof(*(pos)), member); \ pos; \ pos = hlist_entry_safe(rcu_dereference_raw(hlist_next_rcu(\ @@ -642,10 +642,10 @@ static inline void hlist_add_behind_rcu(struct hlist_node *n, * not do any RCU debugging or tracing. */ #define hlist_for_each_entry_rcu_notrace(pos, head, member) \ - for (pos = hlist_entry_safe (rcu_dereference_raw_notrace(hlist_first_rcu(head)),\ + for (pos = hlist_entry_safe(rcu_dereference_raw_check(hlist_first_rcu(head)),\ typeof(*(pos)), member); \ pos; \ - pos = hlist_entry_safe(rcu_dereference_raw_notrace(hlist_next_rcu(\ + pos = hlist_entry_safe(rcu_dereference_raw_check(hlist_next_rcu(\ &(pos)->member)), typeof(*(pos)), member)) /** diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index 8f7167478c1d..bfcafbc1e301 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -476,7 +476,7 @@ do { \ * The no-tracing version of rcu_dereference_raw() must not call * rcu_read_lock_held(). */ -#define rcu_dereference_raw_notrace(p) __rcu_dereference_check((p), 1, __rcu) +#define rcu_dereference_raw_check(p) __rcu_dereference_check((p), 1, __rcu) /** * rcu_dereference_protected() - fetch RCU pointer when updates prevented diff --git a/kernel/trace/ftrace_internal.h b/kernel/trace/ftrace_internal.h index 0515a2096f90..0456e0a3dab1 100644 --- a/kernel/trace/ftrace_internal.h +++ b/kernel/trace/ftrace_internal.h @@ -6,22 +6,22 @@ /* * Traverse the ftrace_global_list, invoking all entries. The reason that we - * can use rcu_dereference_raw_notrace() is that elements removed from this list + * can use rcu_dereference_raw_check() is that elements removed from this list * are simply leaked, so there is no need to interact with a grace-period - * mechanism. The rcu_dereference_raw_notrace() calls are needed to handle + * mechanism. The rcu_dereference_raw_check() calls are needed to handle * concurrent insertions into the ftrace_global_list. * * Silly Alpha and silly pointer-speculation compiler optimizations! */ #define do_for_each_ftrace_op(op, list) \ - op = rcu_dereference_raw_notrace(list); \ + op = rcu_dereference_raw_check(list); \ do /* * Optimized for just a single item in the list (as that is the normal case). */ #define while_for_each_ftrace_op(op) \ - while (likely(op = rcu_dereference_raw_notrace((op)->next)) && \ + while (likely(op = rcu_dereference_raw_check((op)->next)) && \ unlikely((op) != &ftrace_list_end)) extern struct ftrace_ops __rcu *ftrace_ops_list; diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 525a97fbbc60..642474b26ba7 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -2642,10 +2642,10 @@ static void ftrace_exports(struct ring_buffer_event *event) preempt_disable_notrace(); - export = rcu_dereference_raw_notrace(ftrace_exports_list); + export = rcu_dereference_raw_check(ftrace_exports_list); while (export) { trace_process_export(export, event); - export = rcu_dereference_raw_notrace(export->next); + export = rcu_dereference_raw_check(export->next); } preempt_enable_notrace(); From 9147089bee3a6b504821dd8462e2be229e6dbfae Mon Sep 17 00:00:00 2001 From: "Joel Fernandes (Google)" Date: Tue, 16 Jul 2019 18:12:21 -0400 Subject: [PATCH 14/86] rcu: Remove redundant debug_locks check in rcu_read_lock_sched_held() The debug_locks flag can never be true at the end of rcu_read_lock_sched_held() because it is already checked by the earlier call todebug_lockdep_rcu_enabled(). This commit therefore removes this redundant check. Signed-off-by: Joel Fernandes (Google) Signed-off-by: Paul E. McKenney --- kernel/rcu/update.c | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c index 61df2bf08563..9dd5aeef6e70 100644 --- a/kernel/rcu/update.c +++ b/kernel/rcu/update.c @@ -93,17 +93,13 @@ module_param(rcu_normal_after_boot, int, 0); */ int rcu_read_lock_sched_held(void) { - int lockdep_opinion = 0; - if (!debug_lockdep_rcu_enabled()) return 1; if (!rcu_is_watching()) return 0; if (!rcu_lockdep_current_cpu_online()) return 0; - if (debug_locks) - lockdep_opinion = lock_is_held(&rcu_sched_lock_map); - return lockdep_opinion || !preemptible(); + return lock_is_held(&rcu_sched_lock_map) || !preemptible(); } EXPORT_SYMBOL(rcu_read_lock_sched_held); #endif From b3f3886c59f649ace424d132bd8c06e3611c71a8 Mon Sep 17 00:00:00 2001 From: Xiao Yang Date: Fri, 31 May 2019 23:15:45 +0800 Subject: [PATCH 15/86] rcuperf: Fix perf_type module-parameter description The rcu_bh rcuperf type was removed by commit 620d246065cd("rcuperf: Remove the "rcu_bh" and "sched" torture types"), but it lives on in the MODULE_PARM_DESC() of perf_type. This commit therefore changes that module-parameter description to substitute srcu for rcu_bh. Signed-off-by: Xiao Yang Signed-off-by: Paul E. McKenney --- kernel/rcu/rcuperf.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/rcu/rcuperf.c b/kernel/rcu/rcuperf.c index 7a6890b23c5f..4513807cd4c4 100644 --- a/kernel/rcu/rcuperf.c +++ b/kernel/rcu/rcuperf.c @@ -89,7 +89,7 @@ torture_param(int, writer_holdoff, 0, "Holdoff (us) between GPs, zero to disable static char *perf_type = "rcu"; module_param(perf_type, charp, 0444); -MODULE_PARM_DESC(perf_type, "Type of RCU to performance-test (rcu, rcu_bh, ...)"); +MODULE_PARM_DESC(perf_type, "Type of RCU to performance-test (rcu, srcu, ...)"); static int nrealreaders; static int nrealwriters; From 2c667e5eae232f7f4a4fc30f58e51abdb0dc43c5 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 21 Jun 2019 10:32:57 -0700 Subject: [PATCH 16/86] torture: Expand last_ts variable in kvm-test-1-run.sh The kvm-test-1-run.sh script says 'test -z "last_ts"' which always evaluates to true (AKA zero) regardless of the value of the last_ts shell variable. This commit therefore inserts the needed dollar sign ("$"). Signed-off-by: Paul E. McKenney --- tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh index 27b7b5693ede..33c669619736 100755 --- a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh +++ b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh @@ -227,7 +227,7 @@ then must_continue=yes fi last_ts="`tail $resdir/console.log | grep '^\[ *[0-9]\+\.[0-9]\+]' | tail -1 | sed -e 's/^\[ *//' -e 's/\..*$//'`" - if test -z "last_ts" + if test -z "$last_ts" then last_ts=0 fi From f4e8352928587ef8772df3d269a328efa609daaa Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 27 Jun 2019 14:05:54 -0700 Subject: [PATCH 17/86] rcutorture: Test TREE03 with the threadirqs kernel boot parameter Since commit 05f415715ce45 ("rcu: Speed up expedited GPs when interrupting RCU reader") in v5.0 and through v5.1, booting with the threadirqs kernel boot parameter caused self-deadlocks, which can be reproduced using the following command on an 8-CPU system: tools/testing/selftests/rcutorture/bin/kvm.sh --duration 5 --configs "TREE03" --bootargs "threadirqs" This commit therefore adds the threadirqs kernel boot parameter to the TREE03 rcutorture scenario in order to more quickly detect future similar bugs. Link: http://lkml.kernel.org/r/20190626135447.y24mvfuid5fifwjc@linutronix.de Signed-off-by: Paul E. McKenney Cc: Sebastian Andrzej Siewior Cc: Joel Fernandes --- tools/testing/selftests/rcutorture/configs/rcu/TREE03.boot | 1 + 1 file changed, 1 insertion(+) diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE03.boot b/tools/testing/selftests/rcutorture/configs/rcu/TREE03.boot index 5c3213cc3ad7..1c218944b1e9 100644 --- a/tools/testing/selftests/rcutorture/configs/rcu/TREE03.boot +++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE03.boot @@ -3,3 +3,4 @@ rcutree.gp_preinit_delay=12 rcutree.gp_init_delay=3 rcutree.gp_cleanup_delay=3 rcutree.kthread_prio=2 +threadirqs From bd1bfc51a36f334270b886db6d8467e55fe294ca Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Sat, 22 Jun 2019 14:35:59 -0700 Subject: [PATCH 18/86] rcutorture: Emulate userspace sojourn during call_rcu() floods During an actual call_rcu() flood, there would be frequent trips to userspace (in-kernel call_rcu() floods must be otherwise housebroken). Userspace execution allows a great many things to interrupt execution, and rcutorture needs to also allow such interruptions. This commit therefore causes call_rcu() floods to occasionally invoke schedule(), thus preventing spurious rcutorture failures due to other parts of the kernel becoming irate at the call_rcu() flood events. Signed-off-by: Paul E. McKenney --- kernel/rcu/rcutorture.c | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index fce4e7e6f502..c44e5307afcc 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -1713,12 +1713,14 @@ static void rcu_torture_fwd_cb_cr(struct rcu_head *rhp) } // Give the scheduler a chance, even on nohz_full CPUs. -static void rcu_torture_fwd_prog_cond_resched(void) +static void rcu_torture_fwd_prog_cond_resched(unsigned long iter) { if (IS_ENABLED(CONFIG_PREEMPT) && IS_ENABLED(CONFIG_NO_HZ_FULL)) { - if (need_resched()) + // Real call_rcu() floods hit userspace, so emulate that. + if (need_resched() || (iter & 0xfff)) schedule(); } else { + // No userspace emulation: CB invocation throttles call_rcu() cond_resched(); } } @@ -1746,7 +1748,7 @@ static unsigned long rcu_torture_fwd_prog_cbfree(void) spin_unlock_irqrestore(&rcu_fwd_lock, flags); kfree(rfcp); freed++; - rcu_torture_fwd_prog_cond_resched(); + rcu_torture_fwd_prog_cond_resched(freed); } return freed; } @@ -1790,7 +1792,7 @@ static void rcu_torture_fwd_prog_nr(int *tested, int *tested_tries) udelay(10); cur_ops->readunlock(idx); if (!fwd_progress_need_resched || need_resched()) - rcu_torture_fwd_prog_cond_resched(); + rcu_torture_fwd_prog_cond_resched(1); } (*tested_tries)++; if (!time_before(jiffies, stopat) && @@ -1875,7 +1877,7 @@ static void rcu_torture_fwd_prog_cr(void) rfcp->rfc_gps = 0; } cur_ops->call(&rfcp->rh, rcu_torture_fwd_cb_cr); - rcu_torture_fwd_prog_cond_resched(); + rcu_torture_fwd_prog_cond_resched(n_launders + n_max_cbs); } stoppedat = jiffies; n_launders_cb_snap = READ_ONCE(n_launders_cb); From 21f57546ceaf4c5537a617f55b809a843b109210 Mon Sep 17 00:00:00 2001 From: Denis Efremov Date: Thu, 4 Jul 2019 15:57:19 +0300 Subject: [PATCH 19/86] torture: Remove exporting of internal functions The functions torture_onoff_cleanup() and torture_shuffle_cleanup() are declared static and marked EXPORT_SYMBOL_GPL(), which is at best an odd combination. Because these functions are not used outside of the kernel/torture.c file they are defined in, this commit removes their EXPORT_SYMBOL_GPL() marking. Fixes: cc47ae083026 ("rcutorture: Abstract torture-test cleanup") Signed-off-by: Denis Efremov Signed-off-by: Paul E. McKenney --- kernel/torture.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/kernel/torture.c b/kernel/torture.c index a8d9bdfba7c3..7c13f5558b71 100644 --- a/kernel/torture.c +++ b/kernel/torture.c @@ -263,7 +263,6 @@ static void torture_onoff_cleanup(void) onoff_task = NULL; #endif /* #ifdef CONFIG_HOTPLUG_CPU */ } -EXPORT_SYMBOL_GPL(torture_onoff_cleanup); /* * Print online/offline testing statistics. @@ -449,7 +448,6 @@ static void torture_shuffle_cleanup(void) } shuffler_task = NULL; } -EXPORT_SYMBOL_GPL(torture_shuffle_cleanup); /* * Variables for auto-shutdown. This allows "lights out" torture runs From 77e9752ce69f36f1be4e366373727fb7921f5909 Mon Sep 17 00:00:00 2001 From: "Joel Fernandes (Google)" Date: Thu, 4 Jul 2019 00:34:30 -0400 Subject: [PATCH 20/86] rcuperf: Make rcuperf kernel test more robust for !expedited mode It is possible that the rcuperf kernel test runs concurrently with init starting up. During this time, the system is running all grace periods as expedited. However, rcuperf can also be run for normal GP tests. Right now, it depends on a holdoff time before starting the test to ensure grace periods start later. This works fine with the default holdoff time however it is not robust in situations where init takes greater than the holdoff time to finish running. Or, as in my case: I modified the rcuperf test locally to also run a thread that did preempt disable/enable in a loop. This had the effect of slowing down init. The end result was that the "batches:" counter in rcuperf was 0 causing a division by 0 error in the results. This counter was 0 because only expedited GPs seem to happen, not normal ones which led to the rcu_state.gp_seq counter remaining constant across grace periods which unexpectedly happen to be expedited. The system was running expedited RCU all the time because rcu_unexpedited_gp() would not have run yet from init. In other words, the test would concurrently with init booting in expedited GP mode. To fix this properly, this commit waits until system_state is set to SYSTEM_RUNNING before starting the test. This change is made just before kernel_init() invokes rcu_end_inkernel_boot(), and this latter is what turns off boot-time expediting of RCU grace periods. Signed-off-by: Joel Fernandes (Google) Signed-off-by: Paul E. McKenney --- kernel/rcu/rcuperf.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/kernel/rcu/rcuperf.c b/kernel/rcu/rcuperf.c index 4513807cd4c4..5a879d073c1c 100644 --- a/kernel/rcu/rcuperf.c +++ b/kernel/rcu/rcuperf.c @@ -375,6 +375,14 @@ rcu_perf_writer(void *arg) if (holdoff) schedule_timeout_uninterruptible(holdoff * HZ); + /* + * Wait until rcu_end_inkernel_boot() is called for normal GP tests + * so that RCU is not always expedited for normal GP tests. + * The system_state test is approximate, but works well in practice. + */ + while (!gp_exp && system_state != SYSTEM_RUNNING) + schedule_timeout_uninterruptible(1); + t = ktime_get_mono_fast_ns(); if (atomic_inc_return(&n_rcu_perf_writer_started) >= nrealwriters) { t_rcu_perf_writer_started = t; From 60013d5d2b4031e6027005e5e2dcb6ee6da6b186 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 10 Jul 2019 08:30:00 -0700 Subject: [PATCH 21/86] rcutorture: Aggressive forward-progress tests shouldn't block shutdown The more aggressive forward-progress tests can interfere with rcutorture shutdown, resulting in false-positive diagnostics. This commit therefore ends any such tests 30 seconds prior to shutdown. Signed-off-by: Paul E. McKenney --- kernel/rcu/rcutorture.c | 17 ++++++++++++++++- 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index c44e5307afcc..b22947324423 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -161,6 +161,7 @@ static atomic_long_t n_rcu_torture_timers; static long n_barrier_attempts; static long n_barrier_successes; /* did rcu_barrier test succeed? */ static struct list_head rcu_torture_removed; +static unsigned long shutdown_jiffies; static int rcu_torture_writer_state; #define RTWS_FIXED_DELAY 0 @@ -228,6 +229,15 @@ static u64 notrace rcu_trace_clock_local(void) } #endif /* #else #ifdef CONFIG_RCU_TRACE */ +/* + * Stop aggressive CPU-hog tests a bit before the end of the test in order + * to avoid interfering with test shutdown. + */ +static bool shutdown_time_arrived(void) +{ + return shutdown_secs && time_after(jiffies, shutdown_jiffies - 30 * HZ); +} + static unsigned long boost_starttime; /* jiffies of next boost test start. */ static DEFINE_MUTEX(boost_mutex); /* protect setting boost_starttime */ /* and boost task create/destroy. */ @@ -1787,6 +1797,7 @@ static void rcu_torture_fwd_prog_nr(int *tested, int *tested_tries) WRITE_ONCE(rcu_fwd_startat, jiffies); stopat = rcu_fwd_startat + dur; while (time_before(jiffies, stopat) && + !shutdown_time_arrived() && !READ_ONCE(rcu_fwd_emergency_stop) && !torture_must_stop()) { idx = cur_ops->readlock(); udelay(10); @@ -1796,6 +1807,7 @@ static void rcu_torture_fwd_prog_nr(int *tested, int *tested_tries) } (*tested_tries)++; if (!time_before(jiffies, stopat) && + !shutdown_time_arrived() && !READ_ONCE(rcu_fwd_emergency_stop) && !torture_must_stop()) { (*tested)++; cver = READ_ONCE(rcu_torture_current_version) - cver; @@ -1854,6 +1866,7 @@ static void rcu_torture_fwd_prog_cr(void) gps = cur_ops->get_gp_seq(); rcu_launder_gp_seq_start = gps; while (time_before(jiffies, stopat) && + !shutdown_time_arrived() && !READ_ONCE(rcu_fwd_emergency_stop) && !torture_must_stop()) { rfcp = READ_ONCE(rcu_fwd_cb_head); rfcpn = NULL; @@ -1886,7 +1899,8 @@ static void rcu_torture_fwd_prog_cr(void) cur_ops->cb_barrier(); /* Wait for callbacks to be invoked. */ (void)rcu_torture_fwd_prog_cbfree(); - if (!torture_must_stop() && !READ_ONCE(rcu_fwd_emergency_stop)) { + if (!torture_must_stop() && !READ_ONCE(rcu_fwd_emergency_stop) && + !shutdown_time_arrived()) { WARN_ON(n_max_gps < MIN_FWD_CBS_LAUNDERED); pr_alert("%s Duration %lu barrier: %lu pending %ld n_launders: %ld n_launders_sa: %ld n_max_gps: %ld n_max_cbs: %ld cver %ld gps %ld\n", __func__, @@ -2467,6 +2481,7 @@ rcu_torture_init(void) goto unwind; rcutor_hp = firsterr; } + shutdown_jiffies = jiffies + shutdown_secs * HZ; firsterr = torture_shutdown_init(shutdown_secs, rcu_torture_cleanup); if (firsterr) goto unwind; From 6240973e5661a83df24e35a9a9c2013496931e2b Mon Sep 17 00:00:00 2001 From: "Joel Fernandes (Google)" Date: Mon, 29 Jul 2019 08:36:05 -0400 Subject: [PATCH 22/86] tools/memory-model: Use cumul-fence instead of fence in ->prop example To reduce ambiguity in the more exotic ->prop ordering example, this commit uses the term cumul-fence instead of the term fence for the two fences, so that the implict ->rfe on loads/stores to Y are covered by the description. Link: https://lore.kernel.org/lkml/20190729121745.GA140682@google.com Suggested-by: Alan Stern Signed-off-by: Joel Fernandes (Google) Acked-by: Alan Stern Signed-off-by: Paul E. McKenney --- tools/memory-model/Documentation/explanation.txt | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/tools/memory-model/Documentation/explanation.txt b/tools/memory-model/Documentation/explanation.txt index 68caa9a976d0..634dc6db26c4 100644 --- a/tools/memory-model/Documentation/explanation.txt +++ b/tools/memory-model/Documentation/explanation.txt @@ -1302,7 +1302,7 @@ followed by an arbitrary number of cumul-fence links, ending with an rfe link. You can concoct more exotic examples, containing more than one fence, although this quickly leads to diminishing returns in terms of complexity. For instance, here's an example containing a coe link -followed by two fences and an rfe link, utilizing the fact that +followed by two cumul-fences and an rfe link, utilizing the fact that release fences are A-cumulative: int x, y, z; @@ -1334,10 +1334,10 @@ If x = 2, r0 = 1, and r2 = 1 after this code runs then there is a prop link from P0's store to its load. This is because P0's store gets overwritten by P1's store since x = 2 at the end (a coe link), the smp_wmb() ensures that P1's store to x propagates to P2 before the -store to y does (the first fence), the store to y propagates to P2 +store to y does (the first cumul-fence), the store to y propagates to P2 before P2's load and store execute, P2's smp_store_release() guarantees that the stores to x and y both propagate to P0 before the -store to z does (the second fence), and P0's load executes after the +store to z does (the second cumul-fence), and P0's load executes after the store to z has propagated to P0 (an rfe link). In summary, the fact that the hb relation links memory access events From 6738ff85c3ee8073d5b030cb26241d0009d4ce29 Mon Sep 17 00:00:00 2001 From: Andrea Parri Date: Sat, 29 Jun 2019 23:10:44 +0200 Subject: [PATCH 23/86] tools/memory-model: Update the informal documentation The formal memory consistency model has added support for plain accesses (and data races). While updating the informal documentation to describe this addition to the model is highly desirable and important future work, update the informal documentation to at least acknowledge such addition. Signed-off-by: Andrea Parri Cc: Will Deacon Cc: Peter Zijlstra Cc: Boqun Feng Cc: Nicholas Piggin Cc: David Howells Cc: Jade Alglave Cc: Luc Maranget Cc: "Paul E. McKenney" Cc: Akira Yokosawa Cc: Daniel Lustig Signed-off-by: Paul E. McKenney Acked-by: Alan Stern --- .../Documentation/explanation.txt | 43 ++++++++----------- tools/memory-model/README | 16 +++---- 2 files changed, 27 insertions(+), 32 deletions(-) diff --git a/tools/memory-model/Documentation/explanation.txt b/tools/memory-model/Documentation/explanation.txt index 634dc6db26c4..488f11f6c588 100644 --- a/tools/memory-model/Documentation/explanation.txt +++ b/tools/memory-model/Documentation/explanation.txt @@ -42,7 +42,8 @@ linux-kernel.bell and linux-kernel.cat files that make up the formal version of the model; they are extremely terse and their meanings are far from clear. -This document describes the ideas underlying the LKMM. It is meant +This document describes the ideas underlying the LKMM, but excluding +the modeling of bare C (or plain) shared memory accesses. It is meant for people who want to understand how the model was designed. It does not go into the details of the code in the .bell and .cat files; rather, it explains in English what the code expresses symbolically. @@ -354,31 +355,25 @@ be extremely complex. Optimizing compilers have great freedom in the way they translate source code to object code. They are allowed to apply transformations that add memory accesses, eliminate accesses, combine them, split them -into pieces, or move them around. Faced with all these possibilities, -the LKMM basically gives up. It insists that the code it analyzes -must contain no ordinary accesses to shared memory; all accesses must -be performed using READ_ONCE(), WRITE_ONCE(), or one of the other -atomic or synchronization primitives. These primitives prevent a -large number of compiler optimizations. In particular, it is -guaranteed that the compiler will not remove such accesses from the -generated code (unless it can prove the accesses will never be -executed), it will not change the order in which they occur in the -code (within limits imposed by the C standard), and it will not -introduce extraneous accesses. +into pieces, or move them around. The use of READ_ONCE(), WRITE_ONCE(), +or one of the other atomic or synchronization primitives prevents a +large number of compiler optimizations. In particular, it is guaranteed +that the compiler will not remove such accesses from the generated code +(unless it can prove the accesses will never be executed), it will not +change the order in which they occur in the code (within limits imposed +by the C standard), and it will not introduce extraneous accesses. -This explains why the MP and SB examples above used READ_ONCE() and -WRITE_ONCE() rather than ordinary memory accesses. Thanks to this -usage, we can be certain that in the MP example, P0's write event to -buf really is po-before its write event to flag, and similarly for the -other shared memory accesses in the examples. +The MP and SB examples above used READ_ONCE() and WRITE_ONCE() rather +than ordinary memory accesses. Thanks to this usage, we can be certain +that in the MP example, the compiler won't reorder P0's write event to +buf and P0's write event to flag, and similarly for the other shared +memory accesses in the examples. -Private variables are not subject to this restriction. Since they are -not shared between CPUs, they can be accessed normally without -READ_ONCE() or WRITE_ONCE(), and there will be no ill effects. In -fact, they need not even be stored in normal memory at all -- in -principle a private variable could be stored in a CPU register (hence -the convention that these variables have names starting with the -letter 'r'). +Since private variables are not shared between CPUs, they can be +accessed normally without READ_ONCE() or WRITE_ONCE(). In fact, they +need not even be stored in normal memory at all -- in principle a +private variable could be stored in a CPU register (hence the convention +that these variables have names starting with the letter 'r'). A WARNING diff --git a/tools/memory-model/README b/tools/memory-model/README index 2b87f3971548..fc07b52f2028 100644 --- a/tools/memory-model/README +++ b/tools/memory-model/README @@ -167,15 +167,15 @@ scripts Various scripts, see scripts/README. LIMITATIONS =========== -The Linux-kernel memory model has the following limitations: +The Linux-kernel memory model (LKMM) has the following limitations: -1. Compiler optimizations are not modeled. Of course, the use - of READ_ONCE() and WRITE_ONCE() limits the compiler's ability - to optimize, but there is Linux-kernel code that uses bare C - memory accesses. Handling this code is on the to-do list. - For more information, see Documentation/explanation.txt (in - particular, the "THE PROGRAM ORDER RELATION: po AND po-loc" - and "A WARNING" sections). +1. Compiler optimizations are not accurately modeled. Of course, + the use of READ_ONCE() and WRITE_ONCE() limits the compiler's + ability to optimize, but under some circumstances it is possible + for the compiler to undermine the memory model. For more + information, see Documentation/explanation.txt (in particular, + the "THE PROGRAM ORDER RELATION: po AND po-loc" and "A WARNING" + sections). Note that this limitation in turn limits LKMM's ability to accurately model address, control, and data dependencies. From 28875945ba98d1b47a8a706812b6494d165bb0a0 Mon Sep 17 00:00:00 2001 From: "Joel Fernandes (Google)" Date: Tue, 16 Jul 2019 18:12:22 -0400 Subject: [PATCH 24/86] rcu: Add support for consolidated-RCU reader checking This commit adds RCU-reader checks to list_for_each_entry_rcu() and hlist_for_each_entry_rcu(). These checks are optional, and are indicated by a lockdep expression passed to a new optional argument to these two macros. If this optional lockdep expression is omitted, these two macros act as before, checking for an RCU read-side critical section. Signed-off-by: Joel Fernandes (Google) [ paulmck: Update to eliminate return within macro and update comment. ] Signed-off-by: Paul E. McKenney --- include/linux/rculist.h | 32 +++++++++++--- include/linux/rcupdate.h | 7 +++ kernel/rcu/Kconfig.debug | 11 +++++ kernel/rcu/update.c | 96 ++++++++++++++++++++++++++-------------- 4 files changed, 108 insertions(+), 38 deletions(-) diff --git a/include/linux/rculist.h b/include/linux/rculist.h index 932296144131..4158b7212936 100644 --- a/include/linux/rculist.h +++ b/include/linux/rculist.h @@ -40,6 +40,24 @@ static inline void INIT_LIST_HEAD_RCU(struct list_head *list) */ #define list_next_rcu(list) (*((struct list_head __rcu **)(&(list)->next))) +/* + * Check during list traversal that we are within an RCU reader + */ + +#define check_arg_count_one(dummy) + +#ifdef CONFIG_PROVE_RCU_LIST +#define __list_check_rcu(dummy, cond, extra...) \ + ({ \ + check_arg_count_one(extra); \ + RCU_LOCKDEP_WARN(!cond && !rcu_read_lock_any_held(), \ + "RCU-list traversed in non-reader section!"); \ + }) +#else +#define __list_check_rcu(dummy, cond, extra...) \ + ({ check_arg_count_one(extra); }) +#endif + /* * Insert a new entry between two known consecutive entries. * @@ -343,14 +361,16 @@ static inline void list_splice_tail_init_rcu(struct list_head *list, * @pos: the type * to use as a loop cursor. * @head: the head for your list. * @member: the name of the list_head within the struct. + * @cond: optional lockdep expression if called from non-RCU protection. * * This list-traversal primitive may safely run concurrently with * the _rcu list-mutation primitives such as list_add_rcu() * as long as the traversal is guarded by rcu_read_lock(). */ -#define list_for_each_entry_rcu(pos, head, member) \ - for (pos = list_entry_rcu((head)->next, typeof(*pos), member); \ - &pos->member != (head); \ +#define list_for_each_entry_rcu(pos, head, member, cond...) \ + for (__list_check_rcu(dummy, ## cond, 0), \ + pos = list_entry_rcu((head)->next, typeof(*pos), member); \ + &pos->member != (head); \ pos = list_entry_rcu(pos->member.next, typeof(*pos), member)) /** @@ -616,13 +636,15 @@ static inline void hlist_add_behind_rcu(struct hlist_node *n, * @pos: the type * to use as a loop cursor. * @head: the head for your list. * @member: the name of the hlist_node within the struct. + * @cond: optional lockdep expression if called from non-RCU protection. * * This list-traversal primitive may safely run concurrently with * the _rcu list-mutation primitives such as hlist_add_head_rcu() * as long as the traversal is guarded by rcu_read_lock(). */ -#define hlist_for_each_entry_rcu(pos, head, member) \ - for (pos = hlist_entry_safe(rcu_dereference_raw(hlist_first_rcu(head)),\ +#define hlist_for_each_entry_rcu(pos, head, member, cond...) \ + for (__list_check_rcu(dummy, ## cond, 0), \ + pos = hlist_entry_safe(rcu_dereference_raw(hlist_first_rcu(head)),\ typeof(*(pos)), member); \ pos; \ pos = hlist_entry_safe(rcu_dereference_raw(hlist_next_rcu(\ diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index bfcafbc1e301..80d6056f5855 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -221,6 +221,7 @@ int debug_lockdep_rcu_enabled(void); int rcu_read_lock_held(void); int rcu_read_lock_bh_held(void); int rcu_read_lock_sched_held(void); +int rcu_read_lock_any_held(void); #else /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */ @@ -241,6 +242,12 @@ static inline int rcu_read_lock_sched_held(void) { return !preemptible(); } + +static inline int rcu_read_lock_any_held(void) +{ + return !preemptible(); +} + #endif /* #else #ifdef CONFIG_DEBUG_LOCK_ALLOC */ #ifdef CONFIG_PROVE_RCU diff --git a/kernel/rcu/Kconfig.debug b/kernel/rcu/Kconfig.debug index 5ec3ea4028e2..4aa02eee8f6c 100644 --- a/kernel/rcu/Kconfig.debug +++ b/kernel/rcu/Kconfig.debug @@ -8,6 +8,17 @@ menu "RCU Debugging" config PROVE_RCU def_bool PROVE_LOCKING +config PROVE_RCU_LIST + bool "RCU list lockdep debugging" + depends on PROVE_RCU && RCU_EXPERT + default n + help + Enable RCU lockdep checking for list usages. By default it is + turned off since there are several list RCU users that still + need to be converted to pass a lockdep expression. To prevent + false-positive splats, we keep it default disabled but once all + users are converted, we can remove this config option. + config TORTURE_TEST tristate default n diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c index 9dd5aeef6e70..38cbd616b381 100644 --- a/kernel/rcu/update.c +++ b/kernel/rcu/update.c @@ -61,9 +61,15 @@ module_param(rcu_normal_after_boot, int, 0); #ifdef CONFIG_DEBUG_LOCK_ALLOC /** - * rcu_read_lock_sched_held() - might we be in RCU-sched read-side critical section? + * rcu_read_lock_held_common() - might we be in RCU-sched read-side critical section? + * @ret: Best guess answer if lockdep cannot be relied on * - * If CONFIG_DEBUG_LOCK_ALLOC is selected, returns nonzero iff in an + * Returns true if lockdep must be ignored, in which case *ret contains + * the best guess described below. Otherwise returns false, in which + * case *ret tells the caller nothing and the caller should instead + * consult lockdep. + * + * If CONFIG_DEBUG_LOCK_ALLOC is selected, set *ret to nonzero iff in an * RCU-sched read-side critical section. In absence of * CONFIG_DEBUG_LOCK_ALLOC, this assumes we are in an RCU-sched read-side * critical section unless it can prove otherwise. Note that disabling @@ -75,30 +81,44 @@ module_param(rcu_normal_after_boot, int, 0); * Check debug_lockdep_rcu_enabled() to prevent false positives during boot * and while lockdep is disabled. * - * Note that if the CPU is in the idle loop from an RCU point of - * view (ie: that we are in the section between rcu_idle_enter() and - * rcu_idle_exit()) then rcu_read_lock_held() returns false even if the CPU - * did an rcu_read_lock(). The reason for this is that RCU ignores CPUs - * that are in such a section, considering these as in extended quiescent - * state, so such a CPU is effectively never in an RCU read-side critical - * section regardless of what RCU primitives it invokes. This state of - * affairs is required --- we need to keep an RCU-free window in idle - * where the CPU may possibly enter into low power mode. This way we can - * notice an extended quiescent state to other CPUs that started a grace - * period. Otherwise we would delay any grace period as long as we run in - * the idle task. + * Note that if the CPU is in the idle loop from an RCU point of view (ie: + * that we are in the section between rcu_idle_enter() and rcu_idle_exit()) + * then rcu_read_lock_held() sets *ret to false even if the CPU did an + * rcu_read_lock(). The reason for this is that RCU ignores CPUs that are + * in such a section, considering these as in extended quiescent state, + * so such a CPU is effectively never in an RCU read-side critical section + * regardless of what RCU primitives it invokes. This state of affairs is + * required --- we need to keep an RCU-free window in idle where the CPU may + * possibly enter into low power mode. This way we can notice an extended + * quiescent state to other CPUs that started a grace period. Otherwise + * we would delay any grace period as long as we run in the idle task. * - * Similarly, we avoid claiming an SRCU read lock held if the current + * Similarly, we avoid claiming an RCU read lock held if the current * CPU is offline. */ +static bool rcu_read_lock_held_common(bool *ret) +{ + if (!debug_lockdep_rcu_enabled()) { + *ret = 1; + return true; + } + if (!rcu_is_watching()) { + *ret = 0; + return true; + } + if (!rcu_lockdep_current_cpu_online()) { + *ret = 0; + return true; + } + return false; +} + int rcu_read_lock_sched_held(void) { - if (!debug_lockdep_rcu_enabled()) - return 1; - if (!rcu_is_watching()) - return 0; - if (!rcu_lockdep_current_cpu_online()) - return 0; + bool ret; + + if (rcu_read_lock_held_common(&ret)) + return ret; return lock_is_held(&rcu_sched_lock_map) || !preemptible(); } EXPORT_SYMBOL(rcu_read_lock_sched_held); @@ -257,12 +277,10 @@ NOKPROBE_SYMBOL(debug_lockdep_rcu_enabled); */ int rcu_read_lock_held(void) { - if (!debug_lockdep_rcu_enabled()) - return 1; - if (!rcu_is_watching()) - return 0; - if (!rcu_lockdep_current_cpu_online()) - return 0; + bool ret; + + if (rcu_read_lock_held_common(&ret)) + return ret; return lock_is_held(&rcu_lock_map); } EXPORT_SYMBOL_GPL(rcu_read_lock_held); @@ -284,16 +302,28 @@ EXPORT_SYMBOL_GPL(rcu_read_lock_held); */ int rcu_read_lock_bh_held(void) { - if (!debug_lockdep_rcu_enabled()) - return 1; - if (!rcu_is_watching()) - return 0; - if (!rcu_lockdep_current_cpu_online()) - return 0; + bool ret; + + if (rcu_read_lock_held_common(&ret)) + return ret; return in_softirq() || irqs_disabled(); } EXPORT_SYMBOL_GPL(rcu_read_lock_bh_held); +int rcu_read_lock_any_held(void) +{ + bool ret; + + if (rcu_read_lock_held_common(&ret)) + return ret; + if (lock_is_held(&rcu_lock_map) || + lock_is_held(&rcu_bh_lock_map) || + lock_is_held(&rcu_sched_lock_map)) + return 1; + return !preemptible(); +} +EXPORT_SYMBOL_GPL(rcu_read_lock_any_held); + #endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */ /** From fbab8d6735e2643365040bd9e1057addc0d9b4cf Mon Sep 17 00:00:00 2001 From: "Joel Fernandes (Google)" Date: Tue, 16 Jul 2019 18:12:23 -0400 Subject: [PATCH 25/86] rcu/sync: Remove custom check for RCU readers The rcu/sync code currently does a special check for being in an RCU read-side critical section. With RCU consolidating flavors and the generic helper added earlier in this series, this check is no longer need. This commit switches to the generic helper, saving a couple of lines of code. Cc: Oleg Nesterov Acked-by: Oleg Nesterov Signed-off-by: Joel Fernandes (Google) Signed-off-by: Paul E. McKenney --- include/linux/rcu_sync.h | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/include/linux/rcu_sync.h b/include/linux/rcu_sync.h index 9b83865d24f9..0027d4c8087c 100644 --- a/include/linux/rcu_sync.h +++ b/include/linux/rcu_sync.h @@ -31,9 +31,7 @@ struct rcu_sync { */ static inline bool rcu_sync_is_idle(struct rcu_sync *rsp) { - RCU_LOCKDEP_WARN(!rcu_read_lock_held() && - !rcu_read_lock_bh_held() && - !rcu_read_lock_sched_held(), + RCU_LOCKDEP_WARN(!rcu_read_lock_any_held(), "suspicious rcu_sync_is_idle() usage"); return !READ_ONCE(rsp->gp_state); /* GP_IDLE */ } From 7fd69b0ba48a2b2d8e5b4f0945b28d3839a7705a Mon Sep 17 00:00:00 2001 From: "Joel Fernandes (Google)" Date: Tue, 16 Jul 2019 18:12:24 -0400 Subject: [PATCH 26/86] ipv4: Add lockdep condition to fix for_each_entry() This commit applies the consolidated list_for_each_entry_rcu() support for lockdep conditions. Acked-by: David S. Miller Signed-off-by: Joel Fernandes (Google) Signed-off-by: Paul E. McKenney --- net/ipv4/fib_frontend.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index e8bc939b56dd..dde77f72e03e 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -124,7 +124,8 @@ struct fib_table *fib_get_table(struct net *net, u32 id) h = id & (FIB_TABLE_HASHSZ - 1); head = &net->ipv4.fib_table_hash[h]; - hlist_for_each_entry_rcu(tb, head, tb_hlist) { + hlist_for_each_entry_rcu(tb, head, tb_hlist, + lockdep_rtnl_is_held()) { if (tb->tb_id == id) return tb; } From e78a7614f3876ac649b3df608789cb6ef74d0480 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Wed, 5 Jun 2019 07:46:43 -0700 Subject: [PATCH 27/86] idle: Prevent late-arriving interrupts from disrupting offline Scheduling-clock interrupts can arrive late in the CPU-offline process, after idle entry and the subsequent call to cpuhp_report_idle_dead(). Once execution passes the call to rcu_report_dead(), RCU is ignoring the CPU, which results in lockdep complaints when the interrupt handler uses RCU: ------------------------------------------------------------------------ ============================= WARNING: suspicious RCU usage 5.2.0-rc1+ #681 Not tainted ----------------------------- kernel/sched/fair.c:9542 suspicious rcu_dereference_check() usage! other info that might help us debug this: RCU used illegally from offline CPU! rcu_scheduler_active = 2, debug_locks = 1 no locks held by swapper/5/0. stack backtrace: CPU: 5 PID: 0 Comm: swapper/5 Not tainted 5.2.0-rc1+ #681 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Bochs 01/01/2011 Call Trace: dump_stack+0x5e/0x8b trigger_load_balance+0xa8/0x390 ? tick_sched_do_timer+0x60/0x60 update_process_times+0x3b/0x50 tick_sched_handle+0x2f/0x40 tick_sched_timer+0x32/0x70 __hrtimer_run_queues+0xd3/0x3b0 hrtimer_interrupt+0x11d/0x270 ? sched_clock_local+0xc/0x74 smp_apic_timer_interrupt+0x79/0x200 apic_timer_interrupt+0xf/0x20 RIP: 0010:delay_tsc+0x22/0x50 Code: ff 0f 1f 80 00 00 00 00 65 44 8b 05 18 a7 11 48 0f ae e8 0f 31 48 89 d6 48 c1 e6 20 48 09 c6 eb 0e f3 90 65 8b 05 fe a6 11 48 <41> 39 c0 75 18 0f ae e8 0f 31 48 c1 e2 20 48 09 c2 48 89 d0 48 29 RSP: 0000:ffff8f92c0157ed0 EFLAGS: 00000212 ORIG_RAX: ffffffffffffff13 RAX: 0000000000000005 RBX: ffff8c861f356400 RCX: ffff8f92c0157e64 RDX: 000000321214c8cc RSI: 00000032120daa7f RDI: 0000000000260f15 RBP: 0000000000000005 R08: 0000000000000005 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000 R13: 0000000000000000 R14: ffff8c861ee18000 R15: ffff8c861ee18000 cpuhp_report_idle_dead+0x31/0x60 do_idle+0x1d5/0x200 ? _raw_spin_unlock_irqrestore+0x2d/0x40 cpu_startup_entry+0x14/0x20 start_secondary+0x151/0x170 secondary_startup_64+0xa4/0xb0 ------------------------------------------------------------------------ This happens rarely, but can be forced by happen more often by placing delays in cpuhp_report_idle_dead() following the call to rcu_report_dead(). With this in place, the following rcutorture scenario reproduces the problem within a few minutes: tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 8 --duration 5 --kconfig "CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_PROVE_LOCKING=y" --configs "TREE04" This commit uses the crude but effective expedient of moving the disabling of interrupts within the idle loop to precede the cpu_is_offline() check. It also invokes tick_nohz_idle_stop_tick() instead of tick_nohz_idle_stop_tick_protected() to shut off the scheduling-clock interrupt. Signed-off-by: Peter Zijlstra Cc: Frederic Weisbecker Cc: Thomas Gleixner Cc: Ingo Molnar [ paulmck: Revert tick_nohz_idle_stop_tick_protected() removal, new callers. ] Signed-off-by: Paul E. McKenney --- kernel/sched/idle.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index 80940939b733..e4bc4aa739b8 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -241,13 +241,14 @@ static void do_idle(void) check_pgt_cache(); rmb(); + local_irq_disable(); + if (cpu_is_offline(cpu)) { - tick_nohz_idle_stop_tick_protected(); + tick_nohz_idle_stop_tick(); cpuhp_report_idle_dead(); arch_cpu_idle_dead(); } - local_irq_disable(); arch_cpu_idle_enter(); /* From b823cafa7501f946a37dce5aa1e576a0b2f31ed9 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 5 Jul 2019 08:05:10 -0700 Subject: [PATCH 28/86] rcu: Remove redundant "if" condition from rcu_gp_is_expedited() Because rcu_expedited_nesting is initialized to 1 and not decremented until just before init is spawned, rcu_expedited_nesting is guaranteed to be non-zero whenever rcu_scheduler_active == RCU_SCHEDULER_INIT. This commit therefore removes this redundant "if" equality test. Signed-off-by: Paul E. McKenney Reviewed-by: Joel Fernandes (Google) --- kernel/rcu/update.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c index 249517058b13..64e9cc8609e7 100644 --- a/kernel/rcu/update.c +++ b/kernel/rcu/update.c @@ -136,8 +136,7 @@ static atomic_t rcu_expedited_nesting = ATOMIC_INIT(1); */ bool rcu_gp_is_expedited(void) { - return rcu_expedited || atomic_read(&rcu_expedited_nesting) || - rcu_scheduler_active == RCU_SCHEDULER_INIT; + return rcu_expedited || atomic_read(&rcu_expedited_nesting); } EXPORT_SYMBOL_GPL(rcu_gp_is_expedited); From 1d5087ab964d84e5a0cfe5059cf5e929127d573f Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 12 May 2015 14:50:06 -0700 Subject: [PATCH 29/86] arm: Use common outgoing-CPU-notification code This commit removes the open-coded CPU-offline notification with new common code. In particular, this change avoids calling scheduler code using RCU from an offline CPU that RCU is ignoring. This is a minimal change. A more intrusive change might invoke the cpu_check_up_prepare() and cpu_set_state_online() functions at CPU-online time, which would allow onlining throw an error if the CPU did not go offline properly. Signed-off-by: Paul E. McKenney Cc: linux-arm-kernel@lists.infradead.org Cc: Russell King Cc: Mark Rutland Cc: Dietmar Eggemann --- arch/arm/kernel/smp.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c index aab8ba40ce38..4b0bab2607e4 100644 --- a/arch/arm/kernel/smp.c +++ b/arch/arm/kernel/smp.c @@ -264,15 +264,13 @@ int __cpu_disable(void) return 0; } -static DECLARE_COMPLETION(cpu_died); - /* * called on the thread which is asking for a CPU to be shutdown - * waits until shutdown has completed, or it is timed out. */ void __cpu_die(unsigned int cpu) { - if (!wait_for_completion_timeout(&cpu_died, msecs_to_jiffies(5000))) { + if (!cpu_wait_death(cpu, 5)) { pr_err("CPU%u: cpu didn't die\n", cpu); return; } @@ -319,7 +317,7 @@ void arch_cpu_idle_dead(void) * this returns, power and/or clocks can be removed at any point * from this CPU and its cache by platform_cpu_kill(). */ - complete(&cpu_died); + (void)cpu_report_death(); /* * Ensure that the cache lines associated with that completion are From 511b44f7598ce602f9efce687ca9eec013967d9b Mon Sep 17 00:00:00 2001 From: Mukesh Ojha Date: Mon, 29 Jul 2019 13:25:57 +0530 Subject: [PATCH 30/86] rcu: Fix spelling mistake "greate"->"great" This commit fixes a spelling mistake in file tree_exp.h. Signed-off-by: Mukesh Ojha Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_exp.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index 513b403b683b..d632cd019597 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -781,7 +781,7 @@ static int rcu_print_task_exp_stall(struct rcu_node *rnp) * other hand, if the CPU is not in an RCU read-side critical section, * the IPI handler reports the quiescent state immediately. * - * Although this is a greate improvement over previous expedited + * Although this is a great improvement over previous expedited * implementations, it is still unfriendly to real-time workloads, so is * thus not recommended for any sort of common-case code. In fact, if * you are using synchronize_rcu_expedited() in a loop, please restructure From ba31ebfa7b749906e0befcc1e0c0db5e7463d55e Mon Sep 17 00:00:00 2001 From: Andrea Parri Date: Mon, 5 Aug 2019 14:15:17 +0200 Subject: [PATCH 31/86] MAINTAINERS: Update e-mail address for Andrea Parri My @amarulasolutions.com address stopped working this July, so update to my @gmail.com address where you'll still be able to reach me. Signed-off-by: Andrea Parri Cc: Alan Stern Cc: Will Deacon Cc: Peter Zijlstra Cc: Boqun Feng Cc: Nicholas Piggin Cc: David Howells Cc: Jade Alglave Cc: Luc Maranget Cc: "Paul E. McKenney" Cc: Akira Yokosawa Cc: Daniel Lustig Signed-off-by: Paul E. McKenney --- MAINTAINERS | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/MAINTAINERS b/MAINTAINERS index 6426db5198f0..527317026492 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -9326,7 +9326,7 @@ F: drivers/misc/lkdtm/* LINUX KERNEL MEMORY CONSISTENCY MODEL (LKMM) M: Alan Stern -M: Andrea Parri +M: Andrea Parri M: Will Deacon M: Peter Zijlstra M: Boqun Feng From c2fa1e1bfa5b74558854a70b8afd797d43eb2743 Mon Sep 17 00:00:00 2001 From: "Joel Fernandes (Google)" Date: Tue, 16 Jul 2019 18:12:25 -0400 Subject: [PATCH 32/86] driver/core: Convert to use built-in RCU list checking This commit applies the consolidated hlist_for_each_entry_rcu() support for lockdep conditions. Acked-by: Greg Kroah-Hartman Signed-off-by: Joel Fernandes (Google) Signed-off-by: Paul E. McKenney --- drivers/base/base.h | 1 + drivers/base/core.c | 12 ++++++++++++ drivers/base/power/runtime.c | 15 ++++++++++----- 3 files changed, 23 insertions(+), 5 deletions(-) diff --git a/drivers/base/base.h b/drivers/base/base.h index b405436ee28e..0d32544b6f91 100644 --- a/drivers/base/base.h +++ b/drivers/base/base.h @@ -165,6 +165,7 @@ static inline int devtmpfs_init(void) { return 0; } /* Device links support */ extern int device_links_read_lock(void); extern void device_links_read_unlock(int idx); +extern int device_links_read_lock_held(void); extern int device_links_check_suppliers(struct device *dev); extern void device_links_driver_bound(struct device *dev); extern void device_links_driver_cleanup(struct device *dev); diff --git a/drivers/base/core.c b/drivers/base/core.c index 636058bbf48a..eede79630ceb 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -68,6 +68,11 @@ void device_links_read_unlock(int idx) { srcu_read_unlock(&device_links_srcu, idx); } + +int device_links_read_lock_held(void) +{ + return srcu_read_lock_held(&device_links_srcu); +} #else /* !CONFIG_SRCU */ static DECLARE_RWSEM(device_links_lock); @@ -91,6 +96,13 @@ void device_links_read_unlock(int not_used) { up_read(&device_links_lock); } + +#ifdef CONFIG_DEBUG_LOCK_ALLOC +int device_links_read_lock_held(void) +{ + return lockdep_is_held(&device_links_lock); +} +#endif #endif /* !CONFIG_SRCU */ /** diff --git a/drivers/base/power/runtime.c b/drivers/base/power/runtime.c index b75335508d2c..50def99df970 100644 --- a/drivers/base/power/runtime.c +++ b/drivers/base/power/runtime.c @@ -287,7 +287,8 @@ static int rpm_get_suppliers(struct device *dev) { struct device_link *link; - list_for_each_entry_rcu(link, &dev->links.suppliers, c_node) { + list_for_each_entry_rcu(link, &dev->links.suppliers, c_node, + device_links_read_lock_held()) { int retval; if (!(link->flags & DL_FLAG_PM_RUNTIME) || @@ -309,7 +310,8 @@ static void rpm_put_suppliers(struct device *dev) { struct device_link *link; - list_for_each_entry_rcu(link, &dev->links.suppliers, c_node) { + list_for_each_entry_rcu(link, &dev->links.suppliers, c_node, + device_links_read_lock_held()) { if (READ_ONCE(link->status) == DL_STATE_SUPPLIER_UNBIND) continue; @@ -1640,7 +1642,8 @@ void pm_runtime_clean_up_links(struct device *dev) idx = device_links_read_lock(); - list_for_each_entry_rcu(link, &dev->links.consumers, s_node) { + list_for_each_entry_rcu(link, &dev->links.consumers, s_node, + device_links_read_lock_held()) { if (link->flags & DL_FLAG_STATELESS) continue; @@ -1662,7 +1665,8 @@ void pm_runtime_get_suppliers(struct device *dev) idx = device_links_read_lock(); - list_for_each_entry_rcu(link, &dev->links.suppliers, c_node) + list_for_each_entry_rcu(link, &dev->links.suppliers, c_node, + device_links_read_lock_held()) if (link->flags & DL_FLAG_PM_RUNTIME) { link->supplier_preactivated = true; refcount_inc(&link->rpm_active); @@ -1683,7 +1687,8 @@ void pm_runtime_put_suppliers(struct device *dev) idx = device_links_read_lock(); - list_for_each_entry_rcu(link, &dev->links.suppliers, c_node) + list_for_each_entry_rcu(link, &dev->links.suppliers, c_node, + device_links_read_lock_held()) if (link->supplier_preactivated) { link->supplier_preactivated = false; if (refcount_dec_not_one(&link->rpm_active)) From 842a56cf3eb00f717f9522766c0e7b71bafd5fc1 Mon Sep 17 00:00:00 2001 From: "Joel Fernandes (Google)" Date: Tue, 16 Jul 2019 18:12:27 -0400 Subject: [PATCH 33/86] x86/pci: Pass lockdep condition to pcm_mmcfg_list iterator The pcm_mmcfg_list is traversed by list_for_each_entry_rcu() outside of an RCU read-side critical section, which is safe because the pci_mmcfg_lock is held. This commit therefore adds a lockdep expression to list_for_each_entry_rcu() in order t avoid lockdep warnings. Acked-by: Bjorn Helgaas Signed-off-by: Joel Fernandes (Google) Signed-off-by: Paul E. McKenney --- arch/x86/pci/mmconfig-shared.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/arch/x86/pci/mmconfig-shared.c b/arch/x86/pci/mmconfig-shared.c index 7389db538c30..6fa42e9c4e6f 100644 --- a/arch/x86/pci/mmconfig-shared.c +++ b/arch/x86/pci/mmconfig-shared.c @@ -29,6 +29,7 @@ static bool pci_mmcfg_running_state; static bool pci_mmcfg_arch_init_failed; static DEFINE_MUTEX(pci_mmcfg_lock); +#define pci_mmcfg_lock_held() lock_is_held(&(pci_mmcfg_lock).dep_map) LIST_HEAD(pci_mmcfg_list); @@ -54,7 +55,7 @@ static void list_add_sorted(struct pci_mmcfg_region *new) struct pci_mmcfg_region *cfg; /* keep list sorted by segment and starting bus number */ - list_for_each_entry_rcu(cfg, &pci_mmcfg_list, list) { + list_for_each_entry_rcu(cfg, &pci_mmcfg_list, list, pci_mmcfg_lock_held()) { if (cfg->segment > new->segment || (cfg->segment == new->segment && cfg->start_bus >= new->start_bus)) { @@ -118,7 +119,7 @@ struct pci_mmcfg_region *pci_mmconfig_lookup(int segment, int bus) { struct pci_mmcfg_region *cfg; - list_for_each_entry_rcu(cfg, &pci_mmcfg_list, list) + list_for_each_entry_rcu(cfg, &pci_mmcfg_list, list, pci_mmcfg_lock_held()) if (cfg->segment == segment && cfg->start_bus <= bus && bus <= cfg->end_bus) return cfg; From bee6f87166e9c6b8d81a7570995bd637e8da485a Mon Sep 17 00:00:00 2001 From: "Joel Fernandes (Google)" Date: Tue, 16 Jul 2019 18:12:28 -0400 Subject: [PATCH 34/86] acpi: Use built-in RCU list checking for acpi_ioremaps list This commit applies the consolidated list_for_each_entry_rcu() support for lockdep conditions. Acked-by: Rafael J. Wysocki Signed-off-by: Joel Fernandes (Google) Signed-off-by: Paul E. McKenney --- drivers/acpi/osl.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c index 9c0edf2fc0dd..2f9d0d20b836 100644 --- a/drivers/acpi/osl.c +++ b/drivers/acpi/osl.c @@ -14,6 +14,7 @@ #include #include #include +#include #include #include #include @@ -80,6 +81,7 @@ struct acpi_ioremap { static LIST_HEAD(acpi_ioremaps); static DEFINE_MUTEX(acpi_ioremap_lock); +#define acpi_ioremap_lock_held() lock_is_held(&acpi_ioremap_lock.dep_map) static void __init acpi_request_region (struct acpi_generic_address *gas, unsigned int length, char *desc) @@ -206,7 +208,7 @@ acpi_map_lookup(acpi_physical_address phys, acpi_size size) { struct acpi_ioremap *map; - list_for_each_entry_rcu(map, &acpi_ioremaps, list) + list_for_each_entry_rcu(map, &acpi_ioremaps, list, acpi_ioremap_lock_held()) if (map->phys <= phys && phys + size <= map->phys + map->size) return map; @@ -249,7 +251,7 @@ acpi_map_lookup_virt(void __iomem *virt, acpi_size size) { struct acpi_ioremap *map; - list_for_each_entry_rcu(map, &acpi_ioremaps, list) + list_for_each_entry_rcu(map, &acpi_ioremaps, list, acpi_ioremap_lock_held()) if (map->virt <= virt && virt + size <= map->virt + map->size) return map; From 58bf6f77c6fb0abe8e1330d8375dddd52711ef4c Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 28 Mar 2019 15:33:59 -0700 Subject: [PATCH 35/86] rcu/nocb: Rename rcu_data fields to prepare for forward-progress work This commit simply renames rcu_data fields to prepare for leader nocb kthreads doing only grace-period work and callback shuffling. This will mean the addition of replacement kthreads to invoke callbacks. The "leader" and "follower" thus become less meaningful, so the commit changes no-CB fields with these strings to "gp" and "cb", respectively. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.h | 14 ++++---- kernel/rcu/tree_plugin.h | 78 ++++++++++++++++++++-------------------- 2 files changed, 46 insertions(+), 46 deletions(-) diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index 7acaf3a62d39..e4e59b627c5a 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -198,10 +198,10 @@ struct rcu_data { struct rcu_head **nocb_tail; atomic_long_t nocb_q_count; /* # CBs waiting for nocb */ atomic_long_t nocb_q_count_lazy; /* invocation (all stages). */ - struct rcu_head *nocb_follower_head; /* CBs ready to invoke. */ - struct rcu_head **nocb_follower_tail; + struct rcu_head *nocb_cb_head; /* CBs ready to invoke. */ + struct rcu_head **nocb_cb_tail; struct swait_queue_head nocb_wq; /* For nocb kthreads to sleep on. */ - struct task_struct *nocb_kthread; + struct task_struct *nocb_cb_kthread; raw_spinlock_t nocb_lock; /* Guard following pair of fields. */ int nocb_defer_wakeup; /* Defer wakeup of nocb_kthread. */ struct timer_list nocb_timer; /* Enforce finite deferral. */ @@ -210,12 +210,12 @@ struct rcu_data { struct rcu_head *nocb_gp_head ____cacheline_internodealigned_in_smp; /* CBs waiting for GP. */ struct rcu_head **nocb_gp_tail; - bool nocb_leader_sleep; /* Is the nocb leader thread asleep? */ - struct rcu_data *nocb_next_follower; - /* Next follower in wakeup chain. */ + bool nocb_gp_sleep; /* Is the nocb leader thread asleep? */ + struct rcu_data *nocb_next_cb_rdp; + /* Next rcu_data in wakeup chain. */ /* The following fields are used by the follower, hence new cachline. */ - struct rcu_data *nocb_leader ____cacheline_internodealigned_in_smp; + struct rcu_data *nocb_gp_rdp ____cacheline_internodealigned_in_smp; /* Leader CPU takes GP-end wakeups. */ #endif /* #ifdef CONFIG_RCU_NOCB_CPU */ diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 99e9d952827b..5ce1edd1c87f 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1528,19 +1528,19 @@ static void __wake_nocb_leader(struct rcu_data *rdp, bool force, unsigned long flags) __releases(rdp->nocb_lock) { - struct rcu_data *rdp_leader = rdp->nocb_leader; + struct rcu_data *rdp_leader = rdp->nocb_gp_rdp; lockdep_assert_held(&rdp->nocb_lock); - if (!READ_ONCE(rdp_leader->nocb_kthread)) { + if (!READ_ONCE(rdp_leader->nocb_cb_kthread)) { raw_spin_unlock_irqrestore(&rdp->nocb_lock, flags); return; } - if (rdp_leader->nocb_leader_sleep || force) { + if (rdp_leader->nocb_gp_sleep || force) { /* Prior smp_mb__after_atomic() orders against prior enqueue. */ - WRITE_ONCE(rdp_leader->nocb_leader_sleep, false); + WRITE_ONCE(rdp_leader->nocb_gp_sleep, false); del_timer(&rdp->nocb_timer); raw_spin_unlock_irqrestore(&rdp->nocb_lock, flags); - smp_mb(); /* ->nocb_leader_sleep before swake_up_one(). */ + smp_mb(); /* ->nocb_gp_sleep before swake_up_one(). */ swake_up_one(&rdp_leader->nocb_wq); } else { raw_spin_unlock_irqrestore(&rdp->nocb_lock, flags); @@ -1604,10 +1604,10 @@ static bool rcu_nocb_cpu_needs_barrier(int cpu) if (!rhp) rhp = READ_ONCE(rdp->nocb_gp_head); if (!rhp) - rhp = READ_ONCE(rdp->nocb_follower_head); + rhp = READ_ONCE(rdp->nocb_cb_head); /* Having no rcuo kthread but CBs after scheduler starts is bad! */ - if (!READ_ONCE(rdp->nocb_kthread) && rhp && + if (!READ_ONCE(rdp->nocb_cb_kthread) && rhp && rcu_scheduler_fully_active) { /* RCU callback enqueued before CPU first came online??? */ pr_err("RCU: Never-onlined no-CBs CPU %d has CB %p\n", @@ -1646,7 +1646,7 @@ static void __call_rcu_nocb_enqueue(struct rcu_data *rdp, smp_mb__after_atomic(); /* Store *old_rhpp before _wake test. */ /* If we are not being polled and there is a kthread, awaken it ... */ - t = READ_ONCE(rdp->nocb_kthread); + t = READ_ONCE(rdp->nocb_cb_kthread); if (rcu_nocb_poll || !t) { trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WakeNotPoll")); @@ -1800,9 +1800,9 @@ wait_again: if (!rcu_nocb_poll) { trace_rcu_nocb_wake(rcu_state.name, my_rdp->cpu, TPS("Sleep")); swait_event_interruptible_exclusive(my_rdp->nocb_wq, - !READ_ONCE(my_rdp->nocb_leader_sleep)); + !READ_ONCE(my_rdp->nocb_gp_sleep)); raw_spin_lock_irqsave(&my_rdp->nocb_lock, flags); - my_rdp->nocb_leader_sleep = true; + my_rdp->nocb_gp_sleep = true; WRITE_ONCE(my_rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT); del_timer(&my_rdp->nocb_timer); raw_spin_unlock_irqrestore(&my_rdp->nocb_lock, flags); @@ -1818,7 +1818,7 @@ wait_again: */ gotcbs = false; smp_mb(); /* wakeup and _sleep before ->nocb_head reads. */ - for (rdp = my_rdp; rdp; rdp = rdp->nocb_next_follower) { + for (rdp = my_rdp; rdp; rdp = rdp->nocb_next_cb_rdp) { rdp->nocb_gp_head = READ_ONCE(rdp->nocb_head); if (!rdp->nocb_gp_head) continue; /* No CBs here, try next follower. */ @@ -1845,12 +1845,12 @@ wait_again: rcu_nocb_wait_gp(my_rdp); /* Each pass through the following loop wakes a follower, if needed. */ - for (rdp = my_rdp; rdp; rdp = rdp->nocb_next_follower) { + for (rdp = my_rdp; rdp; rdp = rdp->nocb_next_cb_rdp) { if (!rcu_nocb_poll && READ_ONCE(rdp->nocb_head) && - READ_ONCE(my_rdp->nocb_leader_sleep)) { + READ_ONCE(my_rdp->nocb_gp_sleep)) { raw_spin_lock_irqsave(&my_rdp->nocb_lock, flags); - my_rdp->nocb_leader_sleep = false;/* No need to sleep.*/ + my_rdp->nocb_gp_sleep = false;/* No need to sleep.*/ raw_spin_unlock_irqrestore(&my_rdp->nocb_lock, flags); } if (!rdp->nocb_gp_head) @@ -1858,18 +1858,18 @@ wait_again: /* Append callbacks to follower's "done" list. */ raw_spin_lock_irqsave(&rdp->nocb_lock, flags); - tail = rdp->nocb_follower_tail; - rdp->nocb_follower_tail = rdp->nocb_gp_tail; + tail = rdp->nocb_cb_tail; + rdp->nocb_cb_tail = rdp->nocb_gp_tail; *tail = rdp->nocb_gp_head; raw_spin_unlock_irqrestore(&rdp->nocb_lock, flags); - if (rdp != my_rdp && tail == &rdp->nocb_follower_head) { + if (rdp != my_rdp && tail == &rdp->nocb_cb_head) { /* List was empty, so wake up the follower. */ swake_up_one(&rdp->nocb_wq); } } /* If we (the leader) don't have CBs, go wait some more. */ - if (!my_rdp->nocb_follower_head) + if (!my_rdp->nocb_cb_head) goto wait_again; } @@ -1882,8 +1882,8 @@ static void nocb_follower_wait(struct rcu_data *rdp) for (;;) { trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("FollowerSleep")); swait_event_interruptible_exclusive(rdp->nocb_wq, - READ_ONCE(rdp->nocb_follower_head)); - if (smp_load_acquire(&rdp->nocb_follower_head)) { + READ_ONCE(rdp->nocb_cb_head)); + if (smp_load_acquire(&rdp->nocb_cb_head)) { /* ^^^ Ensure CB invocation follows _head test. */ return; } @@ -1910,17 +1910,17 @@ static int rcu_nocb_kthread(void *arg) /* Each pass through this loop invokes one batch of callbacks */ for (;;) { /* Wait for callbacks. */ - if (rdp->nocb_leader == rdp) + if (rdp->nocb_gp_rdp == rdp) nocb_leader_wait(rdp); else nocb_follower_wait(rdp); /* Pull the ready-to-invoke callbacks onto local list. */ raw_spin_lock_irqsave(&rdp->nocb_lock, flags); - list = rdp->nocb_follower_head; - rdp->nocb_follower_head = NULL; - tail = rdp->nocb_follower_tail; - rdp->nocb_follower_tail = &rdp->nocb_follower_head; + list = rdp->nocb_cb_head; + rdp->nocb_cb_head = NULL; + tail = rdp->nocb_cb_tail; + rdp->nocb_cb_tail = &rdp->nocb_cb_head; raw_spin_unlock_irqrestore(&rdp->nocb_lock, flags); if (WARN_ON_ONCE(!list)) continue; @@ -2048,7 +2048,7 @@ static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp) { rdp->nocb_tail = &rdp->nocb_head; init_swait_queue_head(&rdp->nocb_wq); - rdp->nocb_follower_tail = &rdp->nocb_follower_head; + rdp->nocb_cb_tail = &rdp->nocb_cb_head; raw_spin_lock_init(&rdp->nocb_lock); timer_setup(&rdp->nocb_timer, do_nocb_deferred_wakeup_timer, 0); } @@ -2070,27 +2070,27 @@ static void rcu_spawn_one_nocb_kthread(int cpu) * If this isn't a no-CBs CPU or if it already has an rcuo kthread, * then nothing to do. */ - if (!rcu_is_nocb_cpu(cpu) || rdp_spawn->nocb_kthread) + if (!rcu_is_nocb_cpu(cpu) || rdp_spawn->nocb_cb_kthread) return; /* If we didn't spawn the leader first, reorganize! */ - rdp_old_leader = rdp_spawn->nocb_leader; - if (rdp_old_leader != rdp_spawn && !rdp_old_leader->nocb_kthread) { + rdp_old_leader = rdp_spawn->nocb_gp_rdp; + if (rdp_old_leader != rdp_spawn && !rdp_old_leader->nocb_cb_kthread) { rdp_last = NULL; rdp = rdp_old_leader; do { - rdp->nocb_leader = rdp_spawn; + rdp->nocb_gp_rdp = rdp_spawn; if (rdp_last && rdp != rdp_spawn) - rdp_last->nocb_next_follower = rdp; + rdp_last->nocb_next_cb_rdp = rdp; if (rdp == rdp_spawn) { - rdp = rdp->nocb_next_follower; + rdp = rdp->nocb_next_cb_rdp; } else { rdp_last = rdp; - rdp = rdp->nocb_next_follower; - rdp_last->nocb_next_follower = NULL; + rdp = rdp->nocb_next_cb_rdp; + rdp_last->nocb_next_cb_rdp = NULL; } } while (rdp); - rdp_spawn->nocb_next_follower = rdp_old_leader; + rdp_spawn->nocb_next_cb_rdp = rdp_old_leader; } /* Spawn the kthread for this CPU. */ @@ -2098,7 +2098,7 @@ static void rcu_spawn_one_nocb_kthread(int cpu) "rcuo%c/%d", rcu_state.abbr, cpu); if (WARN_ONCE(IS_ERR(t), "%s: Could not start rcuo kthread, OOM is now expected behavior\n", __func__)) return; - WRITE_ONCE(rdp_spawn->nocb_kthread, t); + WRITE_ONCE(rdp_spawn->nocb_cb_kthread, t); } /* @@ -2158,12 +2158,12 @@ static void __init rcu_organize_nocb_kthreads(void) if (rdp->cpu >= nl) { /* New leader, set up for followers & next leader. */ nl = DIV_ROUND_UP(rdp->cpu + 1, ls) * ls; - rdp->nocb_leader = rdp; + rdp->nocb_gp_rdp = rdp; rdp_leader = rdp; } else { /* Another follower, link to previous leader. */ - rdp->nocb_leader = rdp_leader; - rdp_prev->nocb_next_follower = rdp; + rdp->nocb_gp_rdp = rdp_leader; + rdp_prev->nocb_next_cb_rdp = rdp; } rdp_prev = rdp; } From 6484fe54b5c64e9a388f369001508ab8df85a646 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 28 Mar 2019 15:44:18 -0700 Subject: [PATCH 36/86] rcu/nocb: Update comments to prepare for forward-progress work This commit simply rewords comments to prepare for leader nocb kthreads doing only grace-period work and callback shuffling. This will mean the addition of replacement kthreads to invoke callbacks. The "leader" and "follower" thus become less meaningful, so the commit changes no-CB comments with these strings to "GP" and "CB", respectively. (Give or take the usual grammatical transformations.) Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.h | 8 +++--- kernel/rcu/tree_plugin.h | 57 ++++++++++++++++++++-------------------- 2 files changed, 33 insertions(+), 32 deletions(-) diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index e4e59b627c5a..32b3348d3a4d 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -206,17 +206,17 @@ struct rcu_data { int nocb_defer_wakeup; /* Defer wakeup of nocb_kthread. */ struct timer_list nocb_timer; /* Enforce finite deferral. */ - /* The following fields are used by the leader, hence own cacheline. */ + /* The following fields are used by GP kthread, hence own cacheline. */ struct rcu_head *nocb_gp_head ____cacheline_internodealigned_in_smp; /* CBs waiting for GP. */ struct rcu_head **nocb_gp_tail; - bool nocb_gp_sleep; /* Is the nocb leader thread asleep? */ + bool nocb_gp_sleep; /* Is the nocb GP thread asleep? */ struct rcu_data *nocb_next_cb_rdp; /* Next rcu_data in wakeup chain. */ - /* The following fields are used by the follower, hence new cachline. */ + /* The following fields are used by CB kthread, hence new cachline. */ struct rcu_data *nocb_gp_rdp ____cacheline_internodealigned_in_smp; - /* Leader CPU takes GP-end wakeups. */ + /* GP rdp takes GP-end wakeups. */ #endif /* #ifdef CONFIG_RCU_NOCB_CPU */ /* 6) RCU priority boosting. */ diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 5ce1edd1c87f..5a72700c3a32 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1447,10 +1447,10 @@ static void rcu_cleanup_after_idle(void) * specified by rcu_nocb_mask. For the CPUs in the set, there are kthreads * created that pull the callbacks from the corresponding CPU, wait for * a grace period to elapse, and invoke the callbacks. These kthreads - * are organized into leaders, which manage incoming callbacks, wait for - * grace periods, and awaken followers, and the followers, which only - * invoke callbacks. Each leader is its own follower. The no-CBs CPUs - * do a wake_up() on their kthread when they insert a callback into any + * are organized into GP kthreads, which manage incoming callbacks, wait for + * grace periods, and awaken CB kthreads, and the CB kthreads, which only + * invoke callbacks. Each GP kthread invokes its own CBs. The no-CBs CPUs + * do a wake_up() on their GP kthread when they insert a callback into any * empty list, unless the rcu_nocb_poll boot parameter has been specified, * in which case each kthread actively polls its CPU. (Which isn't so great * for energy efficiency, but which does reduce RCU's overhead on that CPU.) @@ -1521,7 +1521,7 @@ bool rcu_is_nocb_cpu(int cpu) } /* - * Kick the leader kthread for this NOCB group. Caller holds ->nocb_lock + * Kick the GP kthread for this NOCB group. Caller holds ->nocb_lock * and this function releases it. */ static void __wake_nocb_leader(struct rcu_data *rdp, bool force, @@ -1548,7 +1548,7 @@ static void __wake_nocb_leader(struct rcu_data *rdp, bool force, } /* - * Kick the leader kthread for this NOCB group, but caller has not + * Kick the GP kthread for this NOCB group, but caller has not * acquired locks. */ static void wake_nocb_leader(struct rcu_data *rdp, bool force) @@ -1560,8 +1560,8 @@ static void wake_nocb_leader(struct rcu_data *rdp, bool force) } /* - * Arrange to wake the leader kthread for this NOCB group at some - * future time when it is safe to do so. + * Arrange to wake the GP kthread for this NOCB group at some future + * time when it is safe to do so. */ static void wake_nocb_leader_defer(struct rcu_data *rdp, int waketype, const char *reason) @@ -1783,7 +1783,7 @@ static void rcu_nocb_wait_gp(struct rcu_data *rdp) } /* - * Leaders come here to wait for additional callbacks to show up. + * No-CBs GP kthreads come here to wait for additional callbacks to show up. * This function does not return until callbacks appear. */ static void nocb_leader_wait(struct rcu_data *my_rdp) @@ -1812,8 +1812,8 @@ wait_again: } /* - * Each pass through the following loop checks a follower for CBs. - * We are our own first follower. Any CBs found are moved to + * Each pass through the following loop checks for CBs. + * We are our own first CB kthread. Any CBs found are moved to * nocb_gp_head, where they await a grace period. */ gotcbs = false; @@ -1821,7 +1821,7 @@ wait_again: for (rdp = my_rdp; rdp; rdp = rdp->nocb_next_cb_rdp) { rdp->nocb_gp_head = READ_ONCE(rdp->nocb_head); if (!rdp->nocb_gp_head) - continue; /* No CBs here, try next follower. */ + continue; /* No CBs here, try next. */ /* Move callbacks to wait-for-GP list, which is empty. */ WRITE_ONCE(rdp->nocb_head, NULL); @@ -1844,7 +1844,7 @@ wait_again: /* Wait for one grace period. */ rcu_nocb_wait_gp(my_rdp); - /* Each pass through the following loop wakes a follower, if needed. */ + /* Each pass through this loop wakes a CB kthread, if needed. */ for (rdp = my_rdp; rdp; rdp = rdp->nocb_next_cb_rdp) { if (!rcu_nocb_poll && READ_ONCE(rdp->nocb_head) && @@ -1854,27 +1854,27 @@ wait_again: raw_spin_unlock_irqrestore(&my_rdp->nocb_lock, flags); } if (!rdp->nocb_gp_head) - continue; /* No CBs, so no need to wake follower. */ + continue; /* No CBs, so no need to wake kthread. */ - /* Append callbacks to follower's "done" list. */ + /* Append callbacks to CB kthread's "done" list. */ raw_spin_lock_irqsave(&rdp->nocb_lock, flags); tail = rdp->nocb_cb_tail; rdp->nocb_cb_tail = rdp->nocb_gp_tail; *tail = rdp->nocb_gp_head; raw_spin_unlock_irqrestore(&rdp->nocb_lock, flags); if (rdp != my_rdp && tail == &rdp->nocb_cb_head) { - /* List was empty, so wake up the follower. */ + /* List was empty, so wake up the kthread. */ swake_up_one(&rdp->nocb_wq); } } - /* If we (the leader) don't have CBs, go wait some more. */ + /* If we (the GP kthreads) don't have CBs, go wait some more. */ if (!my_rdp->nocb_cb_head) goto wait_again; } /* - * Followers come here to wait for additional callbacks to show up. + * No-CBs CB kthreads come here to wait for additional callbacks to show up. * This function does not return until callbacks appear. */ static void nocb_follower_wait(struct rcu_data *rdp) @@ -1894,9 +1894,10 @@ static void nocb_follower_wait(struct rcu_data *rdp) /* * Per-rcu_data kthread, but only for no-CBs CPUs. Each kthread invokes - * callbacks queued by the corresponding no-CBs CPU, however, there is - * an optional leader-follower relationship so that the grace-period - * kthreads don't have to do quite so many wakeups. + * callbacks queued by the corresponding no-CBs CPU, however, there is an + * optional GP-CB relationship so that the grace-period kthreads don't + * have to do quite so many wakeups (as in they only need to wake the + * no-CBs GP kthreads, not the CB kthreads). */ static int rcu_nocb_kthread(void *arg) { @@ -2056,7 +2057,7 @@ static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp) /* * If the specified CPU is a no-CBs CPU that does not already have its * rcuo kthread, spawn it. If the CPUs are brought online out of order, - * this can require re-organizing the leader-follower relationships. + * this can require re-organizing the GP-CB relationships. */ static void rcu_spawn_one_nocb_kthread(int cpu) { @@ -2073,7 +2074,7 @@ static void rcu_spawn_one_nocb_kthread(int cpu) if (!rcu_is_nocb_cpu(cpu) || rdp_spawn->nocb_cb_kthread) return; - /* If we didn't spawn the leader first, reorganize! */ + /* If we didn't spawn the GP kthread first, reorganize! */ rdp_old_leader = rdp_spawn->nocb_gp_rdp; if (rdp_old_leader != rdp_spawn && !rdp_old_leader->nocb_cb_kthread) { rdp_last = NULL; @@ -2125,18 +2126,18 @@ static void __init rcu_spawn_nocb_kthreads(void) rcu_spawn_cpu_nocb_kthread(cpu); } -/* How many follower CPU IDs per leader? Default of -1 for sqrt(nr_cpu_ids). */ +/* How many CB CPU IDs per GP kthread? Default of -1 for sqrt(nr_cpu_ids). */ static int rcu_nocb_leader_stride = -1; module_param(rcu_nocb_leader_stride, int, 0444); /* - * Initialize leader-follower relationships for all no-CBs CPU. + * Initialize GP-CB relationships for all no-CBs CPU. */ static void __init rcu_organize_nocb_kthreads(void) { int cpu; int ls = rcu_nocb_leader_stride; - int nl = 0; /* Next leader. */ + int nl = 0; /* Next GP kthread. */ struct rcu_data *rdp; struct rcu_data *rdp_leader = NULL; /* Suppress misguided gcc warn. */ struct rcu_data *rdp_prev = NULL; @@ -2156,12 +2157,12 @@ static void __init rcu_organize_nocb_kthreads(void) for_each_cpu(cpu, rcu_nocb_mask) { rdp = per_cpu_ptr(&rcu_data, cpu); if (rdp->cpu >= nl) { - /* New leader, set up for followers & next leader. */ + /* New GP kthread, set up for CBs & next GP. */ nl = DIV_ROUND_UP(rdp->cpu + 1, ls) * ls; rdp->nocb_gp_rdp = rdp; rdp_leader = rdp; } else { - /* Another follower, link to previous leader. */ + /* Another CB kthread, link to previous GP kthread. */ rdp->nocb_gp_rdp = rdp_leader; rdp_prev->nocb_next_cb_rdp = rdp; } From 12f54c3a8410102afb96ed437aebe7f1d87f399f Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 29 Mar 2019 16:43:51 -0700 Subject: [PATCH 37/86] rcu/nocb: Provide separate no-CBs grace-period kthreads Currently, there is one no-CBs rcuo kthread per CPU, and these kthreads are divided into groups. The first rcuo kthread to come online in a given group is that group's leader, and the leader both waits for grace periods and invokes its CPU's callbacks. The non-leader rcuo kthreads only invoke callbacks. This works well in the real-time/embedded environments for which it was intended because such environments tend not to generate all that many callbacks. However, given huge floods of callbacks, it is possible for the leader kthread to be stuck invoking callbacks while its followers wait helplessly while their callbacks pile up. This is a good recipe for an OOM, and rcutorture's new callback-flood capability does generate such OOMs. One strategy would be to wait until such OOMs start happening in production, but similar OOMs have in fact happened starting in 2018. It would therefore be wise to take a more proactive approach. This commit therefore features per-CPU rcuo kthreads that do nothing but invoke callbacks. Instead of having one of these kthreads act as leader, each group has a separate rcog kthread that handles grace periods for its group. Because these rcuog kthreads do not invoke callbacks, callback floods on one CPU no longer block callbacks from reaching the rcuc callback-invocation kthreads on other CPUs. This change does introduce additional kthreads, however: 1. The number of additional kthreads is about the square root of the number of CPUs, so that a 4096-CPU system would have only about 64 additional kthreads. Note that recent changes decreased the number of rcuo kthreads by a factor of two (CONFIG_PREEMPT=n) or even three (CONFIG_PREEMPT=y), so this still represents a significant improvement on most systems. 2. The leading "rcuo" of the rcuog kthreads should allow existing scripting to affinity these additional kthreads as needed, the same as for the rcuop and rcuos kthreads. (There are no longer any rcuob kthreads.) 3. A state-machine approach was considered and rejected. Although this would allow the rcuo kthreads to continue their dual leader/follower roles, it complicates callback invocation and makes it more difficult to consolidate rcuo callback invocation with existing softirq callback invocation. The introduction of rcuog kthreads should thus be acceptable. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.h | 6 +- kernel/rcu/tree_plugin.h | 115 +++++++++++++++++++-------------------- 2 files changed, 61 insertions(+), 60 deletions(-) diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index 32b3348d3a4d..dc3c53cb9608 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -200,8 +200,8 @@ struct rcu_data { atomic_long_t nocb_q_count_lazy; /* invocation (all stages). */ struct rcu_head *nocb_cb_head; /* CBs ready to invoke. */ struct rcu_head **nocb_cb_tail; - struct swait_queue_head nocb_wq; /* For nocb kthreads to sleep on. */ - struct task_struct *nocb_cb_kthread; + struct swait_queue_head nocb_cb_wq; /* For nocb kthreads to sleep on. */ + struct task_struct *nocb_gp_kthread; raw_spinlock_t nocb_lock; /* Guard following pair of fields. */ int nocb_defer_wakeup; /* Defer wakeup of nocb_kthread. */ struct timer_list nocb_timer; /* Enforce finite deferral. */ @@ -211,6 +211,8 @@ struct rcu_data { /* CBs waiting for GP. */ struct rcu_head **nocb_gp_tail; bool nocb_gp_sleep; /* Is the nocb GP thread asleep? */ + struct swait_queue_head nocb_gp_wq; /* For nocb kthreads to sleep on. */ + struct task_struct *nocb_cb_kthread; struct rcu_data *nocb_next_cb_rdp; /* Next rcu_data in wakeup chain. */ diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 5a72700c3a32..c3b6493313ab 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1531,7 +1531,7 @@ static void __wake_nocb_leader(struct rcu_data *rdp, bool force, struct rcu_data *rdp_leader = rdp->nocb_gp_rdp; lockdep_assert_held(&rdp->nocb_lock); - if (!READ_ONCE(rdp_leader->nocb_cb_kthread)) { + if (!READ_ONCE(rdp_leader->nocb_gp_kthread)) { raw_spin_unlock_irqrestore(&rdp->nocb_lock, flags); return; } @@ -1541,7 +1541,7 @@ static void __wake_nocb_leader(struct rcu_data *rdp, bool force, del_timer(&rdp->nocb_timer); raw_spin_unlock_irqrestore(&rdp->nocb_lock, flags); smp_mb(); /* ->nocb_gp_sleep before swake_up_one(). */ - swake_up_one(&rdp_leader->nocb_wq); + swake_up_one(&rdp_leader->nocb_gp_wq); } else { raw_spin_unlock_irqrestore(&rdp->nocb_lock, flags); } @@ -1646,7 +1646,7 @@ static void __call_rcu_nocb_enqueue(struct rcu_data *rdp, smp_mb__after_atomic(); /* Store *old_rhpp before _wake test. */ /* If we are not being polled and there is a kthread, awaken it ... */ - t = READ_ONCE(rdp->nocb_cb_kthread); + t = READ_ONCE(rdp->nocb_gp_kthread); if (rcu_nocb_poll || !t) { trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WakeNotPoll")); @@ -1786,7 +1786,7 @@ static void rcu_nocb_wait_gp(struct rcu_data *rdp) * No-CBs GP kthreads come here to wait for additional callbacks to show up. * This function does not return until callbacks appear. */ -static void nocb_leader_wait(struct rcu_data *my_rdp) +static void nocb_gp_wait(struct rcu_data *my_rdp) { bool firsttime = true; unsigned long flags; @@ -1794,12 +1794,10 @@ static void nocb_leader_wait(struct rcu_data *my_rdp) struct rcu_data *rdp; struct rcu_head **tail; -wait_again: - /* Wait for callbacks to appear. */ if (!rcu_nocb_poll) { trace_rcu_nocb_wake(rcu_state.name, my_rdp->cpu, TPS("Sleep")); - swait_event_interruptible_exclusive(my_rdp->nocb_wq, + swait_event_interruptible_exclusive(my_rdp->nocb_gp_wq, !READ_ONCE(my_rdp->nocb_gp_sleep)); raw_spin_lock_irqsave(&my_rdp->nocb_lock, flags); my_rdp->nocb_gp_sleep = true; @@ -1838,7 +1836,7 @@ wait_again: trace_rcu_nocb_wake(rcu_state.name, my_rdp->cpu, TPS("WokeEmpty")); } - goto wait_again; + return; } /* Wait for one grace period. */ @@ -1862,34 +1860,47 @@ wait_again: rdp->nocb_cb_tail = rdp->nocb_gp_tail; *tail = rdp->nocb_gp_head; raw_spin_unlock_irqrestore(&rdp->nocb_lock, flags); - if (rdp != my_rdp && tail == &rdp->nocb_cb_head) { + if (tail == &rdp->nocb_cb_head) { /* List was empty, so wake up the kthread. */ - swake_up_one(&rdp->nocb_wq); + swake_up_one(&rdp->nocb_cb_wq); } } +} - /* If we (the GP kthreads) don't have CBs, go wait some more. */ - if (!my_rdp->nocb_cb_head) - goto wait_again; +/* + * No-CBs grace-period-wait kthread. There is one of these per group + * of CPUs, but only once at least one CPU in that group has come online + * at least once since boot. This kthread checks for newly posted + * callbacks from any of the CPUs it is responsible for, waits for a + * grace period, then awakens all of the rcu_nocb_cb_kthread() instances + * that then have callback-invocation work to do. + */ +static int rcu_nocb_gp_kthread(void *arg) +{ + struct rcu_data *rdp = arg; + + for (;;) + nocb_gp_wait(rdp); + return 0; } /* * No-CBs CB kthreads come here to wait for additional callbacks to show up. - * This function does not return until callbacks appear. + * This function returns true ("keep waiting") until callbacks appear and + * then false ("stop waiting") when callbacks finally do appear. */ -static void nocb_follower_wait(struct rcu_data *rdp) +static bool nocb_follower_wait(struct rcu_data *rdp) { - for (;;) { - trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("FollowerSleep")); - swait_event_interruptible_exclusive(rdp->nocb_wq, - READ_ONCE(rdp->nocb_cb_head)); - if (smp_load_acquire(&rdp->nocb_cb_head)) { - /* ^^^ Ensure CB invocation follows _head test. */ - return; - } - WARN_ON(signal_pending(current)); - trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WokeEmpty")); + trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("FollowerSleep")); + swait_event_interruptible_exclusive(rdp->nocb_cb_wq, + READ_ONCE(rdp->nocb_cb_head)); + if (smp_load_acquire(&rdp->nocb_cb_head)) { /* VVV */ + /* ^^^ Ensure CB invocation follows _head test. */ + return false; } + WARN_ON(signal_pending(current)); + trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WokeEmpty")); + return true; } /* @@ -1899,7 +1910,7 @@ static void nocb_follower_wait(struct rcu_data *rdp) * have to do quite so many wakeups (as in they only need to wake the * no-CBs GP kthreads, not the CB kthreads). */ -static int rcu_nocb_kthread(void *arg) +static int rcu_nocb_cb_kthread(void *arg) { int c, cl; unsigned long flags; @@ -1911,10 +1922,8 @@ static int rcu_nocb_kthread(void *arg) /* Each pass through this loop invokes one batch of callbacks */ for (;;) { /* Wait for callbacks. */ - if (rdp->nocb_gp_rdp == rdp) - nocb_leader_wait(rdp); - else - nocb_follower_wait(rdp); + while (nocb_follower_wait(rdp)) + continue; /* Pull the ready-to-invoke callbacks onto local list. */ raw_spin_lock_irqsave(&rdp->nocb_lock, flags); @@ -2048,7 +2057,8 @@ void __init rcu_init_nohz(void) static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp) { rdp->nocb_tail = &rdp->nocb_head; - init_swait_queue_head(&rdp->nocb_wq); + init_swait_queue_head(&rdp->nocb_cb_wq); + init_swait_queue_head(&rdp->nocb_gp_wq); rdp->nocb_cb_tail = &rdp->nocb_cb_head; raw_spin_lock_init(&rdp->nocb_lock); timer_setup(&rdp->nocb_timer, do_nocb_deferred_wakeup_timer, 0); @@ -2056,50 +2066,39 @@ static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp) /* * If the specified CPU is a no-CBs CPU that does not already have its - * rcuo kthread, spawn it. If the CPUs are brought online out of order, - * this can require re-organizing the GP-CB relationships. + * rcuo CB kthread, spawn it. Additionally, if the rcuo GP kthread + * for this CPU's group has not yet been created, spawn it as well. */ static void rcu_spawn_one_nocb_kthread(int cpu) { - struct rcu_data *rdp; - struct rcu_data *rdp_last; - struct rcu_data *rdp_old_leader; - struct rcu_data *rdp_spawn = per_cpu_ptr(&rcu_data, cpu); + struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu); + struct rcu_data *rdp_gp; struct task_struct *t; /* * If this isn't a no-CBs CPU or if it already has an rcuo kthread, * then nothing to do. */ - if (!rcu_is_nocb_cpu(cpu) || rdp_spawn->nocb_cb_kthread) + if (!rcu_is_nocb_cpu(cpu) || rdp->nocb_cb_kthread) return; /* If we didn't spawn the GP kthread first, reorganize! */ - rdp_old_leader = rdp_spawn->nocb_gp_rdp; - if (rdp_old_leader != rdp_spawn && !rdp_old_leader->nocb_cb_kthread) { - rdp_last = NULL; - rdp = rdp_old_leader; - do { - rdp->nocb_gp_rdp = rdp_spawn; - if (rdp_last && rdp != rdp_spawn) - rdp_last->nocb_next_cb_rdp = rdp; - if (rdp == rdp_spawn) { - rdp = rdp->nocb_next_cb_rdp; - } else { - rdp_last = rdp; - rdp = rdp->nocb_next_cb_rdp; - rdp_last->nocb_next_cb_rdp = NULL; - } - } while (rdp); - rdp_spawn->nocb_next_cb_rdp = rdp_old_leader; + rdp_gp = rdp->nocb_gp_rdp; + if (!rdp_gp->nocb_gp_kthread) { + t = kthread_run(rcu_nocb_gp_kthread, rdp_gp, + "rcuog/%d", rdp_gp->cpu); + if (WARN_ONCE(IS_ERR(t), "%s: Could not start rcuo GP kthread, OOM is now expected behavior\n", __func__)) + return; + WRITE_ONCE(rdp_gp->nocb_gp_kthread, t); } /* Spawn the kthread for this CPU. */ - t = kthread_run(rcu_nocb_kthread, rdp_spawn, + t = kthread_run(rcu_nocb_cb_kthread, rdp, "rcuo%c/%d", rcu_state.abbr, cpu); - if (WARN_ONCE(IS_ERR(t), "%s: Could not start rcuo kthread, OOM is now expected behavior\n", __func__)) + if (WARN_ONCE(IS_ERR(t), "%s: Could not start rcuo CB kthread, OOM is now expected behavior\n", __func__)) return; - WRITE_ONCE(rdp_spawn->nocb_cb_kthread, t); + WRITE_ONCE(rdp->nocb_cb_kthread, t); + WRITE_ONCE(rdp->nocb_gp_kthread, rdp_gp->nocb_gp_kthread); } /* From 9fa471a881df9ba84e0e0844d918ed1ec55fc567 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Sun, 31 Mar 2019 16:07:43 -0700 Subject: [PATCH 38/86] rcu/nocb: Rename nocb_follower_wait() to nocb_cb_wait() This commit adjusts naming to account for the new distinction between callback and grace-period no-CBs kthreads. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index c3b6493313ab..9d5448217bbc 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1889,7 +1889,7 @@ static int rcu_nocb_gp_kthread(void *arg) * This function returns true ("keep waiting") until callbacks appear and * then false ("stop waiting") when callbacks finally do appear. */ -static bool nocb_follower_wait(struct rcu_data *rdp) +static bool nocb_cb_wait(struct rcu_data *rdp) { trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("FollowerSleep")); swait_event_interruptible_exclusive(rdp->nocb_cb_wq, @@ -1922,7 +1922,7 @@ static int rcu_nocb_cb_kthread(void *arg) /* Each pass through this loop invokes one batch of callbacks */ for (;;) { /* Wait for callbacks. */ - while (nocb_follower_wait(rdp)) + while (nocb_cb_wait(rdp)) continue; /* Pull the ready-to-invoke callbacks onto local list. */ From 5d62c08c5fe54492899b978d0ddc0bf7fd071317 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Sun, 31 Mar 2019 16:10:17 -0700 Subject: [PATCH 39/86] rcu/nocb: Rename wake_nocb_leader() to wake_nocb_gp() This commit adjusts naming to account for the new distinction between callback and grace-period no-CBs kthreads. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 9d5448217bbc..632c2cfb9856 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1551,7 +1551,7 @@ static void __wake_nocb_leader(struct rcu_data *rdp, bool force, * Kick the GP kthread for this NOCB group, but caller has not * acquired locks. */ -static void wake_nocb_leader(struct rcu_data *rdp, bool force) +static void wake_nocb_gp(struct rcu_data *rdp, bool force) { unsigned long flags; @@ -1656,7 +1656,7 @@ static void __call_rcu_nocb_enqueue(struct rcu_data *rdp, if (old_rhpp == &rdp->nocb_head) { if (!irqs_disabled_flags(flags)) { /* ... if queue was empty ... */ - wake_nocb_leader(rdp, false); + wake_nocb_gp(rdp, false); trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WakeEmpty")); } else { @@ -1667,7 +1667,7 @@ static void __call_rcu_nocb_enqueue(struct rcu_data *rdp, } else if (len > rdp->qlen_last_fqs_check + qhimark) { /* ... or if many callbacks queued. */ if (!irqs_disabled_flags(flags)) { - wake_nocb_leader(rdp, true); + wake_nocb_gp(rdp, true); trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WakeOvf")); } else { From 5f675ba6eb5d1e4be56fc7c28881728373d880c4 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Sun, 31 Mar 2019 16:11:57 -0700 Subject: [PATCH 40/86] rcu/nocb: Rename __wake_nocb_leader() to __wake_nocb_gp() This commit adjusts naming to account for the new distinction between callback and grace-period no-CBs kthreads. While in the area, it also updates local variables. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 632c2cfb9856..7c7870da234a 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1524,24 +1524,24 @@ bool rcu_is_nocb_cpu(int cpu) * Kick the GP kthread for this NOCB group. Caller holds ->nocb_lock * and this function releases it. */ -static void __wake_nocb_leader(struct rcu_data *rdp, bool force, - unsigned long flags) +static void __wake_nocb_gp(struct rcu_data *rdp, bool force, + unsigned long flags) __releases(rdp->nocb_lock) { - struct rcu_data *rdp_leader = rdp->nocb_gp_rdp; + struct rcu_data *rdp_gp = rdp->nocb_gp_rdp; lockdep_assert_held(&rdp->nocb_lock); - if (!READ_ONCE(rdp_leader->nocb_gp_kthread)) { + if (!READ_ONCE(rdp_gp->nocb_gp_kthread)) { raw_spin_unlock_irqrestore(&rdp->nocb_lock, flags); return; } - if (rdp_leader->nocb_gp_sleep || force) { + if (rdp_gp->nocb_gp_sleep || force) { /* Prior smp_mb__after_atomic() orders against prior enqueue. */ - WRITE_ONCE(rdp_leader->nocb_gp_sleep, false); + WRITE_ONCE(rdp_gp->nocb_gp_sleep, false); del_timer(&rdp->nocb_timer); raw_spin_unlock_irqrestore(&rdp->nocb_lock, flags); smp_mb(); /* ->nocb_gp_sleep before swake_up_one(). */ - swake_up_one(&rdp_leader->nocb_gp_wq); + swake_up_one(&rdp_gp->nocb_gp_wq); } else { raw_spin_unlock_irqrestore(&rdp->nocb_lock, flags); } @@ -1556,7 +1556,7 @@ static void wake_nocb_gp(struct rcu_data *rdp, bool force) unsigned long flags; raw_spin_lock_irqsave(&rdp->nocb_lock, flags); - __wake_nocb_leader(rdp, force, flags); + __wake_nocb_gp(rdp, force, flags); } /* @@ -1988,7 +1988,7 @@ static void do_nocb_deferred_wakeup_common(struct rcu_data *rdp) } ndw = READ_ONCE(rdp->nocb_defer_wakeup); WRITE_ONCE(rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT); - __wake_nocb_leader(rdp, ndw == RCU_NOCB_WAKE_FORCE, flags); + __wake_nocb_gp(rdp, ndw == RCU_NOCB_WAKE_FORCE, flags); trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("DeferredWake")); } From 0d52a6652f15f6b1155297326de85c07ca421d64 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Sun, 31 Mar 2019 16:19:02 -0700 Subject: [PATCH 41/86] rcu/nocb: Rename wake_nocb_leader_defer() to wake_nocb_gp_defer() This commit adjusts naming to account for the new distinction between callback and grace-period no-CBs kthreads. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 7c7870da234a..e6581a51ff9a 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1563,8 +1563,8 @@ static void wake_nocb_gp(struct rcu_data *rdp, bool force) * Arrange to wake the GP kthread for this NOCB group at some future * time when it is safe to do so. */ -static void wake_nocb_leader_defer(struct rcu_data *rdp, int waketype, - const char *reason) +static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype, + const char *reason) { unsigned long flags; @@ -1660,8 +1660,8 @@ static void __call_rcu_nocb_enqueue(struct rcu_data *rdp, trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WakeEmpty")); } else { - wake_nocb_leader_defer(rdp, RCU_NOCB_WAKE, - TPS("WakeEmptyIsDeferred")); + wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE, + TPS("WakeEmptyIsDeferred")); } rdp->qlen_last_fqs_check = 0; } else if (len > rdp->qlen_last_fqs_check + qhimark) { @@ -1671,8 +1671,8 @@ static void __call_rcu_nocb_enqueue(struct rcu_data *rdp, trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WakeOvf")); } else { - wake_nocb_leader_defer(rdp, RCU_NOCB_WAKE_FORCE, - TPS("WakeOvfIsDeferred")); + wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE_FORCE, + TPS("WakeOvfIsDeferred")); } rdp->qlen_last_fqs_check = LONG_MAX / 2; } else { From 0bdc33daef96a54f9e5799d84f2fbc05d9e5cae3 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Sun, 31 Mar 2019 16:20:52 -0700 Subject: [PATCH 42/86] rcu/nocb: Rename rcu_organize_nocb_kthreads() local variable This commit renames rdp_leader to rdp_gp in order to account for the new distinction between callback and grace-period no-CBs kthreads. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index e6581a51ff9a..0af36e98e70f 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -2138,7 +2138,7 @@ static void __init rcu_organize_nocb_kthreads(void) int ls = rcu_nocb_leader_stride; int nl = 0; /* Next GP kthread. */ struct rcu_data *rdp; - struct rcu_data *rdp_leader = NULL; /* Suppress misguided gcc warn. */ + struct rcu_data *rdp_gp = NULL; /* Suppress misguided gcc warn. */ struct rcu_data *rdp_prev = NULL; if (!cpumask_available(rcu_nocb_mask)) @@ -2159,10 +2159,10 @@ static void __init rcu_organize_nocb_kthreads(void) /* New GP kthread, set up for CBs & next GP. */ nl = DIV_ROUND_UP(rdp->cpu + 1, ls) * ls; rdp->nocb_gp_rdp = rdp; - rdp_leader = rdp; + rdp_gp = rdp; } else { /* Another CB kthread, link to previous GP kthread. */ - rdp->nocb_gp_rdp = rdp_leader; + rdp->nocb_gp_rdp = rdp_gp; rdp_prev->nocb_next_cb_rdp = rdp; } rdp_prev = rdp; From f7c9a9b664fb32a127e8e9a987b52023b92c3a0b Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 1 Apr 2019 09:57:01 -0700 Subject: [PATCH 43/86] rcu/nocb: Rename and document no-CB CB kthread sleep trace event The nocb_cb_wait() function traces a "FollowerSleep" trace_rcu_nocb_wake() event, which never was documented and is now misleading. This commit therefore changes "FollowerSleep" to "CBSleep", documents this, and updates the documentation for "Sleep" as well. Signed-off-by: Paul E. McKenney --- include/trace/events/rcu.h | 3 ++- kernel/rcu/tree_plugin.h | 2 +- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h index 02a3f78f7cd8..313324d1b135 100644 --- a/include/trace/events/rcu.h +++ b/include/trace/events/rcu.h @@ -267,7 +267,8 @@ TRACE_EVENT_RCU(rcu_exp_funnel_lock, * "WakeNotPoll": Don't wake rcuo kthread because it is polling. * "DeferredWake": Carried out the "IsDeferred" wakeup. * "Poll": Start of new polling cycle for rcu_nocb_poll. - * "Sleep": Sleep waiting for CBs for !rcu_nocb_poll. + * "Sleep": Sleep waiting for GP for !rcu_nocb_poll. + * "CBSleep": Sleep waiting for CBs for !rcu_nocb_poll. * "WokeEmpty": rcuo kthread woke to find empty list. * "WokeNonEmpty": rcuo kthread woke to find non-empty list. * "WaitQueue": Enqueue partially done, timed wait for it to complete. diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 0af36e98e70f..be065aacd63b 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1891,7 +1891,7 @@ static int rcu_nocb_gp_kthread(void *arg) */ static bool nocb_cb_wait(struct rcu_data *rdp) { - trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("FollowerSleep")); + trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("CBSleep")); swait_event_interruptible_exclusive(rdp->nocb_cb_wq, READ_ONCE(rdp->nocb_cb_head)); if (smp_load_acquire(&rdp->nocb_cb_head)) { /* VVV */ From f7c612b000d7e974826089b5a6f6eecd6805862a Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 2 Apr 2019 08:05:55 -0700 Subject: [PATCH 44/86] rcu/nocb: Rename rcu_nocb_leader_stride kernel boot parameter This commit changes the name of the rcu_nocb_leader_stride kernel boot parameter to rcu_nocb_gp_stride in order to account for the new distinction between callback and grace-period no-CBs kthreads. Signed-off-by: Paul E. McKenney --- Documentation/admin-guide/kernel-parameters.txt | 13 +++++++------ kernel/rcu/tree_plugin.h | 8 ++++---- 2 files changed, 11 insertions(+), 10 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index f3fcd6140ee1..79b983bedcaa 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3837,12 +3837,13 @@ RCU_BOOST is not set, valid values are 0-99 and the default is zero (non-realtime operation). - rcutree.rcu_nocb_leader_stride= [KNL] - Set the number of NOCB kthread groups, which - defaults to the square root of the number of - CPUs. Larger numbers reduces the wakeup overhead - on the per-CPU grace-period kthreads, but increases - that same overhead on each group's leader. + rcutree.rcu_nocb_gp_stride= [KNL] + Set the number of NOCB callback kthreads in + each group, which defaults to the square root + of the number of CPUs. Larger numbers reduce + the wakeup overhead on the global grace-period + kthread, but increases that same overhead on + each group's NOCB grace-period kthread. rcutree.qhimark= [KNL] Set threshold of queued RCU callbacks beyond which diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index be065aacd63b..80b27a9f306d 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -2126,8 +2126,8 @@ static void __init rcu_spawn_nocb_kthreads(void) } /* How many CB CPU IDs per GP kthread? Default of -1 for sqrt(nr_cpu_ids). */ -static int rcu_nocb_leader_stride = -1; -module_param(rcu_nocb_leader_stride, int, 0444); +static int rcu_nocb_gp_stride = -1; +module_param(rcu_nocb_gp_stride, int, 0444); /* * Initialize GP-CB relationships for all no-CBs CPU. @@ -2135,7 +2135,7 @@ module_param(rcu_nocb_leader_stride, int, 0444); static void __init rcu_organize_nocb_kthreads(void) { int cpu; - int ls = rcu_nocb_leader_stride; + int ls = rcu_nocb_gp_stride; int nl = 0; /* Next GP kthread. */ struct rcu_data *rdp; struct rcu_data *rdp_gp = NULL; /* Suppress misguided gcc warn. */ @@ -2145,7 +2145,7 @@ static void __init rcu_organize_nocb_kthreads(void) return; if (ls == -1) { ls = int_sqrt(nr_cpu_ids); - rcu_nocb_leader_stride = ls; + rcu_nocb_gp_stride = ls; } /* From 18cd8c93e69e3853eb408980089fb3c58813f922 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Sat, 1 Jun 2019 05:12:36 -0700 Subject: [PATCH 45/86] rcu/nocb: Print gp/cb kthread hierarchy if dump_tree This commit causes the no-CBs grace-period/callback hierarchy to be printed to the console when the dump_tree kernel boot parameter is set. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 80b27a9f306d..0a3f8680b450 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -2135,6 +2135,7 @@ module_param(rcu_nocb_gp_stride, int, 0444); static void __init rcu_organize_nocb_kthreads(void) { int cpu; + bool firsttime = true; int ls = rcu_nocb_gp_stride; int nl = 0; /* Next GP kthread. */ struct rcu_data *rdp; @@ -2160,10 +2161,15 @@ static void __init rcu_organize_nocb_kthreads(void) nl = DIV_ROUND_UP(rdp->cpu + 1, ls) * ls; rdp->nocb_gp_rdp = rdp; rdp_gp = rdp; + if (!firsttime && dump_tree) + pr_cont("\n"); + firsttime = false; + pr_alert("%s: No-CB GP kthread CPU %d:", __func__, cpu); } else { /* Another CB kthread, link to previous GP kthread. */ rdp->nocb_gp_rdp = rdp_gp; rdp_prev->nocb_next_cb_rdp = rdp; + pr_alert(" %d", cpu); } rdp_prev = rdp; } From 1bb5f9b95afe5d9d6b586389ce5e8f461a5b671c Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 12 Apr 2019 12:34:41 -0700 Subject: [PATCH 46/86] rcu/nocb: Use separate flag to indicate disabled ->cblist NULLing the RCU_NEXT_TAIL pointer was a clever way to save a byte, but forward-progress considerations would require that this pointer be both NULL and non-NULL, which, absent a quantum-computer port of the Linux kernel, simply won't happen. This commit therefore creates as separate ->enabled flag to replace the current NULL checks. [ paulmck: Add include files per 0day test robot and -next. ] Signed-off-by: Paul E. McKenney --- include/linux/rcu_segcblist.h | 4 ++++ kernel/rcu/rcu_segcblist.c | 3 ++- kernel/rcu/rcu_segcblist.h | 2 +- kernel/rcu/tree_plugin.h | 2 +- 4 files changed, 8 insertions(+), 3 deletions(-) diff --git a/include/linux/rcu_segcblist.h b/include/linux/rcu_segcblist.h index 87404cb015f1..ed2cfd3c0743 100644 --- a/include/linux/rcu_segcblist.h +++ b/include/linux/rcu_segcblist.h @@ -14,6 +14,9 @@ #ifndef __INCLUDE_LINUX_RCU_SEGCBLIST_H #define __INCLUDE_LINUX_RCU_SEGCBLIST_H +#include +#include + /* Simple unsegmented callback lists. */ struct rcu_cblist { struct rcu_head *head; @@ -67,6 +70,7 @@ struct rcu_segcblist { unsigned long gp_seq[RCU_CBLIST_NSEGS]; long len; long len_lazy; + u8 enabled; }; #define RCU_SEGCBLIST_INITIALIZER(n) \ diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c index 9bd5f6023c21..b305dcac34c9 100644 --- a/kernel/rcu/rcu_segcblist.c +++ b/kernel/rcu/rcu_segcblist.c @@ -58,6 +58,7 @@ void rcu_segcblist_init(struct rcu_segcblist *rsclp) rsclp->tails[i] = &rsclp->head; rsclp->len = 0; rsclp->len_lazy = 0; + rsclp->enabled = 1; } /* @@ -69,7 +70,7 @@ void rcu_segcblist_disable(struct rcu_segcblist *rsclp) WARN_ON_ONCE(!rcu_segcblist_empty(rsclp)); WARN_ON_ONCE(rcu_segcblist_n_cbs(rsclp)); WARN_ON_ONCE(rcu_segcblist_n_lazy_cbs(rsclp)); - rsclp->tails[RCU_NEXT_TAIL] = NULL; + rsclp->enabled = 0; } /* diff --git a/kernel/rcu/rcu_segcblist.h b/kernel/rcu/rcu_segcblist.h index 822a39da0533..b2de7b32da29 100644 --- a/kernel/rcu/rcu_segcblist.h +++ b/kernel/rcu/rcu_segcblist.h @@ -63,7 +63,7 @@ static inline long rcu_segcblist_n_nonlazy_cbs(struct rcu_segcblist *rsclp) */ static inline bool rcu_segcblist_is_enabled(struct rcu_segcblist *rsclp) { - return !!rsclp->tails[RCU_NEXT_TAIL]; + return rsclp->enabled; } /* diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 0a3f8680b450..b8a43cf9bb4e 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -2189,8 +2189,8 @@ static bool init_nocb_callback_list(struct rcu_data *rdp) rcu_segcblist_n_cbs(&rdp->cblist)); atomic_long_set(&rdp->nocb_q_count_lazy, rcu_segcblist_n_lazy_cbs(&rdp->cblist)); - rcu_segcblist_init(&rdp->cblist); } + rcu_segcblist_init(&rdp->cblist); rcu_segcblist_disable(&rdp->cblist); return true; } From ce5215c1342c6c89b3c3c45fea82cddf0b013787 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 12 Apr 2019 15:58:34 -0700 Subject: [PATCH 47/86] rcu/nocb: Use separate flag to indicate offloaded ->cblist RCU callback processing currently uses rcu_is_nocb_cpu() to determine whether or not the current CPU's callbacks are to be offloaded. This works, but it is not so good for cache locality. Plus use of ->cblist for offloaded callbacks will greatly increase the frequency of these checks. This commit therefore adds a ->offloaded flag to the rcu_segcblist structure to provide a more flexible and cache-friendly means of checking for callback offloading. Signed-off-by: Paul E. McKenney --- include/linux/rcu_segcblist.h | 1 + kernel/rcu/rcu_segcblist.c | 12 ++++++++++++ kernel/rcu/rcu_segcblist.h | 7 +++++++ kernel/rcu/tree.c | 10 ++++++---- kernel/rcu/tree_plugin.h | 11 +++++++---- 5 files changed, 33 insertions(+), 8 deletions(-) diff --git a/include/linux/rcu_segcblist.h b/include/linux/rcu_segcblist.h index ed2cfd3c0743..8b684888f71d 100644 --- a/include/linux/rcu_segcblist.h +++ b/include/linux/rcu_segcblist.h @@ -71,6 +71,7 @@ struct rcu_segcblist { long len; long len_lazy; u8 enabled; + u8 offloaded; }; #define RCU_SEGCBLIST_INITIALIZER(n) \ diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c index b305dcac34c9..700779f4c0cb 100644 --- a/kernel/rcu/rcu_segcblist.c +++ b/kernel/rcu/rcu_segcblist.c @@ -73,6 +73,18 @@ void rcu_segcblist_disable(struct rcu_segcblist *rsclp) rsclp->enabled = 0; } +/* + * Mark the specified rcu_segcblist structure as offloaded. This + * structure must be empty. + */ +void rcu_segcblist_offload(struct rcu_segcblist *rsclp) +{ + WARN_ON_ONCE(!rcu_segcblist_empty(rsclp)); + WARN_ON_ONCE(rcu_segcblist_n_cbs(rsclp)); + WARN_ON_ONCE(rcu_segcblist_n_lazy_cbs(rsclp)); + rsclp->offloaded = 1; +} + /* * Does the specified rcu_segcblist structure contain callbacks that * are ready to be invoked? diff --git a/kernel/rcu/rcu_segcblist.h b/kernel/rcu/rcu_segcblist.h index b2de7b32da29..8f3783391075 100644 --- a/kernel/rcu/rcu_segcblist.h +++ b/kernel/rcu/rcu_segcblist.h @@ -66,6 +66,12 @@ static inline bool rcu_segcblist_is_enabled(struct rcu_segcblist *rsclp) return rsclp->enabled; } +/* Is the specified rcu_segcblist offloaded? */ +static inline bool rcu_segcblist_is_offloaded(struct rcu_segcblist *rsclp) +{ + return rsclp->offloaded; +} + /* * Are all segments following the specified segment of the specified * rcu_segcblist structure empty of callbacks? (The specified @@ -78,6 +84,7 @@ static inline bool rcu_segcblist_restempty(struct rcu_segcblist *rsclp, int seg) void rcu_segcblist_init(struct rcu_segcblist *rsclp); void rcu_segcblist_disable(struct rcu_segcblist *rsclp); +void rcu_segcblist_offload(struct rcu_segcblist *rsclp); bool rcu_segcblist_ready_cbs(struct rcu_segcblist *rsclp); bool rcu_segcblist_pend_cbs(struct rcu_segcblist *rsclp); struct rcu_head *rcu_segcblist_first_cb(struct rcu_segcblist *rsclp); diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index a14e5fbbea46..6f5c96c4f9a3 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2858,10 +2858,11 @@ void rcu_barrier(void) * corresponding CPU's preceding callbacks have been invoked. */ for_each_possible_cpu(cpu) { - if (!cpu_online(cpu) && !rcu_is_nocb_cpu(cpu)) - continue; rdp = per_cpu_ptr(&rcu_data, cpu); - if (rcu_is_nocb_cpu(cpu)) { + if (!cpu_online(cpu) && + !rcu_segcblist_is_offloaded(&rdp->cblist)) + continue; + if (rcu_segcblist_is_offloaded(&rdp->cblist)) { if (!rcu_nocb_cpu_needs_barrier(cpu)) { rcu_barrier_trace(TPS("OfflineNoCB"), cpu, rcu_state.barrier_sequence); @@ -3155,7 +3156,8 @@ void rcutree_migrate_callbacks(int cpu) struct rcu_node *rnp_root = rcu_get_root(); bool needwake; - if (rcu_is_nocb_cpu(cpu) || rcu_segcblist_empty(&rdp->cblist)) + if (rcu_segcblist_is_offloaded(&rdp->cblist) || + rcu_segcblist_empty(&rdp->cblist)) return; /* No callbacks to migrate. */ local_irq_save(flags); diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index b8a43cf9bb4e..fc6133eed50a 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1382,7 +1382,7 @@ static void rcu_prepare_for_idle(void) int tne; lockdep_assert_irqs_disabled(); - if (rcu_is_nocb_cpu(smp_processor_id())) + if (rcu_segcblist_is_offloaded(&rdp->cblist)) return; /* Handle nohz enablement switches conservatively. */ @@ -1431,8 +1431,10 @@ static void rcu_prepare_for_idle(void) */ static void rcu_cleanup_after_idle(void) { + struct rcu_data *rdp = this_cpu_ptr(&rcu_data); + lockdep_assert_irqs_disabled(); - if (rcu_is_nocb_cpu(smp_processor_id())) + if (rcu_segcblist_is_offloaded(&rdp->cblist)) return; if (rcu_try_advance_all_cbs()) invoke_rcu_core(); @@ -1694,7 +1696,7 @@ static bool __call_rcu_nocb(struct rcu_data *rdp, struct rcu_head *rhp, bool lazy, unsigned long flags) { - if (!rcu_is_nocb_cpu(rdp->cpu)) + if (!rcu_segcblist_is_offloaded(&rdp->cblist)) return false; __call_rcu_nocb_enqueue(rdp, rhp, &rhp->next, 1, lazy, flags); if (__is_kfree_rcu_offset((unsigned long)rhp->func)) @@ -1729,7 +1731,7 @@ static bool __maybe_unused rcu_nocb_adopt_orphan_cbs(struct rcu_data *my_rdp, unsigned long flags) { lockdep_assert_irqs_disabled(); - if (!rcu_is_nocb_cpu(smp_processor_id())) + if (!rcu_segcblist_is_offloaded(&my_rdp->cblist)) return false; /* Not NOCBs CPU, caller must migrate CBs. */ __call_rcu_nocb_enqueue(my_rdp, rcu_segcblist_head(&rdp->cblist), rcu_segcblist_tail(&rdp->cblist), @@ -2192,6 +2194,7 @@ static bool init_nocb_callback_list(struct rcu_data *rdp) } rcu_segcblist_init(&rdp->cblist); rcu_segcblist_disable(&rdp->cblist); + rcu_segcblist_offload(&rdp->cblist); return true; } From 750d7f6a434ff4640fa825dfb1eccb44e79fb6af Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 16 Apr 2019 08:19:43 -0700 Subject: [PATCH 48/86] rcu/nocb: Add checks for offloaded callback processing This commit is a preparatory patch for offloaded callbacks using the same ->cblist structure used by non-offloaded callbacks. It therefore adds rcu_segcblist_is_offloaded() calls where they will be needed when !rcu_segcblist_is_enabled() no longer flags the offloaded case. It also adds checks in rcu_do_batch() to ensure that there are no missed checks: Currently, it should not be possible for offloaded execution to reach rcu_do_batch(), though this will change later in this series. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 6f5c96c4f9a3..969ba292a669 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -210,7 +210,8 @@ static long rcu_get_n_cbs_cpu(int cpu) { struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu); - if (rcu_segcblist_is_enabled(&rdp->cblist)) /* Online normal CPU? */ + if (rcu_segcblist_is_enabled(&rdp->cblist) && + !rcu_segcblist_is_offloaded(&rdp->cblist)) /* Online normal CPU? */ return rcu_segcblist_n_cbs(&rdp->cblist); return rcu_get_n_cbs_nocb_cpu(rdp); /* Works for offline, too. */ } @@ -2081,6 +2082,7 @@ static void rcu_do_batch(struct rcu_data *rdp) struct rcu_cblist rcl = RCU_CBLIST_INITIALIZER(rcl); long bl, count; + WARN_ON_ONCE(rdp->cblist.offloaded); /* If no callbacks are ready, just return. */ if (!rcu_segcblist_ready_cbs(&rdp->cblist)) { trace_rcu_batch_start(rcu_state.name, @@ -2299,7 +2301,8 @@ static __latent_entropy void rcu_core(void) /* No grace period and unregistered callbacks? */ if (!rcu_gp_in_progress() && - rcu_segcblist_is_enabled(&rdp->cblist)) { + rcu_segcblist_is_enabled(&rdp->cblist) && + !rcu_segcblist_is_offloaded(&rdp->cblist)) { local_irq_save(flags); if (!rcu_segcblist_restempty(&rdp->cblist, RCU_NEXT_READY_TAIL)) rcu_accelerate_cbs_unlocked(rnp, rdp); @@ -2514,7 +2517,8 @@ __call_rcu(struct rcu_head *head, rcu_callback_t func, int cpu, bool lazy) rdp = this_cpu_ptr(&rcu_data); /* Add the callback to our list. */ - if (unlikely(!rcu_segcblist_is_enabled(&rdp->cblist)) || cpu != -1) { + if (unlikely(!rcu_segcblist_is_enabled(&rdp->cblist)) || + rcu_segcblist_is_offloaded(&rdp->cblist) || cpu != -1) { int offline; if (cpu != -1) @@ -2750,6 +2754,7 @@ static int rcu_pending(void) /* Has RCU gone idle with this CPU needing another grace period? */ if (!rcu_gp_in_progress() && rcu_segcblist_is_enabled(&rdp->cblist) && + !rcu_segcblist_is_offloaded(&rdp->cblist) && !rcu_segcblist_restempty(&rdp->cblist, RCU_NEXT_READY_TAIL)) return 1; From c00045be32fe13333ba8c62748ba04747c182838 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 16 Apr 2019 14:09:15 -0700 Subject: [PATCH 49/86] rcu/nocb: Make rcutree_migrate_callbacks() start at leaf rcu_node structure Because rcutree_migrate_callbacks() is invoked infrequently and because an exact snapshot of the grace-period state might save some callbacks a second trip through a grace period, this function has used the root rcu_node structure. However, this safe-second-trip optimization happens only if rcutree_migrate_callbacks() races with grace-period initialization, so it is not worth the added mental load. This commit therefore makes rcutree_migrate_callbacks() start with the leaf rcu_node structures, as is done elsewhere. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 969ba292a669..ea479d81da7f 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3157,8 +3157,8 @@ void rcutree_migrate_callbacks(int cpu) { unsigned long flags; struct rcu_data *my_rdp; + struct rcu_node *my_rnp; struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu); - struct rcu_node *rnp_root = rcu_get_root(); bool needwake; if (rcu_segcblist_is_offloaded(&rdp->cblist) || @@ -3167,18 +3167,19 @@ void rcutree_migrate_callbacks(int cpu) local_irq_save(flags); my_rdp = this_cpu_ptr(&rcu_data); + my_rnp = my_rdp->mynode; if (rcu_nocb_adopt_orphan_cbs(my_rdp, rdp, flags)) { local_irq_restore(flags); return; } - raw_spin_lock_rcu_node(rnp_root); /* irqs already disabled. */ + raw_spin_lock_rcu_node(my_rnp); /* irqs already disabled. */ /* Leverage recent GPs and set GP for new callbacks. */ - needwake = rcu_advance_cbs(rnp_root, rdp) || - rcu_advance_cbs(rnp_root, my_rdp); + needwake = rcu_advance_cbs(my_rnp, rdp) || + rcu_advance_cbs(my_rnp, my_rdp); rcu_segcblist_merge(&my_rdp->cblist, &rdp->cblist); WARN_ON_ONCE(rcu_segcblist_empty(&my_rdp->cblist) != !rcu_segcblist_n_cbs(&my_rdp->cblist)); - raw_spin_unlock_irqrestore_rcu_node(rnp_root, flags); + raw_spin_unlock_irqrestore_rcu_node(my_rnp, flags); if (needwake) rcu_gp_kthread_wake(); WARN_ONCE(rcu_segcblist_n_cbs(&rdp->cblist) != 0 || From 85f69b32126dcf798f2c8d69a7957ba990bc2e02 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 16 Apr 2019 14:48:28 -0700 Subject: [PATCH 50/86] rcu/nocb: Check for deferred nocb wakeups before nohz_full early exit In theory, a timer is used to defer wakeups of no-CBs grace-period kthreads when the wakeup cannot be done safely directly from the call_rcu(). In practice, the one-jiffy delay is not always consistent with timely callback invocation under heavy call_rcu() loads. Therefore, there are a number of checks for a pending deferred wakeup, including from the scheduling-clock interrupt. Unfortunately, this check follows the rcu_nohz_full_cpu() early exit, which renders it useless on such CPUs. This commit therefore moves the check for the pending deferred no-CB wakeup to precede the rcu_nohz_full_cpu() early exit. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index ea479d81da7f..f1a25d17e3a0 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2739,6 +2739,10 @@ static int rcu_pending(void) /* Check for CPU stalls, if enabled. */ check_cpu_stall(rdp); + /* Does this CPU need a deferred NOCB wakeup? */ + if (rcu_nocb_need_deferred_wakeup(rdp)) + return 1; + /* Is this CPU a NO_HZ_FULL CPU that should ignore RCU? */ if (rcu_nohz_full_cpu()) return 0; @@ -2763,10 +2767,6 @@ static int rcu_pending(void) unlikely(READ_ONCE(rdp->gpwrap))) /* outside lock */ return 1; - /* Does this CPU need a deferred NOCB wakeup? */ - if (rcu_nocb_need_deferred_wakeup(rdp)) - return 1; - /* nothing to do */ return 0; } From ca5c8258081178c60b27e3532d9ea95b6eaa7040 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 16 Apr 2019 15:15:24 -0700 Subject: [PATCH 51/86] rcu/nocb: Remove deferred wakeup checks for extended quiescent states The idea behind the checks for extended quiescent states at the end of __call_rcu_nocb() is to handle cases where call_rcu() is invoked directly from within an extended quiescent state, for example, from the idle loop. However, this will result in a timer-mediated deferred wakeup, which will cause the needed wakeup to happen within a jiffy or thereabouts. There should be no forward-progress concerns, and if there are, the proper response is to exit the extended quiescent state while executing the endless blast of call_rcu() invocations, for example, using RCU_NONIDLE(). Given the more realistic case of an isolated call_rcu() invocation, there should be no problem. This commit therefore removes the checks for invoking call_rcu() within an extended quiescent state for on no-CBs CPUs. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 10 ---------- 1 file changed, 10 deletions(-) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index fc6133eed50a..9936a66b80bb 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1709,16 +1709,6 @@ static bool __call_rcu_nocb(struct rcu_data *rdp, struct rcu_head *rhp, -atomic_long_read(&rdp->nocb_q_count_lazy), -rcu_get_n_cbs_nocb_cpu(rdp)); - /* - * If called from an extended quiescent state with interrupts - * disabled, invoke the RCU core in order to allow the idle-entry - * deferred-wakeup check to function. - */ - if (irqs_disabled_flags(flags) && - !rcu_is_watching() && - cpu_online(smp_processor_id())) - invoke_rcu_core(); - return true; } From 76c6927c3ee443e756f2c0c9f992cb04b26c65f2 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 13 May 2019 14:36:11 -0700 Subject: [PATCH 52/86] rcu/nocb: Allow lockless use of rcu_segcblist_restempty() Currently, rcu_segcblist_restempty() assumes that the callback list is not being changed by other CPUs, but upcoming changes will require it to operate locklessly. This commit therefore adds the needed READ_ONCE() calls, along with the WRITE_ONCE() calls when updating the callback list. Signed-off-by: Paul E. McKenney --- kernel/rcu/rcu_segcblist.c | 30 +++++++++++++++--------------- kernel/rcu/rcu_segcblist.h | 2 +- 2 files changed, 16 insertions(+), 16 deletions(-) diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c index 700779f4c0cb..0e7fe678b6ac 100644 --- a/kernel/rcu/rcu_segcblist.c +++ b/kernel/rcu/rcu_segcblist.c @@ -147,8 +147,8 @@ void rcu_segcblist_enqueue(struct rcu_segcblist *rsclp, rsclp->len_lazy++; smp_mb(); /* Ensure counts are updated before callback is enqueued. */ rhp->next = NULL; - *rsclp->tails[RCU_NEXT_TAIL] = rhp; - rsclp->tails[RCU_NEXT_TAIL] = &rhp->next; + WRITE_ONCE(*rsclp->tails[RCU_NEXT_TAIL], rhp); + WRITE_ONCE(rsclp->tails[RCU_NEXT_TAIL], &rhp->next); } /* @@ -176,9 +176,9 @@ bool rcu_segcblist_entrain(struct rcu_segcblist *rsclp, for (i = RCU_NEXT_TAIL; i > RCU_DONE_TAIL; i--) if (rsclp->tails[i] != rsclp->tails[i - 1]) break; - *rsclp->tails[i] = rhp; + WRITE_ONCE(*rsclp->tails[i], rhp); for (; i <= RCU_NEXT_TAIL; i++) - rsclp->tails[i] = &rhp->next; + WRITE_ONCE(rsclp->tails[i], &rhp->next); return true; } @@ -214,11 +214,11 @@ void rcu_segcblist_extract_done_cbs(struct rcu_segcblist *rsclp, return; /* Nothing to do. */ *rclp->tail = rsclp->head; rsclp->head = *rsclp->tails[RCU_DONE_TAIL]; - *rsclp->tails[RCU_DONE_TAIL] = NULL; + WRITE_ONCE(*rsclp->tails[RCU_DONE_TAIL], NULL); rclp->tail = rsclp->tails[RCU_DONE_TAIL]; for (i = RCU_CBLIST_NSEGS - 1; i >= RCU_DONE_TAIL; i--) if (rsclp->tails[i] == rsclp->tails[RCU_DONE_TAIL]) - rsclp->tails[i] = &rsclp->head; + WRITE_ONCE(rsclp->tails[i], &rsclp->head); } /* @@ -237,9 +237,9 @@ void rcu_segcblist_extract_pend_cbs(struct rcu_segcblist *rsclp, return; /* Nothing to do. */ *rclp->tail = *rsclp->tails[RCU_DONE_TAIL]; rclp->tail = rsclp->tails[RCU_NEXT_TAIL]; - *rsclp->tails[RCU_DONE_TAIL] = NULL; + WRITE_ONCE(*rsclp->tails[RCU_DONE_TAIL], NULL); for (i = RCU_DONE_TAIL + 1; i < RCU_CBLIST_NSEGS; i++) - rsclp->tails[i] = rsclp->tails[RCU_DONE_TAIL]; + WRITE_ONCE(rsclp->tails[i], rsclp->tails[RCU_DONE_TAIL]); } /* @@ -271,7 +271,7 @@ void rcu_segcblist_insert_done_cbs(struct rcu_segcblist *rsclp, rsclp->head = rclp->head; for (i = RCU_DONE_TAIL; i < RCU_CBLIST_NSEGS; i++) if (&rsclp->head == rsclp->tails[i]) - rsclp->tails[i] = rclp->tail; + WRITE_ONCE(rsclp->tails[i], rclp->tail); else break; rclp->head = NULL; @@ -287,8 +287,8 @@ void rcu_segcblist_insert_pend_cbs(struct rcu_segcblist *rsclp, { if (!rclp->head) return; /* Nothing to do. */ - *rsclp->tails[RCU_NEXT_TAIL] = rclp->head; - rsclp->tails[RCU_NEXT_TAIL] = rclp->tail; + WRITE_ONCE(*rsclp->tails[RCU_NEXT_TAIL], rclp->head); + WRITE_ONCE(rsclp->tails[RCU_NEXT_TAIL], rclp->tail); rclp->head = NULL; rclp->tail = &rclp->head; } @@ -312,7 +312,7 @@ void rcu_segcblist_advance(struct rcu_segcblist *rsclp, unsigned long seq) for (i = RCU_WAIT_TAIL; i < RCU_NEXT_TAIL; i++) { if (ULONG_CMP_LT(seq, rsclp->gp_seq[i])) break; - rsclp->tails[RCU_DONE_TAIL] = rsclp->tails[i]; + WRITE_ONCE(rsclp->tails[RCU_DONE_TAIL], rsclp->tails[i]); } /* If no callbacks moved, nothing more need be done. */ @@ -321,7 +321,7 @@ void rcu_segcblist_advance(struct rcu_segcblist *rsclp, unsigned long seq) /* Clean up tail pointers that might have been misordered above. */ for (j = RCU_WAIT_TAIL; j < i; j++) - rsclp->tails[j] = rsclp->tails[RCU_DONE_TAIL]; + WRITE_ONCE(rsclp->tails[j], rsclp->tails[RCU_DONE_TAIL]); /* * Callbacks moved, so clean up the misordered ->tails[] pointers @@ -332,7 +332,7 @@ void rcu_segcblist_advance(struct rcu_segcblist *rsclp, unsigned long seq) for (j = RCU_WAIT_TAIL; i < RCU_NEXT_TAIL; i++, j++) { if (rsclp->tails[j] == rsclp->tails[RCU_NEXT_TAIL]) break; /* No more callbacks. */ - rsclp->tails[j] = rsclp->tails[i]; + WRITE_ONCE(rsclp->tails[j], rsclp->tails[i]); rsclp->gp_seq[j] = rsclp->gp_seq[i]; } } @@ -397,7 +397,7 @@ bool rcu_segcblist_accelerate(struct rcu_segcblist *rsclp, unsigned long seq) * structure other than in the RCU_NEXT_TAIL segment. */ for (; i < RCU_NEXT_TAIL; i++) { - rsclp->tails[i] = rsclp->tails[RCU_NEXT_TAIL]; + WRITE_ONCE(rsclp->tails[i], rsclp->tails[RCU_NEXT_TAIL]); rsclp->gp_seq[i] = seq; } return true; diff --git a/kernel/rcu/rcu_segcblist.h b/kernel/rcu/rcu_segcblist.h index 8f3783391075..f74960f0305c 100644 --- a/kernel/rcu/rcu_segcblist.h +++ b/kernel/rcu/rcu_segcblist.h @@ -79,7 +79,7 @@ static inline bool rcu_segcblist_is_offloaded(struct rcu_segcblist *rsclp) */ static inline bool rcu_segcblist_restempty(struct rcu_segcblist *rsclp, int seg) { - return !*rsclp->tails[seg]; + return !READ_ONCE(*READ_ONCE(rsclp->tails[seg])); } void rcu_segcblist_init(struct rcu_segcblist *rsclp); From e6060b41c9955374079926a7612b857a8458ed1f Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 13 May 2019 15:57:50 -0700 Subject: [PATCH 53/86] rcu/nocb: Allow lockless use of rcu_segcblist_empty() Currently, rcu_segcblist_empty() assumes that the callback list is not being changed by other CPUs, but upcoming changes will require it to operate locklessly. This commit therefore adds the needed READ_ONCE() call, along with the WRITE_ONCE() calls when updating the callback list's ->head field. Signed-off-by: Paul E. McKenney --- kernel/rcu/rcu_segcblist.c | 4 ++-- kernel/rcu/rcu_segcblist.h | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c index 0e7fe678b6ac..06435a368be5 100644 --- a/kernel/rcu/rcu_segcblist.c +++ b/kernel/rcu/rcu_segcblist.c @@ -213,7 +213,7 @@ void rcu_segcblist_extract_done_cbs(struct rcu_segcblist *rsclp, if (!rcu_segcblist_ready_cbs(rsclp)) return; /* Nothing to do. */ *rclp->tail = rsclp->head; - rsclp->head = *rsclp->tails[RCU_DONE_TAIL]; + WRITE_ONCE(rsclp->head, *rsclp->tails[RCU_DONE_TAIL]); WRITE_ONCE(*rsclp->tails[RCU_DONE_TAIL], NULL); rclp->tail = rsclp->tails[RCU_DONE_TAIL]; for (i = RCU_CBLIST_NSEGS - 1; i >= RCU_DONE_TAIL; i--) @@ -268,7 +268,7 @@ void rcu_segcblist_insert_done_cbs(struct rcu_segcblist *rsclp, if (!rclp->head) return; /* No callbacks to move. */ *rclp->tail = rsclp->head; - rsclp->head = rclp->head; + WRITE_ONCE(rsclp->head, rclp->head); for (i = RCU_DONE_TAIL; i < RCU_CBLIST_NSEGS; i++) if (&rsclp->head == rsclp->tails[i]) WRITE_ONCE(rsclp->tails[i], rclp->tail); diff --git a/kernel/rcu/rcu_segcblist.h b/kernel/rcu/rcu_segcblist.h index f74960f0305c..d9142b3590a8 100644 --- a/kernel/rcu/rcu_segcblist.h +++ b/kernel/rcu/rcu_segcblist.h @@ -36,7 +36,7 @@ struct rcu_head *rcu_cblist_dequeue(struct rcu_cblist *rclp); */ static inline bool rcu_segcblist_empty(struct rcu_segcblist *rsclp) { - return !rsclp->head; + return !READ_ONCE(rsclp->head); } /* Return number of callbacks in segmented callback list. */ From e83e73f5b0f8de6a8978ba64185e80fdf48a2a63 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 14 May 2019 09:50:49 -0700 Subject: [PATCH 54/86] rcu/nocb: Leave ->cblist enabled for no-CBs CPUs As a first step towards making no-CBs CPUs use the ->cblist, this commit leaves the ->cblist enabled for these CPUs. The main reason to make no-CBs CPUs use ->cblist is to take advantage of callback numbering, which will reduce the effects of missed grace periods which in turn will reduce forward-progress problems for no-CBs CPUs. Signed-off-by: Paul E. McKenney --- kernel/rcu/rcu_segcblist.c | 3 --- kernel/rcu/rcu_segcblist.h | 2 +- kernel/rcu/tree.c | 5 +++-- kernel/rcu/tree.h | 1 - kernel/rcu/tree_plugin.h | 35 +++++++---------------------------- 5 files changed, 11 insertions(+), 35 deletions(-) diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c index 06435a368be5..9ac28f175627 100644 --- a/kernel/rcu/rcu_segcblist.c +++ b/kernel/rcu/rcu_segcblist.c @@ -79,9 +79,6 @@ void rcu_segcblist_disable(struct rcu_segcblist *rsclp) */ void rcu_segcblist_offload(struct rcu_segcblist *rsclp) { - WARN_ON_ONCE(!rcu_segcblist_empty(rsclp)); - WARN_ON_ONCE(rcu_segcblist_n_cbs(rsclp)); - WARN_ON_ONCE(rcu_segcblist_n_lazy_cbs(rsclp)); rsclp->offloaded = 1; } diff --git a/kernel/rcu/rcu_segcblist.h b/kernel/rcu/rcu_segcblist.h index d9142b3590a8..ed3fcece39a9 100644 --- a/kernel/rcu/rcu_segcblist.h +++ b/kernel/rcu/rcu_segcblist.h @@ -59,7 +59,7 @@ static inline long rcu_segcblist_n_nonlazy_cbs(struct rcu_segcblist *rsclp) /* * Is the specified rcu_segcblist enabled, for example, not corresponding - * to an offline or callback-offloaded CPU? + * to an offline CPU? */ static inline bool rcu_segcblist_is_enabled(struct rcu_segcblist *rsclp) { diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index f1a25d17e3a0..2917ce379b23 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2964,7 +2964,8 @@ rcu_boot_init_percpu_data(int cpu) * Initializes a CPU's per-CPU RCU data. Note that only one online or * offline event can be happening at a given time. Note also that we can * accept some slop in the rsp->gp_seq access due to the fact that this - * CPU cannot possibly have any RCU callbacks in flight yet. + * CPU cannot possibly have any non-offloaded RCU callbacks in flight yet. + * And any offloaded callbacks are being numbered elsewhere. */ int rcutree_prepare_cpu(unsigned int cpu) { @@ -2978,7 +2979,7 @@ int rcutree_prepare_cpu(unsigned int cpu) rdp->n_force_qs_snap = rcu_state.n_force_qs; rdp->blimit = blimit; if (rcu_segcblist_empty(&rdp->cblist) && /* No early-boot CBs? */ - !init_nocb_callback_list(rdp)) + !rcu_segcblist_is_offloaded(&rdp->cblist)) rcu_segcblist_init(&rdp->cblist); /* Re-enable callbacks. */ rdp->dynticks_nesting = 1; /* CPU not up, no tearing. */ rcu_dynticks_eqs_online(); diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index dc3c53cb9608..8d9cfcac6757 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -438,7 +438,6 @@ static void __init rcu_spawn_nocb_kthreads(void); #ifdef CONFIG_RCU_NOCB_CPU static void __init rcu_organize_nocb_kthreads(void); #endif /* #ifdef CONFIG_RCU_NOCB_CPU */ -static bool init_nocb_callback_list(struct rcu_data *rdp); static unsigned long rcu_get_n_cbs_nocb_cpu(struct rcu_data *rdp); static void rcu_bind_gp_kthread(void); static bool rcu_nohz_full_cpu(void); diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 9936a66b80bb..2d37fd3fa0d4 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -2007,6 +2007,7 @@ void __init rcu_init_nohz(void) { int cpu; bool need_rcu_nocb_mask = false; + struct rcu_data *rdp; #if defined(CONFIG_NO_HZ_FULL) if (tick_nohz_full_running && cpumask_weight(tick_nohz_full_mask)) @@ -2040,8 +2041,12 @@ void __init rcu_init_nohz(void) if (rcu_nocb_poll) pr_info("\tPoll for callbacks from no-CBs CPUs.\n"); - for_each_cpu(cpu, rcu_nocb_mask) - init_nocb_callback_list(per_cpu_ptr(&rcu_data, cpu)); + for_each_cpu(cpu, rcu_nocb_mask) { + rdp = per_cpu_ptr(&rcu_data, cpu); + if (rcu_segcblist_empty(&rdp->cblist)) + rcu_segcblist_init(&rdp->cblist); + rcu_segcblist_offload(&rdp->cblist); + } rcu_organize_nocb_kthreads(); } @@ -2167,27 +2172,6 @@ static void __init rcu_organize_nocb_kthreads(void) } } -/* Prevent __call_rcu() from enqueuing callbacks on no-CBs CPUs */ -static bool init_nocb_callback_list(struct rcu_data *rdp) -{ - if (!rcu_is_nocb_cpu(rdp->cpu)) - return false; - - /* If there are early-boot callbacks, move them to nocb lists. */ - if (!rcu_segcblist_empty(&rdp->cblist)) { - rdp->nocb_head = rcu_segcblist_head(&rdp->cblist); - rdp->nocb_tail = rcu_segcblist_tail(&rdp->cblist); - atomic_long_set(&rdp->nocb_q_count, - rcu_segcblist_n_cbs(&rdp->cblist)); - atomic_long_set(&rdp->nocb_q_count_lazy, - rcu_segcblist_n_lazy_cbs(&rdp->cblist)); - } - rcu_segcblist_init(&rdp->cblist); - rcu_segcblist_disable(&rdp->cblist); - rcu_segcblist_offload(&rdp->cblist); - return true; -} - /* * Bind the current task to the offloaded CPUs. If there are no offloaded * CPUs, leave the task unbound. Splat if the bind attempt fails. @@ -2263,11 +2247,6 @@ static void __init rcu_spawn_nocb_kthreads(void) { } -static bool init_nocb_callback_list(struct rcu_data *rdp) -{ - return false; -} - static unsigned long rcu_get_n_cbs_nocb_cpu(struct rcu_data *rdp) { return 0; From 5d6742b37727e111f4755155e59c5319cf5caa7b Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 15 May 2019 09:56:40 -0700 Subject: [PATCH 55/86] rcu/nocb: Use rcu_segcblist for no-CBs CPUs Currently the RCU callbacks for no-CBs CPUs are queued on a series of ad-hoc linked lists, which means that these callbacks cannot benefit from "drive-by" grace periods, thus suffering needless delays prior to invocation. In addition, the no-CBs grace-period kthreads first wait for callbacks to appear and later wait for a new grace period, which means that callbacks appearing during a grace-period wait can be delayed. These delays increase memory footprint, and could even result in an out-of-memory condition. This commit therefore enqueues RCU callbacks from no-CBs CPUs on the rcu_segcblist structure that is already used by non-no-CBs CPUs. It also restructures the no-CBs grace-period kthread to be checking for incoming callbacks while waiting for grace periods. Also, instead of waiting for a new grace period, it waits for the closest grace period that will cause some of the callbacks to be safe to invoke. All of these changes reduce callback latency and thus the number of outstanding callbacks, in turn reducing the probability of an out-of-memory condition. Signed-off-by: Paul E. McKenney --- include/trace/events/rcu.h | 1 - kernel/rcu/rcu_segcblist.c | 12 + kernel/rcu/rcu_segcblist.h | 1 + kernel/rcu/tree.c | 116 +++++---- kernel/rcu/tree.h | 14 +- kernel/rcu/tree_plugin.h | 516 ++++++++++++++----------------------- 6 files changed, 273 insertions(+), 387 deletions(-) diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h index 313324d1b135..694bd040cf51 100644 --- a/include/trace/events/rcu.h +++ b/include/trace/events/rcu.h @@ -100,7 +100,6 @@ TRACE_EVENT_RCU(rcu_grace_period, * "Startedroot": Requested a nocb grace period based on root-node data. * "NoGPkthread": The RCU grace-period kthread has not yet started. * "StartWait": Start waiting for the requested grace period. - * "ResumeWait": Resume waiting after signal. * "EndWait": Complete wait. * "Cleanup": Clean up rcu_node structure after previous GP. * "CleanupMore": Clean up, and another GP is needed. diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c index 9ac28f175627..92968b856593 100644 --- a/kernel/rcu/rcu_segcblist.c +++ b/kernel/rcu/rcu_segcblist.c @@ -127,6 +127,18 @@ struct rcu_head *rcu_segcblist_first_pend_cb(struct rcu_segcblist *rsclp) return NULL; } +/* + * Return false if there are no CBs awaiting grace periods, otherwise, + * return true and store the nearest waited-upon grace period into *lp. + */ +bool rcu_segcblist_nextgp(struct rcu_segcblist *rsclp, unsigned long *lp) +{ + if (!rcu_segcblist_pend_cbs(rsclp)) + return false; + *lp = rsclp->gp_seq[RCU_WAIT_TAIL]; + return true; +} + /* * Enqueue the specified callback onto the specified rcu_segcblist * structure, updating accounting as needed. Note that the ->len diff --git a/kernel/rcu/rcu_segcblist.h b/kernel/rcu/rcu_segcblist.h index ed3fcece39a9..db38f0a512c4 100644 --- a/kernel/rcu/rcu_segcblist.h +++ b/kernel/rcu/rcu_segcblist.h @@ -89,6 +89,7 @@ bool rcu_segcblist_ready_cbs(struct rcu_segcblist *rsclp); bool rcu_segcblist_pend_cbs(struct rcu_segcblist *rsclp); struct rcu_head *rcu_segcblist_first_cb(struct rcu_segcblist *rsclp); struct rcu_head *rcu_segcblist_first_pend_cb(struct rcu_segcblist *rsclp); +bool rcu_segcblist_nextgp(struct rcu_segcblist *rsclp, unsigned long *lp); void rcu_segcblist_enqueue(struct rcu_segcblist *rsclp, struct rcu_head *rhp, bool lazy); bool rcu_segcblist_entrain(struct rcu_segcblist *rsclp, diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 2917ce379b23..054418d2d960 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1343,8 +1343,10 @@ static bool rcu_advance_cbs(struct rcu_node *rnp, struct rcu_data *rdp) */ static bool __note_gp_changes(struct rcu_node *rnp, struct rcu_data *rdp) { - bool ret; + bool ret = false; bool need_gp; + const bool offloaded = IS_ENABLED(CONFIG_RCU_NOCB_CPU) && + rcu_segcblist_is_offloaded(&rdp->cblist); raw_lockdep_assert_held_rcu_node(rnp); @@ -1354,10 +1356,12 @@ static bool __note_gp_changes(struct rcu_node *rnp, struct rcu_data *rdp) /* Handle the ends of any preceding grace periods first. */ if (rcu_seq_completed_gp(rdp->gp_seq, rnp->gp_seq) || unlikely(READ_ONCE(rdp->gpwrap))) { - ret = rcu_advance_cbs(rnp, rdp); /* Advance callbacks. */ + if (!offloaded) + ret = rcu_advance_cbs(rnp, rdp); /* Advance CBs. */ trace_rcu_grace_period(rcu_state.name, rdp->gp_seq, TPS("cpuend")); } else { - ret = rcu_accelerate_cbs(rnp, rdp); /* Recent callbacks. */ + if (!offloaded) + ret = rcu_accelerate_cbs(rnp, rdp); /* Recent CBs. */ } /* Now handle the beginnings of any new-to-this-CPU grace periods. */ @@ -1658,6 +1662,7 @@ static void rcu_gp_cleanup(void) unsigned long gp_duration; bool needgp = false; unsigned long new_gp_seq; + bool offloaded; struct rcu_data *rdp; struct rcu_node *rnp = rcu_get_root(); struct swait_queue_head *sq; @@ -1723,7 +1728,9 @@ static void rcu_gp_cleanup(void) needgp = true; } /* Advance CBs to reduce false positives below. */ - if (!rcu_accelerate_cbs(rnp, rdp) && needgp) { + offloaded = IS_ENABLED(CONFIG_RCU_NOCB_CPU) && + rcu_segcblist_is_offloaded(&rdp->cblist); + if ((offloaded || !rcu_accelerate_cbs(rnp, rdp)) && needgp) { WRITE_ONCE(rcu_state.gp_flags, RCU_GP_FLAG_INIT); rcu_state.gp_req_activity = jiffies; trace_rcu_grace_period(rcu_state.name, @@ -1917,7 +1924,9 @@ rcu_report_qs_rdp(int cpu, struct rcu_data *rdp) { unsigned long flags; unsigned long mask; - bool needwake; + bool needwake = false; + const bool offloaded = IS_ENABLED(CONFIG_RCU_NOCB_CPU) && + rcu_segcblist_is_offloaded(&rdp->cblist); struct rcu_node *rnp; rnp = rdp->mynode; @@ -1944,7 +1953,8 @@ rcu_report_qs_rdp(int cpu, struct rcu_data *rdp) * This GP can't end until cpu checks in, so all of our * callbacks can be processed during the next GP. */ - needwake = rcu_accelerate_cbs(rnp, rdp); + if (!offloaded) + needwake = rcu_accelerate_cbs(rnp, rdp); rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags); /* ^^^ Released rnp->lock */ @@ -2082,7 +2092,6 @@ static void rcu_do_batch(struct rcu_data *rdp) struct rcu_cblist rcl = RCU_CBLIST_INITIALIZER(rcl); long bl, count; - WARN_ON_ONCE(rdp->cblist.offloaded); /* If no callbacks are ready, just return. */ if (!rcu_segcblist_ready_cbs(&rdp->cblist)) { trace_rcu_batch_start(rcu_state.name, @@ -2101,13 +2110,14 @@ static void rcu_do_batch(struct rcu_data *rdp) * callback counts, as rcu_barrier() needs to be conservative. */ local_irq_save(flags); + rcu_nocb_lock(rdp); WARN_ON_ONCE(cpu_is_offline(smp_processor_id())); bl = rdp->blimit; trace_rcu_batch_start(rcu_state.name, rcu_segcblist_n_lazy_cbs(&rdp->cblist), rcu_segcblist_n_cbs(&rdp->cblist), bl); rcu_segcblist_extract_done_cbs(&rdp->cblist, &rcl); - local_irq_restore(flags); + rcu_nocb_unlock_irqrestore(rdp, flags); /* Invoke callbacks. */ rhp = rcu_cblist_dequeue(&rcl); @@ -2120,12 +2130,22 @@ static void rcu_do_batch(struct rcu_data *rdp) * Note: The rcl structure counts down from zero. */ if (-rcl.len >= bl && + !rcu_segcblist_is_offloaded(&rdp->cblist) && (need_resched() || (!is_idle_task(current) && !rcu_is_callbacks_kthread()))) break; + if (rcu_segcblist_is_offloaded(&rdp->cblist)) { + WARN_ON_ONCE(in_serving_softirq()); + local_bh_enable(); + lockdep_assert_irqs_enabled(); + cond_resched_tasks_rcu_qs(); + lockdep_assert_irqs_enabled(); + local_bh_disable(); + } } local_irq_save(flags); + rcu_nocb_lock(rdp); count = -rcl.len; trace_rcu_batch_end(rcu_state.name, count, !!rcl.head, need_resched(), is_idle_task(current), rcu_is_callbacks_kthread()); @@ -2153,10 +2173,11 @@ static void rcu_do_batch(struct rcu_data *rdp) */ WARN_ON_ONCE(rcu_segcblist_empty(&rdp->cblist) != (count == 0)); - local_irq_restore(flags); + rcu_nocb_unlock_irqrestore(rdp, flags); /* Re-invoke RCU core processing if there are callbacks remaining. */ - if (rcu_segcblist_ready_cbs(&rdp->cblist)) + if (!rcu_segcblist_is_offloaded(&rdp->cblist) && + rcu_segcblist_ready_cbs(&rdp->cblist)) invoke_rcu_core(); } @@ -2312,7 +2333,8 @@ static __latent_entropy void rcu_core(void) rcu_check_gp_start_stall(rnp, rdp, rcu_jiffies_till_stall_check()); /* If there are callbacks ready, invoke them. */ - if (rcu_segcblist_ready_cbs(&rdp->cblist) && + if (!rcu_segcblist_is_offloaded(&rdp->cblist) && + rcu_segcblist_ready_cbs(&rdp->cblist) && likely(READ_ONCE(rcu_scheduler_fully_active))) rcu_do_batch(rdp); @@ -2492,10 +2514,11 @@ static void rcu_leak_callback(struct rcu_head *rhp) * is expected to specify a CPU. */ static void -__call_rcu(struct rcu_head *head, rcu_callback_t func, int cpu, bool lazy) +__call_rcu(struct rcu_head *head, rcu_callback_t func, bool lazy) { unsigned long flags; struct rcu_data *rdp; + bool was_alldone; /* Misaligned rcu_head! */ WARN_ON_ONCE((unsigned long)head & (sizeof(void *) - 1)); @@ -2517,29 +2540,17 @@ __call_rcu(struct rcu_head *head, rcu_callback_t func, int cpu, bool lazy) rdp = this_cpu_ptr(&rcu_data); /* Add the callback to our list. */ - if (unlikely(!rcu_segcblist_is_enabled(&rdp->cblist)) || - rcu_segcblist_is_offloaded(&rdp->cblist) || cpu != -1) { - int offline; - - if (cpu != -1) - rdp = per_cpu_ptr(&rcu_data, cpu); - if (likely(rdp->mynode)) { - /* Post-boot, so this should be for a no-CBs CPU. */ - offline = !__call_rcu_nocb(rdp, head, lazy, flags); - WARN_ON_ONCE(offline); - /* Offline CPU, _call_rcu() illegal, leak callback. */ - local_irq_restore(flags); - return; - } - /* - * Very early boot, before rcu_init(). Initialize if needed - * and then drop through to queue the callback. - */ - WARN_ON_ONCE(cpu != -1); + if (unlikely(!rcu_segcblist_is_enabled(&rdp->cblist))) { + // This can trigger due to call_rcu() from offline CPU: + WARN_ON_ONCE(rcu_scheduler_active != RCU_SCHEDULER_INACTIVE); WARN_ON_ONCE(!rcu_is_watching()); + // Very early boot, before rcu_init(). Initialize if needed + // and then drop through to queue the callback. if (rcu_segcblist_empty(&rdp->cblist)) rcu_segcblist_init(&rdp->cblist); } + rcu_nocb_lock(rdp); + was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist); rcu_segcblist_enqueue(&rdp->cblist, head, lazy); if (__is_kfree_rcu_offset((unsigned long)func)) trace_rcu_kfree_callback(rcu_state.name, head, @@ -2552,8 +2563,13 @@ __call_rcu(struct rcu_head *head, rcu_callback_t func, int cpu, bool lazy) rcu_segcblist_n_cbs(&rdp->cblist)); /* Go handle any RCU core processing required. */ - __call_rcu_core(rdp, head, flags); - local_irq_restore(flags); + if (IS_ENABLED(CONFIG_RCU_NOCB_CPU) && + unlikely(rcu_segcblist_is_offloaded(&rdp->cblist))) { + __call_rcu_nocb_wake(rdp, was_alldone, flags); /* unlocks */ + } else { + __call_rcu_core(rdp, head, flags); + local_irq_restore(flags); + } } /** @@ -2593,7 +2609,7 @@ __call_rcu(struct rcu_head *head, rcu_callback_t func, int cpu, bool lazy) */ void call_rcu(struct rcu_head *head, rcu_callback_t func) { - __call_rcu(head, func, -1, 0); + __call_rcu(head, func, 0); } EXPORT_SYMBOL_GPL(call_rcu); @@ -2606,7 +2622,7 @@ EXPORT_SYMBOL_GPL(call_rcu); */ void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func) { - __call_rcu(head, func, -1, 1); + __call_rcu(head, func, 1); } EXPORT_SYMBOL_GPL(kfree_call_rcu); @@ -2806,6 +2822,7 @@ static void rcu_barrier_func(void *unused) rcu_barrier_trace(TPS("IRQ"), -1, rcu_state.barrier_sequence); rdp->barrier_head.func = rcu_barrier_callback; debug_rcu_head_queue(&rdp->barrier_head); + rcu_nocb_lock(rdp); if (rcu_segcblist_entrain(&rdp->cblist, &rdp->barrier_head, 0)) { atomic_inc(&rcu_state.barrier_cpu_count); } else { @@ -2813,6 +2830,7 @@ static void rcu_barrier_func(void *unused) rcu_barrier_trace(TPS("IRQNQ"), -1, rcu_state.barrier_sequence); } + rcu_nocb_unlock(rdp); } /** @@ -2867,19 +2885,7 @@ void rcu_barrier(void) if (!cpu_online(cpu) && !rcu_segcblist_is_offloaded(&rdp->cblist)) continue; - if (rcu_segcblist_is_offloaded(&rdp->cblist)) { - if (!rcu_nocb_cpu_needs_barrier(cpu)) { - rcu_barrier_trace(TPS("OfflineNoCB"), cpu, - rcu_state.barrier_sequence); - } else { - rcu_barrier_trace(TPS("OnlineNoCB"), cpu, - rcu_state.barrier_sequence); - smp_mb__before_atomic(); - atomic_inc(&rcu_state.barrier_cpu_count); - __call_rcu(&rdp->barrier_head, - rcu_barrier_callback, cpu, 0); - } - } else if (rcu_segcblist_n_cbs(&rdp->cblist)) { + if (rcu_segcblist_n_cbs(&rdp->cblist)) { rcu_barrier_trace(TPS("OnlineQ"), cpu, rcu_state.barrier_sequence); smp_call_function_single(cpu, rcu_barrier_func, NULL, 1); @@ -3169,10 +3175,7 @@ void rcutree_migrate_callbacks(int cpu) local_irq_save(flags); my_rdp = this_cpu_ptr(&rcu_data); my_rnp = my_rdp->mynode; - if (rcu_nocb_adopt_orphan_cbs(my_rdp, rdp, flags)) { - local_irq_restore(flags); - return; - } + rcu_nocb_lock(my_rdp); /* irqs already disabled. */ raw_spin_lock_rcu_node(my_rnp); /* irqs already disabled. */ /* Leverage recent GPs and set GP for new callbacks. */ needwake = rcu_advance_cbs(my_rnp, rdp) || @@ -3180,9 +3183,16 @@ void rcutree_migrate_callbacks(int cpu) rcu_segcblist_merge(&my_rdp->cblist, &rdp->cblist); WARN_ON_ONCE(rcu_segcblist_empty(&my_rdp->cblist) != !rcu_segcblist_n_cbs(&my_rdp->cblist)); - raw_spin_unlock_irqrestore_rcu_node(my_rnp, flags); + if (rcu_segcblist_is_offloaded(&my_rdp->cblist)) { + raw_spin_unlock_rcu_node(my_rnp); /* irqs remain disabled. */ + __call_rcu_nocb_wake(my_rdp, true, flags); + } else { + rcu_nocb_unlock(my_rdp); /* irqs remain disabled. */ + raw_spin_unlock_irqrestore_rcu_node(my_rnp, flags); + } if (needwake) rcu_gp_kthread_wake(); + lockdep_assert_irqs_enabled(); WARN_ONCE(rcu_segcblist_n_cbs(&rdp->cblist) != 0 || !rcu_segcblist_empty(&rdp->cblist), "rcu_cleanup_dead_cpu: Callbacks on offline CPU %d: qlen=%lu, 1stCB=%p\n", diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index 8d9cfcac6757..529eec2aa74d 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -211,7 +211,9 @@ struct rcu_data { /* CBs waiting for GP. */ struct rcu_head **nocb_gp_tail; bool nocb_gp_sleep; /* Is the nocb GP thread asleep? */ + bool nocb_gp_forced; /* Forced nocb GP thread wakeup? */ struct swait_queue_head nocb_gp_wq; /* For nocb kthreads to sleep on. */ + bool nocb_cb_sleep; /* Is the nocb CB thread asleep? */ struct task_struct *nocb_cb_kthread; struct rcu_data *nocb_next_cb_rdp; /* Next rcu_data in wakeup chain. */ @@ -421,20 +423,20 @@ static bool rcu_preempt_has_tasks(struct rcu_node *rnp); static bool rcu_preempt_need_deferred_qs(struct task_struct *t); static void rcu_preempt_deferred_qs(struct task_struct *t); static void zero_cpu_stall_ticks(struct rcu_data *rdp); -static bool rcu_nocb_cpu_needs_barrier(int cpu); static struct swait_queue_head *rcu_nocb_gp_get(struct rcu_node *rnp); static void rcu_nocb_gp_cleanup(struct swait_queue_head *sq); static void rcu_init_one_nocb(struct rcu_node *rnp); -static bool __call_rcu_nocb(struct rcu_data *rdp, struct rcu_head *rhp, - bool lazy, unsigned long flags); -static bool rcu_nocb_adopt_orphan_cbs(struct rcu_data *my_rdp, - struct rcu_data *rdp, - unsigned long flags); +static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_empty, + unsigned long flags); static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp); static void do_nocb_deferred_wakeup(struct rcu_data *rdp); static void rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp); static void rcu_spawn_cpu_nocb_kthread(int cpu); static void __init rcu_spawn_nocb_kthreads(void); +static void rcu_nocb_lock(struct rcu_data *rdp); +static void rcu_nocb_unlock(struct rcu_data *rdp); +static void rcu_nocb_unlock_irqrestore(struct rcu_data *rdp, + unsigned long flags); #ifdef CONFIG_RCU_NOCB_CPU static void __init rcu_organize_nocb_kthreads(void); #endif /* #ifdef CONFIG_RCU_NOCB_CPU */ diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 2d37fd3fa0d4..feffc46cccb0 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1494,6 +1494,45 @@ static int __init parse_rcu_nocb_poll(char *arg) } early_param("rcu_nocb_poll", parse_rcu_nocb_poll); +/* + * Acquire the specified rcu_data structure's ->nocb_lock, but only + * if it corresponds to a no-CBs CPU. + */ +static void rcu_nocb_lock(struct rcu_data *rdp) +{ + if (rcu_segcblist_is_offloaded(&rdp->cblist)) { + lockdep_assert_irqs_disabled(); + raw_spin_lock(&rdp->nocb_lock); + } +} + +/* + * Release the specified rcu_data structure's ->nocb_lock, but only + * if it corresponds to a no-CBs CPU. + */ +static void rcu_nocb_unlock(struct rcu_data *rdp) +{ + if (rcu_segcblist_is_offloaded(&rdp->cblist)) { + lockdep_assert_irqs_disabled(); + raw_spin_unlock(&rdp->nocb_lock); + } +} + +/* + * Release the specified rcu_data structure's ->nocb_lock and restore + * interrupts, but only if it corresponds to a no-CBs CPU. + */ +static void rcu_nocb_unlock_irqrestore(struct rcu_data *rdp, + unsigned long flags) +{ + if (rcu_segcblist_is_offloaded(&rdp->cblist)) { + lockdep_assert_irqs_disabled(); + raw_spin_unlock_irqrestore(&rdp->nocb_lock, flags); + } else { + local_irq_restore(flags); + } +} + /* * Wake up any no-CBs CPUs' kthreads that were waiting on the just-ended * grace period. @@ -1526,7 +1565,7 @@ bool rcu_is_nocb_cpu(int cpu) * Kick the GP kthread for this NOCB group. Caller holds ->nocb_lock * and this function releases it. */ -static void __wake_nocb_gp(struct rcu_data *rdp, bool force, +static void wake_nocb_gp(struct rcu_data *rdp, bool force, unsigned long flags) __releases(rdp->nocb_lock) { @@ -1537,30 +1576,19 @@ static void __wake_nocb_gp(struct rcu_data *rdp, bool force, raw_spin_unlock_irqrestore(&rdp->nocb_lock, flags); return; } - if (rdp_gp->nocb_gp_sleep || force) { - /* Prior smp_mb__after_atomic() orders against prior enqueue. */ - WRITE_ONCE(rdp_gp->nocb_gp_sleep, false); + if (READ_ONCE(rdp_gp->nocb_gp_sleep) || force) { del_timer(&rdp->nocb_timer); raw_spin_unlock_irqrestore(&rdp->nocb_lock, flags); - smp_mb(); /* ->nocb_gp_sleep before swake_up_one(). */ - swake_up_one(&rdp_gp->nocb_gp_wq); + smp_mb(); /* enqueue before ->nocb_gp_sleep. */ + raw_spin_lock_irqsave(&rdp_gp->nocb_lock, flags); + WRITE_ONCE(rdp_gp->nocb_gp_sleep, false); + raw_spin_unlock_irqrestore(&rdp_gp->nocb_lock, flags); + wake_up_process(rdp_gp->nocb_gp_kthread); } else { raw_spin_unlock_irqrestore(&rdp->nocb_lock, flags); } } -/* - * Kick the GP kthread for this NOCB group, but caller has not - * acquired locks. - */ -static void wake_nocb_gp(struct rcu_data *rdp, bool force) -{ - unsigned long flags; - - raw_spin_lock_irqsave(&rdp->nocb_lock, flags); - __wake_nocb_gp(rdp, force, flags); -} - /* * Arrange to wake the GP kthread for this NOCB group at some future * time when it is safe to do so. @@ -1568,295 +1596,148 @@ static void wake_nocb_gp(struct rcu_data *rdp, bool force) static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype, const char *reason) { - unsigned long flags; - - raw_spin_lock_irqsave(&rdp->nocb_lock, flags); if (rdp->nocb_defer_wakeup == RCU_NOCB_WAKE_NOT) mod_timer(&rdp->nocb_timer, jiffies + 1); WRITE_ONCE(rdp->nocb_defer_wakeup, waketype); trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, reason); - raw_spin_unlock_irqrestore(&rdp->nocb_lock, flags); -} - -/* Does rcu_barrier need to queue an RCU callback on the specified CPU? */ -static bool rcu_nocb_cpu_needs_barrier(int cpu) -{ - struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu); - unsigned long ret; -#ifdef CONFIG_PROVE_RCU - struct rcu_head *rhp; -#endif /* #ifdef CONFIG_PROVE_RCU */ - - /* - * Check count of all no-CBs callbacks awaiting invocation. - * There needs to be a barrier before this function is called, - * but associated with a prior determination that no more - * callbacks would be posted. In the worst case, the first - * barrier in rcu_barrier() suffices (but the caller cannot - * necessarily rely on this, not a substitute for the caller - * getting the concurrency design right!). There must also be a - * barrier between the following load and posting of a callback - * (if a callback is in fact needed). This is associated with an - * atomic_inc() in the caller. - */ - ret = rcu_get_n_cbs_nocb_cpu(rdp); - -#ifdef CONFIG_PROVE_RCU - rhp = READ_ONCE(rdp->nocb_head); - if (!rhp) - rhp = READ_ONCE(rdp->nocb_gp_head); - if (!rhp) - rhp = READ_ONCE(rdp->nocb_cb_head); - - /* Having no rcuo kthread but CBs after scheduler starts is bad! */ - if (!READ_ONCE(rdp->nocb_cb_kthread) && rhp && - rcu_scheduler_fully_active) { - /* RCU callback enqueued before CPU first came online??? */ - pr_err("RCU: Never-onlined no-CBs CPU %d has CB %p\n", - cpu, rhp->func); - WARN_ON_ONCE(1); - } -#endif /* #ifdef CONFIG_PROVE_RCU */ - - return !!ret; } /* - * Enqueue the specified string of rcu_head structures onto the specified - * CPU's no-CBs lists. The CPU is specified by rdp, the head of the - * string by rhp, and the tail of the string by rhtp. The non-lazy/lazy - * counts are supplied by rhcount and rhcount_lazy. + * Awaken the no-CBs grace-period kthead if needed, either due to it + * legitimately being asleep or due to overload conditions. * * If warranted, also wake up the kthread servicing this CPUs queues. */ -static void __call_rcu_nocb_enqueue(struct rcu_data *rdp, - struct rcu_head *rhp, - struct rcu_head **rhtp, - int rhcount, int rhcount_lazy, - unsigned long flags) +static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone, + unsigned long flags) + __releases(rdp->nocb_lock) { int len; - struct rcu_head **old_rhpp; struct task_struct *t; - /* Enqueue the callback on the nocb list and update counts. */ - atomic_long_add(rhcount, &rdp->nocb_q_count); - /* rcu_barrier() relies on ->nocb_q_count add before xchg. */ - old_rhpp = xchg(&rdp->nocb_tail, rhtp); - WRITE_ONCE(*old_rhpp, rhp); - atomic_long_add(rhcount_lazy, &rdp->nocb_q_count_lazy); - smp_mb__after_atomic(); /* Store *old_rhpp before _wake test. */ - - /* If we are not being polled and there is a kthread, awaken it ... */ + // If we are being polled or there is no kthread, just leave. t = READ_ONCE(rdp->nocb_gp_kthread); if (rcu_nocb_poll || !t) { trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WakeNotPoll")); + rcu_nocb_unlock_irqrestore(rdp, flags); return; } - len = rcu_get_n_cbs_nocb_cpu(rdp); - if (old_rhpp == &rdp->nocb_head) { + // Need to actually to a wakeup. + len = rcu_segcblist_n_cbs(&rdp->cblist); + if (was_alldone) { if (!irqs_disabled_flags(flags)) { /* ... if queue was empty ... */ - wake_nocb_gp(rdp, false); + wake_nocb_gp(rdp, false, flags); trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WakeEmpty")); } else { wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE, TPS("WakeEmptyIsDeferred")); + rcu_nocb_unlock_irqrestore(rdp, flags); } rdp->qlen_last_fqs_check = 0; } else if (len > rdp->qlen_last_fqs_check + qhimark) { /* ... or if many callbacks queued. */ if (!irqs_disabled_flags(flags)) { - wake_nocb_gp(rdp, true); + wake_nocb_gp(rdp, true, flags); trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WakeOvf")); } else { wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE_FORCE, TPS("WakeOvfIsDeferred")); + rcu_nocb_unlock_irqrestore(rdp, flags); } rdp->qlen_last_fqs_check = LONG_MAX / 2; } else { trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WakeNot")); + rcu_nocb_unlock_irqrestore(rdp, flags); } + if (!irqs_disabled_flags(flags)) + lockdep_assert_irqs_enabled(); return; } /* - * This is a helper for __call_rcu(), which invokes this when the normal - * callback queue is inoperable. If this is not a no-CBs CPU, this - * function returns failure back to __call_rcu(), which can complain - * appropriately. - * - * Otherwise, this function queues the callback where the corresponding - * "rcuo" kthread can find it. - */ -static bool __call_rcu_nocb(struct rcu_data *rdp, struct rcu_head *rhp, - bool lazy, unsigned long flags) -{ - - if (!rcu_segcblist_is_offloaded(&rdp->cblist)) - return false; - __call_rcu_nocb_enqueue(rdp, rhp, &rhp->next, 1, lazy, flags); - if (__is_kfree_rcu_offset((unsigned long)rhp->func)) - trace_rcu_kfree_callback(rcu_state.name, rhp, - (unsigned long)rhp->func, - -atomic_long_read(&rdp->nocb_q_count_lazy), - -rcu_get_n_cbs_nocb_cpu(rdp)); - else - trace_rcu_callback(rcu_state.name, rhp, - -atomic_long_read(&rdp->nocb_q_count_lazy), - -rcu_get_n_cbs_nocb_cpu(rdp)); - - return true; -} - -/* - * Adopt orphaned callbacks on a no-CBs CPU, or return 0 if this is - * not a no-CBs CPU. - */ -static bool __maybe_unused rcu_nocb_adopt_orphan_cbs(struct rcu_data *my_rdp, - struct rcu_data *rdp, - unsigned long flags) -{ - lockdep_assert_irqs_disabled(); - if (!rcu_segcblist_is_offloaded(&my_rdp->cblist)) - return false; /* Not NOCBs CPU, caller must migrate CBs. */ - __call_rcu_nocb_enqueue(my_rdp, rcu_segcblist_head(&rdp->cblist), - rcu_segcblist_tail(&rdp->cblist), - rcu_segcblist_n_cbs(&rdp->cblist), - rcu_segcblist_n_lazy_cbs(&rdp->cblist), flags); - rcu_segcblist_init(&rdp->cblist); - rcu_segcblist_disable(&rdp->cblist); - return true; -} - -/* - * If necessary, kick off a new grace period, and either way wait - * for a subsequent grace period to complete. - */ -static void rcu_nocb_wait_gp(struct rcu_data *rdp) -{ - unsigned long c; - bool d; - unsigned long flags; - bool needwake; - struct rcu_node *rnp = rdp->mynode; - - local_irq_save(flags); - c = rcu_seq_snap(&rcu_state.gp_seq); - if (!rdp->gpwrap && ULONG_CMP_GE(rdp->gp_seq_needed, c)) { - local_irq_restore(flags); - } else { - raw_spin_lock_rcu_node(rnp); /* irqs already disabled. */ - needwake = rcu_start_this_gp(rnp, rdp, c); - raw_spin_unlock_irqrestore_rcu_node(rnp, flags); - if (needwake) - rcu_gp_kthread_wake(); - } - - /* - * Wait for the grace period. Do so interruptibly to avoid messing - * up the load average. - */ - trace_rcu_this_gp(rnp, rdp, c, TPS("StartWait")); - for (;;) { - swait_event_interruptible_exclusive( - rnp->nocb_gp_wq[rcu_seq_ctr(c) & 0x1], - (d = rcu_seq_done(&rnp->gp_seq, c))); - if (likely(d)) - break; - WARN_ON(signal_pending(current)); - trace_rcu_this_gp(rnp, rdp, c, TPS("ResumeWait")); - } - trace_rcu_this_gp(rnp, rdp, c, TPS("EndWait")); - smp_mb(); /* Ensure that CB invocation happens after GP end. */ -} - -/* - * No-CBs GP kthreads come here to wait for additional callbacks to show up. - * This function does not return until callbacks appear. + * No-CBs GP kthreads come here to wait for additional callbacks to show up + * or for grace periods to end. */ static void nocb_gp_wait(struct rcu_data *my_rdp) { - bool firsttime = true; + int __maybe_unused cpu = my_rdp->cpu; + unsigned long cur_gp_seq; unsigned long flags; bool gotcbs; + bool needwait_gp = false; + bool needwake; + bool needwake_gp; struct rcu_data *rdp; - struct rcu_head **tail; - - /* Wait for callbacks to appear. */ - if (!rcu_nocb_poll) { - trace_rcu_nocb_wake(rcu_state.name, my_rdp->cpu, TPS("Sleep")); - swait_event_interruptible_exclusive(my_rdp->nocb_gp_wq, - !READ_ONCE(my_rdp->nocb_gp_sleep)); - raw_spin_lock_irqsave(&my_rdp->nocb_lock, flags); - my_rdp->nocb_gp_sleep = true; - WRITE_ONCE(my_rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT); - del_timer(&my_rdp->nocb_timer); - raw_spin_unlock_irqrestore(&my_rdp->nocb_lock, flags); - } else if (firsttime) { - firsttime = false; /* Don't drown trace log with "Poll"! */ - trace_rcu_nocb_wake(rcu_state.name, my_rdp->cpu, TPS("Poll")); - } + struct rcu_node *rnp; + unsigned long wait_gp_seq; /* - * Each pass through the following loop checks for CBs. - * We are our own first CB kthread. Any CBs found are moved to - * nocb_gp_head, where they await a grace period. + * Each pass through the following loop checks for CBs and for the + * nearest grace period (if any) to wait for next. The CB kthreads + * and the global grace-period kthread are awakened if needed. */ - gotcbs = false; - smp_mb(); /* wakeup and _sleep before ->nocb_head reads. */ for (rdp = my_rdp; rdp; rdp = rdp->nocb_next_cb_rdp) { - rdp->nocb_gp_head = READ_ONCE(rdp->nocb_head); - if (!rdp->nocb_gp_head) - continue; /* No CBs here, try next. */ - - /* Move callbacks to wait-for-GP list, which is empty. */ - WRITE_ONCE(rdp->nocb_head, NULL); - rdp->nocb_gp_tail = xchg(&rdp->nocb_tail, &rdp->nocb_head); - gotcbs = true; - } - - /* No callbacks? Sleep a bit if polling, and go retry. */ - if (unlikely(!gotcbs)) { - WARN_ON(signal_pending(current)); - if (rcu_nocb_poll) { - schedule_timeout_interruptible(1); - } else { - trace_rcu_nocb_wake(rcu_state.name, my_rdp->cpu, - TPS("WokeEmpty")); - } - return; - } - - /* Wait for one grace period. */ - rcu_nocb_wait_gp(my_rdp); - - /* Each pass through this loop wakes a CB kthread, if needed. */ - for (rdp = my_rdp; rdp; rdp = rdp->nocb_next_cb_rdp) { - if (!rcu_nocb_poll && - READ_ONCE(rdp->nocb_head) && - READ_ONCE(my_rdp->nocb_gp_sleep)) { - raw_spin_lock_irqsave(&my_rdp->nocb_lock, flags); - my_rdp->nocb_gp_sleep = false;/* No need to sleep.*/ - raw_spin_unlock_irqrestore(&my_rdp->nocb_lock, flags); - } - if (!rdp->nocb_gp_head) - continue; /* No CBs, so no need to wake kthread. */ - - /* Append callbacks to CB kthread's "done" list. */ + if (rcu_segcblist_empty(&rdp->cblist)) + continue; /* No callbacks here, try next. */ + rnp = rdp->mynode; raw_spin_lock_irqsave(&rdp->nocb_lock, flags); - tail = rdp->nocb_cb_tail; - rdp->nocb_cb_tail = rdp->nocb_gp_tail; - *tail = rdp->nocb_gp_head; - raw_spin_unlock_irqrestore(&rdp->nocb_lock, flags); - if (tail == &rdp->nocb_cb_head) { - /* List was empty, so wake up the kthread. */ - swake_up_one(&rdp->nocb_cb_wq); + WRITE_ONCE(my_rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT); + del_timer(&my_rdp->nocb_timer); + raw_spin_lock_rcu_node(rnp); /* irqs already disabled. */ + needwake_gp = rcu_advance_cbs(rnp, rdp); + raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */ + // Need to wait on some grace period? + if (rcu_segcblist_nextgp(&rdp->cblist, &cur_gp_seq)) { + if (!needwait_gp || + ULONG_CMP_LT(cur_gp_seq, wait_gp_seq)) + wait_gp_seq = cur_gp_seq; + needwait_gp = true; } + if (rcu_segcblist_ready_cbs(&rdp->cblist)) { + needwake = rdp->nocb_cb_sleep; + WRITE_ONCE(rdp->nocb_cb_sleep, false); + smp_mb(); /* CB invocation -after- GP end. */ + } else { + needwake = false; + } + raw_spin_unlock_irqrestore(&rdp->nocb_lock, flags); + if (needwake) { + swake_up_one(&rdp->nocb_cb_wq); + gotcbs = true; + } + if (needwake_gp) + rcu_gp_kthread_wake(); } + + if (rcu_nocb_poll) { + /* Polling, so trace if first poll in the series. */ + if (gotcbs) + trace_rcu_nocb_wake(rcu_state.name, cpu, TPS("Poll")); + schedule_timeout_interruptible(1); + } else if (!needwait_gp) { + /* Wait for callbacks to appear. */ + trace_rcu_nocb_wake(rcu_state.name, cpu, TPS("Sleep")); + swait_event_interruptible_exclusive(my_rdp->nocb_gp_wq, + !READ_ONCE(my_rdp->nocb_gp_sleep)); + } else { + rnp = my_rdp->mynode; + trace_rcu_this_gp(rnp, my_rdp, wait_gp_seq, TPS("StartWait")); + swait_event_interruptible_exclusive( + rnp->nocb_gp_wq[rcu_seq_ctr(wait_gp_seq) & 0x1], + rcu_seq_done(&rnp->gp_seq, wait_gp_seq) || + !READ_ONCE(my_rdp->nocb_gp_sleep)); + trace_rcu_this_gp(rnp, my_rdp, wait_gp_seq, TPS("EndWait")); + } + if (!rcu_nocb_poll) { + raw_spin_lock_irqsave(&my_rdp->nocb_lock, flags); + WRITE_ONCE(my_rdp->nocb_gp_sleep, true); + raw_spin_unlock_irqrestore(&my_rdp->nocb_lock, flags); + } + WARN_ON(signal_pending(current)); } /* @@ -1871,92 +1752,69 @@ static int rcu_nocb_gp_kthread(void *arg) { struct rcu_data *rdp = arg; - for (;;) + for (;;) { nocb_gp_wait(rdp); + cond_resched_tasks_rcu_qs(); + } return 0; } /* - * No-CBs CB kthreads come here to wait for additional callbacks to show up. - * This function returns true ("keep waiting") until callbacks appear and - * then false ("stop waiting") when callbacks finally do appear. + * Invoke any ready callbacks from the corresponding no-CBs CPU, + * then, if there are no more, wait for more to appear. */ -static bool nocb_cb_wait(struct rcu_data *rdp) +static void nocb_cb_wait(struct rcu_data *rdp) { + unsigned long flags; + bool needwake_gp = false; + struct rcu_node *rnp = rdp->mynode; + + local_irq_save(flags); + rcu_momentary_dyntick_idle(); + local_irq_restore(flags); + local_bh_disable(); + rcu_do_batch(rdp); + local_bh_enable(); + lockdep_assert_irqs_enabled(); + raw_spin_lock_irqsave(&rdp->nocb_lock, flags); + raw_spin_lock_rcu_node(rnp); /* irqs already disabled. */ + needwake_gp = rcu_advance_cbs(rdp->mynode, rdp); + raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */ + if (rcu_segcblist_ready_cbs(&rdp->cblist)) { + raw_spin_unlock_irqrestore(&rdp->nocb_lock, flags); + if (needwake_gp) + rcu_gp_kthread_wake(); + return; + } + trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("CBSleep")); + WRITE_ONCE(rdp->nocb_cb_sleep, true); + raw_spin_unlock_irqrestore(&rdp->nocb_lock, flags); + if (needwake_gp) + rcu_gp_kthread_wake(); swait_event_interruptible_exclusive(rdp->nocb_cb_wq, - READ_ONCE(rdp->nocb_cb_head)); - if (smp_load_acquire(&rdp->nocb_cb_head)) { /* VVV */ - /* ^^^ Ensure CB invocation follows _head test. */ - return false; + !READ_ONCE(rdp->nocb_cb_sleep)); + if (!smp_load_acquire(&rdp->nocb_cb_sleep)) { /* VVV */ + /* ^^^ Ensure CB invocation follows _sleep test. */ + return; } WARN_ON(signal_pending(current)); trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WokeEmpty")); - return true; } /* - * Per-rcu_data kthread, but only for no-CBs CPUs. Each kthread invokes - * callbacks queued by the corresponding no-CBs CPU, however, there is an - * optional GP-CB relationship so that the grace-period kthreads don't - * have to do quite so many wakeups (as in they only need to wake the - * no-CBs GP kthreads, not the CB kthreads). + * Per-rcu_data kthread, but only for no-CBs CPUs. Repeatedly invoke + * nocb_cb_wait() to do the dirty work. */ static int rcu_nocb_cb_kthread(void *arg) { - int c, cl; - unsigned long flags; - struct rcu_head *list; - struct rcu_head *next; - struct rcu_head **tail; struct rcu_data *rdp = arg; - /* Each pass through this loop invokes one batch of callbacks */ + // Each pass through this loop does one callback batch, and, + // if there are no more ready callbacks, waits for them. for (;;) { - /* Wait for callbacks. */ - while (nocb_cb_wait(rdp)) - continue; - - /* Pull the ready-to-invoke callbacks onto local list. */ - raw_spin_lock_irqsave(&rdp->nocb_lock, flags); - list = rdp->nocb_cb_head; - rdp->nocb_cb_head = NULL; - tail = rdp->nocb_cb_tail; - rdp->nocb_cb_tail = &rdp->nocb_cb_head; - raw_spin_unlock_irqrestore(&rdp->nocb_lock, flags); - if (WARN_ON_ONCE(!list)) - continue; - trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WokeNonEmpty")); - - /* Each pass through the following loop invokes a callback. */ - trace_rcu_batch_start(rcu_state.name, - atomic_long_read(&rdp->nocb_q_count_lazy), - rcu_get_n_cbs_nocb_cpu(rdp), -1); - c = cl = 0; - while (list) { - next = list->next; - /* Wait for enqueuing to complete, if needed. */ - while (next == NULL && &list->next != tail) { - trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, - TPS("WaitQueue")); - schedule_timeout_interruptible(1); - trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, - TPS("WokeQueue")); - next = list->next; - } - debug_rcu_head_unqueue(list); - local_bh_disable(); - if (__rcu_reclaim(rcu_state.name, list)) - cl++; - c++; - local_bh_enable(); - cond_resched_tasks_rcu_qs(); - list = next; - } - trace_rcu_batch_end(rcu_state.name, c, !!list, 0, 0, 1); - smp_mb__before_atomic(); /* _add after CB invocation. */ - atomic_long_add(-c, &rdp->nocb_q_count); - atomic_long_add(-cl, &rdp->nocb_q_count_lazy); + nocb_cb_wait(rdp); + cond_resched_tasks_rcu_qs(); } return 0; } @@ -1980,7 +1838,7 @@ static void do_nocb_deferred_wakeup_common(struct rcu_data *rdp) } ndw = READ_ONCE(rdp->nocb_defer_wakeup); WRITE_ONCE(rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT); - __wake_nocb_gp(rdp, ndw == RCU_NOCB_WAKE_FORCE, flags); + wake_nocb_gp(rdp, ndw == RCU_NOCB_WAKE_FORCE, flags); trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("DeferredWake")); } @@ -2194,10 +2052,21 @@ static unsigned long rcu_get_n_cbs_nocb_cpu(struct rcu_data *rdp) #else /* #ifdef CONFIG_RCU_NOCB_CPU */ -static bool rcu_nocb_cpu_needs_barrier(int cpu) +/* No ->nocb_lock to acquire. */ +static void rcu_nocb_lock(struct rcu_data *rdp) { - WARN_ON_ONCE(1); /* Should be dead code. */ - return false; +} + +/* No ->nocb_lock to release. */ +static void rcu_nocb_unlock(struct rcu_data *rdp) +{ +} + +/* No ->nocb_lock to release. */ +static void rcu_nocb_unlock_irqrestore(struct rcu_data *rdp, + unsigned long flags) +{ + local_irq_restore(flags); } static void rcu_nocb_gp_cleanup(struct swait_queue_head *sq) @@ -2213,17 +2082,10 @@ static void rcu_init_one_nocb(struct rcu_node *rnp) { } -static bool __call_rcu_nocb(struct rcu_data *rdp, struct rcu_head *rhp, - bool lazy, unsigned long flags) +static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_empty, + unsigned long flags) { - return false; -} - -static bool __maybe_unused rcu_nocb_adopt_orphan_cbs(struct rcu_data *my_rdp, - struct rcu_data *rdp, - unsigned long flags) -{ - return false; + WARN_ON_ONCE(1); /* Should be dead code! */ } static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp) From e7f4c5b3998a3cf1bd8dbf110948075b47ac9b78 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 21 May 2019 07:18:00 -0700 Subject: [PATCH 56/86] rcu/nocb: Remove obsolete nocb_head and nocb_tail fields Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.h | 3 --- kernel/rcu/tree_plugin.h | 1 - 2 files changed, 4 deletions(-) diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index 529eec2aa74d..74e3a4ab8095 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -194,8 +194,6 @@ struct rcu_data { /* 5) Callback offloading. */ #ifdef CONFIG_RCU_NOCB_CPU - struct rcu_head *nocb_head; /* CBs waiting for kthread. */ - struct rcu_head **nocb_tail; atomic_long_t nocb_q_count; /* # CBs waiting for nocb */ atomic_long_t nocb_q_count_lazy; /* invocation (all stages). */ struct rcu_head *nocb_cb_head; /* CBs ready to invoke. */ @@ -211,7 +209,6 @@ struct rcu_data { /* CBs waiting for GP. */ struct rcu_head **nocb_gp_tail; bool nocb_gp_sleep; /* Is the nocb GP thread asleep? */ - bool nocb_gp_forced; /* Forced nocb GP thread wakeup? */ struct swait_queue_head nocb_gp_wq; /* For nocb kthreads to sleep on. */ bool nocb_cb_sleep; /* Is the nocb CB thread asleep? */ struct task_struct *nocb_cb_kthread; diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index feffc46cccb0..838e0caaf53a 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1911,7 +1911,6 @@ void __init rcu_init_nohz(void) /* Initialize per-rcu_data variables for no-CBs CPUs. */ static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp) { - rdp->nocb_tail = &rdp->nocb_head; init_swait_queue_head(&rdp->nocb_cb_wq); init_swait_queue_head(&rdp->nocb_gp_wq); rdp->nocb_cb_tail = &rdp->nocb_cb_head; From c035280f1761b3336f4dad336906c19735d7ba5f Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 21 May 2019 08:28:41 -0700 Subject: [PATCH 57/86] rcu/nocb: Remove obsolete nocb_q_count and nocb_q_count_lazy fields This commit removes the obsolete nocb_q_count and nocb_q_count_lazy fields, also removing rcu_get_n_cbs_nocb_cpu(), adjusting rcu_get_n_cbs_cpu(), and making rcutree_migrate_callbacks() once again disable the ->cblist fields of offline CPUs. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 6 +++--- kernel/rcu/tree.h | 3 --- kernel/rcu/tree_plugin.h | 14 -------------- 3 files changed, 3 insertions(+), 20 deletions(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 054418d2d960..e5f30b364276 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -210,10 +210,9 @@ static long rcu_get_n_cbs_cpu(int cpu) { struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu); - if (rcu_segcblist_is_enabled(&rdp->cblist) && - !rcu_segcblist_is_offloaded(&rdp->cblist)) /* Online normal CPU? */ + if (rcu_segcblist_is_enabled(&rdp->cblist)) return rcu_segcblist_n_cbs(&rdp->cblist); - return rcu_get_n_cbs_nocb_cpu(rdp); /* Works for offline, too. */ + return 0; } void rcu_softirq_qs(void) @@ -3181,6 +3180,7 @@ void rcutree_migrate_callbacks(int cpu) needwake = rcu_advance_cbs(my_rnp, rdp) || rcu_advance_cbs(my_rnp, my_rdp); rcu_segcblist_merge(&my_rdp->cblist, &rdp->cblist); + rcu_segcblist_disable(&rdp->cblist); WARN_ON_ONCE(rcu_segcblist_empty(&my_rdp->cblist) != !rcu_segcblist_n_cbs(&my_rdp->cblist)); if (rcu_segcblist_is_offloaded(&my_rdp->cblist)) { diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index 74e3a4ab8095..d1df192272fb 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -194,8 +194,6 @@ struct rcu_data { /* 5) Callback offloading. */ #ifdef CONFIG_RCU_NOCB_CPU - atomic_long_t nocb_q_count; /* # CBs waiting for nocb */ - atomic_long_t nocb_q_count_lazy; /* invocation (all stages). */ struct rcu_head *nocb_cb_head; /* CBs ready to invoke. */ struct rcu_head **nocb_cb_tail; struct swait_queue_head nocb_cb_wq; /* For nocb kthreads to sleep on. */ @@ -437,7 +435,6 @@ static void rcu_nocb_unlock_irqrestore(struct rcu_data *rdp, #ifdef CONFIG_RCU_NOCB_CPU static void __init rcu_organize_nocb_kthreads(void); #endif /* #ifdef CONFIG_RCU_NOCB_CPU */ -static unsigned long rcu_get_n_cbs_nocb_cpu(struct rcu_data *rdp); static void rcu_bind_gp_kthread(void); static bool rcu_nohz_full_cpu(void); static void rcu_dynticks_task_enter(void); diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 838e0caaf53a..458838c63a6c 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -2040,15 +2040,6 @@ void rcu_bind_current_to_nocb(void) } EXPORT_SYMBOL_GPL(rcu_bind_current_to_nocb); -/* - * Return the number of RCU callbacks still queued from the specified - * CPU, which must be a nocbs CPU. - */ -static unsigned long rcu_get_n_cbs_nocb_cpu(struct rcu_data *rdp) -{ - return atomic_long_read(&rdp->nocb_q_count); -} - #else /* #ifdef CONFIG_RCU_NOCB_CPU */ /* No ->nocb_lock to acquire. */ @@ -2108,11 +2099,6 @@ static void __init rcu_spawn_nocb_kthreads(void) { } -static unsigned long rcu_get_n_cbs_nocb_cpu(struct rcu_data *rdp) -{ - return 0; -} - #endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */ /* From 2a777de757f4c7050997c6232a585eff59c5ea36 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 21 May 2019 09:10:24 -0700 Subject: [PATCH 58/86] rcu/nocb: Remove obsolete nocb_cb_tail and nocb_cb_head fields Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.h | 2 -- kernel/rcu/tree_plugin.h | 1 - 2 files changed, 3 deletions(-) diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index d1df192272fb..6e4cf7de303f 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -194,8 +194,6 @@ struct rcu_data { /* 5) Callback offloading. */ #ifdef CONFIG_RCU_NOCB_CPU - struct rcu_head *nocb_cb_head; /* CBs ready to invoke. */ - struct rcu_head **nocb_cb_tail; struct swait_queue_head nocb_cb_wq; /* For nocb kthreads to sleep on. */ struct task_struct *nocb_gp_kthread; raw_spinlock_t nocb_lock; /* Guard following pair of fields. */ diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 458838c63a6c..1847fffdfa0a 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1913,7 +1913,6 @@ static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp) { init_swait_queue_head(&rdp->nocb_cb_wq); init_swait_queue_head(&rdp->nocb_gp_wq); - rdp->nocb_cb_tail = &rdp->nocb_cb_head; raw_spin_lock_init(&rdp->nocb_lock); timer_setup(&rdp->nocb_timer, do_nocb_deferred_wakeup_timer, 0); } From 4f9c1bc727f917c8c32ee1decc88e89057e0dffc Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 21 May 2019 09:20:10 -0700 Subject: [PATCH 59/86] rcu/nocb: Remove obsolete nocb_gp_head and nocb_gp_tail fields Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.h | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index 6e4cf7de303f..c12e85c12310 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -201,10 +201,8 @@ struct rcu_data { struct timer_list nocb_timer; /* Enforce finite deferral. */ /* The following fields are used by GP kthread, hence own cacheline. */ - struct rcu_head *nocb_gp_head ____cacheline_internodealigned_in_smp; - /* CBs waiting for GP. */ - struct rcu_head **nocb_gp_tail; - bool nocb_gp_sleep; /* Is the nocb GP thread asleep? */ + bool nocb_gp_sleep ____cacheline_internodealigned_in_smp; + /* Is the nocb GP thread asleep? */ struct swait_queue_head nocb_gp_wq; /* For nocb kthreads to sleep on. */ bool nocb_cb_sleep; /* Is the nocb CB thread asleep? */ struct task_struct *nocb_cb_kthread; From ec5ef87bac820be8ae9cc0a95108cded039ed8ef Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 21 May 2019 13:03:49 -0700 Subject: [PATCH 60/86] rcu/nocb: Use build-time no-CBs check in rcu_do_batch() Currently, rcu_do_batch() invokes rcu_segcblist_is_offloaded() each time it needs to know whether the current CPU is a no-CBs CPU. Given that it is not possible to change the no-CBs status of a CPU after boot, and given that it is not possible to even have no-CBs CPUs in CONFIG_RCU_NOCB_CPU=n kernels, this per-callback invocation wastes CPU. This commit therefore created a const on-stack variable to allow this check to be done only once per rcu_do_batch() invocation. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index e5f30b364276..16dabd6b36d7 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2087,6 +2087,8 @@ int rcutree_dead_cpu(unsigned int cpu) static void rcu_do_batch(struct rcu_data *rdp) { unsigned long flags; + const bool offloaded = IS_ENABLED(CONFIG_RCU_NOCB_CPU) && + rcu_segcblist_is_offloaded(&rdp->cblist); struct rcu_head *rhp; struct rcu_cblist rcl = RCU_CBLIST_INITIALIZER(rcl); long bl, count; @@ -2128,12 +2130,11 @@ static void rcu_do_batch(struct rcu_data *rdp) * Stop only if limit reached and CPU has something to do. * Note: The rcl structure counts down from zero. */ - if (-rcl.len >= bl && - !rcu_segcblist_is_offloaded(&rdp->cblist) && + if (-rcl.len >= bl && !offloaded && (need_resched() || (!is_idle_task(current) && !rcu_is_callbacks_kthread()))) break; - if (rcu_segcblist_is_offloaded(&rdp->cblist)) { + if (offloaded) { WARN_ON_ONCE(in_serving_softirq()); local_bh_enable(); lockdep_assert_irqs_enabled(); @@ -2175,8 +2176,7 @@ static void rcu_do_batch(struct rcu_data *rdp) rcu_nocb_unlock_irqrestore(rdp, flags); /* Re-invoke RCU core processing if there are callbacks remaining. */ - if (!rcu_segcblist_is_offloaded(&rdp->cblist) && - rcu_segcblist_ready_cbs(&rdp->cblist)) + if (!offloaded && rcu_segcblist_ready_cbs(&rdp->cblist)) invoke_rcu_core(); } From c1ab99d66ebcebedd9d416a840c488eaf079f3e9 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 21 May 2019 13:39:15 -0700 Subject: [PATCH 61/86] rcu/nocb: Use build-time no-CBs check in rcu_core() Currently, rcu_core() invokes rcu_segcblist_is_offloaded() each time it needs to know whether the current CPU is a no-CBs CPU. Given that it is not possible to change the no-CBs status of a CPU after boot, and given that it is not possible to even have no-CBs CPUs in CONFIG_RCU_NOCB_CPU=n kernels, this repeated runtime invocation wastes CPU. This commit therefore created a const on-stack variable to allow this check to be done only once per rcu_core() invocation. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 16dabd6b36d7..14939273d120 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2302,6 +2302,8 @@ static __latent_entropy void rcu_core(void) unsigned long flags; struct rcu_data *rdp = raw_cpu_ptr(&rcu_data); struct rcu_node *rnp = rdp->mynode; + const bool offloaded = IS_ENABLED(CONFIG_RCU_NOCB_CPU) && + rcu_segcblist_is_offloaded(&rdp->cblist); if (cpu_is_offline(smp_processor_id())) return; @@ -2321,8 +2323,7 @@ static __latent_entropy void rcu_core(void) /* No grace period and unregistered callbacks? */ if (!rcu_gp_in_progress() && - rcu_segcblist_is_enabled(&rdp->cblist) && - !rcu_segcblist_is_offloaded(&rdp->cblist)) { + rcu_segcblist_is_enabled(&rdp->cblist) && !offloaded) { local_irq_save(flags); if (!rcu_segcblist_restempty(&rdp->cblist, RCU_NEXT_READY_TAIL)) rcu_accelerate_cbs_unlocked(rnp, rdp); @@ -2332,8 +2333,7 @@ static __latent_entropy void rcu_core(void) rcu_check_gp_start_stall(rnp, rdp, rcu_jiffies_till_stall_check()); /* If there are callbacks ready, invoke them. */ - if (!rcu_segcblist_is_offloaded(&rdp->cblist) && - rcu_segcblist_ready_cbs(&rdp->cblist) && + if (!offloaded && rcu_segcblist_ready_cbs(&rdp->cblist) && likely(READ_ONCE(rcu_scheduler_fully_active))) rcu_do_batch(rdp); From 921bb5fad11c0e8ec5f7625547552b252281f4de Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 21 May 2019 13:53:28 -0700 Subject: [PATCH 62/86] rcu/nocb: Use build-time no-CBs check in rcu_pending() Currently, rcu_pending() invokes rcu_segcblist_is_offloaded() even in CONFIG_RCU_NOCB_CPU=n kernels, which cannot possibly be offloaded. Given that rcu_pending() is on a fastpath, it makes sense to check for CONFIG_RCU_NOCB_CPU=y before invoking rcu_segcblist_is_offloaded(). This commit therefore makes this change. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 14939273d120..fb6b80aa34f6 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2773,7 +2773,8 @@ static int rcu_pending(void) /* Has RCU gone idle with this CPU needing another grace period? */ if (!rcu_gp_in_progress() && rcu_segcblist_is_enabled(&rdp->cblist) && - !rcu_segcblist_is_offloaded(&rdp->cblist) && + (!IS_ENABLED(CONFIG_RCU_NOCB_CPU) || + !rcu_segcblist_is_offloaded(&rdp->cblist)) && !rcu_segcblist_restempty(&rdp->cblist, RCU_NEXT_READY_TAIL)) return 1; From 969974e5c51e717fc9070b00eb2f61ae589ed13d Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 22 May 2019 09:35:11 -0700 Subject: [PATCH 63/86] rcu/nocb: Suppress uninitialized false-positive in nocb_gp_wait() Some compilers complain that wait_gp_seq might be used uninitialized in nocb_gp_wait(). This cannot actually happen because when wait_gp_seq is uninitialized, needwait_gp must be false, which prevents wait_gp_seq from being used. But this analysis is apparently beyond some compilers, so this commit adds a bogus initialization of wait_gp_seq for the sole purpose of suppressing the false-positive warning. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 1847fffdfa0a..c1dfbac8cd39 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1668,12 +1668,12 @@ static void nocb_gp_wait(struct rcu_data *my_rdp) unsigned long cur_gp_seq; unsigned long flags; bool gotcbs; - bool needwait_gp = false; + bool needwait_gp = false; // This prevents actual uninitialized use. bool needwake; bool needwake_gp; struct rcu_data *rdp; struct rcu_node *rnp; - unsigned long wait_gp_seq; + unsigned long wait_gp_seq = 0; // Suppress "use uninitialized" warning. /* * Each pass through the following loop checks for CBs and for the From 0bd55c693617cd2488378d011b66b92e1dd66ecf Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 12 Aug 2019 10:28:08 -0700 Subject: [PATCH 64/86] rcu/nohz: Turn off tick for offloaded CPUs Historically, no-CBs CPUs allowed the scheduler-clock tick to be unconditionally disabled on any transition to idle or nohz_full userspace execution (see the rcu_needs_cpu() implementations). Unfortunately, the checks used by rcu_needs_cpu() are defeated now that no-CBs CPUs use ->cblist, which might make users of battery-powered devices rather unhappy. This commit therefore adds explicit rcu_segcblist_is_offloaded() checks to return to the historical energy-efficient semantics. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index c1dfbac8cd39..ae927710d670 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1226,10 +1226,10 @@ static void rcu_prepare_kthreads(int cpu) #if !defined(CONFIG_RCU_FAST_NO_HZ) /* - * Check to see if any future RCU-related work will need to be done - * by the current CPU, even if none need be done immediately, returning - * 1 if so. This function is part of the RCU implementation; it is -not- - * an exported member of the RCU API. + * Check to see if any future non-offloaded RCU-related work will need + * to be done by the current CPU, even if none need be done immediately, + * returning 1 if so. This function is part of the RCU implementation; + * it is -not- an exported member of the RCU API. * * Because we not have RCU_FAST_NO_HZ, just check whether or not this * CPU has RCU callbacks queued. @@ -1237,7 +1237,8 @@ static void rcu_prepare_kthreads(int cpu) int rcu_needs_cpu(u64 basemono, u64 *nextevt) { *nextevt = KTIME_MAX; - return !rcu_segcblist_empty(&this_cpu_ptr(&rcu_data)->cblist); + return !rcu_segcblist_empty(&this_cpu_ptr(&rcu_data)->cblist) && + !rcu_segcblist_is_offloaded(&this_cpu_ptr(&rcu_data)->cblist); } /* @@ -1338,8 +1339,9 @@ int rcu_needs_cpu(u64 basemono, u64 *nextevt) lockdep_assert_irqs_disabled(); - /* If no callbacks, RCU doesn't need the CPU. */ - if (rcu_segcblist_empty(&rdp->cblist)) { + /* If no non-offloaded callbacks, RCU doesn't need the CPU. */ + if (rcu_segcblist_empty(&rdp->cblist) || + rcu_segcblist_is_offloaded(&this_cpu_ptr(&rcu_data)->cblist)) { *nextevt = KTIME_MAX; return 0; } From aeeacd9d844b2219d47e9010298b635c68a2a4c9 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 23 May 2019 10:43:58 -0700 Subject: [PATCH 65/86] rcu/nocb: Enable re-awakening under high callback load The __call_rcu_nocb_wake() function and its predecessors set ->qlen_last_fqs_check to zero for the first callback and to LONG_MAX / 2 for forced reawakenings. The former can result in a too-quick reawakening when there are many callbacks ready to invoke and the latter prevents a second reawakening. This commit therefore sets ->qlen_last_fqs_check to the current number of callbacks in both cases. While in the area, this commit also moves both assignments under ->nocb_lock. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index ae927710d670..7077ef7bea96 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1628,6 +1628,7 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone, // Need to actually to a wakeup. len = rcu_segcblist_n_cbs(&rdp->cblist); if (was_alldone) { + rdp->qlen_last_fqs_check = len; if (!irqs_disabled_flags(flags)) { /* ... if queue was empty ... */ wake_nocb_gp(rdp, false, flags); @@ -1638,9 +1639,9 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone, TPS("WakeEmptyIsDeferred")); rcu_nocb_unlock_irqrestore(rdp, flags); } - rdp->qlen_last_fqs_check = 0; } else if (len > rdp->qlen_last_fqs_check + qhimark) { /* ... or if many callbacks queued. */ + rdp->qlen_last_fqs_check = len; if (!irqs_disabled_flags(flags)) { wake_nocb_gp(rdp, true, flags); trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, @@ -1650,7 +1651,6 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone, TPS("WakeOvfIsDeferred")); rcu_nocb_unlock_irqrestore(rdp, flags); } - rdp->qlen_last_fqs_check = LONG_MAX / 2; } else { trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WakeNot")); rcu_nocb_unlock_irqrestore(rdp, flags); From 383e13328373ae1e17119ff89c86ff5f9413f31c Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 23 May 2019 13:49:26 -0700 Subject: [PATCH 66/86] rcu/nocb: Never downgrade ->nocb_defer_wakeup in wake_nocb_gp_defer() Currently, wake_nocb_gp_defer() simply stores whatever waketype was passed in, which can result in a RCU_NOCB_WAKE_FORCE being downgraded to RCU_NOCB_WAKE, which could in turn delay callback processing. This commit therefore adds a check so that wake_nocb_gp_defer() only updates ->nocb_defer_wakeup when the update increases the forcefulness, thus avoiding downgrades. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 7077ef7bea96..b9e00660af60 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1600,7 +1600,8 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype, { if (rdp->nocb_defer_wakeup == RCU_NOCB_WAKE_NOT) mod_timer(&rdp->nocb_timer, jiffies + 1); - WRITE_ONCE(rdp->nocb_defer_wakeup, waketype); + if (rdp->nocb_defer_wakeup < waketype) + WRITE_ONCE(rdp->nocb_defer_wakeup, waketype); trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, reason); } From ce0a825e40606d6dbe6dfe90d4d4c0ccc9fa3bde Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 23 May 2019 13:56:12 -0700 Subject: [PATCH 67/86] rcu/nocb: Make __call_rcu_nocb_wake() safe for many callbacks It might be hard to imagine having more than two billion callbacks queued on a single CPU's ->cblist, but someone will do it sometime. This commit therefore makes __call_rcu_nocb_wake() handle this situation by upgrading local variable "len" from "int" to "long". Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index b9e00660af60..5cbc0848647c 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1615,7 +1615,7 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone, unsigned long flags) __releases(rdp->nocb_lock) { - int len; + long len; struct task_struct *t; // If we are being polled or there is no kthread, just leave. From 7f36ef82e5cf0b401c2676fb3e56ad0633ed6ad5 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 28 May 2019 05:54:26 -0700 Subject: [PATCH 68/86] rcu/nocb: Avoid needless wakeups of no-CBs grace-period kthread Currently, the code provides an extra wakeup for the no-CBs grace-period kthread if one of its CPUs is generating excessive numbers of callbacks. But satisfying though it is to wake something up when things are going south, unless the thing being awakened can actually help solve the problem, that extra wakeup does nothing but consume additional CPU time, which is exactly what you don't want during a call_rcu() flood. This commit therefore avoids doing anything if the corresponding no-CBs callback kthread is going full tilt. Otherwise, if advancing callbacks immediately might help and if the leaf rcu_node structure's lock is immediately available, this commit invokes a new variant of rcu_advance_cbs() that advances callbacks only if doing so won't require awakening the grace-period kthread (not to be confused with any of the no-CBs grace-period kthreads). Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 15 +++++++++++++++ kernel/rcu/tree_plugin.h | 13 +++++++++---- 2 files changed, 24 insertions(+), 4 deletions(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index fb6b80aa34f6..a6ddfae6978d 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1334,6 +1334,19 @@ static bool rcu_advance_cbs(struct rcu_node *rnp, struct rcu_data *rdp) return rcu_accelerate_cbs(rnp, rdp); } +/* + * Move and classify callbacks, but only if doing so won't require + * that the RCU grace-period kthread be awakened. + */ +static void __maybe_unused rcu_advance_cbs_nowake(struct rcu_node *rnp, + struct rcu_data *rdp) +{ + raw_lockdep_assert_held_rcu_node(rnp); + if (!rcu_seq_state(rcu_seq_current(&rnp->gp_seq))) + return; + WARN_ON_ONCE(rcu_advance_cbs(rnp, rdp)); +} + /* * Update CPU-local rcu_data state to record the beginnings and ends of * grace periods. The caller must hold the ->lock of the leaf rcu_node @@ -2118,6 +2131,8 @@ static void rcu_do_batch(struct rcu_data *rdp) rcu_segcblist_n_lazy_cbs(&rdp->cblist), rcu_segcblist_n_cbs(&rdp->cblist), bl); rcu_segcblist_extract_done_cbs(&rdp->cblist, &rcl); + if (offloaded) + rdp->qlen_last_fqs_check = rcu_segcblist_n_cbs(&rdp->cblist); rcu_nocb_unlock_irqrestore(rdp, flags); /* Invoke callbacks. */ diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 5cbc0848647c..c10afe778430 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1643,10 +1643,15 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone, } else if (len > rdp->qlen_last_fqs_check + qhimark) { /* ... or if many callbacks queued. */ rdp->qlen_last_fqs_check = len; - if (!irqs_disabled_flags(flags)) { - wake_nocb_gp(rdp, true, flags); - trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, - TPS("WakeOvf")); + if (!rdp->nocb_cb_sleep && + rcu_segcblist_ready_cbs(&rdp->cblist)) { + // Already going full tilt, so don't try to rewake. + rcu_nocb_unlock_irqrestore(rdp, flags); + } else if (rcu_segcblist_pend_cbs(&rdp->cblist) && + raw_spin_trylock_rcu_node(rdp->mynode)) { + rcu_advance_cbs_nowake(rdp->mynode, rdp); + raw_spin_unlock_rcu_node(rdp->mynode); + rcu_nocb_unlock_irqrestore(rdp, flags); } else { wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE_FORCE, TPS("WakeOvfIsDeferred")); From 81c0b3d724f419c0524f432c1ac22b9f518c2899 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 28 May 2019 07:18:08 -0700 Subject: [PATCH 69/86] rcu/nocb: Avoid ->nocb_lock capture by corresponding CPU A given rcu_data structure's ->nocb_lock can be acquired very frequently by the corresponding CPU and occasionally by the corresponding no-CBs grace-period and callbacks kthreads. In particular, these two kthreads will have frequent gaps between ->nocb_lock acquisitions that are roughly a grace period in duration. This means that any excessive ->nocb_lock contention will be due to the CPU's acquisitions, and this in turn enables a very naive contention-avoidance strategy to be quite effective. This commit therefore modifies rcu_nocb_lock() to first attempt a raw_spin_trylock(), and to atomically increment a separate ->nocb_lock_contended across a raw_spin_lock(). This new ->nocb_lock_contended field is checked in __call_rcu_nocb_wake() when interrupts are enabled, with a spin-wait for contending acquisitions to complete, thus allowing the kthreads a chance to acquire the lock. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.h | 18 ++++++++++- kernel/rcu/tree_plugin.h | 68 ++++++++++++++++++++++++++-------------- 2 files changed, 62 insertions(+), 24 deletions(-) diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index c12e85c12310..7062f9d9c053 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -197,6 +197,7 @@ struct rcu_data { struct swait_queue_head nocb_cb_wq; /* For nocb kthreads to sleep on. */ struct task_struct *nocb_gp_kthread; raw_spinlock_t nocb_lock; /* Guard following pair of fields. */ + atomic_t nocb_lock_contended; /* Contention experienced. */ int nocb_defer_wakeup; /* Defer wakeup of nocb_kthread. */ struct timer_list nocb_timer; /* Enforce finite deferral. */ @@ -430,7 +431,22 @@ static void rcu_nocb_unlock_irqrestore(struct rcu_data *rdp, unsigned long flags); #ifdef CONFIG_RCU_NOCB_CPU static void __init rcu_organize_nocb_kthreads(void); -#endif /* #ifdef CONFIG_RCU_NOCB_CPU */ +#define rcu_nocb_lock_irqsave(rdp, flags) \ +do { \ + if (!rcu_segcblist_is_offloaded(&(rdp)->cblist)) { \ + local_irq_save(flags); \ + } else if (!raw_spin_trylock_irqsave(&(rdp)->nocb_lock, (flags))) {\ + atomic_inc(&(rdp)->nocb_lock_contended); \ + smp_mb__after_atomic(); /* atomic_inc() before lock. */ \ + raw_spin_lock_irqsave(&(rdp)->nocb_lock, (flags)); \ + smp_mb__before_atomic(); /* atomic_dec() after lock. */ \ + atomic_dec(&(rdp)->nocb_lock_contended); \ + } \ +} while (0) +#else /* #ifdef CONFIG_RCU_NOCB_CPU */ +#define rcu_nocb_lock_irqsave(rdp, flags) local_irq_save(flags) +#endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */ + static void rcu_bind_gp_kthread(void); static bool rcu_nohz_full_cpu(void); static void rcu_dynticks_task_enter(void); diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index c10afe778430..5f0894cec75d 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1498,14 +1498,36 @@ early_param("rcu_nocb_poll", parse_rcu_nocb_poll); /* * Acquire the specified rcu_data structure's ->nocb_lock, but only - * if it corresponds to a no-CBs CPU. + * if it corresponds to a no-CBs CPU. If the lock isn't immediately + * available, increment ->nocb_lock_contended to flag the contention. */ static void rcu_nocb_lock(struct rcu_data *rdp) { - if (rcu_segcblist_is_offloaded(&rdp->cblist)) { - lockdep_assert_irqs_disabled(); - raw_spin_lock(&rdp->nocb_lock); - } + lockdep_assert_irqs_disabled(); + if (!rcu_segcblist_is_offloaded(&rdp->cblist) || + raw_spin_trylock(&rdp->nocb_lock)) + return; + atomic_inc(&rdp->nocb_lock_contended); + smp_mb__after_atomic(); /* atomic_inc() before lock. */ + raw_spin_lock(&rdp->nocb_lock); + smp_mb__before_atomic(); /* atomic_dec() after lock. */ + atomic_dec(&rdp->nocb_lock_contended); +} + +/* + * Spinwait until the specified rcu_data structure's ->nocb_lock is + * not contended. Please note that this is extremely special-purpose, + * relying on the fact that at most two kthreads and one CPU contend for + * this lock, and also that the two kthreads are guaranteed to have frequent + * grace-period-duration time intervals between successive acquisitions + * of the lock. This allows us to use an extremely simple throttling + * mechanism, and further to apply it only to the CPU doing floods of + * call_rcu() invocations. Don't try this at home! + */ +static void rcu_nocb_wait_contended(struct rcu_data *rdp) +{ + while (atomic_read(&rdp->nocb_lock_contended)) + cpu_relax(); } /* @@ -1575,19 +1597,19 @@ static void wake_nocb_gp(struct rcu_data *rdp, bool force, lockdep_assert_held(&rdp->nocb_lock); if (!READ_ONCE(rdp_gp->nocb_gp_kthread)) { - raw_spin_unlock_irqrestore(&rdp->nocb_lock, flags); + rcu_nocb_unlock_irqrestore(rdp, flags); return; } if (READ_ONCE(rdp_gp->nocb_gp_sleep) || force) { del_timer(&rdp->nocb_timer); - raw_spin_unlock_irqrestore(&rdp->nocb_lock, flags); + rcu_nocb_unlock_irqrestore(rdp, flags); smp_mb(); /* enqueue before ->nocb_gp_sleep. */ - raw_spin_lock_irqsave(&rdp_gp->nocb_lock, flags); + rcu_nocb_lock_irqsave(rdp_gp, flags); WRITE_ONCE(rdp_gp->nocb_gp_sleep, false); - raw_spin_unlock_irqrestore(&rdp_gp->nocb_lock, flags); + rcu_nocb_unlock_irqrestore(rdp_gp, flags); wake_up_process(rdp_gp->nocb_gp_kthread); } else { - raw_spin_unlock_irqrestore(&rdp->nocb_lock, flags); + rcu_nocb_unlock_irqrestore(rdp, flags); } } @@ -1646,23 +1668,23 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone, if (!rdp->nocb_cb_sleep && rcu_segcblist_ready_cbs(&rdp->cblist)) { // Already going full tilt, so don't try to rewake. - rcu_nocb_unlock_irqrestore(rdp, flags); } else if (rcu_segcblist_pend_cbs(&rdp->cblist) && raw_spin_trylock_rcu_node(rdp->mynode)) { rcu_advance_cbs_nowake(rdp->mynode, rdp); raw_spin_unlock_rcu_node(rdp->mynode); - rcu_nocb_unlock_irqrestore(rdp, flags); } else { wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE_FORCE, TPS("WakeOvfIsDeferred")); - rcu_nocb_unlock_irqrestore(rdp, flags); } + rcu_nocb_unlock_irqrestore(rdp, flags); } else { trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WakeNot")); rcu_nocb_unlock_irqrestore(rdp, flags); } - if (!irqs_disabled_flags(flags)) + if (!irqs_disabled_flags(flags)) { lockdep_assert_irqs_enabled(); + rcu_nocb_wait_contended(rdp); + } return; } @@ -1692,7 +1714,7 @@ static void nocb_gp_wait(struct rcu_data *my_rdp) if (rcu_segcblist_empty(&rdp->cblist)) continue; /* No callbacks here, try next. */ rnp = rdp->mynode; - raw_spin_lock_irqsave(&rdp->nocb_lock, flags); + rcu_nocb_lock_irqsave(rdp, flags); WRITE_ONCE(my_rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT); del_timer(&my_rdp->nocb_timer); raw_spin_lock_rcu_node(rnp); /* irqs already disabled. */ @@ -1712,7 +1734,7 @@ static void nocb_gp_wait(struct rcu_data *my_rdp) } else { needwake = false; } - raw_spin_unlock_irqrestore(&rdp->nocb_lock, flags); + rcu_nocb_unlock_irqrestore(rdp, flags); if (needwake) { swake_up_one(&rdp->nocb_cb_wq); gotcbs = true; @@ -1741,9 +1763,9 @@ static void nocb_gp_wait(struct rcu_data *my_rdp) trace_rcu_this_gp(rnp, my_rdp, wait_gp_seq, TPS("EndWait")); } if (!rcu_nocb_poll) { - raw_spin_lock_irqsave(&my_rdp->nocb_lock, flags); + rcu_nocb_lock_irqsave(my_rdp, flags); WRITE_ONCE(my_rdp->nocb_gp_sleep, true); - raw_spin_unlock_irqrestore(&my_rdp->nocb_lock, flags); + rcu_nocb_unlock_irqrestore(my_rdp, flags); } WARN_ON(signal_pending(current)); } @@ -1784,12 +1806,12 @@ static void nocb_cb_wait(struct rcu_data *rdp) rcu_do_batch(rdp); local_bh_enable(); lockdep_assert_irqs_enabled(); - raw_spin_lock_irqsave(&rdp->nocb_lock, flags); + rcu_nocb_lock_irqsave(rdp, flags); raw_spin_lock_rcu_node(rnp); /* irqs already disabled. */ needwake_gp = rcu_advance_cbs(rdp->mynode, rdp); raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */ if (rcu_segcblist_ready_cbs(&rdp->cblist)) { - raw_spin_unlock_irqrestore(&rdp->nocb_lock, flags); + rcu_nocb_unlock_irqrestore(rdp, flags); if (needwake_gp) rcu_gp_kthread_wake(); return; @@ -1797,7 +1819,7 @@ static void nocb_cb_wait(struct rcu_data *rdp) trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("CBSleep")); WRITE_ONCE(rdp->nocb_cb_sleep, true); - raw_spin_unlock_irqrestore(&rdp->nocb_lock, flags); + rcu_nocb_unlock_irqrestore(rdp, flags); if (needwake_gp) rcu_gp_kthread_wake(); swait_event_interruptible_exclusive(rdp->nocb_cb_wq, @@ -1839,9 +1861,9 @@ static void do_nocb_deferred_wakeup_common(struct rcu_data *rdp) unsigned long flags; int ndw; - raw_spin_lock_irqsave(&rdp->nocb_lock, flags); + rcu_nocb_lock_irqsave(rdp, flags); if (!rcu_nocb_need_deferred_wakeup(rdp)) { - raw_spin_unlock_irqrestore(&rdp->nocb_lock, flags); + rcu_nocb_unlock_irqrestore(rdp, flags); return; } ndw = READ_ONCE(rdp->nocb_defer_wakeup); From 9fcb09bddd56bae42319b606bae86e85c625f868 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Sat, 1 Jun 2019 05:14:47 -0700 Subject: [PATCH 70/86] rcu/nocb: Round down for number of no-CBs grace-period kthreads Currently, when the square root of the number of CPUs is rounded down by int_sqrt(), this round-down is applied to the number of callback kthreads per grace-period kthreads. This makes almost no difference for large systems, but results in oddities such as three no-CBs grace-period kthreads for a five-CPU system, which is a bit excessive. This commit therefore causes the round-down to apply to the number of no-CBs grace-period kthreads, so that systems with from four to eight CPUs have only two no-CBs grace period kthreads. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 5f0894cec75d..12212764ecd8 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -2028,7 +2028,7 @@ static void __init rcu_organize_nocb_kthreads(void) if (!cpumask_available(rcu_nocb_mask)) return; if (ls == -1) { - ls = int_sqrt(nr_cpu_ids); + ls = nr_cpu_ids / int_sqrt(nr_cpu_ids); rcu_nocb_gp_stride = ls; } From 6608c3a027bcc0b34cc02bc764ea9f52b9dce46f Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Sat, 1 Jun 2019 06:16:38 -0700 Subject: [PATCH 71/86] rcu/nocb: Reduce contention at no-CBs registry-time CB advancement Currently, __call_rcu_nocb_wake() conditionally acquires the leaf rcu_node structure's ->lock, and only afterwards does rcu_advance_cbs_nowake() check to see if it is possible to advance callbacks without potentially needing to awaken the grace-period kthread. Given that the no-awaken check can be done locklessly, this commit reverses the order, so that rcu_advance_cbs_nowake() is invoked without holding the leaf rcu_node structure's ->lock and rcu_advance_cbs_nowake() checks the grace-period state before conditionally acquiring that lock, thus reducing the number of needless acquistions of the leaf rcu_node structure's ->lock. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 5 +++-- kernel/rcu/tree_plugin.h | 4 +--- 2 files changed, 4 insertions(+), 5 deletions(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index a6ddfae6978d..ec320658aeef 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1341,10 +1341,11 @@ static bool rcu_advance_cbs(struct rcu_node *rnp, struct rcu_data *rdp) static void __maybe_unused rcu_advance_cbs_nowake(struct rcu_node *rnp, struct rcu_data *rdp) { - raw_lockdep_assert_held_rcu_node(rnp); - if (!rcu_seq_state(rcu_seq_current(&rnp->gp_seq))) + if (!rcu_seq_state(rcu_seq_current(&rnp->gp_seq)) || + !raw_spin_trylock_rcu_node(rnp)) return; WARN_ON_ONCE(rcu_advance_cbs(rnp, rdp)); + raw_spin_unlock_rcu_node(rnp); } /* diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 12212764ecd8..09c15f619e78 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1668,10 +1668,8 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone, if (!rdp->nocb_cb_sleep && rcu_segcblist_ready_cbs(&rdp->cblist)) { // Already going full tilt, so don't try to rewake. - } else if (rcu_segcblist_pend_cbs(&rdp->cblist) && - raw_spin_trylock_rcu_node(rdp->mynode)) { + } else if (rcu_segcblist_pend_cbs(&rdp->cblist)) { rcu_advance_cbs_nowake(rdp->mynode, rdp); - raw_spin_unlock_rcu_node(rdp->mynode); } else { wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE_FORCE, TPS("WakeOvfIsDeferred")); From 523bddd553c09a2cf051eb724bffba680424f5ec Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Sat, 1 Jun 2019 13:33:55 -0700 Subject: [PATCH 72/86] rcu/nocb: Reduce contention at no-CBs invocation-done time Currently, nocb_cb_wait() unconditionally acquires the leaf rcu_node ->lock to advance callbacks when done invoking the previous batch. It does this while holding ->nocb_lock, which means that contention on the leaf rcu_node ->lock visits itself on the ->nocb_lock. This commit therefore makes this lock acquisition conditional, forgoing callback advancement when the leaf rcu_node ->lock is not immediately available. (In this case, the no-CBs grace-period kthread will eventually do any needed callback advancement.) Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 09c15f619e78..c4cbfb5dc48d 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1805,9 +1805,10 @@ static void nocb_cb_wait(struct rcu_data *rdp) local_bh_enable(); lockdep_assert_irqs_enabled(); rcu_nocb_lock_irqsave(rdp, flags); - raw_spin_lock_rcu_node(rnp); /* irqs already disabled. */ - needwake_gp = rcu_advance_cbs(rdp->mynode, rdp); - raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */ + if (raw_spin_trylock_rcu_node(rnp)) { /* irqs already disabled. */ + needwake_gp = rcu_advance_cbs(rdp->mynode, rdp); + raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */ + } if (rcu_segcblist_ready_cbs(&rdp->cblist)) { rcu_nocb_unlock_irqrestore(rdp, flags); if (needwake_gp) From 4fd8c5f153bc41ae847b9ddb1539b34f70c18278 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Sun, 2 Jun 2019 13:41:08 -0700 Subject: [PATCH 73/86] rcu/nocb: Reduce ->nocb_lock contention with separate ->nocb_gp_lock The sleep/wakeup of the no-CBs grace-period kthreads is synchronized using the ->nocb_lock of the first CPU corresponding to that kthread. This commit provides a separate ->nocb_gp_lock for this purpose, thus reducing contention on ->nocb_lock. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.h | 3 ++- kernel/rcu/tree_plugin.h | 9 +++++---- 2 files changed, 7 insertions(+), 5 deletions(-) diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index 7062f9d9c053..2c3e9068671c 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -202,7 +202,8 @@ struct rcu_data { struct timer_list nocb_timer; /* Enforce finite deferral. */ /* The following fields are used by GP kthread, hence own cacheline. */ - bool nocb_gp_sleep ____cacheline_internodealigned_in_smp; + raw_spinlock_t nocb_gp_lock ____cacheline_internodealigned_in_smp; + bool nocb_gp_sleep; /* Is the nocb GP thread asleep? */ struct swait_queue_head nocb_gp_wq; /* For nocb kthreads to sleep on. */ bool nocb_cb_sleep; /* Is the nocb CB thread asleep? */ diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index c4cbfb5dc48d..e92bc39c4008 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1604,9 +1604,9 @@ static void wake_nocb_gp(struct rcu_data *rdp, bool force, del_timer(&rdp->nocb_timer); rcu_nocb_unlock_irqrestore(rdp, flags); smp_mb(); /* enqueue before ->nocb_gp_sleep. */ - rcu_nocb_lock_irqsave(rdp_gp, flags); + raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags); WRITE_ONCE(rdp_gp->nocb_gp_sleep, false); - rcu_nocb_unlock_irqrestore(rdp_gp, flags); + raw_spin_unlock_irqrestore(&rdp_gp->nocb_gp_lock, flags); wake_up_process(rdp_gp->nocb_gp_kthread); } else { rcu_nocb_unlock_irqrestore(rdp, flags); @@ -1761,9 +1761,9 @@ static void nocb_gp_wait(struct rcu_data *my_rdp) trace_rcu_this_gp(rnp, my_rdp, wait_gp_seq, TPS("EndWait")); } if (!rcu_nocb_poll) { - rcu_nocb_lock_irqsave(my_rdp, flags); + raw_spin_lock_irqsave(&my_rdp->nocb_gp_lock, flags); WRITE_ONCE(my_rdp->nocb_gp_sleep, true); - rcu_nocb_unlock_irqrestore(my_rdp, flags); + raw_spin_unlock_irqrestore(&my_rdp->nocb_gp_lock, flags); } WARN_ON(signal_pending(current)); } @@ -1943,6 +1943,7 @@ static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp) init_swait_queue_head(&rdp->nocb_cb_wq); init_swait_queue_head(&rdp->nocb_gp_wq); raw_spin_lock_init(&rdp->nocb_lock); + raw_spin_lock_init(&rdp->nocb_gp_lock); timer_setup(&rdp->nocb_timer, do_nocb_deferred_wakeup_timer, 0); } From faca5c250935262f026cac1bb951a0f7672474b8 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 26 Jun 2019 09:50:38 -0700 Subject: [PATCH 74/86] rcu/nocb: Unconditionally advance and wake for excessive CBs When there are excessive numbers of callbacks, and when either the corresponding no-CBs callback kthread is asleep or there is no more ready-to-invoke callbacks, and when least one callback is pending, __call_rcu_nocb_wake() will advance the callbacks, but refrain from awakening the corresponding no-CBs grace-period kthread. However, because rcu_advance_cbs_nowake() is used, it is possible (if a bit unlikely) that the needed advancement could not happen due to a grace period not being in progress. Plus there will always be at least one pending callback due to one having just now been enqueued. This commit therefore attempts to advance callbacks and awakens the no-CBs grace-period kthread when there are excessive numbers of callbacks posted and when the no-CBs callback kthread is not in a position to do anything helpful. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index e92bc39c4008..4e49bb359464 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1668,13 +1668,19 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone, if (!rdp->nocb_cb_sleep && rcu_segcblist_ready_cbs(&rdp->cblist)) { // Already going full tilt, so don't try to rewake. - } else if (rcu_segcblist_pend_cbs(&rdp->cblist)) { - rcu_advance_cbs_nowake(rdp->mynode, rdp); + rcu_nocb_unlock_irqrestore(rdp, flags); } else { - wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE_FORCE, - TPS("WakeOvfIsDeferred")); + rcu_advance_cbs_nowake(rdp->mynode, rdp); + if (!irqs_disabled_flags(flags)) { + wake_nocb_gp(rdp, false, flags); + trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, + TPS("WakeOvf")); + } else { + wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE_FORCE, + TPS("WakeOvfIsDeferred")); + rcu_nocb_unlock_irqrestore(rdp, flags); + } } - rcu_nocb_unlock_irqrestore(rdp, flags); } else { trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WakeNot")); rcu_nocb_unlock_irqrestore(rdp, flags); From eda669a6a2c517fd6db41d0fe3c95c1b749c60bd Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 1 Jul 2019 17:36:53 -0700 Subject: [PATCH 75/86] rcu/nocb: Atomic ->len field in rcu_segcblist structure Upcoming ->nocb_lock contention-reduction work requires that the rcu_segcblist structure's ->len field be concurrently manipulated, but only if there are no-CBs CPUs in the kernel. This commit therefore makes this ->len field be an atomic_long_t, but only in CONFIG_RCU_NOCB_CPU=y kernels. Signed-off-by: Paul E. McKenney --- include/linux/rcu_segcblist.h | 4 ++ kernel/rcu/rcu_segcblist.c | 86 ++++++++++++++++++++++++++++++++--- kernel/rcu/rcu_segcblist.h | 12 ++++- 3 files changed, 94 insertions(+), 8 deletions(-) diff --git a/include/linux/rcu_segcblist.h b/include/linux/rcu_segcblist.h index 8b684888f71d..646759042333 100644 --- a/include/linux/rcu_segcblist.h +++ b/include/linux/rcu_segcblist.h @@ -68,7 +68,11 @@ struct rcu_segcblist { struct rcu_head *head; struct rcu_head **tails[RCU_CBLIST_NSEGS]; unsigned long gp_seq[RCU_CBLIST_NSEGS]; +#ifdef CONFIG_RCU_NOCB_CPU + atomic_long_t len; +#else long len; +#endif long len_lazy; u8 enabled; u8 offloaded; diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c index 92968b856593..ff431cc83037 100644 --- a/kernel/rcu/rcu_segcblist.c +++ b/kernel/rcu/rcu_segcblist.c @@ -23,6 +23,19 @@ void rcu_cblist_init(struct rcu_cblist *rclp) rclp->len_lazy = 0; } +/* + * Enqueue an rcu_head structure onto the specified callback list. + * This function assumes that the callback is non-lazy because it + * is intended for use by no-CBs CPUs, which do not distinguish + * between lazy and non-lazy RCU callbacks. + */ +void rcu_cblist_enqueue(struct rcu_cblist *rclp, struct rcu_head *rhp) +{ + *rclp->tail = rhp; + rclp->tail = &rhp->next; + WRITE_ONCE(rclp->len, rclp->len + 1); +} + /* * Dequeue the oldest rcu_head structure from the specified callback * list. This function assumes that the callback is non-lazy, but @@ -44,6 +57,67 @@ struct rcu_head *rcu_cblist_dequeue(struct rcu_cblist *rclp) return rhp; } +/* Set the length of an rcu_segcblist structure. */ +void rcu_segcblist_set_len(struct rcu_segcblist *rsclp, long v) +{ +#ifdef CONFIG_RCU_NOCB_CPU + atomic_long_set(&rsclp->len, v); +#else + WRITE_ONCE(rsclp->len, v); +#endif +} + +/* + * Increase the numeric length of an rcu_segcblist structure by the + * specified amount, which can be negative. This can cause the ->len + * field to disagree with the actual number of callbacks on the structure. + * This increase is fully ordered with respect to the callers accesses + * both before and after. + */ +void rcu_segcblist_add_len(struct rcu_segcblist *rsclp, long v) +{ +#ifdef CONFIG_RCU_NOCB_CPU + smp_mb__before_atomic(); /* Up to the caller! */ + atomic_long_add(v, &rsclp->len); + smp_mb__after_atomic(); /* Up to the caller! */ +#else + smp_mb(); /* Up to the caller! */ + WRITE_ONCE(rsclp->len, rsclp->len + v); + smp_mb(); /* Up to the caller! */ +#endif +} + +/* + * Increase the numeric length of an rcu_segcblist structure by one. + * This can cause the ->len field to disagree with the actual number of + * callbacks on the structure. This increase is fully ordered with respect + * to the callers accesses both before and after. + */ +void rcu_segcblist_inc_len(struct rcu_segcblist *rsclp) +{ + rcu_segcblist_add_len(rsclp, 1); +} + +/* + * Exchange the numeric length of the specified rcu_segcblist structure + * with the specified value. This can cause the ->len field to disagree + * with the actual number of callbacks on the structure. This exchange is + * fully ordered with respect to the callers accesses both before and after. + */ +long rcu_segcblist_xchg_len(struct rcu_segcblist *rsclp, long v) +{ +#ifdef CONFIG_RCU_NOCB_CPU + return atomic_long_xchg(&rsclp->len, v); +#else + long ret = rsclp->len; + + smp_mb(); /* Up to the caller! */ + WRITE_ONCE(rsclp->len, v); + smp_mb(); /* Up to the caller! */ + return ret; +#endif +} + /* * Initialize an rcu_segcblist structure. */ @@ -56,7 +130,7 @@ void rcu_segcblist_init(struct rcu_segcblist *rsclp) rsclp->head = NULL; for (i = 0; i < RCU_CBLIST_NSEGS; i++) rsclp->tails[i] = &rsclp->head; - rsclp->len = 0; + rcu_segcblist_set_len(rsclp, 0); rsclp->len_lazy = 0; rsclp->enabled = 1; } @@ -151,7 +225,7 @@ bool rcu_segcblist_nextgp(struct rcu_segcblist *rsclp, unsigned long *lp) void rcu_segcblist_enqueue(struct rcu_segcblist *rsclp, struct rcu_head *rhp, bool lazy) { - WRITE_ONCE(rsclp->len, rsclp->len + 1); /* ->len sampled locklessly. */ + rcu_segcblist_inc_len(rsclp); if (lazy) rsclp->len_lazy++; smp_mb(); /* Ensure counts are updated before callback is enqueued. */ @@ -177,7 +251,7 @@ bool rcu_segcblist_entrain(struct rcu_segcblist *rsclp, if (rcu_segcblist_n_cbs(rsclp) == 0) return false; - WRITE_ONCE(rsclp->len, rsclp->len + 1); + rcu_segcblist_inc_len(rsclp); if (lazy) rsclp->len_lazy++; smp_mb(); /* Ensure counts are updated before callback is entrained. */ @@ -204,9 +278,8 @@ void rcu_segcblist_extract_count(struct rcu_segcblist *rsclp, struct rcu_cblist *rclp) { rclp->len_lazy += rsclp->len_lazy; - rclp->len += rsclp->len; rsclp->len_lazy = 0; - WRITE_ONCE(rsclp->len, 0); /* ->len sampled locklessly. */ + rclp->len = rcu_segcblist_xchg_len(rsclp, 0); } /* @@ -259,8 +332,7 @@ void rcu_segcblist_insert_count(struct rcu_segcblist *rsclp, struct rcu_cblist *rclp) { rsclp->len_lazy += rclp->len_lazy; - /* ->len sampled locklessly. */ - WRITE_ONCE(rsclp->len, rsclp->len + rclp->len); + rcu_segcblist_add_len(rsclp, rclp->len); rclp->len_lazy = 0; rclp->len = 0; } diff --git a/kernel/rcu/rcu_segcblist.h b/kernel/rcu/rcu_segcblist.h index db38f0a512c4..1ff996647d3c 100644 --- a/kernel/rcu/rcu_segcblist.h +++ b/kernel/rcu/rcu_segcblist.h @@ -9,6 +9,12 @@ #include +/* Return number of callbacks in the specified callback list. */ +static inline long rcu_cblist_n_cbs(struct rcu_cblist *rclp) +{ + return READ_ONCE(rclp->len); +} + /* * Account for the fact that a previously dequeued callback turned out * to be marked as lazy. @@ -42,7 +48,11 @@ static inline bool rcu_segcblist_empty(struct rcu_segcblist *rsclp) /* Return number of callbacks in segmented callback list. */ static inline long rcu_segcblist_n_cbs(struct rcu_segcblist *rsclp) { +#ifdef CONFIG_RCU_NOCB_CPU + return atomic_long_read(&rsclp->len); +#else return READ_ONCE(rsclp->len); +#endif } /* Return number of lazy callbacks in segmented callback list. */ @@ -54,7 +64,7 @@ static inline long rcu_segcblist_n_lazy_cbs(struct rcu_segcblist *rsclp) /* Return number of lazy callbacks in segmented callback list. */ static inline long rcu_segcblist_n_nonlazy_cbs(struct rcu_segcblist *rsclp) { - return rsclp->len - rsclp->len_lazy; + return rcu_segcblist_n_cbs(rsclp) - rsclp->len_lazy; } /* From d1b222c6be1f8bfc77099e034219732ecaeaaf96 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 2 Jul 2019 16:03:33 -0700 Subject: [PATCH 76/86] rcu/nocb: Add bypass callback queueing Use of the rcu_data structure's segmented ->cblist for no-CBs CPUs takes advantage of unrelated grace periods, thus reducing the memory footprint in the face of floods of call_rcu() invocations. However, the ->cblist field is a more-complex rcu_segcblist structure which must be protected via locking. Even though there are only three entities which can acquire this lock (the CPU invoking call_rcu(), the no-CBs grace-period kthread, and the no-CBs callbacks kthread), the contention on this lock is excessive under heavy stress. This commit therefore greatly reduces contention by provisioning an rcu_cblist structure field named ->nocb_bypass within the rcu_data structure. Each no-CBs CPU is permitted only a limited number of enqueues onto the ->cblist per jiffy, controlled by a new nocb_nobypass_lim_per_jiffy kernel boot parameter that defaults to about 16 enqueues per millisecond (16 * 1000 / HZ). When that limit is exceeded, the CPU instead enqueues onto the new ->nocb_bypass. The ->nocb_bypass is flushed into the ->cblist every jiffy or when the number of callbacks on ->nocb_bypass exceeds qhimark, whichever happens first. During call_rcu() floods, this flushing is carried out by the CPU during the course of its call_rcu() invocations. However, a CPU could simply stop invoking call_rcu() at any time. The no-CBs grace-period kthread therefore carries out less-aggressive flushing (every few jiffies or when the number of callbacks on ->nocb_bypass exceeds (2 * qhimark), whichever comes first). This means that the no-CBs grace-period kthread cannot be permitted to do unbounded waits while there are callbacks on ->nocb_bypass. A ->nocb_bypass_timer is used to provide the needed wakeups. [ paulmck: Apply Coverity feedback reported by Colin Ian King. ] Signed-off-by: Paul E. McKenney --- kernel/rcu/rcu_segcblist.c | 30 ++++ kernel/rcu/rcu_segcblist.h | 5 + kernel/rcu/tree.c | 16 +- kernel/rcu/tree.h | 28 +-- kernel/rcu/tree_plugin.h | 359 ++++++++++++++++++++++++++++++++++--- 5 files changed, 396 insertions(+), 42 deletions(-) diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c index ff431cc83037..495c58ce1640 100644 --- a/kernel/rcu/rcu_segcblist.c +++ b/kernel/rcu/rcu_segcblist.c @@ -36,6 +36,36 @@ void rcu_cblist_enqueue(struct rcu_cblist *rclp, struct rcu_head *rhp) WRITE_ONCE(rclp->len, rclp->len + 1); } +/* + * Flush the second rcu_cblist structure onto the first one, obliterating + * any contents of the first. If rhp is non-NULL, enqueue it as the sole + * element of the second rcu_cblist structure, but ensuring that the second + * rcu_cblist structure, if initially non-empty, always appears non-empty + * throughout the process. If rdp is NULL, the second rcu_cblist structure + * is instead initialized to empty. + */ +void rcu_cblist_flush_enqueue(struct rcu_cblist *drclp, + struct rcu_cblist *srclp, + struct rcu_head *rhp) +{ + drclp->head = srclp->head; + if (drclp->head) + drclp->tail = srclp->tail; + else + drclp->tail = &drclp->head; + drclp->len = srclp->len; + drclp->len_lazy = srclp->len_lazy; + if (!rhp) { + rcu_cblist_init(srclp); + } else { + rhp->next = NULL; + srclp->head = rhp; + srclp->tail = &rhp->next; + WRITE_ONCE(srclp->len, 1); + srclp->len_lazy = 0; + } +} + /* * Dequeue the oldest rcu_head structure from the specified callback * list. This function assumes that the callback is non-lazy, but diff --git a/kernel/rcu/rcu_segcblist.h b/kernel/rcu/rcu_segcblist.h index 1ff996647d3c..815c2fdd3fcc 100644 --- a/kernel/rcu/rcu_segcblist.h +++ b/kernel/rcu/rcu_segcblist.h @@ -25,6 +25,10 @@ static inline void rcu_cblist_dequeued_lazy(struct rcu_cblist *rclp) } void rcu_cblist_init(struct rcu_cblist *rclp); +void rcu_cblist_enqueue(struct rcu_cblist *rclp, struct rcu_head *rhp); +void rcu_cblist_flush_enqueue(struct rcu_cblist *drclp, + struct rcu_cblist *srclp, + struct rcu_head *rhp); struct rcu_head *rcu_cblist_dequeue(struct rcu_cblist *rclp); /* @@ -92,6 +96,7 @@ static inline bool rcu_segcblist_restempty(struct rcu_segcblist *rsclp, int seg) return !READ_ONCE(*READ_ONCE(rsclp->tails[seg])); } +void rcu_segcblist_inc_len(struct rcu_segcblist *rsclp); void rcu_segcblist_init(struct rcu_segcblist *rsclp); void rcu_segcblist_disable(struct rcu_segcblist *rsclp); void rcu_segcblist_offload(struct rcu_segcblist *rsclp); diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index ec320658aeef..457623100d12 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1251,6 +1251,7 @@ static bool rcu_accelerate_cbs(struct rcu_node *rnp, struct rcu_data *rdp) unsigned long gp_seq_req; bool ret = false; + rcu_lockdep_assert_cblist_protected(rdp); raw_lockdep_assert_held_rcu_node(rnp); /* If no pending (not yet ready to invoke) callbacks, nothing to do. */ @@ -1292,7 +1293,7 @@ static void rcu_accelerate_cbs_unlocked(struct rcu_node *rnp, unsigned long c; bool needwake; - lockdep_assert_irqs_disabled(); + rcu_lockdep_assert_cblist_protected(rdp); c = rcu_seq_snap(&rcu_state.gp_seq); if (!rdp->gpwrap && ULONG_CMP_GE(rdp->gp_seq_needed, c)) { /* Old request still live, so mark recent callbacks. */ @@ -1318,6 +1319,7 @@ static void rcu_accelerate_cbs_unlocked(struct rcu_node *rnp, */ static bool rcu_advance_cbs(struct rcu_node *rnp, struct rcu_data *rdp) { + rcu_lockdep_assert_cblist_protected(rdp); raw_lockdep_assert_held_rcu_node(rnp); /* If no pending (not yet ready to invoke) callbacks, nothing to do. */ @@ -1341,6 +1343,7 @@ static bool rcu_advance_cbs(struct rcu_node *rnp, struct rcu_data *rdp) static void __maybe_unused rcu_advance_cbs_nowake(struct rcu_node *rnp, struct rcu_data *rdp) { + rcu_lockdep_assert_cblist_protected(rdp); if (!rcu_seq_state(rcu_seq_current(&rnp->gp_seq)) || !raw_spin_trylock_rcu_node(rnp)) return; @@ -2187,7 +2190,9 @@ static void rcu_do_batch(struct rcu_data *rdp) * The following usually indicates a double call_rcu(). To track * this down, try building with CONFIG_DEBUG_OBJECTS_RCU_HEAD=y. */ - WARN_ON_ONCE(rcu_segcblist_empty(&rdp->cblist) != (count == 0)); + WARN_ON_ONCE(count == 0 && !rcu_segcblist_empty(&rdp->cblist)); + WARN_ON_ONCE(!IS_ENABLED(CONFIG_RCU_NOCB_CPU) && + count != 0 && rcu_segcblist_empty(&rdp->cblist)); rcu_nocb_unlock_irqrestore(rdp, flags); @@ -2564,8 +2569,9 @@ __call_rcu(struct rcu_head *head, rcu_callback_t func, bool lazy) if (rcu_segcblist_empty(&rdp->cblist)) rcu_segcblist_init(&rdp->cblist); } - rcu_nocb_lock(rdp); - was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist); + if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags)) + return; // Enqueued onto ->nocb_bypass, so just leave. + /* If we get here, rcu_nocb_try_bypass() acquired ->nocb_lock. */ rcu_segcblist_enqueue(&rdp->cblist, head, lazy); if (__is_kfree_rcu_offset((unsigned long)func)) trace_rcu_kfree_callback(rcu_state.name, head, @@ -2839,6 +2845,7 @@ static void rcu_barrier_func(void *unused) rdp->barrier_head.func = rcu_barrier_callback; debug_rcu_head_queue(&rdp->barrier_head); rcu_nocb_lock(rdp); + WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies)); if (rcu_segcblist_entrain(&rdp->cblist, &rdp->barrier_head, 0)) { atomic_inc(&rcu_state.barrier_cpu_count); } else { @@ -3192,6 +3199,7 @@ void rcutree_migrate_callbacks(int cpu) my_rdp = this_cpu_ptr(&rcu_data); my_rnp = my_rdp->mynode; rcu_nocb_lock(my_rdp); /* irqs already disabled. */ + WARN_ON_ONCE(!rcu_nocb_flush_bypass(my_rdp, NULL, jiffies)); raw_spin_lock_rcu_node(my_rnp); /* irqs already disabled. */ /* Leverage recent GPs and set GP for new callbacks. */ needwake = rcu_advance_cbs(my_rnp, rdp) || diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index 2c3e9068671c..e4df86db8137 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -200,18 +200,26 @@ struct rcu_data { atomic_t nocb_lock_contended; /* Contention experienced. */ int nocb_defer_wakeup; /* Defer wakeup of nocb_kthread. */ struct timer_list nocb_timer; /* Enforce finite deferral. */ + unsigned long nocb_gp_adv_time; /* Last call_rcu() CB adv (jiffies). */ + + /* The following fields are used by call_rcu, hence own cacheline. */ + raw_spinlock_t nocb_bypass_lock ____cacheline_internodealigned_in_smp; + struct rcu_cblist nocb_bypass; /* Lock-contention-bypass CB list. */ + unsigned long nocb_bypass_first; /* Time (jiffies) of first enqueue. */ + unsigned long nocb_nobypass_last; /* Last ->cblist enqueue (jiffies). */ + int nocb_nobypass_count; /* # ->cblist enqueues at ^^^ time. */ /* The following fields are used by GP kthread, hence own cacheline. */ raw_spinlock_t nocb_gp_lock ____cacheline_internodealigned_in_smp; - bool nocb_gp_sleep; - /* Is the nocb GP thread asleep? */ + struct timer_list nocb_bypass_timer; /* Force nocb_bypass flush. */ + bool nocb_gp_sleep; /* Is the nocb GP thread asleep? */ struct swait_queue_head nocb_gp_wq; /* For nocb kthreads to sleep on. */ bool nocb_cb_sleep; /* Is the nocb CB thread asleep? */ struct task_struct *nocb_cb_kthread; struct rcu_data *nocb_next_cb_rdp; /* Next rcu_data in wakeup chain. */ - /* The following fields are used by CB kthread, hence new cachline. */ + /* The following fields are used by CB kthread, hence new cacheline. */ struct rcu_data *nocb_gp_rdp ____cacheline_internodealigned_in_smp; /* GP rdp takes GP-end wakeups. */ #endif /* #ifdef CONFIG_RCU_NOCB_CPU */ @@ -419,6 +427,10 @@ static void zero_cpu_stall_ticks(struct rcu_data *rdp); static struct swait_queue_head *rcu_nocb_gp_get(struct rcu_node *rnp); static void rcu_nocb_gp_cleanup(struct swait_queue_head *sq); static void rcu_init_one_nocb(struct rcu_node *rnp); +static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp, + unsigned long j); +static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp, + bool *was_alldone, unsigned long flags); static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_empty, unsigned long flags); static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp); @@ -430,19 +442,15 @@ static void rcu_nocb_lock(struct rcu_data *rdp); static void rcu_nocb_unlock(struct rcu_data *rdp); static void rcu_nocb_unlock_irqrestore(struct rcu_data *rdp, unsigned long flags); +static void rcu_lockdep_assert_cblist_protected(struct rcu_data *rdp); #ifdef CONFIG_RCU_NOCB_CPU static void __init rcu_organize_nocb_kthreads(void); #define rcu_nocb_lock_irqsave(rdp, flags) \ do { \ - if (!rcu_segcblist_is_offloaded(&(rdp)->cblist)) { \ + if (!rcu_segcblist_is_offloaded(&(rdp)->cblist)) \ local_irq_save(flags); \ - } else if (!raw_spin_trylock_irqsave(&(rdp)->nocb_lock, (flags))) {\ - atomic_inc(&(rdp)->nocb_lock_contended); \ - smp_mb__after_atomic(); /* atomic_inc() before lock. */ \ + else \ raw_spin_lock_irqsave(&(rdp)->nocb_lock, (flags)); \ - smp_mb__before_atomic(); /* atomic_dec() after lock. */ \ - atomic_dec(&(rdp)->nocb_lock_contended); \ - } \ } while (0) #else /* #ifdef CONFIG_RCU_NOCB_CPU */ #define rcu_nocb_lock_irqsave(rdp, flags) local_irq_save(flags) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 4e49bb359464..12b14d7a2cf2 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1497,19 +1497,26 @@ static int __init parse_rcu_nocb_poll(char *arg) early_param("rcu_nocb_poll", parse_rcu_nocb_poll); /* - * Acquire the specified rcu_data structure's ->nocb_lock, but only - * if it corresponds to a no-CBs CPU. If the lock isn't immediately - * available, increment ->nocb_lock_contended to flag the contention. + * Don't bother bypassing ->cblist if the call_rcu() rate is low. + * After all, the main point of bypassing is to avoid lock contention + * on ->nocb_lock, which only can happen at high call_rcu() rates. */ -static void rcu_nocb_lock(struct rcu_data *rdp) +int nocb_nobypass_lim_per_jiffy = 16 * 1000 / HZ; +module_param(nocb_nobypass_lim_per_jiffy, int, 0); + +/* + * Acquire the specified rcu_data structure's ->nocb_bypass_lock. If the + * lock isn't immediately available, increment ->nocb_lock_contended to + * flag the contention. + */ +static void rcu_nocb_bypass_lock(struct rcu_data *rdp) { lockdep_assert_irqs_disabled(); - if (!rcu_segcblist_is_offloaded(&rdp->cblist) || - raw_spin_trylock(&rdp->nocb_lock)) + if (raw_spin_trylock(&rdp->nocb_bypass_lock)) return; atomic_inc(&rdp->nocb_lock_contended); smp_mb__after_atomic(); /* atomic_inc() before lock. */ - raw_spin_lock(&rdp->nocb_lock); + raw_spin_lock(&rdp->nocb_bypass_lock); smp_mb__before_atomic(); /* atomic_dec() after lock. */ atomic_dec(&rdp->nocb_lock_contended); } @@ -1530,6 +1537,37 @@ static void rcu_nocb_wait_contended(struct rcu_data *rdp) cpu_relax(); } +/* + * Conditionally acquire the specified rcu_data structure's + * ->nocb_bypass_lock. + */ +static bool rcu_nocb_bypass_trylock(struct rcu_data *rdp) +{ + lockdep_assert_irqs_disabled(); + return raw_spin_trylock(&rdp->nocb_bypass_lock); +} + +/* + * Release the specified rcu_data structure's ->nocb_bypass_lock. + */ +static void rcu_nocb_bypass_unlock(struct rcu_data *rdp) +{ + lockdep_assert_irqs_disabled(); + raw_spin_unlock(&rdp->nocb_bypass_lock); +} + +/* + * Acquire the specified rcu_data structure's ->nocb_lock, but only + * if it corresponds to a no-CBs CPU. + */ +static void rcu_nocb_lock(struct rcu_data *rdp) +{ + lockdep_assert_irqs_disabled(); + if (!rcu_segcblist_is_offloaded(&rdp->cblist)) + return; + raw_spin_lock(&rdp->nocb_lock); +} + /* * Release the specified rcu_data structure's ->nocb_lock, but only * if it corresponds to a no-CBs CPU. @@ -1557,6 +1595,15 @@ static void rcu_nocb_unlock_irqrestore(struct rcu_data *rdp, } } +/* Lockdep check that ->cblist may be safely accessed. */ +static void rcu_lockdep_assert_cblist_protected(struct rcu_data *rdp) +{ + lockdep_assert_irqs_disabled(); + if (rcu_segcblist_is_offloaded(&rdp->cblist) && + cpu_online(rdp->cpu)) + lockdep_assert_held(&rdp->nocb_lock); +} + /* * Wake up any no-CBs CPUs' kthreads that were waiting on the just-ended * grace period. @@ -1593,24 +1640,27 @@ static void wake_nocb_gp(struct rcu_data *rdp, bool force, unsigned long flags) __releases(rdp->nocb_lock) { + bool needwake = false; struct rcu_data *rdp_gp = rdp->nocb_gp_rdp; lockdep_assert_held(&rdp->nocb_lock); if (!READ_ONCE(rdp_gp->nocb_gp_kthread)) { + trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, + TPS("AlreadyAwake")); rcu_nocb_unlock_irqrestore(rdp, flags); return; } - if (READ_ONCE(rdp_gp->nocb_gp_sleep) || force) { - del_timer(&rdp->nocb_timer); - rcu_nocb_unlock_irqrestore(rdp, flags); - smp_mb(); /* enqueue before ->nocb_gp_sleep. */ - raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags); + del_timer(&rdp->nocb_timer); + rcu_nocb_unlock_irqrestore(rdp, flags); + raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags); + if (force || READ_ONCE(rdp_gp->nocb_gp_sleep)) { WRITE_ONCE(rdp_gp->nocb_gp_sleep, false); - raw_spin_unlock_irqrestore(&rdp_gp->nocb_gp_lock, flags); - wake_up_process(rdp_gp->nocb_gp_kthread); - } else { - rcu_nocb_unlock_irqrestore(rdp, flags); + needwake = true; + trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("DoWake")); } + raw_spin_unlock_irqrestore(&rdp_gp->nocb_gp_lock, flags); + if (needwake) + wake_up_process(rdp_gp->nocb_gp_kthread); } /* @@ -1627,6 +1677,189 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype, trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, reason); } +/* + * Flush the ->nocb_bypass queue into ->cblist, enqueuing rhp if non-NULL. + * However, if there is a callback to be enqueued and if ->nocb_bypass + * proves to be initially empty, just return false because the no-CB GP + * kthread may need to be awakened in this case. + * + * Note that this function always returns true if rhp is NULL. + */ +static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp, + unsigned long j) +{ + struct rcu_cblist rcl; + + WARN_ON_ONCE(!rcu_segcblist_is_offloaded(&rdp->cblist)); + rcu_lockdep_assert_cblist_protected(rdp); + lockdep_assert_held(&rdp->nocb_bypass_lock); + if (rhp && !rcu_cblist_n_cbs(&rdp->nocb_bypass)) { + raw_spin_unlock(&rdp->nocb_bypass_lock); + return false; + } + /* Note: ->cblist.len already accounts for ->nocb_bypass contents. */ + if (rhp) + rcu_segcblist_inc_len(&rdp->cblist); /* Must precede enqueue. */ + rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp); + rcu_segcblist_insert_pend_cbs(&rdp->cblist, &rcl); + WRITE_ONCE(rdp->nocb_bypass_first, j); + rcu_nocb_bypass_unlock(rdp); + return true; +} + +/* + * Flush the ->nocb_bypass queue into ->cblist, enqueuing rhp if non-NULL. + * However, if there is a callback to be enqueued and if ->nocb_bypass + * proves to be initially empty, just return false because the no-CB GP + * kthread may need to be awakened in this case. + * + * Note that this function always returns true if rhp is NULL. + */ +static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp, + unsigned long j) +{ + if (!rcu_segcblist_is_offloaded(&rdp->cblist)) + return true; + rcu_lockdep_assert_cblist_protected(rdp); + rcu_nocb_bypass_lock(rdp); + return rcu_nocb_do_flush_bypass(rdp, rhp, j); +} + +/* + * If the ->nocb_bypass_lock is immediately available, flush the + * ->nocb_bypass queue into ->cblist. + */ +static void rcu_nocb_try_flush_bypass(struct rcu_data *rdp, unsigned long j) +{ + rcu_lockdep_assert_cblist_protected(rdp); + if (!rcu_segcblist_is_offloaded(&rdp->cblist) || + !rcu_nocb_bypass_trylock(rdp)) + return; + WARN_ON_ONCE(!rcu_nocb_do_flush_bypass(rdp, NULL, j)); +} + +/* + * See whether it is appropriate to use the ->nocb_bypass list in order + * to control contention on ->nocb_lock. A limited number of direct + * enqueues are permitted into ->cblist per jiffy. If ->nocb_bypass + * is non-empty, further callbacks must be placed into ->nocb_bypass, + * otherwise rcu_barrier() breaks. Use rcu_nocb_flush_bypass() to switch + * back to direct use of ->cblist. However, ->nocb_bypass should not be + * used if ->cblist is empty, because otherwise callbacks can be stranded + * on ->nocb_bypass because we cannot count on the current CPU ever again + * invoking call_rcu(). The general rule is that if ->nocb_bypass is + * non-empty, the corresponding no-CBs grace-period kthread must not be + * in an indefinite sleep state. + * + * Finally, it is not permitted to use the bypass during early boot, + * as doing so would confuse the auto-initialization code. Besides + * which, there is no point in worrying about lock contention while + * there is only one CPU in operation. + */ +static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp, + bool *was_alldone, unsigned long flags) +{ + unsigned long c; + unsigned long cur_gp_seq; + unsigned long j = jiffies; + long ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass); + + if (!rcu_segcblist_is_offloaded(&rdp->cblist)) { + *was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist); + return false; /* Not offloaded, no bypassing. */ + } + lockdep_assert_irqs_disabled(); + + // Don't use ->nocb_bypass during early boot. + if (rcu_scheduler_active != RCU_SCHEDULER_RUNNING) { + rcu_nocb_lock(rdp); + WARN_ON_ONCE(rcu_cblist_n_cbs(&rdp->nocb_bypass)); + *was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist); + return false; + } + + // If we have advanced to a new jiffy, reset counts to allow + // moving back from ->nocb_bypass to ->cblist. + if (j == rdp->nocb_nobypass_last) { + c = rdp->nocb_nobypass_count + 1; + } else { + WRITE_ONCE(rdp->nocb_nobypass_last, j); + c = rdp->nocb_nobypass_count - nocb_nobypass_lim_per_jiffy; + if (ULONG_CMP_LT(rdp->nocb_nobypass_count, + nocb_nobypass_lim_per_jiffy)) + c = 0; + else if (c > nocb_nobypass_lim_per_jiffy) + c = nocb_nobypass_lim_per_jiffy; + } + WRITE_ONCE(rdp->nocb_nobypass_count, c); + + // If there hasn't yet been all that many ->cblist enqueues + // this jiffy, tell the caller to enqueue onto ->cblist. But flush + // ->nocb_bypass first. + if (rdp->nocb_nobypass_count < nocb_nobypass_lim_per_jiffy) { + rcu_nocb_lock(rdp); + *was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist); + if (*was_alldone) + trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, + TPS("FirstQ")); + WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, j)); + WARN_ON_ONCE(rcu_cblist_n_cbs(&rdp->nocb_bypass)); + return false; // Caller must enqueue the callback. + } + + // If ->nocb_bypass has been used too long or is too full, + // flush ->nocb_bypass to ->cblist. + if ((ncbs && j != READ_ONCE(rdp->nocb_bypass_first)) || + ncbs >= qhimark) { + rcu_nocb_lock(rdp); + if (!rcu_nocb_flush_bypass(rdp, rhp, j)) { + *was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist); + if (*was_alldone) + trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, + TPS("FirstQ")); + WARN_ON_ONCE(rcu_cblist_n_cbs(&rdp->nocb_bypass)); + return false; // Caller must enqueue the callback. + } + if (j != rdp->nocb_gp_adv_time && + rcu_segcblist_nextgp(&rdp->cblist, &cur_gp_seq) && + rcu_seq_done(&rdp->mynode->gp_seq, cur_gp_seq)) { + rcu_advance_cbs_nowake(rdp->mynode, rdp); + rdp->nocb_gp_adv_time = j; + } + rcu_nocb_unlock_irqrestore(rdp, flags); + return true; // Callback already enqueued. + } + + // We need to use the bypass. + rcu_nocb_wait_contended(rdp); + rcu_nocb_bypass_lock(rdp); + ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass); + rcu_segcblist_inc_len(&rdp->cblist); /* Must precede enqueue. */ + rcu_cblist_enqueue(&rdp->nocb_bypass, rhp); + if (!ncbs) { + WRITE_ONCE(rdp->nocb_bypass_first, j); + trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("FirstBQ")); + } + rcu_nocb_bypass_unlock(rdp); + smp_mb(); /* Order enqueue before wake. */ + if (ncbs) { + local_irq_restore(flags); + } else { + // No-CBs GP kthread might be indefinitely asleep, if so, wake. + rcu_nocb_lock(rdp); // Rare during call_rcu() flood. + if (!rcu_segcblist_pend_cbs(&rdp->cblist)) { + trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, + TPS("FirstBQwake")); + __call_rcu_nocb_wake(rdp, true, flags); + } else { + trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, + TPS("FirstBQnoWake")); + rcu_nocb_unlock_irqrestore(rdp, flags); + } + } + return true; // Callback already enqueued. +} + /* * Awaken the no-CBs grace-period kthead if needed, either due to it * legitimately being asleep or due to overload conditions. @@ -1685,23 +1918,33 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone, trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WakeNot")); rcu_nocb_unlock_irqrestore(rdp, flags); } - if (!irqs_disabled_flags(flags)) { - lockdep_assert_irqs_enabled(); - rcu_nocb_wait_contended(rdp); - } return; } +/* Wake up the no-CBs GP kthread to flush ->nocb_bypass. */ +static void do_nocb_bypass_wakeup_timer(struct timer_list *t) +{ + unsigned long flags; + struct rcu_data *rdp = from_timer(rdp, t, nocb_bypass_timer); + + trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("Timer")); + rcu_nocb_lock_irqsave(rdp, flags); + __call_rcu_nocb_wake(rdp, true, flags); +} + /* * No-CBs GP kthreads come here to wait for additional callbacks to show up * or for grace periods to end. */ static void nocb_gp_wait(struct rcu_data *my_rdp) { + bool bypass = false; + long bypass_ncbs; int __maybe_unused cpu = my_rdp->cpu; unsigned long cur_gp_seq; unsigned long flags; bool gotcbs; + unsigned long j = jiffies; bool needwait_gp = false; // This prevents actual uninitialized use. bool needwake; bool needwake_gp; @@ -1715,21 +1958,50 @@ static void nocb_gp_wait(struct rcu_data *my_rdp) * and the global grace-period kthread are awakened if needed. */ for (rdp = my_rdp; rdp; rdp = rdp->nocb_next_cb_rdp) { - if (rcu_segcblist_empty(&rdp->cblist)) - continue; /* No callbacks here, try next. */ - rnp = rdp->mynode; + trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("Check")); rcu_nocb_lock_irqsave(rdp, flags); - WRITE_ONCE(my_rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT); - del_timer(&my_rdp->nocb_timer); - raw_spin_lock_rcu_node(rnp); /* irqs already disabled. */ - needwake_gp = rcu_advance_cbs(rnp, rdp); - raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */ + bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass); + if (bypass_ncbs && + (time_after(j, READ_ONCE(rdp->nocb_bypass_first) + 1) || + bypass_ncbs > 2 * qhimark)) { + // Bypass full or old, so flush it. + (void)rcu_nocb_try_flush_bypass(rdp, j); + bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass); + } else if (!bypass_ncbs && rcu_segcblist_empty(&rdp->cblist)) { + rcu_nocb_unlock_irqrestore(rdp, flags); + continue; /* No callbacks here, try next. */ + } + if (bypass_ncbs) { + trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, + TPS("Bypass")); + bypass = true; + } + rnp = rdp->mynode; + if (bypass) { // Avoid race with first bypass CB. + WRITE_ONCE(my_rdp->nocb_defer_wakeup, + RCU_NOCB_WAKE_NOT); + del_timer(&my_rdp->nocb_timer); + } + // Advance callbacks if helpful and low contention. + needwake_gp = false; + if (!rcu_segcblist_restempty(&rdp->cblist, + RCU_NEXT_READY_TAIL) || + (rcu_segcblist_nextgp(&rdp->cblist, &cur_gp_seq) && + rcu_seq_done(&rnp->gp_seq, cur_gp_seq))) { + raw_spin_lock_rcu_node(rnp); /* irqs disabled. */ + needwake_gp = rcu_advance_cbs(rnp, rdp); + raw_spin_unlock_rcu_node(rnp); /* irqs disabled. */ + } // Need to wait on some grace period? + WARN_ON_ONCE(!rcu_segcblist_restempty(&rdp->cblist, + RCU_NEXT_READY_TAIL)); if (rcu_segcblist_nextgp(&rdp->cblist, &cur_gp_seq)) { if (!needwait_gp || ULONG_CMP_LT(cur_gp_seq, wait_gp_seq)) wait_gp_seq = cur_gp_seq; needwait_gp = true; + trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, + TPS("NeedWaitGP")); } if (rcu_segcblist_ready_cbs(&rdp->cblist)) { needwake = rdp->nocb_cb_sleep; @@ -1747,6 +2019,13 @@ static void nocb_gp_wait(struct rcu_data *my_rdp) rcu_gp_kthread_wake(); } + if (bypass && !rcu_nocb_poll) { + // At least one child with non-empty ->nocb_bypass, so set + // timer in order to avoid stranding its callbacks. + raw_spin_lock_irqsave(&my_rdp->nocb_gp_lock, flags); + mod_timer(&my_rdp->nocb_bypass_timer, j + 2); + raw_spin_unlock_irqrestore(&my_rdp->nocb_gp_lock, flags); + } if (rcu_nocb_poll) { /* Polling, so trace if first poll in the series. */ if (gotcbs) @@ -1757,6 +2036,7 @@ static void nocb_gp_wait(struct rcu_data *my_rdp) trace_rcu_nocb_wake(rcu_state.name, cpu, TPS("Sleep")); swait_event_interruptible_exclusive(my_rdp->nocb_gp_wq, !READ_ONCE(my_rdp->nocb_gp_sleep)); + trace_rcu_nocb_wake(rcu_state.name, cpu, TPS("EndSleep")); } else { rnp = my_rdp->mynode; trace_rcu_this_gp(rnp, my_rdp, wait_gp_seq, TPS("StartWait")); @@ -1768,6 +2048,8 @@ static void nocb_gp_wait(struct rcu_data *my_rdp) } if (!rcu_nocb_poll) { raw_spin_lock_irqsave(&my_rdp->nocb_gp_lock, flags); + if (bypass) + del_timer(&my_rdp->nocb_bypass_timer); WRITE_ONCE(my_rdp->nocb_gp_sleep, true); raw_spin_unlock_irqrestore(&my_rdp->nocb_gp_lock, flags); } @@ -1949,8 +2231,11 @@ static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp) init_swait_queue_head(&rdp->nocb_cb_wq); init_swait_queue_head(&rdp->nocb_gp_wq); raw_spin_lock_init(&rdp->nocb_lock); + raw_spin_lock_init(&rdp->nocb_bypass_lock); raw_spin_lock_init(&rdp->nocb_gp_lock); timer_setup(&rdp->nocb_timer, do_nocb_deferred_wakeup_timer, 0); + timer_setup(&rdp->nocb_bypass_timer, do_nocb_bypass_wakeup_timer, 0); + rcu_cblist_init(&rdp->nocb_bypass); } /* @@ -2094,6 +2379,12 @@ static void rcu_nocb_unlock_irqrestore(struct rcu_data *rdp, local_irq_restore(flags); } +/* Lockdep check that ->cblist may be safely accessed. */ +static void rcu_lockdep_assert_cblist_protected(struct rcu_data *rdp) +{ + lockdep_assert_irqs_disabled(); +} + static void rcu_nocb_gp_cleanup(struct swait_queue_head *sq) { } @@ -2107,6 +2398,18 @@ static void rcu_init_one_nocb(struct rcu_node *rnp) { } +static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp, + unsigned long j) +{ + return true; +} + +static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp, + bool *was_alldone, unsigned long flags) +{ + return false; +} + static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_empty, unsigned long flags) { From 6aacd88d1721e12b013ae4ccf4f17609bd5091f3 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Sat, 13 Jul 2019 12:27:03 -0700 Subject: [PATCH 77/86] rcu/nocb: EXP Check use and usefulness of ->nocb_lock_contended Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 12b14d7a2cf2..97c730753a6d 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1515,6 +1515,7 @@ static void rcu_nocb_bypass_lock(struct rcu_data *rdp) if (raw_spin_trylock(&rdp->nocb_bypass_lock)) return; atomic_inc(&rdp->nocb_lock_contended); + WARN_ON_ONCE(smp_processor_id() != rdp->cpu); smp_mb__after_atomic(); /* atomic_inc() before lock. */ raw_spin_lock(&rdp->nocb_bypass_lock); smp_mb__before_atomic(); /* atomic_dec() after lock. */ @@ -1533,7 +1534,8 @@ static void rcu_nocb_bypass_lock(struct rcu_data *rdp) */ static void rcu_nocb_wait_contended(struct rcu_data *rdp) { - while (atomic_read(&rdp->nocb_lock_contended)) + WARN_ON_ONCE(smp_processor_id() != rdp->cpu); + while (WARN_ON_ONCE(atomic_read(&rdp->nocb_lock_contended))) cpu_relax(); } From f7a81b12d6af42a9d09be1e5f041169f04b0b67a Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 25 Jun 2019 13:32:51 -0700 Subject: [PATCH 78/86] rcu/nocb: Print no-CBs diagnostics when rcutorture writer unduly delayed This commit causes locking, sleeping, and callback state to be printed for no-CBs CPUs when the rcutorture writer is delayed sufficiently for rcutorture to complain. Signed-off-by: Paul E. McKenney --- kernel/rcu/rcutorture.c | 1 + kernel/rcu/tree.h | 7 +++- kernel/rcu/tree_plugin.h | 82 ++++++++++++++++++++++++++++++++++++++++ kernel/rcu/tree_stall.h | 5 +++ 4 files changed, 94 insertions(+), 1 deletion(-) diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index b22947324423..3c9feca1eab1 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -2176,6 +2176,7 @@ rcu_torture_cleanup(void) return; } + show_rcu_gp_kthreads(); rcu_torture_barrier_cleanup(); torture_stop_kthread(rcu_torture_fwd_prog, fwd_prog_task); torture_stop_kthread(rcu_torture_stall, stall_task); diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index e4df86db8137..c612f306fe89 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -212,7 +212,11 @@ struct rcu_data { /* The following fields are used by GP kthread, hence own cacheline. */ raw_spinlock_t nocb_gp_lock ____cacheline_internodealigned_in_smp; struct timer_list nocb_bypass_timer; /* Force nocb_bypass flush. */ - bool nocb_gp_sleep; /* Is the nocb GP thread asleep? */ + u8 nocb_gp_sleep; /* Is the nocb GP thread asleep? */ + u8 nocb_gp_bypass; /* Found a bypass on last scan? */ + u8 nocb_gp_gp; /* GP to wait for on last scan? */ + unsigned long nocb_gp_seq; /* If so, ->gp_seq to wait for. */ + unsigned long nocb_gp_loops; /* # passes through wait code. */ struct swait_queue_head nocb_gp_wq; /* For nocb kthreads to sleep on. */ bool nocb_cb_sleep; /* Is the nocb CB thread asleep? */ struct task_struct *nocb_cb_kthread; @@ -438,6 +442,7 @@ static void do_nocb_deferred_wakeup(struct rcu_data *rdp); static void rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp); static void rcu_spawn_cpu_nocb_kthread(int cpu); static void __init rcu_spawn_nocb_kthreads(void); +static void show_rcu_nocb_state(struct rcu_data *rdp); static void rcu_nocb_lock(struct rcu_data *rdp); static void rcu_nocb_unlock(struct rcu_data *rdp); static void rcu_nocb_unlock_irqrestore(struct rcu_data *rdp, diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 97c730753a6d..25a53742ca68 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -2021,6 +2021,9 @@ static void nocb_gp_wait(struct rcu_data *my_rdp) rcu_gp_kthread_wake(); } + my_rdp->nocb_gp_bypass = bypass; + my_rdp->nocb_gp_gp = needwait_gp; + my_rdp->nocb_gp_seq = needwait_gp ? wait_gp_seq : 0; if (bypass && !rcu_nocb_poll) { // At least one child with non-empty ->nocb_bypass, so set // timer in order to avoid stranding its callbacks. @@ -2055,6 +2058,7 @@ static void nocb_gp_wait(struct rcu_data *my_rdp) WRITE_ONCE(my_rdp->nocb_gp_sleep, true); raw_spin_unlock_irqrestore(&my_rdp->nocb_gp_lock, flags); } + my_rdp->nocb_gp_seq = -1; WARN_ON(signal_pending(current)); } @@ -2071,6 +2075,7 @@ static int rcu_nocb_gp_kthread(void *arg) struct rcu_data *rdp = arg; for (;;) { + WRITE_ONCE(rdp->nocb_gp_loops, rdp->nocb_gp_loops + 1); nocb_gp_wait(rdp); cond_resched_tasks_rcu_qs(); } @@ -2362,6 +2367,79 @@ void rcu_bind_current_to_nocb(void) } EXPORT_SYMBOL_GPL(rcu_bind_current_to_nocb); +/* + * Dump out nocb grace-period kthread state for the specified rcu_data + * structure. + */ +static void show_rcu_nocb_gp_state(struct rcu_data *rdp) +{ + struct rcu_node *rnp = rdp->mynode; + + pr_info("nocb GP %d %c%c%c%c%c%c %c[%c%c] %c%c:%ld rnp %d:%d %lu\n", + rdp->cpu, + "kK"[!!rdp->nocb_gp_kthread], + "lL"[raw_spin_is_locked(&rdp->nocb_gp_lock)], + "dD"[!!rdp->nocb_defer_wakeup], + "tT"[timer_pending(&rdp->nocb_timer)], + "bB"[timer_pending(&rdp->nocb_bypass_timer)], + "sS"[!!rdp->nocb_gp_sleep], + ".W"[swait_active(&rdp->nocb_gp_wq)], + ".W"[swait_active(&rnp->nocb_gp_wq[0])], + ".W"[swait_active(&rnp->nocb_gp_wq[1])], + ".B"[!!rdp->nocb_gp_bypass], + ".G"[!!rdp->nocb_gp_gp], + (long)rdp->nocb_gp_seq, + rnp->grplo, rnp->grphi, READ_ONCE(rdp->nocb_gp_loops)); +} + +/* Dump out nocb kthread state for the specified rcu_data structure. */ +static void show_rcu_nocb_state(struct rcu_data *rdp) +{ + struct rcu_segcblist *rsclp = &rdp->cblist; + bool waslocked; + bool wastimer; + bool wassleep; + + if (rdp->nocb_gp_rdp == rdp) + show_rcu_nocb_gp_state(rdp); + + pr_info(" CB %d->%d %c%c%c%c%c%c F%ld L%ld C%d %c%c%c%c%c q%ld\n", + rdp->cpu, rdp->nocb_gp_rdp->cpu, + "kK"[!!rdp->nocb_cb_kthread], + "bB"[raw_spin_is_locked(&rdp->nocb_bypass_lock)], + "cC"[!!atomic_read(&rdp->nocb_lock_contended)], + "lL"[raw_spin_is_locked(&rdp->nocb_lock)], + "sS"[!!rdp->nocb_cb_sleep], + ".W"[swait_active(&rdp->nocb_cb_wq)], + jiffies - rdp->nocb_bypass_first, + jiffies - rdp->nocb_nobypass_last, + rdp->nocb_nobypass_count, + ".D"[rcu_segcblist_ready_cbs(rsclp)], + ".W"[!rcu_segcblist_restempty(rsclp, RCU_DONE_TAIL)], + ".R"[!rcu_segcblist_restempty(rsclp, RCU_WAIT_TAIL)], + ".N"[!rcu_segcblist_restempty(rsclp, RCU_NEXT_READY_TAIL)], + ".B"[!!rcu_cblist_n_cbs(&rdp->nocb_bypass)], + rcu_segcblist_n_cbs(&rdp->cblist)); + + /* It is OK for GP kthreads to have GP state. */ + if (rdp->nocb_gp_rdp == rdp) + return; + + waslocked = raw_spin_is_locked(&rdp->nocb_gp_lock); + wastimer = timer_pending(&rdp->nocb_timer); + wassleep = swait_active(&rdp->nocb_gp_wq); + if (!rdp->nocb_defer_wakeup && !rdp->nocb_gp_sleep && + !waslocked && !wastimer && !wassleep) + return; /* Nothing untowards. */ + + pr_info(" !!! %c%c%c%c %c\n", + "lL"[waslocked], + "dD"[!!rdp->nocb_defer_wakeup], + "tT"[wastimer], + "sS"[!!rdp->nocb_gp_sleep], + ".W"[wassleep]); +} + #else /* #ifdef CONFIG_RCU_NOCB_CPU */ /* No ->nocb_lock to acquire. */ @@ -2439,6 +2517,10 @@ static void __init rcu_spawn_nocb_kthreads(void) { } +static void show_rcu_nocb_state(struct rcu_data *rdp) +{ +} + #endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */ /* diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h index 0627a66699a6..841ab43f3e60 100644 --- a/kernel/rcu/tree_stall.h +++ b/kernel/rcu/tree_stall.h @@ -589,6 +589,11 @@ void show_rcu_gp_kthreads(void) cpu, (long)rdp->gp_seq_needed); } } + for_each_possible_cpu(cpu) { + rdp = per_cpu_ptr(&rcu_data, cpu); + if (rcu_segcblist_is_offloaded(&rdp->cblist)) + show_rcu_nocb_state(rdp); + } /* sched_show_task(rcu_state.gp_kthread); */ } EXPORT_SYMBOL_GPL(show_rcu_gp_kthreads); From 273f034065002bf9480601d66404c991b243b91e Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 9 Jul 2019 06:54:42 -0700 Subject: [PATCH 79/86] rcu/nocb: Avoid synchronous wakeup in __call_rcu_nocb_wake() When callbacks are in full flow, the common case is waiting for a grace period, and this grace period will normally take a few jiffies to complete. It therefore isn't all that helpful for __call_rcu_nocb_wake() to do a synchronous wakeup in this case. This commit therefore turns this into a timer-based deferred wakeup of the no-CBs grace-period kthread. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 19 +++++-------------- 1 file changed, 5 insertions(+), 14 deletions(-) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 25a53742ca68..4b59ef1cbc8b 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1900,22 +1900,13 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone, } else if (len > rdp->qlen_last_fqs_check + qhimark) { /* ... or if many callbacks queued. */ rdp->qlen_last_fqs_check = len; - if (!rdp->nocb_cb_sleep && - rcu_segcblist_ready_cbs(&rdp->cblist)) { - // Already going full tilt, so don't try to rewake. - rcu_nocb_unlock_irqrestore(rdp, flags); - } else { + if (rdp->nocb_cb_sleep || + !rcu_segcblist_ready_cbs(&rdp->cblist)) { rcu_advance_cbs_nowake(rdp->mynode, rdp); - if (!irqs_disabled_flags(flags)) { - wake_nocb_gp(rdp, false, flags); - trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, - TPS("WakeOvf")); - } else { - wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE_FORCE, - TPS("WakeOvfIsDeferred")); - rcu_nocb_unlock_irqrestore(rdp, flags); - } + wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE_FORCE, + TPS("WakeOvfIsDeferred")); } + rcu_nocb_unlock_irqrestore(rdp, flags); } else { trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WakeNot")); rcu_nocb_unlock_irqrestore(rdp, flags); From 23651d9b9616060cf86af5e3b15defcf3bcd2642 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 10 Jul 2019 12:54:56 -0700 Subject: [PATCH 80/86] rcu/nocb: Advance CBs after merge in rcutree_migrate_callbacks() The rcutree_migrate_callbacks() invokes rcu_advance_cbs() on both the offlined CPU's ->cblist and that of the surviving CPU, then merges them. However, after the merge, and of the offlined CPU's callbacks that were not ready to be invoked will no longer be associated with a grace-period number. This commit therefore invokes rcu_advance_cbs() one more time on the merged ->cblist in order to assign a grace-period number to these callbacks. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 457623100d12..3e89b5b83ea0 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3205,6 +3205,7 @@ void rcutree_migrate_callbacks(int cpu) needwake = rcu_advance_cbs(my_rnp, rdp) || rcu_advance_cbs(my_rnp, my_rdp); rcu_segcblist_merge(&my_rdp->cblist, &rdp->cblist); + needwake = needwake || rcu_advance_cbs(my_rnp, my_rdp); rcu_segcblist_disable(&rdp->cblist); WARN_ON_ONCE(rcu_segcblist_empty(&my_rdp->cblist) != !rcu_segcblist_n_cbs(&my_rdp->cblist)); From 1d5a81c18dc68fc38a52e8dab1992a043a358927 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 15 Jul 2019 01:09:04 -0700 Subject: [PATCH 81/86] rcu/nocb: Reduce nocb_cb_wait() leaf rcu_node ->lock contention Currently, nocb_cb_wait() advances callbacks on each pass through its loop, though only if it succeeds in conditionally acquiring its leaf rcu_node structure's ->lock. Despite the conditional acquisition of ->lock, this does increase contention. This commit therefore avoids advancing callbacks unless there are callbacks in ->cblist whose grace period has completed. Note that nocb_cb_wait() doesn't worry about callbacks that have not yet been assigned a grace period. The idea is that the only reason for nocb_cb_wait() to advance callbacks is to allow it to continue invoking callbacks. Time will tell whether this is the correct choice. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 4b59ef1cbc8b..f6f23a16bd64 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -2079,6 +2079,7 @@ static int rcu_nocb_gp_kthread(void *arg) */ static void nocb_cb_wait(struct rcu_data *rdp) { + unsigned long cur_gp_seq; unsigned long flags; bool needwake_gp = false; struct rcu_node *rnp = rdp->mynode; @@ -2091,7 +2092,9 @@ static void nocb_cb_wait(struct rcu_data *rdp) local_bh_enable(); lockdep_assert_irqs_enabled(); rcu_nocb_lock_irqsave(rdp, flags); - if (raw_spin_trylock_rcu_node(rnp)) { /* irqs already disabled. */ + if (rcu_segcblist_nextgp(&rdp->cblist, &cur_gp_seq) && + rcu_seq_done(&rnp->gp_seq, cur_gp_seq) && + raw_spin_trylock_rcu_node(rnp)) { /* irqs already disabled. */ needwake_gp = rcu_advance_cbs(rdp->mynode, rdp); raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */ } From 296181d78df9892e08e794f2a9a4d2c38f9acedb Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 15 Jul 2019 06:06:40 -0700 Subject: [PATCH 82/86] rcu/nocb: Reduce __call_rcu_nocb_wake() leaf rcu_node ->lock contention Currently, __call_rcu_nocb_wake() advances callbacks each time that it detects excessive numbers of callbacks, though only if it succeeds in conditionally acquiring its leaf rcu_node structure's ->lock. Despite the conditional acquisition of ->lock, this does increase contention. This commit therefore avoids advancing callbacks unless there are callbacks in ->cblist whose grace period has completed and advancing has not yet been done during this jiffy. Note that this decision does not take the presence of new callbacks into account. That is because on this code path, there will always be at least one new callback, namely the one we just enqueued. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index f6f23a16bd64..f56fb4e97a8e 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1872,6 +1872,8 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone, unsigned long flags) __releases(rdp->nocb_lock) { + unsigned long cur_gp_seq; + unsigned long j; long len; struct task_struct *t; @@ -1900,12 +1902,17 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone, } else if (len > rdp->qlen_last_fqs_check + qhimark) { /* ... or if many callbacks queued. */ rdp->qlen_last_fqs_check = len; - if (rdp->nocb_cb_sleep || - !rcu_segcblist_ready_cbs(&rdp->cblist)) { + j = jiffies; + if (j != rdp->nocb_gp_adv_time && + rcu_segcblist_nextgp(&rdp->cblist, &cur_gp_seq) && + rcu_seq_done(&rdp->mynode->gp_seq, cur_gp_seq)) { rcu_advance_cbs_nowake(rdp->mynode, rdp); + rdp->nocb_gp_adv_time = j; + } + if (rdp->nocb_cb_sleep || + !rcu_segcblist_ready_cbs(&rdp->cblist)) wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE_FORCE, TPS("WakeOvfIsDeferred")); - } rcu_nocb_unlock_irqrestore(rdp, flags); } else { trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WakeNot")); From f48fe4c586604c3a09938c6a6e9fd3356dfe8f3c Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 16 Jul 2019 02:17:00 -0700 Subject: [PATCH 83/86] rcu/nocb: Don't wake no-CBs GP kthread if timer posted under overload When under overload conditions, __call_rcu_nocb_wake() will wake the no-CBs GP kthread any time the no-CBs CB kthread is asleep or there are no ready-to-invoke callbacks, but only after a timer delay. If the no-CBs GP kthread has a ->nocb_bypass_timer pending, the deferred wakeup from __call_rcu_nocb_wake() is redundant. This commit therefore makes __call_rcu_nocb_wake() avoid posting the redundant deferred wakeup if ->nocb_bypass_timer is pending. This requires adding a bit of ordering of timer actions. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index f56fb4e97a8e..2defc7fe74c3 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1909,8 +1909,10 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone, rcu_advance_cbs_nowake(rdp->mynode, rdp); rdp->nocb_gp_adv_time = j; } - if (rdp->nocb_cb_sleep || - !rcu_segcblist_ready_cbs(&rdp->cblist)) + smp_mb(); /* Enqueue before timer_pending(). */ + if ((rdp->nocb_cb_sleep || + !rcu_segcblist_ready_cbs(&rdp->cblist)) && + !timer_pending(&rdp->nocb_bypass_timer)) wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE_FORCE, TPS("WakeOvfIsDeferred")); rcu_nocb_unlock_irqrestore(rdp, flags); @@ -1929,6 +1931,7 @@ static void do_nocb_bypass_wakeup_timer(struct timer_list *t) trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("Timer")); rcu_nocb_lock_irqsave(rdp, flags); + smp_mb__after_spinlock(); /* Timer expire before wakeup. */ __call_rcu_nocb_wake(rdp, true, flags); } From cfcdef5e30469f3f2d6786ad35fc3fdef2a3833f Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Wed, 24 Jul 2019 18:07:52 -0700 Subject: [PATCH 84/86] rcu: Allow rcu_do_batch() to dynamically adjust batch sizes Bimodal behavior of rcu_do_batch() is not really suited to Google applications like gfe servers. When a process with millions of sockets exits, closing all files queues two rcu callbacks per socket. This eventually reaches the point where RCU enters an emergency mode, where rcu_do_batch() do not return until whole queue is flushed. Each rcu callback lasts at least 70 nsec, so with millions of elements, we easily spend more than 100 msec without rescheduling. Goal of this patch is to avoid the infamous message like following "need_resched set for > 51999388 ns (52 ticks) without schedule" We dynamically adjust the number of elements we process, instead of 10 / INFINITE choices, we use a floor of ~1 % of current entries. If the number is above 1000, we switch to a time based limit of 3 msec per batch, adjustable with /sys/module/rcutree/parameters/rcu_resched_ns Signed-off-by: Eric Dumazet [ paulmck: Forward-port and remove debug statements. ] Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 20 +++++++++++++++++++- 1 file changed, 19 insertions(+), 1 deletion(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 3e89b5b83ea0..71395e91b876 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -56,6 +56,7 @@ #include #include #include +#include #include "../time/tick-internal.h" #include "tree.h" @@ -416,6 +417,12 @@ module_param(qlowmark, long, 0444); static ulong jiffies_till_first_fqs = ULONG_MAX; static ulong jiffies_till_next_fqs = ULONG_MAX; static bool rcu_kick_kthreads; +static int rcu_divisor = 7; +module_param(rcu_divisor, int, 0644); + +/* Force an exit from rcu_do_batch() after 3 milliseconds. */ +static long rcu_resched_ns = 3 * NSEC_PER_MSEC; +module_param(rcu_resched_ns, long, 0644); /* * How long the grace period must be before we start recruiting @@ -2109,6 +2116,7 @@ static void rcu_do_batch(struct rcu_data *rdp) struct rcu_head *rhp; struct rcu_cblist rcl = RCU_CBLIST_INITIALIZER(rcl); long bl, count; + long pending, tlimit = 0; /* If no callbacks are ready, just return. */ if (!rcu_segcblist_ready_cbs(&rdp->cblist)) { @@ -2130,7 +2138,10 @@ static void rcu_do_batch(struct rcu_data *rdp) local_irq_save(flags); rcu_nocb_lock(rdp); WARN_ON_ONCE(cpu_is_offline(smp_processor_id())); - bl = rdp->blimit; + pending = rcu_segcblist_n_cbs(&rdp->cblist); + bl = max(rdp->blimit, pending >> rcu_divisor); + if (unlikely(bl > 100)) + tlimit = local_clock() + rcu_resched_ns; trace_rcu_batch_start(rcu_state.name, rcu_segcblist_n_lazy_cbs(&rdp->cblist), rcu_segcblist_n_cbs(&rdp->cblist), bl); @@ -2153,6 +2164,13 @@ static void rcu_do_batch(struct rcu_data *rdp) (need_resched() || (!is_idle_task(current) && !rcu_is_callbacks_kthread()))) break; + if (unlikely(tlimit)) { + /* only call local_clock() every 32 callbacks */ + if (likely((-rcl.len & 31) || local_clock() < tlimit)) + continue; + /* Exceeded the time limit, so leave. */ + break; + } if (offloaded) { WARN_ON_ONCE(in_serving_softirq()); local_bh_enable(); From 24691069a348f82a95e0fa9697bb5656c6d8c48c Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Thu, 22 Aug 2019 10:53:43 +0900 Subject: [PATCH 85/86] rcu: Don't include in rcutiny.h The kbuild reported a built failure due to a header loop when RCUTINY is enabled with my pending riscv-nommu port. Switch rcutiny.h to only include the minimal required header to get HZ instead. Signed-off-by: Christoph Hellwig Signed-off-by: Paul E. McKenney --- include/linux/rcutiny.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h index 8e727f57d814..9bf1dfe7781f 100644 --- a/include/linux/rcutiny.h +++ b/include/linux/rcutiny.h @@ -12,7 +12,7 @@ #ifndef __LINUX_TINY_H #define __LINUX_TINY_H -#include +#include /* for HZ */ /* Never flag non-existent other CPUs! */ static inline bool rcu_eqs_special_set(int cpu) { return false; } From 049b405029c00f3fd9e4ffa269bdd29b429c4672 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 26 Aug 2019 16:02:56 -0700 Subject: [PATCH 86/86] MAINTAINERS: Update from paulmck@linux.ibm.com to paulmck@kernel.org Note that the paulmck@linux.ibm.com still works most of the time. Signed-off-by: Paul E. McKenney --- MAINTAINERS | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/MAINTAINERS b/MAINTAINERS index 527317026492..e200eb56362a 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -9334,7 +9334,7 @@ M: Nicholas Piggin M: David Howells M: Jade Alglave M: Luc Maranget -M: "Paul E. McKenney" +M: "Paul E. McKenney" R: Akira Yokosawa R: Daniel Lustig L: linux-kernel@vger.kernel.org @@ -10363,7 +10363,7 @@ F: drivers/platform/x86/mlx-platform.c MEMBARRIER SUPPORT M: Mathieu Desnoyers -M: "Paul E. McKenney" +M: "Paul E. McKenney" L: linux-kernel@vger.kernel.org S: Supported F: kernel/sched/membarrier.c @@ -13476,7 +13476,7 @@ S: Orphan F: drivers/net/wireless/ray* RCUTORTURE TEST FRAMEWORK -M: "Paul E. McKenney" +M: "Paul E. McKenney" M: Josh Triplett R: Steven Rostedt R: Mathieu Desnoyers @@ -13523,7 +13523,7 @@ F: arch/x86/include/asm/resctrl_sched.h F: Documentation/x86/resctrl* READ-COPY UPDATE (RCU) -M: "Paul E. McKenney" +M: "Paul E. McKenney" M: Josh Triplett R: Steven Rostedt R: Mathieu Desnoyers @@ -13681,7 +13681,7 @@ F: include/linux/reset-controller.h RESTARTABLE SEQUENCES SUPPORT M: Mathieu Desnoyers M: Peter Zijlstra -M: "Paul E. McKenney" +M: "Paul E. McKenney" M: Boqun Feng L: linux-kernel@vger.kernel.org S: Supported @@ -14710,7 +14710,7 @@ F: mm/sl?b* SLEEPABLE READ-COPY UPDATE (SRCU) M: Lai Jiangshan -M: "Paul E. McKenney" +M: "Paul E. McKenney" M: Josh Triplett R: Steven Rostedt R: Mathieu Desnoyers @@ -16207,7 +16207,7 @@ F: drivers/platform/x86/topstar-laptop.c TORTURE-TEST MODULES M: Davidlohr Bueso -M: "Paul E. McKenney" +M: "Paul E. McKenney" M: Josh Triplett L: linux-kernel@vger.kernel.org S: Supported