Commit graph

24338 commits

Author SHA1 Message Date
Paul E. McKenney 1e9a038b7f srcu: Expedited grace periods with reduced memory contention
Commit f60d231a87 ("srcu: Crude control of expedited grace periods")
introduced a per-srcu_struct atomic counter to track outstanding
requests for grace periods.  This works, but represents a memory-contention
bottleneck.  This commit therefore uses the srcu_node combining tree
to remove this bottleneck.

This commit adds new ->srcu_gp_seq_needed_exp fields to the
srcu_data, srcu_node, and srcu_struct structures, which track the
farthest-in-the-future grace period that must be expedited, which in
turn requires that all nearer-term grace periods also be expedited.
Requests for expediting start with the srcu_data structure, run up
through the srcu_node tree, and end at the srcu_struct structure.
Note that it may be necessary to expedite a grace period that just
now started, and this is handled by a new srcu_funnel_exp_start()
function, which is invoked when the grace period itself is already
in its way, but when that grace period was not marked as expedited.

A new srcu_get_delay() function returns zero if there is at least one
expedited SRCU grace period in flight, or SRCU_INTERVAL otherwise.
This function is used to calculate delays:  Normal grace periods
are allowed to extend in order to cover more requests with a given
grace-period computation, which decreases per-request overhead.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Mike Galbraith <efault@gmx.de>
2017-04-26 16:32:16 -07:00
Paul E. McKenney 7f6733c3c6 srcu: Make rcutorture writer stalls print SRCU GP state
In the past, SRCU was simple enough that there was little point in
making the rcutorture writer stall messages print the SRCU grace-period
number state.  With the advent of Tree SRCU, this has changed.  This
commit therefore makes Classic, Tiny, and Tree SRCU report this state
to rcutorture as needed.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Mike Galbraith <efault@gmx.de>
2017-04-26 11:23:28 -07:00
Paul E. McKenney c7e88067c1 srcu: Exact tracking of srcu_data structures containing callbacks
The current Tree SRCU implementation schedules a workqueue for every
srcu_data covered by a given leaf srcu_node structure having callbacks,
even if only one of those srcu_data structures actually contains
callbacks.  This is clearly inefficient for workloads that don't feature
callbacks everywhere all the time.  This commit therefore adds an array
of masks that are used by the leaf srcu_node structures to track exactly
which srcu_data structures contain callbacks.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Mike Galbraith <efault@gmx.de>
2017-04-26 11:23:12 -07:00
Ingo Molnar 58d30c36d4 Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU updates from Paul E. McKenney:

 - Documentation updates.

 - Miscellaneous fixes.

 - Parallelize SRCU callback handling (plus overlapping patches).

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-23 11:12:44 +02:00
Paul E. McKenney f2094107ac Merge branches 'doc.2017.04.12a', 'fixes.2017.04.19a' and 'srcu.2017.04.21a' into HEAD
doc.2017.04.12a: Documentation updates
fixes.2017.04.19a: Miscellaneous fixes
srcu.2017.04.21a: Parallelize SRCU callback handling
2017-04-21 06:00:13 -07:00
Paul E. McKenney bcbfdd01dc rcu: Make non-preemptive schedule be Tasks RCU quiescent state
Currently, a call to schedule() acts as a Tasks RCU quiescent state
only if a context switch actually takes place.  However, just the
call to schedule() guarantees that the calling task has moved off of
whatever tracing trampoline that it might have been one previously.
This commit therefore plumbs schedule()'s "preempt" parameter into
rcu_note_context_switch(), which then records the Tasks RCU quiescent
state, but only if this call to schedule() was -not- due to a preemption.

To avoid adding overhead to the common-case context-switch path,
this commit hides the rcu_note_context_switch() check under an existing
non-common-case check.

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-04-21 05:59:27 -07:00
Paul E. McKenney 0497b489b8 srcu: Expedite srcu_schedule_cbs_snp() callback invocation
Although Tree SRCU does reduce delays when there is at least one
synchronize_srcu_expedited() invocation pending, srcu_schedule_cbs_snp()
still waits for SRCU_INTERVAL before invoking callbacks.  Since
synchronize_srcu_expedited() now posts a callback and waits for
that callback to do a wakeup, this destroys the expedited nature of
synchronize_srcu_expedited().  This destruction became apparent to
Marc Zyngier in the guise of a guest-OS bootup slowdown from five
seconds to no fewer than forty seconds.

This commit therefore invokes callbacks immediately at the end of the
grace period when there is at least one synchronize_srcu_expedited()
invocation pending.  This brought Marc's guest-OS bootup times back
into the realm of reason.

Reported-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Marc Zyngier <marc.zyngier@arm.com>
2017-04-21 05:59:27 -07:00
Paul E. McKenney da915ad5cf srcu: Parallelize callback handling
Peter Zijlstra proposed using SRCU to reduce mmap_sem contention [1,2],
however, there are workloads that could result in a high volume of
concurrent invocations of call_srcu(), which with current SRCU would
result in excessive lock contention on the srcu_struct structure's
->queue_lock, which protects SRCU's callback lists.  This commit therefore
moves SRCU to per-CPU callback lists, thus greatly reducing contention.

Because a given SRCU instance no longer has a single centralized callback
list, starting grace periods and invoking callbacks are both more complex
than in the single-list Classic SRCU implementation.  Starting grace
periods and handling callbacks are now handled using an srcu_node tree
that is in some ways similar to the rcu_node trees used by RCU-bh,
RCU-preempt, and RCU-sched (for example, the srcu_node tree shape is
controlled by exactly the same Kconfig options and boot parameters that
control the shape of the rcu_node tree).

In addition, the old per-CPU srcu_array structure is now named srcu_data
and contains an rcu_segcblist structure named ->srcu_cblist for its
callbacks (and a spinlock to protect this).  The srcu_struct gets
an srcu_gp_seq that is used to associate callback segments with the
corresponding completion-time grace-period number.  These completion-time
grace-period numbers are propagated up the srcu_node tree so that the
grace-period workqueue handler can determine whether additional grace
periods are needed on the one hand and where to look for callbacks that
are ready to be invoked.

The srcu_barrier() function must now wait on all instances of the per-CPU
->srcu_cblist.  Because each ->srcu_cblist is protected by ->lock,
srcu_barrier() can remotely add the needed callbacks.  In theory,
it could also remotely start grace periods, but in practice doing so
is complex and racy.  And interestingly enough, it is never necessary
for srcu_barrier() to start a grace period because srcu_barrier() only
enqueues a callback when a callback is already present--and it turns out
that a grace period has to have already been started for this pre-existing
callback.  Furthermore, it is only the callback that srcu_barrier()
needs to wait on, not any particular grace period.  Therefore, a new
rcu_segcblist_entrain() function enqueues the srcu_barrier() function's
callback into the same segment occupied by the last pre-existing callback
in the list.  The special case where all the pre-existing callbacks are
on a different list (because they are in the process of being invoked)
is handled by enqueuing srcu_barrier()'s callback into the RCU_DONE_TAIL
segment, relying on the done-callbacks check that takes place after all
callbacks are inovked.

Note that the readers use the same algorithm as before.  Note that there
is a separate srcu_idx that tells the readers what counter to increment.
This unfortunately cannot be combined with srcu_gp_seq because they
need to be incremented at different times.

This commit introduces some ugly #ifdefs in rcutorture.  These will go
away when I feel good enough about Tree SRCU to ditch Classic SRCU.

Some crude performance comparisons, courtesy of a quickly hacked rcuperf
asynchronous-grace-period capability:

			Callback Queuing Overhead
			-------------------------
	# CPUS		Classic SRCU	Tree SRCU
	------          ------------    ---------
	     2              0.349 us     0.342 us
	    16             31.66  us     0.4   us
	    41             ---------     0.417 us

The times are the 90th percentiles, a statistic that was chosen to reject
the overheads of the occasional srcu_barrier() call needed to avoid OOMing
the test machine.  The rcuperf test hangs when running Classic SRCU at 41
CPUs, hence the line of dashes.  Despite the hacks to both the rcuperf code
and that statistics, this is a convincing demonstration of Tree SRCU's
performance and scalability advantages.

[1] https://lwn.net/Articles/309030/
[2] https://patchwork.kernel.org/patch/5108281/

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Fix initialization if synchronize_srcu_expedited() called first. ]
2017-04-21 05:59:26 -07:00
Linus Torvalds 160062e190 While continuing my development, I uncovered two more small bugs.
One is a race condition when enabling the snapshot function probe
 trigger. It enables the probe before allocating the snapshot, and
 if the probe triggers first, it stops tracing with a warning that
 the snapshot buffer was not allocated.
 
 The seconds is that the snapshot file should show how to use it when
 it is empty. But a bug fix from long ago broke the "is empty" test
 and the snapshot file no longer displays the help message.
 -----BEGIN PGP SIGNATURE-----
 
 iQExBAABCAAbBQJY+L3dFBxyb3N0ZWR0QGdvb2RtaXMub3JnAAoJEMm5BfJq2Y3L
 DyQH/j4ZoRhc+XziMw7iJxNvDfptT9XFawqTKDdYJ3nMsFp+40bzlfYah94b1YYQ
 YTLnvlxtiYUo1rifOnsdY913IKLc1wtO/a/S8/qqUJ1+7ik46zgaPYqNQlvM6clV
 xoJQ6+c631SbJ3KuhadvXTABvzF4Qc1w0/f81lzGgYE8IB2VxiWeYZDMVxe5r2oM
 A0seve9C5Wps39m/kcFHSVMZwpk6s7gZL7ERcME4dOewJpQ7b0ufWXMsBssD0bMx
 G0ihBdfeM6TzXSTtrnLzU9eZaUtfh37olpvjpJzdIUUqwVpSrxOKmLcsYCIeNs3f
 YuS54g7kEsDqLxGJvkC0UBou2rU=
 =DQC3
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.11-rc5-5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull two more ftrace fixes from Steven Rostedt:
 "While continuing my development, I uncovered two more small bugs.

  One is a race condition when enabling the snapshot function probe
  trigger. It enables the probe before allocating the snapshot, and if
  the probe triggers first, it stops tracing with a warning that the
  snapshot buffer was not allocated.

  The seconds is that the snapshot file should show how to use it when
  it is empty. But a bug fix from long ago broke the "is empty" test and
  the snapshot file no longer displays the help message"

* tag 'trace-v4.11-rc5-5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  ring-buffer: Have ring_buffer_iter_empty() return true when empty
  tracing: Allocate the snapshot buffer before enabling probe
2017-04-20 12:30:10 -07:00
Steven Rostedt (VMware) 78f7a45dac ring-buffer: Have ring_buffer_iter_empty() return true when empty
I noticed that reading the snapshot file when it is empty no longer gives a
status. It suppose to show the status of the snapshot buffer as well as how
to allocate and use it. For example:

 ># cat snapshot
 # tracer: nop
 #
 #
 # * Snapshot is allocated *
 #
 # Snapshot commands:
 # echo 0 > snapshot : Clears and frees snapshot buffer
 # echo 1 > snapshot : Allocates snapshot buffer, if not already allocated.
 #                      Takes a snapshot of the main buffer.
 # echo 2 > snapshot : Clears snapshot buffer (but does not allocate or free)
 #                      (Doesn't have to be '2' works with any number that
 #                       is not a '0' or '1')

But instead it just showed an empty buffer:

 ># cat snapshot
 # tracer: nop
 #
 # entries-in-buffer/entries-written: 0/0   #P:4
 #
 #                              _-----=> irqs-off
 #                             / _----=> need-resched
 #                            | / _---=> hardirq/softirq
 #                            || / _--=> preempt-depth
 #                            ||| /     delay
 #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
 #              | |       |   ||||       |         |

What happened was that it was using the ring_buffer_iter_empty() function to
see if it was empty, and if it was, it showed the status. But that function
was returning false when it was empty. The reason was that the iter header
page was on the reader page, and the reader page was empty, but so was the
buffer itself. The check only tested to see if the iter was on the commit
page, but the commit page was no longer pointing to the reader page, but as
all pages were empty, the buffer is also.

Cc: stable@vger.kernel.org
Fixes: 651e22f270 ("ring-buffer: Always reset iterator to reader page")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-19 21:23:47 -04:00
Steven Rostedt (VMware) df62db5be2 tracing: Allocate the snapshot buffer before enabling probe
Currently the snapshot trigger enables the probe and then allocates the
snapshot. If the probe triggers before the allocation, it could cause the
snapshot to fail and turn tracing off. It's best to allocate the snapshot
buffer first, and then enable the trigger. If something goes wrong in the
enabling of the trigger, the snapshot buffer is still allocated, but it can
also be freed by the user by writting zero into the snapshot buffer file.

Also add a check of the return status of alloc_snapshot().

Cc: stable@vger.kernel.org
Fixes: 77fd5c15e3 ("tracing: Add snapshot trigger to function probes")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-19 14:19:08 -04:00
Paul E. McKenney bfd090be14 rcu: Fix typo in PER_RCU_NODE_PERIOD header comment
This commit just changes a "the the" to "the" to reduce repetition.

Reported-by: Michalis Kokologiannakis <mixaskok@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-04-19 09:29:20 -07:00
Nicholas Mc Guire 5455a7f6a8 rcu: Use true/false in assignment to bool
This commit makes the parse_rcu_nocb_poll() function assign true
(rather than the constant 1) to the bool variable rcu_nocb_poll.

Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-04-19 09:29:20 -07:00
Nicholas Mc Guire 50dc7def4a rcu: Use bool value directly
The beenonline variable is declared bool so there is no need for an
explicit comparison, especially not against the constant zero.

Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-04-19 09:29:19 -07:00
Paul E. McKenney deb34f3643 rcu: Improve comments for hotplug/suspend/hibernate functions
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-04-19 09:29:18 -07:00
Paul E. McKenney d1e4f01d09 rcu: Remove obsolete comment from rcu_future_gp_cleanup() header
The rcu_nocb_gp_cleanup() function is now invoked elsewhere, so this
commit drags this comment into the year 2017.

Reported-by: Michalis Kokologiannakis <mixaskok@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-04-19 09:29:17 -07:00
Paul E. McKenney a5dd63efda lockdep: Use "WARNING" tag on lockdep splats
This commit changes lockdep splats to begin lines with "WARNING" and
to use pr_warn() instead of printk().  This change eases scripted
analysis of kernel console output.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2017-04-19 09:27:29 -07:00
Linus Torvalds 005882e53d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc
Pull sparc fixes from David Miller:
 "Two Sparc bug fixes from Daniel Jordan and Nitin Gupta"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
  sparc64: Fix hugepage page table free
  sparc64: Use LOCKDEP_SMALL, not PROVE_LOCKING_SMALL
2017-04-18 13:56:51 -07:00
Linus Torvalds 40d9018eb7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) BPF tail call handling bug fixes from Daniel Borkmann.

 2) Fix allowance of too many rx queues in sfc driver, from Bert
    Kenward.

 3) Non-loopback ipv6 packets claiming src of ::1 should be dropped,
    from Florian Westphal.

 4) Statistics requests on KSZ9031 can crash, fix from Grygorii
    Strashko.

 5) TX ring handling fixes in mediatek driver, from Sean Wang.

 6) ip_ra_control can deadlock, fix lock acquisition ordering to fix,
    from Cong WANG.

 7) Fix use after free in ip_recv_error(), from Willem de Buijn.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
  bpf: fix checking xdp_adjust_head on tail calls
  bpf: fix cb access in socket filter programs on tail calls
  ipv6: drop non loopback packets claiming to originate from ::1
  net: ethernet: mediatek: fix inconsistency of port number carried in TXD
  net: ethernet: mediatek: fix inconsistency between TXD and the used buffer
  net: phy: micrel: fix crash when statistic requested for KSZ9031 phy
  net: vrf: Fix setting NLM_F_EXCL flag when adding l3mdev rule
  net: thunderx: Fix set_max_bgx_per_node for 81xx rgx
  net-timestamp: avoid use-after-free in ip_recv_error
  ipv4: fix a deadlock in ip_ra_control
  sfc: limit the number of receive queues
2017-04-18 13:24:42 -07:00
Daniel Jordan 395102db44 sparc64: Use LOCKDEP_SMALL, not PROVE_LOCKING_SMALL
CONFIG_PROVE_LOCKING_SMALL shrinks the memory usage of lockdep so the
kernel text, data, and bss fit in the required 32MB limit, but this
option is not set for every config that enables lockdep.

A 4.10 kernel fails to boot with the console output

    Kernel: Using 8 locked TLB entries for main kernel image.
    hypervisor_tlb_lock[2000000:0:8000000071c007c3:1]: errors with f
    Program terminated

with these config options

    CONFIG_LOCKDEP=y
    CONFIG_LOCK_STAT=y
    CONFIG_PROVE_LOCKING=n

To fix, rename CONFIG_PROVE_LOCKING_SMALL to CONFIG_LOCKDEP_SMALL, and
enable this option with CONFIG_LOCKDEP=y so we get the reduced memory
usage every time lockdep is turned on.

Tested that CONFIG_LOCKDEP_SMALL is set to 'y' if and only if
CONFIG_LOCKDEP is set to 'y'.  When other lockdep-related config options
that select CONFIG_LOCKDEP are enabled (e.g. CONFIG_LOCK_STAT or
CONFIG_PROVE_LOCKING), verified that CONFIG_LOCKDEP_SMALL is also
enabled.

Fixes: e6b5f1be7a ("config: Adding the new config parameter CONFIG_PROVE_LOCKING_SMALL for sparc")
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Babu Moger <babu.moger@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-18 13:11:07 -07:00
Paul E. McKenney 5f0d5a3ae7 mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU
A group of Linux kernel hackers reported chasing a bug that resulted
from their assumption that SLAB_DESTROY_BY_RCU provided an existence
guarantee, that is, that no block from such a slab would be reallocated
during an RCU read-side critical section.  Of course, that is not the
case.  Instead, SLAB_DESTROY_BY_RCU only prevents freeing of an entire
slab of blocks.

However, there is a phrase for this, namely "type safety".  This commit
therefore renames SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU in order
to avoid future instances of this sort of confusion.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <linux-mm@kvack.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
[ paulmck: Add comments mentioning the old name, as requested by Eric
  Dumazet, in order to help people familiar with the old name find
  the new one. ]
Acked-by: David Rientjes <rientjes@google.com>
2017-04-18 11:42:36 -07:00
Paul E. McKenney dad81a2026 srcu: Introduce CLASSIC_SRCU Kconfig option
The TREE_SRCU rewrite is large and a bit on the non-simple side, so
this commit helps reduce risk by allowing the old v4.11 SRCU algorithm
to be selected using a new CLASSIC_SRCU Kconfig option that depends
on RCU_EXPERT.  The default is to use the new TREE_SRCU and TINY_SRCU
algorithms, in order to help get these the testing that they need.
However, if your users do not require the update-side scalability that
is to be provided by TREE_SRCU, select RCU_EXPERT and then CLASSIC_SRCU
to revert back to the old classic SRCU algorithm.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-04-18 11:38:23 -07:00
Paul E. McKenney 32071141b2 srcutorture: Print Tiny SRCU reader statistics
The srcu_torture_stats() function is adapted to the specific srcu_struct
layout traditionally used by SRCU.  This commit therefore adds support
for Tiny SRCU.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-04-18 11:38:22 -07:00
Paul E. McKenney d8be81735a srcu: Create a tiny SRCU
In response to automated complaints about modifications to SRCU
increasing its size, this commit creates a tiny SRCU that is
used in SMP=n && PREEMPT=n builds.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-04-18 11:38:22 -07:00
Paul E. McKenney f60d231a87 srcu: Crude control of expedited grace periods
SRCU's implementation of expedited grace periods has always assumed
that the SRCU instance is idle when the expedited request arrives.
This commit improves this a bit by maintaining a count of the number
of outstanding expedited requests, thus allowing prior non-expedited
grace periods accommodate these requests by shifting to expedited mode.
However, any non-expedited wait already in progress will still wait for
the full duration.

Improved control of expedited grace periods is planned, but one step
at a time.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-04-18 11:38:22 -07:00
Paul E. McKenney 80a7956fe3 srcu: Merge ->srcu_state into ->srcu_gp_seq
Updating ->srcu_state and ->srcu_gp_seq will lead to extremely complex
race conditions given multiple callback queues, so this commit takes
advantage of the two-bit state now available in rcu_seq counters to
store the state in the bottom two bits of ->srcu_gp_seq.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-04-18 11:38:22 -07:00
Paul E. McKenney f1ec57a462 srcu: Allow a second bit in rcu_seq for SRCU state
This commit increases the number of reserved bits at the bottom of an
rcu_seq grace-period counter from one to two, as will be needed to
accommodate SRCU's three-state grace periods.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-04-18 11:38:21 -07:00
Paul E. McKenney 031aeee000 srcu: Improve rcu_seq grace-period-counter abstraction
The expedited grace-period code contains several open-coded shifts
know the format of an rcu_seq grace-period counter, which is not
particularly good style.  This commit therefore creates a new
rcu_seq_ctr() function that extracts the counter portion of the
counter, and an rcu_seq_state() function that extracts the low-order
state bit.  This commit prepares for SRCU callback parallelization,
which will require two state bits.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-04-18 11:38:21 -07:00
Paul E. McKenney 91e27c356c srcu: Fix bogus try_check_zero() comment
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-04-18 11:38:21 -07:00
Paul E. McKenney e95d68d212 srcu: Make num_rcu_lvl[] array be external
This commit makes the num_rcu_lvl[] array external so that SRCU can
make use of it for initializing its upcoming srcu_node tree.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-04-18 11:38:21 -07:00
Paul E. McKenney efbe451d46 srcu: Move rcu_node traversal macros to rcu.h
This commit moves rcu_for_each_node_breadth_first(),
rcu_for_each_nonleaf_node_breadth_first(), and
rcu_for_each_leaf_node() from kernel/rcu/tree.h to
kernel/rcu/rcu.h so that SRCU can access them.
This commit is code-movement only.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-04-18 11:38:21 -07:00
Paul E. McKenney 41f5c63178 rcu: Remove redundant levelcnt[] array from rcu_init_one()
The levelcnt[] array is identical to num_rcu_lvl[], so this commit
removes levelcnt[].

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-04-18 11:38:21 -07:00
Paul E. McKenney 2b34c43cc1 srcu: Move rcu_init_levelspread() to rcu_tree_node.h
This commit moves the rcu_init_levelspread() function from
kernel/rcu/tree.c to kernel/rcu/rcu.h so that SRCU can access it.  This is
another step towards enabling SRCU to create its own combining tree.
This commit is code-movement only, give or take knock-on adjustments.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-04-18 11:38:20 -07:00
Paul E. McKenney f2425b4efb srcu: Move combining-tree definitions for SRCU's benefit
This commit moves the C preprocessor code that defines the default shape
of the rcu_node combining tree to a new include/linux/rcu_node_tree.h
file as a first step towards enabling SRCU to create its own combining
tree, which in turn enables SRCU to implement per-CPU callback handling,
thus avoiding contention on the lock currently guarding the single list
of callbacks.  Note that users of SRCU still need to know the size of
the srcu_struct structure, hence include/linux rather than kernel/rcu.

This commit is code-movement only.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-04-18 11:38:20 -07:00
Paul E. McKenney 8660b7d8a5 srcu: Use rcu_segcblist to track SRCU callbacks
This commit switches SRCU from custom-built callback queues to the new
rcu_segcblist structure.  This change associates grace-period sequence
numbers with groups of callbacks, which will be needed for efficient
processing of per-CPU callbacks.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-04-18 11:38:20 -07:00
Paul E. McKenney ac367c1c62 srcu: Add grace-period sequence numbers
This commit adds grace-period sequence numbers, which will be used to
handle mid-boot grace periods and per-CPU callback lists.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-04-18 11:38:20 -07:00
Paul E. McKenney c2a8ec0778 srcu: Move to state-based grace-period sequencing
The current SRCU grace-period processing might never reach the last
portion of srcu_advance_batches().  This is OK given the current
implementation, as the first portion, up to the try_check_zero()
following the srcu_flip() is sufficient to drive grace periods forward.
However, it has the unfortunate side-effect of making it impossible to
determine when a given grace period has ended, and it will be necessary
to efficiently trace ends of grace periods in order to efficiently handle
per-CPU SRCU callback lists.

This commit therefore adds states to the SRCU grace-period processing,
so that the end of a given SRCU grace period is marked by the transition
to the SRCU_STATE_DONE state.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-04-18 11:38:20 -07:00
Paul E. McKenney c6e56f593a srcu: Push srcu_advance_batches() fastpath into common case
This commit simplifies the SRCU state machine by pushing the
srcu_advance_batches() idle-SRCU fastpath into the common case.  This is
done by giving srcu_reschedule() a delay parameter, which is zero in
the call from srcu_advance_batches().

This commit is a step towards numbering callbacks in order to
efficiently handle per-CPU callback lists.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-04-18 11:38:19 -07:00
Dmitry Vyukov f010ed82c7 rcu: Fix warning in rcu_seq_end()
The rcu_seq_end() function increments seq signifying completion
of a grace period, after that checks that the seq is even and wakes
_synchronize_rcu_expedited().  The _synchronize_rcu_expedited() function
uses wait_event() to wait for even seq.  The problem is that wait_event()
can return as soon as seq becomes even without waiting for the wakeup.
In such case the warning in rcu_seq_end() can falsely fire if the next
expedited grace period starts before the check.

Check that seq has good value before incrementing it.

Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Cc: syzkaller@googlegroups.com
Cc: linux-kernel@vger.kernel.org
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: josh@joshtriplett.org
Cc: jiangshanlai@gmail.com
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

---

syzkaller-triggered warning:

WARNING: CPU: 0 PID: 4832 at kernel/rcu/tree.c:3533
rcu_seq_end+0x110/0x140 kernel/rcu/tree.c:3533
CPU: 0 PID: 4832 Comm: kworker/0:3 Not tainted 4.10.0+ #276
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Workqueue: events wait_rcu_exp_gp
Call Trace:
 __dump_stack lib/dump_stack.c:15 [inline]
 dump_stack+0x2ee/0x3ef lib/dump_stack.c:51
 panic+0x1fb/0x412 kernel/panic.c:179
 __warn+0x1c4/0x1e0 kernel/panic.c:540
 warn_slowpath_null+0x2c/0x40 kernel/panic.c:583
 rcu_seq_end+0x110/0x140 kernel/rcu/tree.c:3533
 rcu_exp_gp_seq_end kernel/rcu/tree_exp.h:36 [inline]
 rcu_exp_wait_wake+0x8a9/0x1330 kernel/rcu/tree_exp.h:517
 rcu_exp_sel_wait_wake kernel/rcu/tree_exp.h:559 [inline]
 wait_rcu_exp_gp+0x83/0xc0 kernel/rcu/tree_exp.h:570
 process_one_work+0xc06/0x1c20 kernel/workqueue.c:2096
 worker_thread+0x223/0x19c0 kernel/workqueue.c:2230
 kthread+0x326/0x3f0 kernel/kthread.c:227
 ret_from_fork+0x31/0x40 arch/x86/entry/entry_64.S:430
---
2017-04-18 11:38:19 -07:00
Paul E. McKenney 3c345825c8 rcu: Expedited wakeups need to be fully ordered
Expedited grace periods use workqueue handlers that wake up the requesters,
but there is no lock mediating this wakeup.  Therefore, memory barriers
are required to ensure that the handler's memory references are seen by
all to occur before synchronize_*_expedited() returns to its caller.
Possibly detected by syzkaller.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-04-18 11:38:19 -07:00
Paul E. McKenney 2e8c28c2dd srcu: Move rcu_seq_start() and friends to rcu.h
This commit moves rcu_seq_start(), rcu_seq_end(), rcu_seq_snap(),
and rcu_seq_done() from kernel/rcu/tree.c to kernel/rcu/rcu.h.
This will allow SRCU to use these functions, which in turn will
allow SRCU to move from a single global callback queue to a
per-CPU callback queue.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-04-18 11:38:19 -07:00
Paul E. McKenney bdcabf4c7d rcu: Add single-element dequeue functions to rcu_segcblist
This commit adds single-element dequeue functions to rcu_segcblist.
These are less efficient than using the extract and insert functions,
but allow more precise debugging code.  These functions are thus
expected to be used only in debug builds, for example, CONFIG_PROVE_RCU.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-04-18 11:38:19 -07:00
Paul E. McKenney b5eaeaa509 srcu: Allow early boot use of synchronize_srcu()
This commit checks for pre-scheduler state, and if that early in the
boot process, synchronize_srcu() and friends are no-ops.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-04-18 11:38:18 -07:00
Paul E. McKenney 900b1028ec srcu: Allow SRCU to access rcu_scheduler_active
This is primarily a code-movement commit in preparation for allowing
SRCU to handle early-boot SRCU grace periods.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-04-18 11:38:18 -07:00
Paul E. McKenney 15fecf89e4 srcu: Abstract multi-tail callback list handling
RCU has only one multi-tail callback list, which is implemented via
the nxtlist, nxttail, nxtcompleted, qlen_lazy, and qlen fields in the
rcu_data structure, and whose operations are open-code throughout the
Tree RCU implementation.  This has been more or less OK in the past,
but upcoming callback-list optimizations in SRCU could really use
a multi-tail callback list there as well.

This commit therefore abstracts the multi-tail callback list handling
into a new kernel/rcu/rcu_segcblist.h file, and uses this new API.
The simple head-and-tail pointer callback list is also abstracted and
applied everywhere except for the NOCB callback-offload lists.  (Yes,
the plan is to apply them there as well, but this commit is already
bigger than would be good.)

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-04-18 11:38:18 -07:00
Paul E. McKenney b8c78d3afc rcu: Default RCU_FANOUT_LEAF to 16 unless explicitly changed
If the RCU_EXPERT Kconfig option is not set (the default), then the
RCU_FANOUT_LEAF Kconfig option will not be defined, which will cause
the leaf-level rcu_node tree fanout to default to 32 on 32-bit systems
and 64 on 64-bit systems.  This can result in excessive lock contention.
This commit therefore changes the computation of the leaf-level rcu_node
tree fanout so that the result will be 16 unless an explicit Kconfig or
kernel-boot setting says otherwise.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-04-18 11:38:18 -07:00
Paul E. McKenney 9226b10d78 rcu: Place guard on rcu_all_qs() and rcu_note_context_switch() actions
The rcu_all_qs() and rcu_note_context_switch() do a series of checks,
taking various actions to supply RCU with quiescent states, depending
on the outcomes of the various checks.  This is a bit much for scheduling
fastpaths, so this commit creates a separate ->rcu_urgent_qs field in
the rcu_dynticks structure that acts as a global guard for these checks.
Thus, in the common case, rcu_all_qs() and rcu_note_context_switch()
check the ->rcu_urgent_qs field, find it false, and simply return.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2017-04-18 11:38:18 -07:00
Paul E. McKenney 0f9be8cabb rcu: Eliminate flavor scan in rcu_momentary_dyntick_idle()
The rcu_momentary_dyntick_idle() function scans the RCU flavors, checking
that one of them still needs a quiescent state before doing an expensive
atomic operation on the ->dynticks counter.  However, this check reduces
overhead only after a rare race condition, and increases complexity.  This
commit therefore removes the scan and the mechanism enabling the scan.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-04-18 11:38:18 -07:00
Paul E. McKenney 9577df9a31 rcu: Pull rcu_qs_ctr into rcu_dynticks structure
The rcu_qs_ctr variable is yet another isolated per-CPU variable,
so this commit pulls it into the pre-existing rcu_dynticks per-CPU
structure.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-04-18 11:38:17 -07:00
Paul E. McKenney abb06b9948 rcu: Pull rcu_sched_qs_mask into rcu_dynticks structure
The rcu_sched_qs_mask variable is yet another isolated per-CPU variable,
so this commit pulls it into the pre-existing rcu_dynticks per-CPU
structure.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-04-18 11:38:17 -07:00