tools/memory-model/Documentation: Add plain accesses and data races to explanation.txt

This patch updates the Linux Kernel Memory Model's explanation.txt file by adding a section devoted to the model's handling of plain accesses and data-race detection. Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Acked-by: Andrea Parri <parri.andrea@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-01 13:40:19 -04:00 · 2019-10-01 13:40:19 -04:00 · c58a801701
parent ddc82999f0
commit c58a801701
1 changed files with 534 additions and 5 deletions
--- a/tools/memory-model/Documentation/explanation.txt
+++ b/tools/memory-model/Documentation/explanation.txt
@ -29,7 +29,8 @@ Explanation of the Linux-Kernel Memory Consistency Model
  21. THE PROPAGATES-BEFORE RELATION: pb
  22. RCU RELATIONS: rcu-link, rcu-gp, rcu-rscsi, rcu-order, rcu-fence, and rb
  23. LOCKING
-  24. ODDS AND ENDS
+  24. PLAIN ACCESSES AND DATA RACES
  25. ODDS AND ENDS
@ -42,8 +43,7 @@ linux-kernel.bell and linux-kernel.cat files that make up the formal
 version of the model; they are extremely terse and their meanings are
 far from clear.
-This document describes the ideas underlying the LKMM, but excluding
+This document describes the ideas underlying the LKMM.  It is meant
 the modeling of bare C (or plain) shared memory accesses.  It is meant
 for people who want to understand how the model was designed.  It does
 not go into the details of the code in the .bell and .cat files;
 rather, it explains in English what the code expresses symbolically.
@ -857,7 +857,7 @@ outlined above.  These restrictions involve the necessity of
 maintaining cache coherence and the fact that a CPU can't operate on a
 value before it knows what that value is, among other things.
-The formal version of the LKMM is defined by five requirements, or
+The formal version of the LKMM is defined by six requirements, or
 axioms:
 	Sequential consistency per variable: This requires that the
@ -877,10 +877,14 @@ axioms:
 	grace periods obey the rules of RCU, in particular, the
 	Grace-Period Guarantee.
 	Plain-coherence: This requires that plain memory accesses
 	(those not using READ_ONCE(), WRITE_ONCE(), etc.) must obey
 	the operational model's rules regarding cache coherence.
 The first and second are quite common; they can be found in many
 memory models (such as those for C11/C++11).  The "happens-before" and
 "propagation" axioms have analogs in other memory models as well.  The
-"rcu" axiom is specific to the LKMM.
+"rcu" and "plain-coherence" axioms are specific to the LKMM.
 Each of these axioms is discussed below.
@ -1915,6 +1919,521 @@ architectures supported by the Linux kernel, albeit for various
 differing reasons.
 PLAIN ACCESSES AND DATA RACES
 -----------------------------
 In the LKMM, memory accesses such as READ_ONCE(x), atomic_inc(&y),
 smp_load_acquire(&z), and so on are collectively referred to as
 "marked" accesses, because they are all annotated with special
 operations of one kind or another.  Ordinary C-language memory
 accesses such as x or y = 0 are simply called "plain" accesses.
 Early versions of the LKMM had nothing to say about plain accesses.
 The C standard allows compilers to assume that the variables affected
 by plain accesses are not concurrently read or written by any other
 threads or CPUs.  This leaves compilers free to implement all manner
 of transformations or optimizations of code containing plain accesses,
 making such code very difficult for a memory model to handle.
 Here is just one example of a possible pitfall:
 	int a = 6;
 	int *x = &a;
 	P0()
 	{
 		int *r1;
 		int r2 = 0;
 		r1 = x;
 		if (r1 != NULL)
 			r2 = READ_ONCE(*r1);
 	}
 	P1()
 	{
 		WRITE_ONCE(x, NULL);
 	}
 On the face of it, one would expect that when this code runs, the only
 possible final values for r2 are 6 and 0, depending on whether or not
 P1's store to x propagates to P0 before P0's load from x executes.
 But since P0's load from x is a plain access, the compiler may decide
 to carry out the load twice (for the comparison against NULL, then again
 for the READ_ONCE()) and eliminate the temporary variable r1.  The
 object code generated for P0 could therefore end up looking rather
 like this:
 	P0()
 	{
 		int r2 = 0;
 		if (x != NULL)
 			r2 = READ_ONCE(*x);
 	}
 And now it is obvious that this code runs the risk of dereferencing a
 NULL pointer, because P1's store to x might propagate to P0 after the
 test against NULL has been made but before the READ_ONCE() executes.
 If the original code had said "r1 = READ_ONCE(x)" instead of "r1 = x",
 the compiler would not have performed this optimization and there
 would be no possibility of a NULL-pointer dereference.
 Given the possibility of transformations like this one, the LKMM
 doesn't try to predict all possible outcomes of code containing plain
 accesses.  It is instead content to determine whether the code
 violates the compiler's assumptions, which would render the ultimate
 outcome undefined.
 In technical terms, the compiler is allowed to assume that when the
 program executes, there will not be any data races.  A "data race"
 occurs when two conflicting memory accesses execute concurrently;
 two memory accesses "conflict" if:
 	they access the same location,
 	they occur on different CPUs (or in different threads on the
 	same CPU),
 	at least one of them is a plain access,
 	and at least one of them is a store.
 The LKMM tries to determine whether a program contains two conflicting
 accesses which may execute concurrently; if it does then the LKMM says
 there is a potential data race and makes no predictions about the
 program's outcome.
 Determining whether two accesses conflict is easy; you can see that
 all the concepts involved in the definition above are already part of
 the memory model.  The hard part is telling whether they may execute
 concurrently.  The LKMM takes a conservative attitude, assuming that
 accesses may be concurrent unless it can prove they cannot.
 If two memory accesses aren't concurrent then one must execute before
 the other.  Therefore the LKMM decides two accesses aren't concurrent
 if they can be connected by a sequence of hb, pb, and rb links
 (together referred to as xb, for "executes before").  However, there
 are two complicating factors.
 If X is a load and X executes before a store Y, then indeed there is
 no danger of X and Y being concurrent.  After all, Y can't have any
 effect on the value obtained by X until the memory subsystem has
 propagated Y from its own CPU to X's CPU, which won't happen until
 some time after Y executes and thus after X executes.  But if X is a
 store, then even if X executes before Y it is still possible that X
 will propagate to Y's CPU just as Y is executing.  In such a case X
 could very well interfere somehow with Y, and we would have to
 consider X and Y to be concurrent.
 Therefore when X is a store, for X and Y to be non-concurrent the LKMM
 requires not only that X must execute before Y but also that X must
 propagate to Y's CPU before Y executes.  (Or vice versa, of course, if
 Y executes before X -- then Y must propagate to X's CPU before X
 executes if Y is a store.)  This is expressed by the visibility
 relation (vis), where X ->vis Y is defined to hold if there is an
 intermediate event Z such that:
 	X is connected to Z by a possibly empty sequence of
 	cumul-fence links followed by an optional rfe link (if none of
 	these links are present, X and Z are the same event),
 and either:
 	Z is connected to Y by a strong-fence link followed by a
 	possibly empty sequence of xb links,
 or:
 	Z is on the same CPU as Y and is connected to Y by a possibly
 	empty sequence of xb links (again, if the sequence is empty it
 	means Z and Y are the same event).
 The motivations behind this definition are straightforward:
 	cumul-fence memory barriers force stores that are po-before
 	the barrier to propagate to other CPUs before stores that are
 	po-after the barrier.
 	An rfe link from an event W to an event R says that R reads
 	from W, which certainly means that W must have propagated to
 	R's CPU before R executed.
 	strong-fence memory barriers force stores that are po-before
 	the barrier, or that propagate to the barrier's CPU before the
 	barrier executes, to propagate to all CPUs before any events
 	po-after the barrier can execute.
 To see how this works out in practice, consider our old friend, the MP
 pattern (with fences and statement labels, but without the conditional
 test):
 	int buf = 0, flag = 0;
 	P0()
 	{
 		X: WRITE_ONCE(buf, 1);
 		   smp_wmb();
 		W: WRITE_ONCE(flag, 1);
 	}
 	P1()
 	{
 		int r1;
 		int r2 = 0;
 		Z: r1 = READ_ONCE(flag);
 		   smp_rmb();
 		Y: r2 = READ_ONCE(buf);
 	}
 The smp_wmb() memory barrier gives a cumul-fence link from X to W, and
 assuming r1 = 1 at the end, there is an rfe link from W to Z.  This
 means that the store to buf must propagate from P0 to P1 before Z
 executes.  Next, Z and Y are on the same CPU and the smp_rmb() fence
 provides an xb link from Z to Y (i.e., it forces Z to execute before
 Y).  Therefore we have X ->vis Y: X must propagate to Y's CPU before Y
 executes.
 The second complicating factor mentioned above arises from the fact
 that when we are considering data races, some of the memory accesses
 are plain.  Now, although we have not said so explicitly, up to this
 point most of the relations defined by the LKMM (ppo, hb, prop,
 cumul-fence, pb, and so on -- including vis) apply only to marked
 accesses.
 There are good reasons for this restriction.  The compiler is not
 allowed to apply fancy transformations to marked accesses, and
 consequently each such access in the source code corresponds more or
 less directly to a single machine instruction in the object code.  But
 plain accesses are a different story; the compiler may combine them,
 split them up, duplicate them, eliminate them, invent new ones, and
 who knows what else.  Seeing a plain access in the source code tells
 you almost nothing about what machine instructions will end up in the
 object code.
 Fortunately, the compiler isn't completely free; it is subject to some
 limitations.  For one, it is not allowed to introduce a data race into
 the object code if the source code does not already contain a data
 race (if it could, memory models would be useless and no multithreaded
 code would be safe!).  For another, it cannot move a plain access past
 a compiler barrier.
 A compiler barrier is a kind of fence, but as the name implies, it
 only affects the compiler; it does not necessarily have any effect on
 how instructions are executed by the CPU.  In Linux kernel source
 code, the barrier() function is a compiler barrier.  It doesn't give
 rise directly to any machine instructions in the object code; rather,
 it affects how the compiler generates the rest of the object code.
 Given source code like this:
 	... some memory accesses ...
 	barrier();
 	... some other memory accesses ...
 the barrier() function ensures that the machine instructions
 corresponding to the first group of accesses will all end po-before
 any machine instructions corresponding to the second group of accesses
 -- even if some of the accesses are plain.  (Of course, the CPU may
 then execute some of those accesses out of program order, but we
 already know how to deal with such issues.)  Without the barrier()
 there would be no such guarantee; the two groups of accesses could be
 intermingled or even reversed in the object code.
 The LKMM doesn't say much about the barrier() function, but it does
 require that all fences are also compiler barriers.  In addition, it
 requires that the ordering properties of memory barriers such as
 smp_rmb() or smp_store_release() apply to plain accesses as well as to
 marked accesses.
 This is the key to analyzing data races.  Consider the MP pattern
 again, now using plain accesses for buf:
 	int buf = 0, flag = 0;
 	P0()
 	{
 		U: buf = 1;
 		   smp_wmb();
 		X: WRITE_ONCE(flag, 1);
 	}
 	P1()
 	{
 		int r1;
 		int r2 = 0;
 		Y: r1 = READ_ONCE(flag);
 		   if (r1) {
 			   smp_rmb();
 			V: r2 = buf;
 		   }
 	}
 This program does not contain a data race.  Although the U and V
 accesses conflict, the LKMM can prove they are not concurrent as
 follows:
 	The smp_wmb() fence in P0 is both a compiler barrier and a
 	cumul-fence.  It guarantees that no matter what hash of
 	machine instructions the compiler generates for the plain
 	access U, all those instructions will be po-before the fence.
 	Consequently U's store to buf, no matter how it is carried out
 	at the machine level, must propagate to P1 before X's store to
 	flag does.
 	X and Y are both marked accesses.  Hence an rfe link from X to
 	Y is a valid indicator that X propagated to P1 before Y
 	executed, i.e., X ->vis Y.  (And if there is no rfe link then
 	r1 will be 0, so V will not be executed and ipso facto won't
 	race with U.)
 	The smp_rmb() fence in P1 is a compiler barrier as well as a
 	fence.  It guarantees that all the machine-level instructions
 	corresponding to the access V will be po-after the fence, and
 	therefore any loads among those instructions will execute
 	after the fence does and hence after Y does.
 Thus U's store to buf is forced to propagate to P1 before V's load
 executes (assuming V does execute), ruling out the possibility of a
 data race between them.
 This analysis illustrates how the LKMM deals with plain accesses in
 general.  Suppose R is a plain load and we want to show that R
 executes before some marked access E.  We can do this by finding a
 marked access X such that R and X are ordered by a suitable fence and
 X ->xb* E.  If E was also a plain access, we would also look for a
 marked access Y such that X ->xb* Y, and Y and E are ordered by a
 fence.  We describe this arrangement by saying that R is
 "post-bounded" by X and E is "pre-bounded" by Y.
 In fact, we go one step further: Since R is a read, we say that R is
 "r-post-bounded" by X.  Similarly, E would be "r-pre-bounded" or
 "w-pre-bounded" by Y, depending on whether E was a store or a load.
 This distinction is needed because some fences affect only loads
 (i.e., smp_rmb()) and some affect only stores (smp_wmb()); otherwise
 the two types of bounds are the same.  And as a degenerate case, we
 say that a marked access pre-bounds and post-bounds itself (e.g., if R
 above were a marked load then X could simply be taken to be R itself.)
 The need to distinguish between r- and w-bounding raises yet another
 issue.  When the source code contains a plain store, the compiler is
 allowed to put plain loads of the same location into the object code.
 For example, given the source code:
 	x = 1;
 the compiler is theoretically allowed to generate object code that
 looks like:
 	if (x != 1)
 		x = 1;
 thereby adding a load (and possibly replacing the store entirely).
 For this reason, whenever the LKMM requires a plain store to be
 w-pre-bounded or w-post-bounded by a marked access, it also requires
 the store to be r-pre-bounded or r-post-bounded, so as to handle cases
 where the compiler adds a load.
 (This may be overly cautious.  We don't know of any examples where a
 compiler has augmented a store with a load in this fashion, and the
 Linux kernel developers would probably fight pretty hard to change a
 compiler if it ever did this.  Still, better safe than sorry.)
 Incidentally, the other tranformation -- augmenting a plain load by
 adding in a store to the same location -- is not allowed.  This is
 because the compiler cannot know whether any other CPUs might perform
 a concurrent load from that location.  Two concurrent loads don't
 constitute a race (they can't interfere with each other), but a store
 does race with a concurrent load.  Thus adding a store might create a
 data race where one was not already present in the source code,
 something the compiler is forbidden to do.  Augmenting a store with a
 load, on the other hand, is acceptable because doing so won't create a
 data race unless one already existed.
 The LKMM includes a second way to pre-bound plain accesses, in
 addition to fences: an address dependency from a marked load.  That
 is, in the sequence:
 	p = READ_ONCE(ptr);
 	r = *p;
 the LKMM says that the marked load of ptr pre-bounds the plain load of
 *p; the marked load must execute before any of the machine
 instructions corresponding to the plain load.  This is a reasonable
 stipulation, since after all, the CPU can't perform the load of *p
 until it knows what value p will hold.  Furthermore, without some
 assumption like this one, some usages typical of RCU would count as
 data races.  For example:
 	int a = 1, b;
 	int *ptr = &a;
 	P0()
 	{
 		b = 2;
 		rcu_assign_pointer(ptr, &b);
 	}
 	P1()
 	{
 		int *p;
 		int r;
 		rcu_read_lock();
 		p = rcu_dereference(ptr);
 		r = *p;
 		rcu_read_unlock();
 	}
 (In this example the rcu_read_lock() and rcu_read_unlock() calls don't
 really do anything, because there aren't any grace periods.  They are
 included merely for the sake of good form; typically P0 would call
 synchronize_rcu() somewhere after the rcu_assign_pointer().)
 rcu_assign_pointer() performs a store-release, so the plain store to b
 is definitely w-post-bounded before the store to ptr, and the two
 stores will propagate to P1 in that order.  However, rcu_dereference()
 is only equivalent to READ_ONCE().  While it is a marked access, it is
 not a fence or compiler barrier.  Hence the only guarantee we have
 that the load of ptr in P1 is r-pre-bounded before the load of *p
 (thus avoiding a race) is the assumption about address dependencies.
 This is a situation where the compiler can undermine the memory model,
 and a certain amount of care is required when programming constructs
 like this one.  In particular, comparisons between the pointer and
 other known addresses can cause trouble.  If you have something like:
 	p = rcu_dereference(ptr);
 	if (p == &x)
 		r = *p;
 then the compiler just might generate object code resembling:
 	p = rcu_dereference(ptr);
 	if (p == &x)
 		r = x;
 or even:
 	rtemp = x;
 	p = rcu_dereference(ptr);
 	if (p == &x)
 		r = rtemp;
 which would invalidate the memory model's assumption, since the CPU
 could now perform the load of x before the load of ptr (there might be
 a control dependency but no address dependency at the machine level).
 Finally, it turns out there is a situation in which a plain write does
 not need to be w-post-bounded: when it is separated from the
 conflicting access by a fence.  At first glance this may seem
 impossible.  After all, to be conflicting the second access has to be
 on a different CPU from the first, and fences don't link events on
 different CPUs.  Well, normal fences don't -- but rcu-fence can!
 Here's an example:
 	int x, y;
 	P0()
 	{
 		WRITE_ONCE(x, 1);
 		synchronize_rcu();
 		y = 3;
 	}
 	P1()
 	{
 		rcu_read_lock();
 		if (READ_ONCE(x) == 0)
 			y = 2;
 		rcu_read_unlock();
 	}
 Do the plain stores to y race?  Clearly not if P1 reads a non-zero
 value for x, so let's assume the READ_ONCE(x) does obtain 0.  This
 means that the read-side critical section in P1 must finish executing
 before the grace period in P0 does, because RCU's Grace-Period
 Guarantee says that otherwise P0's store to x would have propagated to
 P1 before the critical section started and so would have been visible
 to the READ_ONCE().  (Another way of putting it is that the fre link
 from the READ_ONCE() to the WRITE_ONCE() gives rise to an rcu-link
 between those two events.)
 This means there is an rcu-fence link from P1's "y = 2" store to P0's
 "y = 3" store, and consequently the first must propagate from P1 to P0
 before the second can execute.  Therefore the two stores cannot be
 concurrent and there is no race, even though P1's plain store to y
 isn't w-post-bounded by any marked accesses.
 Putting all this material together yields the following picture.  For
 two conflicting stores W and W', where W ->co W', the LKMM says the
 stores don't race if W can be linked to W' by a
 	w-post-bounded ; vis ; w-pre-bounded
 sequence.  If W is plain then they also have to be linked by an
 	r-post-bounded ; xb* ; w-pre-bounded
 sequence, and if W' is plain then they also have to be linked by a
 	w-post-bounded ; vis ; r-pre-bounded
 sequence.  For a conflicting load R and store W, the LKMM says the two
 accesses don't race if R can be linked to W by an
 	r-post-bounded ; xb* ; w-pre-bounded
 sequence or if W can be linked to R by a
 	w-post-bounded ; vis ; r-pre-bounded
 sequence.  For the cases involving a vis link, the LKMM also accepts
 sequences in which W is linked to W' or R by a
 	strong-fence ; xb* ; {w and/or r}-pre-bounded
 sequence with no post-bounding, and in every case the LKMM also allows
 the link simply to be a fence with no bounding at all.  If no sequence
 of the appropriate sort exists, the LKMM says that the accesses race.
 There is one more part of the LKMM related to plain accesses (although
 not to data races) we should discuss.  Recall that many relations such
 as hb are limited to marked accesses only.  As a result, the
 happens-before, propagates-before, and rcu axioms (which state that
 various relation must not contain a cycle) doesn't apply to plain
 accesses.  Nevertheless, we do want to rule out such cycles, because
 they don't make sense even for plain accesses.
 To this end, the LKMM imposes three extra restrictions, together
 called the "plain-coherence" axiom because of their resemblance to the
 rules used by the operational model to ensure cache coherence (that
 is, the rules governing the memory subsystem's choice of a store to
 satisfy a load request and its determination of where a store will
 fall in the coherence order):
 	If R and W conflict and it is possible to link R to W by one
 	of the xb* sequences listed above, then W ->rfe R is not
 	allowed (i.e., a load cannot read from a store that it
 	executes before, even if one or both is plain).
 	If W and R conflict and it is possible to link W to R by one
 	of the vis sequences listed above, then R ->fre W is not
 	allowed (i.e., if a store is visible to a load then the load
 	must read from that store or one coherence-after it).
 	If W and W' conflict and it is possible to link W to W' by one
 	of the vis sequences listed above, then W' ->co W is not
 	allowed (i.e., if one store is visible to a second then the
 	second must come after the first in the coherence order).
 This is the extent to which the LKMM deals with plain accesses.
 Perhaps it could say more (for example, plain accesses might
 contribute to the ppo relation), but at the moment it seems that this
 minimal, conservative approach is good enough.
 ODDS AND ENDS
 -------------
@ -1962,6 +2481,16 @@ treated as READ_ONCE() and rcu_assign_pointer() is treated as
 smp_store_release() -- which is basically how the Linux kernel treats
 them.
 Although we said that plain accesses are not linked by the ppo
 relation, they do contribute to it indirectly.  Namely, when there is
 an address dependency from a marked load R to a plain store W,
 followed by smp_wmb() and then a marked store W', the LKMM creates a
 ppo link from R to W'.  The reasoning behind this is perhaps a little
 shaky, but essentially it says there is no way to generate object code
 for this source code in which W' could execute before R.  Just as with
 pre-bounding by address dependencies, it is possible for the compiler
 to undermine this relation if sufficient care is not taken.
 There are a few oddball fences which need special treatment:
 smp_mb__before_atomic(), smp_mb__after_atomic(), and
 smp_mb__after_spinlock().  The LKMM uses fence events with special