Commit graph

16778 commits

Author SHA1 Message Date
Mel Gorman 989348b5fc sched/numa: Use unsigned longs for numa group fault stats
As Peter says "If you're going to hold locks you can also do away with all
that atomic_long_*() nonsense". Lock aquisition moved slightly to protect
the updates.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-63-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 14:48:23 +02:00
Rik van Riel de1c9ce6f0 sched/numa: Skip some page migrations after a shared fault
Shared faults can lead to lots of unnecessary page migrations,
slowing down the system, and causing private faults to hit the
per-pgdat migration ratelimit.

This patch adds sysctl numa_balancing_migrate_deferred, which specifies
how many shared page migrations to skip unconditionally, after each page
migration that is skipped because it is a shared fault.

This reduces the number of page migrations back and forth in
shared fault situations. It also gives a strong preference to
the tasks that are already running where most of the memory is,
and to moving the other tasks to near the memory.

Testing this with a much higher scan rate than the default
still seems to result in fewer page migrations than before.

Memory seems to be somewhat better consolidated than previously,
with multi-instance specjbb runs on a 4 node system.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-62-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 14:48:21 +02:00
Rik van Riel 1e3646ffc6 mm: numa: Revert temporarily disabling of NUMA migration
With the scan rate code working (at least for multi-instance specjbb),
the large hammer that is "sched: Do not migrate memory immediately after
switching node" can be replaced with something smarter. Revert temporarily
migration disabling and all traces of numa_migrate_seq.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-61-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 14:48:20 +02:00
Mel Gorman 930aa174fc sched/numa: Remove the numa_balancing_scan_period_reset sysctl
With scan rate adaptions based on whether the workload has properly
converged or not there should be no need for the scan period reset
hammer. Get rid of it.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-60-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 14:48:18 +02:00
Rik van Riel 04bb2f9475 sched/numa: Adjust scan rate in task_numa_placement
Adjust numa_scan_period in task_numa_placement, depending on how much
useful work the numa code can do. The more local faults there are in a
given scan window the longer the period (and hence the slower the scan rate)
during the next window. If there are excessive shared faults then the scan
period will decrease with the amount of scaling depending on whether the
ratio of shared/private faults. If the preferred node changes then the
scan rate is reset to recheck if the task is properly placed.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-59-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 14:48:16 +02:00
Mel Gorman 3e6a9418cf sched/numa: Take false sharing into account when adapting scan rate
Scan rate is altered based on whether shared/private faults dominated.
task_numa_group() may detect false sharing but that information is not
taken into account when adapting the scan rate. Take it into account.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-58-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 14:48:14 +02:00
Rik van Riel dabe1d9924 sched/numa: Be more careful about joining numa groups
Due to the way the pid is truncated, and tasks are moved between
CPUs by the scheduler, it is possible for the current task_numa_fault
to group together tasks that do not actually share memory together.

This patch adds a few easy sanity checks to task_numa_fault, joining
tasks together if they share the same tsk->mm, or if the fault was on
a page with an elevated mapcount, in a shared VMA.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-57-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 14:48:12 +02:00
Peter Zijlstra 0ec8aa00f2 sched/numa: Avoid migrating tasks that are placed on their preferred node
This patch classifies scheduler domains and runqueues into types depending
the number of tasks that are about their NUMA placement and the number
that are currently running on their preferred node. The types are

regular: There are tasks running that do not care about their NUMA
	placement.

remote: There are tasks running that care about their placement but are
	currently running on a node remote to their ideal placement

all: No distinction

To implement this the patch tracks the number of tasks that are optimally
NUMA placed (rq->nr_preferred_running) and the number of tasks running
that care about their placement (nr_numa_running). The load balancer
uses this information to avoid migrating idea placed NUMA tasks as long
as better options for load balancing exists. For example, it will not
consider balancing between a group whose tasks are all perfectly placed
and a group with remote tasks.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1381141781-10992-56-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 14:48:10 +02:00
Rik van Riel ca28aa53dd sched/numa: Fix task or group comparison
This patch separately considers task and group affinities when
searching for swap candidates during NUMA placement. If tasks
are part of the same group, or no group at all, the task weights
are considered.

Some hysteresis is added to prevent tasks within one group from
getting bounced between NUMA nodes due to tiny differences.

If tasks are part of different groups, the code compares group
weights, in order to favor grouping task groups together.

The patch also changes the group weight multiplier to be the
same as the task weight multiplier, since the two are no longer
added up like before.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-55-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 14:48:08 +02:00
Rik van Riel 887c290e82 sched/numa: Decide whether to favour task or group weights based on swap candidate relationships
This patch separately considers task and group affinities when searching
for swap candidates during task NUMA placement. If tasks are not part of
a group or the same group then the task weights are considered.
Otherwise the group weights are compared.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-54-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 14:48:06 +02:00
Ingo Molnar b32e86b430 sched/numa: Add debugging
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: http://lkml.kernel.org/r/1381141781-10992-53-git-send-email-mgorman@suse.de
2013-10-09 14:48:04 +02:00
Mel Gorman 7dbd13ed06 sched/numa: Prevent parallel updates to group stats during placement
Having multiple tasks in a group go through task_numa_placement
simultaneously can lead to a task picking a wrong node to run on, because
the group stats may be in the middle of an update. This patch avoids
parallel updates by holding the numa_group lock during placement
decisions.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-52-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 14:48:02 +02:00
Rik van Riel 82727018b0 sched/numa: Call task_numa_free() from do_execve()
It is possible for a task in a numa group to call exec, and
have the new (unrelated) executable inherit the numa group
association from its former self.

This has the potential to break numa grouping, and is trivial
to fix.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-51-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 14:48:00 +02:00
Mel Gorman 83e1d2cd9e sched/numa: Use group fault statistics in numa placement
This patch uses the fraction of faults on a particular node for both task
and group, to figure out the best node to place a task.  If the task and
group statistics disagree on what the preferred node should be then a full
rescan will select the node with the best combined weight.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-50-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 14:47:58 +02:00
Rik van Riel 5e1576ed0e sched/numa: Stay on the same node if CLONE_VM
A newly spawned thread inside a process should stay on the same
NUMA node as its parent. This prevents processes from being "torn"
across multiple NUMA nodes every time they spawn a new thread.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-49-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 14:47:57 +02:00
Peter Zijlstra 6688cc0547 mm: numa: Do not group on RO pages
And here's a little something to make sure not the whole world ends up
in a single group.

As while we don't migrate shared executable pages, we do scan/fault on
them. And since everybody links to libc, everybody ends up in the same
group.

Suggested-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1381141781-10992-47-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 14:47:53 +02:00
Mel Gorman e29cf08b05 sched/numa: Report a NUMA task group ID
It is desirable to model from userspace how the scheduler groups tasks
over time. This patch adds an ID to the numa_group and reports it via
/proc/PID/status.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-45-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 14:47:49 +02:00
Peter Zijlstra 8c8a743c50 sched/numa: Use {cpu, pid} to create task groups for shared faults
While parallel applications tend to align their data on the cache
boundary, they tend not to align on the page or THP boundary.
Consequently tasks that partition their data can still "false-share"
pages presenting a problem for optimal NUMA placement.

This patch uses NUMA hinting faults to chain tasks together into
numa_groups. As well as storing the NID a task was running on when
accessing a page a truncated representation of the faulting PID is
stored. If subsequent faults are from different PIDs it is reasonable
to assume that those two tasks share a page and are candidates for
being grouped together. Note that this patch makes no scheduling
decisions based on the grouping information.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1381141781-10992-44-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 14:47:47 +02:00
Peter Zijlstra 90572890d2 mm: numa: Change page last {nid,pid} into {cpu,pid}
Change the per page last fault tracking to use cpu,pid instead of
nid,pid. This will allow us to try and lookup the alternate task more
easily. Note that even though it is the cpu that is store in the page
flags that the mpol_misplaced decision is still based on the node.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1381141781-10992-43-git-send-email-mgorman@suse.de
[ Fixed build failure on 32-bit systems. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 14:47:45 +02:00
Rik van Riel e1dda8a797 sched/numa: Fix placement of workloads spread across multiple nodes
The load balancer will spread workloads across multiple NUMA nodes,
in order to balance the load on the system. This means that sometimes
a task's preferred node has available capacity, but moving the task
there will not succeed, because that would create too large an imbalance.

In that case, other NUMA nodes need to be considered.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-42-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 14:47:43 +02:00
Mel Gorman 2c8a50aa87 sched/numa: Favor placing a task on the preferred node
A tasks preferred node is selected based on the number of faults
recorded for a node but the actual task_numa_migate() conducts a global
search regardless of the preferred nid. This patch checks if the
preferred nid has capacity and if so, searches for a CPU within that
node. This avoids a global search when the preferred node is not
overloaded.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-41-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 14:47:41 +02:00
Mel Gorman fb13c7ee0e sched/numa: Use a system-wide search to find swap/migration candidates
This patch implements a system-wide search for swap/migration candidates
based on total NUMA hinting faults. It has a balance limit, however it
doesn't properly consider total node balance.

In the old scheme a task selected a preferred node based on the highest
number of private faults recorded on the node. In this scheme, the preferred
node is based on the total number of faults. If the preferred node for a
task changes then task_numa_migrate will search the whole system looking
for tasks to swap with that would improve both the overall compute
balance and minimise the expected number of remote NUMA hinting faults.

Not there is no guarantee that the node the source task is placed
on by task_numa_migrate() has any relationship to the newly selected
task->numa_preferred_nid due to compute overloading.

Signed-off-by: Mel Gorman <mgorman@suse.de>
[ Do not swap with tasks that cannot run on source cpu]
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
[ Fixed compiler warning on UP. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-40-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 14:47:25 +02:00
Peter Zijlstra ac66f54772 sched/numa: Introduce migrate_swap()
Use the new stop_two_cpus() to implement migrate_swap(), a function that
flips two tasks between their respective cpus.

I'm fairly sure there's a less crude way than employing the stop_two_cpus()
method, but everything I tried either got horribly fragile and/or complex. So
keep it simple for now.

The notable detail is how we 'migrate' tasks that aren't runnable
anymore. We'll make it appear like we migrated them before they went to
sleep. The sole difference is the previous cpu in the wakeup path, so we
override this.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Link: http://lkml.kernel.org/r/1381141781-10992-39-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 12:40:46 +02:00
Peter Zijlstra 1be0bd77c5 stop_machine: Introduce stop_two_cpus()
Introduce stop_two_cpus() in order to allow controlled swapping of two
tasks. It repurposes the stop_machine() state machine but only stops
the two cpus which we can do with on-stack structures and avoid
machine wide synchronization issues.

The ordering of CPUs is important to avoid deadlocks. If unordered then
two cpus calling stop_two_cpus on each other simultaneously would attempt
to queue in the opposite order on each CPU causing an AB-BA style deadlock.
By always having the lowest number CPU doing the queueing of works, we can
guarantee that works are always queued in the same order, and deadlocks
are avoided.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
[ Implemented deadlock avoidance. ]
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Link: http://lkml.kernel.org/r/1381141781-10992-38-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 12:40:45 +02:00
Mel Gorman 4591ce4f2d sched/numa: Do not trap hinting faults for shared libraries
NUMA hinting faults will not migrate a shared executable page mapped by
multiple processes on the grounds that the data is probably in the CPU
cache already and the page may just bounce between tasks running on multipl
nodes. Even if the migration is avoided, there is still the overhead of
trapping the fault, updating the statistics, making scheduler placement
decisions based on the information etc. If we are never going to migrate
the page, it is overhead for no gain and worse a process may be placed on
a sub-optimal node for shared executable pages. This patch avoids trapping
faults for shared libraries entirely.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-36-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 12:40:42 +02:00
Rik van Riel 06ea5e035b sched/numa: Increment numa_migrate_seq when task runs in correct location
When a task is already running on its preferred node, increment
numa_migrate_seq to indicate that the task is settled if migration is
temporarily disabled, and memory should migrate towards it.

Signed-off-by: Rik van Riel <riel@redhat.com>
[ Only increment migrate_seq if migration temporarily disabled. ]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-35-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 12:40:41 +02:00
Mel Gorman 6b9a7460b6 sched/numa: Retry migration of tasks to CPU on a preferred node
When a preferred node is selected for a tasks there is an attempt to migrate
the task to a CPU there. This may fail in which case the task will only
migrate if the active load balancer takes action. This may never happen if
the conditions are not right. This patch will check at NUMA hinting fault
time if another attempt should be made to migrate the task. It will only
make an attempt once every five seconds.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-34-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 12:40:40 +02:00
Mel Gorman 58d081b508 sched/numa: Avoid overloading CPUs on a preferred NUMA node
This patch replaces find_idlest_cpu_node with task_numa_find_cpu.
find_idlest_cpu_node has two critical limitations. It does not take the
scheduling class into account when calculating the load and it is unsuitable
for using when comparing loads between NUMA nodes.

task_numa_find_cpu uses similar load calculations to wake_affine() when
selecting the least loaded CPU within a scheduling domain common to the
source and destimation nodes. It avoids causing CPU load imbalances in
the machine by refusing to migrate if the relative load on the target
CPU is higher than the source CPU.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-33-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 12:40:39 +02:00
Mel Gorman fc3147245d mm: numa: Limit NUMA scanning to migrate-on-fault VMAs
There is a 90% regression observed with a large Oracle performance test
on a 4 node system. Profiles indicated that the overhead was due to
contention on sp_lock when looking up shared memory policies. These
policies do not have the appropriate flags to allow them to be
automatically balanced so trapping faults on them is pointless. This
patch skips VMAs that do not have MPOL_F_MOF set.

[riel@redhat.com: Initial patch]

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-and-tested-by: Joe Mario <jmario@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-32-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 12:40:38 +02:00
Rik van Riel 6fe6b2d6da sched/numa: Do not migrate memory immediately after switching node
The load balancer can move tasks between nodes and does not take NUMA
locality into account. With automatic NUMA balancing this may result in the
tasks working set being migrated to the new node. However, as the fault
buffer will still store faults from the old node the schduler may decide to
reset the preferred node and migrate the task back resulting in more
migrations.

The ideal would be that the scheduler did not migrate tasks with a heavy
memory footprint but this may result nodes being overloaded. We could
also discard the fault information on task migration but this would still
cause all the tasks working set to be migrated. This patch simply avoids
migrating the memory for a short time after a task is migrated.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-31-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 12:40:36 +02:00
Mel Gorman b795854b1f sched/numa: Set preferred NUMA node based on number of private faults
Ideally it would be possible to distinguish between NUMA hinting faults that
are private to a task and those that are shared. If treated identically
there is a risk that shared pages bounce between nodes depending on
the order they are referenced by tasks. Ultimately what is desirable is
that task private pages remain local to the task while shared pages are
interleaved between sharing tasks running on different nodes to give good
average performance. This is further complicated by THP as even
applications that partition their data may not be partitioning on a huge
page boundary.

To start with, this patch assumes that multi-threaded or multi-process
applications partition their data and that in general the private accesses
are more important for cpu->memory locality in the general case. Also,
no new infrastructure is required to treat private pages properly but
interleaving for shared pages requires additional infrastructure.

To detect private accesses the pid of the last accessing task is required
but the storage requirements are a high. This patch borrows heavily from
Ingo Molnar's patch "numa, mm, sched: Implement last-CPU+PID hash tracking"
to encode some bits from the last accessing task in the page flags as
well as the node information. Collisions will occur but it is better than
just depending on the node information. Node information is then used to
determine if a page needs to migrate. The PID information is used to detect
private/shared accesses. The preferred NUMA node is selected based on where
the maximum number of approximately private faults were measured. Shared
faults are not taken into consideration for a few reasons.

First, if there are many tasks sharing the page then they'll all move
towards the same node. The node will be compute overloaded and then
scheduled away later only to bounce back again. Alternatively the shared
tasks would just bounce around nodes because the fault information is
effectively noise. Either way accounting for shared faults the same as
private faults can result in lower performance overall.

The second reason is based on a hypothetical workload that has a small
number of very important, heavily accessed private pages but a large shared
array. The shared array would dominate the number of faults and be selected
as a preferred node even though it's the wrong decision.

The third reason is that multiple threads in a process will race each
other to fault the shared page making the fault information unreliable.

Signed-off-by: Mel Gorman <mgorman@suse.de>
[ Fix complication error when !NUMA_BALANCING. ]
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-30-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 12:40:35 +02:00
Mel Gorman 073b5beea7 sched/numa: Remove check that skips small VMAs
task_numa_work skips small VMAs. At the time the logic was to reduce the
scanning overhead which was considerable. It is a dubious hack at best.
It would make much more sense to cache where faults have been observed
and only rescan those regions during subsequent PTE scans. Remove this
hack as motivation to do it properly in the future.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-29-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 12:40:33 +02:00
Mel Gorman 9ff1d9ff3c sched/numa: Check current->mm before allocating NUMA faults
task_numa_placement checks current->mm but after buffers for faults
have already been uselessly allocated. Move the check earlier.

[peterz@infradead.org: Identified the problem]

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-27-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 12:40:31 +02:00
Mel Gorman ac8e895bd2 sched/numa: Add infrastructure for split shared/private accounting of NUMA hinting faults
Ideally it would be possible to distinguish between NUMA hinting faults
that are private to a task and those that are shared.  This patch prepares
infrastructure for separately accounting shared and private faults by
allocating the necessary buffers and passing in relevant information. For
now, all faults are treated as private and detection will be introduced
later.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-26-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 12:40:30 +02:00
Mel Gorman e6628d5b0a sched/numa: Reschedule task on preferred NUMA node once selected
A preferred node is selected based on the node the most NUMA hinting
faults was incurred on. There is no guarantee that the task is running
on that node at the time so this patch rescheules the task to run on
the most idle CPU of the selected node when selected. This avoids
waiting for the balancer to make a decision.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-25-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 12:40:28 +02:00
Mel Gorman 7a0f308337 sched/numa: Resist moving tasks towards nodes with fewer hinting faults
Just as "sched: Favour moving tasks towards the preferred node" favours
moving tasks towards nodes with a higher number of recorded NUMA hinting
faults, this patch resists moving tasks towards nodes with lower faults.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-24-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 12:40:27 +02:00
Mel Gorman 3a7053b322 sched/numa: Favour moving tasks towards the preferred node
This patch favours moving tasks towards NUMA node that recorded a higher
number of NUMA faults during active load balancing.  Ideally this is
self-reinforcing as the longer the task runs on that node, the more faults
it should incur causing task_numa_placement to keep the task running on that
node. In reality a big weakness is that the nodes CPUs can be overloaded
and it would be more efficient to queue tasks on an idle node and migrate
to the new node. This would require additional smarts in the balancer so
for now the balancer will simply prefer to place the task on the preferred
node for a PTE scans which is controlled by the numa_balancing_settle_count
sysctl. Once the settle_count number of scans has complete the schedule
is free to place the task on an alternative node if the load is imbalanced.

[srikar@linux.vnet.ibm.com: Fixed statistics]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
[ Tunable and use higher faults instead of preferred. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-23-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 12:40:26 +02:00
Mel Gorman 745d61476d sched/numa: Update NUMA hinting faults once per scan
NUMA hinting fault counts and placement decisions are both recorded in the
same array which distorts the samples in an unpredictable fashion. The values
linearly accumulate during the scan and then decay creating a sawtooth-like
pattern in the per-node counts. It also means that placement decisions are
time sensitive. At best it means that it is very difficult to state that
the buffer holds a decaying average of past faulting behaviour. At worst,
it can confuse the load balancer if it sees one node with an artifically high
count due to very recent faulting activity and may create a bouncing effect.

This patch adds a second array. numa_faults stores the historical data
which is used for placement decisions. numa_faults_buffer holds the
fault activity during the current scan window. When the scan completes,
numa_faults decays and the values from numa_faults_buffer are copied
across.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-22-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 12:40:25 +02:00
Mel Gorman 688b7585d1 sched/numa: Select a preferred node with the most numa hinting faults
This patch selects a preferred node for a task to run on based on the
NUMA hinting faults. This information is later used to migrate tasks
towards the node during balancing.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-21-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 12:40:23 +02:00
Mel Gorman f809ca9a55 sched/numa: Track NUMA hinting faults on per-node basis
This patch tracks what nodes numa hinting faults were incurred on.
This information is later used to schedule a task on the node storing
the pages most frequently faulted by the task.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-20-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 12:40:22 +02:00
Mel Gorman f307cd1a32 sched/numa: Slow scan rate if no NUMA hinting faults are being recorded
NUMA PTE scanning slows if a NUMA hinting fault was trapped and no page
was migrated. For long-lived but idle processes there may be no faults
but the scan rate will be high and just waste CPU. This patch will slow
the scan rate for processes that are not trapping faults.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-19-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 12:40:21 +02:00
Mel Gorman 598f0ec0bc sched/numa: Set the scan rate proportional to the memory usage of the task being scanned
The NUMA PTE scan rate is controlled with a combination of the
numa_balancing_scan_period_min, numa_balancing_scan_period_max and
numa_balancing_scan_size. This scan rate is independent of the size
of the task and as an aside it is further complicated by the fact that
numa_balancing_scan_size controls how many pages are marked pte_numa and
not how much virtual memory is scanned.

In combination, it is almost impossible to meaningfully tune the min and
max scan periods and reasoning about performance is complex when the time
to complete a full scan is is partially a function of the tasks memory
size. This patch alters the semantic of the min and max tunables to be
about tuning the length time it takes to complete a scan of a tasks occupied
virtual address space. Conceptually this is a lot easier to understand. There
is a "sanity" check to ensure the scan rate is never extremely fast based on
the amount of virtual memory that should be scanned in a second. The default
of 2.5G seems arbitrary but it is to have the maximum scan rate after the
patch roughly match the maximum scan rate before the patch was applied.

On a similar note, numa_scan_period is in milliseconds and not
jiffies. Properly placed pages slow the scanning rate but adding 10 jiffies
to numa_scan_period means that the rate scanning slows depends on HZ which
is confusing. Get rid of the jiffies_to_msec conversion and treat it as ms.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-18-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 12:40:20 +02:00
Mel Gorman 7e8d16b6cb sched/numa: Initialise numa_next_scan properly
Scan delay logic and resets are currently initialised to start scanning
immediately instead of delaying properly. Initialise them properly at
fork time and catch when a new mm has been allocated.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-17-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 12:40:19 +02:00
Mel Gorman b726b7dfb4 Revert "mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node"
PTE scanning and NUMA hinting fault handling is expensive so commit
5bca2303 ("mm: sched: numa: Delay PTE scanning until a task is scheduled
on a new node") deferred the PTE scan until a task had been scheduled on
another node. The problem is that in the purely shared memory case that
this may never happen and no NUMA hinting fault information will be
captured. We are not ruling out the possibility that something better
can be done here but for now, this patch needs to be reverted and depend
entirely on the scan_delay to avoid punishing short-lived processes.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-16-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 12:40:17 +02:00
Peter Zijlstra 9e645ab6d0 sched/numa: Continue PTE scanning even if migrate rate limited
Avoiding marking PTEs pte_numa because a particular NUMA node is migrate rate
limited sees like a bad idea. Even if this node can't migrate anymore other
nodes might and we want up-to-date information to do balance decisions.
We already rate limit the actual migrations, this should leave enough
bandwidth to allow the non-migrating scanning. I think its important we
keep up-to-date information if we're going to do placement based on it.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1381141781-10992-15-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 12:40:09 +02:00
Peter Zijlstra 19a78d110d sched/numa: Mitigate chance that same task always updates PTEs
With a trace_printk("working\n"); right after the cmpxchg in
task_numa_work() we can see that of a 4 thread process, its always the
same task winning the race and doing the protection change.

This is a problem since the task doing the protection change has a
penalty for taking faults -- it is busy when marking the PTEs. If its
always the same task the ->numa_faults[] get severely skewed.

Avoid this by delaying the task doing the protection change such that
it is unlikely to win the privilege again.

Before:

root@interlagos:~# grep "thread 0/.*working" /debug/tracing/trace | tail -15
      thread 0/0-3232  [022] ....   212.787402: task_numa_work: working
      thread 0/0-3232  [022] ....   212.888473: task_numa_work: working
      thread 0/0-3232  [022] ....   212.989538: task_numa_work: working
      thread 0/0-3232  [022] ....   213.090602: task_numa_work: working
      thread 0/0-3232  [022] ....   213.191667: task_numa_work: working
      thread 0/0-3232  [022] ....   213.292734: task_numa_work: working
      thread 0/0-3232  [022] ....   213.393804: task_numa_work: working
      thread 0/0-3232  [022] ....   213.494869: task_numa_work: working
      thread 0/0-3232  [022] ....   213.596937: task_numa_work: working
      thread 0/0-3232  [022] ....   213.699000: task_numa_work: working
      thread 0/0-3232  [022] ....   213.801067: task_numa_work: working
      thread 0/0-3232  [022] ....   213.903155: task_numa_work: working
      thread 0/0-3232  [022] ....   214.005201: task_numa_work: working
      thread 0/0-3232  [022] ....   214.107266: task_numa_work: working
      thread 0/0-3232  [022] ....   214.209342: task_numa_work: working

After:

root@interlagos:~# grep "thread 0/.*working" /debug/tracing/trace | tail -15
      thread 0/0-3253  [005] ....   136.865051: task_numa_work: working
      thread 0/2-3255  [026] ....   136.965134: task_numa_work: working
      thread 0/3-3256  [024] ....   137.065217: task_numa_work: working
      thread 0/3-3256  [024] ....   137.165302: task_numa_work: working
      thread 0/3-3256  [024] ....   137.265382: task_numa_work: working
      thread 0/0-3253  [004] ....   137.366465: task_numa_work: working
      thread 0/2-3255  [026] ....   137.466549: task_numa_work: working
      thread 0/0-3253  [004] ....   137.566629: task_numa_work: working
      thread 0/0-3253  [004] ....   137.666711: task_numa_work: working
      thread 0/1-3254  [028] ....   137.766799: task_numa_work: working
      thread 0/0-3253  [004] ....   137.866876: task_numa_work: working
      thread 0/2-3255  [026] ....   137.966960: task_numa_work: working
      thread 0/1-3254  [028] ....   138.067041: task_numa_work: working
      thread 0/2-3255  [026] ....   138.167123: task_numa_work: working
      thread 0/3-3256  [024] ....   138.267207: task_numa_work: working

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1381141781-10992-14-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 12:39:56 +02:00
Peter Zijlstra c69307d533 sched/numa: Fix comments
Fix a 80 column violation and a PTE vs PMD reference.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1381141781-10992-4-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 12:39:30 +02:00
Ingo Molnar 37bf06375c Linux 3.12-rc4
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.14 (GNU/Linux)
 
 iQEcBAABAgAGBQJSUc9zAAoJEHm+PkMAQRiG9DMH/AtpuAF6LlMRPjrCeuJQ1pyh
 T0IUO+CsLKO6qtM5IyweP8V6zaasNjIuW1+B6IwVIl8aOrM+M7CwRiKvpey26ldM
 I8G2ron7hqSOSQqSQs20jN2yGAqQGpYIbTmpdGLAjQ350NNNvEKthbP5SZR5PAmE
 UuIx5OGEkaOyZXvCZJXU9AZkCxbihlMSt2zFVxybq2pwnGezRUYgCigE81aeyE0I
 QLwzzMVdkCxtZEpkdJMpLILAz22jN4RoVDbXRa2XC7dA9I2PEEXI9CcLzqCsx2Ii
 8eYS+no2K5N2rrpER7JFUB2B/2X8FaVDE+aJBCkfbtwaYTV9UYLq3a/sKVpo1Cs=
 =xSFJ
 -----END PGP SIGNATURE-----

Merge tag 'v3.12-rc4' into sched/core

Merge Linux v3.12-rc4 to fix a conflict and also to refresh the tree
before applying more scheduler patches.

Conflicts:
	arch/avr32/include/asm/Kbuild

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 12:36:13 +02:00
Linus Torvalds 0e7a3ed04f Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Various fixlets:

  On the kernel side:

   - fix a race
   - fix a bug in the handling of the perf ring-buffer data page

  On the tooling side:

   - fix the handling of certain corrupted perf.data files
   - fix a bug in 'perf probe'
   - fix a bug in 'perf record + perf sched'
   - fix a bug in 'make install'
   - fix a bug in libaudit feature-detection on certain distros"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf session: Fix infinite loop on invalid perf.data file
  perf tools: Fix installation of libexec components
  perf probe: Fix to find line information for probe list
  perf tools: Fix libaudit test
  perf stat: Set child_pid after perf_evlist__prepare_workload()
  perf tools: Add default handler for mmap2 events
  perf/x86: Clean up cap_user_time* setting
  perf: Fix perf_pmu_migrate_context
2013-10-08 09:23:12 -07:00
Shawn Bohrer 6bfa687c19 sched/rt: Remove redundant nr_cpus_allowed test
In 76854c7e8f ("sched: Use
rt.nr_cpus_allowed to recover select_task_rq() cycles") an
optimization was added to select_task_rq_rt() that immediately
returns when p->nr_cpus_allowed == 1 at the beginning of the
function.

This makes the latter p->nr_cpus_allowed > 1 check redundant,
which can now be removed.

Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Mike Galbraith <mgalbraith@suse.de>
Cc: tomk@rgmadvisors.com
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1380914693-24634-1-git-send-email-shawn.bohrer@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-06 11:28:40 +02:00
Linus Torvalds 7dee8dff47 ACPI and power management fixes for 3.12-rc4
1) The resume part of user space driven hibernation (s2disk) is now
     broken after the change that moved the creation of memory bitmaps
     to after the freezing of tasks, because I forgot that the resume
     utility loaded the image before freezing tasks and needed the
     bitmaps for that.  The fix adds special handling for that case.
 
  2) One of recent commits changed the export of acpi_bus_get_device()
     to EXPORT_SYMBOL_GPL(), which was technically correct but broke
     existing binary modules using that function including one in
     particularly widespread use.  Change it back to EXPORT_SYMBOL().
 
  3) The intel_pstate driver sometimes fails to disable turbo if its
     no_turbo sysfs attribute is set.  Fix from Srinivas Pandruvada.
 
  4) One of recent cpufreq fixes forgot to update a check in cpufreq-cpu0
     which still (incorrectly) treats non-NULL as non-error.  Fix from
     Philipp Zabel.
 
  5) The SPEAr cpufreq driver uses a wrong variable type in one place
     preventing it from catching errors returned by one of the functions
     called by it.  Fix from Sachin Kamat.
 
 /
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQIcBAABCAAGBQJSTvXfAAoJEKhOf7ml8uNslkkP+QGoghnGR9hScYq/0Mcnzr4b
 kwkiRx54NggjzzN8Q+ejZmxNZ7UZt3q05PmHPtJk3A8gzqIMsb83jnXsZNiDiQs6
 m+KBYrV5dhPZkp08X2tHJp5ijZNRULpp9QA49ulnLfVT/A+rkr5xBCK0W3ln/zL3
 tJSlGJ3N7yYUXe3nMRCCNnnnAzWA+Tk8yRaMx5MnFqlQWWnyx1SGKjD/kVv0/3RA
 6rlDPQEIuoCTqLKotnGIqVN2hTFPFJKc9yTrRGZ15pMjdUGHMwnHy6KMAdXy4Rdh
 R1DOdf+bvPkkFiGE1D1vKOt7pdOG/cTtNkppvWZRuoGg2AMJGm5KWlrdLhlvunyt
 IQXmdt/eWecNr+WzN8FiDp4LEQcI6VjEDaJ3qbjXHLH/FOupBKXYoNWpejj4bGSE
 PtPmJYjNpD2vF3cdtt80ZAYSxhLutwPQksoAwyJ40++l53Ygi81BO31LWZQnDk/8
 HPWOXFThmWJtT03b0sG25GpboiCpYtHEmbwQe+y+pRx7L12HBfE4StT3hmv5Z9J4
 RXXB3yNq4ApXtFq1mitpiPmSVfYe+zu590m7ZUr457BpXi7MH17tzDn9nUJ2eTZl
 kXwUNWiRKGjPmKYxV/ml/apClozsGMFP+XoZkYotFd0W5+SVLuhdXdtClIt4NAbD
 dUkYVMm/BBBALpmH+yKw
 =P4mh
 -----END PGP SIGNATURE-----

Merge tag 'pm+acpi-3.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull ACPI and power management fixes from Rafael Wysocki:

 - The resume part of user space driven hibernation (s2disk) is now
   broken after the change that moved the creation of memory bitmaps to
   after the freezing of tasks, because I forgot that the resume utility
   loaded the image before freezing tasks and needed the bitmaps for
   that.  The fix adds special handling for that case.

 - One of recent commits changed the export of acpi_bus_get_device() to
   EXPORT_SYMBOL_GPL(), which was technically correct but broke existing
   binary modules using that function including one in particularly
   widespread use.  Change it back to EXPORT_SYMBOL().

 - The intel_pstate driver sometimes fails to disable turbo if its
   no_turbo sysfs attribute is set.  Fix from Srinivas Pandruvada.

 - One of recent cpufreq fixes forgot to update a check in cpufreq-cpu0
   which still (incorrectly) treats non-NULL as non-error.  Fix from
   Philipp Zabel.

 - The SPEAr cpufreq driver uses a wrong variable type in one place
   preventing it from catching errors returned by one of the functions
   called by it.  Fix from Sachin Kamat.

* tag 'pm+acpi-3.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI: Use EXPORT_SYMBOL() for acpi_bus_get_device()
  intel_pstate: fix no_turbo
  cpufreq: cpufreq-cpu0: NULL is a valid regulator, part 2
  cpufreq: SPEAr: Fix incorrect variable type
  PM / hibernate: Fix user space driven resume regression
2013-10-04 15:03:42 -07:00
Andi Kleen fdfbbd07e9 perf: Add generic transaction flags
Add a generic qualifier for transaction events, as a new sample
type that returns a flag word. This is particularly useful
for qualifying aborts: to distinguish aborts which happen
due to asynchronous events (like conflicts caused by another
CPU) versus instructions that lead to an abort.

The tuning strategies are very different for those cases,
so it's important to distinguish them easily and early.

Since it's inconvenient and inflexible to filter for this
in the kernel we report all the events out and allow
some post processing in user space.

The flags are based on the Intel TSX events, but should be fairly
generic and mostly applicable to other HTM architectures too. In addition
to various flag words there's also reserved space to report an
program supplied abort code. For TSX this is used to distinguish specific
classes of aborts, like a lock busy abort when doing lock elision.

Flags:

Elision and generic transactions 		   (ELISION vs TRANSACTION)
(HLE vs RTM on TSX; IBM etc.  would likely only use TRANSACTION)
Aborts caused by current thread vs aborts caused by others (SYNC vs ASYNC)
Retryable transaction				   (RETRY)
Conflicts with other threads			   (CONFLICT)
Transaction write capacity overflow		   (CAPACITY WRITE)
Transaction read capacity overflow		   (CAPACITY READ)

Transactions implicitely aborted can also return an abort code.
This can be used to signal specific events to the profiler. A common
case is abort on lock busy in a RTM eliding library (code 0xff)
To handle this case we include the TSX abort code

Common example aborts in TSX would be:

- Data conflict with another thread on memory read.
                                      Flags: TRANSACTION|ASYNC|CONFLICT
- executing a WRMSR in a transaction. Flags: TRANSACTION|SYNC
- HLE transaction in user space is too large
                                      Flags: ELISION|SYNC|CAPACITY-WRITE

The only flag that is somewhat TSX specific is ELISION.

This adds the perf core glue needed for reporting the new flag word out.

v2: Add MEM/MISC
v3: Move transaction to the end
v4: Separate capacity-read/write and remove misc
v5: Remove _SAMPLE. Move abort flags to 32bit. Rename
    transaction to txn
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1379688044-14173-2-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-04 10:06:08 +02:00
Knut Petersen 723478c8a4 perf: Enforce 1 as lower limit for perf_event_max_sample_rate
/proc/sys/kernel/perf_event_max_sample_rate will accept
negative values as well as 0.

Negative values are unreasonable, and 0 causes a
divide by zero exception in perf_proc_update_handler.

This patch enforces a lower limit of 1.

Signed-off-by: Knut Petersen <Knut_Petersen@t-online.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/5242DB0C.4070005@t-online.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-04 10:06:07 +02:00
Peter Zijlstra 9886167d20 perf: Fix perf_pmu_migrate_context
While auditing the list_entry usage due to a trinity bug I found that
perf_pmu_migrate_context violates the rules for
perf_event::event_entry.

The problem is that perf_event::event_entry is a RCU list element, and
hence we must wait for a full RCU grace period before re-using the
element after deletion.

Therefore the usage in perf_pmu_migrate_context() which re-uses the
entry immediately is broken. For now introduce another list_head into
perf_event for this specific usage.

This doesn't actually fix the trinity report because that never goes
through this code.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-mkj72lxagw1z8fvjm648iznw@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-04 09:58:53 +02:00
Mike Travis 8daaa5f826 kdb: Add support for external NMI handler to call KGDB/KDB
This patch adds a kgdb_nmicallin() interface that can be used by
external NMI handlers to call the KGDB/KDB handler.  The primary
need for this is for those types of NMI interrupts where all the
CPUs have already received the NMI signal.  Therefore no
send_IPI(NMI) is required, and in fact it will cause a 2nd
unhandled NMI to occur. This generates the "Dazed and Confuzed"
messages.

Since all the CPUs are getting the NMI at roughly the same time,
it's not guaranteed that the first CPU that hits the NMI handler
will manage to enter KGDB and set the dbg_master_lock before the
slaves start entering. The new argument "send_ready" was added
for KGDB to signal the NMI handler to release the slave CPUs for
entry into KGDB.

Signed-off-by: Mike Travis <travis@sgi.com>
Acked-by: Jason Wessel <jason.wessel@windriver.com>
Reviewed-by: Dimitri Sivanich <sivanich@sgi.com>
Reviewed-by: Hedi Berriche <hedi@sgi.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Link: http://lkml.kernel.org/r/20131002151417.928886849@asylum.americas.sgi.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-03 18:47:54 +02:00
Ingo Molnar 68e9074028 Merge branch 'clockevents/3.13' of git://git.linaro.org/people/dlezcano/linux into timers/core
Pull (mostly) ARM clocksource driver updates from Daniel Lezcano:

" - Soren Brinkmann added FEAT_PERCPU to a clock device when it is local
    per cpu. This feature prevents the clock framework to choose a per cpu
    timer as a broadcast timer. This problem arised when the ARM global
    timer is used when switching to the broadcast timer which is the case
    now on Xillinx with its cpuidle driver.

  - Stephen Boyd extended the generic sched_clock code to support 64bit
    counters and removes the setup_sched_clock deprecation, as that causes
    lots of warnings since there's still users in the arch/arm tree. He
    added also the CLOCK_SOURCE_SUSPEND_NONSTOP flag on the architected
    timer as they continue counting during suspend.

  - Uwe Kleine-König added some missing __init sections and consolidated the
    code by moving the of_node_put call from the drivers to the function
    clocksource_of_init. "

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-03 07:57:02 +02:00
Ingo Molnar 19f29887a7 Merge branch 'timers/core' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks into timers/core
Merge updated full dynticks support from Frederic Weisbecker:

   - support 32-bit systems (full dynticks was 64-bit only before)
   - support ARM

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-03 07:53:25 +02:00
Ingo Molnar 6c09f6d830 Linux 3.12-rc3
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.14 (GNU/Linux)
 
 iQEcBAABAgAGBQJSSKOHAAoJEHm+PkMAQRiGeREH/3EqHmJPBzmVoJwR9/ykDoLg
 u+TJTkuxZG220WhgXS7W/0ECyBX0U7yA0bY9PZbqgcdiLjY0veR18/pOhEq5RzHq
 ub8Q+AJdiORF/sq268q7gnNmy3rSCgnrAyHA/bzBtkbisYODwZPYvWQVUjgNZ2dW
 qtW/TE9rjANcUrk8WdOu9oWcwsq4cyG3cscbfHE/JLFy/8tB5GoD158gxKLZsLXk
 uTCeUHMmvFRT56fZwfyvNstA8ozxXcHBmuu6+Ttceky2zeGzp6dOrd+d2SU1Ps3O
 P91x4e/Af4RFEwDczGP6TpSBEf/J/JaqrM1drjhnQHho0hrNRZVUXhADFVADCXY=
 =dOjB
 -----END PGP SIGNATURE-----

Merge tag 'v3.12-rc3' into timers/core

Merge Linux 3.12-rc3 - refresh the tree with the latest fixes before merging new bits.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-03 07:52:21 +02:00
Soren Brinkmann 245a349626 tick: broadcast: Deny per-cpu clockevents from being broadcast sources
On most ARM systems the per-cpu clockevents are truly per-cpu in
the sense that they can't be controlled on any other CPU besides
the CPU that they interrupt. If one of these clockevents were to
become a broadcast source we will run into a lot of trouble
because the broadcast source is enabled on the first CPU to go
into deep idle (if that CPU suffers from FEAT_C3_STOP) and that
could be a different CPU than what the clockevent is interrupting
(or even worse the CPU that the clockevent interrupts could be
offline).

Theoretically it's possible to support per-cpu clockevents as the
broadcast source but so far we haven't needed this and supporting
it is rather complicated. Let's just deny the possibility for now
until this becomes a reality (let's hope it never does!).

Signed-off-by: Soren Brinkmann <soren.brinkmann@xilinx.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Michal Simek <michal.simek@xilinx.com>
2013-10-02 11:34:06 +02:00
Ingo Molnar 0d119fb576 Merge branch 'irq/urgent-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks into irq/urgent
Pull a hardirq-nesting fix from Frederic Weisbecker.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-02 07:53:01 +02:00
Frederic Weisbecker cc1f027454 irq: Optimize softirq stack selection in irq exit
If irq_exit() is called on the arch's specified irq stack,
it should be safe to run softirqs inline under that same
irq stack as it is near empty by the time we call irq_exit().

For example if we use the same stack for both hard and soft irqs here,
the worst case scenario is:
hardirq -> softirq -> hardirq. But then the softirq supersedes the
first hardirq as the stack user since irq_exit() is called in
a mostly empty stack. So the stack merge in this case looks acceptable.

Stack overrun still have a chance to happen if hardirqs have more
opportunities to nest, but then it's another problem to solve.

So lets adapt the irq exit's softirq stack on top of a new Kconfig symbol
that can be defined when irq_exit() runs on the irq stack. That way
we can spare some stack switch on irq processing and all the cache
issues that come along.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@au1.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Mackerras <paulus@au1.ibm.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
2013-10-01 12:53:27 +02:00
Frederic Weisbecker 0bed698a33 irq: Justify the various softirq stack choices
For clarity, comment the various stack choices for softirqs
processing, whether we execute them from ksoftirqd or
local_irq_enable() calls.

Their use on irq_exit() is already commented.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@au1.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Mackerras <paulus@au1.ibm.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
2013-10-01 12:53:27 +02:00
Frederic Weisbecker 5d60d3e7c0 irq: Improve a bit softirq debugging
do_softirq() has a debug check that verifies that it is not nesting
on softirqs processing, nor miscounting the softirq part of the preempt
count.

But making sure that softirqs processing don't nest is actually a more
generic concern that applies to any caller of __do_softirq().

Do take it one step further and generalize that debug check to
any softirq processing.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@au1.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Mackerras <paulus@au1.ibm.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
2013-10-01 12:53:26 +02:00
Frederic Weisbecker be6e101644 irq: Optimize call to softirq on hardirq exit
Before processing softirqs on hardirq exit, we already
do the check for pending softirqs while hardirqs are
guaranteed to be disabled.

So we can take a shortcut and safely jump to the arch
specific implementation directly.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@au1.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Mackerras <paulus@au1.ibm.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
2013-10-01 12:53:25 +02:00
Frederic Weisbecker 7d65f4a655 irq: Consolidate do_softirq() arch overriden implementations
All arch overriden implementations of do_softirq() share the following
common code: disable irqs (to avoid races with the pending check),
check if there are softirqs pending, then execute __do_softirq() on
a specific stack.

Consolidate the common parts such that archs only worry about the
stack switch.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@au1.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Mackerras <paulus@au1.ibm.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
2013-10-01 12:53:25 +02:00
Frederic Weisbecker ded7975475 irq: Force hardirq exit's softirq processing on its own stack
The commit facd8b80c6
("irq: Sanitize invoke_softirq") converted irq exit
calls of do_softirq() to __do_softirq() on all architectures,
assuming it was only used there for its irq disablement
properties.

But as a side effect, the softirqs processed in the end
of the hardirq are always called on the inline current
stack that is used by irq_exit() instead of the softirq
stack provided by the archs that override do_softirq().

The result is mostly safe if the architecture runs irq_exit()
on a separate irq stack because then softirqs are processed
on that same stack that is near empty at this stage (assuming
hardirq aren't nesting).

Otherwise irq_exit() runs in the task stack and so does the softirq
too. The interrupted call stack can be randomly deep already and
the softirq can dig through it even further. To add insult to the
injury, this softirq can be interrupted by a new hardirq, maximizing
the chances for a stack overrun as reported in powerpc for example:

	do_IRQ: stack overflow: 1920
	CPU: 0 PID: 1602 Comm: qemu-system-ppc Not tainted 3.10.4-300.1.fc19.ppc64p7 #1
	Call Trace:
	[c0000000050a8740] .show_stack+0x130/0x200 (unreliable)
	[c0000000050a8810] .dump_stack+0x28/0x3c
	[c0000000050a8880] .do_IRQ+0x2b8/0x2c0
	[c0000000050a8930] hardware_interrupt_common+0x154/0x180
	--- Exception: 501 at .cp_start_xmit+0x3a4/0x820 [8139cp]
		LR = .cp_start_xmit+0x390/0x820 [8139cp]
	[c0000000050a8d40] .dev_hard_start_xmit+0x394/0x640
	[c0000000050a8e00] .sch_direct_xmit+0x110/0x260
	[c0000000050a8ea0] .dev_queue_xmit+0x260/0x630
	[c0000000050a8f40] .br_dev_queue_push_xmit+0xc4/0x130 [bridge]
	[c0000000050a8fc0] .br_dev_xmit+0x198/0x270 [bridge]
	[c0000000050a9070] .dev_hard_start_xmit+0x394/0x640
	[c0000000050a9130] .dev_queue_xmit+0x428/0x630
	[c0000000050a91d0] .ip_finish_output+0x2a4/0x550
	[c0000000050a9290] .ip_local_out+0x50/0x70
	[c0000000050a9310] .ip_queue_xmit+0x148/0x420
	[c0000000050a93b0] .tcp_transmit_skb+0x4e4/0xaf0
	[c0000000050a94a0] .__tcp_ack_snd_check+0x7c/0xf0
	[c0000000050a9520] .tcp_rcv_established+0x1e8/0x930
	[c0000000050a95f0] .tcp_v4_do_rcv+0x21c/0x570
	[c0000000050a96c0] .tcp_v4_rcv+0x734/0x930
	[c0000000050a97a0] .ip_local_deliver_finish+0x184/0x360
	[c0000000050a9840] .ip_rcv_finish+0x148/0x400
	[c0000000050a98d0] .__netif_receive_skb_core+0x4f8/0xb00
	[c0000000050a99d0] .netif_receive_skb+0x44/0x110
	[c0000000050a9a70] .br_handle_frame_finish+0x2bc/0x3f0 [bridge]
	[c0000000050a9b20] .br_nf_pre_routing_finish+0x2ac/0x420 [bridge]
	[c0000000050a9bd0] .br_nf_pre_routing+0x4dc/0x7d0 [bridge]
	[c0000000050a9c70] .nf_iterate+0x114/0x130
	[c0000000050a9d30] .nf_hook_slow+0xb4/0x1e0
	[c0000000050a9e00] .br_handle_frame+0x290/0x330 [bridge]
	[c0000000050a9ea0] .__netif_receive_skb_core+0x34c/0xb00
	[c0000000050a9fa0] .netif_receive_skb+0x44/0x110
	[c0000000050aa040] .napi_gro_receive+0xe8/0x120
	[c0000000050aa0c0] .cp_rx_poll+0x31c/0x590 [8139cp]
	[c0000000050aa1d0] .net_rx_action+0x1dc/0x310
	[c0000000050aa2b0] .__do_softirq+0x158/0x330
	[c0000000050aa3b0] .irq_exit+0xc8/0x110
	[c0000000050aa430] .do_IRQ+0xdc/0x2c0
	[c0000000050aa4e0] hardware_interrupt_common+0x154/0x180
	 --- Exception: 501 at .bad_range+0x1c/0x110
		 LR = .get_page_from_freelist+0x908/0xbb0
	[c0000000050aa7d0] .list_del+0x18/0x50 (unreliable)
	[c0000000050aa850] .get_page_from_freelist+0x908/0xbb0
	[c0000000050aa9e0] .__alloc_pages_nodemask+0x21c/0xae0
	[c0000000050aaba0] .alloc_pages_vma+0xd0/0x210
	[c0000000050aac60] .handle_pte_fault+0x814/0xb70
	[c0000000050aad50] .__get_user_pages+0x1a4/0x640
	[c0000000050aae60] .get_user_pages_fast+0xec/0x160
	[c0000000050aaf10] .__gfn_to_pfn_memslot+0x3b0/0x430 [kvm]
	[c0000000050aafd0] .kvmppc_gfn_to_pfn+0x64/0x130 [kvm]
	[c0000000050ab070] .kvmppc_mmu_map_page+0x94/0x530 [kvm]
	[c0000000050ab190] .kvmppc_handle_pagefault+0x174/0x610 [kvm]
	[c0000000050ab270] .kvmppc_handle_exit_pr+0x464/0x9b0 [kvm]
	[c0000000050ab320]  kvm_start_lightweight+0x1ec/0x1fc [kvm]
	[c0000000050ab4f0] .kvmppc_vcpu_run_pr+0x168/0x3b0 [kvm]
	[c0000000050ab9c0] .kvmppc_vcpu_run+0xc8/0xf0 [kvm]
	[c0000000050aba50] .kvm_arch_vcpu_ioctl_run+0x5c/0x1a0 [kvm]
	[c0000000050abae0] .kvm_vcpu_ioctl+0x478/0x730 [kvm]
	[c0000000050abc90] .do_vfs_ioctl+0x4ec/0x7c0
	[c0000000050abd80] .SyS_ioctl+0xd4/0xf0
	[c0000000050abe30] syscall_exit+0x0/0x98

Since this is a regression, this patch proposes a minimalistic
and low-risk solution by blindly forcing the hardirq exit processing of
softirqs on the softirq stack. This way we should reduce significantly
the opportunities for task stack overflow dug by softirqs.

Longer term solutions may involve extending the hardirq stack coverage to
irq_exit(), etc...

Reported-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: #3.9.. <stable@vger.kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@au1.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Mackerras <paulus@au1.ibm.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
2013-10-01 12:39:08 +02:00
Borislav Petkov a17bce4d1d x86/boot: Further compress CPUs bootup message
Turn it into (for example):

[    0.073380] x86: Booting SMP configuration:
[    0.074005] .... node   #0, CPUs:          #1   #2   #3   #4   #5   #6   #7
[    0.603005] .... node   #1, CPUs:     #8   #9  #10  #11  #12  #13  #14  #15
[    1.200005] .... node   #2, CPUs:    #16  #17  #18  #19  #20  #21  #22  #23
[    1.796005] .... node   #3, CPUs:    #24  #25  #26  #27  #28  #29  #30  #31
[    2.393005] .... node   #4, CPUs:    #32  #33  #34  #35  #36  #37  #38  #39
[    2.996005] .... node   #5, CPUs:    #40  #41  #42  #43  #44  #45  #46  #47
[    3.600005] .... node   #6, CPUs:    #48  #49  #50  #51  #52  #53  #54  #55
[    4.202005] .... node   #7, CPUs:    #56  #57  #58  #59  #60  #61  #62  #63
[    4.811005] .... node   #8, CPUs:    #64  #65  #66  #67  #68  #69  #70  #71
[    5.421006] .... node   #9, CPUs:    #72  #73  #74  #75  #76  #77  #78  #79
[    6.032005] .... node  #10, CPUs:    #80  #81  #82  #83  #84  #85  #86  #87
[    6.648006] .... node  #11, CPUs:    #88  #89  #90  #91  #92  #93  #94  #95
[    7.262005] .... node  #12, CPUs:    #96  #97  #98  #99 #100 #101 #102 #103
[    7.865005] .... node  #13, CPUs:   #104 #105 #106 #107 #108 #109 #110 #111
[    8.466005] .... node  #14, CPUs:   #112 #113 #114 #115 #116 #117 #118 #119
[    9.073006] .... node  #15, CPUs:   #120 #121 #122 #123 #124 #125 #126 #127
[    9.679901] x86: Booted up 16 nodes, 128 CPUs

and drop useless elements.

Change num_digits() to hpa's division-avoiding, cell-phone-typed
version which he went at great lengths and pains to submit on a
Saturday evening.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: huawei.libin@huawei.com
Cc: wangyijing@huawei.com
Cc: fenghua.yu@intel.com
Cc: guohanjun@huawei.com
Cc: paul.gortmaker@windriver.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20130930095624.GB16383@pd.tnic
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-01 10:52:30 +02:00
Oleg Nesterov 314a8ad0f1 pidns: fix free_pid() to handle the first fork failure
"case 0" in free_pid() assumes that disable_pid_allocation() should
clear PIDNS_HASH_ADDING before the last pid goes away.

However this doesn't happen if the first fork() fails to create the
child reaper which should call disable_pid_allocation().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-30 14:31:03 -07:00
Tetsuo Handa 4c1c7be95c kernel/kmod.c: check for NULL in call_usermodehelper_exec()
If /proc/sys/kernel/core_pattern contains only "|", a NULL pointer
dereference happens upon core dump because argv_split("") returns
argv[0] == NULL.

This bug was once fixed by commit 264b83c07a ("usermodehelper: check
subprocess_info->path != NULL") but was by error reintroduced by commit
7f57cfa4e2 ("usermodehelper: kill the sub_info->path[0] check").

This bug seems to exist since 2.6.19 (the version which core dump to
pipe was added).  Depending on kernel version and config, some side
effect might happen immediately after this oops (e.g.  kernel panic with
2.6.32-358.18.1.el6).

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-30 14:31:02 -07:00
Rafael J. Wysocki aab1728915 PM / hibernate: Fix user space driven resume regression
Recent commit 8fd37a4 (PM / hibernate: Create memory bitmaps after
freezing user space) broke the resume part of the user space driven
hibernation (s2disk), because I forgot that the resume utility
loaded the image into memory without freezing user space (it still
freezes tasks after loading the image).  This means that during user
space driven resume we need to create the memory bitmaps at the
"device open" time rather than at the "freeze tasks" time, so make
that happen (that's a special case anyway, so it needs to be treated
in a special way).

Reported-and-tested-by: Ronald <ronald645@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-09-30 19:40:56 +02:00
Kevin Hilman ff3fb25412 nohz: Drop generic vtime obsolete dependency on CONFIG_64BIT
The CONFIG_64BIT requirement on vtime can finally be removed
since we now depend on HAVE_VIRT_CPU_ACCOUNTING_GEN which
already takes care of the arch ability to handle nsecs based
cputime_t safely.

Signed-off-by: Kevin Hilman <khilman@linaro.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Arm Linux <linux-arm-kernel@lists.infradead.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2013-09-30 15:37:01 +02:00
Kevin Hilman 554b0004d0 vtime: Add HAVE_VIRT_CPU_ACCOUNTING_GEN Kconfig
With VIRT_CPU_ACCOUNTING_GEN, cputime_t becomes 64-bit. In order
to use that feature, arch code should be audited to ensure there are no
races in concurrent read/write of cputime_t. For example,
reading/writing 64-bit cputime_t on some 32-bit arches may require
multiple accesses for low and high value parts, so proper locking
is needed to protect against concurrent accesses.

Therefore, add CONFIG_HAVE_VIRT_CPU_ACCOUNTING_GEN which arches can
enable after they've been audited for potential races.

This option is automatically enabled on 64-bit platforms.

Feature requested by Frederic Weisbecker.

Signed-off-by: Kevin Hilman <khilman@linaro.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Arm Linux <linux-arm-kernel@lists.infradead.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2013-09-30 15:35:53 +02:00
Greg Kroah-Hartman 88502b9c0a Merge 3.12-rc3 into driver-core-next
We want the driver core and sysfs fixes in here to make merges and
development easier.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-09-29 18:29:23 -07:00
Linus Torvalds 669fc2f0c7 Merge branches 'sched-urgent-for-linus', 'timers-urgent-for-linus' and 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler, timer and x86 fixes from Ingo Molnar:
 - A context tracking ARM build and functional fix
 - A handful of ARM clocksource/clockevent driver fixes
 - An AMD microcode patch level sysfs reporting fixlet

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  arm: Fix build error with context tracking calls

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  clocksource: em_sti: Set cpu_possible_mask to fix SMP broadcast
  clocksource: of: Respect device tree node status
  clocksource: exynos_mct: Set IRQ affinity when the CPU goes online
  arm: clocksource: mvebu: Use the main timer as clock source from DT

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/microcode/AMD: Fix patch level reporting for family 15h
2013-09-28 14:22:17 -07:00
Jean Delvare 3a126f85e0 kernel/params: fix handling of signed integer types
Commit 6072ddc852 ("kernel: replace strict_strto*() with kstrto*()")
broke the handling of signed integer types, fix it.

Signed-off-by: Jean Delvare <khali@linux-fr.org>
Reported-by: Christian Kujau <lists@nerdbynature.de>
Tested-by: Christian Kujau <lists@nerdbynature.de>
Cc: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-28 12:35:52 -07:00
Ingo Molnar 62d08aec6a Merge branch 'context_tracking/fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks into sched/urgent
Pull context tracking ARM fix from Frederic Weisbecker.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-28 08:50:09 +02:00
Frederic Weisbecker 0c06a5d4b1 arm: Fix build error with context tracking calls
ad65782fba (context_tracking: Optimize main APIs off case
with static key) converted context tracking main APIs to inline
function and left ARM asm callers behind.

This can be easily fixed by making ARM calling the post static
keys context tracking function. We just need to replicate the
static key checks there. We'll remove these later when ARM will
support the context tracking static keys.

Reported-by: Guenter Roeck <linux@roeck-us.net>
Reported-by: Russell King <linux@arm.linux.org.uk>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Tested-by: Kevin Hilman <khilman@linaro.org>
Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: Anil Kumar <anilk4.v@gmail.com>
Cc: Tony Lindgren <tony@atomide.com>
Cc: Benoit Cousson <b-cousson@ti.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Kevin Hilman <khilman@linaro.org>
2013-09-27 17:59:47 +02:00
Greg Kroah-Hartman 90826ca740 pmu_bus: convert bus code to use dev_groups
The dev_attrs field of struct bus_type is going away soon, dev_groups
should be used instead.  This converts the pmu bus code to use
the correct field.

Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-09-26 15:49:43 -07:00
Linus Torvalds 82dfaa58a7 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Three small fixes"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/balancing: Fix cfs_rq->task_h_load calculation
  sched/balancing: Fix 'local->avg_load > busiest->avg_load' case in fix_small_imbalance()
  sched/balancing: Fix 'local->avg_load > sds->avg_load' case in calculate_imbalance()
2013-09-25 13:28:45 -07:00
Linus Torvalds bdc5663fa1 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Assorted standalone fixes"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/intel: Add model number for Avoton Silvermont
  perf: Fix capabilities bitfield compatibility in 'struct perf_event_mmap_page'
  perf/x86/intel/uncore: Don't use smp_processor_id() in validate_group()
  perf: Update ABI comment
  tools lib lk: Uninclude linux/magic.h in debugfs.c
  perf tools: Fix old GCC build error in trace-event-parse.c:parse_proc_kallsyms()
  perf probe: Fix finder to find lines of given function
  perf session: Check for SIGINT in more loops
  perf tools: Fix compile with libelf without get_phdrnum
  perf tools: Fix buildid cache handling of kallsyms with kcore
  perf annotate: Fix objdump line parsing offset validation
  perf tools: Fill in new definitions for madvise()/mmap() flags
  perf tools: Sharpen the libaudit dependencies test
2013-09-25 13:28:08 -07:00
Paul E. McKenney 5c173eb8bc rcu: Consistent rcu_is_watching() naming
The old rcu_is_cpu_idle() function is just __rcu_is_watching() with
preemption disabled.  This commit therefore renames rcu_is_cpu_idle()
to rcu_is_watching.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2013-09-25 06:45:06 -07:00
Paul E. McKenney f9ffc31ebd rcu: Change EXPORT_SYMBOL() to EXPORT_SYMBOL_GPL()
Commit e6b80a3b (rcu: Detect illegal rcu dereference in extended
quiescent state) exported the pre-existing rcu_is_cpu_idle() function
using EXPORT_SYMBOL().  However, this is inconsistent with the remaining
exports from RCU, which are all EXPORT_SYMBOL_GPL().  The current state
of affairs means that a non-GPL module could use rcu_is_cpu_idle(),
but in a CONFIG_TREE_PREEMPT_RCU=y kernel would be unable to invoke
rcu_read_lock() and rcu_read_unlock().

This commit therefore makes rcu_is_cpu_idle()'s export be consistent
with the rest of RCU, namely EXPORT_SYMBOL_GPL().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2013-09-25 06:44:56 -07:00
Paul E. McKenney cc6783f788 rcu: Is it safe to enter an RCU read-side critical section?
There is currently no way for kernel code to determine whether it
is safe to enter an RCU read-side critical section, in other words,
whether or not RCU is paying attention to the currently running CPU.
Given the large and increasing quantity of code shared by the idle loop
and non-idle code, the this shortcoming is becoming increasingly painful.

This commit therefore adds __rcu_is_watching(), which returns true if
it is safe to enter an RCU read-side critical section on the currently
running CPU.  This function is quite fast, using only a __this_cpu_read().
However, the caller must disable preemption.

Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2013-09-25 06:44:41 -07:00
Paul E. McKenney c337f8f58e rcu: Throttle invoke_rcu_core() invocations due to non-lazy callbacks
If a non-lazy callback arrives on a CPU that has previously gone idle
with no non-lazy callbacks, invoke_rcu_core() forces the RCU core to
run.  However, it does not update the conditions, which could result
in several closely spaced invocations of the RCU core, which in turn
could result in an excessively high context-switch rate and resulting
high overhead.

This commit therefore updates the ->all_lazy and ->nonlazy_posted_snap
fields to prevent closely spaced invocations.

Reported-by: Tibor Billes <tbilles@gmx.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Tibor Billes <tbilles@gmx.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2013-09-25 06:44:33 -07:00
Paul E. McKenney c229828ca6 rcu: Throttle rcu_try_advance_all_cbs() execution
The rcu_try_advance_all_cbs() function is invoked on each attempted
entry to and every exit from idle.  If this function determines that
there are callbacks ready to invoke, the caller will invoke the RCU
core, which in turn will result in a pair of context switches.  If a
CPU enters and exits idle extremely frequently, this can result in
an excessive number of context switches and high CPU overhead.

This commit therefore causes rcu_try_advance_all_cbs() to throttle
itself, refusing to do work more than once per jiffy.

Reported-by: Tibor Billes <tbilles@gmx.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Tibor Billes <tbilles@gmx.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2013-09-25 06:44:25 -07:00
Paul E. McKenney 7a497c963e rcu: Remove redundant code from rcu_cleanup_after_idle()
The rcu_try_advance_all_cbs() function returns a bool saying whether or
not there are callbacks ready to invoke, but rcu_cleanup_after_idle()
rechecks this regardless.  This commit therefore uses the value returned
by rcu_try_advance_all_cbs() instead of making rcu_cleanup_after_idle()
do this recheck.

Reported-by: Tibor Billes <tbilles@gmx.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Tibor Billes <tbilles@gmx.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2013-09-25 06:44:03 -07:00
Peter Zijlstra a233f1120c sched: Prepare for per-cpu preempt_count
When using per-cpu preempt_count variables we need to save/restore the
preempt_count on context switch (into per task storage; for instance
the old thread_info::preempt_count variable) because of
PREEMPT_ACTIVE.

However, this means that on fork() the preempt_count value of the last
context switch gets copied and if we had a PREEMPT_ACTIVE switch right
before cloning a child task the child task will now too have
PREEMPT_ACTIVE set and start its life with an extra PREEMPT_ACTIVE
count.

Therefore we need to make init_task_preempt_count() unconditional;
this resets whatever preempt_count we inherited from our parent
process.

Doing so for !per-cpu implementations is harmless.

For !PREEMPT_COUNT kernels we need to be careful not to start life
with an increased preempt_count.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-4k0b7oy1rcdyzochwiixuwi9@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-25 14:07:55 +02:00
Peter Zijlstra bdb4380658 sched: Extract the basic add/sub preempt_count modifiers
Rewrite the preempt_count macros in order to extract the 3 basic
preempt_count value modifiers:

  __preempt_count_add()
  __preempt_count_sub()

and the new:

  __preempt_count_dec_and_test()

And since we're at it anyway, replace the unconventional
$op_preempt_count names with the more conventional preempt_count_$op.

Since these basic operators are equivalent to the previous _notrace()
variants, do away with the _notrace() versions.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-ewbpdbupy9xpsjhg960zwbv8@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-25 14:07:54 +02:00
Peter Zijlstra 0102874755 sched: Create more preempt_count accessors
We need a few special preempt_count accessors:
 - task_preempt_count() for when we're interested in the preemption
   count of another (non-running) task.
 - init_task_preempt_count() for properly initializing the preemption
   count.
 - init_idle_preempt_count() a special case of the above for the idle
   threads.

With these no generic code ever touches thread_info::preempt_count
anymore and architectures could choose to remove it.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-jf5swrio8l78j37d06fzmo4r@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-25 14:07:52 +02:00
Peter Zijlstra f27dde8dee sched: Add NEED_RESCHED to the preempt_count
In order to combine the preemption and need_resched test we need to
fold the need_resched information into the preempt_count value.

Since the NEED_RESCHED flag is set across CPUs this needs to be an
atomic operation, however we very much want to avoid making
preempt_count atomic, therefore we keep the existing TIF_NEED_RESCHED
infrastructure in place but at 3 sites test it and fold its value into
preempt_count; namely:

 - resched_task() when setting TIF_NEED_RESCHED on the current task
 - scheduler_ipi() when resched_task() sets TIF_NEED_RESCHED on a
                   remote task it follows it up with a reschedule IPI
                   and we can modify the cpu local preempt_count from
                   there.
 - cpu_idle_loop() for when resched_task() found tsk_is_polling().

We use an inverted bitmask to indicate need_resched so that a 0 means
both need_resched and !atomic.

Also remove the barrier() in preempt_enable() between
preempt_enable_no_resched() and preempt_check_resched() to avoid
having to reload the preemption value and allow the compiler to use
the flags of the previuos decrement. I couldn't come up with any sane
reason for this barrier() to be there as preempt_enable_no_resched()
already has a barrier() before doing the decrement.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-7a7m5qqbn5pmwnd4wko9u6da@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-25 14:07:49 +02:00
Peter Zijlstra 4a2b4b2227 sched: Introduce preempt_count accessor functions
Replace the single preempt_count() 'function' that's an lvalue with
two proper functions:

 preempt_count() - returns the preempt_count value as rvalue
 preempt_count_set() - Allows setting the preempt-count value

Also provide preempt_count_ptr() as a convenience wrapper to implement
all modifying operations.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-orxrbycjozopqfhb4dxdkdvb@git.kernel.org
[ Fixed build failure. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-25 14:07:32 +02:00
Peter Zijlstra ea81174789 sched, idle: Fix the idle polling state logic
Mike reported that commit 7d1a9417 ("x86: Use generic idle loop")
regressed several workloads and caused excessive reschedule
interrupts.

The patch in question failed to notice that the x86 code had an
inverted sense of the polling state versus the new generic code (x86:
default polling, generic: default !polling).

Fix the two prominent x86 mwait based idle drivers and introduce a few
new generic polling helpers (fixing the wrong smp_mb__after_clear_bit
usage).

Also switch the idle routines to using tif_need_resched() which is an
immediate TIF_NEED_RESCHED test as opposed to need_resched which will
end up being slightly different.

Reported-by: Mike Galbraith <bitbucket@online.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: lenb@kernel.org
Cc: tglx@linutronix.de
Link: http://lkml.kernel.org/n/tip-nc03imb0etuefmzybzj7sprf@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-25 13:53:10 +02:00
Peter Zijlstra b021fe3e25 sched, rcu: Make RCU use resched_cpu()
We're going to deprecate and remove set_need_resched() for it will do
the wrong thing. Make an exception for RCU and allow it to use
resched_cpu() which will do the right thing.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/n/tip-2eywnacjl1nllctl1nszqa5w@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-25 13:53:08 +02:00
Michael S. Tsirkin 4314895165 sched: Micro-optimize by dropping unnecessary task_rq() calls
We always know the rq used, let's just pass it around.
This seems to cut the size of scheduler core down a tiny bit:

Before:

  [linux]$ size kernel/sched/core.o.orig
     text    data     bss     dec     hex filename
    62760   16130    3876   82766   1434e kernel/sched/core.o.orig

After:

  [linux]$ size kernel/sched/core.o.patched
     text    data     bss     dec     hex filename
    62566   16130    3876   82572   1428c kernel/sched/core.o.patched

Probably speeds it up as well.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20130922142054.GA11499@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-25 13:51:06 +02:00
Chuansheng Liu e2f0b88e84 kernel/reboot.c: re-enable the function of variable reboot_default
Commit 1b3a5d02ee ("reboot: move arch/x86 reboot= handling to generic
kernel") did some cleanup for reboot= command line, but it made the
reboot_default inoperative.

The default value of variable reboot_default should be 1, and if command
line reboot= is not set, system will use the default reboot mode.

[akpm@linux-foundation.org: fix comment layout]
Signed-off-by: Li Fei <fei.li@intel.com>
Signed-off-by: liu chuansheng <chuansheng.liu@intel.com>
Acked-by: Robin Holt <robinmholt@linux.com>
Cc: <stable@vger.kernel.org>	[3.11.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-24 17:00:26 -07:00
Konstantin Khlebnikov 8ac1c8d5de audit: fix endless wait in audit_log_start()
After commit 829199197a ("kernel/audit.c: avoid negative sleep
durations") audit emitters will block forever if userspace daemon cannot
handle backlog.

After the timeout the waiting loop turns into busy loop and runs until
daemon dies or returns back to work.  This is a minimal patch for that
bug.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Richard Guy Briggs <rgb@redhat.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: Chuck Anderson <chuck.anderson@oracle.com>
Cc: Dan Duval <dan.duval@oracle.com>
Cc: Dave Kleikamp <dave.kleikamp@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-24 17:00:26 -07:00
Michal Hocko 9809b18fcf watchdog: update watchdog_thresh properly
watchdog_tresh controls how often nmi perf event counter checks per-cpu
hrtimer_interrupts counter and blows up if the counter hasn't changed
since the last check.  The counter is updated by per-cpu
watchdog_hrtimer hrtimer which is scheduled with 2/5 watchdog_thresh
period which guarantees that hrtimer is scheduled 2 times per the main
period.  Both hrtimer and perf event are started together when the
watchdog is enabled.

So far so good.  But...

But what happens when watchdog_thresh is updated from sysctl handler?

proc_dowatchdog will set a new sampling period and hrtimer callback
(watchdog_timer_fn) will use the new value in the next round.  The
problem, however, is that nobody tells the perf event that the sampling
period has changed so it is ticking with the period configured when it
has been set up.

This might result in an ear ripping dissonance between perf and hrtimer
parts if the watchdog_thresh is increased.  And even worse it might lead
to KABOOM if the watchdog is configured to panic on such a spurious
lockup.

This patch fixes the issue by updating both nmi perf even counter and
hrtimers if the threshold value has changed.

The nmi one is disabled and then reinitialized from scratch.  This has
an unpleasant side effect that the allocation of the new event might
fail theoretically so the hard lockup detector would be disabled for
such cpus.  On the other hand such a memory allocation failure is very
unlikely because the original event is deallocated right before.

It would be much nicer if we just changed perf event period but there
doesn't seem to be any API to do that right now.  It is also unfortunate
that perf_event_alloc uses GFP_KERNEL allocation unconditionally so we
cannot use on_each_cpu() and do the same thing from the per-cpu context.
The update from the current CPU should be safe because
perf_event_disable removes the event atomically before it clears the
per-cpu watchdog_ev so it cannot change anything under running handler
feet.

The hrtimer is simply restarted (thanks to Don Zickus who has pointed
this out) if it is queued because we cannot rely it will fire&adopt to
the new sampling period before a new nmi event triggers (when the
treshold is decreased).

[akpm@linux-foundation.org: the UP version of __smp_call_function_single ended up in the wrong place]
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Fabio Estevam <festevam@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-24 17:00:25 -07:00
Michal Hocko 359e6fab66 watchdog: update watchdog attributes atomically
proc_dowatchdog doesn't synchronize multiple callers which might lead to
confusion when two parallel callers might confuse watchdog_enable_all_cpus
resp watchdog_disable_all_cpus (eg watchdog gets enabled even if
watchdog_thresh was set to 0 already).

This patch adds a local mutex which synchronizes callers to the sysctl
handler.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-24 17:00:25 -07:00
Kirill Tkhai 5d5a08003d rcu: Fix CONFIG_RCU_NOCB_CPU_ALL panic on machines with sparse CPU mask
Some architectures have sparse cpu mask. UltraSparc's cpuinfo for example:

CPU0: online
CPU2: online

So, set only possible CPUs when CONFIG_RCU_NOCB_CPU_ALL is enabled.

Also, check that user passes right 'rcu_nocbs=' option.

Signed-off-by: Kirill Tkhai <tkhai@yandex.ru>
CC: Dipankar Sarma <dipankar@in.ibm.com>
[ paulmck: Fix pr_info() issue noted by scripts/checkpatch.pl. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-09-23 14:11:11 -07:00
Paul E. McKenney 15f5191b6a rcu: Avoid sparse warnings in rcu_nocb_wake trace event
The event-tracing macros do not like bool tracing arguments, so this
commit makes them be of type char.  This change has the knock-on effect
of making it illegal to pass a pointer into one of these arguments, so
also change rcutiny's first call to trace_rcu_batch_end() to convert
from pointer to boolean, prefixing with "!!".

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-09-23 09:18:17 -07:00
Paul E. McKenney 69a79bb12a rcu: Track rcu_nocb_kthread()'s sleeping and awakening
This commit adds event traces to track all of rcu_nocb_kthread()'s
blocking and awakening.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-09-23 09:18:16 -07:00
Paul E. McKenney 756cbf6bef rcu: Distinguish between NOCB and non-NOCB rcu_callback trace events
One way to distinguish between NOCB and non-NOCB rcu_callback trace
events is that the former always print zero for the lazy and non-lazy
queue lengths.  Unfortunately, this also means that we cannot see the NOCB
queue lengths.  This commit therefore accesses the NOCB queue lengths,
but negates them.  NOCB rcu_callback trace events should therefore have
negative queue lengths.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Match operand size per kbuild test robot's advice. ]
2013-09-23 09:18:14 -07:00
Paul E. McKenney 9261dd0da6 rcu: Add tracing for rcuo no-CBs CPU wakeup handshake
Lost wakeups from call_rcu() to the rcuo kthreads can result in hangs
that are difficult to diagnose.  This commit therefore adds tracing to
help pin down the cause of these hangs.

Reported-by: Clark Williams <williams@redhat.com>
Reported-by: Carsten Emde <C.Emde@osadl.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Add const per kbuild test robot's advice. ]
2013-09-23 09:18:13 -07:00
Paul E. McKenney bb311eccbd rcu: Add tracing of normal (non-NOCB) grace-period requests
This commit adds tracing to the normal grace-period request points.
These are rcu_gp_cleanup(), which checks for the need for another
grace period at the end of the previous grace period, and
rcu_start_gp_advanced(), which restarts RCU's state machine after
an idle period.  These trace events are intended to help track down
bugs where RCU remains idle despite there being work for it to do.

Reported-by: Clark Williams <williams@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-09-23 09:18:08 -07:00
Paul E. McKenney 63c4db78e8 rcu: Add tracing to rcu_gp_kthread()
This commit adds tracing to the rcu_gp_kthread() function in order to
help trace down hangs potentially involving this kthread.

Reported-by: Clark Williams <williams@redhat.com>
Reported-by: Carsten Emde <C.Emde@osadl.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-09-23 09:16:14 -07:00
Paul E. McKenney 591c6d1710 rcu: Flag lockless access to ->gp_flags with ACCESS_ONCE()
This commit applies ACCESS_ONCE() to an outside-of-lock access to
->gp_flags.  Although it is hard to imagine any sane compiler messing
this particular case up, the documentation benefits are substantial.
Plus the definition of "sane compiler" grows ever looser.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-09-23 09:16:13 -07:00
Paul E. McKenney 88d6df612c rcu: Prevent spurious-wakeup DoS attack on rcu_gp_kthread()
Spurious wakeups in the force-quiescent-state loop in rcu_gp_kthread()
cause the timeout to be recalculated, which would prevent rcu_gp_fqs()
from ever being called.  This would in turn would prevent the grace period
from ever ending for as long as there was at least one CPU in an extended
quiescent state that had not yet passed through a quiescent state.

This commit therefore avoids recalculating the timeout unless the
previous pass's call to wait_event_interruptible_timeout() actually
did time out, thus preventing the above scenario.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-09-23 09:16:11 -07:00
Paul E. McKenney f7be820939 rcu: Improve grace-period start logic
This commit improves grace-period start logic by checking ->gp_flags
under the lock and by issuing a warning if a grace period is already
in progress.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-09-23 09:16:10 -07:00
Paul E. McKenney 0d75292467 rcu: Have rcutiny tracepoints use tracepoint_string()
This commit extends the work done in f7f7bac9 (rcu: Have the RCU
tracepoints use the tracepoint_string infrastructure) to cover rcutiny.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
2013-09-23 09:15:31 -07:00
Paul E. McKenney 26cdfedf6a rcu: Reject memory-order-induced stall-warning false positives
If a system is idle from an RCU perspective for longer than specified
by CONFIG_RCU_CPU_STALL_TIMEOUT, and if one CPU starts a grace period
just as a second checks for CPU stalls, and if this second CPU happens
to see the old value of rsp->jiffies_stall, it will incorrectly report a
CPU stall.  This is quite rare, but apparently occurs deterministically
on systems with about 6TB of memory.

This commit therefore orders accesses to the data used to determine
whether or not a CPU stall is in progress.  Grace-period initialization
and cleanup first increments rsp->completed to mark the end of the
previous grace period, then records the current jiffies in rsp->gp_start,
then records the jiffies at which a stall can be expected to occur in
rsp->jiffies_stall, and finally increments rsp->gpnum to mark the start
of the new grace period.  Now, this ordering by itself does not prevent
false positives.  For example, if grace-period initialization was delayed
between recording rsp->gp_start and rsp->jiffies_stall, the CPU stall
warning code might still see an old value of rsp->jiffies_stall.

Therefore, this commit also orders the CPU stall warning accesses as
well, loading rsp->gpnum and jiffies, then rsp->jiffies_stall, then
rsp->gp_start, and finally rsp->completed.  This ordering means that
the false-positive scenario in the previous paragraph would result
in rsp->completed being greater than or equal to rsp->gpnum, which is
never valid for a CPU stall, allowing the false positive to be rejected.
Furthermore, any fetch that gets an old value of rsp->jiffies_stall
must also get an old value of rsp->gpnum, which will again be rejected
by the comparison of rsp->gpnum and rsp->completed.  Situations where
rsp->gp_start is later than rsp->jiffies_stall are also rejected, as
are situations where jiffies is less than rsp->jiffies_stall.

Although use of unsynchronized accesses means that there are likely
still some false-positive scenarios (synchronization has proven to be
a very bad idea on large systems), this should get rid of a large class
of these scenarios.

Reported-by: Fabian Herschel <fabian.herschel@suse.com>
Reported-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Tested-by: Jochen Striepe <jochen@tolot.escape.de>
2013-09-23 09:15:30 -07:00
Paul E. McKenney 69c8d28c96 rcu: Micro-optimize rcu_cpu_has_callbacks()
The for_each_rcu_flavor() loop unconditionally scans all flavors, even
when the first flavor might have some non-lazy callbacks.  Once the
loop has seen a non-lazy callback, further passes through the loop
cannot change the state.  This is not a huge problem, given that there
can be at most three RCU flavors (RCU-bh, RCU-preempt, and RCU-sched),
but this code is on the path to idle, so speeding it up even a small
amount would have some benefit.

This commit therefore does two things:

1.	Rearranges the order of the list of RCU flavors in order to
	place the most active flavor first in the list.  The most active
	RCU flavor is RCU-preempt, or, if there is no RCU-preempt,
	RCU-sched.

2.	Reworks the for_each_rcu_flavor() to exit early when the first
	non-lazy callback is seen, or, in the case where the caller
	does not care about non-lazy callbacks (RCU_FAST_NO_HZ=n),
	when the first callback is seen.

Reported-by: Chen Gang <gang.chen@asianux.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-09-23 09:15:28 -07:00
Paul E. McKenney 289828e62d rcu: Silence unused-variable warnings
The "idle" variable in both rcu_eqs_enter_common() and
rcu_eqs_exit_common() is only used in a WARN_ON_ONCE().  If the kernel
is built disabling WARN_ON_ONCE(), the compiler will complain (rightly)
that "idle" is unused.  This commit therefore adds a __maybe_unused to
the declaration of both variables.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-09-23 09:15:27 -07:00
Christoph Lameter c9d4b0af9e rcu: Replace __get_cpu_var() uses
__get_cpu_var() is used for multiple purposes in the kernel source. One
of them is address calculation via the form &__get_cpu_var(x). This
calculates the address for the instance of the percpu variable of the
current processor based on an offset.

Other use cases are for storing and retrieving data from the current
processors percpu area.  __get_cpu_var() can be used as an lvalue when
writing data or on the right side of an assignment.

__get_cpu_var() is defined as :

__get_cpu_var() always only does an address determination. However,
store and retrieve operations could use a segment prefix (or global
register on other platforms) to avoid the address calculation.

this_cpu_write() and this_cpu_read() can directly take an offset into
a percpu area and use optimized assembly code to read and write per
cpu variables.

This patch converts __get_cpu_var into either an explicit address
calculation using this_cpu_ptr() or into a use of this_cpu operations
that use the offset. Thereby address calcualtions are avoided and less
registers are used when code is generated.

At the end of the patchset all uses of __get_cpu_var have been removed
so the macro is removed too.

The patchset includes passes over all arches as well. Once these
operations are used throughout then specialized macros can be defined in
non -x86 arches as well in order to optimize per cpu access by f.e. using
a global register that may be set to the per cpu base.

Transformations done to __get_cpu_var()

1. Determine the address of the percpu instance of the current processor.

	DEFINE_PER_CPU(int, y);
	int *x = &__get_cpu_var(y);

    Converts to

	int *x = this_cpu_ptr(&y);

2. Same as #1 but this time an array structure is involved.

	DEFINE_PER_CPU(int, y[20]);
	int *x = __get_cpu_var(y);

    Converts to

	int *x = this_cpu_ptr(y);

3. Retrieve the content of the current processors instance of a per cpu
   variable.

	DEFINE_PER_CPU(int, u);
	int x = __get_cpu_var(y)

   Converts to

	int x = __this_cpu_read(y);

4. Retrieve the content of a percpu struct

	DEFINE_PER_CPU(struct mystruct, y);
	struct mystruct x = __get_cpu_var(y);

   Converts to

	memcpy(this_cpu_ptr(&x), y, sizeof(x));

5. Assignment to a per cpu variable

	DEFINE_PER_CPU(int, y)
	__get_cpu_var(y) = x;

   Converts to

	this_cpu_write(y, x);

6. Increment/Decrement etc of a per cpu variable

	DEFINE_PER_CPU(int, y);
	__get_cpu_var(y)++

   Converts to

	this_cpu_inc(y)

Signed-off-by: Christoph Lameter <cl@linux.com>
[ paulmck: Address conflicts. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-09-23 09:15:22 -07:00
Paul E. McKenney 829511d8aa rcu: Fix dubious "if" condition in __call_rcu_nocb_enqueue()
This commit replaces an incorrect (but fortunately functional)
bitwise OR ("|") operator with the correct logical OR ("||").

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-09-23 09:13:44 -07:00
Paul E. McKenney 01896f7e0a rcu: Convert local functions to static
The rcu_cpu_stall_timeout kernel parameter, the rcu_dynticks per-CPU
variable, and the rcu_gp_fqs() function are used only locally.  This
commit therefore marks them as static.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-09-23 09:12:31 -07:00
Paul E. McKenney b3f2d02598 rcu: Use proper cpp macro for ->gp_flags
One of the ->gp_flags assignments used a raw number rather than the
cpp macro that was intended for this purpose, which this commit fixes.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-09-20 09:43:06 -07:00
Jason Low f48627e686 sched/balancing: Periodically decay max cost of idle balance
This patch builds on patch 2 and periodically decays that max value to
do idle balancing per sched domain by approximately 1% per second. Also
decay the rq's max_idle_balance_cost value.

Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1379096813-3032-4-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-20 12:03:46 +02:00
Jason Low 9bd721c55c sched/balancing: Consider max cost of idle balance per sched domain
In this patch, we keep track of the max cost we spend doing idle load balancing
for each sched domain. If the avg time the CPU remains idle is less then the
time we have already spent on idle balancing + the max cost of idle balancing
in the sched domain, then we don't continue to attempt the balance. We also
keep a per rq variable, max_idle_balance_cost, which keeps track of the max
time spent on newidle load balances throughout all its domains so that we can
determine the avg_idle's max value.

By using the max, we avoid overrunning the average. This further reduces the
chance we attempt balancing when the CPU is not idle for longer than the cost
to balance.

Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1379096813-3032-3-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-20 12:03:44 +02:00
Jason Low abfafa54db sched: Reduce overestimating rq->avg_idle
When updating avg_idle, if the delta exceeds some max value, then avg_idle
gets set to the max, regardless of what the previous avg was. This can cause
avg_idle to often be overestimated.

This patch modifies the way we update avg_idle by always updating it with the
function call to update_avg() first. Then, if avg_idle exceeds the max, we set
it to the max.

Signed-off-by: Jason Low <jason.low2@hp.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1379096813-3032-2-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-20 12:03:41 +02:00
Vladimir Davydov 7aff2e3a56 sched/balancing: Prevent the reselection of a previous env.dst_cpu if some tasks are pinned
Currently new_dst_cpu is prevented from being reselected actually, not
dst_cpu. This can result in attempting to pull tasks to this_cpu twice.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/281f59b6e596c718dd565ad267fc38f5b8e5c995.1379265590.git.vdavydov@parallels.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-20 12:02:20 +02:00
Ingo Molnar 40a0c68ca9 Merge branch 'sched/urgent' into sched/core
Merge in the latest fixes before applying a dependent patch.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-20 12:01:01 +02:00
Vladimir Davydov 7e3115ef51 sched/balancing: Fix cfs_rq->task_h_load calculation
Patch a003a2 (sched: Consider runnable load average in move_tasks())
sets all top-level cfs_rqs' h_load to rq->avg.load_avg_contrib, which is
always 0. This mistype leads to all tasks having weight 0 when load
balancing in a cpu-cgroup enabled setup. There obviously should be sum
of weights of all runnable tasks there instead. Fix it.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Reviewed-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1379173186-11944-1-git-send-email-vdavydov@parallels.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-20 11:59:39 +02:00
Vladimir Davydov 3029ede393 sched/balancing: Fix 'local->avg_load > busiest->avg_load' case in fix_small_imbalance()
In busiest->group_imb case we can come to fix_small_imbalance() with
local->avg_load > busiest->avg_load. This can result in wrong imbalance
fix-up, because there is the following check there where all the
members are unsigned:

if (busiest->avg_load - local->avg_load + scaled_busy_load_per_task >=
    (scaled_busy_load_per_task * imbn)) {
	env->imbalance = busiest->load_per_task;
	return;
}

As a result we can end up constantly bouncing tasks from one cpu to
another if there are pinned tasks.

Fix it by substituting the subtraction with an equivalent addition in
the check.

[ The bug can be caught by running 2*N cpuhogs pinned to two logical cpus
  belonging to different cores on an HT-enabled machine with N logical
  cpus: just look at se.nr_migrations growth. ]

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/ef167822e5c5b2d96cf5b0e3e4f4bdff3f0414a2.1379252740.git.vdavydov@parallels.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-20 11:59:38 +02:00
Vladimir Davydov b18855500f sched/balancing: Fix 'local->avg_load > sds->avg_load' case in calculate_imbalance()
In busiest->group_imb case we can come to calculate_imbalance() with
local->avg_load >= busiest->avg_load >= sds->avg_load. This can result
in imbalance overflow, because it is calculated as follows

env->imbalance = min(
	max_pull * busiest->group_power,
	(sds->avg_load - local->avg_load) * local->group_power) / SCHED_POWER_SCALE;

As a result we can end up constantly bouncing tasks from one cpu to
another if there are pinned tasks.

Fix this by skipping the assignment and assuming imbalance=0 in case
local->avg_load > sds->avg_load.

[ The bug can be caught by running 2*N cpuhogs pinned to two logical cpus
  belonging to different cores on an HT-enabled machine with N logical
  cpus: just look at se.nr_migrations growth. ]

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/8f596cc6bc0e5e655119dc892c9bfcad26e971f4.1379252740.git.vdavydov@parallels.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-20 11:59:36 +02:00
Peter Zijlstra fa73158710 perf: Fix capabilities bitfield compatibility in 'struct perf_event_mmap_page'
Solve the problems around the broken definition of perf_event_mmap_page::
cap_usr_time and cap_usr_rdpmc fields which used to overlap, partially
fixed by:

  860f085b74 ("perf: Fix broken union in 'struct perf_event_mmap_page'")

The problem with the fix (merged in v3.12-rc1 and not yet released
officially), noticed by Vince Weaver is that the new behavior is
not detectable by new user-space, and that due to the reuse of the
field names it's easy to mis-compile a binary if old headers are used
on a new kernel or new headers are used on an old kernel.

To solve all that make this change explicit, detectable and self-contained,
by iterating the ABI the following way:

 - Always clear bit 0, and rename it to usrpage->cap_bit0, to at least not
   confuse old user-space binaries. RDPMC will be marked as unavailable
   to old binaries but that's within the ABI, this is a capability bit.

 - Rename bit 1 to ->cap_bit0_is_deprecated and always set it to 1, so new
   libraries can reliably detect that bit 0 is deprecated and perma-zero
   without having to check the kernel version.

 - Use bits 2, 3, 4 for the newly defined, correct functionality:

	cap_user_rdpmc		: 1, /* The RDPMC instruction can be used to read counts */
	cap_user_time		: 1, /* The time_* fields are used */
	cap_user_time_zero	: 1, /* The time_zero field is used */

 - Rename all the bitfield names in perf_event.h to be different from the
   old names, to make sure it's not possible to mis-compile it
   accidentally with old assumptions.

The 'size' field can then be used in the future to add new fields and it
will act as a natural ABI version indicator as well.

Also adjust tools/perf/ userspace for the new definitions, noticed by
Adrian Hunter.

Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Also-Fixed-by: Adrian Hunter <adrian.hunter@intel.com>
Link: http://lkml.kernel.org/n/tip-zr03yxjrpXesOzzupszqglbv@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-20 09:45:11 +02:00
Linus Torvalds 9d2cd7048b Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Ingo Molnar:
 "An NTP related lockup fix"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timekeeping: Fix HRTICK related deadlock from ntp lock changes
2013-09-18 11:24:49 -05:00
Linus Torvalds 7e28b2712e Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Misc fixes"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Fix comment for sched_info_depart
  sched/Documentation: Update sched-design-CFS.txt documentation
  sched/debug: Take PID namespace into account
  sched/fair: Fix small race where child->se.parent,cfs_rq might point to invalid ones
2013-09-18 11:23:32 -05:00
Elad Wexler 233bcb411c clocksource: Fix 'ret' data type of sysfs_override_clocksource() and sysfs_unbind_clocksource()
sysfs_override_clocksource(): The expression 'if (ret >= 0)' is always true.
This will cause clocksource_select() to always run.
Thus modified ret to be of type ssize_t.

sysfs_unbind_clocksource(): The expression 'if (ret < 0)' is always false.
So in case sysfs_get_uname() failed, the expression won't take an effect.
Thus modified ret to be of type ssize_t.

Signed-off-by: Elad Wexler <elad.wexler@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-09-17 11:19:27 -07:00
John Stultz 389e067032 Merge branch 'fortglx/3.12/time' into fortglx/3.13/time
Merge in the timekeeping changes that missed 3.12

Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-09-16 18:54:07 -07:00
John Stultz 19c3205cea Merge branch 'fortglx/3.12/sched-clock64-base' into fortglx/3.13/time
Merge in 64bit sched_clock support that missed 3.12.

Conflicts:
	kernel/time/sched_clock.c

Signed-off-by: John.Stultz <john.stultz@linaro.org>
2013-09-16 18:52:52 -07:00
Michael S. Tsirkin 13b62e46d5 sched: Fix comment for sched_info_depart
sched_info_depart seems to be only called from
sched_info_switch(), so only on involuntary task switch.

Fix the comment to match.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Link: http://lkml.kernel.org/r/20130916083036.GA1113@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-16 11:18:34 +02:00
Linus Torvalds 9bf12df31f Merge git://git.kvack.org/~bcrl/aio-next
Pull aio changes from Ben LaHaise:
 "First off, sorry for this pull request being late in the merge window.
  Al had raised a couple of concerns about 2 items in the series below.
  I addressed the first issue (the race introduced by Gu's use of
  mm_populate()), but he has not provided any further details on how he
  wants to rework the anon_inode.c changes (which were sent out months
  ago but have yet to be commented on).

  The bulk of the changes have been sitting in the -next tree for a few
  months, with all the issues raised being addressed"

* git://git.kvack.org/~bcrl/aio-next: (22 commits)
  aio: rcu_read_lock protection for new rcu_dereference calls
  aio: fix race in ring buffer page lookup introduced by page migration support
  aio: fix rcu sparse warnings introduced by ioctx table lookup patch
  aio: remove unnecessary debugging from aio_free_ring()
  aio: table lookup: verify ctx pointer
  staging/lustre: kiocb->ki_left is removed
  aio: fix error handling and rcu usage in "convert the ioctx list to table lookup v3"
  aio: be defensive to ensure request batching is non-zero instead of BUG_ON()
  aio: convert the ioctx list to table lookup v3
  aio: double aio_max_nr in calculations
  aio: Kill ki_dtor
  aio: Kill ki_users
  aio: Kill unneeded kiocb members
  aio: Kill aio_rw_vect_retry()
  aio: Don't use ctx->tail unnecessarily
  aio: io_cancel() no longer returns the io_event
  aio: percpu ioctx refcount
  aio: percpu reqs_available
  aio: reqs_active -> reqs_available
  aio: fix build when migration is disabled
  ...
2013-09-13 10:55:58 -07:00
Martin Schwidefsky 0244ad004a Remove GENERIC_HARDIRQ config option
After the last architecture switched to generic hard irqs the config
options HAVE_GENERIC_HARDIRQS & GENERIC_HARDIRQS and the related code
for !CONFIG_GENERIC_HARDIRQS can be removed.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2013-09-13 15:09:52 +02:00
Linus Torvalds ac4de9543a Merge branch 'akpm' (patches from Andrew Morton)
Merge more patches from Andrew Morton:
 "The rest of MM.  Plus one misc cleanup"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (35 commits)
  mm/Kconfig: add MMU dependency for MIGRATION.
  kernel: replace strict_strto*() with kstrto*()
  mm, thp: count thp_fault_fallback anytime thp fault fails
  thp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page()
  thp: do_huge_pmd_anonymous_page() cleanup
  thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
  mm: cleanup add_to_page_cache_locked()
  thp: account anon transparent huge pages into NR_ANON_PAGES
  truncate: drop 'oldsize' truncate_pagecache() parameter
  mm: make lru_add_drain_all() selective
  memcg: document cgroup dirty/writeback memory statistics
  memcg: add per cgroup writeback pages accounting
  memcg: check for proper lock held in mem_cgroup_update_page_stat
  memcg: remove MEMCG_NR_FILE_MAPPED
  memcg: reduce function dereference
  memcg: avoid overflow caused by PAGE_ALIGN
  memcg: rename RESOURCE_MAX to RES_COUNTER_MAX
  memcg: correct RESOURCE_MAX to ULLONG_MAX
  mm: memcg: do not trap chargers with full callstack on OOM
  mm: memcg: rework and document OOM waiting and wakeup
  ...
2013-09-12 15:44:27 -07:00
Jingoo Han 6072ddc852 kernel: replace strict_strto*() with kstrto*()
The usage of strict_strto*() is not preferred, because strict_strto*() is
obsolete.  Thus, kstrto*() should be used.

Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12 15:38:03 -07:00
Sha Zhengju 1a36e59d48 memcg: reduce function dereference
This function dereferences res far too often, so optimize it.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Jeff Liu <jeff.liu@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12 15:38:02 -07:00
Sha Zhengju 3af3351676 memcg: avoid overflow caused by PAGE_ALIGN
Since PAGE_ALIGN is aligning up(the next page boundary), so after
PAGE_ALIGN, the value might be overflow, such as write the MAX value to
*.limit_in_bytes.

  $ cat /cgroup/memory/memory.limit_in_bytes
  18446744073709551615

  # echo 18446744073709551615 > /cgroup/memory/memory.limit_in_bytes
  bash: echo: write error: Invalid argument

Some user programs might depend on such behaviours(like libcg, we read
the value in snapshot, then use the value to reset cgroup later), and
that will cause confusion.  So we need to fix it.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Jeff Liu <jeff.liu@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12 15:38:02 -07:00
Sha Zhengju 6de5a8bfca memcg: rename RESOURCE_MAX to RES_COUNTER_MAX
RESOURCE_MAX is far too general name, change it to RES_COUNTER_MAX.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Jeff Liu <jeff.liu@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12 15:38:02 -07:00
Linus Torvalds 26935fb06e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs pile 4 from Al Viro:
 "list_lru pile, mostly"

This came out of Andrew's pile, Al ended up doing the merge work so that
Andrew didn't have to.

Additionally, a few fixes.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (42 commits)
  super: fix for destroy lrus
  list_lru: dynamically adjust node arrays
  shrinker: Kill old ->shrink API.
  shrinker: convert remaining shrinkers to count/scan API
  staging/lustre/libcfs: cleanup linux-mem.h
  staging/lustre/ptlrpc: convert to new shrinker API
  staging/lustre/obdclass: convert lu_object shrinker to count/scan API
  staging/lustre/ldlm: convert to shrinkers to count/scan API
  hugepage: convert huge zero page shrinker to new shrinker API
  i915: bail out earlier when shrinker cannot acquire mutex
  drivers: convert shrinkers to new count/scan API
  fs: convert fs shrinkers to new scan/count API
  xfs: fix dquot isolation hang
  xfs-convert-dquot-cache-lru-to-list_lru-fix
  xfs: convert dquot cache lru to list_lru
  xfs: rework buffer dispose list tracking
  xfs-convert-buftarg-lru-to-generic-code-fix
  xfs: convert buftarg LRU to generic code
  fs: convert inode and dentry shrinking to be node aware
  vmscan: per-node deferred work
  ...
2013-09-12 15:01:38 -07:00
Linus Torvalds 02b9735c12 ACPI and power management fixes for 3.12-rc1
1) ACPI-based PCI hotplug (ACPIPHP) fixes related to spurious events
 
   After the recent ACPIPHP changes we've seen some interesting breakage
   on a system that triggers device check notifications during boot for
   non-existing devices.  Although those notifications are really
   spurious, we should be able to deal with them nevertheless and that
   shouldn't introduce too much overhead.  Four commits to make that
   work properly.
 
  2) Memory hotplug and hibernation mutual exclusion rework
 
   This was maent to be a cleanup, but it happens to fix a classical
   ABBA deadlock between system suspend/hibernation and ACPI memory
   hotplug which is possible if they are started roughly at the same
   time.  Three commits rework memory hotplug so that it doesn't
   acquire pm_mutex and make hibernation use device_hotplug_lock
   which prevents it from racing with memory hotplug.
 
  3) ACPI Intel LPSS (Low-Power Subsystem) driver crash fix
 
   The ACPI LPSS driver crashes during boot on Apple Macbook Air with
   Haswell that has slightly unusual BIOS configuration in which one
   of the LPSS device's _CRS method doesn't return all of the information
   expected by the driver.  Fix from Mika Westerberg, for stable.
 
  4) ACPICA fix related to Store->ArgX operation
 
   AML interpreter fix for obscure breakage that causes AML to be
   executed incorrectly on some machines (observed in practice).  From
   Bob Moore.
 
  5) ACPI core fix for PCI ACPI device objects lookup
 
   There still are cases in which there is more than one ACPI device
   object matching a given PCI device and we don't choose the one that
   the BIOS expects us to choose, so this makes the lookup take more
   criteria into account in those cases.
 
  6) Fix to prevent cpuidle from crashing in some rare cases
 
   If the result of cpuidle_get_driver() is NULL, which can happen on
   some systems, cpuidle_driver_ref() will crash trying to use that
   pointer and the Daniel Fu's fix prevents that from happening.
 
  7) cpufreq fixes related to CPU hotplug
 
   Stephen Boyd reported a number of concurrency problems with cpufreq
   related to CPU hotplug which are addressed by a series of fixes
   from Srivatsa S Bhat and Viresh Kumar.
 
  8) cpufreq fix for time conversion in time_in_state attribute
 
   Time conversion carried out by cpufreq when user space attempts to
   read /sys/devices/system/cpu/cpu*/cpufreq/stats/time_in_state won't
   work correcty if cputime_t doesn't map directly to jiffies.  Fix
   from Andreas Schwab.
 
  9) Revert of a troublesome cpufreq commit
 
   Commit 7c30ed5 (cpufreq: make sure frequency transitions are
   serialized) was intended to address some known concurrency problems
   in cpufreq related to the ordering of transitions, but unfortunately
   it introduced several problems of its own, so I decided to revert it
   now and address the original problems later in a more robust way.
 
 10) Intel Haswell CPU models for intel_pstate from Nell Hardcastle.
 
 11) cpufreq fixes related to system suspend/resume
 
   The recent cpufreq changes that made it preserve CPU sysfs attributes
   over suspend/resume cycles introduced a possible NULL pointer
   dereference that caused it to crash during the second attempt to
   suspend.  Three commits from Srivatsa S Bhat fix that problem and a
   couple of related issues.
 
 12) cpufreq locking fix
 
   cpufreq_policy_restore() should acquire the lock for reading, but
   it acquires it for writing.  Fix from Lan Tianyu.
 
 /
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQIcBAABCAAGBQJSMbdRAAoJEKhOf7ml8uNsiFkQAKSh1iBXuiUCxBApEGZgoQio
 8lmnuyWdhNQWdjZTnh7ptjpDxdrWhxcoxvoaGABU++reDObjef1QnyrQtdO3r8dl
 oy0C/YGh5kq5SIffIDEwPIb/ipDe/47cgRMW8iBlnViDa1MJBqICuLyefcTRIrKp
 QGvv0owUM2o7TXpA10+qm8zXjv6m5mu1DTtxYI+2Eodhwi54neAqb+aKMspa2thy
 V9KFcVv3Td4rJrNvw6BhXNM81QbaYpRxaK3DRr1T6SM++EKvbqYFA1jgW24YvqTL
 nrCZlDMb6KRww5DCxA/ns9Kx5H+ZyicoRwdtAM3PBYA6MGqsLqPozC/8VKV1fSvZ
 sgUdbUSuLqKRAkOqM1bjKAhi9PdCGBvkQAg2AqbRK6IBl4HJC8xhdb5E6eZ/J42G
 GyNBpKef7wVJwYKXE2hSChZ5dYjqMizNHWxFHf8Xy1dveExbQ2nmSJmaWMy2A3kx
 YOXFkcTV5F6GOIZB8WCRruzUalff9xal4G+iVhGF+AZIOCm7bC+FDXfwIS82uVor
 ej2l+uQLLZCB499IRmM6942ZIAXshmtN7eRfGtKBc6jsbSCEdQDqf1Z7oRwqAD6h
 WkD/k/zz30CyM8y4snOkAXkZgqAQsZodtqfowE3e9OHd51tfcNiqdht+obwCx+eD
 MWXc2xATMAX6NcZTXSZS
 =U/Jw
 -----END PGP SIGNATURE-----

Merge tag 'pm+acpi-fixes-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull ACPI and power management fixes from Rafael Wysocki:
 "All of these commits are fixes that have emerged recently and some of
  them fix bugs introduced during this merge window.

  Specifics:

   1) ACPI-based PCI hotplug (ACPIPHP) fixes related to spurious events

      After the recent ACPIPHP changes we've seen some interesting
      breakage on a system that triggers device check notifications
      during boot for non-existing devices.  Although those
      notifications are really spurious, we should be able to deal with
      them nevertheless and that shouldn't introduce too much overhead.
      Four commits to make that work properly.

   2) Memory hotplug and hibernation mutual exclusion rework

      This was maent to be a cleanup, but it happens to fix a classical
      ABBA deadlock between system suspend/hibernation and ACPI memory
      hotplug which is possible if they are started roughly at the same
      time.  Three commits rework memory hotplug so that it doesn't
      acquire pm_mutex and make hibernation use device_hotplug_lock
      which prevents it from racing with memory hotplug.

   3) ACPI Intel LPSS (Low-Power Subsystem) driver crash fix

      The ACPI LPSS driver crashes during boot on Apple Macbook Air with
      Haswell that has slightly unusual BIOS configuration in which one
      of the LPSS device's _CRS method doesn't return all of the
      information expected by the driver.  Fix from Mika Westerberg, for
      stable.

   4) ACPICA fix related to Store->ArgX operation

      AML interpreter fix for obscure breakage that causes AML to be
      executed incorrectly on some machines (observed in practice).
      From Bob Moore.

   5) ACPI core fix for PCI ACPI device objects lookup

      There still are cases in which there is more than one ACPI device
      object matching a given PCI device and we don't choose the one
      that the BIOS expects us to choose, so this makes the lookup take
      more criteria into account in those cases.

   6) Fix to prevent cpuidle from crashing in some rare cases

      If the result of cpuidle_get_driver() is NULL, which can happen on
      some systems, cpuidle_driver_ref() will crash trying to use that
      pointer and the Daniel Fu's fix prevents that from happening.

   7) cpufreq fixes related to CPU hotplug

      Stephen Boyd reported a number of concurrency problems with
      cpufreq related to CPU hotplug which are addressed by a series of
      fixes from Srivatsa S Bhat and Viresh Kumar.

   8) cpufreq fix for time conversion in time_in_state attribute

      Time conversion carried out by cpufreq when user space attempts to
      read /sys/devices/system/cpu/cpu*/cpufreq/stats/time_in_state
      won't work correcty if cputime_t doesn't map directly to jiffies.
      Fix from Andreas Schwab.

   9) Revert of a troublesome cpufreq commit

      Commit 7c30ed5 (cpufreq: make sure frequency transitions are
      serialized) was intended to address some known concurrency
      problems in cpufreq related to the ordering of transitions, but
      unfortunately it introduced several problems of its own, so I
      decided to revert it now and address the original problems later
      in a more robust way.

  10) Intel Haswell CPU models for intel_pstate from Nell Hardcastle.

  11) cpufreq fixes related to system suspend/resume

      The recent cpufreq changes that made it preserve CPU sysfs
      attributes over suspend/resume cycles introduced a possible NULL
      pointer dereference that caused it to crash during the second
      attempt to suspend.  Three commits from Srivatsa S Bhat fix that
      problem and a couple of related issues.

  12) cpufreq locking fix

      cpufreq_policy_restore() should acquire the lock for reading, but
      it acquires it for writing.  Fix from Lan Tianyu"

* tag 'pm+acpi-fixes-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (25 commits)
  cpufreq: Acquire the lock in cpufreq_policy_restore() for reading
  cpufreq: Prevent problems in update_policy_cpu() if last_cpu == new_cpu
  cpufreq: Restructure if/else block to avoid unintended behavior
  cpufreq: Fix crash in cpufreq-stats during suspend/resume
  intel_pstate: Add Haswell CPU models
  Revert "cpufreq: make sure frequency transitions are serialized"
  cpufreq: Use signed type for 'ret' variable, to store negative error values
  cpufreq: Remove temporary fix for race between CPU hotplug and sysfs-writes
  cpufreq: Synchronize the cpufreq store_*() routines with CPU hotplug
  cpufreq: Invoke __cpufreq_remove_dev_finish() after releasing cpu_hotplug.lock
  cpufreq: Split __cpufreq_remove_dev() into two parts
  cpufreq: Fix wrong time unit conversion
  cpufreq: serialize calls to __cpufreq_governor()
  cpufreq: don't allow governor limits to be changed when it is disabled
  ACPI / bind: Prefer device objects with _STA to those without it
  ACPI / hotplug / PCI: Avoid parent bus rescans on spurious device checks
  ACPI / hotplug / PCI: Use _OST to notify firmware about notify status
  ACPI / hotplug / PCI: Avoid doing too much for spurious notifies
  ACPICA: Fix for a Store->ArgX when ArgX contains a reference to a field.
  ACPI / hotplug / PCI: Don't trim devices before scanning the namespace
  ...
2013-09-12 11:22:45 -07:00
Linus Torvalds 75acebf242 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Various fixes.

  The -g perf report lockup you reported is only partially addressed,
  patches that fix the excessive runtime are still being worked on"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86: Fix uncore PCI fixed counter handling
  uprobes: Fix utask->depth accounting in handle_trampoline()
  perf/x86: Add constraint for IVB CYCLE_ACTIVITY:CYCLES_LDM_PENDING
  perf: Fix up MMAP2 buffer space reservation
  perf tools: Add attr->mmap2 support
  perf kvm: Fix sample_type manipulation
  perf evlist: Fix id pos in perf_evlist__open()
  perf trace: Handle perf.data files with no tracepoints
  perf session: Separate progress bar update when processing events
  perf trace: Check if MAP_32BIT is defined
  perf hists: Fix formatting of long symbol names
  perf evlist: Fix parsing with no sample_id_all bit set
  perf tools: Add test for parsing with no sample_id_all bit
  perf trace: Check control+C more often
2013-09-12 10:44:54 -07:00
Linus Torvalds b55ee2816e Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Ingo Molnar:
 "Performance regression fix"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Fix load balancing performance regression in should_we_balance()
2013-09-12 10:44:13 -07:00
Peter Zijlstra c61037e905 sched/fair: Fix the group_capacity computation
Do away with 'phantom' cores due to N*frac(smt_power) >= 1 by limiting
the capacity to the actual number of cores.

The assumption of 1 < smt_power < 2 is an actual requirement because
of what SMT is so this should work regardless of the SMT
implementation.

It can still be defeated by creative use of cpu hotplug, but if you're
one of those freaks, you get to live with it.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Vincent Guittot <vincent.guitto@linaro.org>
Link: http://lkml.kernel.org/n/tip-dczmbi8tfgixacg1ji2av1un@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-12 19:14:45 +02:00
Peter Zijlstra b37d931685 sched/fair: Rework and comment the group_capacity code
Pull out the group_capacity computation so that we can more clearly
comment its issues.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-az1hl1ya55k361nkeh9bj0yw@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-12 19:14:44 +02:00
Peter Zijlstra 863bffc808 sched/fair: Fix group power_orig computation
When looking at the code I noticed we don't actually compute
sgp->power_orig correctly for groups, fix that.

Currently the only consumer of that value is fix_small_capacity()
which is only used on POWER7+ and that code excludes this case by
being limited to SD_SHARE_CPUPOWER which is only ever set on the SMT
domain which must be the lowest domain and this has singleton groups.

So nothing should be affected by this change.

Cc: Michael Neuling <mikey@neuling.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-db2pe0vxwunv37plc7onnugj@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-12 19:14:43 +02:00
Peter Zijlstra b72ff13ce6 sched/fair: Reduce local_group logic
Try and reduce the local_group logic by pulling most of it into
update_sd_lb_stats.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-mgezl354xgyhiyrte78fdkpd@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-12 19:14:43 +02:00
Peter Zijlstra 6263322c5e sched/fair: Rewrite group_imb trigger
Change the group_imb detection from the old 'load-spike' detector to
an actual imbalance detector. We set it from the lower domain balance
pass when it fails to create a balance in the presence of task
affinities.

The advantage is that this should no longer generate the false
positive group_imb conditions generated by transient load spikes from
the normal balancing/bulk-wakeup etc. behaviour.

While I haven't actually observed those they could happen.

I'm not entirely happy with this patch; it somehow feels a little
fragile.

Nor does it solve the biggest issue I have with the group_imb code; it
it still a fragile construct in that once we 'fixed' the imbalance
we'll not detect the group_imb again and could end up re-creating it.

That said, this patch does seem to preserve behaviour for the
described degenerate case. In particular on my 2*6*2 wsm-ep:

  taskset -c 3-11 bash -c 'for ((i=0;i<9;i++)) do while :; do :; done & done'

ends up with 9 spinners, each on their own CPU; whereas if you disable
the group_imb code that typically doesn't happen (you'll get one pair
sharing a CPU most of the time).

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-36fpbgl39dv4u51b6yz2ypz5@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-12 19:14:42 +02:00
Peter Zijlstra fc840914e9 sched/debug: Take PID namespace into account
Emmanuel reported that /proc/sched_debug didn't report the right PIDs
when using namespaces, cure this.

Reported-by: Emmanuel Deloget <emmanuel.deloget@efixo.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20130909110141.GM31370@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-12 19:14:16 +02:00
Daisuke Nishimura 6c9a27f5da sched/fair: Fix small race where child->se.parent,cfs_rq might point to invalid ones
There is a small race between copy_process() and cgroup_attach_task()
where child->se.parent,cfs_rq points to invalid (old) ones.

        parent doing fork()      | someone moving the parent to another cgroup
  -------------------------------+---------------------------------------------
    copy_process()
      + dup_task_struct()
        -> parent->se is copied to child->se.
           se.parent,cfs_rq of them point to old ones.

                                     cgroup_attach_task()
                                       + cgroup_task_migrate()
                                         -> parent->cgroup is updated.
                                       + cpu_cgroup_attach()
                                         + sched_move_task()
                                           + task_move_group_fair()
                                             +- set_task_rq()
                                                -> se.parent,cfs_rq of parent
                                                   are updated.

      + cgroup_fork()
        -> parent->cgroup is copied to child->cgroup. (*1)
      + sched_fork()
        + task_fork_fair()
          -> se.parent,cfs_rq of child are accessed
             while they point to old ones. (*2)

In the worst case, this bug can lead to "use-after-free" and cause a panic,
because it's new cgroup's refcount that is incremented at (*1),
so the old cgroup(and related data) can be freed before (*2).

In fact, a panic caused by this bug was originally caught in RHEL6.4.

    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: [<ffffffff81051e3e>] sched_slice+0x6e/0xa0
    [...]
    Call Trace:
     [<ffffffff81051f25>] place_entity+0x75/0xa0
     [<ffffffff81056a3a>] task_fork_fair+0xaa/0x160
     [<ffffffff81063c0b>] sched_fork+0x6b/0x140
     [<ffffffff8106c3c2>] copy_process+0x5b2/0x1450
     [<ffffffff81063b49>] ? wake_up_new_task+0xd9/0x130
     [<ffffffff8106d2f4>] do_fork+0x94/0x460
     [<ffffffff81072a9e>] ? sys_wait4+0xae/0x100
     [<ffffffff81009598>] sys_clone+0x28/0x30
     [<ffffffff8100b393>] stub_clone+0x13/0x20
     [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/039601ceae06$733d3130$59b79390$@mxp.nes.nec.co.jp
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-12 19:14:14 +02:00
Oleg Nesterov 878b5a6efd uprobes: Fix utask->depth accounting in handle_trampoline()
Currently utask->depth is simply the number of allocated/pending
return_instance's in uprobe_task->return_instances list.

handle_trampoline() should decrement this counter every time we
handle/free an instance, but due to typo it does this only if
->chained == T. This means that in the likely case this counter
is never decremented and the probed task can't report more than
MAX_URETPROBE_DEPTH events.

Reported-by: Mikhail Kulemin <Mikhail.Kulemin@ru.ibm.com>
Reported-by: Hemant Kumar Shaw <hkshaw@linux.vnet.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Anton Arapov <anton@redhat.com>
Cc: masami.hiramatsu.pt@hitachi.com
Cc: srikar@linux.vnet.ibm.com
Cc: systemtap@sourceware.org
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20130911154726.GA8093@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-12 08:00:55 +02:00