1
0
Fork 0
Commit Graph

211 Commits (34845c939082093354a6ffbb1ebff599e30b9b22)

Author SHA1 Message Date
Michael Ellerman 8404410b29 Merge branch 'topic/livepatch' into next
Merge the support for live patching on ppc64le using mprofile-kernel.
This branch has also been merged into the livepatching tree for v4.7.
2016-04-18 20:45:32 +10:00
Michael Ellerman 85baa09549 powerpc/livepatch: Add live patching support on ppc64le
Add the kconfig logic & assembly support for handling live patched
functions. This depends on DYNAMIC_FTRACE_WITH_REGS, which in turn
depends on the new -mprofile-kernel ftrace ABI, which is only supported
currently on ppc64le.

Live patching is handled by a special ftrace handler. This means it runs
from ftrace_caller(). The live patch handler modifies the NIP so as to
redirect the return from ftrace_caller() to the new patched function.

However there is one particularly tricky case we need to handle.

If a function A calls another function B, and it is known at link time
that they share the same TOC, then A will not save or restore its TOC,
and will call the local entry point of B.

When we live patch B, we replace it with a new function C, which may
not have the same TOC as A. At live patch time it's too late to modify A
to do the TOC save/restore, so the live patching code must interpose
itself between A and C, and do the TOC save/restore that A omitted.

An additionaly complication is that the livepatch code can not create a
stack frame in order to save the TOC. That is because if C takes > 8
arguments, or is varargs, A will have written the arguments for C in
A's stack frame.

To solve this, we introduce a "livepatch stack" which grows upward from
the base of the regular stack, and is used to store the TOC & LR when
calling a live patched function.

When the patched function returns, we retrieve the real LR & TOC from
the livepatch stack, restore them, and pop the livepatch "stack frame".

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Torsten Duwe <duwe@suse.de>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
2016-04-14 15:48:06 +10:00
Cyril Bur 6e669f085d powerpc: Fix unrecoverable SLB miss during restore_math()
Commit 70fe3d9 "powerpc: Restore FPU/VEC/VSX if previously used" introduces a
call to restore_math() late in the syscall return path, after MSR_RI has been
cleared. The MSR_RI flag is used to indicate whether the kernel can take
another exception or not. A cleared MSR_RI flag indicates that the kernel
cannot.

Unfortunately when a machine is under SLB pressure an SLB miss can occur
in restore_math() which (with MSR_RI cleared) leads to an unrecoverable
exception.

  Unrecoverable exception 4100 at c0000000000088d8
  cpu 0x0: Vector: 4100  at [c0000003fa473b20]
      pc: c0000000000088d8: .load_vr_state+0x70/0x110
      lr: c00000000000f710: .restore_math+0x130/0x188
      sp: c0000003fa473da0
     msr: 9000000002003030
    current = 0xc0000007f876f180
    paca    = 0xc00000000fff0000	 softe: 0	 irq_happened: 0x01
      pid   = 1944, comm = K08umountfs
  [link register   ] c00000000000f710 .restore_math+0x130/0x188
  [c0000003fa473da0] c0000003fa473e30 (unreliable)
  [c0000003fa473e30] c000000000007b6c system_call+0x84/0xfc

The clearing of MSR_RI is actually an optimisation to avoid multiple MSR
writes, what must be disabled are interrupts. See comment in entry_64.S:

  /*
   * For performance reasons we clear RI the same time that we
   * clear EE. We only need to clear RI just before we restore r13
   * below, but batching it with EE saves us one expensive mtmsrd call.
   * We have to be careful to restore RI if we branch anywhere from
   * here (eg syscall_exit_work).
   */

At the point of calling restore_math() r13 has not been restored, as such, the
quick fix of turning MSR_RI back on for the call to restore_math() will
eliminate the occurrence of an unrecoverable exception.

We'd like to do a better fix in future.

Fixes: 70fe3d980f ("powerpc: Restore FPU/VEC/VSX if previously used")
Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-03-16 15:23:02 +11:00
Michael Ellerman d8c0282f4d Merge branch 'topic/mprofile-kernel' into next
Merge the ftrace changes to support -mprofile-kernel on ppc64le. This is
a prerequisite for live patching, the support for which will be merged
via the livepatch tree based on this topic branch.
2016-03-11 11:20:15 +11:00
Torsten Duwe 153086644f powerpc/ftrace: Add support for -mprofile-kernel ftrace ABI
The gcc switch -mprofile-kernel defines a new ABI for calling _mcount()
very early in the function with minimal overhead.

Although mprofile-kernel has been available since GCC 3.4, there were
bugs which were only fixed recently. Currently it is known to work in
GCC 4.9, 5 and 6.

Additionally there are two possible code sequences generated by the
flag, the first uses mflr/std/bl and the second is optimised to omit the
std. Currently only gcc 6 has the optimised sequence. This patch
supports both sequences.

Initial work started by Vojtech Pavlik, used with permission.

Key changes:
 - rework _mcount() to work for both the old and new ABIs.
 - implement new versions of ftrace_caller() and ftrace_graph_caller()
   which deal with the new ABI.
 - updates to __ftrace_make_nop() to recognise the new mcount calling
   sequence.
 - updates to __ftrace_make_call() to recognise the nop'ed sequence.
 - implement ftrace_modify_call().
 - updates to the module loader to surpress the toc save in the module
   stub when calling mcount with the new ABI.

Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Torsten Duwe <duwe@suse.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-03-07 14:53:55 +11:00
Cyril Bur 70fe3d980f powerpc: Restore FPU/VEC/VSX if previously used
Currently the FPU, VEC and VSX facilities are lazily loaded. This is not
a problem unless a process is using these facilities.

Modern versions of GCC are very good at automatically vectorising code,
new and modernised workloads make use of floating point and vector
facilities, even the kernel makes use of vectorised memcpy.

All this combined greatly increases the cost of a syscall since the
kernel uses the facilities sometimes even in syscall fast-path making it
increasingly common for a thread to take an *_unavailable exception soon
after a syscall, not to mention potentially taking all three.

The obvious overcompensation to this problem is to simply always load
all the facilities on every exit to userspace. Loading up all FPU, VEC
and VSX registers every time can be expensive and if a workload does
avoid using them, it should not be forced to incur this penalty.

An 8bit counter is used to detect if the registers have been used in the
past and the registers are always loaded until the value wraps to back
to zero.

Several versions of the assembly in entry_64.S were tested:

  1. Always calling C.
  2. Performing a common case check and then calling C.
  3. A complex check in asm.

After some benchmarking it was determined that avoiding C in the common
case is a performance benefit (option 2). The full check in asm (option
3) greatly complicated that codepath for a negligible performance gain
and the trade-off was deemed not worth it.

Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
[mpe: Move load_vec in the struct to fill an existing hole, reword change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

fixup
2016-03-02 23:34:48 +11:00
Michael Ellerman d8725ce86c powerpc/kernel: Open code SET_DEFAULT_THREAD_PPR
This is only used in one location, open code it.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-12-17 22:40:57 +11:00
Michael Ellerman d030a4b5eb powerpc/kernel: Open code HMT_MEDIUM_LOW_HAS_PPR
HMT_MEDIUM_LOW_HAS_PPR is only used in once place, open code it.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-12-17 22:40:57 +11:00
Anton Blanchard 68bfa962bf powerpc: Remove redundant mflr in _switch
No need to execute mflr twice.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-12-01 13:52:24 +11:00
Anton Blanchard 152d523e63 powerpc: Create context switch helpers save_sprs() and restore_sprs()
Move all our context switch SPR save and restore code into two
helpers. We do a few optimisations:

- Group all mfsprs and all mtsprs. In many cases an mtspr sets a
scoreboarding bit that an mfspr waits on, so the current practise of
mfspr A; mtspr A; mfpsr B; mtspr B is the worst scheduling we can
do.

- SPR writes are slow, so check that the value is changing before
writing it.

A context switch microbenchmark using yield():

http://ozlabs.org/~anton/junkcode/context_switch2.c

./context_switch2 --test=yield 0 0

shows an improvement of almost 10% on POWER8.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-12-01 13:52:24 +11:00
Anton Blanchard 07e45c120c powerpc: Don't disable kernel FP/VMX/VSX MSR bits on context switch
Writing the MSR is slow, so we want to avoid it whenever possible.

A subsequent patch will add a debug option that strictly manages the
FP/VMX/VSX unavailable bits. For now just remove it, matching what
we do in other areas of the kernel (eg enable_kernel_altivec()).

A context switch microbenchmark using yield():

http://ozlabs.org/~anton/junkcode/context_switch2.c

./context_switch2 --test=yield --fp 0 0

shows an improvement of almost 3% on POWER8.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-12-01 13:52:24 +11:00
Michael Ellerman d38374142b powerpc/kernel: Change the do_syscall_trace_enter() API
The API for calling do_syscall_trace_enter() is currently sensible
enough, it just returns the (modified) syscall number.

However once we enable seccomp filter it will get more complicated. When
seccomp filter runs, the seccomp kernel code (via SECCOMP_RET_ERRNO), or
a ptracer (via SECCOMP_RET_TRACE), may reject the syscall and *may* or may
*not* set a return value in r3.

That means the assembler that calls do_syscall_trace_enter() can not
blindly return ENOSYS, it needs to only return ENOSYS if a return value
has not already been set.

There is no way to implement that logic with the current API. So change
the do_syscall_trace_enter() API to make it deal with the return code
juggling, and the assembler can then just return whatever return code it
is given.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Kees Cook <keescook@chromium.org>
2015-07-29 11:56:11 +10:00
Michael Ellerman c3525940cc powerpc/kernel: Switch to using MAX_ERRNO
Currently on powerpc we have our own #define for the highest (negative)
errno value, called _LAST_ERRNO. This is defined to be 516, for reasons
which are not clear.

The generic code, and x86, use MAX_ERRNO, which is defined to be 4095.

In particular seccomp uses MAX_ERRNO to restrict the value that a
seccomp filter can return.

Currently with the mismatch between _LAST_ERRNO and MAX_ERRNO, a seccomp
tracer wanting to return 600, expecting it to be seen as an error, would
instead find on powerpc that userspace sees a successful syscall with a
return value of 600.

To avoid this inconsistency, switch powerpc to use MAX_ERRNO.

We are somewhat confident that generic syscalls that can return a
non-error value above negative MAX_ERRNO have already been updated to
use force_successful_syscall_return().

I have also checked all the powerpc specific syscalls, and believe that
none of them expect to return a non-error value between -MAX_ERRNO and
-516. So this change should be safe ...

Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Kees Cook <keescook@chromium.org>
2015-07-29 11:56:11 +10:00
Sam bobroff b4b56f9eca powerpc/tm: Abort syscalls in active transactions
This patch changes the syscall handler to doom (tabort) active
transactions when a syscall is made and return very early without
performing the syscall and keeping side effects to a minimum (no CPU
accounting or system call tracing is performed). Also included is a
new HWCAP2 bit, PPC_FEATURE2_HTM_NOSC, to indicate this
behaviour to userspace.

Currently, the system call instruction automatically suspends an
active transaction which causes side effects to persist when an active
transaction fails.

This does change the kernel's behaviour, but in a way that was
documented as unsupported.  It doesn't reduce functionality as
syscalls will still be performed after tsuspend; it just requires that
the transaction be explicitly suspended.  It also provides a
consistent interface and makes the behaviour of user code
substantially the same across powerpc and platforms that do not
support suspended transactions (e.g. x86 and s390).

Performance measurements using
http://ozlabs.org/~anton/junkcode/null_syscall.c indicate the cost of
a normal (non-aborted) system call increases by about 0.25%.

Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-19 17:10:28 +10:00
Anshuman Khandual 1db365258a powerpc/kernel: Rename PACA_DSCR to PACA_DSCR_DEFAULT
PACA_DSCR offset macro tracks dscr_default element in the paca
structure. Better change the name of this macro to match that of the
data element it tracks. Makes the code more readable.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-07 19:29:00 +10:00
Michael Ellerman 68fc378ce3 Revert "powerpc/tm: Abort syscalls in active transactions"
This reverts commit feba40362b.

Although the principle of this change is good, the implementation has a
few issues.

Firstly we can sometimes fail to abort a syscall because r12 may have
been clobbered by C code if we went down the virtual CPU accounting
path, or if syscall tracing was enabled.

Secondly we have decided that it is safer to abort the syscall even
earlier in the syscall entry path, so that we avoid the syscall tracing
path when we are transactional.

So that we have time to thoroughly test those changes we have decided to
revert this for this merge window and will merge the fixed version in
the next window.

NB. Rather than reverting the selftest we just drop tm-syscall from
TEST_PROGS so that it's not run by default.

Fixes: feba40362b ("powerpc/tm: Abort syscalls in active transactions")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-04-30 15:24:58 +10:00
Sam bobroff feba40362b powerpc/tm: Abort syscalls in active transactions
This patch changes the syscall handler to doom (tabort) active
transactions when a syscall is made and return immediately without
performing the syscall.

Currently, the system call instruction automatically suspends an
active transaction which causes side effects to persist when an active
transaction fails.

This does change the kernel's behaviour, but in a way that was
documented as unsupported. It doesn't reduce functionality because
syscalls will still be performed after tsuspend. It also provides a
consistent interface and makes the behaviour of user code
substantially the same across powerpc and platforms that do not
support suspended transactions (e.g. x86 and s390).

Performance measurements using
http://ozlabs.org/~anton/junkcode/null_syscall.c
indicate the cost of a system call increases by about 0.5%.

Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
Acked-By: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-04-11 20:49:19 +10:00
Michael Ellerman 529d235a0e powerpc: Add a proper syscall for switching endianness
We currently have a "special" syscall for switching endianness. This is
syscall number 0x1ebe, which is handled explicitly in the 64-bit syscall
exception entry.

That has a few problems, firstly the syscall number is outside of the
usual range, which confuses various tools. For example strace doesn't
recognise the syscall at all.

Secondly it's handled explicitly as a special case in the syscall
exception entry, which is complicated enough without it.

As a first step toward removing the special syscall, we need to add a
regular syscall that implements the same functionality.

The logic is simple, it simply toggles the MSR_LE bit in the userspace
MSR. This is the same as the special syscall, with the caveat that the
special syscall clobbers fewer registers.

This version clobbers r9-r12, XER, CTR, and CR0-1,5-7.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-03-28 22:03:40 +11:00
Michael Ellerman a4bcbe6a41 powerpc: Remove old compile time disabled syscall tracing code
We have code to do syscall tracing which is disabled at compile time by
default. It's not been touched since the dawn of time (ie. v2.6.12).

There are now better ways to do syscall tracing, ie. using the
raw_syscall, or syscall tracepoints.

For the specific case of tracing syscalls at boot on a system that
doesn't get to userspace, you can boot with:

  trace_event=syscalls tp_printk=on

Which will trace syscalls from boot, and echo all output to the console.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-02-02 14:51:32 +11:00
Michael Ellerman 4c3b216861 powerpc/kernel: Make syscall_exit a local label
Currently when we back trace something that is in a syscall we see
something like this:

[c000000000000000] [c000000000000000] SyS_read+0x6c/0x110
[c000000000000000] [c000000000000000] syscall_exit+0x0/0x98

Although it's entirely correct, seeing syscall_exit at the bottom can be
confusing - we were exiting from a syscall and then called SyS_read() ?

If we instead change syscall_exit to be a local label we get something
more intuitive:

[c0000001fa46fde0] [c00000000026719c] SyS_read+0x6c/0x110
[c0000001fa46fe30] [c000000000009264] system_call+0x38/0xd0

ie. we were handling a system call, and it was SyS_read().

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-02-02 14:51:31 +11:00
Michael Ellerman 10ea834364 powerpc: Rename _TIF_SYSCALL_T_OR_A to _TIF_SYSCALL_DOTRACE
Once upon a time, at least 9 years ago (< 2.6.12), _TIF_SYSCALL_T_OR_A
meant "TRACE or AUDIT". But these days it means TRACE or AUDIT or
SECCOMP or TRACEPOINT or NOHZ.

All of those are implemented via syscall_dotrace() so rename the flag to
that to try and clarify things.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-01-23 14:02:51 +11:00
Anton Blanchard b3c18725a0 powerpc/ftrace: simplify prepare_ftrace_return
Instead of passing in the stack address of the link register
to be modified, just pass in the old value and return the
new value and rely on ftrace_graph_caller to do the
modification.

This removes the exception handling around the stack update -
it isn't needed and we weren't consistent about it. Later on
we would do an unprotected modification:

       if (!ftrace_graph_entry(&trace)) {
               *parent = old;

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-11-10 09:59:28 +11:00
Anton Blanchard 7d56c65a6f powerpc/ftrace: Remove mod_return_to_handler
mod_return_to_handler is the same as return_to_handler, except
it handles the change of the TOC (r2). Add this into
return_to_handler and remove mod_return_to_handler.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-11-10 09:59:27 +11:00
Anton Blanchard 808be31426 powerpc: do_notify_resume can be called with bad thread_info flags argument
Back in 7230c56441 ("powerpc: Rework lazy-interrupt handling") we
added a call out to restore_interrupts() (written in c) before calling
do_notify_resume:

        bl      restore_interrupts
        addi    r3,r1,STACK_FRAME_OVERHEAD
        bl      do_notify_resume

Unfortunately do_notify_resume takes two arguments, the second one
being the thread_info flags:

void do_notify_resume(struct pt_regs *regs, unsigned long thread_info_flags)

We do populate r4 (the second argument) earlier, but
restore_interrupts() is free to muck it up all it wants. My guess is
the gcc compiler gods shone down on us and its register allocator
never used r4. Sometimes, rarely, luck is on our side.

LLVM on the other hand did trample r4.

Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: stable@vger.kernel.org
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-10-31 16:52:46 +11:00
Mahesh Salgaonkar 0869b6fd20 powerpc/book3s: Add basic infrastructure to handle HMI in Linux.
Handle Hypervisor Maintenance Interrupt (HMI) in Linux. This patch implements
basic infrastructure to handle HMI in Linux host. The design is to invoke
opal handle hmi in real mode for recovery and set irq_pending when we hit HMI.
During check_irq_replay pull opal hmi event and print hmi info on console.

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-08-05 16:33:48 +10:00
Michael Ellerman 13b3d13b81 powerpc: Remove MMU_FTR_SLB
We now only support cpus that use an SLB, so we don't need an MMU
feature to indicate that.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-07-28 14:10:23 +10:00
Sam bobroff 96d0161086 powerpc: Correct DSCR during TM context switch
Correct the DSCR SPR becoming temporarily corrupted if a task is
context switched during a transaction.

The problem occurs while suspending the task and is caused by saving
the DSCR to thread.dscr after it has already been set to the CPU's
default value:

__switch_to() calls __switch_to_tm()
	which calls tm_reclaim_task()
	which calls tm_reclaim_thread()
	which calls tm_reclaim()
		where the DSCR is set to the CPU's default
__switch_to() calls _switch()
		where thread.dscr is set to the DSCR

When the task is resumed, it's transaction will be doomed (as usual)
and the DSCR SPR will be corrupted, although the checkpointed value
will be correct. Therefore the DSCR will be immediately corrected by
the transaction aborting, unless it has been suspended. In that case
the incorrect value can be seen by the task until it resumes the
transaction.

The fix is to treat the DSCR similarly to the TAR and save it early
in __switch_to().

A program exposing the problem is added to the kernel self tests as:
tools/testing/selftests/powerpc/tm/tm-resched-dscr.

Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
CC: <stable@vger.kernel.org> [v3.10+]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-06-11 17:02:56 +10:00
Sam bobroff 1739ea9e13 powerpc: Fix regression of per-CPU DSCR setting
Since commit "efcac65 powerpc: Per process DSCR + some fixes (try#4)"
it is no longer possible to set the DSCR on a per-CPU basis.

The old behaviour was to minipulate the DSCR SPR directly but this is no
longer sufficient: the value is quickly overwritten by context switching.

This patch stores the per-CPU DSCR value in a kernel variable rather than
directly in the SPR and it is used whenever a process has not set the DSCR
itself. The sysfs interface (/sys/devices/system/cpu/cpuN/dscr) is unchanged.

Writes to the old global default (/sys/devices/system/cpu/dscr_default)
now set all of the per-CPU values and reads return the last written value.

The new per-CPU default is added to the paca_struct and is used everywhere
outside of sysfs.c instead of the old global default.

Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-05-28 13:35:40 +10:00
Anton Blanchard 5e66684fe4 powerpc: ftrace_caller, _mcount is exported to modules so needs _GLOBAL_TOC()
When testing the ftrace function tracer, I realised that ftrace_caller
and mcount are called from modules and they both call into C, therefore
they need the ABIv2 global entry point to establish r2.

Signed-off-by: Anton Blanchard <anton@samba.org>
2014-04-23 10:05:33 +10:00
Anton Blanchard 7cedd6014b powerpc: Fix kernel thread creation on ABIv2
Change how we setup registers for ret_from_kernel_thread. In
ABIv1, instead of passing a function descriptor in, dereference
it and pass the target in directly.

Use ppc_global_function_entry to get it right on both ABIv1 and ABIv2.

Signed-off-by: Anton Blanchard <anton@samba.org>
2014-04-23 10:05:23 +10:00
Anton Blanchard cc7efbf919 powerpc: ABIv2 function calls must place target address in r12
To establish addressability quickly, ABIv2 requires the target
address of the function being called to be in r12. Fix a number of
places in assembly code that we do indirect function calls.

We need to avoid function descriptors on ABIv2 too.

Signed-off-by: Anton Blanchard <anton@samba.org>
2014-04-23 10:05:20 +10:00
Anton Blanchard c857c43b34 powerpc: Don't use a function descriptor for system call table
There is no need to create a function descriptor for the system call
table. By using one we force the system call table into the text
section and it really belongs in the rodata section.

This also removes another use of dot symbols.

Signed-off-by: Anton Blanchard <anton@samba.org>
2014-04-23 10:05:17 +10:00
Anton Blanchard ad0289e4ac powerpc: Remove superflous function descriptors in assembly only code
We have a number of places where we load the text address of a local
function and indirectly branch to it in assembly. Since it is an
indirect branch binutils will not know to use the function text
address, so that trick wont work.

There is no need for these functions to have a function descriptor
so we can replace it with a label and remove the dot symbol.

Signed-off-by: Anton Blanchard <anton@samba.org>
2014-04-23 10:05:17 +10:00
Anton Blanchard b1576fec7f powerpc: No need to use dot symbols when branching to a function
binutils is smart enough to know that a branch to a function
descriptor is actually a branch to the functions text address.

Alan tells me that binutils has been doing this for 9 years.

Signed-off-by: Anton Blanchard <anton@samba.org>
2014-04-23 10:05:16 +10:00
Paul Mackerras d31626f70b powerpc: Don't corrupt transactional state when using FP/VMX in kernel
Currently, when we have a process using the transactional memory
facilities on POWER8 (that is, the processor is in transactional
or suspended state), and the process enters the kernel and the
kernel then uses the floating-point or vector (VMX/Altivec) facility,
we end up corrupting the user-visible FP/VMX/VSX state.  This
happens, for example, if a page fault causes a copy-on-write
operation, because the copy_page function will use VMX to do the
copy on POWER8.  The test program below demonstrates the bug.

The bug happens because when FP/VMX state for a transactional process
is stored in the thread_struct, we store the checkpointed state in
.fp_state/.vr_state and the transactional (current) state in
.transact_fp/.transact_vr.  However, when the kernel wants to use
FP/VMX, it calls enable_kernel_fp() or enable_kernel_altivec(),
which saves the current state in .fp_state/.vr_state.  Furthermore,
when we return to the user process we return with FP/VMX/VSX
disabled.  The next time the process uses FP/VMX/VSX, we don't know
which set of state (the current register values, .fp_state/.vr_state,
or .transact_fp/.transact_vr) we should be using, since we have no
way to tell if we are still in the same transaction, and if not,
whether the previous transaction succeeded or failed.

Thus it is necessary to strictly adhere to the rule that if FP has
been enabled at any point in a transaction, we must keep FP enabled
for the user process with the current transactional state in the
FP registers, until we detect that it is no longer in a transaction.
Similarly for VMX; once enabled it must stay enabled until the
process is no longer transactional.

In order to keep this rule, we add a new thread_info flag which we
test when returning from the kernel to userspace, called TIF_RESTORE_TM.
This flag indicates that there is FP/VMX/VSX state to be restored
before entering userspace, and when it is set the .tm_orig_msr field
in the thread_struct indicates what state needs to be restored.
The restoration is done by restore_tm_state().  The TIF_RESTORE_TM
bit is set by new giveup_fpu/altivec_maybe_transactional helpers,
which are called from enable_kernel_fp/altivec, giveup_vsx, and
flush_fp/altivec_to_thread instead of giveup_fpu/altivec.

The other thing to be done is to get the transactional FP/VMX/VSX
state from .fp_state/.vr_state when doing reclaim, if that state
has been saved there by giveup_fpu/altivec_maybe_transactional.
Having done this, we set the FP/VMX bit in the thread's MSR after
reclaim to indicate that that part of the state is now valid
(having been reclaimed from the processor's checkpointed state).

Finally, in the signal handling code, we move the clearing of the
transactional state bits in the thread's MSR a bit earlier, before
calling flush_fp_to_thread(), so that we don't unnecessarily set
the TIF_RESTORE_TM bit.

This is the test program:

/* Michael Neuling 4/12/2013
 *
 * See if the altivec state is leaked out of an aborted transaction due to
 * kernel vmx copy loops.
 *
 *   gcc -m64 htm_vmxcopy.c -o htm_vmxcopy
 *
 */

/* We don't use all of these, but for reference: */

int main(int argc, char *argv[])
{
	long double vecin = 1.3;
	long double vecout;
	unsigned long pgsize = getpagesize();
	int i;
	int fd;
	int size = pgsize*16;
	char tmpfile[] = "/tmp/page_faultXXXXXX";
	char buf[pgsize];
	char *a;
	uint64_t aborted = 0;

	fd = mkstemp(tmpfile);
	assert(fd >= 0);

	memset(buf, 0, pgsize);
	for (i = 0; i < size; i += pgsize)
		assert(write(fd, buf, pgsize) == pgsize);

	unlink(tmpfile);

	a = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0);
	assert(a != MAP_FAILED);

	asm __volatile__(
		"lxvd2x 40,0,%[vecinptr] ; " // set 40 to initial value
		TBEGIN
		"beq	3f ;"
		TSUSPEND
		"xxlxor 40,40,40 ; " // set 40 to 0
		"std	5, 0(%[map]) ;" // cause kernel vmx copy page
		TABORT
		TRESUME
		TEND
		"li	%[res], 0 ;"
		"b	5f ;"
		"3: ;" // Abort handler
		"li	%[res], 1 ;"
		"5: ;"
		"stxvd2x 40,0,%[vecoutptr] ; "
		: [res]"=r"(aborted)
		: [vecinptr]"r"(&vecin),
		  [vecoutptr]"r"(&vecout),
		  [map]"r"(a)
		: "memory", "r0", "r3", "r4", "r5", "r6", "r7");

	if (aborted && (vecin != vecout)){
		printf("FAILED: vector state leaked on abort %f != %f\n",
		       (double)vecin, (double)vecout);
		exit(1);
	}

	munmap(a, size);

	close(fd);

	printf("PASSED!\n");
	return 0;
}

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-01-15 13:59:11 +11:00
Mahesh Salgaonkar 30c826358d Move precessing of MCE queued event out from syscall exit path.
Huge Dickins reported an issue that b5ff4211a8
"powerpc/book3s: Queue up and process delayed MCE events" breaks the
PowerMac G5 boot. This patch fixes it by moving the mce even processing
away from syscall exit, which was wrong to do that in first place, and
using irq work framework to delay processing of mce event.

Reported-by: Hugh Dickins <hughd@google.com
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-01-15 13:58:59 +11:00
Mahesh Salgaonkar b5ff4211a8 powerpc/book3s: Queue up and process delayed MCE events.
When machine check real mode handler can not continue into host kernel
in V mode, it returns from the interrupt and we loose MCE event which
never gets logged. In such a situation queue up the MCE event so that
we can log it later when we get back into host kernel with r1 pointing to
kernel stack e.g. during syscall exit.

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-12-05 16:05:21 +11:00
Benjamin Herrenschmidt 0c4888ef1d powerpc: Fix fatal SLB miss when restoring PPR
When restoring the PPR value, we incorrectly access the thread structure
at a time where MSR:RI is clear, which means we cannot recover from nested
faults. However the thread structure isn't covered by the "bolted" SLB
entries and thus accessing can fault.

This fixes it by splitting the code so that the PPR value is loaded into
a GPR before MSR:RI is cleared.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-11-06 14:13:53 +11:00
Kevin Hao 0edfdd10f5 powerpc/ppc64: Remove the unneeded load of ti_flags in resume_kernel
We already got the value of current_thread_info and ti_flags and store
them into r9 and r4 respectively before jumping to resume_kernel. So
there is no reason to reload them again.

Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-10-11 16:49:25 +11:00
Benjamin Herrenschmidt 5c0484e25e powerpc: Endian safe trampoline
Create a trampoline that works in either endian and flips to
the expected endian. Use it for primary and secondary thread
entry as well as RTAS and OF call return.

Credit for finding the magic instruction goes to Paul Mackerras

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-10-11 16:48:34 +11:00
Michael Neuling bc683a7e51 powerpc: Cleanup handling of the DSCR bit in the FSCR register
As suggested by paulus we can simplify the Data Stream Control Register
(DSCR) Facility Status and Control Register (FSCR) handling.

Firstly, we simplify the asm by using a rldimi.

Secondly, we now use the FSCR only to control the DSCR facility, rather
than both the FSCR and HFSCR.  Users will see no functional change from
this but will get a minor speedup as they will trap into the kernel only
once (rather than twice) when they first touch the DSCR.  Also, this
changes removes a bunch of ugly FTR_SECTION code.

Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-08-27 15:05:22 +10:00
Benjamin Herrenschmidt 3f1f431188 Merge branch 'merge' into next
Merge stuff that already went into Linus via "merge" which
are pre-reqs for subsequent patches
2013-08-27 15:03:30 +10:00
Anton Blanchard 7ffcf8ec26 powerpc: Fix little endian lppaca, slb_shadow and dtl_entry
The lppaca, slb_shadow and dtl_entry hypervisor structures are
big endian, so we have to byte swap them in little endian builds.

LE KVM hosts will also need to be fixed but for now add an #error
to remind us.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-08-14 15:33:35 +10:00
Tiejun Chen de021bb79c powerpc/ppc64: Rename SOFT_DISABLE_INTS with RECONCILE_IRQ_STATE
The SOFT_DISABLE_INTS seems an odd name for something that updates the
software state to be consistent with interrupts being hard disabled, so
rename SOFT_DISABLE_INTS with RECONCILE_IRQ_STATE to avoid this confusion.

Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-08-14 14:57:47 +10:00
Michael Neuling c2d52644e2 powerpc: Save the TAR register earlier
This moves us to save the Target Address Register (TAR) a earlier in
__switch_to.  It introduces a new function save_tar() to do this.

We need to save the TAR earlier as we will overwrite it in the transactional
memory reclaim/recheckpoint path.  We are going to do this in a subsequent
patch which will fix saving the TAR register when it's modified inside a
transaction.

Signed-off-by: Michael Neuling <mikey@neuling.org>
Cc: <stable@vger.kernel.org> [v3.10]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-08-09 18:07:08 +10:00
Michael Neuling 2517617e0d powerpc: Fix context switch DSCR on POWER8
POWER8 allows the DSCR to be accessed directly from userspace via a new SPR
number 0x3 (Rather than 0x11.  DSCR SPR number 0x11 is still used on POWER8 but
like POWER7, is only accessible in HV and OS modes).  Currently, we allow this
by setting H/FSCR DSCR bit on boot.

Unfortunately this doesn't work, as the kernel needs to see the DSCR change so
that it knows to no longer restore the system wide version of DSCR on context
switch (ie. to set thread.dscr_inherit).

This clears the H/FSCR DSCR bit initially.  If a process then accesses the DSCR
(via SPR 0x3), it'll trap into the kernel where we set thread.dscr_inherit in
facility_unavailable_exception().

We also change _switch() so that we set or clear the H/FSCR DSCR bit based on
the thread.dscr_inherit.

Signed-off-by: Michael Neuling <mikey@neuling.org>
Cc: <stable@vger.kernel.org> [v3.10]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-08-09 18:07:05 +10:00
Bharat Bhushan 13d543cd79 powerpc: Restore dbcr0 on user space exit
On BookE (Branch taken + Single Step) is as same as Branch Taken
on BookS and in Linux we simulate BookS behavior for BookE as well.
When doing so, in Branch taken handling we want to set DBCR0_IC but
we update the current->thread->dbcr0 and not DBCR0.

Now on 64bit the current->thread.dbcr0 (and other debug registers)
is synchronized ONLY on context switch flow. But after handling
Branch taken in debug exception if we return back to user space
without context switch then single stepping change (DBCR0_ICMP)
does not get written in h/w DBCR0 and Instruction Complete exception
does not happen.

This fixes using ptrace reliably on BookE-PowerPC

lmbench latency test (lat_syscall) Results are (they varies a little
on each run)

1) ./lat_syscall <action> /dev/shm/uImage

action:	Open	read	write	stat	fstat	null
Before:	3.8618	0.2017	0.2851	1.6789	0.2256	0.0856
After:	3.8580	0.2017	0.2851	1.6955	0.2255	0.0856

1) ./lat_syscall -P 2 -N 10 <action> /dev/shm/uImage
action:	Open	read	write	stat	fstat	null
Before:	4.1388	0.2238	0.3066	1.7106	0.2256	0.0856
After:	4.1413	0.2236	0.3062	1.7107	0.2256	0.0856

[ Slightly modified to avoid extra branch in the fast path
  on Book3S and fix build on all non-BookE 64-bit -- BenH
]

Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-20 17:04:19 +10:00
Michael Ellerman b11ae95100 powerpc: Partial revert of "Context switch more PMU related SPRs"
In commit 59affcd I added context switching of more PMU SPRs, because
they are potentially exposed to userspace on Power8. However despite me
being a smart arse in the commit message it's actually not correct. In
particular it interacts badly with a global perf record.

We will have to do something more complicated, but that will have to
wait for 3.11.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-10 08:36:35 +10:00
Michael Neuling a515348fc6 powerpc/pseries: Kill all prefetch streams on context switch
On context switch, we should have no prefetch streams leak from one
userspace process to another.  This frees up prefetch resources for the
next process.

Based on patch from Milton Miller.

Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-01 08:29:25 +10:00
Michael Ellerman 59affcd3e4 powerpc: Context switch more PMU related SPRs
In commit 9353374 "Context switch the new EBB SPRs" we added support for
context switching some new EBB SPRs. However despite four of us signing
off on that patch we missed some. To be fair these are not actually new
SPRs, but they are now potentially user accessible so need to be context
switched.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-05-24 18:13:45 +10:00
Linus Torvalds a2c7a54fcc Merge branch 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
Pull powerpc fixes from Benjamin Herrenschmidt:
 "This is mostly bug fixes (some of them regressions, some of them I
  deemed worth merging now) along with some patches from Li Zhong
  hooking up the new context tracking stuff (for the new full NO_HZ)"

* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (25 commits)
  powerpc: Set show_unhandled_signals to 1 by default
  powerpc/perf: Fix setting of "to" addresses for BHRB
  powerpc/pmu: Fix order of interpreting BHRB target entries
  powerpc/perf: Move BHRB code into CONFIG_PPC64 region
  powerpc: select HAVE_CONTEXT_TRACKING for pSeries
  powerpc: Use the new schedule_user API on userspace preemption
  powerpc: Exit user context on notify resume
  powerpc: Exception hooks for context tracking subsystem
  powerpc: Syscall hooks for context tracking subsystem
  powerpc/booke64: Fix kernel hangs at kernel_dbg_exc
  powerpc: Fix irq_set_affinity() return values
  powerpc: Provide __bswapdi2
  powerpc/powernv: Fix starting of secondary CPUs on OPALv2 and v3
  powerpc/powernv: Detect OPAL v3 API version
  powerpc: Fix MAX_STACK_TRACE_ENTRIES too low warning again
  powerpc: Make CONFIG_RTAS_PROC depend on CONFIG_PROC_FS
  powerpc: Bring all threads online prior to migration/hibernation
  powerpc/rtas_flash: Fix validate_flash buffer overflow issue
  powerpc/kexec: Fix kexec when using VMX optimised memcpy
  powerpc: Fix build errors STRICT_MM_TYPECHECKS
  ...
2013-05-14 07:43:11 -07:00
Li Zhong 5d1c574511 powerpc: Use the new schedule_user API on userspace preemption
This patch corresponds to
[PATCH] x86: Use the new schedule_user API on userspace preemption
  commit 0430499ce9

Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-05-14 16:00:20 +10:00
Li Zhong af945cf4bf powerpc: Fix MAX_STACK_TRACE_ENTRIES too low warning again
Saw this warning again, and this time from the ret_from_fork path.

It seems we could clear the back chain earlier in copy_thread(), which
could cover both path, and also fix potential lockdep usage in
schedule_tail(), or exception occurred before we clear the back chain.

Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-05-14 14:36:35 +10:00
Linus Torvalds c4cc75c332 Merge git://git.infradead.org/users/eparis/audit
Pull audit changes from Eric Paris:
 "Al used to send pull requests every couple of years but he told me to
  just start pushing them to you directly.

  Our touching outside of core audit code is pretty straight forward.  A
  couple of interface changes which hit net/.  A simple argument bug
  calling audit functions in namei.c and the removal of some assembly
  branch prediction code on ppc"

* git://git.infradead.org/users/eparis/audit: (31 commits)
  audit: fix message spacing printing auid
  Revert "audit: move kaudit thread start from auditd registration to kaudit init"
  audit: vfs: fix audit_inode call in O_CREAT case of do_last
  audit: Make testing for a valid loginuid explicit.
  audit: fix event coverage of AUDIT_ANOM_LINK
  audit: use spin_lock in audit_receive_msg to process tty logging
  audit: do not needlessly take a lock in tty_audit_exit
  audit: do not needlessly take a spinlock in copy_signal
  audit: add an option to control logging of passwords with pam_tty_audit
  audit: use spin_lock_irqsave/restore in audit tty code
  helper for some session id stuff
  audit: use a consistent audit helper to log lsm information
  audit: push loginuid and sessionid processing down
  audit: stop pushing loginid, uid, sessionid as arguments
  audit: remove the old depricated kernel interface
  audit: make validity checking generic
  audit: allow checking the type of audit message in the user filter
  audit: fix build break when AUDIT_DEBUG == 2
  audit: remove duplicate export of audit_enabled
  Audit: do not print error when LSMs disabled
  ...
2013-05-11 14:29:11 -07:00
Michael Ellerman 9353374b8e powerpc: Context switch the new EBB SPRs
This context switches the new Event Based Branching (EBB) SPRs.  The three new
SPRs are:
  - Event Based Branch Handler Register (EBBHR)
  - Event Based Branch Return Register (EBBRR)
  - Branch Event Status and Control Register (BESCR)

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Matt Evans <matt@ozlabs.org>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-05-02 10:37:36 +10:00
Michael Ellerman 1de2bd4e0c powerpc: Replace CPU_FTR_BCTAR with CPU_FTR_ARCH_207S
We are getting low on cpu feature bits. So rather than add a separate bit for
every new Power8 feature, add a bit for arch 2.07 server catagory and use that
instead.

Hijack the value we had for BCTAR, but swap the value with CFAR so that all the
ARCH defines are together.

Note we don't touch CPU_FTR_TM, because it is conditionally enabled if
the kernel is built with TM support.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-05-02 10:35:16 +10:00
Kevin Hao d8b9229240 powerpc: add a missing label in resume_kernel
A label 0 was missed in the patch a9c4e541 (powerpc/kprobe: Complete
kprobe and migrate exception frame). This will cause the kernel
branch to an undetermined address if there really has a conflict when
updating the thread flags.

Signed-off-by: Kevin Hao <haokexin@gmail.com>
Cc: stable@vger.kernel.org
Acked-By: Tiejun Chen <tiejun.chen@windriver.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
2013-04-15 17:29:48 +10:00
Alistair Popple 05e38e5d5d powerpc: Fix audit crash due to save/restore PPR changes
The current mainline crashes when hitting userspace with the following:

kernel BUG at kernel/auditsc.c:1769!
cpu 0x1: Vector: 700 (Program Check) at [c000000023883a60]
    pc: c0000000001047a8: .__audit_syscall_entry+0x38/0x130
    lr: c00000000000ed64: .do_syscall_trace_enter+0xc4/0x270
    sp: c000000023883ce0
   msr: 8000000000029032
  current = 0xc000000023800000
  paca    = 0xc00000000f080380   softe: 0        irq_happened: 0x01
    pid   = 1629, comm = start_udev
kernel BUG at kernel/auditsc.c:1769!
enter ? for help
[c000000023883d80] c00000000000ed64 .do_syscall_trace_enter+0xc4/0x270
[c000000023883e30] c000000000009b08 syscall_dotrace+0xc/0x38
 --- Exception: c00 (System Call) at 0000008010ec50dc

Bisecting found the following patch caused it:

commit 44e9309f1f
Author: Haren Myneni <haren@linux.vnet.ibm.com>
powerpc: Implement PPR save/restore

It was found this patch corrupted r9 when calling
SET_DEFAULT_THREAD_PPR()

Using r10 as a scratch register instead of r9 solved the problem.

Signed-off-by: Alistair Popple <alistair@popple.id.au>
Acked-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
2013-04-15 17:29:45 +10:00
Anton Blanchard 2540334adc powerpc: Remove static branch prediction in 64bit traced syscall path
Some distros enable auditing by default which forces us through the
syscall trace path. Remove the static branch prediction in our 64bit
syscall handler and let the hardware do the prediction.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Eric Paris <eparis@redhat.com>
2013-04-10 12:49:20 -04:00
Linus Torvalds 9d3cae26ac Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
Pull powerpc updates from Benjamin Herrenschmidt:
 "So from the depth of frozen Minnesota, here's the powerpc pull request
  for 3.9.  It has a few interesting highlights, in addition to the
  usual bunch of bug fixes, minor updates, embedded device tree updates
  and new boards:

   - Hand tuned asm implementation of SHA1 (by Paulus & Michael
     Ellerman)

   - Support for Doorbell interrupts on Power8 (kind of fast
     thread-thread IPIs) by Ian Munsie

   - Long overdue cleanup of the way we handle relocation of our open
     firmware trampoline (prom_init.c) on 64-bit by Anton Blanchard

   - Support for saving/restoring & context switching the PPR (Processor
     Priority Register) on server processors that support it.  This
     allows the kernel to preserve thread priorities established by
     userspace.  By Haren Myneni.

   - DAWR (new watchpoint facility) support on Power8 by Michael Neuling

   - Ability to change the DSCR (Data Stream Control Register) which
     controls cache prefetching on a running process via ptrace by
     Alexey Kardashevskiy

   - Support for context switching the TAR register on Power8 (new
     branch target register meant to be used by some new specific
     userspace perf event interrupt facility which is yet to be enabled)
     by Ian Munsie.

   - Improve preservation of the CFAR register (which captures the
     origin of a branch) on various exception conditions by Paulus.

   - Move the Bestcomm DMA driver from arch powerpc to drivers/dma where
     it belongs by Philippe De Muyter

   - Support for Transactional Memory on Power8 by Michael Neuling
     (based on original work by Matt Evans).  For those curious about
     the feature, the patch contains a pretty good description."

(See commit db8ff907027b: "powerpc: Documentation for transactional
memory on powerpc" for the mentioned description added to the file
Documentation/powerpc/transactional_memory.txt)

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (140 commits)
  powerpc/kexec: Disable hard IRQ before kexec
  powerpc/85xx: l2sram - Add compatible string for BSC9131 platform
  powerpc/85xx: bsc9131 - Correct typo in SDHC device node
  powerpc/e500/qemu-e500: enable coreint
  powerpc/mpic: allow coreint to be determined by MPIC version
  powerpc/fsl_pci: Store the pci ctlr device ptr in the pci ctlr struct
  powerpc/85xx: Board support for ppa8548
  powerpc/fsl: remove extraneous DIU platform functions
  arch/powerpc/platforms/85xx/p1022_ds.c: adjust duplicate test
  powerpc: Documentation for transactional memory on powerpc
  powerpc: Add transactional memory to pseries and ppc64 defconfigs
  powerpc: Add config option for transactional memory
  powerpc: Add transactional memory to POWER8 cpu features
  powerpc: Add new transactional memory state to the signal context
  powerpc: Hook in new transactional memory code
  powerpc: Routines for FP/VSX/VMX unavailable during a transaction
  powerpc: Add transactional memory unavaliable execption handler
  powerpc: Add reclaim and recheckpoint functions for context switching transactional memory processes
  powerpc: Add FP/VSX and VMX register load functions for transactional memory
  powerpc: Add helper functions for transactional memory context switching
  ...
2013-02-23 17:09:55 -08:00
Linus Torvalds d652e1eb8e Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler changes from Ingo Molnar:
 "Main changes:

   - scheduler side full-dynticks (user-space execution is undisturbed
     and receives no timer IRQs) preparation changes that convert the
     cputime accounting code to be full-dynticks ready, from Frederic
     Weisbecker.

   - Initial sched.h split-up changes, by Clark Williams

   - select_idle_sibling() performance improvement by Mike Galbraith:

        " 1 tbench pair (worst case) in a 10 core + SMT package:

          pre   15.22 MB/sec 1 procs
          post 252.01 MB/sec 1 procs "

  - sched_rr_get_interval() ABI fix/change.  We think this detail is not
    used by apps (so it's not an ABI in practice), but lets keep it
    under observation.

  - misc RT scheduling cleanups, optimizations"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
  sched/rt: Add <linux/sched/rt.h> header to <linux/init_task.h>
  cputime: Remove irqsave from seqlock readers
  sched, powerpc: Fix sched.h split-up build failure
  cputime: Restore CPU_ACCOUNTING config defaults for PPC64
  sched/rt: Move rt specific bits into new header file
  sched/rt: Add a tuning knob to allow changing SCHED_RR timeslice
  sched: Move sched.h sysctl bits into separate header
  sched: Fix signedness bug in yield_to()
  sched: Fix select_idle_sibling() bouncing cow syndrome
  sched/rt: Further simplify pick_rt_task()
  sched/rt: Do not account zero delta_exec in update_curr_rt()
  cputime: Safely read cputime of full dynticks CPUs
  kvm: Prepare to add generic guest entry/exit callbacks
  cputime: Use accessors to read task cputime stats
  cputime: Allow dynamic switch between tick/virtual based cputime accounting
  cputime: Generic on-demand virtual cputime accounting
  cputime: Move default nsecs_to_cputime() to jiffies based cputime file
  cputime: Librarize per nsecs resolution cputime definitions
  cputime: Avoid multiplication overflow on utime scaling
  context_tracking: Export context state for generic vtime
  ...

Fix up conflict in kernel/context_tracking.c due to comment additions.
2013-02-19 18:19:48 -08:00
Michael Neuling afc07701ce powerpc: Add transactional memory paca scratch register to show_regs
Add transactional memory paca scratch register to show_regs.  This is useful
for debugging.

Signed-off-by: Matt Evans <matt@ozlabs.org>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-02-15 16:58:51 +11:00
Ian Munsie 2468dcf641 powerpc: Add support for context switching the TAR register
This patch adds support for enabling and context switching the Target
Address Register in Power8. The TAR is a new special purpose register
that can be used for computed branches with the bctar[l] (branch
conditional to TAR) instruction in the same manner as the count and link
registers.

Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Matt Evans <matt@ozlabs.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-02-08 14:05:50 +11:00
Benjamin Herrenschmidt dfd0436ad0 Merge branch 'merge' into next
Merge "merge" branch to bring in various bug fixes that are
going into 3.8
2013-01-29 11:33:37 +11:00
Tiejun Chen 572177d7c7 powerpc/book3e: Disable interrupt after preempt_schedule_irq
In preempt case current arch_local_irq_restore() from
preempt_schedule_irq() may enable hard interrupt but we really
should disable interrupts when we return from the interrupt,
and so that we don't get interrupted after loading SRR0/1.

Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com>
CC: <stable@vger.kernel.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-01-29 10:18:15 +11:00
Frederic Weisbecker abf917cd91 cputime: Generic on-demand virtual cputime accounting
If we want to stop the tick further idle, we need to be
able to account the cputime without using the tick.

Virtual based cputime accounting solves that problem by
hooking into kernel/user boundaries.

However implementing CONFIG_VIRT_CPU_ACCOUNTING require
low level hooks and involves more overhead. But we already
have a generic context tracking subsystem that is required
for RCU needs by archs which plan to shut down the tick
outside idle.

This patch implements a generic virtual based cputime
accounting that relies on these generic kernel/user hooks.

There are some upsides of doing this:

- This requires no arch code to implement CONFIG_VIRT_CPU_ACCOUNTING
if context tracking is already built (already necessary for RCU in full
tickless mode).

- We can rely on the generic context tracking subsystem to dynamically
(de)activate the hooks, so that we can switch anytime between virtual
and tick based accounting. This way we don't have the overhead
of the virtual accounting when the tick is running periodically.

And one downside:

- There is probably more overhead than a native virtual based cputime
accounting. But this relies on hooks that are already set anyway.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-01-27 19:23:27 +01:00
Haren Myneni 44e9309f1f powerpc: Implement PPR save/restore
[PATCH 6/6] powerpc: Implement PPR save/restore

When the task enters in to kernel space, the user defined priority (PPR)
will be saved in to PACA at the beginning of first level exception
vector and then copy from PACA to thread_info in second level vector.
PPR will be restored from thread_info before exits the kernel space.

P7/P8 temporarily raises the thread priority to higher level during
exception until the program executes HMT_* calls. But it will not modify
PPR register. So we save PPR value whenever some register is available
to use and then calls HMT_MEDIUM to increase the priority. This feature
supports on P7 or later processors.

We save/ restore PPR for all exception vectors except system call entry.
GLIBC will be saving / restore for system calls. So the default PPR
value (3) will be set for the system call exit when the task returned
to the user space.

Signed-off-by: Haren Myneni <haren@us.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-01-10 17:01:13 +11:00
Haren Myneni 5d75b26443 powerpc: Move branch instruction from ACCOUNT_CPU_USER_ENTRY to caller
[PATCH 1/6] powerpc: Move branch instruction from ACCOUNT_CPU_USER_ENTRY to caller

The first instruction in ACCOUNT_CPU_USER_ENTRY is 'beq' which checks for
exceptions coming from kernel mode. PPR value will be saved immediately after
ACCOUNT_CPU_USER_ENTRY and is also for user level exceptions. So moved this
branch instruction in the caller code.

Signed-off-by: Haren Myneni <haren@us.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-01-10 17:00:59 +11:00
Ian Munsie fe9e1d54e3 powerpc: Add code to handle soft-disabled doorbells on server
This patch adds the logic to properly handle doorbells that come in when
interrupts have been soft disabled and to replay them when interrupts
are re-enabled:

- masked_##_H##interrupt is modified to leave interrupts enabled when a
  doorbell has come in since doorbells are edge sensitive and as such
  won't be automatically re-raised.

- __check_irq_replay now tests if a doorbell happened on book3s, and
  returns either 0xe80 or 0xa00 depending on whether we are the
  hypervisor or not.

- restore_check_irq_replay now tests for the two possible server
  doorbell vector numbers to replay.

- __replay_interrupt also adds tests for the two server doorbell vector
  numbers, and is modified to use a compare instruction rather than an
  andi. on the single bit difference between 0x500 and 0x900.

The last two use a CPU feature section to avoid needlessly testing
against the hypervisor vector if it is not the hypervisor, and vice
versa.

Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-01-10 15:09:07 +11:00
Linus Torvalds 16e024f30c Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
Pull powerpc update from Benjamin Herrenschmidt:
 "The main highlight is probably some base POWER8 support.  There's more
  to come such as transactional memory support but that will wait for
  the next one.

  Overall it's pretty quiet, or rather I've been pretty poor at picking
  things up from patchwork and reviewing them this time around and Kumar
  no better on the FSL side it seems..."

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (73 commits)
  powerpc+of: Rename and fix OF reconfig notifier error inject module
  powerpc: mpc5200: Add a3m071 board support
  powerpc/512x: don't compile any platform DIU code if the DIU is not enabled
  powerpc/mpc52xx: use module_platform_driver macro
  powerpc+of: Export of_reconfig_notifier_[register,unregister]
  powerpc/dma/raidengine: add raidengine device
  powerpc/iommu/fsl: Add PAMU bypass enable register to ccsr_guts struct
  powerpc/mpc85xx: Change spin table to cached memory
  powerpc/fsl-pci: Add PCI controller ATMU PM support
  powerpc/86xx: fsl_pcibios_fixup_bus requires CONFIG_PCI
  drivers/virt: the Freescale hypervisor driver doesn't need to check MSR[GS]
  powerpc/85xx: p1022ds: Use NULL instead of 0 for pointers
  powerpc: Disable relocation on exceptions when kexecing
  powerpc: Enable relocation on during exceptions at boot
  powerpc: Move get_longbusy_msecs into hvcall.h and remove duplicate function
  powerpc: Add wrappers to enable/disable relocation on exceptions
  powerpc: Add set_mode hcall
  powerpc: Setup relocation on exceptions for bare metal systems
  powerpc: Move initial mfspr LPCR out of __init_LPCR
  powerpc: Add relocation on exception vector handlers
  ...
2012-12-18 09:58:09 -08:00
Li Zhong 12660b1702 powerpc: Fix MAX_STACK_TRACE_ENTRIES too low warning !
This patch tries to fix the following BUG report:

[    0.012313] BUG: MAX_STACK_TRACE_ENTRIES too low!
[    0.012318] turning off the locking correctness validator.
[    0.012321] Call Trace:
[    0.012330] [c00000017666f6d0] [c000000000012128] .show_stack+0x78/0x184 (unreliable)
[    0.012339] [c00000017666f780] [c0000000000b6348] .save_trace+0x12c/0x14c
[    0.012345] [c00000017666f800] [c0000000000b7448] .mark_lock+0x2bc/0x710
[    0.012351] [c00000017666f8b0] [c0000000000bb198] .__lock_acquire+0x748/0xaec
[    0.012357] [c00000017666f9b0] [c0000000000bb684] .lock_acquire+0x148/0x194
[    0.012365] [c00000017666fa80] [c00000000069371c] .mutex_lock_nested+0x84/0x4ec
[    0.012372] [c00000017666fb90] [c000000000096998] .smpboot_register_percpu_thread+0x3c/0x10c
[    0.012380] [c00000017666fc30] [c0000000009ba910] .spawn_ksoftirqd+0x28/0x48
[    0.012386] [c00000017666fcb0] [c00000000000a98c] .do_one_initcall+0xd8/0x1d0
[    0.012392] [c00000017666fd60] [c00000000000b1f8] .kernel_init+0x120/0x398

[    0.012398] [c00000017666fe30] [c000000000009ad4] .ret_from_kernel_thread+0x5c/0x64
[    0.012404] [c00000017666fa00] [c00000017666fb20] 0xc00000017666fb20
[    0.012410] [c00000017666fa80] [c00000000069371c] .mutex_lock_nested+0x84/0x4ec
[    0.012416] [c00000017666fb90] [c000000000096998] .smpboot_register_percpu_thread+0x3c/0x10c
[    0.012422] [c00000017666fc30] [c0000000009ba910] .spawn_ksoftirqd+0x28/0x48
[    0.012427] [c00000017666fcb0] [c00000000000a98c] .do_one_initcall+0xd8/0x1d0
[    0.012433] [c00000017666fd60] [c00000000000b1f8] .kernel_init+0x120/0x398

[    0.012439] [c00000017666fe30] [c000000000009ad4] .ret_from_kernel_thread+0x5c/0x64
.......

The reason is that the back chain of c00000017666fe30
(ret_from_kernel_thread) contains some invalid value, which might form a
loop.

Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-11-15 13:00:17 +11:00
Al Viro 53b50f9483 powerpc: take dereferencing to ret_from_kernel_thread()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-21 22:25:11 -04:00
Al Viro 40792104b2 powerpc: don't mess with r2 in copy_thread() and friends
kernel_thread() callbacks are *not* in modules and are not going to
be there.  And it's not even read in ppc32 ret_from_kernel_thread(),
so no need to bother with it there either.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-14 19:35:52 -04:00
Al Viro 138d1ce80e powerpc: switch to saner kernel_execve() semantics
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-14 19:35:44 -04:00
Linus Torvalds 8213a2f3ee Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal
Pull pile 2 of execve and kernel_thread unification work from Al Viro:
 "Stuff in there: kernel_thread/kernel_execve/sys_execve conversions for
  several more architectures plus assorted signal fixes and cleanups.

  There'll be more (in particular, real fixes for the alpha
  do_notify_resume() irq mess)..."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: (43 commits)
  alpha: don't open-code trace_report_syscall_{enter,exit}
  Uninclude linux/freezer.h
  m32r: trim masks
  avr32: trim masks
  tile: don't bother with SIGTRAP in setup_frame
  microblaze: don't bother with SIGTRAP in setup_rt_frame()
  mn10300: don't bother with SIGTRAP in setup_frame()
  frv: no need to raise SIGTRAP in setup_frame()
  x86: get rid of duplicate code in case of CONFIG_VM86
  unicore32: remove pointless test
  h8300: trim _TIF_WORK_MASK
  parisc: decide whether to go to slow path (tracesys) based on thread flags
  parisc: don't bother looping in do_signal()
  parisc: fix double restarts
  bury the rest of TIF_IRET
  sanitize tsk_is_polling()
  bury _TIF_RESTORE_SIGMASK
  unicore32: unobfuscate _TIF_WORK_MASK
  mips: NOTIFY_RESUME is not needed in TIF masks
  mips: merge the identical "return from syscall" per-ABI code
  ...

Conflicts:
	arch/arm/include/asm/thread_info.h
2012-10-12 10:49:08 +09:00
Al Viro be6abfa769 powerpc: switch to generic sys_execve()/kernel_execve()
the only non-obvious part is that current_pt_regs() is really needed
here - task_pt_regs() is NULL for kernel threads; it's OK for ptrace
uses (the thing task_pt_regs() is intended for), but not for us.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-30 23:35:51 -04:00
Al Viro 58254e1002 powerpc: split ret_from_fork
... and get rid of in-kernel syscalls in kernel_thread()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-30 23:31:19 -04:00
Tiejun Chen a9c4e541ea powerpc/kprobe: Complete kprobe and migrate exception frame
We can't emulate stwu since that may corrupt current exception stack.
So we will have to do real store operation in the exception return code.

Firstly we'll allocate a trampoline exception frame below the kprobed
function stack and copy the current exception frame to the trampoline.
Then we can do this real store operation to implement 'stwu', and reroute
the trampoline frame to r1 to complete this exception migration.

Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-09-18 15:32:42 +10:00
Anton Blanchard 714332858b powerpc: Restore correct DSCR in context switch
During a context switch we always restore the per thread DSCR value.
If we aren't doing explicit DSCR management
(ie thread.dscr_inherit == 0) and the default DSCR changed while
the process has been sleeping we end up with the wrong value.

Check thread.dscr_inherit and select the default DSCR or per thread
DSCR as required.

This was found with the following test case, when running with
more threads than CPUs (ie forcing context switching):

http://ozlabs.org/~anton/junkcode/dscr_default_test.c

With the four patches applied I can run a combination of all
test cases successfully at the same time:

http://ozlabs.org/~anton/junkcode/dscr_default_test.c
http://ozlabs.org/~anton/junkcode/dscr_explicit_test.c
http://ozlabs.org/~anton/junkcode/dscr_inherit_test.c

Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: <stable@kernel.org> # 3.0+
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-09-05 16:05:22 +10:00
Stuart Yoder 9778b696a0 powerpc: Use CURRENT_THREAD_INFO instead of open coded assembly
Signed-off-by: Stuart Yoder <stuart.yoder@freescale.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-07-11 14:18:22 +10:00
Anton Blanchard ac1dc36558 powerpc: Clear RI and EE at the same time in system call exit
mtmsrd is an expensive instruction, we save a few cycles by
doing it once instead of twice.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-07-03 14:14:43 +10:00
Tiejun Chen c58ce2b1e3 ppc64: fix missing to check all bits of _TIF_USER_WORK_MASK in preempt
In entry_64.S version of ret_from_except_lite, you'll notice that
in the !preempt case, after we've checked MSR_PR we test for any
TIF flag in _TIF_USER_WORK_MASK to decide whether to go to do_work
or not. However, in the preempt case, we do a convoluted trick to
test SIGPENDING only if PR was set and always test NEED_RESCHED ...
but we forget to test any other bit of _TIF_USER_WORK_MASK !!! So
that means that with preempt, we completely fail to test for things
like single step, syscall tracing, etc...

This should be fixed as the following path:

 - Test PR. If not set, go to resume_kernel, else continue.

 - If go resume_kernel, to do that original do_work.

 - If else, then always test for _TIF_USER_WORK_MASK to decide to do
that original user_work, else restore directly.

Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-06-29 14:35:35 +10:00
Benjamin Herrenschmidt 8b6ee04067 Merge branch 'merge' into next
We want the irq fixes from the "merge" branch.
2012-05-14 10:19:22 +10:00
Benjamin Herrenschmidt 7c0482e3d0 powerpc/irq: Fix another case of lazy IRQ state getting out of sync
So we have another case of paca->irq_happened getting out of
sync with the HW irq state. This can happen when a perfmon
interrupt occurs while soft disabled, as it will return to a
soft disabled but hard enabled context while leaving a stale
PACA_IRQ_HARD_DIS flag set.

This patch fixes it, and also adds a test for the condition
of those flags being out of sync in arch_local_irq_restore()
when CONFIG_TRACE_IRQFLAGS is enabled.

This helps catching those gremlins faster (and so far I
can't seem see any anymore, so that's good news).

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-05-12 09:40:41 +10:00
Benjamin Herrenschmidt ea4e89afed Merge branch 'merge' into next 2012-05-09 10:57:57 +10:00
Benjamin Herrenschmidt 56dfa7fa19 powerpc/irq: Fix bug with new lazy IRQ handling code
We had a case where we could turn on hard interrupts while
leaving the PACA_IRQ_HARD_DIS bit set in the PACA. This can
in turn cause a BUG_ON() to hit in __check_irq_replay() due
to interrupt state getting out of sync.

The assembly code was also way too convoluted. Instead, we
now leave it to the C code to do the right thing which ends
up being smaller and more readable.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-05-09 09:42:21 +10:00
Anton Blanchard fd6c40f3b0 powerpc: Better scheduling of CR save code in system call path
At the moment system call entry looks like:

crclr	so
...
mfcr	r9
...
std	r9,_CCR(r1)

commit bd19c8994a ([POWERPC] system call micro optimisation) put
some space between the crclr and mfcr in order to avoid a stall.

There is still a stall seen between the mfcr and std. We can avoid
the crclr by doing it in a GPR with rlwinm which gives us more room
to better schedule the sequence.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-04-30 15:37:14 +10:00
Anton Blanchard 82087414c6 powerpc: No need to preserve count register across system call
The count register is volatile so we don't need to preserve it.
Store zero to the entry in the exception frame.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-04-30 15:35:10 +10:00
Anton Blanchard 823df43552 powerpc: No need to save XER in a system call
The XER is a volatile register so there is no need to save and restore
it over a system call - zero it out in the exception stack frame
instead.

This should fix a 5 cycle stall of the mfxer/std seen on POWER7.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-04-30 15:34:44 +10:00
Anton Blanchard d14299dec7 powerpc: Hide some system call labels from profile tools
syscall_dotrace_cont and syscall_error_cont tend to complicate perf
output so make them local.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-04-30 15:34:43 +10:00
Benjamin Herrenschmidt 7230c56441 powerpc: Rework lazy-interrupt handling
The current implementation of lazy interrupts handling has some
issues that this tries to address.

We don't do the various workarounds we need to do when re-enabling
interrupts in some cases such as when returning from an interrupt
and thus we may still lose or get delayed decrementer or doorbell
interrupts.

The current scheme also makes it much harder to handle the external
"edge" interrupts provided by some BookE processors when using the
EPR facility (External Proxy) and the Freescale Hypervisor.

Additionally, we tend to keep interrupts hard disabled in a number
of cases, such as decrementer interrupts, external interrupts, or
when a masked decrementer interrupt is pending. This is sub-optimal.

This is an attempt at fixing it all in one go by reworking the way
we do the lazy interrupt disabling from the ground up.

The base idea is to replace the "hard_enabled" field with a
"irq_happened" field in which we store a bit mask of what interrupt
occurred while soft-disabled.

When re-enabling, either via arch_local_irq_restore() or when returning
from an interrupt, we can now decide what to do by testing bits in that
field.

We then implement replaying of the missed interrupts either by
re-using the existing exception frame (in exception exit case) or via
the creation of a new one from an assembly trampoline (in the
arch_local_irq_enable case).

This removes the need to play with the decrementer to try to create
fake interrupts, among others.

In addition, this adds a few refinements:

 - We no longer  hard disable decrementer interrupts that occur
while soft-disabled. We now simply bump the decrementer back to max
(on BookS) or leave it stopped (on BookE) and continue with hard interrupts
enabled, which means that we'll potentially get better sample quality from
performance monitor interrupts.

 - Timer, decrementer and doorbell interrupts now hard-enable
shortly after removing the source of the interrupt, which means
they no longer run entirely hard disabled. Again, this will improve
perf sample quality.

 - On Book3E 64-bit, we now make the performance monitor interrupt
act as an NMI like Book3S (the necessary C code for that to work
appear to already be present in the FSL perf code, notably calling
nmi_enter instead of irq_enter). (This also fixes a bug where BookE
perfmon interrupts could clobber r14 ... oops)

 - We could make "masked" decrementer interrupts act as NMIs when doing
timer-based perf sampling to improve the sample quality.

Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---

v2:

- Add hard-enable to decrementer, timer and doorbells
- Fix CR clobber in masked irq handling on BookE
- Make embedded perf interrupt act as an NMI
- Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want
  to retrigger an interrupt without preventing hard-enable

v3:

 - Fix or vs. ori bug on Book3E
 - Fix enabling of interrupts for some exceptions on Book3E

v4:

 - Fix resend of doorbells on return from interrupt on Book3E

v5:

 - Rebased on top of my latest series, which involves some significant
rework of some aspects of the patch.

v6:
 - 32-bit compile fix
 - more compile fixes with various .config combos
 - factor out the asm code to soft-disable interrupts
 - remove the C wrapper around preempt_schedule_irq

v7:
 - Fix a bug with hard irq state tracking on native power7
2012-03-09 13:25:06 +11:00
Benjamin Herrenschmidt d9ada91ae2 powerpc: Replace mfmsr instructions with load from PACA kernel_msr field
On 64-bit, the mfmsr instruction can be quite slow, slower
than loading a field from the cache-hot PACA, which happens
to already contain the value we want in most cases.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-03-09 10:55:20 +11:00
Benjamin Herrenschmidt 1421ae0b29 powerpc: Improve 64-bit syscall entry/exit
We unconditionally hard enable interrupts. This is unnecessary as
syscalls are expected to always be called with interrupts enabled.

While at it, we add a WARN_ON if that is not the case and
CONFIG_TRACE_IRQFLAGS is enabled (we don't want to add overhead
to the fast path when this is not set though).

Thus let's remove the enabling (and associated irq tracing) from
the syscall entry path. Also on Book3S, replace a few mfmsr
instructions with loads of PACAMSR from the PACA, which should be
faster & schedule better.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-03-09 10:55:04 +11:00
Benjamin Herrenschmidt 4f8cf36f48 powerpc: Remove legacy iSeries bits from assembly files
This removes the various bits of assembly in the kernel entry,
exception handling and SLB management code that were specific
to running under the legacy iSeries hypervisor which is no
longer supported.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-03-09 10:54:59 +11:00
Benjamin Herrenschmidt 18b246fa60 powerpc: Fix various issues with return to userspace
We have a few problems when returning to userspace. This is a
quick set of fixes for 3.3, I'll look into a more comprehensive
rework for 3.4. This fixes:

 - We kept interrupts soft-disabled when schedule'ing or calling
do_signal when returning to userspace as a result of a hardware
interrupt.

 - Rename do_signal to do_notify_resume like all other archs (and
do_signal_pending back to do_signal, which it was before Roland
changed it).

 - Add the missing call to key_replace_session_keyring() to
do_notify_resume().

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
2012-02-22 16:48:53 +11:00
Matt Evans 44ae3ab335 powerpc: Free up some CPU feature bits by moving out MMU-related features
Some of the 64bit PPC CPU features are MMU-related, so this patch moves
them to MMU_FTR_ bits.  All cpu_has_feature()-style tests are moved to
mmu_has_feature(), and seven feature bits are freed as a result.

Signed-off-by: Matt Evans <matt@ozlabs.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-04-27 14:18:52 +10:00
Alexey Kardashevskiy efcac6589a powerpc: Per process DSCR + some fixes (try#4)
The DSCR (aka Data Stream Control Register) is supported on some
server PowerPC chips and allow some control over the prefetch
of data streams.

This patch allows the value to be specified per thread by emulating
the corresponding mfspr and mtspr instructions. Children of such
threads inherit the value. Other threads use a default value that
can be specified in sysfs - /sys/devices/system/cpu/dscr_default.

If a thread starts with non default value in the sysfs entry,
all children threads inherit this non default value even if
the sysfs value is changed later.

Signed-off-by: Alexey Kardashevskiy <aik@au1.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-04-27 14:18:19 +10:00
Benjamin Herrenschmidt 2dd60d79e0 powerpc: In HV mode, use HSPRG0 for PACA
When running in Hypervisor mode (arch 2.06 or later), we store the PACA
in HSPRG0 instead of SPRG1. The architecture specifies that SPRGs may be
lost during a "nap" power management operation (though they aren't
currently on POWER7) and this enables use of SPRG1 by KVM guests.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-04-20 11:03:22 +10:00
Paul Mackerras cf9efce0ce powerpc: Account time using timebase rather than PURR
Currently, when CONFIG_VIRT_CPU_ACCOUNTING is enabled, we use the
PURR register for measuring the user and system time used by
processes, as well as other related times such as hardirq and
softirq times.  This turns out to be quite confusing for users
because it means that a program will often be measured as taking
less time when run on a multi-threaded processor (SMT2 or SMT4 mode)
than it does when run on a single-threaded processor (ST mode), even
though the program takes longer to finish.  The discrepancy is
accounted for as stolen time, which is also confusing, particularly
when there are no other partitions running.

This changes the accounting to use the timebase instead, meaning that
the reported user and system times are the actual number of real-time
seconds that the program was executing on the processor thread,
regardless of which SMT mode the processor is in.  Thus a program will
generally show greater user and system times when run on a
multi-threaded processor than on a single-threaded processor.

On pSeries systems on POWER5 or later processors, we measure the
stolen time (time when this partition wasn't running) using the
hypervisor dispatch trace log.  We check for new entries in the
log on every entry from user mode and on every transition from
kernel process context to soft or hard IRQ context (i.e. when
account_system_vtime() gets called).  So that we can correctly
distinguish time stolen from user time and time stolen from system
time, without having to check the log on every exit to user mode,
we store separate timestamps for exit to user mode and entry from
user mode.

On systems that have a SPURR (POWER6 and POWER7), we read the SPURR
in account_system_vtime() (as before), and then apportion the SPURR
ticks since the last time we read it between scaled user time and
scaled system time according to the relative proportions of user
time and system time over the same interval.  This avoids having to
read the SPURR on every kernel entry and exit.  On systems that have
PURR but not SPURR (i.e., POWER5), we do the same using the PURR
rather than the SPURR.

This disables the DTL user interface in /sys/debug/kernel/powerpc/dtl
for now since it conflicts with the use of the dispatch trace log
by the time accounting code.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-09-02 14:07:31 +10:00
Anton Blanchard f89451fbd2 powerpc: Feature nop out reservation clear when stcx checks address
The POWER architecture does not require stcx to check that it is operating
on the same address as the larx. This means it is possible for an
an exception handler to execute a larx, get a reservation, decide
not to do the stcx and then return back with an active reservation. If the
interrupted code was in the middle of a larx/stcx sequence the stcx could
incorrectly succeed.

All recent POWER CPUs check the address before letting the stcx succeed
so we can create a CPU feature and nop it out. As Ben suggested, we can
only do this in our syscall path because there is a remote possibility
some kernel code gets interrupted by an exception that ends up operating
on the same cacheline.

Thanks to Paul Mackerras and Derek Williams for the idea.

To test this I used a very simple null syscall (actually getppid) testcase
at http://ozlabs.org/~anton/junkcode/null_syscall.c

I tested against 2.6.35-git10 with the following changes against the
pseries_defconfig:

CONFIG_VIRT_CPU_ACCOUNTING=n
CONFIG_AUDIT=n
CONFIG_PPC_4K_PAGES=n
CONFIG_PPC_64K_PAGES=y
CONFIG_FORCE_MAX_ZONEORDER=9
CONFIG_PPC_SUBPAGE_PROT=n
CONFIG_FUNCTION_TRACER=n
CONFIG_FUNCTION_GRAPH_TRACER=n
CONFIG_IRQSOFF_TRACER=n
CONFIG_STACK_TRACER=n

to remove the overhead of virtual CPU accounting, syscall auditing and
the ftrace mcount tracers. 64kB pages were enabled to minimise TLB misses.

POWER6: +8.2%
POWER7: +7.0%

Another suggestion was to use a larx to something in the L1 instead of a stcx.
This was almost as fast as removing the larx on POWER6, but only 3.5% faster
on POWER7. We can use this to speed up the reservation clear in our
exception exit code.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-09-02 14:07:30 +10:00