1
0
Fork 0
alistair23-linux/arch/powerpc/kernel/exceptions-64s.S

1629 lines
45 KiB
ArmAsm
Raw Normal View History

/*
* This file contains the 64-bit "server" PowerPC variant
* of the low level exception handling including exception
* vectors, exception return, part of the slb and stab
* handling and other fixed offset specific things.
*
* This file is meant to be #included from head_64.S due to
* position dependent assembly.
*
* Most of this originates from head_64.S and thus has the same
* copyright history.
*
*/
powerpc: Rework lazy-interrupt handling The current implementation of lazy interrupts handling has some issues that this tries to address. We don't do the various workarounds we need to do when re-enabling interrupts in some cases such as when returning from an interrupt and thus we may still lose or get delayed decrementer or doorbell interrupts. The current scheme also makes it much harder to handle the external "edge" interrupts provided by some BookE processors when using the EPR facility (External Proxy) and the Freescale Hypervisor. Additionally, we tend to keep interrupts hard disabled in a number of cases, such as decrementer interrupts, external interrupts, or when a masked decrementer interrupt is pending. This is sub-optimal. This is an attempt at fixing it all in one go by reworking the way we do the lazy interrupt disabling from the ground up. The base idea is to replace the "hard_enabled" field with a "irq_happened" field in which we store a bit mask of what interrupt occurred while soft-disabled. When re-enabling, either via arch_local_irq_restore() or when returning from an interrupt, we can now decide what to do by testing bits in that field. We then implement replaying of the missed interrupts either by re-using the existing exception frame (in exception exit case) or via the creation of a new one from an assembly trampoline (in the arch_local_irq_enable case). This removes the need to play with the decrementer to try to create fake interrupts, among others. In addition, this adds a few refinements: - We no longer hard disable decrementer interrupts that occur while soft-disabled. We now simply bump the decrementer back to max (on BookS) or leave it stopped (on BookE) and continue with hard interrupts enabled, which means that we'll potentially get better sample quality from performance monitor interrupts. - Timer, decrementer and doorbell interrupts now hard-enable shortly after removing the source of the interrupt, which means they no longer run entirely hard disabled. Again, this will improve perf sample quality. - On Book3E 64-bit, we now make the performance monitor interrupt act as an NMI like Book3S (the necessary C code for that to work appear to already be present in the FSL perf code, notably calling nmi_enter instead of irq_enter). (This also fixes a bug where BookE perfmon interrupts could clobber r14 ... oops) - We could make "masked" decrementer interrupts act as NMIs when doing timer-based perf sampling to improve the sample quality. Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org> --- v2: - Add hard-enable to decrementer, timer and doorbells - Fix CR clobber in masked irq handling on BookE - Make embedded perf interrupt act as an NMI - Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want to retrigger an interrupt without preventing hard-enable v3: - Fix or vs. ori bug on Book3E - Fix enabling of interrupts for some exceptions on Book3E v4: - Fix resend of doorbells on return from interrupt on Book3E v5: - Rebased on top of my latest series, which involves some significant rework of some aspects of the patch. v6: - 32-bit compile fix - more compile fixes with various .config combos - factor out the asm code to soft-disable interrupts - remove the C wrapper around preempt_schedule_irq v7: - Fix a bug with hard irq state tracking on native power7
2012-03-06 00:27:59 -07:00
#include <asm/hw_irq.h>
#include <asm/exception-64s.h>
#include <asm/ptrace.h>
#include <asm/cpuidle.h>
#include <asm/head-64.h>
/*
* There are a few constraints to be concerned with.
* - Real mode exceptions code/data must be located at their physical location.
* - Virtual mode exceptions must be mapped at their 0xc000... location.
* - Fixed location code must not call directly beyond the __end_interrupts
* area when built with CONFIG_RELOCATABLE. LOAD_HANDLER / bctr sequence
* must be used.
* - LOAD_HANDLER targets must be within first 64K of physical 0 /
* virtual 0xc00...
* - Conditional branch targets must be within +/-32K of caller.
*
* "Virtual exceptions" run with relocation on (MSR_IR=1, MSR_DR=1), and
* therefore don't have to run in physically located code or rfid to
* virtual mode kernel code. However on relocatable kernels they do have
* to branch to KERNELBASE offset because the rest of the kernel (outside
* the exception vectors) may be located elsewhere.
*
* Virtual exceptions correspond with physical, except their entry points
* are offset by 0xc000000000000000 and also tend to get an added 0x4000
* offset applied. Virtual exceptions are enabled with the Alternate
* Interrupt Location (AIL) bit set in the LPCR. However this does not
* guarantee they will be delivered virtually. Some conditions (see the ISA)
* cause exceptions to be delivered in real mode.
*
* It's impossible to receive interrupts below 0x300 via AIL.
*
* KVM: None of the virtual exceptions are from the guest. Anything that
* escalated to HV=1 from HV=0 is delivered via real mode handlers.
*
*
* We layout physical memory as follows:
* 0x0000 - 0x00ff : Secondary processor spin code
* 0x0100 - 0x18ff : Real mode pSeries interrupt vectors
* 0x1900 - 0x3fff : Real mode trampolines
* 0x4000 - 0x58ff : Relon (IR=1,DR=1) mode pSeries interrupt vectors
* 0x5900 - 0x6fff : Relon mode trampolines
* 0x7000 - 0x7fff : FWNMI data area
* 0x8000 - .... : Common interrupt handlers, remaining early
* setup code, rest of kernel.
*/
OPEN_FIXED_SECTION(real_vectors, 0x0100, 0x1900)
OPEN_FIXED_SECTION(real_trampolines, 0x1900, 0x4000)
OPEN_FIXED_SECTION(virt_vectors, 0x4000, 0x5900)
OPEN_FIXED_SECTION(virt_trampolines, 0x5900, 0x7000)
#if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_PPC_POWERNV)
/*
* Data area reserved for FWNMI option.
* This address (0x7000) is fixed by the RPA.
* pseries and powernv need to keep the whole page from
* 0x7000 to 0x8000 free for use by the firmware
*/
ZERO_FIXED_SECTION(fwnmi_page, 0x7000, 0x8000)
OPEN_TEXT_SECTION(0x8000)
#else
OPEN_TEXT_SECTION(0x7000)
#endif
USE_FIXED_SECTION(real_vectors)
#define LOAD_SYSCALL_HANDLER(reg) \
ld reg,PACAKBASE(r13); \
ori reg,reg,(ABS_ADDR(system_call_common))@l;
/* Syscall routine is used twice, in reloc-off and reloc-on paths */
#define SYSCALL_PSERIES_1 \
BEGIN_FTR_SECTION \
cmpdi r0,0x1ebe ; \
beq- 1f ; \
END_FTR_SECTION_IFSET(CPU_FTR_REAL_LE) \
mr r9,r13 ; \
GET_PACA(r13) ; \
mfspr r11,SPRN_SRR0 ; \
0:
#define SYSCALL_PSERIES_2_RFID \
mfspr r12,SPRN_SRR1 ; \
LOAD_SYSCALL_HANDLER(r10) ; \
mtspr SPRN_SRR0,r10 ; \
ld r10,PACAKMSR(r13) ; \
mtspr SPRN_SRR1,r10 ; \
rfid ; \
b . ; /* prevent speculative execution */
#define SYSCALL_PSERIES_3 \
/* Fast LE/BE switch system call */ \
1: mfspr r12,SPRN_SRR1 ; \
xori r12,r12,MSR_LE ; \
mtspr SPRN_SRR1,r12 ; \
rfid ; /* return to userspace */ \
b . ; /* prevent speculative execution */
#if defined(CONFIG_RELOCATABLE)
/*
* We can't branch directly so we do it via the CTR which
* is volatile across system calls.
*/
#define SYSCALL_PSERIES_2_DIRECT \
LOAD_SYSCALL_HANDLER(r12) ; \
mtctr r12 ; \
mfspr r12,SPRN_SRR1 ; \
li r10,MSR_RI ; \
mtmsrd r10,1 ; \
bctr ;
#else
/* We can branch directly */
#define SYSCALL_PSERIES_2_DIRECT \
mfspr r12,SPRN_SRR1 ; \
li r10,MSR_RI ; \
mtmsrd r10,1 ; /* Set RI (EE=0) */ \
b system_call_common ;
#endif
/*
* This is the start of the interrupt handlers for pSeries
* This code runs with relocation off.
* Code from here to __end_interrupts gets copied down to real
* address 0x100 when we are running a relocatable kernel.
* Therefore any relative branches in this section must only
* branch to labels in this section.
*/
.globl __start_interrupts
__start_interrupts:
EXC_REAL_BEGIN(system_reset, 0x100, 0x200)
SET_SCRATCH0(r13)
#ifdef CONFIG_PPC_P7_NAP
BEGIN_FTR_SECTION
/* Running native on arch 2.06 or later, check if we are
* waking up from nap/sleep/winkle.
*/
mfspr r13,SPRN_SRR1
KVM: PPC: Allow book3s_hv guests to use SMT processor modes This lifts the restriction that book3s_hv guests can only run one hardware thread per core, and allows them to use up to 4 threads per core on POWER7. The host still has to run single-threaded. This capability is advertised to qemu through a new KVM_CAP_PPC_SMT capability. The return value of the ioctl querying this capability is the number of vcpus per virtual CPU core (vcore), currently 4. To use this, the host kernel should be booted with all threads active, and then all the secondary threads should be offlined. This will put the secondary threads into nap mode. KVM will then wake them from nap mode and use them for running guest code (while they are still offline). To wake the secondary threads, we send them an IPI using a new xics_wake_cpu() function, implemented in arch/powerpc/sysdev/xics/icp-native.c. In other words, at this stage we assume that the platform has a XICS interrupt controller and we are using icp-native.c to drive it. Since the woken thread will need to acknowledge and clear the IPI, we also export the base physical address of the XICS registers using kvmppc_set_xics_phys() for use in the low-level KVM book3s code. When a vcpu is created, it is assigned to a virtual CPU core. The vcore number is obtained by dividing the vcpu number by the number of threads per core in the host. This number is exported to userspace via the KVM_CAP_PPC_SMT capability. If qemu wishes to run the guest in single-threaded mode, it should make all vcpu numbers be multiples of the number of threads per core. We distinguish three states of a vcpu: runnable (i.e., ready to execute the guest), blocked (that is, idle), and busy in host. We currently implement a policy that the vcore can run only when all its threads are runnable or blocked. This way, if a vcpu needs to execute elsewhere in the kernel or in qemu, it can do so without being starved of CPU by the other vcpus. When a vcore starts to run, it executes in the context of one of the vcpu threads. The other vcpu threads all go to sleep and stay asleep until something happens requiring the vcpu thread to return to qemu, or to wake up to run the vcore (this can happen when another vcpu thread goes from busy in host state to blocked). It can happen that a vcpu goes from blocked to runnable state (e.g. because of an interrupt), and the vcore it belongs to is already running. In that case it can start to run immediately as long as the none of the vcpus in the vcore have started to exit the guest. We send the next free thread in the vcore an IPI to get it to start to execute the guest. It synchronizes with the other threads via the vcore->entry_exit_count field to make sure that it doesn't go into the guest if the other vcpus are exiting by the time that it is ready to actually enter the guest. Note that there is no fixed relationship between the hardware thread number and the vcpu number. Hardware threads are assigned to vcpus as they become runnable, so we will always use the lower-numbered hardware threads in preference to higher-numbered threads if not all the vcpus in the vcore are runnable, regardless of which vcpus are runnable. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-28 18:23:08 -06:00
rlwinm. r13,r13,47-31,30,31
beq 9f
cmpwi cr3,r13,2
KVM: PPC: Allow book3s_hv guests to use SMT processor modes This lifts the restriction that book3s_hv guests can only run one hardware thread per core, and allows them to use up to 4 threads per core on POWER7. The host still has to run single-threaded. This capability is advertised to qemu through a new KVM_CAP_PPC_SMT capability. The return value of the ioctl querying this capability is the number of vcpus per virtual CPU core (vcore), currently 4. To use this, the host kernel should be booted with all threads active, and then all the secondary threads should be offlined. This will put the secondary threads into nap mode. KVM will then wake them from nap mode and use them for running guest code (while they are still offline). To wake the secondary threads, we send them an IPI using a new xics_wake_cpu() function, implemented in arch/powerpc/sysdev/xics/icp-native.c. In other words, at this stage we assume that the platform has a XICS interrupt controller and we are using icp-native.c to drive it. Since the woken thread will need to acknowledge and clear the IPI, we also export the base physical address of the XICS registers using kvmppc_set_xics_phys() for use in the low-level KVM book3s code. When a vcpu is created, it is assigned to a virtual CPU core. The vcore number is obtained by dividing the vcpu number by the number of threads per core in the host. This number is exported to userspace via the KVM_CAP_PPC_SMT capability. If qemu wishes to run the guest in single-threaded mode, it should make all vcpu numbers be multiples of the number of threads per core. We distinguish three states of a vcpu: runnable (i.e., ready to execute the guest), blocked (that is, idle), and busy in host. We currently implement a policy that the vcore can run only when all its threads are runnable or blocked. This way, if a vcpu needs to execute elsewhere in the kernel or in qemu, it can do so without being starved of CPU by the other vcpus. When a vcore starts to run, it executes in the context of one of the vcpu threads. The other vcpu threads all go to sleep and stay asleep until something happens requiring the vcpu thread to return to qemu, or to wake up to run the vcore (this can happen when another vcpu thread goes from busy in host state to blocked). It can happen that a vcpu goes from blocked to runnable state (e.g. because of an interrupt), and the vcore it belongs to is already running. In that case it can start to run immediately as long as the none of the vcpus in the vcore have started to exit the guest. We send the next free thread in the vcore an IPI to get it to start to execute the guest. It synchronizes with the other threads via the vcore->entry_exit_count field to make sure that it doesn't go into the guest if the other vcpus are exiting by the time that it is ready to actually enter the guest. Note that there is no fixed relationship between the hardware thread number and the vcpu number. Hardware threads are assigned to vcpus as they become runnable, so we will always use the lower-numbered hardware threads in preference to higher-numbered threads if not all the vcpus in the vcore are runnable, regardless of which vcpus are runnable. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-28 18:23:08 -06:00
GET_PACA(r13)
bl pnv_restore_hyp_resource
li r0,PNV_THREAD_RUNNING
stb r0,PACA_THREAD_IDLE_STATE(r13) /* Clear thread state */
KVM: PPC: Allow book3s_hv guests to use SMT processor modes This lifts the restriction that book3s_hv guests can only run one hardware thread per core, and allows them to use up to 4 threads per core on POWER7. The host still has to run single-threaded. This capability is advertised to qemu through a new KVM_CAP_PPC_SMT capability. The return value of the ioctl querying this capability is the number of vcpus per virtual CPU core (vcore), currently 4. To use this, the host kernel should be booted with all threads active, and then all the secondary threads should be offlined. This will put the secondary threads into nap mode. KVM will then wake them from nap mode and use them for running guest code (while they are still offline). To wake the secondary threads, we send them an IPI using a new xics_wake_cpu() function, implemented in arch/powerpc/sysdev/xics/icp-native.c. In other words, at this stage we assume that the platform has a XICS interrupt controller and we are using icp-native.c to drive it. Since the woken thread will need to acknowledge and clear the IPI, we also export the base physical address of the XICS registers using kvmppc_set_xics_phys() for use in the low-level KVM book3s code. When a vcpu is created, it is assigned to a virtual CPU core. The vcore number is obtained by dividing the vcpu number by the number of threads per core in the host. This number is exported to userspace via the KVM_CAP_PPC_SMT capability. If qemu wishes to run the guest in single-threaded mode, it should make all vcpu numbers be multiples of the number of threads per core. We distinguish three states of a vcpu: runnable (i.e., ready to execute the guest), blocked (that is, idle), and busy in host. We currently implement a policy that the vcore can run only when all its threads are runnable or blocked. This way, if a vcpu needs to execute elsewhere in the kernel or in qemu, it can do so without being starved of CPU by the other vcpus. When a vcore starts to run, it executes in the context of one of the vcpu threads. The other vcpu threads all go to sleep and stay asleep until something happens requiring the vcpu thread to return to qemu, or to wake up to run the vcore (this can happen when another vcpu thread goes from busy in host state to blocked). It can happen that a vcpu goes from blocked to runnable state (e.g. because of an interrupt), and the vcore it belongs to is already running. In that case it can start to run immediately as long as the none of the vcpus in the vcore have started to exit the guest. We send the next free thread in the vcore an IPI to get it to start to execute the guest. It synchronizes with the other threads via the vcore->entry_exit_count field to make sure that it doesn't go into the guest if the other vcpus are exiting by the time that it is ready to actually enter the guest. Note that there is no fixed relationship between the hardware thread number and the vcpu number. Hardware threads are assigned to vcpus as they become runnable, so we will always use the lower-numbered hardware threads in preference to higher-numbered threads if not all the vcpus in the vcore are runnable, regardless of which vcpus are runnable. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-28 18:23:08 -06:00
#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
li r0,KVM_HWTHREAD_IN_KERNEL
stb r0,HSTATE_HWTHREAD_STATE(r13)
/* Order setting hwthread_state vs. testing hwthread_req */
sync
lbz r0,HSTATE_HWTHREAD_REQ(r13)
cmpwi r0,0
beq 1f
KVM: PPC: Allow book3s_hv guests to use SMT processor modes This lifts the restriction that book3s_hv guests can only run one hardware thread per core, and allows them to use up to 4 threads per core on POWER7. The host still has to run single-threaded. This capability is advertised to qemu through a new KVM_CAP_PPC_SMT capability. The return value of the ioctl querying this capability is the number of vcpus per virtual CPU core (vcore), currently 4. To use this, the host kernel should be booted with all threads active, and then all the secondary threads should be offlined. This will put the secondary threads into nap mode. KVM will then wake them from nap mode and use them for running guest code (while they are still offline). To wake the secondary threads, we send them an IPI using a new xics_wake_cpu() function, implemented in arch/powerpc/sysdev/xics/icp-native.c. In other words, at this stage we assume that the platform has a XICS interrupt controller and we are using icp-native.c to drive it. Since the woken thread will need to acknowledge and clear the IPI, we also export the base physical address of the XICS registers using kvmppc_set_xics_phys() for use in the low-level KVM book3s code. When a vcpu is created, it is assigned to a virtual CPU core. The vcore number is obtained by dividing the vcpu number by the number of threads per core in the host. This number is exported to userspace via the KVM_CAP_PPC_SMT capability. If qemu wishes to run the guest in single-threaded mode, it should make all vcpu numbers be multiples of the number of threads per core. We distinguish three states of a vcpu: runnable (i.e., ready to execute the guest), blocked (that is, idle), and busy in host. We currently implement a policy that the vcore can run only when all its threads are runnable or blocked. This way, if a vcpu needs to execute elsewhere in the kernel or in qemu, it can do so without being starved of CPU by the other vcpus. When a vcore starts to run, it executes in the context of one of the vcpu threads. The other vcpu threads all go to sleep and stay asleep until something happens requiring the vcpu thread to return to qemu, or to wake up to run the vcore (this can happen when another vcpu thread goes from busy in host state to blocked). It can happen that a vcpu goes from blocked to runnable state (e.g. because of an interrupt), and the vcore it belongs to is already running. In that case it can start to run immediately as long as the none of the vcpus in the vcore have started to exit the guest. We send the next free thread in the vcore an IPI to get it to start to execute the guest. It synchronizes with the other threads via the vcore->entry_exit_count field to make sure that it doesn't go into the guest if the other vcpus are exiting by the time that it is ready to actually enter the guest. Note that there is no fixed relationship between the hardware thread number and the vcpu number. Hardware threads are assigned to vcpus as they become runnable, so we will always use the lower-numbered hardware threads in preference to higher-numbered threads if not all the vcpus in the vcore are runnable, regardless of which vcpus are runnable. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-28 18:23:08 -06:00
b kvm_start_guest
1:
#endif
powerpc/powernv: Return to cpu offline loop when finished in KVM guest When a secondary hardware thread has finished running a KVM guest, we currently put that thread into nap mode using a nap instruction in the KVM code. This changes the code so that instead of doing a nap instruction directly, we instead cause the call to power7_nap() that put the thread into nap mode to return. The reason for doing this is to avoid having the KVM code having to know what low-power mode to put the thread into. In the case of a secondary thread used to run a KVM guest, the thread will be offline from the point of view of the host kernel, and the relevant power7_nap() call is the one in pnv_smp_cpu_disable(). In this case we don't want to clear pending IPIs in the offline loop in that function, since that might cause us to miss the wakeup for the next time the thread needs to run a guest. To tell whether or not to clear the interrupt, we use the SRR1 value returned from power7_nap(), and check if it indicates an external interrupt. We arrange that the return from power7_nap() when we have finished running a guest returns 0, so pending interrupts don't get flushed in that case. Note that it is important a secondary thread that has finished executing in the guest, or that didn't have a guest to run, should not return to power7_nap's caller while the kvm_hstate.hwthread_req flag in the PACA is non-zero, because the return from power7_nap will reenable the MMU, and the MMU might still be in guest context. In this situation we spin at low priority in real mode waiting for hwthread_req to become zero. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-12-02 20:48:40 -07:00
/* Return SRR1 from power7_nap() */
mfspr r3,SPRN_SRR1
blt cr3,2f
b pnv_wakeup_loss
2: b pnv_wakeup_noloss
KVM: PPC: Allow book3s_hv guests to use SMT processor modes This lifts the restriction that book3s_hv guests can only run one hardware thread per core, and allows them to use up to 4 threads per core on POWER7. The host still has to run single-threaded. This capability is advertised to qemu through a new KVM_CAP_PPC_SMT capability. The return value of the ioctl querying this capability is the number of vcpus per virtual CPU core (vcore), currently 4. To use this, the host kernel should be booted with all threads active, and then all the secondary threads should be offlined. This will put the secondary threads into nap mode. KVM will then wake them from nap mode and use them for running guest code (while they are still offline). To wake the secondary threads, we send them an IPI using a new xics_wake_cpu() function, implemented in arch/powerpc/sysdev/xics/icp-native.c. In other words, at this stage we assume that the platform has a XICS interrupt controller and we are using icp-native.c to drive it. Since the woken thread will need to acknowledge and clear the IPI, we also export the base physical address of the XICS registers using kvmppc_set_xics_phys() for use in the low-level KVM book3s code. When a vcpu is created, it is assigned to a virtual CPU core. The vcore number is obtained by dividing the vcpu number by the number of threads per core in the host. This number is exported to userspace via the KVM_CAP_PPC_SMT capability. If qemu wishes to run the guest in single-threaded mode, it should make all vcpu numbers be multiples of the number of threads per core. We distinguish three states of a vcpu: runnable (i.e., ready to execute the guest), blocked (that is, idle), and busy in host. We currently implement a policy that the vcore can run only when all its threads are runnable or blocked. This way, if a vcpu needs to execute elsewhere in the kernel or in qemu, it can do so without being starved of CPU by the other vcpus. When a vcore starts to run, it executes in the context of one of the vcpu threads. The other vcpu threads all go to sleep and stay asleep until something happens requiring the vcpu thread to return to qemu, or to wake up to run the vcore (this can happen when another vcpu thread goes from busy in host state to blocked). It can happen that a vcpu goes from blocked to runnable state (e.g. because of an interrupt), and the vcore it belongs to is already running. In that case it can start to run immediately as long as the none of the vcpus in the vcore have started to exit the guest. We send the next free thread in the vcore an IPI to get it to start to execute the guest. It synchronizes with the other threads via the vcore->entry_exit_count field to make sure that it doesn't go into the guest if the other vcpus are exiting by the time that it is ready to actually enter the guest. Note that there is no fixed relationship between the hardware thread number and the vcpu number. Hardware threads are assigned to vcpus as they become runnable, so we will always use the lower-numbered hardware threads in preference to higher-numbered threads if not all the vcpus in the vcore are runnable, regardless of which vcpus are runnable. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-28 18:23:08 -06:00
9:
END_FTR_SECTION_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
#endif /* CONFIG_PPC_P7_NAP */
EXCEPTION_PROLOG_PSERIES(PACA_EXGEN, system_reset_common, EXC_STD,
NOTEST, 0x100)
EXC_REAL_END(system_reset, 0x100, 0x200)
EXC_VIRT_NONE(0x4100, 0x4200)
EXC_COMMON(system_reset_common, 0x100, system_reset_exception)
#ifdef CONFIG_PPC_PSERIES
/*
* Vectors for the FWNMI option. Share common code.
*/
TRAMP_REAL_BEGIN(system_reset_fwnmi)
SET_SCRATCH0(r13) /* save r13 */
EXCEPTION_PROLOG_PSERIES(PACA_EXGEN, system_reset_common, EXC_STD,
NOTEST, 0x100)
#endif /* CONFIG_PPC_PSERIES */
EXC_REAL_BEGIN(machine_check, 0x200, 0x300)
/* This is moved out of line as it can be patched by FW, but
* some code path might still want to branch into the original
* vector
*/
powerpc: Save CFAR before branching in interrupt entry paths Some of the interrupt vectors on 64-bit POWER server processors are only 32 bytes long, which is not enough for the full first-level interrupt handler. For these we currently just have a branch to an out-of-line handler. However, this means that we corrupt the CFAR (come-from address register) on POWER7 and later processors. To fix this, we split the EXCEPTION_PROLOG_1 macro into two pieces: EXCEPTION_PROLOG_0 contains the part up to the point where the CFAR is saved in the PACA, and EXCEPTION_PROLOG_1 contains the rest. We then put EXCEPTION_PROLOG_0 in the short interrupt vectors before we branch to the out-of-line handler, which contains the rest of the first-level interrupt handler. To facilitate this, we define new _OOL (out of line) variants of STD_EXCEPTION_PSERIES, etc. In order to get EXCEPTION_PROLOG_0 to be short enough, i.e., no more than 6 instructions, it was necessary to move the stores that move the PPR and CFAR values into the PACA into __EXCEPTION_PROLOG_1 and to get rid of one of the two HMT_MEDIUM instructions. Previously there was a HMT_MEDIUM_PPR_DISCARD before the prolog, which was nop'd out on processors with the PPR (POWER7 and later), and then another HMT_MEDIUM inside the HMT_MEDIUM_PPR_SAVE macro call inside __EXCEPTION_PROLOG_1, which was nop'd out on processors without PPR. Now the HMT_MEDIUM inside EXCEPTION_PROLOG_0 is there unconditionally and the HMT_MEDIUM_PPR_DISCARD is not strictly necessary, although this leaves it in for the interrupt vectors where there is room for it. Previously we had a handler for hypervisor maintenance interrupts at 0xe50, which doesn't leave enough room for the vector for hypervisor emulation assist interrupts at 0xe40, since we need 8 instructions. The 0xe50 vector was only used on POWER6, as the HMI vector was moved to 0xe60 on POWER7. Since we don't support running in hypervisor mode on POWER6, we just remove the handler at 0xe50. This also changes denorm_exception_hv to use EXCEPTION_PROLOG_0 instead of open-coding it, and removes the HMT_MEDIUM_PPR_DISCARD from the relocation-on vectors (since any CPU that supports relocation-on interrupts also has the PPR). Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-02-04 11:10:15 -07:00
SET_SCRATCH0(r13) /* save r13 */
/*
* Running native on arch 2.06 or later, we may wakeup from winkle
* inside machine check. If yes, then last bit of HSPGR0 would be set
* to 1. Hence clear it unconditionally.
*/
GET_PACA(r13)
clrrdi r13,r13,1
SET_PACA(r13)
powerpc: Save CFAR before branching in interrupt entry paths Some of the interrupt vectors on 64-bit POWER server processors are only 32 bytes long, which is not enough for the full first-level interrupt handler. For these we currently just have a branch to an out-of-line handler. However, this means that we corrupt the CFAR (come-from address register) on POWER7 and later processors. To fix this, we split the EXCEPTION_PROLOG_1 macro into two pieces: EXCEPTION_PROLOG_0 contains the part up to the point where the CFAR is saved in the PACA, and EXCEPTION_PROLOG_1 contains the rest. We then put EXCEPTION_PROLOG_0 in the short interrupt vectors before we branch to the out-of-line handler, which contains the rest of the first-level interrupt handler. To facilitate this, we define new _OOL (out of line) variants of STD_EXCEPTION_PSERIES, etc. In order to get EXCEPTION_PROLOG_0 to be short enough, i.e., no more than 6 instructions, it was necessary to move the stores that move the PPR and CFAR values into the PACA into __EXCEPTION_PROLOG_1 and to get rid of one of the two HMT_MEDIUM instructions. Previously there was a HMT_MEDIUM_PPR_DISCARD before the prolog, which was nop'd out on processors with the PPR (POWER7 and later), and then another HMT_MEDIUM inside the HMT_MEDIUM_PPR_SAVE macro call inside __EXCEPTION_PROLOG_1, which was nop'd out on processors without PPR. Now the HMT_MEDIUM inside EXCEPTION_PROLOG_0 is there unconditionally and the HMT_MEDIUM_PPR_DISCARD is not strictly necessary, although this leaves it in for the interrupt vectors where there is room for it. Previously we had a handler for hypervisor maintenance interrupts at 0xe50, which doesn't leave enough room for the vector for hypervisor emulation assist interrupts at 0xe40, since we need 8 instructions. The 0xe50 vector was only used on POWER6, as the HMI vector was moved to 0xe60 on POWER7. Since we don't support running in hypervisor mode on POWER6, we just remove the handler at 0xe50. This also changes denorm_exception_hv to use EXCEPTION_PROLOG_0 instead of open-coding it, and removes the HMT_MEDIUM_PPR_DISCARD from the relocation-on vectors (since any CPU that supports relocation-on interrupts also has the PPR). Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-02-04 11:10:15 -07:00
EXCEPTION_PROLOG_0(PACA_EXMC)
powerpc/book3s: handle machine check in Linux host. Move machine check entry point into Linux. So far we were dependent on firmware to decode MCE error details and handover the high level info to OS. This patch introduces early machine check routine that saves the MCE information (srr1, srr0, dar and dsisr) to the emergency stack. We allocate stack frame on emergency stack and set the r1 accordingly. This allows us to be prepared to take another exception without loosing context. One thing to note here that, if we get another machine check while ME bit is off then we risk a checkstop. Hence we restrict ourselves to save only MCE information and register saved on PACA_EXMC save are before we turn the ME bit on. We use paca->in_mce flag to differentiate between first entry and nested machine check entry which helps proper use of emergency stack. We increment paca->in_mce every time we enter in early machine check handler and decrement it while leaving. When we enter machine check early handler first time (paca->in_mce == 0), we are sure nobody is using MC emergency stack and allocate a stack frame at the start of the emergency stack. During subsequent entry (paca->in_mce > 0), we know that r1 points inside emergency stack and we allocate separate stack frame accordingly. This prevents us from clobbering MCE information during nested machine checks. The early machine check handler changes are placed under CPU_FTR_HVMODE section. This makes sure that the early machine check handler will get executed only in hypervisor kernel. This is the code flow: Machine Check Interrupt | V 0x200 vector ME=0, IR=0, DR=0 | V +-----------------------------------------------+ |machine_check_pSeries_early: | ME=0, IR=0, DR=0 | Alloc frame on emergency stack | | Save srr1, srr0, dar and dsisr on stack | +-----------------------------------------------+ | (ME=1, IR=0, DR=0, RFID) | V machine_check_handle_early ME=1, IR=0, DR=0 | V +-----------------------------------------------+ | machine_check_early (r3=pt_regs) | ME=1, IR=0, DR=0 | Things to do: (in next patches) | | Flush SLB for SLB errors | | Flush TLB for TLB errors | | Decode and save MCE info | +-----------------------------------------------+ | (Fall through existing exception handler routine.) | V machine_check_pSerie ME=1, IR=0, DR=0 | (ME=1, IR=1, DR=1, RFID) | V machine_check_common ME=1, IR=1, DR=1 . . . Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-10-30 08:34:08 -06:00
BEGIN_FTR_SECTION
b machine_check_powernv_early
powerpc/book3s: handle machine check in Linux host. Move machine check entry point into Linux. So far we were dependent on firmware to decode MCE error details and handover the high level info to OS. This patch introduces early machine check routine that saves the MCE information (srr1, srr0, dar and dsisr) to the emergency stack. We allocate stack frame on emergency stack and set the r1 accordingly. This allows us to be prepared to take another exception without loosing context. One thing to note here that, if we get another machine check while ME bit is off then we risk a checkstop. Hence we restrict ourselves to save only MCE information and register saved on PACA_EXMC save are before we turn the ME bit on. We use paca->in_mce flag to differentiate between first entry and nested machine check entry which helps proper use of emergency stack. We increment paca->in_mce every time we enter in early machine check handler and decrement it while leaving. When we enter machine check early handler first time (paca->in_mce == 0), we are sure nobody is using MC emergency stack and allocate a stack frame at the start of the emergency stack. During subsequent entry (paca->in_mce > 0), we know that r1 points inside emergency stack and we allocate separate stack frame accordingly. This prevents us from clobbering MCE information during nested machine checks. The early machine check handler changes are placed under CPU_FTR_HVMODE section. This makes sure that the early machine check handler will get executed only in hypervisor kernel. This is the code flow: Machine Check Interrupt | V 0x200 vector ME=0, IR=0, DR=0 | V +-----------------------------------------------+ |machine_check_pSeries_early: | ME=0, IR=0, DR=0 | Alloc frame on emergency stack | | Save srr1, srr0, dar and dsisr on stack | +-----------------------------------------------+ | (ME=1, IR=0, DR=0, RFID) | V machine_check_handle_early ME=1, IR=0, DR=0 | V +-----------------------------------------------+ | machine_check_early (r3=pt_regs) | ME=1, IR=0, DR=0 | Things to do: (in next patches) | | Flush SLB for SLB errors | | Flush TLB for TLB errors | | Decode and save MCE info | +-----------------------------------------------+ | (Fall through existing exception handler routine.) | V machine_check_pSerie ME=1, IR=0, DR=0 | (ME=1, IR=1, DR=1, RFID) | V machine_check_common ME=1, IR=1, DR=1 . . . Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-10-30 08:34:08 -06:00
FTR_SECTION_ELSE
powerpc: Save CFAR before branching in interrupt entry paths Some of the interrupt vectors on 64-bit POWER server processors are only 32 bytes long, which is not enough for the full first-level interrupt handler. For these we currently just have a branch to an out-of-line handler. However, this means that we corrupt the CFAR (come-from address register) on POWER7 and later processors. To fix this, we split the EXCEPTION_PROLOG_1 macro into two pieces: EXCEPTION_PROLOG_0 contains the part up to the point where the CFAR is saved in the PACA, and EXCEPTION_PROLOG_1 contains the rest. We then put EXCEPTION_PROLOG_0 in the short interrupt vectors before we branch to the out-of-line handler, which contains the rest of the first-level interrupt handler. To facilitate this, we define new _OOL (out of line) variants of STD_EXCEPTION_PSERIES, etc. In order to get EXCEPTION_PROLOG_0 to be short enough, i.e., no more than 6 instructions, it was necessary to move the stores that move the PPR and CFAR values into the PACA into __EXCEPTION_PROLOG_1 and to get rid of one of the two HMT_MEDIUM instructions. Previously there was a HMT_MEDIUM_PPR_DISCARD before the prolog, which was nop'd out on processors with the PPR (POWER7 and later), and then another HMT_MEDIUM inside the HMT_MEDIUM_PPR_SAVE macro call inside __EXCEPTION_PROLOG_1, which was nop'd out on processors without PPR. Now the HMT_MEDIUM inside EXCEPTION_PROLOG_0 is there unconditionally and the HMT_MEDIUM_PPR_DISCARD is not strictly necessary, although this leaves it in for the interrupt vectors where there is room for it. Previously we had a handler for hypervisor maintenance interrupts at 0xe50, which doesn't leave enough room for the vector for hypervisor emulation assist interrupts at 0xe40, since we need 8 instructions. The 0xe50 vector was only used on POWER6, as the HMI vector was moved to 0xe60 on POWER7. Since we don't support running in hypervisor mode on POWER6, we just remove the handler at 0xe50. This also changes denorm_exception_hv to use EXCEPTION_PROLOG_0 instead of open-coding it, and removes the HMT_MEDIUM_PPR_DISCARD from the relocation-on vectors (since any CPU that supports relocation-on interrupts also has the PPR). Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-02-04 11:10:15 -07:00
b machine_check_pSeries_0
powerpc/book3s: handle machine check in Linux host. Move machine check entry point into Linux. So far we were dependent on firmware to decode MCE error details and handover the high level info to OS. This patch introduces early machine check routine that saves the MCE information (srr1, srr0, dar and dsisr) to the emergency stack. We allocate stack frame on emergency stack and set the r1 accordingly. This allows us to be prepared to take another exception without loosing context. One thing to note here that, if we get another machine check while ME bit is off then we risk a checkstop. Hence we restrict ourselves to save only MCE information and register saved on PACA_EXMC save are before we turn the ME bit on. We use paca->in_mce flag to differentiate between first entry and nested machine check entry which helps proper use of emergency stack. We increment paca->in_mce every time we enter in early machine check handler and decrement it while leaving. When we enter machine check early handler first time (paca->in_mce == 0), we are sure nobody is using MC emergency stack and allocate a stack frame at the start of the emergency stack. During subsequent entry (paca->in_mce > 0), we know that r1 points inside emergency stack and we allocate separate stack frame accordingly. This prevents us from clobbering MCE information during nested machine checks. The early machine check handler changes are placed under CPU_FTR_HVMODE section. This makes sure that the early machine check handler will get executed only in hypervisor kernel. This is the code flow: Machine Check Interrupt | V 0x200 vector ME=0, IR=0, DR=0 | V +-----------------------------------------------+ |machine_check_pSeries_early: | ME=0, IR=0, DR=0 | Alloc frame on emergency stack | | Save srr1, srr0, dar and dsisr on stack | +-----------------------------------------------+ | (ME=1, IR=0, DR=0, RFID) | V machine_check_handle_early ME=1, IR=0, DR=0 | V +-----------------------------------------------+ | machine_check_early (r3=pt_regs) | ME=1, IR=0, DR=0 | Things to do: (in next patches) | | Flush SLB for SLB errors | | Flush TLB for TLB errors | | Decode and save MCE info | +-----------------------------------------------+ | (Fall through existing exception handler routine.) | V machine_check_pSerie ME=1, IR=0, DR=0 | (ME=1, IR=1, DR=1, RFID) | V machine_check_common ME=1, IR=1, DR=1 . . . Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-10-30 08:34:08 -06:00
ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE)
EXC_REAL_END(machine_check, 0x200, 0x300)
EXC_VIRT_NONE(0x4200, 0x4300)
TRAMP_REAL_BEGIN(machine_check_powernv_early)
BEGIN_FTR_SECTION
EXCEPTION_PROLOG_1(PACA_EXMC, NOTEST, 0x200)
/*
* Register contents:
* R13 = PACA
* R9 = CR
* Original R9 to R13 is saved on PACA_EXMC
*
* Switch to mc_emergency stack and handle re-entrancy (we limit
* the nested MCE upto level 4 to avoid stack overflow).
* Save MCE registers srr1, srr0, dar and dsisr and then set ME=1
*
* We use paca->in_mce to check whether this is the first entry or
* nested machine check. We increment paca->in_mce to track nested
* machine checks.
*
* If this is the first entry then set stack pointer to
* paca->mc_emergency_sp, otherwise r1 is already pointing to
* stack frame on mc_emergency stack.
*
* NOTE: We are here with MSR_ME=0 (off), which means we risk a
* checkstop if we get another machine check exception before we do
* rfid with MSR_ME=1.
*/
mr r11,r1 /* Save r1 */
lhz r10,PACA_IN_MCE(r13)
cmpwi r10,0 /* Are we in nested machine check */
bne 0f /* Yes, we are. */
/* First machine check entry */
ld r1,PACAMCEMERGSP(r13) /* Use MC emergency stack */
0: subi r1,r1,INT_FRAME_SIZE /* alloc stack frame */
addi r10,r10,1 /* increment paca->in_mce */
sth r10,PACA_IN_MCE(r13)
/* Limit nested MCE to level 4 to avoid stack overflow */
cmpwi r10,4
bgt 2f /* Check if we hit limit of 4 */
std r11,GPR1(r1) /* Save r1 on the stack. */
std r11,0(r1) /* make stack chain pointer */
mfspr r11,SPRN_SRR0 /* Save SRR0 */
std r11,_NIP(r1)
mfspr r11,SPRN_SRR1 /* Save SRR1 */
std r11,_MSR(r1)
mfspr r11,SPRN_DAR /* Save DAR */
std r11,_DAR(r1)
mfspr r11,SPRN_DSISR /* Save DSISR */
std r11,_DSISR(r1)
std r9,_CCR(r1) /* Save CR in stackframe */
/* Save r9 through r13 from EXMC save area to stack frame. */
EXCEPTION_PROLOG_COMMON_2(PACA_EXMC)
mfmsr r11 /* get MSR value */
ori r11,r11,MSR_ME /* turn on ME bit */
ori r11,r11,MSR_RI /* turn on RI bit */
LOAD_HANDLER(r12, machine_check_handle_early)
1: mtspr SPRN_SRR0,r12
mtspr SPRN_SRR1,r11
rfid
b . /* prevent speculative execution */
2:
/* Stack overflow. Stay on emergency stack and panic.
* Keep the ME bit off while panic-ing, so that if we hit
* another machine check we checkstop.
*/
addi r1,r1,INT_FRAME_SIZE /* go back to previous stack frame */
ld r11,PACAKMSR(r13)
LOAD_HANDLER(r12, unrecover_mce)
li r10,MSR_ME
andc r11,r11,r10 /* Turn off MSR_ME */
b 1b
b . /* prevent speculative execution */
END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
TRAMP_REAL_BEGIN(machine_check_pSeries)
.globl machine_check_fwnmi
machine_check_fwnmi:
SET_SCRATCH0(r13) /* save r13 */
EXCEPTION_PROLOG_0(PACA_EXMC)
machine_check_pSeries_0:
EXCEPTION_PROLOG_1(PACA_EXMC, KVMTEST_PR, 0x200)
/*
* The following is essentially EXCEPTION_PROLOG_PSERIES_1 with the
* difference that MSR_RI is not enabled, because PACA_EXMC is being
* used, so nested machine check corrupts it. machine_check_common
* enables MSR_RI.
*/
ld r10,PACAKMSR(r13)
xori r10,r10,MSR_RI
mfspr r11,SPRN_SRR0
LOAD_HANDLER(r12, machine_check_common)
mtspr SPRN_SRR0,r12
mfspr r12,SPRN_SRR1
mtspr SPRN_SRR1,r10
rfid
b . /* prevent speculative execution */
TRAMP_KVM_SKIP(PACA_EXMC, 0x200)
EXC_COMMON_BEGIN(machine_check_common)
/*
* Machine check is different because we use a different
* save area: PACA_EXMC instead of PACA_EXGEN.
*/
mfspr r10,SPRN_DAR
std r10,PACA_EXMC+EX_DAR(r13)
mfspr r10,SPRN_DSISR
stw r10,PACA_EXMC+EX_DSISR(r13)
EXCEPTION_PROLOG_COMMON(0x200, PACA_EXMC)
FINISH_NAP
RECONCILE_IRQ_STATE(r10, r11)
ld r3,PACA_EXMC+EX_DAR(r13)
lwz r4,PACA_EXMC+EX_DSISR(r13)
/* Enable MSR_RI when finished with PACA_EXMC */
li r10,MSR_RI
mtmsrd r10,1
std r3,_DAR(r1)
std r4,_DSISR(r1)
bl save_nvgprs
addi r3,r1,STACK_FRAME_OVERHEAD
bl machine_check_exception
b ret_from_except
#define MACHINE_CHECK_HANDLER_WINDUP \
/* Clear MSR_RI before setting SRR0 and SRR1. */\
li r0,MSR_RI; \
mfmsr r9; /* get MSR value */ \
andc r9,r9,r0; \
mtmsrd r9,1; /* Clear MSR_RI */ \
/* Move original SRR0 and SRR1 into the respective regs */ \
ld r9,_MSR(r1); \
mtspr SPRN_SRR1,r9; \
ld r3,_NIP(r1); \
mtspr SPRN_SRR0,r3; \
ld r9,_CTR(r1); \
mtctr r9; \
ld r9,_XER(r1); \
mtxer r9; \
ld r9,_LINK(r1); \
mtlr r9; \
REST_GPR(0, r1); \
REST_8GPRS(2, r1); \
REST_GPR(10, r1); \
ld r11,_CCR(r1); \
mtcr r11; \
/* Decrement paca->in_mce. */ \
lhz r12,PACA_IN_MCE(r13); \
subi r12,r12,1; \
sth r12,PACA_IN_MCE(r13); \
REST_GPR(11, r1); \
REST_2GPRS(12, r1); \
/* restore original r1. */ \
ld r1,GPR1(r1)
/*
* Handle machine check early in real mode. We come here with
* ME=1, MMU (IR=0 and DR=0) off and using MC emergency stack.
*/
EXC_COMMON_BEGIN(machine_check_handle_early)
std r0,GPR0(r1) /* Save r0 */
EXCEPTION_PROLOG_COMMON_3(0x200)
bl save_nvgprs
addi r3,r1,STACK_FRAME_OVERHEAD
bl machine_check_early
std r3,RESULT(r1) /* Save result */
ld r12,_MSR(r1)
#ifdef CONFIG_PPC_P7_NAP
/*
* Check if thread was in power saving mode. We come here when any
* of the following is true:
* a. thread wasn't in power saving mode
* b. thread was in power saving mode with no state loss,
* supervisor state loss or hypervisor state loss.
*
* Go back to nap/sleep/winkle mode again if (b) is true.
*/
rlwinm. r11,r12,47-31,30,31 /* Was it in power saving mode? */
beq 4f /* No, it wasn;t */
/* Thread was in power saving mode. Go back to nap again. */
cmpwi r11,2
blt 3f
/* Supervisor/Hypervisor state loss */
li r0,1
stb r0,PACA_NAPSTATELOST(r13)
3: bl machine_check_queue_event
MACHINE_CHECK_HANDLER_WINDUP
GET_PACA(r13)
ld r1,PACAR1(r13)
/*
* Check what idle state this CPU was in and go back to same mode
* again.
*/
lbz r3,PACA_THREAD_IDLE_STATE(r13)
cmpwi r3,PNV_THREAD_NAP
bgt 10f
IDLE_STATE_ENTER_SEQ(PPC_NAP)
/* No return */
10:
cmpwi r3,PNV_THREAD_SLEEP
bgt 2f
IDLE_STATE_ENTER_SEQ(PPC_SLEEP)
/* No return */
2:
/*
* Go back to winkle. Please note that this thread was woken up in
* machine check from winkle and have not restored the per-subcore
* state. Hence before going back to winkle, set last bit of HSPGR0
* to 1. This will make sure that if this thread gets woken up
* again at reset vector 0x100 then it will get chance to restore
* the subcore state.
*/
ori r13,r13,1
SET_PACA(r13)
IDLE_STATE_ENTER_SEQ(PPC_WINKLE)
/* No return */
4:
#endif
/*
* Check if we are coming from hypervisor userspace. If yes then we
* continue in host kernel in V mode to deliver the MC event.
*/
rldicl. r11,r12,4,63 /* See if MC hit while in HV mode. */
beq 5f
andi. r11,r12,MSR_PR /* See if coming from user. */
bne 9f /* continue in V mode if we are. */
5:
#ifdef CONFIG_KVM_BOOK3S_64_HANDLER
/*
* We are coming from kernel context. Check if we are coming from
* guest. if yes, then we can continue. We will fall through
* do_kvm_200->kvmppc_interrupt to deliver the MC event to guest.
*/
lbz r11,HSTATE_IN_GUEST(r13)
cmpwi r11,0 /* Check if coming from guest */
bne 9f /* continue if we are. */
#endif
/*
* At this point we are not sure about what context we come from.
* Queue up the MCE event and return from the interrupt.
* But before that, check if this is an un-recoverable exception.
* If yes, then stay on emergency stack and panic.
*/
andi. r11,r12,MSR_RI
bne 2f
1: mfspr r11,SPRN_SRR0
LOAD_HANDLER(r10,unrecover_mce)
mtspr SPRN_SRR0,r10
ld r10,PACAKMSR(r13)
/*
* We are going down. But there are chances that we might get hit by
* another MCE during panic path and we may run into unstable state
* with no way out. Hence, turn ME bit off while going down, so that
* when another MCE is hit during panic path, system will checkstop
* and hypervisor will get restarted cleanly by SP.
*/
li r3,MSR_ME
andc r10,r10,r3 /* Turn off MSR_ME */
mtspr SPRN_SRR1,r10
rfid
b .
2:
/*
* Check if we have successfully handled/recovered from error, if not
* then stay on emergency stack and panic.
*/
ld r3,RESULT(r1) /* Load result */
cmpdi r3,0 /* see if we handled MCE successfully */
beq 1b /* if !handled then panic */
/*
* Return from MC interrupt.
* Queue up the MCE event so that we can log it later, while
* returning from kernel or opal call.
*/
bl machine_check_queue_event
MACHINE_CHECK_HANDLER_WINDUP
rfid
9:
/* Deliver the machine check to host kernel in V mode. */
MACHINE_CHECK_HANDLER_WINDUP
b machine_check_pSeries
EXC_COMMON_BEGIN(unrecover_mce)
/* Invoke machine_check_exception to print MCE event and panic. */
addi r3,r1,STACK_FRAME_OVERHEAD
bl machine_check_exception
/*
* We will not reach here. Even if we did, there is no way out. Call
* unrecoverable_exception and die.
*/
1: addi r3,r1,STACK_FRAME_OVERHEAD
bl unrecoverable_exception
b 1b
EXC_REAL(data_access, 0x300, 0x380)
EXC_VIRT(data_access, 0x4300, 0x4380, 0x300)
TRAMP_KVM_SKIP(PACA_EXGEN, 0x300)
EXC_COMMON_BEGIN(data_access_common)
/*
* Here r13 points to the paca, r9 contains the saved CR,
* SRR0 and SRR1 are saved in r11 and r12,
* r9 - r13 are saved in paca->exgen.
*/
mfspr r10,SPRN_DAR
std r10,PACA_EXGEN+EX_DAR(r13)
mfspr r10,SPRN_DSISR
stw r10,PACA_EXGEN+EX_DSISR(r13)
EXCEPTION_PROLOG_COMMON(0x300, PACA_EXGEN)
RECONCILE_IRQ_STATE(r10, r11)
ld r12,_MSR(r1)
ld r3,PACA_EXGEN+EX_DAR(r13)
lwz r4,PACA_EXGEN+EX_DSISR(r13)
li r5,0x300
std r3,_DAR(r1)
std r4,_DSISR(r1)
BEGIN_MMU_FTR_SECTION
b do_hash_page /* Try to handle as hpte fault */
MMU_FTR_SECTION_ELSE
b handle_page_fault
ALT_MMU_FTR_SECTION_END_IFCLR(MMU_FTR_TYPE_RADIX)
EXC_REAL_BEGIN(data_access_slb, 0x380, 0x400)
SET_SCRATCH0(r13)
powerpc: Save CFAR before branching in interrupt entry paths Some of the interrupt vectors on 64-bit POWER server processors are only 32 bytes long, which is not enough for the full first-level interrupt handler. For these we currently just have a branch to an out-of-line handler. However, this means that we corrupt the CFAR (come-from address register) on POWER7 and later processors. To fix this, we split the EXCEPTION_PROLOG_1 macro into two pieces: EXCEPTION_PROLOG_0 contains the part up to the point where the CFAR is saved in the PACA, and EXCEPTION_PROLOG_1 contains the rest. We then put EXCEPTION_PROLOG_0 in the short interrupt vectors before we branch to the out-of-line handler, which contains the rest of the first-level interrupt handler. To facilitate this, we define new _OOL (out of line) variants of STD_EXCEPTION_PSERIES, etc. In order to get EXCEPTION_PROLOG_0 to be short enough, i.e., no more than 6 instructions, it was necessary to move the stores that move the PPR and CFAR values into the PACA into __EXCEPTION_PROLOG_1 and to get rid of one of the two HMT_MEDIUM instructions. Previously there was a HMT_MEDIUM_PPR_DISCARD before the prolog, which was nop'd out on processors with the PPR (POWER7 and later), and then another HMT_MEDIUM inside the HMT_MEDIUM_PPR_SAVE macro call inside __EXCEPTION_PROLOG_1, which was nop'd out on processors without PPR. Now the HMT_MEDIUM inside EXCEPTION_PROLOG_0 is there unconditionally and the HMT_MEDIUM_PPR_DISCARD is not strictly necessary, although this leaves it in for the interrupt vectors where there is room for it. Previously we had a handler for hypervisor maintenance interrupts at 0xe50, which doesn't leave enough room for the vector for hypervisor emulation assist interrupts at 0xe40, since we need 8 instructions. The 0xe50 vector was only used on POWER6, as the HMI vector was moved to 0xe60 on POWER7. Since we don't support running in hypervisor mode on POWER6, we just remove the handler at 0xe50. This also changes denorm_exception_hv to use EXCEPTION_PROLOG_0 instead of open-coding it, and removes the HMT_MEDIUM_PPR_DISCARD from the relocation-on vectors (since any CPU that supports relocation-on interrupts also has the PPR). Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-02-04 11:10:15 -07:00
EXCEPTION_PROLOG_0(PACA_EXSLB)
EXCEPTION_PROLOG_1(PACA_EXSLB, KVMTEST_PR, 0x380)
std r3,PACA_EXSLB+EX_R3(r13)
mfspr r3,SPRN_DAR
mfspr r12,SPRN_SRR1
powerpc/mm: Preserve CFAR value on SLB miss caused by access to bogus address Currently, if userspace or the kernel accesses a completely bogus address, for example with any of bits 46-59 set, we first take an SLB miss interrupt, install a corresponding SLB entry with VSID 0, retry the instruction, then take a DSI/ISI interrupt because there is no HPT entry mapping the address. However, by the time of the second interrupt, the Come-From Address Register (CFAR) has been overwritten by the rfid instruction at the end of the SLB miss interrupt handler. Since bogus accesses can often be caused by a function return after the stack has been overwritten, the CFAR value would be very useful as it could indicate which function it was whose return had led to the bogus address. This patch adds code to create a full exception frame in the SLB miss handler in the case of a bogus address, rather than inserting an SLB entry with a zero VSID field. Then we call a new slb_miss_bad_addr() function in C code, which delivers a signal for a user access or creates an oops for a kernel access. In the latter case the oops message will show the CFAR value at the time of the access. In the case of the radix MMU, a segment miss interrupt indicates an access outside the ranges mapped by the page tables. Previously this was handled by the code for an unrecoverable SLB miss (one with MSR[RI] = 0), which is not really correct. With this patch, we now handle these interrupts with slb_miss_bad_addr(), which is much more consistent. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-09-02 05:49:21 -06:00
crset 4*cr6+eq
#ifndef CONFIG_RELOCATABLE
b slb_miss_realmode
#else
/*
* We can't just use a direct branch to slb_miss_realmode
* because the distance from here to there depends on where
* the kernel ends up being put.
*/
mfctr r11
LOAD_HANDLER(r10, slb_miss_realmode)
mtctr r10
bctr
#endif
EXC_REAL_END(data_access_slb, 0x380, 0x400)
EXC_VIRT_BEGIN(data_access_slb, 0x4380, 0x4400)
SET_SCRATCH0(r13)
EXCEPTION_PROLOG_0(PACA_EXSLB)
EXCEPTION_PROLOG_1(PACA_EXSLB, NOTEST, 0x380)
std r3,PACA_EXSLB+EX_R3(r13)
mfspr r3,SPRN_DAR
mfspr r12,SPRN_SRR1
crset 4*cr6+eq
#ifndef CONFIG_RELOCATABLE
b slb_miss_realmode
#else
/*
* We can't just use a direct branch to slb_miss_realmode
* because the distance from here to there depends on where
* the kernel ends up being put.
*/
mfctr r11
LOAD_HANDLER(r10, slb_miss_realmode)
mtctr r10
bctr
#endif
EXC_VIRT_END(data_access_slb, 0x4380, 0x4400)
TRAMP_KVM_SKIP(PACA_EXSLB, 0x380)
EXC_REAL(instruction_access, 0x400, 0x480)
EXC_VIRT(instruction_access, 0x4400, 0x4480, 0x400)
TRAMP_KVM(PACA_EXGEN, 0x400)
EXC_COMMON_BEGIN(instruction_access_common)
EXCEPTION_PROLOG_COMMON(0x400, PACA_EXGEN)
RECONCILE_IRQ_STATE(r10, r11)
ld r12,_MSR(r1)
ld r3,_NIP(r1)
andis. r4,r12,0x5820
li r5,0x400
std r3,_DAR(r1)
std r4,_DSISR(r1)
BEGIN_MMU_FTR_SECTION
b do_hash_page /* Try to handle as hpte fault */
MMU_FTR_SECTION_ELSE
b handle_page_fault
ALT_MMU_FTR_SECTION_END_IFCLR(MMU_FTR_TYPE_RADIX)
EXC_REAL_BEGIN(instruction_access_slb, 0x480, 0x500)
SET_SCRATCH0(r13)
powerpc: Save CFAR before branching in interrupt entry paths Some of the interrupt vectors on 64-bit POWER server processors are only 32 bytes long, which is not enough for the full first-level interrupt handler. For these we currently just have a branch to an out-of-line handler. However, this means that we corrupt the CFAR (come-from address register) on POWER7 and later processors. To fix this, we split the EXCEPTION_PROLOG_1 macro into two pieces: EXCEPTION_PROLOG_0 contains the part up to the point where the CFAR is saved in the PACA, and EXCEPTION_PROLOG_1 contains the rest. We then put EXCEPTION_PROLOG_0 in the short interrupt vectors before we branch to the out-of-line handler, which contains the rest of the first-level interrupt handler. To facilitate this, we define new _OOL (out of line) variants of STD_EXCEPTION_PSERIES, etc. In order to get EXCEPTION_PROLOG_0 to be short enough, i.e., no more than 6 instructions, it was necessary to move the stores that move the PPR and CFAR values into the PACA into __EXCEPTION_PROLOG_1 and to get rid of one of the two HMT_MEDIUM instructions. Previously there was a HMT_MEDIUM_PPR_DISCARD before the prolog, which was nop'd out on processors with the PPR (POWER7 and later), and then another HMT_MEDIUM inside the HMT_MEDIUM_PPR_SAVE macro call inside __EXCEPTION_PROLOG_1, which was nop'd out on processors without PPR. Now the HMT_MEDIUM inside EXCEPTION_PROLOG_0 is there unconditionally and the HMT_MEDIUM_PPR_DISCARD is not strictly necessary, although this leaves it in for the interrupt vectors where there is room for it. Previously we had a handler for hypervisor maintenance interrupts at 0xe50, which doesn't leave enough room for the vector for hypervisor emulation assist interrupts at 0xe40, since we need 8 instructions. The 0xe50 vector was only used on POWER6, as the HMI vector was moved to 0xe60 on POWER7. Since we don't support running in hypervisor mode on POWER6, we just remove the handler at 0xe50. This also changes denorm_exception_hv to use EXCEPTION_PROLOG_0 instead of open-coding it, and removes the HMT_MEDIUM_PPR_DISCARD from the relocation-on vectors (since any CPU that supports relocation-on interrupts also has the PPR). Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-02-04 11:10:15 -07:00
EXCEPTION_PROLOG_0(PACA_EXSLB)
EXCEPTION_PROLOG_1(PACA_EXSLB, KVMTEST_PR, 0x480)
std r3,PACA_EXSLB+EX_R3(r13)
mfspr r3,SPRN_SRR0 /* SRR0 is faulting address */
mfspr r12,SPRN_SRR1
powerpc/mm: Preserve CFAR value on SLB miss caused by access to bogus address Currently, if userspace or the kernel accesses a completely bogus address, for example with any of bits 46-59 set, we first take an SLB miss interrupt, install a corresponding SLB entry with VSID 0, retry the instruction, then take a DSI/ISI interrupt because there is no HPT entry mapping the address. However, by the time of the second interrupt, the Come-From Address Register (CFAR) has been overwritten by the rfid instruction at the end of the SLB miss interrupt handler. Since bogus accesses can often be caused by a function return after the stack has been overwritten, the CFAR value would be very useful as it could indicate which function it was whose return had led to the bogus address. This patch adds code to create a full exception frame in the SLB miss handler in the case of a bogus address, rather than inserting an SLB entry with a zero VSID field. Then we call a new slb_miss_bad_addr() function in C code, which delivers a signal for a user access or creates an oops for a kernel access. In the latter case the oops message will show the CFAR value at the time of the access. In the case of the radix MMU, a segment miss interrupt indicates an access outside the ranges mapped by the page tables. Previously this was handled by the code for an unrecoverable SLB miss (one with MSR[RI] = 0), which is not really correct. With this patch, we now handle these interrupts with slb_miss_bad_addr(), which is much more consistent. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-09-02 05:49:21 -06:00
crclr 4*cr6+eq
#ifndef CONFIG_RELOCATABLE
b slb_miss_realmode
#else
mfctr r11
LOAD_HANDLER(r10, slb_miss_realmode)
mtctr r10
bctr
#endif
EXC_REAL_END(instruction_access_slb, 0x480, 0x500)
EXC_VIRT_BEGIN(instruction_access_slb, 0x4480, 0x4500)
SET_SCRATCH0(r13)
EXCEPTION_PROLOG_0(PACA_EXSLB)
EXCEPTION_PROLOG_1(PACA_EXSLB, NOTEST, 0x480)
std r3,PACA_EXSLB+EX_R3(r13)
mfspr r3,SPRN_SRR0 /* SRR0 is faulting address */
mfspr r12,SPRN_SRR1
crclr 4*cr6+eq
#ifndef CONFIG_RELOCATABLE
b slb_miss_realmode
#else
mfctr r11
LOAD_HANDLER(r10, slb_miss_realmode)
mtctr r10
bctr
#endif
EXC_VIRT_END(instruction_access_slb, 0x4480, 0x4500)
TRAMP_KVM(PACA_EXSLB, 0x480)
/* This handler is used by both 0x380 and 0x480 slb miss interrupts */
EXC_COMMON_BEGIN(slb_miss_realmode)
/*
* r13 points to the PACA, r9 contains the saved CR,
* r12 contain the saved SRR1, SRR0 is still ready for return
* r3 has the faulting address
* r9 - r13 are saved in paca->exslb.
* r3 is saved in paca->slb_r3
* cr6.eq is set for a D-SLB miss, clear for a I-SLB miss
* We assume we aren't going to take any exceptions during this
* procedure.
*/
mflr r10
#ifdef CONFIG_RELOCATABLE
mtctr r11
#endif
stw r9,PACA_EXSLB+EX_CCR(r13) /* save CR in exc. frame */
std r10,PACA_EXSLB+EX_LR(r13) /* save LR */
std r3,PACA_EXSLB+EX_DAR(r13)
crset 4*cr0+eq
#ifdef CONFIG_PPC_STD_MMU_64
BEGIN_MMU_FTR_SECTION
bl slb_allocate_realmode
END_MMU_FTR_SECTION_IFCLR(MMU_FTR_TYPE_RADIX)
#endif
ld r10,PACA_EXSLB+EX_LR(r13)
ld r3,PACA_EXSLB+EX_R3(r13)
lwz r9,PACA_EXSLB+EX_CCR(r13) /* get saved CR */
mtlr r10
beq 8f /* if bad address, make full stack frame */
andi. r10,r12,MSR_RI /* check for unrecoverable exception */
beq- 2f
/* All done -- return from exception. */
.machine push
.machine "power4"
mtcrf 0x80,r9
mtcrf 0x02,r9 /* I/D indication is in cr6 */
mtcrf 0x01,r9 /* slb_allocate uses cr0 and cr7 */
.machine pop
RESTORE_PPR_PACA(PACA_EXSLB, r9)
ld r9,PACA_EXSLB+EX_R9(r13)
ld r10,PACA_EXSLB+EX_R10(r13)
ld r11,PACA_EXSLB+EX_R11(r13)
ld r12,PACA_EXSLB+EX_R12(r13)
ld r13,PACA_EXSLB+EX_R13(r13)
rfid
b . /* prevent speculative execution */
2: mfspr r11,SPRN_SRR0
LOAD_HANDLER(r10,unrecov_slb)
mtspr SPRN_SRR0,r10
ld r10,PACAKMSR(r13)
mtspr SPRN_SRR1,r10
rfid
b .
8: mfspr r11,SPRN_SRR0
LOAD_HANDLER(r10,bad_addr_slb)
mtspr SPRN_SRR0,r10
ld r10,PACAKMSR(r13)
mtspr SPRN_SRR1,r10
rfid
b .
EXC_COMMON_BEGIN(unrecov_slb)
EXCEPTION_PROLOG_COMMON(0x4100, PACA_EXSLB)
RECONCILE_IRQ_STATE(r10, r11)
bl save_nvgprs
1: addi r3,r1,STACK_FRAME_OVERHEAD
bl unrecoverable_exception
b 1b
EXC_COMMON_BEGIN(bad_addr_slb)
EXCEPTION_PROLOG_COMMON(0x380, PACA_EXSLB)
RECONCILE_IRQ_STATE(r10, r11)
ld r3, PACA_EXSLB+EX_DAR(r13)
std r3, _DAR(r1)
beq cr6, 2f
li r10, 0x480 /* fix trap number for I-SLB miss */
std r10, _TRAP(r1)
2: bl save_nvgprs
addi r3, r1, STACK_FRAME_OVERHEAD
bl slb_miss_bad_addr
b ret_from_except
EXC_REAL_BEGIN(hardware_interrupt, 0x500, 0x600)
.globl hardware_interrupt_hv;
hardware_interrupt_hv:
BEGIN_FTR_SECTION
_MASKABLE_EXCEPTION_PSERIES(0x500, hardware_interrupt_common,
EXC_HV, SOFTEN_TEST_HV)
do_kvm_H0x500:
KVM_HANDLER(PACA_EXGEN, EXC_HV, 0x502)
KVM: PPC: Add support for Book3S processors in hypervisor mode This adds support for KVM running on 64-bit Book 3S processors, specifically POWER7, in hypervisor mode. Using hypervisor mode means that the guest can use the processor's supervisor mode. That means that the guest can execute privileged instructions and access privileged registers itself without trapping to the host. This gives excellent performance, but does mean that KVM cannot emulate a processor architecture other than the one that the hardware implements. This code assumes that the guest is running paravirtualized using the PAPR (Power Architecture Platform Requirements) interface, which is the interface that IBM's PowerVM hypervisor uses. That means that existing Linux distributions that run on IBM pSeries machines will also run under KVM without modification. In order to communicate the PAPR hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code to include/linux/kvm.h. Currently the choice between book3s_hv support and book3s_pr support (i.e. the existing code, which runs the guest in user mode) has to be made at kernel configuration time, so a given kernel binary can only do one or the other. This new book3s_hv code doesn't support MMIO emulation at present. Since we are running paravirtualized guests, this isn't a serious restriction. With the guest running in supervisor mode, most exceptions go straight to the guest. We will never get data or instruction storage or segment interrupts, alignment interrupts, decrementer interrupts, program interrupts, single-step interrupts, etc., coming to the hypervisor from the guest. Therefore this introduces a new KVMTEST_NONHV macro for the exception entry path so that we don't have to do the KVM test on entry to those exception handlers. We do however get hypervisor decrementer, hypervisor data storage, hypervisor instruction storage, and hypervisor emulation assist interrupts, so we have to handle those. In hypervisor mode, real-mode accesses can access all of RAM, not just a limited amount. Therefore we put all the guest state in the vcpu.arch and use the shadow_vcpu in the PACA only for temporary scratch space. We allocate the vcpu with kzalloc rather than vzalloc, and we don't use anything in the kvmppc_vcpu_book3s struct, so we don't allocate it. We don't have a shared page with the guest, but we still need a kvm_vcpu_arch_shared struct to store the values of various registers, so we include one in the vcpu_arch struct. The POWER7 processor has a restriction that all threads in a core have to be in the same partition. MMU-on kernel code counts as a partition (partition 0), so we have to do a partition switch on every entry to and exit from the guest. At present we require the host and guest to run in single-thread mode because of this hardware restriction. This code allocates a hashed page table for the guest and initializes it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We require that the guest memory is allocated using 16MB huge pages, in order to simplify the low-level memory management. This also means that we can get away without tracking paging activity in the host for now, since huge pages can't be paged or swapped. This also adds a few new exports needed by the book3s_hv code. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-28 18:21:34 -06:00
FTR_SECTION_ELSE
_MASKABLE_EXCEPTION_PSERIES(0x500, hardware_interrupt_common,
powerpc/64: Include KVM guest test in all interrupt vectors Currently, if HV KVM is configured but PR KVM isn't, we don't include a test to see whether we were interrupted in KVM guest context for the set of interrupts which get delivered directly to the guest by hardware if they occur in the guest. This includes things like program interrupts. However, the recent bug where userspace could set the MSR for a VCPU to have an illegal value in the TS field, and thus cause a TM Bad Thing type of program interrupt on the hrfid that enters the guest, showed that we can never be completely sure that these interrupts can never occur in the guest entry/exit code. If one of these interrupts does happen and we have HV KVM configured but not PR KVM, then we end up trying to run the handler in the host with the MMU set to the guest MMU context, which generally ends badly. Thus, for robustness it is better to have the test in every interrupt vector, so that if some way is found to trigger some interrupt in the guest entry/exit path, we can handle it without immediately crashing the host. This means that the distinction between KVMTEST and KVMTEST_PR goes away. Thus we delete KVMTEST_PR and associated macros and use KVMTEST everywhere that we previously used either KVMTEST_PR or KVMTEST. It also means that SOFTEN_TEST_HV_201 becomes the same as SOFTEN_TEST_PR, so we deleted SOFTEN_TEST_HV_201 and use SOFTEN_TEST_PR instead. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-11-11 22:44:42 -07:00
EXC_STD, SOFTEN_TEST_PR)
do_kvm_0x500:
KVM: PPC: Add support for Book3S processors in hypervisor mode This adds support for KVM running on 64-bit Book 3S processors, specifically POWER7, in hypervisor mode. Using hypervisor mode means that the guest can use the processor's supervisor mode. That means that the guest can execute privileged instructions and access privileged registers itself without trapping to the host. This gives excellent performance, but does mean that KVM cannot emulate a processor architecture other than the one that the hardware implements. This code assumes that the guest is running paravirtualized using the PAPR (Power Architecture Platform Requirements) interface, which is the interface that IBM's PowerVM hypervisor uses. That means that existing Linux distributions that run on IBM pSeries machines will also run under KVM without modification. In order to communicate the PAPR hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code to include/linux/kvm.h. Currently the choice between book3s_hv support and book3s_pr support (i.e. the existing code, which runs the guest in user mode) has to be made at kernel configuration time, so a given kernel binary can only do one or the other. This new book3s_hv code doesn't support MMIO emulation at present. Since we are running paravirtualized guests, this isn't a serious restriction. With the guest running in supervisor mode, most exceptions go straight to the guest. We will never get data or instruction storage or segment interrupts, alignment interrupts, decrementer interrupts, program interrupts, single-step interrupts, etc., coming to the hypervisor from the guest. Therefore this introduces a new KVMTEST_NONHV macro for the exception entry path so that we don't have to do the KVM test on entry to those exception handlers. We do however get hypervisor decrementer, hypervisor data storage, hypervisor instruction storage, and hypervisor emulation assist interrupts, so we have to handle those. In hypervisor mode, real-mode accesses can access all of RAM, not just a limited amount. Therefore we put all the guest state in the vcpu.arch and use the shadow_vcpu in the PACA only for temporary scratch space. We allocate the vcpu with kzalloc rather than vzalloc, and we don't use anything in the kvmppc_vcpu_book3s struct, so we don't allocate it. We don't have a shared page with the guest, but we still need a kvm_vcpu_arch_shared struct to store the values of various registers, so we include one in the vcpu_arch struct. The POWER7 processor has a restriction that all threads in a core have to be in the same partition. MMU-on kernel code counts as a partition (partition 0), so we have to do a partition switch on every entry to and exit from the guest. At present we require the host and guest to run in single-thread mode because of this hardware restriction. This code allocates a hashed page table for the guest and initializes it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We require that the guest memory is allocated using 16MB huge pages, in order to simplify the low-level memory management. This also means that we can get away without tracking paging activity in the host for now, since huge pages can't be paged or swapped. This also adds a few new exports needed by the book3s_hv code. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-28 18:21:34 -06:00
KVM_HANDLER(PACA_EXGEN, EXC_STD, 0x500)
ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
EXC_REAL_END(hardware_interrupt, 0x500, 0x600)
EXC_VIRT_BEGIN(hardware_interrupt, 0x4500, 0x4600)
.globl hardware_interrupt_relon_hv;
hardware_interrupt_relon_hv:
BEGIN_FTR_SECTION
_MASKABLE_RELON_EXCEPTION_PSERIES(0x500, hardware_interrupt_common, EXC_HV, SOFTEN_TEST_HV)
FTR_SECTION_ELSE
_MASKABLE_RELON_EXCEPTION_PSERIES(0x500, hardware_interrupt_common, EXC_STD, SOFTEN_TEST_PR)
ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE)
EXC_VIRT_END(hardware_interrupt, 0x4500, 0x4600)
EXC_COMMON_ASYNC(hardware_interrupt_common, 0x500, do_IRQ)
EXC_REAL(alignment, 0x600, 0x700)
EXC_VIRT(alignment, 0x4600, 0x4700, 0x600)
TRAMP_KVM(PACA_EXGEN, 0x600)
EXC_COMMON_BEGIN(alignment_common)
mfspr r10,SPRN_DAR
std r10,PACA_EXGEN+EX_DAR(r13)
mfspr r10,SPRN_DSISR
stw r10,PACA_EXGEN+EX_DSISR(r13)
EXCEPTION_PROLOG_COMMON(0x600, PACA_EXGEN)
ld r3,PACA_EXGEN+EX_DAR(r13)
lwz r4,PACA_EXGEN+EX_DSISR(r13)
std r3,_DAR(r1)
std r4,_DSISR(r1)
bl save_nvgprs
RECONCILE_IRQ_STATE(r10, r11)
addi r3,r1,STACK_FRAME_OVERHEAD
bl alignment_exception
b ret_from_except
EXC_REAL(program_check, 0x700, 0x800)
EXC_VIRT(program_check, 0x4700, 0x4800, 0x700)
TRAMP_KVM(PACA_EXGEN, 0x700)
EXC_COMMON_BEGIN(program_check_common)
EXCEPTION_PROLOG_COMMON(0x700, PACA_EXGEN)
bl save_nvgprs
RECONCILE_IRQ_STATE(r10, r11)
addi r3,r1,STACK_FRAME_OVERHEAD
bl program_check_exception
b ret_from_except
EXC_REAL(fp_unavailable, 0x800, 0x900)
EXC_VIRT(fp_unavailable, 0x4800, 0x4900, 0x800)
TRAMP_KVM(PACA_EXGEN, 0x800)
EXC_COMMON_BEGIN(fp_unavailable_common)
EXCEPTION_PROLOG_COMMON(0x800, PACA_EXGEN)
bne 1f /* if from user, just load it up */
bl save_nvgprs
RECONCILE_IRQ_STATE(r10, r11)
addi r3,r1,STACK_FRAME_OVERHEAD
bl kernel_fp_unavailable_exception
BUG_OPCODE
1:
#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
BEGIN_FTR_SECTION
/* Test if 2 TM state bits are zero. If non-zero (ie. userspace was in
* transaction), go do TM stuff
*/
rldicl. r0, r12, (64-MSR_TS_LG), (64-2)
bne- 2f
END_FTR_SECTION_IFSET(CPU_FTR_TM)
#endif
bl load_up_fpu
b fast_exception_return
#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
2: /* User process was in a transaction */
bl save_nvgprs
RECONCILE_IRQ_STATE(r10, r11)
addi r3,r1,STACK_FRAME_OVERHEAD
bl fp_unavailable_tm
b ret_from_except
#endif
EXC_REAL_MASKABLE(decrementer, 0x900, 0x980)
EXC_VIRT_MASKABLE(decrementer, 0x4900, 0x4980, 0x900)
TRAMP_KVM(PACA_EXGEN, 0x900)
EXC_COMMON_ASYNC(decrementer_common, 0x900, timer_interrupt)
powerpc: Fix "attempt to move .org backwards" error Building a 64-bit powerpc kernel with PR KVM enabled currently gives this error: AS arch/powerpc/kernel/head_64.o arch/powerpc/kernel/exceptions-64s.S: Assembler messages: arch/powerpc/kernel/exceptions-64s.S:258: Error: attempt to move .org backwards make[2]: *** [arch/powerpc/kernel/head_64.o] Error 1 This happens because the MASKABLE_EXCEPTION_PSERIES macro turns into 33 instructions, but we only have space for 32 at the decrementer interrupt vector (from 0x900 to 0x980). In the code generated by the MASKABLE_EXCEPTION_PSERIES macro, we currently have two instances of the HMT_MEDIUM macro, which has the effect of setting the SMT thread priority to medium. One is the first instruction, and is overwritten by a no-op on processors where we save the PPR (processor priority register), that is, POWER7 or later. The other is after we have saved the PPR. In order to reduce the code at 0x900 by one instruction, we omit the first HMT_MEDIUM. On processors without SMT this will have no effect since HMT_MEDIUM is a no-op there. On POWER5 and RS64 machines this will mean that the first few instructions take a little longer in the case where a decrementer interrupt occurs when the hardware thread is running at low SMT priority. On POWER6 and later machines, the hardware automatically boosts the thread priority when a decrementer interrupt is taken if the thread priority was below medium, so this change won't make any difference. The alternative would be to branch out of line after saving the CFAR. However, that would incur an extra overhead on all processors, whereas the approach adopted here only adds overhead on older threaded processors. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-25 11:51:40 -06:00
EXC_REAL_HV(hdecrementer, 0x980, 0xa00)
EXC_VIRT_HV(hdecrementer, 0x4980, 0x4a00, 0x980)
TRAMP_KVM_HV(PACA_EXGEN, 0x980)
EXC_COMMON(hdecrementer_common, 0x980, hdec_interrupt)
EXC_REAL_MASKABLE(doorbell_super, 0xa00, 0xb00)
TRAMP_KVM(PACA_EXGEN, 0xa00)
EXC_REAL(trap_0b, 0xb00, 0xc00)
TRAMP_KVM(PACA_EXGEN, 0xb00)
EXC_REAL_BEGIN(system_call, 0xc00, 0xd00)
/*
* If CONFIG_KVM_BOOK3S_64_HANDLER is set, save the PPR (on systems
* that support it) before changing to HMT_MEDIUM. That allows the KVM
* code to save that value into the guest state (it is the guest's PPR
* value). Otherwise just change to HMT_MEDIUM as userspace has
* already saved the PPR.
*/
#ifdef CONFIG_KVM_BOOK3S_64_HANDLER
SET_SCRATCH0(r13)
GET_PACA(r13)
std r9,PACA_EXGEN+EX_R9(r13)
OPT_GET_SPR(r9, SPRN_PPR, CPU_FTR_HAS_PPR);
HMT_MEDIUM;
std r10,PACA_EXGEN+EX_R10(r13)
OPT_SAVE_REG_TO_PACA(PACA_EXGEN+EX_PPR, r9, CPU_FTR_HAS_PPR);
mfcr r9
KVMTEST_PR(0xc00)
GET_SCRATCH0(r13)
#else
HMT_MEDIUM;
#endif
SYSCALL_PSERIES_1
SYSCALL_PSERIES_2_RFID
SYSCALL_PSERIES_3
EXC_REAL_END(system_call, 0xc00, 0xd00)
TRAMP_KVM(PACA_EXGEN, 0xc00)
EXC_REAL(single_step, 0xd00, 0xe00)
TRAMP_KVM(PACA_EXGEN, 0xd00)
/* At 0xe??? we have a bunch of hypervisor exceptions, we branch
* out of line to handle them
*/
__EXC_REAL_OOL_HV(h_data_storage, 0xe00, 0xe20)
powerpc: Save CFAR before branching in interrupt entry paths Some of the interrupt vectors on 64-bit POWER server processors are only 32 bytes long, which is not enough for the full first-level interrupt handler. For these we currently just have a branch to an out-of-line handler. However, this means that we corrupt the CFAR (come-from address register) on POWER7 and later processors. To fix this, we split the EXCEPTION_PROLOG_1 macro into two pieces: EXCEPTION_PROLOG_0 contains the part up to the point where the CFAR is saved in the PACA, and EXCEPTION_PROLOG_1 contains the rest. We then put EXCEPTION_PROLOG_0 in the short interrupt vectors before we branch to the out-of-line handler, which contains the rest of the first-level interrupt handler. To facilitate this, we define new _OOL (out of line) variants of STD_EXCEPTION_PSERIES, etc. In order to get EXCEPTION_PROLOG_0 to be short enough, i.e., no more than 6 instructions, it was necessary to move the stores that move the PPR and CFAR values into the PACA into __EXCEPTION_PROLOG_1 and to get rid of one of the two HMT_MEDIUM instructions. Previously there was a HMT_MEDIUM_PPR_DISCARD before the prolog, which was nop'd out on processors with the PPR (POWER7 and later), and then another HMT_MEDIUM inside the HMT_MEDIUM_PPR_SAVE macro call inside __EXCEPTION_PROLOG_1, which was nop'd out on processors without PPR. Now the HMT_MEDIUM inside EXCEPTION_PROLOG_0 is there unconditionally and the HMT_MEDIUM_PPR_DISCARD is not strictly necessary, although this leaves it in for the interrupt vectors where there is room for it. Previously we had a handler for hypervisor maintenance interrupts at 0xe50, which doesn't leave enough room for the vector for hypervisor emulation assist interrupts at 0xe40, since we need 8 instructions. The 0xe50 vector was only used on POWER6, as the HMI vector was moved to 0xe60 on POWER7. Since we don't support running in hypervisor mode on POWER6, we just remove the handler at 0xe50. This also changes denorm_exception_hv to use EXCEPTION_PROLOG_0 instead of open-coding it, and removes the HMT_MEDIUM_PPR_DISCARD from the relocation-on vectors (since any CPU that supports relocation-on interrupts also has the PPR). Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-02-04 11:10:15 -07:00
__EXC_REAL_OOL_HV(h_instr_storage, 0xe20, 0xe40)
powerpc: Save CFAR before branching in interrupt entry paths Some of the interrupt vectors on 64-bit POWER server processors are only 32 bytes long, which is not enough for the full first-level interrupt handler. For these we currently just have a branch to an out-of-line handler. However, this means that we corrupt the CFAR (come-from address register) on POWER7 and later processors. To fix this, we split the EXCEPTION_PROLOG_1 macro into two pieces: EXCEPTION_PROLOG_0 contains the part up to the point where the CFAR is saved in the PACA, and EXCEPTION_PROLOG_1 contains the rest. We then put EXCEPTION_PROLOG_0 in the short interrupt vectors before we branch to the out-of-line handler, which contains the rest of the first-level interrupt handler. To facilitate this, we define new _OOL (out of line) variants of STD_EXCEPTION_PSERIES, etc. In order to get EXCEPTION_PROLOG_0 to be short enough, i.e., no more than 6 instructions, it was necessary to move the stores that move the PPR and CFAR values into the PACA into __EXCEPTION_PROLOG_1 and to get rid of one of the two HMT_MEDIUM instructions. Previously there was a HMT_MEDIUM_PPR_DISCARD before the prolog, which was nop'd out on processors with the PPR (POWER7 and later), and then another HMT_MEDIUM inside the HMT_MEDIUM_PPR_SAVE macro call inside __EXCEPTION_PROLOG_1, which was nop'd out on processors without PPR. Now the HMT_MEDIUM inside EXCEPTION_PROLOG_0 is there unconditionally and the HMT_MEDIUM_PPR_DISCARD is not strictly necessary, although this leaves it in for the interrupt vectors where there is room for it. Previously we had a handler for hypervisor maintenance interrupts at 0xe50, which doesn't leave enough room for the vector for hypervisor emulation assist interrupts at 0xe40, since we need 8 instructions. The 0xe50 vector was only used on POWER6, as the HMI vector was moved to 0xe60 on POWER7. Since we don't support running in hypervisor mode on POWER6, we just remove the handler at 0xe50. This also changes denorm_exception_hv to use EXCEPTION_PROLOG_0 instead of open-coding it, and removes the HMT_MEDIUM_PPR_DISCARD from the relocation-on vectors (since any CPU that supports relocation-on interrupts also has the PPR). Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-02-04 11:10:15 -07:00
__EXC_REAL_OOL_HV(emulation_assist, 0xe40, 0xe60)
powerpc: Save CFAR before branching in interrupt entry paths Some of the interrupt vectors on 64-bit POWER server processors are only 32 bytes long, which is not enough for the full first-level interrupt handler. For these we currently just have a branch to an out-of-line handler. However, this means that we corrupt the CFAR (come-from address register) on POWER7 and later processors. To fix this, we split the EXCEPTION_PROLOG_1 macro into two pieces: EXCEPTION_PROLOG_0 contains the part up to the point where the CFAR is saved in the PACA, and EXCEPTION_PROLOG_1 contains the rest. We then put EXCEPTION_PROLOG_0 in the short interrupt vectors before we branch to the out-of-line handler, which contains the rest of the first-level interrupt handler. To facilitate this, we define new _OOL (out of line) variants of STD_EXCEPTION_PSERIES, etc. In order to get EXCEPTION_PROLOG_0 to be short enough, i.e., no more than 6 instructions, it was necessary to move the stores that move the PPR and CFAR values into the PACA into __EXCEPTION_PROLOG_1 and to get rid of one of the two HMT_MEDIUM instructions. Previously there was a HMT_MEDIUM_PPR_DISCARD before the prolog, which was nop'd out on processors with the PPR (POWER7 and later), and then another HMT_MEDIUM inside the HMT_MEDIUM_PPR_SAVE macro call inside __EXCEPTION_PROLOG_1, which was nop'd out on processors without PPR. Now the HMT_MEDIUM inside EXCEPTION_PROLOG_0 is there unconditionally and the HMT_MEDIUM_PPR_DISCARD is not strictly necessary, although this leaves it in for the interrupt vectors where there is room for it. Previously we had a handler for hypervisor maintenance interrupts at 0xe50, which doesn't leave enough room for the vector for hypervisor emulation assist interrupts at 0xe40, since we need 8 instructions. The 0xe50 vector was only used on POWER6, as the HMI vector was moved to 0xe60 on POWER7. Since we don't support running in hypervisor mode on POWER6, we just remove the handler at 0xe50. This also changes denorm_exception_hv to use EXCEPTION_PROLOG_0 instead of open-coding it, and removes the HMT_MEDIUM_PPR_DISCARD from the relocation-on vectors (since any CPU that supports relocation-on interrupts also has the PPR). Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-02-04 11:10:15 -07:00
__EXC_REAL_OOL_HV_DIRECT(hmi_exception, 0xe60, 0xe80, hmi_exception_early)
powerpc: Save CFAR before branching in interrupt entry paths Some of the interrupt vectors on 64-bit POWER server processors are only 32 bytes long, which is not enough for the full first-level interrupt handler. For these we currently just have a branch to an out-of-line handler. However, this means that we corrupt the CFAR (come-from address register) on POWER7 and later processors. To fix this, we split the EXCEPTION_PROLOG_1 macro into two pieces: EXCEPTION_PROLOG_0 contains the part up to the point where the CFAR is saved in the PACA, and EXCEPTION_PROLOG_1 contains the rest. We then put EXCEPTION_PROLOG_0 in the short interrupt vectors before we branch to the out-of-line handler, which contains the rest of the first-level interrupt handler. To facilitate this, we define new _OOL (out of line) variants of STD_EXCEPTION_PSERIES, etc. In order to get EXCEPTION_PROLOG_0 to be short enough, i.e., no more than 6 instructions, it was necessary to move the stores that move the PPR and CFAR values into the PACA into __EXCEPTION_PROLOG_1 and to get rid of one of the two HMT_MEDIUM instructions. Previously there was a HMT_MEDIUM_PPR_DISCARD before the prolog, which was nop'd out on processors with the PPR (POWER7 and later), and then another HMT_MEDIUM inside the HMT_MEDIUM_PPR_SAVE macro call inside __EXCEPTION_PROLOG_1, which was nop'd out on processors without PPR. Now the HMT_MEDIUM inside EXCEPTION_PROLOG_0 is there unconditionally and the HMT_MEDIUM_PPR_DISCARD is not strictly necessary, although this leaves it in for the interrupt vectors where there is room for it. Previously we had a handler for hypervisor maintenance interrupts at 0xe50, which doesn't leave enough room for the vector for hypervisor emulation assist interrupts at 0xe40, since we need 8 instructions. The 0xe50 vector was only used on POWER6, as the HMI vector was moved to 0xe60 on POWER7. Since we don't support running in hypervisor mode on POWER6, we just remove the handler at 0xe50. This also changes denorm_exception_hv to use EXCEPTION_PROLOG_0 instead of open-coding it, and removes the HMT_MEDIUM_PPR_DISCARD from the relocation-on vectors (since any CPU that supports relocation-on interrupts also has the PPR). Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-02-04 11:10:15 -07:00
__EXC_REAL_OOL_MASKABLE_HV(h_doorbell, 0xe80, 0xea0)
__EXC_REAL_OOL_MASKABLE_HV(h_virt_irq, 0xea0, 0xec0)
EXC_REAL_NONE(0xec0, 0xf00)
__EXC_REAL_OOL(performance_monitor, 0xf00, 0xf20)
__EXC_REAL_OOL(altivec_unavailable, 0xf20, 0xf40)
__EXC_REAL_OOL(vsx_unavailable, 0xf40, 0xf60)
__EXC_REAL_OOL(facility_unavailable, 0xf60, 0xf80)
__EXC_REAL_OOL_HV(h_facility_unavailable, 0xf80, 0xfa0)
EXC_REAL_NONE(0xfa0, 0x1200)
#ifdef CONFIG_CBE_RAS
EXC_REAL_HV(cbe_system_error, 0x1200, 0x1300)
TRAMP_KVM_HV_SKIP(PACA_EXGEN, 0x1200)
#else /* CONFIG_CBE_RAS */
EXC_REAL_NONE(0x1200, 0x1300)
#endif
EXC_REAL(instruction_breakpoint, 0x1300, 0x1400)
TRAMP_KVM_SKIP(PACA_EXGEN, 0x1300)
EXC_REAL_BEGIN(denorm_exception_hv, 0x1500, 0x1600)
mtspr SPRN_SPRG_HSCRATCH0,r13
powerpc: Save CFAR before branching in interrupt entry paths Some of the interrupt vectors on 64-bit POWER server processors are only 32 bytes long, which is not enough for the full first-level interrupt handler. For these we currently just have a branch to an out-of-line handler. However, this means that we corrupt the CFAR (come-from address register) on POWER7 and later processors. To fix this, we split the EXCEPTION_PROLOG_1 macro into two pieces: EXCEPTION_PROLOG_0 contains the part up to the point where the CFAR is saved in the PACA, and EXCEPTION_PROLOG_1 contains the rest. We then put EXCEPTION_PROLOG_0 in the short interrupt vectors before we branch to the out-of-line handler, which contains the rest of the first-level interrupt handler. To facilitate this, we define new _OOL (out of line) variants of STD_EXCEPTION_PSERIES, etc. In order to get EXCEPTION_PROLOG_0 to be short enough, i.e., no more than 6 instructions, it was necessary to move the stores that move the PPR and CFAR values into the PACA into __EXCEPTION_PROLOG_1 and to get rid of one of the two HMT_MEDIUM instructions. Previously there was a HMT_MEDIUM_PPR_DISCARD before the prolog, which was nop'd out on processors with the PPR (POWER7 and later), and then another HMT_MEDIUM inside the HMT_MEDIUM_PPR_SAVE macro call inside __EXCEPTION_PROLOG_1, which was nop'd out on processors without PPR. Now the HMT_MEDIUM inside EXCEPTION_PROLOG_0 is there unconditionally and the HMT_MEDIUM_PPR_DISCARD is not strictly necessary, although this leaves it in for the interrupt vectors where there is room for it. Previously we had a handler for hypervisor maintenance interrupts at 0xe50, which doesn't leave enough room for the vector for hypervisor emulation assist interrupts at 0xe40, since we need 8 instructions. The 0xe50 vector was only used on POWER6, as the HMI vector was moved to 0xe60 on POWER7. Since we don't support running in hypervisor mode on POWER6, we just remove the handler at 0xe50. This also changes denorm_exception_hv to use EXCEPTION_PROLOG_0 instead of open-coding it, and removes the HMT_MEDIUM_PPR_DISCARD from the relocation-on vectors (since any CPU that supports relocation-on interrupts also has the PPR). Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-02-04 11:10:15 -07:00
EXCEPTION_PROLOG_0(PACA_EXGEN)
EXCEPTION_PROLOG_1(PACA_EXGEN, NOTEST, 0x1500)
#ifdef CONFIG_PPC_DENORMALISATION
mfspr r10,SPRN_HSRR1
mfspr r11,SPRN_HSRR0 /* save HSRR0 */
andis. r10,r10,(HSRR1_DENORM)@h /* denorm? */
addi r11,r11,-4 /* HSRR0 is next instruction */
bne+ denorm_assist
#endif
powerpc/book3s: handle machine check in Linux host. Move machine check entry point into Linux. So far we were dependent on firmware to decode MCE error details and handover the high level info to OS. This patch introduces early machine check routine that saves the MCE information (srr1, srr0, dar and dsisr) to the emergency stack. We allocate stack frame on emergency stack and set the r1 accordingly. This allows us to be prepared to take another exception without loosing context. One thing to note here that, if we get another machine check while ME bit is off then we risk a checkstop. Hence we restrict ourselves to save only MCE information and register saved on PACA_EXMC save are before we turn the ME bit on. We use paca->in_mce flag to differentiate between first entry and nested machine check entry which helps proper use of emergency stack. We increment paca->in_mce every time we enter in early machine check handler and decrement it while leaving. When we enter machine check early handler first time (paca->in_mce == 0), we are sure nobody is using MC emergency stack and allocate a stack frame at the start of the emergency stack. During subsequent entry (paca->in_mce > 0), we know that r1 points inside emergency stack and we allocate separate stack frame accordingly. This prevents us from clobbering MCE information during nested machine checks. The early machine check handler changes are placed under CPU_FTR_HVMODE section. This makes sure that the early machine check handler will get executed only in hypervisor kernel. This is the code flow: Machine Check Interrupt | V 0x200 vector ME=0, IR=0, DR=0 | V +-----------------------------------------------+ |machine_check_pSeries_early: | ME=0, IR=0, DR=0 | Alloc frame on emergency stack | | Save srr1, srr0, dar and dsisr on stack | +-----------------------------------------------+ | (ME=1, IR=0, DR=0, RFID) | V machine_check_handle_early ME=1, IR=0, DR=0 | V +-----------------------------------------------+ | machine_check_early (r3=pt_regs) | ME=1, IR=0, DR=0 | Things to do: (in next patches) | | Flush SLB for SLB errors | | Flush TLB for TLB errors | | Decode and save MCE info | +-----------------------------------------------+ | (Fall through existing exception handler routine.) | V machine_check_pSerie ME=1, IR=0, DR=0 | (ME=1, IR=1, DR=1, RFID) | V machine_check_common ME=1, IR=1, DR=1 . . . Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-10-30 08:34:08 -06:00
KVMTEST_PR(0x1500)
EXCEPTION_PROLOG_PSERIES_1(denorm_common, EXC_HV)
EXC_REAL_END(denorm_exception_hv, 0x1500, 0x1600)
TRAMP_KVM_SKIP(PACA_EXGEN, 0x1500)
#ifdef CONFIG_CBE_RAS
EXC_REAL_HV(cbe_maintenance, 0x1600, 0x1700)
TRAMP_KVM_HV_SKIP(PACA_EXGEN, 0x1600)
#else /* CONFIG_CBE_RAS */
EXC_REAL_NONE(0x1600, 0x1700)
#endif
EXC_REAL(altivec_assist, 0x1700, 0x1800)
TRAMP_KVM(PACA_EXGEN, 0x1700)
#ifdef CONFIG_CBE_RAS
EXC_REAL_HV(cbe_thermal, 0x1800, 0x1900)
TRAMP_KVM_HV_SKIP(PACA_EXGEN, 0x1800)
#else /* CONFIG_CBE_RAS */
EXC_REAL_NONE(0x1800, 0x1900)
#endif
/*** Out of line interrupts support ***/
/* moved from 0x200 */
#ifdef CONFIG_PPC_DENORMALISATION
TRAMP_REAL_BEGIN(denorm_assist)
BEGIN_FTR_SECTION
/*
* To denormalise we need to move a copy of the register to itself.
* For POWER6 do that here for all FP regs.
*/
mfmsr r10
ori r10,r10,(MSR_FP|MSR_FE0|MSR_FE1)
xori r10,r10,(MSR_FE0|MSR_FE1)
mtmsrd r10
sync
#define FMR2(n) fmr (n), (n) ; fmr n+1, n+1
#define FMR4(n) FMR2(n) ; FMR2(n+2)
#define FMR8(n) FMR4(n) ; FMR4(n+4)
#define FMR16(n) FMR8(n) ; FMR8(n+8)
#define FMR32(n) FMR16(n) ; FMR16(n+16)
FMR32(0)
FTR_SECTION_ELSE
/*
* To denormalise we need to move a copy of the register to itself.
* For POWER7 do that here for the first 32 VSX registers only.
*/
mfmsr r10
oris r10,r10,MSR_VSX@h
mtmsrd r10
sync
#define XVCPSGNDP2(n) XVCPSGNDP(n,n,n) ; XVCPSGNDP(n+1,n+1,n+1)
#define XVCPSGNDP4(n) XVCPSGNDP2(n) ; XVCPSGNDP2(n+2)
#define XVCPSGNDP8(n) XVCPSGNDP4(n) ; XVCPSGNDP4(n+4)
#define XVCPSGNDP16(n) XVCPSGNDP8(n) ; XVCPSGNDP8(n+8)
#define XVCPSGNDP32(n) XVCPSGNDP16(n) ; XVCPSGNDP16(n+16)
XVCPSGNDP32(0)
ALT_FTR_SECTION_END_IFCLR(CPU_FTR_ARCH_206)
BEGIN_FTR_SECTION
b denorm_done
END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_207S)
/*
* To denormalise we need to move a copy of the register to itself.
* For POWER8 we need to do that for all 64 VSX registers
*/
XVCPSGNDP32(32)
denorm_done:
mtspr SPRN_HSRR0,r11
mtcrf 0x80,r9
ld r9,PACA_EXGEN+EX_R9(r13)
RESTORE_PPR_PACA(PACA_EXGEN, r10)
BEGIN_FTR_SECTION
ld r10,PACA_EXGEN+EX_CFAR(r13)
mtspr SPRN_CFAR,r10
END_FTR_SECTION_IFSET(CPU_FTR_CFAR)
ld r10,PACA_EXGEN+EX_R10(r13)
ld r11,PACA_EXGEN+EX_R11(r13)
ld r12,PACA_EXGEN+EX_R12(r13)
ld r13,PACA_EXGEN+EX_R13(r13)
HRFID
b .
#endif
/* moved from 0xe00 */
__TRAMP_REAL_REAL_OOL_HV(h_data_storage, 0xe00)
TRAMP_KVM_HV_SKIP(PACA_EXGEN, 0xe00)
__TRAMP_REAL_REAL_OOL_HV(h_instr_storage, 0xe20)
TRAMP_KVM_HV(PACA_EXGEN, 0xe20)
__TRAMP_REAL_REAL_OOL_HV(emulation_assist, 0xe40)
TRAMP_KVM_HV(PACA_EXGEN, 0xe40)
__TRAMP_REAL_REAL_OOL_MASKABLE_HV(hmi_exception, 0xe60)
TRAMP_KVM_HV(PACA_EXGEN, 0xe60)
__TRAMP_REAL_REAL_OOL_MASKABLE_HV(h_doorbell, 0xe80)
TRAMP_KVM_HV(PACA_EXGEN, 0xe80)
__TRAMP_REAL_REAL_OOL_MASKABLE_HV(h_virt_irq, 0xea0)
TRAMP_KVM_HV(PACA_EXGEN, 0xea0)
/* moved from 0xf00 */
__TRAMP_REAL_REAL_OOL(performance_monitor, 0xf00)
TRAMP_KVM(PACA_EXGEN, 0xf00)
__TRAMP_REAL_REAL_OOL(altivec_unavailable, 0xf20)
TRAMP_KVM(PACA_EXGEN, 0xf20)
__TRAMP_REAL_REAL_OOL(vsx_unavailable, 0xf40)
TRAMP_KVM(PACA_EXGEN, 0xf40)
__TRAMP_REAL_REAL_OOL(facility_unavailable, 0xf60)
TRAMP_KVM(PACA_EXGEN, 0xf60)
__TRAMP_REAL_REAL_OOL_HV(h_facility_unavailable, 0xf80)
TRAMP_KVM_HV(PACA_EXGEN, 0xf80)
/*
* An interrupt came in while soft-disabled. We set paca->irq_happened, then:
* - If it was a decrementer interrupt, we bump the dec to max and and return.
* - If it was a doorbell we return immediately since doorbells are edge
* triggered and won't automatically refire.
* - If it was a HMI we return immediately since we handled it in realmode
* and it won't refire.
* - else we hard disable and return.
* This is called with r10 containing the value to OR to the paca field.
*/
powerpc: Rework lazy-interrupt handling The current implementation of lazy interrupts handling has some issues that this tries to address. We don't do the various workarounds we need to do when re-enabling interrupts in some cases such as when returning from an interrupt and thus we may still lose or get delayed decrementer or doorbell interrupts. The current scheme also makes it much harder to handle the external "edge" interrupts provided by some BookE processors when using the EPR facility (External Proxy) and the Freescale Hypervisor. Additionally, we tend to keep interrupts hard disabled in a number of cases, such as decrementer interrupts, external interrupts, or when a masked decrementer interrupt is pending. This is sub-optimal. This is an attempt at fixing it all in one go by reworking the way we do the lazy interrupt disabling from the ground up. The base idea is to replace the "hard_enabled" field with a "irq_happened" field in which we store a bit mask of what interrupt occurred while soft-disabled. When re-enabling, either via arch_local_irq_restore() or when returning from an interrupt, we can now decide what to do by testing bits in that field. We then implement replaying of the missed interrupts either by re-using the existing exception frame (in exception exit case) or via the creation of a new one from an assembly trampoline (in the arch_local_irq_enable case). This removes the need to play with the decrementer to try to create fake interrupts, among others. In addition, this adds a few refinements: - We no longer hard disable decrementer interrupts that occur while soft-disabled. We now simply bump the decrementer back to max (on BookS) or leave it stopped (on BookE) and continue with hard interrupts enabled, which means that we'll potentially get better sample quality from performance monitor interrupts. - Timer, decrementer and doorbell interrupts now hard-enable shortly after removing the source of the interrupt, which means they no longer run entirely hard disabled. Again, this will improve perf sample quality. - On Book3E 64-bit, we now make the performance monitor interrupt act as an NMI like Book3S (the necessary C code for that to work appear to already be present in the FSL perf code, notably calling nmi_enter instead of irq_enter). (This also fixes a bug where BookE perfmon interrupts could clobber r14 ... oops) - We could make "masked" decrementer interrupts act as NMIs when doing timer-based perf sampling to improve the sample quality. Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org> --- v2: - Add hard-enable to decrementer, timer and doorbells - Fix CR clobber in masked irq handling on BookE - Make embedded perf interrupt act as an NMI - Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want to retrigger an interrupt without preventing hard-enable v3: - Fix or vs. ori bug on Book3E - Fix enabling of interrupts for some exceptions on Book3E v4: - Fix resend of doorbells on return from interrupt on Book3E v5: - Rebased on top of my latest series, which involves some significant rework of some aspects of the patch. v6: - 32-bit compile fix - more compile fixes with various .config combos - factor out the asm code to soft-disable interrupts - remove the C wrapper around preempt_schedule_irq v7: - Fix a bug with hard irq state tracking on native power7
2012-03-06 00:27:59 -07:00
#define MASKED_INTERRUPT(_H) \
masked_##_H##interrupt: \
std r11,PACA_EXGEN+EX_R11(r13); \
lbz r11,PACAIRQHAPPENED(r13); \
or r11,r11,r10; \
stb r11,PACAIRQHAPPENED(r13); \
cmpwi r10,PACA_IRQ_DEC; \
bne 1f; \
powerpc: Rework lazy-interrupt handling The current implementation of lazy interrupts handling has some issues that this tries to address. We don't do the various workarounds we need to do when re-enabling interrupts in some cases such as when returning from an interrupt and thus we may still lose or get delayed decrementer or doorbell interrupts. The current scheme also makes it much harder to handle the external "edge" interrupts provided by some BookE processors when using the EPR facility (External Proxy) and the Freescale Hypervisor. Additionally, we tend to keep interrupts hard disabled in a number of cases, such as decrementer interrupts, external interrupts, or when a masked decrementer interrupt is pending. This is sub-optimal. This is an attempt at fixing it all in one go by reworking the way we do the lazy interrupt disabling from the ground up. The base idea is to replace the "hard_enabled" field with a "irq_happened" field in which we store a bit mask of what interrupt occurred while soft-disabled. When re-enabling, either via arch_local_irq_restore() or when returning from an interrupt, we can now decide what to do by testing bits in that field. We then implement replaying of the missed interrupts either by re-using the existing exception frame (in exception exit case) or via the creation of a new one from an assembly trampoline (in the arch_local_irq_enable case). This removes the need to play with the decrementer to try to create fake interrupts, among others. In addition, this adds a few refinements: - We no longer hard disable decrementer interrupts that occur while soft-disabled. We now simply bump the decrementer back to max (on BookS) or leave it stopped (on BookE) and continue with hard interrupts enabled, which means that we'll potentially get better sample quality from performance monitor interrupts. - Timer, decrementer and doorbell interrupts now hard-enable shortly after removing the source of the interrupt, which means they no longer run entirely hard disabled. Again, this will improve perf sample quality. - On Book3E 64-bit, we now make the performance monitor interrupt act as an NMI like Book3S (the necessary C code for that to work appear to already be present in the FSL perf code, notably calling nmi_enter instead of irq_enter). (This also fixes a bug where BookE perfmon interrupts could clobber r14 ... oops) - We could make "masked" decrementer interrupts act as NMIs when doing timer-based perf sampling to improve the sample quality. Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org> --- v2: - Add hard-enable to decrementer, timer and doorbells - Fix CR clobber in masked irq handling on BookE - Make embedded perf interrupt act as an NMI - Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want to retrigger an interrupt without preventing hard-enable v3: - Fix or vs. ori bug on Book3E - Fix enabling of interrupts for some exceptions on Book3E v4: - Fix resend of doorbells on return from interrupt on Book3E v5: - Rebased on top of my latest series, which involves some significant rework of some aspects of the patch. v6: - 32-bit compile fix - more compile fixes with various .config combos - factor out the asm code to soft-disable interrupts - remove the C wrapper around preempt_schedule_irq v7: - Fix a bug with hard irq state tracking on native power7
2012-03-06 00:27:59 -07:00
lis r10,0x7fff; \
ori r10,r10,0xffff; \
mtspr SPRN_DEC,r10; \
b 2f; \
1: cmpwi r10,PACA_IRQ_DBELL; \
beq 2f; \
cmpwi r10,PACA_IRQ_HMI; \
beq 2f; \
mfspr r10,SPRN_##_H##SRR1; \
powerpc: Rework lazy-interrupt handling The current implementation of lazy interrupts handling has some issues that this tries to address. We don't do the various workarounds we need to do when re-enabling interrupts in some cases such as when returning from an interrupt and thus we may still lose or get delayed decrementer or doorbell interrupts. The current scheme also makes it much harder to handle the external "edge" interrupts provided by some BookE processors when using the EPR facility (External Proxy) and the Freescale Hypervisor. Additionally, we tend to keep interrupts hard disabled in a number of cases, such as decrementer interrupts, external interrupts, or when a masked decrementer interrupt is pending. This is sub-optimal. This is an attempt at fixing it all in one go by reworking the way we do the lazy interrupt disabling from the ground up. The base idea is to replace the "hard_enabled" field with a "irq_happened" field in which we store a bit mask of what interrupt occurred while soft-disabled. When re-enabling, either via arch_local_irq_restore() or when returning from an interrupt, we can now decide what to do by testing bits in that field. We then implement replaying of the missed interrupts either by re-using the existing exception frame (in exception exit case) or via the creation of a new one from an assembly trampoline (in the arch_local_irq_enable case). This removes the need to play with the decrementer to try to create fake interrupts, among others. In addition, this adds a few refinements: - We no longer hard disable decrementer interrupts that occur while soft-disabled. We now simply bump the decrementer back to max (on BookS) or leave it stopped (on BookE) and continue with hard interrupts enabled, which means that we'll potentially get better sample quality from performance monitor interrupts. - Timer, decrementer and doorbell interrupts now hard-enable shortly after removing the source of the interrupt, which means they no longer run entirely hard disabled. Again, this will improve perf sample quality. - On Book3E 64-bit, we now make the performance monitor interrupt act as an NMI like Book3S (the necessary C code for that to work appear to already be present in the FSL perf code, notably calling nmi_enter instead of irq_enter). (This also fixes a bug where BookE perfmon interrupts could clobber r14 ... oops) - We could make "masked" decrementer interrupts act as NMIs when doing timer-based perf sampling to improve the sample quality. Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org> --- v2: - Add hard-enable to decrementer, timer and doorbells - Fix CR clobber in masked irq handling on BookE - Make embedded perf interrupt act as an NMI - Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want to retrigger an interrupt without preventing hard-enable v3: - Fix or vs. ori bug on Book3E - Fix enabling of interrupts for some exceptions on Book3E v4: - Fix resend of doorbells on return from interrupt on Book3E v5: - Rebased on top of my latest series, which involves some significant rework of some aspects of the patch. v6: - 32-bit compile fix - more compile fixes with various .config combos - factor out the asm code to soft-disable interrupts - remove the C wrapper around preempt_schedule_irq v7: - Fix a bug with hard irq state tracking on native power7
2012-03-06 00:27:59 -07:00
rldicl r10,r10,48,1; /* clear MSR_EE */ \
rotldi r10,r10,16; \
mtspr SPRN_##_H##SRR1,r10; \
2: mtcrf 0x80,r9; \
ld r9,PACA_EXGEN+EX_R9(r13); \
ld r10,PACA_EXGEN+EX_R10(r13); \
ld r11,PACA_EXGEN+EX_R11(r13); \
GET_SCRATCH0(r13); \
##_H##rfid; \
b .
/*
* Real mode exceptions actually use this too, but alternate
* instruction code patches (which end up in the common .text area)
* cannot reach these if they are put there.
*/
USE_FIXED_SECTION(virt_trampolines)
powerpc: Rework lazy-interrupt handling The current implementation of lazy interrupts handling has some issues that this tries to address. We don't do the various workarounds we need to do when re-enabling interrupts in some cases such as when returning from an interrupt and thus we may still lose or get delayed decrementer or doorbell interrupts. The current scheme also makes it much harder to handle the external "edge" interrupts provided by some BookE processors when using the EPR facility (External Proxy) and the Freescale Hypervisor. Additionally, we tend to keep interrupts hard disabled in a number of cases, such as decrementer interrupts, external interrupts, or when a masked decrementer interrupt is pending. This is sub-optimal. This is an attempt at fixing it all in one go by reworking the way we do the lazy interrupt disabling from the ground up. The base idea is to replace the "hard_enabled" field with a "irq_happened" field in which we store a bit mask of what interrupt occurred while soft-disabled. When re-enabling, either via arch_local_irq_restore() or when returning from an interrupt, we can now decide what to do by testing bits in that field. We then implement replaying of the missed interrupts either by re-using the existing exception frame (in exception exit case) or via the creation of a new one from an assembly trampoline (in the arch_local_irq_enable case). This removes the need to play with the decrementer to try to create fake interrupts, among others. In addition, this adds a few refinements: - We no longer hard disable decrementer interrupts that occur while soft-disabled. We now simply bump the decrementer back to max (on BookS) or leave it stopped (on BookE) and continue with hard interrupts enabled, which means that we'll potentially get better sample quality from performance monitor interrupts. - Timer, decrementer and doorbell interrupts now hard-enable shortly after removing the source of the interrupt, which means they no longer run entirely hard disabled. Again, this will improve perf sample quality. - On Book3E 64-bit, we now make the performance monitor interrupt act as an NMI like Book3S (the necessary C code for that to work appear to already be present in the FSL perf code, notably calling nmi_enter instead of irq_enter). (This also fixes a bug where BookE perfmon interrupts could clobber r14 ... oops) - We could make "masked" decrementer interrupts act as NMIs when doing timer-based perf sampling to improve the sample quality. Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org> --- v2: - Add hard-enable to decrementer, timer and doorbells - Fix CR clobber in masked irq handling on BookE - Make embedded perf interrupt act as an NMI - Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want to retrigger an interrupt without preventing hard-enable v3: - Fix or vs. ori bug on Book3E - Fix enabling of interrupts for some exceptions on Book3E v4: - Fix resend of doorbells on return from interrupt on Book3E v5: - Rebased on top of my latest series, which involves some significant rework of some aspects of the patch. v6: - 32-bit compile fix - more compile fixes with various .config combos - factor out the asm code to soft-disable interrupts - remove the C wrapper around preempt_schedule_irq v7: - Fix a bug with hard irq state tracking on native power7
2012-03-06 00:27:59 -07:00
MASKED_INTERRUPT()
MASKED_INTERRUPT(H)
powerpc: Rework lazy-interrupt handling The current implementation of lazy interrupts handling has some issues that this tries to address. We don't do the various workarounds we need to do when re-enabling interrupts in some cases such as when returning from an interrupt and thus we may still lose or get delayed decrementer or doorbell interrupts. The current scheme also makes it much harder to handle the external "edge" interrupts provided by some BookE processors when using the EPR facility (External Proxy) and the Freescale Hypervisor. Additionally, we tend to keep interrupts hard disabled in a number of cases, such as decrementer interrupts, external interrupts, or when a masked decrementer interrupt is pending. This is sub-optimal. This is an attempt at fixing it all in one go by reworking the way we do the lazy interrupt disabling from the ground up. The base idea is to replace the "hard_enabled" field with a "irq_happened" field in which we store a bit mask of what interrupt occurred while soft-disabled. When re-enabling, either via arch_local_irq_restore() or when returning from an interrupt, we can now decide what to do by testing bits in that field. We then implement replaying of the missed interrupts either by re-using the existing exception frame (in exception exit case) or via the creation of a new one from an assembly trampoline (in the arch_local_irq_enable case). This removes the need to play with the decrementer to try to create fake interrupts, among others. In addition, this adds a few refinements: - We no longer hard disable decrementer interrupts that occur while soft-disabled. We now simply bump the decrementer back to max (on BookS) or leave it stopped (on BookE) and continue with hard interrupts enabled, which means that we'll potentially get better sample quality from performance monitor interrupts. - Timer, decrementer and doorbell interrupts now hard-enable shortly after removing the source of the interrupt, which means they no longer run entirely hard disabled. Again, this will improve perf sample quality. - On Book3E 64-bit, we now make the performance monitor interrupt act as an NMI like Book3S (the necessary C code for that to work appear to already be present in the FSL perf code, notably calling nmi_enter instead of irq_enter). (This also fixes a bug where BookE perfmon interrupts could clobber r14 ... oops) - We could make "masked" decrementer interrupts act as NMIs when doing timer-based perf sampling to improve the sample quality. Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org> --- v2: - Add hard-enable to decrementer, timer and doorbells - Fix CR clobber in masked irq handling on BookE - Make embedded perf interrupt act as an NMI - Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want to retrigger an interrupt without preventing hard-enable v3: - Fix or vs. ori bug on Book3E - Fix enabling of interrupts for some exceptions on Book3E v4: - Fix resend of doorbells on return from interrupt on Book3E v5: - Rebased on top of my latest series, which involves some significant rework of some aspects of the patch. v6: - 32-bit compile fix - more compile fixes with various .config combos - factor out the asm code to soft-disable interrupts - remove the C wrapper around preempt_schedule_irq v7: - Fix a bug with hard irq state tracking on native power7
2012-03-06 00:27:59 -07:00
/*
* Called from arch_local_irq_enable when an interrupt needs
* to be resent. r3 contains 0x500, 0x900, 0xa00 or 0xe80 to indicate
* which kind of interrupt. MSR:EE is already off. We generate a
powerpc: Rework lazy-interrupt handling The current implementation of lazy interrupts handling has some issues that this tries to address. We don't do the various workarounds we need to do when re-enabling interrupts in some cases such as when returning from an interrupt and thus we may still lose or get delayed decrementer or doorbell interrupts. The current scheme also makes it much harder to handle the external "edge" interrupts provided by some BookE processors when using the EPR facility (External Proxy) and the Freescale Hypervisor. Additionally, we tend to keep interrupts hard disabled in a number of cases, such as decrementer interrupts, external interrupts, or when a masked decrementer interrupt is pending. This is sub-optimal. This is an attempt at fixing it all in one go by reworking the way we do the lazy interrupt disabling from the ground up. The base idea is to replace the "hard_enabled" field with a "irq_happened" field in which we store a bit mask of what interrupt occurred while soft-disabled. When re-enabling, either via arch_local_irq_restore() or when returning from an interrupt, we can now decide what to do by testing bits in that field. We then implement replaying of the missed interrupts either by re-using the existing exception frame (in exception exit case) or via the creation of a new one from an assembly trampoline (in the arch_local_irq_enable case). This removes the need to play with the decrementer to try to create fake interrupts, among others. In addition, this adds a few refinements: - We no longer hard disable decrementer interrupts that occur while soft-disabled. We now simply bump the decrementer back to max (on BookS) or leave it stopped (on BookE) and continue with hard interrupts enabled, which means that we'll potentially get better sample quality from performance monitor interrupts. - Timer, decrementer and doorbell interrupts now hard-enable shortly after removing the source of the interrupt, which means they no longer run entirely hard disabled. Again, this will improve perf sample quality. - On Book3E 64-bit, we now make the performance monitor interrupt act as an NMI like Book3S (the necessary C code for that to work appear to already be present in the FSL perf code, notably calling nmi_enter instead of irq_enter). (This also fixes a bug where BookE perfmon interrupts could clobber r14 ... oops) - We could make "masked" decrementer interrupts act as NMIs when doing timer-based perf sampling to improve the sample quality. Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org> --- v2: - Add hard-enable to decrementer, timer and doorbells - Fix CR clobber in masked irq handling on BookE - Make embedded perf interrupt act as an NMI - Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want to retrigger an interrupt without preventing hard-enable v3: - Fix or vs. ori bug on Book3E - Fix enabling of interrupts for some exceptions on Book3E v4: - Fix resend of doorbells on return from interrupt on Book3E v5: - Rebased on top of my latest series, which involves some significant rework of some aspects of the patch. v6: - 32-bit compile fix - more compile fixes with various .config combos - factor out the asm code to soft-disable interrupts - remove the C wrapper around preempt_schedule_irq v7: - Fix a bug with hard irq state tracking on native power7
2012-03-06 00:27:59 -07:00
* stackframe like if a real interrupt had happened.
*
* Note: While MSR:EE is off, we need to make sure that _MSR
* in the generated frame has EE set to 1 or the exception
* handler will not properly re-enable them.
*/
USE_TEXT_SECTION()
powerpc: Rework lazy-interrupt handling The current implementation of lazy interrupts handling has some issues that this tries to address. We don't do the various workarounds we need to do when re-enabling interrupts in some cases such as when returning from an interrupt and thus we may still lose or get delayed decrementer or doorbell interrupts. The current scheme also makes it much harder to handle the external "edge" interrupts provided by some BookE processors when using the EPR facility (External Proxy) and the Freescale Hypervisor. Additionally, we tend to keep interrupts hard disabled in a number of cases, such as decrementer interrupts, external interrupts, or when a masked decrementer interrupt is pending. This is sub-optimal. This is an attempt at fixing it all in one go by reworking the way we do the lazy interrupt disabling from the ground up. The base idea is to replace the "hard_enabled" field with a "irq_happened" field in which we store a bit mask of what interrupt occurred while soft-disabled. When re-enabling, either via arch_local_irq_restore() or when returning from an interrupt, we can now decide what to do by testing bits in that field. We then implement replaying of the missed interrupts either by re-using the existing exception frame (in exception exit case) or via the creation of a new one from an assembly trampoline (in the arch_local_irq_enable case). This removes the need to play with the decrementer to try to create fake interrupts, among others. In addition, this adds a few refinements: - We no longer hard disable decrementer interrupts that occur while soft-disabled. We now simply bump the decrementer back to max (on BookS) or leave it stopped (on BookE) and continue with hard interrupts enabled, which means that we'll potentially get better sample quality from performance monitor interrupts. - Timer, decrementer and doorbell interrupts now hard-enable shortly after removing the source of the interrupt, which means they no longer run entirely hard disabled. Again, this will improve perf sample quality. - On Book3E 64-bit, we now make the performance monitor interrupt act as an NMI like Book3S (the necessary C code for that to work appear to already be present in the FSL perf code, notably calling nmi_enter instead of irq_enter). (This also fixes a bug where BookE perfmon interrupts could clobber r14 ... oops) - We could make "masked" decrementer interrupts act as NMIs when doing timer-based perf sampling to improve the sample quality. Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org> --- v2: - Add hard-enable to decrementer, timer and doorbells - Fix CR clobber in masked irq handling on BookE - Make embedded perf interrupt act as an NMI - Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want to retrigger an interrupt without preventing hard-enable v3: - Fix or vs. ori bug on Book3E - Fix enabling of interrupts for some exceptions on Book3E v4: - Fix resend of doorbells on return from interrupt on Book3E v5: - Rebased on top of my latest series, which involves some significant rework of some aspects of the patch. v6: - 32-bit compile fix - more compile fixes with various .config combos - factor out the asm code to soft-disable interrupts - remove the C wrapper around preempt_schedule_irq v7: - Fix a bug with hard irq state tracking on native power7
2012-03-06 00:27:59 -07:00
_GLOBAL(__replay_interrupt)
/* We are going to jump to the exception common code which
* will retrieve various register values from the PACA which
* we don't give a damn about, so we don't bother storing them.
*/
mfmsr r12
mflr r11
mfcr r9
ori r12,r12,MSR_EE
cmpwi r3,0x900
beq decrementer_common
cmpwi r3,0x500
beq hardware_interrupt_common
BEGIN_FTR_SECTION
cmpwi r3,0xe80
beq h_doorbell_common
cmpwi r3,0xea0
beq h_virt_irq_common
KVM: PPC: Book3S HV: Fix TB corruption in guest exit path on HMI interrupt When a guest is assigned to a core it converts the host Timebase (TB) into guest TB by adding guest timebase offset before entering into guest. During guest exit it restores the guest TB to host TB. This means under certain conditions (Guest migration) host TB and guest TB can differ. When we get an HMI for TB related issues the opal HMI handler would try fixing errors and restore the correct host TB value. With no guest running, we don't have any issues. But with guest running on the core we run into TB corruption issues. If we get an HMI while in the guest, the current HMI handler invokes opal hmi handler before forcing guest to exit. The guest exit path subtracts the guest TB offset from the current TB value which may have already been restored with host value by opal hmi handler. This leads to incorrect host and guest TB values. With split-core, things become more complex. With split-core, TB also gets split and each subcore gets its own TB register. When a hmi handler fixes a TB error and restores the TB value, it affects all the TB values of sibling subcores on the same core. On TB errors all the thread in the core gets HMI. With existing code, the individual threads call opal hmi handle independently which can easily throw TB out of sync if we have guest running on subcores. Hence we will need to co-ordinate with all the threads before making opal hmi handler call followed by TB resync. This patch introduces a sibling subcore state structure (shared by all threads in the core) in paca which holds information about whether sibling subcores are in Guest mode or host mode. An array in_guest[] of size MAX_SUBCORE_PER_CORE=4 is used to maintain the state of each subcore. The subcore id is used as index into in_guest[] array. Only primary thread entering/exiting the guest is responsible to set/unset its designated array element. On TB error, we get HMI interrupt on every thread on the core. Upon HMI, this patch will now force guest to vacate the core/subcore. Primary thread from each subcore will then turn off its respective bit from the above bitmap during the guest exit path just after the guest->host partition switch is complete. All other threads that have just exited the guest OR were already in host will wait until all other subcores clears their respective bit. Once all the subcores turn off their respective bit, all threads will will make call to opal hmi handler. It is not necessary that opal hmi handler would resync the TB value for every HMI interrupts. It would do so only for the HMI caused due to TB errors. For rest, it would not touch TB value. Hence to make things simpler, primary thread would call TB resync explicitly once for each core immediately after opal hmi handler instead of subtracting guest offset from TB. TB resync call will restore the TB with host value. Thus we can be sure about the TB state. One of the primary threads exiting the guest will take up the responsibility of calling TB resync. It will use one of the top bits (bit 63) from subcore state flags bitmap to make the decision. The first primary thread (among the subcores) that is able to set the bit will have to call the TB resync. Rest all other threads will wait until TB resync is complete. Once TB resync is complete all threads will then proceed. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-05-14 22:14:26 -06:00
cmpwi r3,0xe60
beq hmi_exception_common
FTR_SECTION_ELSE
cmpwi r3,0xa00
beq doorbell_super_common
ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE)
blr
#ifdef CONFIG_KVM_BOOK3S_64_HANDLER
TRAMP_REAL_BEGIN(kvmppc_skip_interrupt)
/*
* Here all GPRs are unchanged from when the interrupt happened
* except for r13, which is saved in SPRG_SCRATCH0.
*/
mfspr r13, SPRN_SRR0
addi r13, r13, 4
mtspr SPRN_SRR0, r13
GET_SCRATCH0(r13)
rfid
b .
TRAMP_REAL_BEGIN(kvmppc_skip_Hinterrupt)
/*
* Here all GPRs are unchanged from when the interrupt happened
* except for r13, which is saved in SPRG_SCRATCH0.
*/
mfspr r13, SPRN_HSRR0
addi r13, r13, 4
mtspr SPRN_HSRR0, r13
GET_SCRATCH0(r13)
hrfid
b .
#endif
/*
* Ensure that any handlers that get invoked from the exception prologs
* above are below the first 64KB (0x10000) of the kernel image because
* the prologs assemble the addresses of these handlers using the
* LOAD_HANDLER macro, which uses an ori instruction.
*/
/*** Common interrupt handlers ***/
#ifdef CONFIG_PPC_DOORBELL
EXC_COMMON_ASYNC(doorbell_super_common, 0xa00, doorbell_exception)
#else
EXC_COMMON_ASYNC(doorbell_super_common, 0xa00, unknown_exception)
#endif
EXC_COMMON(trap_0b_common, 0xb00, unknown_exception)
EXC_COMMON(single_step_common, 0xd00, single_step_exception)
EXC_COMMON(trap_0e_common, 0xe00, unknown_exception)
EXC_COMMON(emulation_assist_common, 0xe40, emulation_assist_interrupt)
EXC_COMMON_ASYNC(hmi_exception_common, 0xe60, handle_hmi_exception)
#ifdef CONFIG_PPC_DOORBELL
EXC_COMMON_ASYNC(h_doorbell_common, 0xe80, doorbell_exception)
#else
EXC_COMMON_ASYNC(h_doorbell_common, 0xe80, unknown_exception)
#endif
EXC_COMMON_ASYNC(h_virt_irq_common, 0xea0, do_IRQ)
EXC_COMMON_ASYNC(performance_monitor_common, 0xf00, performance_monitor_exception)
EXC_COMMON(instruction_breakpoint_common, 0x1300, instruction_breakpoint_exception)
EXC_COMMON_HV(denorm_common, 0x1500, unknown_exception)
#ifdef CONFIG_ALTIVEC
EXC_COMMON(altivec_assist_common, 0x1700, altivec_assist_exception)
#else
EXC_COMMON(altivec_assist_common, 0x1700, unknown_exception)
#endif
/*
* Relocation-on interrupts: A subset of the interrupts can be delivered
* with IR=1/DR=1, if AIL==2 and MSR.HV won't be changed by delivering
* it. Addresses are the same as the original interrupt addresses, but
* offset by 0xc000000000004000.
* It's impossible to receive interrupts below 0x300 via this mechanism.
* KVM: None of these traps are from the guest ; anything that escalated
* to HV=1 from HV=0 is delivered via real mode handlers.
*/
/*
* This uses the standard macro, since the original 0x300 vector
* only has extra guff for STAB-based processors -- which never
* come here.
*/
EXC_VIRT_MASKABLE(doorbell_super, 0x4a00, 0x4b00, 0xa00)
EXC_VIRT(trap_0b, 0x4b00, 0x4c00, 0xb00)
EXC_VIRT_BEGIN(system_call, 0x4c00, 0x4d00)
HMT_MEDIUM
SYSCALL_PSERIES_1
SYSCALL_PSERIES_2_DIRECT
SYSCALL_PSERIES_3
EXC_VIRT_END(system_call, 0x4c00, 0x4d00)
EXC_VIRT(single_step, 0x4d00, 0x4e00, 0xd00)
EXC_VIRT_BEGIN(unused, 0x4e00, 0x4e20)
b . /* Can't happen, see v2.07 Book III-S section 6.5 */
EXC_VIRT_END(unused, 0x4e00, 0x4e20)
EXC_VIRT_BEGIN(unused, 0x4e20, 0x4e40)
b . /* Can't happen, see v2.07 Book III-S section 6.5 */
EXC_VIRT_END(unused, 0x4e20, 0x4e40)
__EXC_VIRT_OOL_HV(emulation_assist, 0x4e40, 0x4e60)
EXC_VIRT_BEGIN(unused, 0x4e60, 0x4e80)
b . /* Can't happen, see v2.07 Book III-S section 6.5 */
EXC_VIRT_END(unused, 0x4e60, 0x4e80)
__EXC_VIRT_OOL_MASKABLE_HV(h_doorbell, 0x4e80, 0x4ea0)
__EXC_VIRT_OOL_MASKABLE_HV(h_virt_irq, 0x4ea0, 0x4ec0)
EXC_VIRT_NONE(0x4ec0, 0x4f00)
__EXC_VIRT_OOL(performance_monitor, 0x4f00, 0x4f20)
__EXC_VIRT_OOL(altivec_unavailable, 0x4f20, 0x4f40)
__EXC_VIRT_OOL(vsx_unavailable, 0x4f40, 0x4f60)
__EXC_VIRT_OOL(facility_unavailable, 0x4f60, 0x4f80)
__EXC_VIRT_OOL_HV(h_facility_unavailable, 0x4f80, 0x4fa0)
EXC_VIRT_NONE(0x4fa0, 0x5200)
EXC_VIRT_NONE(0x5200, 0x5300)
EXC_VIRT(instruction_breakpoint, 0x5300, 0x5400, 0x1300)
#ifdef CONFIG_PPC_DENORMALISATION
EXC_VIRT_BEGIN(denorm_exception, 0x5500, 0x5600)
b exc_real_0x1500_denorm_exception_hv
EXC_VIRT_END(denorm_exception, 0x5500, 0x5600)
#else
EXC_VIRT_NONE(0x5500, 0x5600)
#endif
EXC_VIRT_NONE(0x5600, 0x5700)
EXC_VIRT(altivec_assist, 0x5700, 0x5800, 0x1700)
EXC_VIRT_NONE(0x5800, 0x5900)
EXC_COMMON_BEGIN(ppc64_runlatch_on_trampoline)
b __ppc64_runlatch_on
EXC_COMMON_BEGIN(h_data_storage_common)
mfspr r10,SPRN_HDAR
std r10,PACA_EXGEN+EX_DAR(r13)
mfspr r10,SPRN_HDSISR
stw r10,PACA_EXGEN+EX_DSISR(r13)
EXCEPTION_PROLOG_COMMON(0xe00, PACA_EXGEN)
bl save_nvgprs
RECONCILE_IRQ_STATE(r10, r11)
addi r3,r1,STACK_FRAME_OVERHEAD
bl unknown_exception
b ret_from_except
EXC_COMMON(h_instr_storage_common, 0xe20, unknown_exception)
EXC_COMMON_BEGIN(altivec_unavailable_common)
EXCEPTION_PROLOG_COMMON(0xf20, PACA_EXGEN)
#ifdef CONFIG_ALTIVEC
BEGIN_FTR_SECTION
beq 1f
#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
BEGIN_FTR_SECTION_NESTED(69)
/* Test if 2 TM state bits are zero. If non-zero (ie. userspace was in
* transaction), go do TM stuff
*/
rldicl. r0, r12, (64-MSR_TS_LG), (64-2)
bne- 2f
END_FTR_SECTION_NESTED(CPU_FTR_TM, CPU_FTR_TM, 69)
#endif
bl load_up_altivec
b fast_exception_return
#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
2: /* User process was in a transaction */
bl save_nvgprs
RECONCILE_IRQ_STATE(r10, r11)
addi r3,r1,STACK_FRAME_OVERHEAD
bl altivec_unavailable_tm
b ret_from_except
#endif
1:
END_FTR_SECTION_IFSET(CPU_FTR_ALTIVEC)
#endif
bl save_nvgprs
RECONCILE_IRQ_STATE(r10, r11)
addi r3,r1,STACK_FRAME_OVERHEAD
bl altivec_unavailable_exception
b ret_from_except
EXC_COMMON_BEGIN(vsx_unavailable_common)
EXCEPTION_PROLOG_COMMON(0xf40, PACA_EXGEN)
#ifdef CONFIG_VSX
BEGIN_FTR_SECTION
powerpc: Rework lazy-interrupt handling The current implementation of lazy interrupts handling has some issues that this tries to address. We don't do the various workarounds we need to do when re-enabling interrupts in some cases such as when returning from an interrupt and thus we may still lose or get delayed decrementer or doorbell interrupts. The current scheme also makes it much harder to handle the external "edge" interrupts provided by some BookE processors when using the EPR facility (External Proxy) and the Freescale Hypervisor. Additionally, we tend to keep interrupts hard disabled in a number of cases, such as decrementer interrupts, external interrupts, or when a masked decrementer interrupt is pending. This is sub-optimal. This is an attempt at fixing it all in one go by reworking the way we do the lazy interrupt disabling from the ground up. The base idea is to replace the "hard_enabled" field with a "irq_happened" field in which we store a bit mask of what interrupt occurred while soft-disabled. When re-enabling, either via arch_local_irq_restore() or when returning from an interrupt, we can now decide what to do by testing bits in that field. We then implement replaying of the missed interrupts either by re-using the existing exception frame (in exception exit case) or via the creation of a new one from an assembly trampoline (in the arch_local_irq_enable case). This removes the need to play with the decrementer to try to create fake interrupts, among others. In addition, this adds a few refinements: - We no longer hard disable decrementer interrupts that occur while soft-disabled. We now simply bump the decrementer back to max (on BookS) or leave it stopped (on BookE) and continue with hard interrupts enabled, which means that we'll potentially get better sample quality from performance monitor interrupts. - Timer, decrementer and doorbell interrupts now hard-enable shortly after removing the source of the interrupt, which means they no longer run entirely hard disabled. Again, this will improve perf sample quality. - On Book3E 64-bit, we now make the performance monitor interrupt act as an NMI like Book3S (the necessary C code for that to work appear to already be present in the FSL perf code, notably calling nmi_enter instead of irq_enter). (This also fixes a bug where BookE perfmon interrupts could clobber r14 ... oops) - We could make "masked" decrementer interrupts act as NMIs when doing timer-based perf sampling to improve the sample quality. Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org> --- v2: - Add hard-enable to decrementer, timer and doorbells - Fix CR clobber in masked irq handling on BookE - Make embedded perf interrupt act as an NMI - Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want to retrigger an interrupt without preventing hard-enable v3: - Fix or vs. ori bug on Book3E - Fix enabling of interrupts for some exceptions on Book3E v4: - Fix resend of doorbells on return from interrupt on Book3E v5: - Rebased on top of my latest series, which involves some significant rework of some aspects of the patch. v6: - 32-bit compile fix - more compile fixes with various .config combos - factor out the asm code to soft-disable interrupts - remove the C wrapper around preempt_schedule_irq v7: - Fix a bug with hard irq state tracking on native power7
2012-03-06 00:27:59 -07:00
beq 1f
#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
BEGIN_FTR_SECTION_NESTED(69)
/* Test if 2 TM state bits are zero. If non-zero (ie. userspace was in
* transaction), go do TM stuff
*/
rldicl. r0, r12, (64-MSR_TS_LG), (64-2)
bne- 2f
END_FTR_SECTION_NESTED(CPU_FTR_TM, CPU_FTR_TM, 69)
#endif
b load_up_vsx
#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
2: /* User process was in a transaction */
bl save_nvgprs
RECONCILE_IRQ_STATE(r10, r11)
addi r3,r1,STACK_FRAME_OVERHEAD
bl vsx_unavailable_tm
b ret_from_except
#endif
1:
END_FTR_SECTION_IFSET(CPU_FTR_VSX)
#endif
bl save_nvgprs
RECONCILE_IRQ_STATE(r10, r11)
addi r3,r1,STACK_FRAME_OVERHEAD
bl vsx_unavailable_exception
b ret_from_except
/* Equivalents to the above handlers for relocation-on interrupt vectors */
__TRAMP_REAL_VIRT_OOL_HV(emulation_assist, 0xe40)
__TRAMP_REAL_VIRT_OOL_MASKABLE_HV(h_doorbell, 0xe80)
__TRAMP_REAL_VIRT_OOL_MASKABLE_HV(h_virt_irq, 0xea0)
__TRAMP_REAL_VIRT_OOL(performance_monitor, 0xf00)
__TRAMP_REAL_VIRT_OOL(altivec_unavailable, 0xf20)
__TRAMP_REAL_VIRT_OOL(vsx_unavailable, 0xf40)
__TRAMP_REAL_VIRT_OOL(facility_unavailable, 0xf60)
__TRAMP_REAL_VIRT_OOL_HV(h_facility_unavailable, 0xf80)
USE_FIXED_SECTION(virt_trampolines)
powerpc/book3s64: Fix branching to OOL handlers in relocatable kernel Some of the interrupt vectors on 64-bit POWER server processors are only 32 bytes long (8 instructions), which is not enough for the full first-level interrupt handler. For these we need to branch to an out-of-line (OOL) handler. But when we are running a relocatable kernel, interrupt vectors till __end_interrupts marker are copied down to real address 0x100. So, branching to labels (ie. OOL handlers) outside this section must be handled differently (see LOAD_HANDLER()), considering relocatable kernel, which would need at least 4 instructions. However, branching from interrupt vector means that we corrupt the CFAR (come-from address register) on POWER7 and later processors as mentioned in commit 1707dd16. So, EXCEPTION_PROLOG_0 (6 instructions) that contains the part up to the point where the CFAR is saved in the PACA should be part of the short interrupt vectors before we branch out to OOL handlers. But as mentioned already, there are interrupt vectors on 64-bit POWER server processors that are only 32 bytes long (like vectors 0x4f00, 0x4f20, etc.), which cannot accomodate the above two cases at the same time owing to space constraint. Currently, in these interrupt vectors, we simply branch out to OOL handlers, without using LOAD_HANDLER(), which leaves us vulnerable when running a relocatable kernel (eg. kdump case). While this has been the case for sometime now and kdump is used widely, we were fortunate not to see any problems so far, for three reasons: 1. In almost all cases, production kernel (relocatable) is used for kdump as well, which would mean that crashed kernel's OOL handler would be at the same place where we end up branching to, from short interrupt vector of kdump kernel. 2. Also, OOL handler was unlikely the reason for crash in almost all the kdump scenarios, which meant we had a sane OOL handler from crashed kernel that we branched to. 3. On most 64-bit POWER server processors, page size is large enough that marking interrupt vector code as executable (see commit 429d2e83) leads to marking OOL handler code from crashed kernel, that sits right below interrupt vector code from kdump kernel, as executable as well. Let us fix this by moving the __end_interrupts marker down past OOL handlers to make sure that we also copy OOL handlers to real address 0x100 when running a relocatable kernel. This fix has been tested successfully in kdump scenario, on an LPAR with 4K page size by using different default/production kernel and kdump kernel. Also tested by manually corrupting the OOL handlers in the first kernel and then kdump'ing, and then causing the OOL handlers to fire - mpe. Fixes: c1fb6816fb1b ("powerpc: Add relocation on exception vector handlers") Cc: stable@vger.kernel.org Signed-off-by: Hari Bathini <hbathini@linux.vnet.ibm.com> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-04-15 06:48:02 -06:00
/*
* The __end_interrupts marker must be past the out-of-line (OOL)
* handlers, so that they are copied to real address 0x100 when running
* a relocatable kernel. This ensures they can be reached from the short
* trampoline handlers (like 0x4f00, 0x4f20, etc.) which branch
* directly, without using LOAD_HANDLER().
*/
.align 7
.globl __end_interrupts
__end_interrupts:
DEFINE_FIXED_SYMBOL(__end_interrupts)
EXC_COMMON(facility_unavailable_common, 0xf60, facility_unavailable_exception)
EXC_COMMON(h_facility_unavailable_common, 0xf80, facility_unavailable_exception)
#ifdef CONFIG_CBE_RAS
EXC_COMMON(cbe_system_error_common, 0x1200, cbe_system_error_exception)
EXC_COMMON(cbe_maintenance_common, 0x1600, cbe_maintenance_exception)
EXC_COMMON(cbe_thermal_common, 0x1800, cbe_thermal_exception)
#endif /* CONFIG_CBE_RAS */
TRAMP_REAL_BEGIN(hmi_exception_early)
EXCEPTION_PROLOG_1(PACA_EXGEN, KVMTEST_HV, 0xe60)
mr r10,r1 /* Save r1 */
ld r1,PACAEMERGSP(r13) /* Use emergency stack */
subi r1,r1,INT_FRAME_SIZE /* alloc stack frame */
std r9,_CCR(r1) /* save CR in stackframe */
mfspr r11,SPRN_HSRR0 /* Save HSRR0 */
std r11,_NIP(r1) /* save HSRR0 in stackframe */
mfspr r12,SPRN_HSRR1 /* Save SRR1 */
std r12,_MSR(r1) /* save SRR1 in stackframe */
std r10,0(r1) /* make stack chain pointer */
std r0,GPR0(r1) /* save r0 in stackframe */
std r10,GPR1(r1) /* save r1 in stackframe */
EXCEPTION_PROLOG_COMMON_2(PACA_EXGEN)
EXCEPTION_PROLOG_COMMON_3(0xe60)
addi r3,r1,STACK_FRAME_OVERHEAD
bl hmi_exception_realmode
/* Windup the stack. */
/* Move original HSRR0 and HSRR1 into the respective regs */
ld r9,_MSR(r1)
mtspr SPRN_HSRR1,r9
ld r3,_NIP(r1)
mtspr SPRN_HSRR0,r3
ld r9,_CTR(r1)
mtctr r9
ld r9,_XER(r1)
mtxer r9
ld r9,_LINK(r1)
mtlr r9
REST_GPR(0, r1)
REST_8GPRS(2, r1)
REST_GPR(10, r1)
ld r11,_CCR(r1)
mtcr r11
REST_GPR(11, r1)
REST_2GPRS(12, r1)
/* restore original r1. */
ld r1,GPR1(r1)
/*
* Go to virtual mode and pull the HMI event information from
* firmware.
*/
.globl hmi_exception_after_realmode
hmi_exception_after_realmode:
SET_SCRATCH0(r13)
EXCEPTION_PROLOG_0(PACA_EXGEN)
b tramp_real_hmi_exception
#ifdef CONFIG_PPC_970_NAP
TRAMP_REAL_BEGIN(power4_fixup_nap)
andc r9,r9,r10
std r9,TI_LOCAL_FLAGS(r11)
ld r10,_LINK(r1) /* make idle task do the */
std r10,_NIP(r1) /* equivalent of a blr */
blr
#endif
CLOSE_FIXED_SECTION(real_vectors);
CLOSE_FIXED_SECTION(real_trampolines);
CLOSE_FIXED_SECTION(virt_vectors);
CLOSE_FIXED_SECTION(virt_trampolines);
USE_TEXT_SECTION()
/*
* Hash table stuff
*/
.align 7
do_hash_page:
#ifdef CONFIG_PPC_STD_MMU_64
andis. r0,r4,0xa410 /* weird error? */
bne- handle_page_fault /* if not, try to insert a HPTE */
andis. r0,r4,DSISR_DABRMATCH@h
bne- handle_dabr_fault
CURRENT_THREAD_INFO(r11, r1)
powerpc: Allow perf_counters to access user memory at interrupt time This provides a mechanism to allow the perf_counters code to access user memory in a PMU interrupt routine. Such an access can cause various kinds of interrupt: SLB miss, MMU hash table miss, segment table miss, or TLB miss, depending on the processor. This commit only deals with 64-bit classic/server processors, which use an MMU hash table. 32-bit processors are already able to access user memory at interrupt time. Since we don't soft-disable on 32-bit, we avoid the possibility of reentering hash_page or the TLB miss handlers, since they run with interrupts disabled. On 64-bit processors, an SLB miss interrupt on a user address will update the slb_cache and slb_cache_ptr fields in the paca. This is OK except in the case where a PMU interrupt occurs in switch_slb, which also accesses those fields. To prevent this, we hard-disable interrupts in switch_slb. Interrupts are already soft-disabled at this point, and will get hard-enabled when they get soft-enabled later. This also reworks slb_flush_and_rebolt: to avoid hard-disabling twice, and to make sure that it clears the slb_cache_ptr when called from other callers than switch_slb, the existing routine is renamed to __slb_flush_and_rebolt, which is called by switch_slb and the new version of slb_flush_and_rebolt. Similarly, switch_stab (used on POWER3 and RS64 processors) gets a hard_irq_disable() to protect the per-cpu variables used there and in ste_allocate. If a MMU hashtable miss interrupt occurs, normally we would call hash_page to look up the Linux PTE for the address and create a HPTE. However, hash_page is fairly complex and takes some locks, so to avoid the possibility of deadlock, we check the preemption count to see if we are in a (pseudo-)NMI handler, and if so, we don't call hash_page but instead treat it like a bad access that will get reported up through the exception table mechanism. An interrupt whose handler runs even though the interrupt occurred when soft-disabled (such as the PMU interrupt) is considered a pseudo-NMI handler, which should use nmi_enter()/nmi_exit() rather than irq_enter()/irq_exit(). Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Paul Mackerras <paulus@samba.org>
2009-08-16 23:17:54 -06:00
lwz r0,TI_PREEMPT(r11) /* If we're in an "NMI" */
andis. r0,r0,NMI_MASK@h /* (i.e. an irq when soft-disabled) */
bne 77f /* then don't call hash_page now */
/*
* r3 contains the faulting address
* r4 msr
* r5 contains the trap number
powerpc/mm: don't do tlbie for updatepp request with NO HPTE fault upatepp can get called for a nohpte fault when we find from the linux page table that the translation was hashed before. In that case we are sure that there is no existing translation, hence we could avoid doing tlbie. We could possibly race with a parallel fault filling the TLB. But that should be ok because updatepp is only ever relaxing permissions. We also look at linux pte permission bits when filling hash pte permission bits. We also hold the linux pte busy bits while inserting/updating a hashpte entry, hence a paralle update of linux pte is not possible. On the other hand mprotect involves ptep_modify_prot_start which cause a hpte invalidate and not updatepp. Performance number: We use randbox_access_bench written by Anton. Kernel with THP disabled and smaller hash page table size. 86.60% random_access_b [kernel.kallsyms] [k] .native_hpte_updatepp 2.10% random_access_b random_access_bench [.] doit 1.99% random_access_b [kernel.kallsyms] [k] .do_raw_spin_lock 1.85% random_access_b [kernel.kallsyms] [k] .native_hpte_insert 1.26% random_access_b [kernel.kallsyms] [k] .native_flush_hash_range 1.18% random_access_b [kernel.kallsyms] [k] .__delay 0.69% random_access_b [kernel.kallsyms] [k] .native_hpte_remove 0.37% random_access_b [kernel.kallsyms] [k] .clear_user_page 0.34% random_access_b [kernel.kallsyms] [k] .__hash_page_64K 0.32% random_access_b [kernel.kallsyms] [k] fast_exception_return 0.30% random_access_b [kernel.kallsyms] [k] .hash_page_mm With Fix: 27.54% random_access_b random_access_bench [.] doit 22.90% random_access_b [kernel.kallsyms] [k] .native_hpte_insert 5.76% random_access_b [kernel.kallsyms] [k] .native_hpte_remove 5.20% random_access_b [kernel.kallsyms] [k] fast_exception_return 5.12% random_access_b [kernel.kallsyms] [k] .__hash_page_64K 4.80% random_access_b [kernel.kallsyms] [k] .hash_page_mm 3.31% random_access_b [kernel.kallsyms] [k] data_access_common 1.84% random_access_b [kernel.kallsyms] [k] .trace_hardirqs_on_caller Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-12-03 22:30:14 -07:00
* r6 contains dsisr
*
powerpc: Rework lazy-interrupt handling The current implementation of lazy interrupts handling has some issues that this tries to address. We don't do the various workarounds we need to do when re-enabling interrupts in some cases such as when returning from an interrupt and thus we may still lose or get delayed decrementer or doorbell interrupts. The current scheme also makes it much harder to handle the external "edge" interrupts provided by some BookE processors when using the EPR facility (External Proxy) and the Freescale Hypervisor. Additionally, we tend to keep interrupts hard disabled in a number of cases, such as decrementer interrupts, external interrupts, or when a masked decrementer interrupt is pending. This is sub-optimal. This is an attempt at fixing it all in one go by reworking the way we do the lazy interrupt disabling from the ground up. The base idea is to replace the "hard_enabled" field with a "irq_happened" field in which we store a bit mask of what interrupt occurred while soft-disabled. When re-enabling, either via arch_local_irq_restore() or when returning from an interrupt, we can now decide what to do by testing bits in that field. We then implement replaying of the missed interrupts either by re-using the existing exception frame (in exception exit case) or via the creation of a new one from an assembly trampoline (in the arch_local_irq_enable case). This removes the need to play with the decrementer to try to create fake interrupts, among others. In addition, this adds a few refinements: - We no longer hard disable decrementer interrupts that occur while soft-disabled. We now simply bump the decrementer back to max (on BookS) or leave it stopped (on BookE) and continue with hard interrupts enabled, which means that we'll potentially get better sample quality from performance monitor interrupts. - Timer, decrementer and doorbell interrupts now hard-enable shortly after removing the source of the interrupt, which means they no longer run entirely hard disabled. Again, this will improve perf sample quality. - On Book3E 64-bit, we now make the performance monitor interrupt act as an NMI like Book3S (the necessary C code for that to work appear to already be present in the FSL perf code, notably calling nmi_enter instead of irq_enter). (This also fixes a bug where BookE perfmon interrupts could clobber r14 ... oops) - We could make "masked" decrementer interrupts act as NMIs when doing timer-based perf sampling to improve the sample quality. Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org> --- v2: - Add hard-enable to decrementer, timer and doorbells - Fix CR clobber in masked irq handling on BookE - Make embedded perf interrupt act as an NMI - Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want to retrigger an interrupt without preventing hard-enable v3: - Fix or vs. ori bug on Book3E - Fix enabling of interrupts for some exceptions on Book3E v4: - Fix resend of doorbells on return from interrupt on Book3E v5: - Rebased on top of my latest series, which involves some significant rework of some aspects of the patch. v6: - 32-bit compile fix - more compile fixes with various .config combos - factor out the asm code to soft-disable interrupts - remove the C wrapper around preempt_schedule_irq v7: - Fix a bug with hard irq state tracking on native power7
2012-03-06 00:27:59 -07:00
* at return r3 = 0 for success, 1 for page fault, negative for error
*/
mr r4,r12
powerpc/mm: don't do tlbie for updatepp request with NO HPTE fault upatepp can get called for a nohpte fault when we find from the linux page table that the translation was hashed before. In that case we are sure that there is no existing translation, hence we could avoid doing tlbie. We could possibly race with a parallel fault filling the TLB. But that should be ok because updatepp is only ever relaxing permissions. We also look at linux pte permission bits when filling hash pte permission bits. We also hold the linux pte busy bits while inserting/updating a hashpte entry, hence a paralle update of linux pte is not possible. On the other hand mprotect involves ptep_modify_prot_start which cause a hpte invalidate and not updatepp. Performance number: We use randbox_access_bench written by Anton. Kernel with THP disabled and smaller hash page table size. 86.60% random_access_b [kernel.kallsyms] [k] .native_hpte_updatepp 2.10% random_access_b random_access_bench [.] doit 1.99% random_access_b [kernel.kallsyms] [k] .do_raw_spin_lock 1.85% random_access_b [kernel.kallsyms] [k] .native_hpte_insert 1.26% random_access_b [kernel.kallsyms] [k] .native_flush_hash_range 1.18% random_access_b [kernel.kallsyms] [k] .__delay 0.69% random_access_b [kernel.kallsyms] [k] .native_hpte_remove 0.37% random_access_b [kernel.kallsyms] [k] .clear_user_page 0.34% random_access_b [kernel.kallsyms] [k] .__hash_page_64K 0.32% random_access_b [kernel.kallsyms] [k] fast_exception_return 0.30% random_access_b [kernel.kallsyms] [k] .hash_page_mm With Fix: 27.54% random_access_b random_access_bench [.] doit 22.90% random_access_b [kernel.kallsyms] [k] .native_hpte_insert 5.76% random_access_b [kernel.kallsyms] [k] .native_hpte_remove 5.20% random_access_b [kernel.kallsyms] [k] fast_exception_return 5.12% random_access_b [kernel.kallsyms] [k] .__hash_page_64K 4.80% random_access_b [kernel.kallsyms] [k] .hash_page_mm 3.31% random_access_b [kernel.kallsyms] [k] data_access_common 1.84% random_access_b [kernel.kallsyms] [k] .trace_hardirqs_on_caller Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-12-03 22:30:14 -07:00
ld r6,_DSISR(r1)
bl __hash_page /* build HPTE if possible */
cmpdi r3,0 /* see if __hash_page succeeded */
powerpc: Rework lazy-interrupt handling The current implementation of lazy interrupts handling has some issues that this tries to address. We don't do the various workarounds we need to do when re-enabling interrupts in some cases such as when returning from an interrupt and thus we may still lose or get delayed decrementer or doorbell interrupts. The current scheme also makes it much harder to handle the external "edge" interrupts provided by some BookE processors when using the EPR facility (External Proxy) and the Freescale Hypervisor. Additionally, we tend to keep interrupts hard disabled in a number of cases, such as decrementer interrupts, external interrupts, or when a masked decrementer interrupt is pending. This is sub-optimal. This is an attempt at fixing it all in one go by reworking the way we do the lazy interrupt disabling from the ground up. The base idea is to replace the "hard_enabled" field with a "irq_happened" field in which we store a bit mask of what interrupt occurred while soft-disabled. When re-enabling, either via arch_local_irq_restore() or when returning from an interrupt, we can now decide what to do by testing bits in that field. We then implement replaying of the missed interrupts either by re-using the existing exception frame (in exception exit case) or via the creation of a new one from an assembly trampoline (in the arch_local_irq_enable case). This removes the need to play with the decrementer to try to create fake interrupts, among others. In addition, this adds a few refinements: - We no longer hard disable decrementer interrupts that occur while soft-disabled. We now simply bump the decrementer back to max (on BookS) or leave it stopped (on BookE) and continue with hard interrupts enabled, which means that we'll potentially get better sample quality from performance monitor interrupts. - Timer, decrementer and doorbell interrupts now hard-enable shortly after removing the source of the interrupt, which means they no longer run entirely hard disabled. Again, this will improve perf sample quality. - On Book3E 64-bit, we now make the performance monitor interrupt act as an NMI like Book3S (the necessary C code for that to work appear to already be present in the FSL perf code, notably calling nmi_enter instead of irq_enter). (This also fixes a bug where BookE perfmon interrupts could clobber r14 ... oops) - We could make "masked" decrementer interrupts act as NMIs when doing timer-based perf sampling to improve the sample quality. Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org> --- v2: - Add hard-enable to decrementer, timer and doorbells - Fix CR clobber in masked irq handling on BookE - Make embedded perf interrupt act as an NMI - Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want to retrigger an interrupt without preventing hard-enable v3: - Fix or vs. ori bug on Book3E - Fix enabling of interrupts for some exceptions on Book3E v4: - Fix resend of doorbells on return from interrupt on Book3E v5: - Rebased on top of my latest series, which involves some significant rework of some aspects of the patch. v6: - 32-bit compile fix - more compile fixes with various .config combos - factor out the asm code to soft-disable interrupts - remove the C wrapper around preempt_schedule_irq v7: - Fix a bug with hard irq state tracking on native power7
2012-03-06 00:27:59 -07:00
/* Success */
beq fast_exc_return_irq /* Return from exception on success */
powerpc: Rework lazy-interrupt handling The current implementation of lazy interrupts handling has some issues that this tries to address. We don't do the various workarounds we need to do when re-enabling interrupts in some cases such as when returning from an interrupt and thus we may still lose or get delayed decrementer or doorbell interrupts. The current scheme also makes it much harder to handle the external "edge" interrupts provided by some BookE processors when using the EPR facility (External Proxy) and the Freescale Hypervisor. Additionally, we tend to keep interrupts hard disabled in a number of cases, such as decrementer interrupts, external interrupts, or when a masked decrementer interrupt is pending. This is sub-optimal. This is an attempt at fixing it all in one go by reworking the way we do the lazy interrupt disabling from the ground up. The base idea is to replace the "hard_enabled" field with a "irq_happened" field in which we store a bit mask of what interrupt occurred while soft-disabled. When re-enabling, either via arch_local_irq_restore() or when returning from an interrupt, we can now decide what to do by testing bits in that field. We then implement replaying of the missed interrupts either by re-using the existing exception frame (in exception exit case) or via the creation of a new one from an assembly trampoline (in the arch_local_irq_enable case). This removes the need to play with the decrementer to try to create fake interrupts, among others. In addition, this adds a few refinements: - We no longer hard disable decrementer interrupts that occur while soft-disabled. We now simply bump the decrementer back to max (on BookS) or leave it stopped (on BookE) and continue with hard interrupts enabled, which means that we'll potentially get better sample quality from performance monitor interrupts. - Timer, decrementer and doorbell interrupts now hard-enable shortly after removing the source of the interrupt, which means they no longer run entirely hard disabled. Again, this will improve perf sample quality. - On Book3E 64-bit, we now make the performance monitor interrupt act as an NMI like Book3S (the necessary C code for that to work appear to already be present in the FSL perf code, notably calling nmi_enter instead of irq_enter). (This also fixes a bug where BookE perfmon interrupts could clobber r14 ... oops) - We could make "masked" decrementer interrupts act as NMIs when doing timer-based perf sampling to improve the sample quality. Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org> --- v2: - Add hard-enable to decrementer, timer and doorbells - Fix CR clobber in masked irq handling on BookE - Make embedded perf interrupt act as an NMI - Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want to retrigger an interrupt without preventing hard-enable v3: - Fix or vs. ori bug on Book3E - Fix enabling of interrupts for some exceptions on Book3E v4: - Fix resend of doorbells on return from interrupt on Book3E v5: - Rebased on top of my latest series, which involves some significant rework of some aspects of the patch. v6: - 32-bit compile fix - more compile fixes with various .config combos - factor out the asm code to soft-disable interrupts - remove the C wrapper around preempt_schedule_irq v7: - Fix a bug with hard irq state tracking on native power7
2012-03-06 00:27:59 -07:00
/* Error */
blt- 13f
#endif /* CONFIG_PPC_STD_MMU_64 */
/* Here we have a page fault that hash_page can't handle. */
handle_page_fault:
11: ld r4,_DAR(r1)
ld r5,_DSISR(r1)
addi r3,r1,STACK_FRAME_OVERHEAD
bl do_page_fault
cmpdi r3,0
beq+ 12f
bl save_nvgprs
mr r5,r3
addi r3,r1,STACK_FRAME_OVERHEAD
lwz r4,_DAR(r1)
bl bad_page_fault
b ret_from_except
/* We have a data breakpoint exception - handle it */
handle_dabr_fault:
bl save_nvgprs
ld r4,_DAR(r1)
ld r5,_DSISR(r1)
addi r3,r1,STACK_FRAME_OVERHEAD
bl do_break
12: b ret_from_except_lite
#ifdef CONFIG_PPC_STD_MMU_64
/* We have a page fault that hash_page could handle but HV refused
* the PTE insertion
*/
13: bl save_nvgprs
mr r5,r3
addi r3,r1,STACK_FRAME_OVERHEAD
ld r4,_DAR(r1)
bl low_hash_fault
b ret_from_except
#endif
powerpc: Allow perf_counters to access user memory at interrupt time This provides a mechanism to allow the perf_counters code to access user memory in a PMU interrupt routine. Such an access can cause various kinds of interrupt: SLB miss, MMU hash table miss, segment table miss, or TLB miss, depending on the processor. This commit only deals with 64-bit classic/server processors, which use an MMU hash table. 32-bit processors are already able to access user memory at interrupt time. Since we don't soft-disable on 32-bit, we avoid the possibility of reentering hash_page or the TLB miss handlers, since they run with interrupts disabled. On 64-bit processors, an SLB miss interrupt on a user address will update the slb_cache and slb_cache_ptr fields in the paca. This is OK except in the case where a PMU interrupt occurs in switch_slb, which also accesses those fields. To prevent this, we hard-disable interrupts in switch_slb. Interrupts are already soft-disabled at this point, and will get hard-enabled when they get soft-enabled later. This also reworks slb_flush_and_rebolt: to avoid hard-disabling twice, and to make sure that it clears the slb_cache_ptr when called from other callers than switch_slb, the existing routine is renamed to __slb_flush_and_rebolt, which is called by switch_slb and the new version of slb_flush_and_rebolt. Similarly, switch_stab (used on POWER3 and RS64 processors) gets a hard_irq_disable() to protect the per-cpu variables used there and in ste_allocate. If a MMU hashtable miss interrupt occurs, normally we would call hash_page to look up the Linux PTE for the address and create a HPTE. However, hash_page is fairly complex and takes some locks, so to avoid the possibility of deadlock, we check the preemption count to see if we are in a (pseudo-)NMI handler, and if so, we don't call hash_page but instead treat it like a bad access that will get reported up through the exception table mechanism. An interrupt whose handler runs even though the interrupt occurred when soft-disabled (such as the PMU interrupt) is considered a pseudo-NMI handler, which should use nmi_enter()/nmi_exit() rather than irq_enter()/irq_exit(). Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Paul Mackerras <paulus@samba.org>
2009-08-16 23:17:54 -06:00
/*
* We come here as a result of a DSI at a point where we don't want
* to call hash_page, such as when we are accessing memory (possibly
* user memory) inside a PMU interrupt that occurred while interrupts
* were soft-disabled. We want to invoke the exception handler for
* the access, or panic if there isn't a handler.
*/
77: bl save_nvgprs
powerpc: Allow perf_counters to access user memory at interrupt time This provides a mechanism to allow the perf_counters code to access user memory in a PMU interrupt routine. Such an access can cause various kinds of interrupt: SLB miss, MMU hash table miss, segment table miss, or TLB miss, depending on the processor. This commit only deals with 64-bit classic/server processors, which use an MMU hash table. 32-bit processors are already able to access user memory at interrupt time. Since we don't soft-disable on 32-bit, we avoid the possibility of reentering hash_page or the TLB miss handlers, since they run with interrupts disabled. On 64-bit processors, an SLB miss interrupt on a user address will update the slb_cache and slb_cache_ptr fields in the paca. This is OK except in the case where a PMU interrupt occurs in switch_slb, which also accesses those fields. To prevent this, we hard-disable interrupts in switch_slb. Interrupts are already soft-disabled at this point, and will get hard-enabled when they get soft-enabled later. This also reworks slb_flush_and_rebolt: to avoid hard-disabling twice, and to make sure that it clears the slb_cache_ptr when called from other callers than switch_slb, the existing routine is renamed to __slb_flush_and_rebolt, which is called by switch_slb and the new version of slb_flush_and_rebolt. Similarly, switch_stab (used on POWER3 and RS64 processors) gets a hard_irq_disable() to protect the per-cpu variables used there and in ste_allocate. If a MMU hashtable miss interrupt occurs, normally we would call hash_page to look up the Linux PTE for the address and create a HPTE. However, hash_page is fairly complex and takes some locks, so to avoid the possibility of deadlock, we check the preemption count to see if we are in a (pseudo-)NMI handler, and if so, we don't call hash_page but instead treat it like a bad access that will get reported up through the exception table mechanism. An interrupt whose handler runs even though the interrupt occurred when soft-disabled (such as the PMU interrupt) is considered a pseudo-NMI handler, which should use nmi_enter()/nmi_exit() rather than irq_enter()/irq_exit(). Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Paul Mackerras <paulus@samba.org>
2009-08-16 23:17:54 -06:00
mr r4,r3
addi r3,r1,STACK_FRAME_OVERHEAD
li r5,SIGSEGV
bl bad_page_fault
b ret_from_except
/*
* Here we have detected that the kernel stack pointer is bad.
* R9 contains the saved CR, r13 points to the paca,
* r10 contains the (bad) kernel stack pointer,
* r11 and r12 contain the saved SRR0 and SRR1.
* We switch to using an emergency stack, save the registers there,
* and call kernel_bad_stack(), which panics.
*/
bad_stack:
ld r1,PACAEMERGSP(r13)
subi r1,r1,64+INT_FRAME_SIZE
std r9,_CCR(r1)
std r10,GPR1(r1)
std r11,_NIP(r1)
std r12,_MSR(r1)
mfspr r11,SPRN_DAR
mfspr r12,SPRN_DSISR
std r11,_DAR(r1)
std r12,_DSISR(r1)
mflr r10
mfctr r11
mfxer r12
std r10,_LINK(r1)
std r11,_CTR(r1)
std r12,_XER(r1)
SAVE_GPR(0,r1)
SAVE_GPR(2,r1)
ld r10,EX_R3(r3)
std r10,GPR3(r1)
SAVE_GPR(4,r1)
SAVE_4GPRS(5,r1)
ld r9,EX_R9(r3)
ld r10,EX_R10(r3)
SAVE_2GPRS(9,r1)
ld r9,EX_R11(r3)
ld r10,EX_R12(r3)
ld r11,EX_R13(r3)
std r9,GPR11(r1)
std r10,GPR12(r1)
std r11,GPR13(r1)
BEGIN_FTR_SECTION
ld r10,EX_CFAR(r3)
std r10,ORIG_GPR3(r1)
END_FTR_SECTION_IFSET(CPU_FTR_CFAR)
SAVE_8GPRS(14,r1)
SAVE_10GPRS(22,r1)
lhz r12,PACA_TRAP_SAVE(r13)
std r12,_TRAP(r1)
addi r11,r1,INT_FRAME_SIZE
std r11,0(r1)
li r12,0
std r12,0(r11)
ld r2,PACATOC(r13)
ld r11,exception_marker@toc(r2)
std r12,RESULT(r1)
std r11,STACK_FRAME_OVERHEAD-16(r1)
1: addi r3,r1,STACK_FRAME_OVERHEAD
bl kernel_bad_stack
b 1b