Commit graph

4630 commits

Author SHA1 Message Date
Linus Torvalds d8312a3f61 ARM:
- VHE optimizations
 - EL2 address space randomization
 - speculative execution mitigations ("variant 3a", aka execution past invalid
 privilege register access)
 - bugfixes and cleanups
 
 PPC:
 - improvements for the radix page fault handler for HV KVM on POWER9
 
 s390:
 - more kvm stat counters
 - virtio gpu plumbing
 - documentation
 - facilities improvements
 
 x86:
 - support for VMware magic I/O port and pseudo-PMCs
 - AMD pause loop exiting
 - support for AMD core performance extensions
 - support for synchronous register access
 - expose nVMX capabilities to userspace
 - support for Hyper-V signaling via eventfd
 - use Enlightened VMCS when running on Hyper-V
 - allow userspace to disable MWAIT/HLT/PAUSE vmexits
 - usual roundup of optimizations and nested virtualization bugfixes
 
 Generic:
 - API selftest infrastructure (though the only tests are for x86 as of now)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJay19UAAoJEL/70l94x66DGKYIAIu9PTHAEwaX0et15fPW5y2x
 rrtS355lSAmMrPJ1nePRQ+rProD/1B0Kizj3/9O+B9OTKKRsorRYNa4CSu9neO2k
 N3rdE46M1wHAPwuJPcYvh3iBVXtgbMayk1EK5aVoSXaMXEHh+PWZextkl+F+G853
 kC27yDy30jj9pStwnEFSBszO9ua/URdKNKBATNx8WUP6d9U/dlfm5xv3Dc3WtKt2
 UMGmog2wh0i7ecXo7hRkMK4R7OYP3ZxAexq5aa9BOPuFp+ZdzC/MVpN+jsjq2J/M
 Zq6RNyA2HFyQeP0E9QgFsYS2BNOPeLZnT5Jg1z4jyiD32lAZ/iC51zwm4oNKcDM=
 =bPlD
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "ARM:
   - VHE optimizations

   - EL2 address space randomization

   - speculative execution mitigations ("variant 3a", aka execution past
     invalid privilege register access)

   - bugfixes and cleanups

  PPC:
   - improvements for the radix page fault handler for HV KVM on POWER9

  s390:
   - more kvm stat counters

   - virtio gpu plumbing

   - documentation

   - facilities improvements

  x86:
   - support for VMware magic I/O port and pseudo-PMCs

   - AMD pause loop exiting

   - support for AMD core performance extensions

   - support for synchronous register access

   - expose nVMX capabilities to userspace

   - support for Hyper-V signaling via eventfd

   - use Enlightened VMCS when running on Hyper-V

   - allow userspace to disable MWAIT/HLT/PAUSE vmexits

   - usual roundup of optimizations and nested virtualization bugfixes

  Generic:
   - API selftest infrastructure (though the only tests are for x86 as
     of now)"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (174 commits)
  kvm: x86: fix a prototype warning
  kvm: selftests: add sync_regs_test
  kvm: selftests: add API testing infrastructure
  kvm: x86: fix a compile warning
  KVM: X86: Add Force Emulation Prefix for "emulate the next instruction"
  KVM: X86: Introduce handle_ud()
  KVM: vmx: unify adjacent #ifdefs
  x86: kvm: hide the unused 'cpu' variable
  KVM: VMX: remove bogus WARN_ON in handle_ept_misconfig
  Revert "KVM: X86: Fix SMRAM accessing even if VM is shutdown"
  kvm: Add emulation for movups/movupd
  KVM: VMX: raise internal error for exception during invalid protected mode state
  KVM: nVMX: Optimization: Dont set KVM_REQ_EVENT when VMExit with nested_run_pending
  KVM: nVMX: Require immediate-exit when event reinjected to L2 and L1 event pending
  KVM: x86: Fix misleading comments on handling pending exceptions
  KVM: x86: Rename interrupt.pending to interrupt.injected
  KVM: VMX: No need to clear pending NMI/interrupt on inject realmode interrupt
  x86/kvm: use Enlightened VMCS when running on Hyper-V
  x86/hyper-v: detect nested features
  x86/hyper-v: define struct hv_enlightened_vmcs and clean field bits
  ...
2018-04-09 11:42:31 -07:00
Peng Hao e01bca2fc6 kvm: x86: fix a prototype warning
Make the function static to avoid a

    warning: no previous prototype for ‘vmx_enable_tdp’

Signed-off-by: Peng Hao <peng.hao2@zte.com.cn>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-04-06 18:20:31 +02:00
Peng Hao 3140c156e9 kvm: x86: fix a compile warning
fix a "warning: no previous prototype".

Cc: stable@vger.kernel.org
Signed-off-by: Peng Hao <peng.hao2@zte.com.cn>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-04-04 19:10:29 +02:00
Wanpeng Li 6c86eedc20 KVM: X86: Add Force Emulation Prefix for "emulate the next instruction"
There is no easy way to force KVM to run an instruction through the emulator
(by design as that will expose the x86 emulator as a significant attack-surface).
However, we do wish to expose the x86 emulator in case we are testing it
(e.g. via kvm-unit-tests). Therefore, this patch adds a "force emulation prefix"
that is designed to raise #UD which KVM will trap and it's #UD exit-handler will
match "force emulation prefix" to run instruction after prefix by the x86 emulator.
To not expose the x86 emulator by default, we add a module parameter that should
be off by default.

A simple testcase here:

    #include <stdio.h>
    #include <string.h>

    #define HYPERVISOR_INFO 0x40000000

    #define CPUID(idx, eax, ebx, ecx, edx) \
        asm volatile (\
        "ud2a; .ascii \"kvm\"; cpuid" \
        :"=b" (*ebx), "=a" (*eax), "=c" (*ecx), "=d" (*edx) \
            :"0"(idx) );

    void main()
    {
        unsigned int eax, ebx, ecx, edx;
        char string[13];

        CPUID(HYPERVISOR_INFO, &eax, &ebx, &ecx, &edx);
        *(unsigned int *)(string + 0) = ebx;
        *(unsigned int *)(string + 4) = ecx;
        *(unsigned int *)(string + 8) = edx;

        string[12] = 0;
        if (strncmp(string, "KVMKVMKVM\0\0\0", 12) == 0)
            printf("kvm guest\n");
        else
            printf("bare hardware\n");
    }

Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
[Correctly handle usermode exits. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-04-04 19:09:40 +02:00
Wanpeng Li 082d06edab KVM: X86: Introduce handle_ud()
Introduce handle_ud() to handle invalid opcode, this function will be
used by later patches.

Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim KrÄmář <rkrcmar@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-04-04 19:03:58 +02:00
Paolo Bonzini 4fde8d57cf KVM: vmx: unify adjacent #ifdefs
vmx_save_host_state has multiple ifdefs for CONFIG_X86_64 that have
no other code between them.  Simplify by reducing them to a single
conditional.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-04-04 18:58:59 +02:00
Arnd Bergmann 51e8a8cc2f x86: kvm: hide the unused 'cpu' variable
The local variable was newly introduced but is only accessed in one
place on x86_64, but not on 32-bit:

arch/x86/kvm/vmx.c: In function 'vmx_save_host_state':
arch/x86/kvm/vmx.c:2175:6: error: unused variable 'cpu' [-Werror=unused-variable]

This puts it into another #ifdef.

Fixes: 35060ed6a1 ("x86/kvm/vmx: avoid expensive rdmsr for MSR_GS_BASE")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-04-04 18:57:40 +02:00
Sean Christopherson c75d0edc8e KVM: VMX: remove bogus WARN_ON in handle_ept_misconfig
Remove the WARN_ON in handle_ept_misconfig() as it is unnecessary
and causes false positives.  Return the unmodified result of
kvm_mmu_page_fault() instead of converting a system error code to
KVM_EXIT_UNKNOWN so that userspace sees the error code of the
actual failure, not a generic "we don't know what went wrong".

  * kvm_mmu_page_fault() will WARN if reserved bits are set in the
    SPTEs, i.e. it covers the case where an EPT misconfig occurred
    because of a KVM bug.

  * The WARN_ON will fire on any system error code that is hit while
    handling the fault, e.g. -ENOMEM from mmu_topup_memory_caches()
    while handling a legitmate MMIO EPT misconfig or -EFAULT from
    kvm_handle_bad_page() if the corresponding HVA is invalid.  In
    either case, userspace should receive the original error code
    and firing a warning is incorrect behavior as KVM is operating
    as designed.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-04-04 18:00:40 +02:00
Sean Christopherson 2c151b2544 Revert "KVM: X86: Fix SMRAM accessing even if VM is shutdown"
The bug that led to commit 95e057e258
was a benign warning (no adverse affects other than the warning
itself) that was detected by syzkaller.  Further inspection shows
that the WARN_ON in question, in handle_ept_misconfig(), is
unnecessary and flawed (this was also briefly discussed in the
original patch: https://patchwork.kernel.org/patch/10204649).

  * The WARN_ON is unnecessary as kvm_mmu_page_fault() will WARN
    if reserved bits are set in the SPTEs, i.e. it covers the case
    where an EPT misconfig occurred because of a KVM bug.

  * The WARN_ON is flawed because it will fire on any system error
    code that is hit while handling the fault, e.g. -ENOMEM can be
    returned by mmu_topup_memory_caches() while handling a legitmate
    MMIO EPT misconfig.

The original behavior of returning -EFAULT when userspace munmaps
an HVA without first removing the memslot is correct and desirable,
i.e. KVM is letting userspace know it has generated a bad address.
Returning RET_PF_EMULATE masks the WARN_ON in the EPT misconfig path,
but does not fix the underlying bug, i.e. the WARN_ON is bogus.

Furthermore, returning RET_PF_EMULATE has the unwanted side effect of
causing KVM to attempt to emulate an instruction on any page fault
with an invalid HVA translation, e.g. a not-present EPT violation
on a VM_PFNMAP VMA whose fault handler failed to insert a PFN.

  * There is no guarantee that the fault is directly related to the
    instruction, i.e. the fault could have been triggered by a side
    effect memory access in the guest, e.g. while vectoring a #DB or
    writing a tracing record.  This could cause KVM to effectively
    mask the fault if KVM doesn't model the behavior leading to the
    fault, i.e. emulation could succeed and resume the guest.

  * If emulation does fail, KVM will return EMULATION_FAILED instead
    of -EFAULT, which is a red herring as the user will either debug
    a bogus emulation attempt or scratch their head wondering why we
    were attempting emulation in the first place.

TL;DR: revert to returning -EFAULT and remove the bogus WARN_ON in
handle_ept_misconfig in a future patch.

This reverts commit 95e057e258.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-04-04 18:00:36 +02:00
Stefan Fritsch 29916968c4 kvm: Add emulation for movups/movupd
This is very similar to the aligned versions movaps/movapd.

We have seen the corresponding emulation failures with openbsd as guest
and with Windows 10 with intel HD graphics pass through.

Signed-off-by: Christian Ehrhardt <christian_ehrhardt@genua.de>
Signed-off-by: Stefan Fritsch <sf@sfritsch.de>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-04-04 17:52:46 +02:00
Sean Christopherson add5ff7a21 KVM: VMX: raise internal error for exception during invalid protected mode state
Exit to userspace with KVM_INTERNAL_ERROR_EMULATION if we encounter
an exception in Protected Mode while emulating guest due to invalid
guest state.  Unlike Big RM, KVM doesn't support emulating exceptions
in PM, i.e. PM exceptions are always injected via the VMCS.  Because
we will never do VMRESUME due to emulation_required, the exception is
never realized and we'll keep emulating the faulting instruction over
and over until we receive a signal.

Exit to userspace iff there is a pending exception, i.e. don't exit
simply on a requested event. The purpose of this check and exit is to
aid in debugging a guest that is in all likelihood already doomed.
Invalid guest state in PM is extremely limited in normal operation,
e.g. it generally only occurs for a few instructions early in BIOS,
and any exception at this time is all but guaranteed to be fatal.
Non-vectored interrupts, e.g. INIT, SIPI and SMI, can be cleanly
handled/emulated, while checking for vectored interrupts, e.g. INTR
and NMI, without hitting false positives would add a fair amount of
complexity for almost no benefit (getting hit by lightning seems
more likely than encountering this specific scenario).

Add a WARN_ON_ONCE to vmx_queue_exception() if we try to inject an
exception via the VMCS and emulation_required is true.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-04-04 17:51:55 +02:00
Linus Torvalds 986b37c0ae Merge branch 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups and msr updates from Ingo Molnar:
 "The main change is a performance/latency improvement to /dev/msr
  access. The rest are misc cleanups"

* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/msr: Make rdmsrl_safe_on_cpu() scheduling safe as well
  x86/cpuid: Allow cpuid_read() to schedule
  x86/msr: Allow rdmsr_safe_on_cpu() to schedule
  x86/rtc: Stop using deprecated functions
  x86/dumpstack: Unify show_regs()
  x86/fault: Do not print IP in show_fault_oops()
  x86/MSR: Move native_* variants to msr.h
2018-04-02 15:16:43 -07:00
Linus Torvalds 72573481eb KVM fixes for v4.16-rc8
PPC:
  - Fix a bug causing occasional machine check exceptions on POWER8 hosts
    (introduced in 4.16-rc1)
 
 x86:
  - Fix a guest crashing regression with nested VMX and restricted guest
    (introduced in 4.16-rc1)
 
  - Fix dependency check for pv tlb flush (The wrong dependency that
    effectively disabled the feature was added in 4.16-rc4, the original
    feature in 4.16-rc1, so it got decent testing.)
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABCAAGBQJavUt5AAoJEED/6hsPKofo8uQH/RuijrsAIUnymkYY+6BYFXlh
 Ri8qhG8VB+C3SpWEtsqcqNVkjJTepCD2Ej5BJTL4Gc9BSTWy7Ht6kqskEgwcnzu2
 xRfkg0q0vTj1+GDd+UiTZfxiinoHtB9x3fiXali5UNTCd1fweLxdidETfO+GqMMq
 KDhTR+S8dXE5VG7r+iJ80LZPtHQJ94f0fh9XpQk3X2ExTG5RBxag1U2nCfiKRAZk
 xRv1CNAxNaBxS38CgYfHzg31NJx38fnq/qREsIdOx0Ju9WQkglBFkhLAGUb4vL0I
 nn8YX/oV9cW2G8tyPWjC245AouABOLbzu0xyj5KgCY/z1leA9tdLFX/ET6Zye+E=
 =++uZ
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM fixes from Radim Krčmář:
 "PPC:
   - Fix a bug causing occasional machine check exceptions on POWER8
     hosts (introduced in 4.16-rc1)

  x86:
   - Fix a guest crashing regression with nested VMX and restricted
     guest (introduced in 4.16-rc1)

   - Fix dependency check for pv tlb flush (the wrong dependency that
     effectively disabled the feature was added in 4.16-rc4, the
     original feature in 4.16-rc1, so it got decent testing)"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: x86: Fix pv tlb flush dependencies
  KVM: nVMX: sync vmcs02 segment regs prior to vmx_set_cr0
  KVM: PPC: Book3S HV: Fix duplication of host SLB entries
2018-03-30 07:24:14 -10:00
Liran Alon f497b6c25d KVM: nVMX: Optimization: Dont set KVM_REQ_EVENT when VMExit with nested_run_pending
When vCPU runs L2 and there is a pending event that requires to exit
from L2 to L1 and nested_run_pending=1, vcpu_enter_guest() will request
an immediate-exit from L2 (See req_immediate_exit).

Since now handling of req_immediate_exit also makes sure to set
KVM_REQ_EVENT, there is no need to also set it on vmx_vcpu_run() when
nested_run_pending=1.

This optimizes cases where VMRESUME was executed by L1 to enter L2 and
there is no pending events that require exit from L2 to L1. Previously,
this would have set KVM_REQ_EVENT unnecessarly.

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-03-28 22:47:06 +02:00
Liran Alon 1a680e355c KVM: nVMX: Require immediate-exit when event reinjected to L2 and L1 event pending
In case L2 VMExit to L0 during event-delivery, VMCS02 is filled with
IDT-vectoring-info which vmx_complete_interrupts() makes sure to
reinject before next resume of L2.

While handling the VMExit in L0, an IPI could be sent by another L1 vCPU
to the L1 vCPU which currently runs L2 and exited to L0.

When L0 will reach vcpu_enter_guest() and call inject_pending_event(),
it will note that a previous event was re-injected to L2 (by
IDT-vectoring-info) and therefore won't check if there are pending L1
events which require exit from L2 to L1. Thus, L0 enters L2 without
immediate VMExit even though there are pending L1 events!

This commit fixes the issue by making sure to check for L1 pending
events even if a previous event was reinjected to L2 and bailing out
from inject_pending_event() before evaluating a new pending event in
case an event was already reinjected.

The bug was observed by the following setup:
* L0 is a 64CPU machine which runs KVM.
* L1 is a 16CPU machine which runs KVM.
* L0 & L1 runs with APICv disabled.
(Also reproduced with APICv enabled but easier to analyze below info
with APICv disabled)
* L1 runs a 16CPU L2 Windows Server 2012 R2 guest.
During L2 boot, L1 hangs completely and analyzing the hang reveals that
one L1 vCPU is holding KVM's mmu_lock and is waiting forever on an IPI
that he has sent for another L1 vCPU. And all other L1 vCPUs are
currently attempting to grab mmu_lock. Therefore, all L1 vCPUs are stuck
forever (as L1 runs with kernel-preemption disabled).

Observing /sys/kernel/debug/tracing/trace_pipe reveals the following
series of events:
(1) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip:
0xfffff802c5dca82f reason: EPT_VIOLATION ext_inf1: 0x0000000000000182
ext_inf2: 0x00000000800000d2 ext_int: 0x00000000 ext_int_err: 0x00000000
(2) qemu-system-x86-19054 [028] kvm_apic_accept_irq: apicid f
vec 252 (Fixed|edge)
(3) qemu-system-x86-19066 [030] kvm_inj_virq: irq 210
(4) qemu-system-x86-19066 [030] kvm_entry: vcpu 15
(5) qemu-system-x86-19066 [030] kvm_exit: reason EPT_VIOLATION
rip 0xffffe00069202690 info 83 0
(6) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip:
0xffffe00069202690 reason: EPT_VIOLATION ext_inf1: 0x0000000000000083
ext_inf2: 0x0000000000000000 ext_int: 0x00000000 ext_int_err: 0x00000000
(7) qemu-system-x86-19066 [030] kvm_nested_vmexit_inject: reason:
EPT_VIOLATION ext_inf1: 0x0000000000000083 ext_inf2: 0x0000000000000000
ext_int: 0x00000000 ext_int_err: 0x00000000
(8) qemu-system-x86-19066 [030] kvm_entry: vcpu 15

Which can be analyzed as follows:
(1) L2 VMExit to L0 on EPT_VIOLATION during delivery of vector 0xd2.
Therefore, vmx_complete_interrupts() will set KVM_REQ_EVENT and reinject
a pending-interrupt of 0xd2.
(2) L1 sends an IPI of vector 0xfc (CALL_FUNCTION_VECTOR) to destination
vCPU 15. This will set relevant bit in LAPIC's IRR and set KVM_REQ_EVENT.
(3) L0 reach vcpu_enter_guest() which calls inject_pending_event() which
notes that interrupt 0xd2 was reinjected and therefore calls
vmx_inject_irq() and returns. Without checking for pending L1 events!
Note that at this point, KVM_REQ_EVENT was cleared by vcpu_enter_guest()
before calling inject_pending_event().
(4) L0 resumes L2 without immediate-exit even though there is a pending
L1 event (The IPI pending in LAPIC's IRR).

We have already reached the buggy scenario but events could be
furthered analyzed:
(5+6) L2 VMExit to L0 on EPT_VIOLATION.  This time not during
event-delivery.
(7) L0 decides to forward the VMExit to L1 for further handling.
(8) L0 resumes into L1. Note that because KVM_REQ_EVENT is cleared, the
LAPIC's IRR is not examined and therefore the IPI is still not delivered
into L1!

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-03-28 22:47:06 +02:00
Liran Alon a042c26fd8 KVM: x86: Fix misleading comments on handling pending exceptions
The reason that exception.pending should block re-injection of
NMI/interrupt is not described correctly in comment in code.
Instead, it describes why a pending exception should be injected
before a pending NMI/interrupt.

Therefore, move currently present comment to code-block evaluating
a new pending event which explains why exception.pending is evaluated
first.
In addition, create a new comment describing that exception.pending
blocks re-injection of NMI/interrupt because the exception was
queued by handling vmexit which was due to NMI/interrupt delivery.

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@orcle.com>
[Used a comment from Sean J <sean.j.christopherson@intel.com>. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-03-28 22:47:06 +02:00
Liran Alon 04140b4144 KVM: x86: Rename interrupt.pending to interrupt.injected
For exceptions & NMIs events, KVM code use the following
coding convention:
*) "pending" represents an event that should be injected to guest at
some point but it's side-effects have not yet occurred.
*) "injected" represents an event that it's side-effects have already
occurred.

However, interrupts don't conform to this coding convention.
All current code flows mark interrupt.pending when it's side-effects
have already taken place (For example, bit moved from LAPIC IRR to
ISR). Therefore, it makes sense to just rename
interrupt.pending to interrupt.injected.

This change follows logic of previous commit 664f8e26b0 ("KVM: X86:
Fix loss of exception which has not yet been injected") which changed
exception to follow this coding convention as well.

It is important to note that in case !lapic_in_kernel(vcpu),
interrupt.pending usage was and still incorrect.
In this case, interrrupt.pending can only be set using one of the
following ioctls: KVM_INTERRUPT, KVM_SET_VCPU_EVENTS and
KVM_SET_SREGS. Looking at how QEMU uses these ioctls, one can see that
QEMU uses them either to re-set an "interrupt.pending" state it has
received from KVM (via KVM_GET_VCPU_EVENTS interrupt.pending or
via KVM_GET_SREGS interrupt_bitmap) or by dispatching a new interrupt
from QEMU's emulated LAPIC which reset bit in IRR and set bit in ISR
before sending ioctl to KVM. So it seems that indeed "interrupt.pending"
in this case is also suppose to represent "interrupt.injected".
However, kvm_cpu_has_interrupt() & kvm_cpu_has_injectable_intr()
is misusing (now named) interrupt.injected in order to return if
there is a pending interrupt.
This leads to nVMX/nSVM not be able to distinguish if it should exit
from L2 to L1 on EXTERNAL_INTERRUPT on pending interrupt or should
re-inject an injected interrupt.
Therefore, add a FIXME at these functions for handling this issue.

This patch introduce no semantics change.

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-03-28 22:47:06 +02:00
Liran Alon 7c5a6a5970 KVM: VMX: No need to clear pending NMI/interrupt on inject realmode interrupt
kvm_inject_realmode_interrupt() is called from one of the injection
functions which writes event-injection to VMCS: vmx_queue_exception(),
vmx_inject_irq() and vmx_inject_nmi().

All these functions are called just to cause an event-injection to
guest. They are not responsible of manipulating the event-pending
flag. The only purpose of kvm_inject_realmode_interrupt() should be
to emulate real-mode interrupt-injection.

This was also incorrect when called from vmx_queue_exception().

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-03-28 22:47:06 +02:00
Vitaly Kuznetsov 773e8a0425 x86/kvm: use Enlightened VMCS when running on Hyper-V
Enlightened VMCS is just a structure in memory, the main benefit
besides avoiding somewhat slower VMREAD/VMWRITE is using clean field
mask: we tell the underlying hypervisor which fields were modified
since VMEXIT so there's no need to inspect them all.

Tight CPUID loop test shows significant speedup:
Before: 18890 cycles
After: 8304 cycles

Static key is being used to avoid performance penalty for non-Hyper-V
deployments.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-03-28 22:47:06 +02:00
Ladi Prosek d4abc577bb x86/kvm: rename HV_X64_MSR_APIC_ASSIST_PAGE to HV_X64_MSR_VP_ASSIST_PAGE
The assist page has been used only for the paravirtual EOI so far, hence
the "APIC" in the MSR name. Renaming to match the Hyper-V TLFS where it's
called "Virtual VP Assist MSR".

Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-03-28 22:47:06 +02:00
Babu Moger 8566ac8b8e KVM: SVM: Implement pause loop exit logic in SVM
Bring the PLE(pause loop exit) logic to AMD svm driver.

While testing, we found this helping in situations where numerous
pauses are generated. Without these patches we could see continuos
VMEXITS due to pause interceptions. Tested it on AMD EPYC server with
boot parameter idle=poll on a VM with 32 vcpus to simulate extensive
pause behaviour. Here are VMEXITS in 10 seconds interval.

Pauses                  810199                  504
Total                   882184                  325415

Signed-off-by: Babu Moger <babu.moger@amd.com>
[Prevented the window from dropping below the initial value. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-03-28 22:47:06 +02:00
Babu Moger 1d8fb44a72 KVM: SVM: Add pause filter threshold
This patch adds the support for pause filtering threshold. This feature
support is indicated by CPUID Fn8000_000A_EDX. See AMD APM Vol 2 Section
15.14.4 Pause Intercept Filtering for more details.

In this mode, a 16-bit pause filter threshold field is added in VMCB.
The threshold value is a cycle count that is used to reset the pause
counter.  As with simple pause filtering, VMRUN loads the pause count
value from VMCB into an internal counter. Then, on each pause instruction
the hardware checks the elapsed number of cycles since the most recent
pause instruction against the pause Filter Threshold. If the elapsed cycle
count is greater than the pause filter threshold, then the internal pause
count is reloaded from VMCB and execution continues. If the elapsed cycle
count is less than the pause filter threshold, then the internal pause
count is decremented. If the count value is less than zero and pause
intercept is enabled, a #VMEXIT is triggered. If advanced pause filtering
is supported and pause filter threshold field is set to zero, the filter
will operate in the simpler, count only mode.

Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-03-28 22:47:06 +02:00
Babu Moger c8e88717cf KVM: VMX: Bring the common code to header file
This patch brings some of the code from vmx to x86.h header file. Now, we
can share this code between vmx and svm. Modified couple functions to make
it common.

Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-03-28 22:47:06 +02:00
Babu Moger 18abdc3425 KVM: VMX: Remove ple_window_actual_max
Get rid of ple_window_actual_max, because its benefits are really
minuscule and the logic is complicated.

The overflows(and underflow) are controlled in __ple_window_grow
and _ple_window_shrink respectively.

Suggested-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Babu Moger <babu.moger@amd.com>
[Fixed potential wraparound and change the max to UINT_MAX. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-03-28 22:47:06 +02:00
Babu Moger 7fbc85a5fb KVM: VMX: Fix the module parameters for vmx
The vmx module parameters are supposed to be unsigned variants.

Also fixed the checkpatch errors like the one below.

WARNING: Symbolic permissions 'S_IRUGO' are not preferred. Consider using octal permissions '0444'.
+module_param(ple_gap, uint, S_IRUGO);

Signed-off-by: Babu Moger <babu.moger@amd.com>
[Expanded uint to unsigned int in code. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-03-28 22:47:06 +02:00
Andi Kleen dd60d21706 KVM: x86: Fix perf timer mode IP reporting
KVM and perf have a special backdoor mechanism to report the IP for interrupts
re-executed after vm exit. This works for the NMIs that perf normally uses.

However when perf is in timer mode it doesn't work because the timer interrupt
doesn't get this special treatment. This is common when KVM is running
nested in another hypervisor which may not implement the PMU, so only
timer mode is available.

Call the functions to set up the backdoor IP also for non NMI interrupts.

I renamed the functions to set up the backdoor IP reporting to be more
appropiate for their new use.  The SVM change is only compile tested.

v2: Moved the functions inline.
For the normal interrupt case the before/after functions are now
called from x86.c, not arch specific code.
For the NMI case we still need to call it in the architecture
specific code, because it's already needed in the low level *_run
functions.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
[Removed unnecessary calls from arch handle_external_intr. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-03-28 16:12:59 +02:00
Dan Carpenter d32ef547fd kvm: x86: hyperv: delete dead code in kvm_hv_hypercall()
"rep_done" is always zero so the "(((u64)rep_done & 0xfff) << 32)"
expression is just zero.  We can remove the "res" temporary variable as
well and just use "ret" directly.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-23 20:11:01 +01:00
Sean Christopherson 81811c162d KVM: SVM: add struct kvm_svm to hold SVM specific KVM vars
Add struct kvm_svm, which is analagous to struct vcpu_svm, along with
a helper to_kvm_svm() to retrieve kvm_svm from a struct kvm *.  Move
the SVM specific variables and struct definitions out of kvm_arch
and into kvm_svm.

Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-23 18:32:19 +01:00
Sean Christopherson 40bbb9d03f KVM: VMX: add struct kvm_vmx to hold VMX specific KVM vars
Add struct kvm_vmx, which wraps struct kvm, and a helper to_kvm_vmx()
that retrieves 'struct kvm_vmx *' from 'struct kvm *'.  Move the VMX
specific variables out of kvm_arch and into kvm_vmx.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-23 18:32:03 +01:00
Sean Christopherson 2ac52ab861 KVM: x86: move setting of ept_identity_map_addr to vmx.c
Add kvm_x86_ops->set_identity_map_addr and set ept_identity_map_addr
in VMX specific code so that ept_identity_map_addr can be moved out
of 'struct kvm_arch' in a future patch.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-23 18:30:47 +01:00
Sean Christopherson 434a1e9446 KVM: x86: define SVM/VMX specific kvm_arch_[alloc|free]_vm
Define kvm_arch_[alloc|free]_vm in x86 as pass through functions
to new kvm_x86_ops vm_alloc and vm_free, and move the current
allocation logic as-is to SVM and VMX.  Vendor specific alloc/free
functions set the stage for SVM/VMX wrappers of 'struct kvm',
which will allow us to move the growing number of SVM/VMX specific
member variables out of 'struct kvm_arch'.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-23 18:30:44 +01:00
Sean Christopherson 9d1887ef32 KVM: nVMX: sync vmcs02 segment regs prior to vmx_set_cr0
Segment registers must be synchronized prior to any code that may
trigger a call to emulation_required()/guest_state_valid(), e.g.
vmx_set_cr0().  Because preparing vmcs02 writes segmentation fields
directly, i.e. doesn't use vmx_set_segment(), emulation_required
will not be re-evaluated when synchronizing the segment registers,
which can result in L0 incorrectly starting emulation of L2.

Fixes: 8665c3f973 ("KVM: nVMX: initialize descriptor cache fields in prepare_vmcs02_full")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
[Move all of prepare_vmcs02_full earlier, not just segment registers. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-23 18:26:21 +01:00
Paolo Bonzini 3184a995f7 KVM: nVMX: fix vmentry failure code when L2 state would require emulation
Commit 2bb8cafea8 ("KVM: vVMX: signal failure for nested VMEntry if
emulation_required", 2018-03-12) introduces a new error path which does
not set *entry_failure_code.  Fix that to avoid a leak of L0 stack to L1.

Reported-by: Radim Krčmář <rkrcmar@redhat.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-21 14:20:33 +01:00
Liran Alon e40ff1d660 KVM: nVMX: Do not load EOI-exitmap while running L2
When L1 IOAPIC redirection-table is written, a request of
KVM_REQ_SCAN_IOAPIC is set on all vCPUs. This is done such that
all vCPUs will now recalc their IOAPIC handled vectors and load
it to their EOI-exitmap.

However, it could be that one of the vCPUs is currently running
L2. In this case, load_eoi_exitmap() will be called which would
write to vmcs02->eoi_exit_bitmap, which is wrong because
vmcs02->eoi_exit_bitmap should always be equal to
vmcs12->eoi_exit_bitmap. Furthermore, at this point
KVM_REQ_SCAN_IOAPIC was already consumed and therefore we will
never update vmcs01->eoi_exit_bitmap. This could lead to remote_irr
of some IOAPIC level-triggered entry to remain set forever.

Fix this issue by delaying the load of EOI-exitmap to when vCPU
is running L1.

One may wonder why not just delay entire KVM_REQ_SCAN_IOAPIC
processing to when vCPU is running L1. This is done in order to handle
correctly the case where LAPIC & IO-APIC of L1 is pass-throughed into
L2. In this case, vmcs12->virtual_interrupt_delivery should be 0. In
current nVMX implementation, that results in
vmcs02->virtual_interrupt_delivery to also be 0. Thus,
vmcs02->eoi_exit_bitmap is not used. Therefore, every L2 EOI cause
a #VMExit into L0 (either on MSR_WRITE to x2APIC MSR or
APIC_ACCESS/APIC_WRITE/EPT_MISCONFIG to APIC MMIO page).
In order for such L2 EOI to be broadcasted, if needed, from LAPIC
to IO-APIC, vcpu->arch.ioapic_handled_vectors must be updated
while L2 is running. Therefore, patch makes sure to delay only the
loading of EOI-exitmap but not the update of
vcpu->arch.ioapic_handled_vectors.

Reviewed-by: Arbel Moshe <arbel.moshe@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-21 14:16:44 +01:00
Linus Torvalds 32d43cd391 kvm/x86: fix icebp instruction handling
The undocumented 'icebp' instruction (aka 'int1') works pretty much like
'int3' in the absense of in-circuit probing equipment (except,
obviously, that it raises #DB instead of raising #BP), and is used by
some validation test-suites as such.

But Andy Lutomirski noticed that his test suite acted differently in kvm
than on bare hardware.

The reason is that kvm used an inexact test for the icebp instruction:
it just assumed that an all-zero VM exit qualification value meant that
the VM exit was due to icebp.

That is not unlike the guess that do_debug() does for the actual
exception handling case, but it's purely a heuristic, not an absolute
rule.  do_debug() does it because it wants to ascribe _some_ reasons to
the #DB that happened, and an empty %dr6 value means that 'icebp' is the
most likely casue and we have no better information.

But kvm can just do it right, because unlike the do_debug() case, kvm
actually sees the real reason for the #DB in the VM-exit interruption
information field.

So instead of relying on an inexact heuristic, just use the actual VM
exit information that says "it was 'icebp'".

Right now the 'icebp' instruction isn't technically documented by Intel,
but that will hopefully change.  The special "privileged software
exception" information _is_ actually mentioned in the Intel SDM, even
though the cause of it isn't enumerated.

Reported-by: Andy Lutomirski <luto@kernel.org>
Tested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-03-20 14:58:34 -07:00
Vitaly Kuznetsov 35060ed6a1 x86/kvm/vmx: avoid expensive rdmsr for MSR_GS_BASE
vmx_save_host_state() is only called from kvm_arch_vcpu_ioctl_run() so
the context is pretty well defined and as we're past 'swapgs' MSR_GS_BASE
should contain kernel's GS base which we point to irq_stack_union.

Add new kernelmode_gs_base() API, irq_stack_union needs to be exported
as KVM can be build as module.

Acked-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-16 22:03:54 +01:00
Vitaly Kuznetsov 42b933b597 x86/kvm/vmx: read MSR_{FS,KERNEL_GS}_BASE from current->thread
vmx_save_host_state() is only called from kvm_arch_vcpu_ioctl_run() so
the context is pretty well defined. Read MSR_{FS,KERNEL_GS}_BASE from
current->thread after calling save_fsgs() which takes care of
X86_BUG_NULL_SEG case now and will do RD[FG,GS]BASE when FSGSBASE
extensions are exposed to userspace (currently they are not).

Acked-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-16 22:03:53 +01:00
Wanpeng Li b31c114b82 KVM: X86: Provide a capability to disable PAUSE intercepts
Allow to disable pause loop exit/pause filtering on a per VM basis.

If some VMs have dedicated host CPUs, they won't be negatively affected
due to needlessly intercepted PAUSE instructions.

Thanks to Jan H. Schönherr's initial patch.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-16 22:03:53 +01:00
Wanpeng Li caa057a2ca KVM: X86: Provide a capability to disable HLT intercepts
If host CPUs are dedicated to a VM, we can avoid VM exits on HLT.
This patch adds the per-VM capability to disable them.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-16 22:03:52 +01:00
Wanpeng Li 4d5422cea3 KVM: X86: Provide a capability to disable MWAIT intercepts
Allowing a guest to execute MWAIT without interception enables a guest
to put a (physical) CPU into a power saving state, where it takes
longer to return from than what may be desired by the host.

Don't give a guest that power over a host by default. (Especially,
since nothing prevents a guest from using MWAIT even when it is not
advertised via CPUID.)

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-16 22:03:51 +01:00
Arbel Moshe 2d7921c499 KVM: x86: Add support for VMware backdoor Pseudo-PMCs
VMware exposes the following Pseudo PMCs:
0x10000: Physical host TSC
0x10001: Elapsed real time in ns
0x10002: Elapsed apparent time in ns

For more info refer to:
https://www.vmware.com/files/pdf/techpaper/Timekeeping-In-VirtualMachines.pdf

VMware allows access to these Pseduo-PMCs even when read via RDPMC
in Ring3 and CR4.PCE=0. Therefore, commit modifies x86 emulator
to allow access to these PMCs in this situation. In addition,
emulation of these PMCs were added to kvm_pmu_rdpmc().

Signed-off-by: Arbel Moshe <arbel.moshe@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-16 22:02:01 +01:00
Liran Alon 9718420e9f KVM: x86: SVM: Intercept #GP to support access to VMware backdoor ports
If KVM enable_vmware_backdoor module parameter is set,
the commit change VMX to now intercept #GP instead of being directly
deliviered from CPU to guest.

It is done to support access to VMware Backdoor I/O ports
even if TSS I/O permission denies it.
In that case:
1. A #GP will be raised and intercepted.
2. #GP intercept handler will simulate I/O port access instruction.
3. I/O port access instruction simulation will allow access to VMware
backdoor ports specifically even if TSS I/O permission bitmap denies it.

Note that the above change introduce slight performance hit as now #GPs
are now not deliviered directly from CPU to guest but instead
cause #VMExit and instruction emulation.
However, this behavior is introduced only when enable_vmware_backdoor
KVM module parameter is set.

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-16 22:01:43 +01:00
Liran Alon 9e86948041 KVM: x86: VMX: Intercept #GP to support access to VMware backdoor ports
If KVM enable_vmware_backdoor module parameter is set,
the commit change VMX to now intercept #GP instead of being directly
deliviered from CPU to guest.

It is done to support access to VMware backdoor I/O ports
even if TSS I/O permission denies it.
In that case:
1. A #GP will be raised and intercepted.
2. #GP intercept handler will simulate I/O port access instruction.
3. I/O port access instruction simulation will allow access to VMware
backdoor ports specifically even if TSS I/O permission bitmap denies it.

Note that the above change introduce slight performance hit as now #GPs
are not deliviered directly from CPU to guest but instead
cause #VMExit and instruction emulation.
However, this behavior is introduced only when enable_vmware_backdoor
KVM module parameter is set.

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-16 22:01:42 +01:00
Liran Alon 04789b6664 KVM: x86: Emulate only IN/OUT instructions when accessing VMware backdoor
Access to VMware backdoor ports is done by one of the IN/OUT/INS/OUTS
instructions. These ports must be allowed access even if TSS I/O
permission bitmap don't allow it.

To handle this, VMX/SVM will be changed in future commits
to intercept #GP which was raised by such access and
handle it by calling x86 emulator to emulate instruction.
If it was one of these instructions, the x86 emulator already handles
it correctly (Since commit "KVM: x86: Always allow access to VMware
backdoor I/O ports") by not checking these ports against TSS I/O
permission bitmap.

One may wonder why checking for specific instructions is necessary
as we can just forward all #GPs to the x86 emulator.
There are multiple reasons for doing so:

1. We don't want the x86 emulator to be reached easily
by guest by just executing an instruction that raises #GP as that
exposes the x86 emulator as a bigger attack surface.

2. The x86 emulator is incomplete and therefore certain instructions
that can cause #GP cannot be emulated. Such an example is "INT x"
(opcode 0xcd) which reaches emulate_int() which can only emulate
the instruction if vCPU is in real-mode.

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-16 22:01:42 +01:00
Liran Alon e236617120 KVM: x86: Add emulation_type to not raise #UD on emulation failure
Next commits are going introduce support for accessing VMware backdoor
ports even though guest's TSS I/O permissions bitmap doesn't allow
access. This mimic VMware hypervisor behavior.

In order to support this, next commits will change VMX/SVM to
intercept #GP which was raised by such access and handle it by calling
the x86 emulator to emulate instruction. Since commit "KVM: x86:
Always allow access to VMware backdoor I/O ports", the x86 emulator
handles access to these I/O ports by not checking these ports against
the TSS I/O permission bitmap.

However, there could be cases that CPU rasies a #GP on instruction
that fails to be disassembled by the x86 emulator (Because of
incomplete implementation for example).

In those cases, we would like the #GP intercept to just forward #GP
as-is to guest as if there was no intercept to begin with.
However, current emulator code always queues #UD exception in case
emulator fails (including disassembly failures) which is not what is
wanted in this flow.

This commit addresses this issue by adding a new emulation_type flag
that will allow the #GP intercept handler to specify that it wishes
to be aware when instruction emulation fails and doesn't want #UD
exception to be queued.

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-16 22:01:41 +01:00
Liran Alon 9a29d449e3 KVM: x86: Always allow access to VMware backdoor I/O ports
VMware allows access to these ports even if denied
by TSS I/O permission bitmap. Mimic behavior.

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-16 22:01:40 +01:00
Liran Alon c4ae60e4bb KVM: x86: Add module parameter for supporting VMware backdoor
Support access to VMware backdoor requires KVM to intercept #GP
exceptions from guest which introduce slight performance hit.
Therefore, control this support by module parameter.

Note that module parameter is exported as it should be consumed by
kvm_intel & kvm_amd to determine if they should intercept #GP or not.

This commit doesn't change semantics.
It is done as a preparation for future commits.

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-16 22:01:40 +01:00
Sean Christopherson dca7f1284f KVM: x86: add kvm_fast_pio() to consolidate fast PIO code
Add kvm_fast_pio() to consolidate duplicate code in VMX and SVM.
Unexport kvm_fast_pio_in() and kvm_fast_pio_out().

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-16 22:01:39 +01:00
Sean Christopherson 432baf60ee KVM: VMX: use kvm_fast_pio_in for handling IN I/O
Fast emulation of processor I/O for IN was disabled on x86 (both VMX
and SVM) some years ago due to a buggy implementation.  The addition
of kvm_fast_pio_in(), used by SVM, re-introduced (functional!) fast
emulation of IN.  Piggyback SVM's work and use kvm_fast_pio_in() on
VMX instead of performing full emulation of IN.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-16 22:01:38 +01:00
Sean Christopherson 2bb8cafea8 KVM: vVMX: signal failure for nested VMEntry if emulation_required
Fail a nested VMEntry with EXIT_REASON_INVALID_STATE if L2 guest state
is invalid, i.e. vmcs12 contained invalid guest state, and unrestricted
guest is disabled in L0 (and by extension disabled in L1).

WARN_ON_ONCE in handle_invalid_guest_state() if we're attempting to
emulate L2, i.e. nested_run_pending is true, to aid debug in the
(hopefully unlikely) scenario that we somehow skip the nested VMEntry
consistency check, e.g. due to a L0 bug.

Note: KVM relies on hardware to detect the scenario where unrestricted
guest is enabled in L0 but disabled in L1 and vmcs12 contains invalid
guest state, i.e. checking emulation_required in prepare_vmcs02 is
required only to handle the case were unrestricted guest is disabled
in L0 since L0 never actually attempts VMLAUNCH/VMRESUME with vmcs02.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-16 22:01:38 +01:00