1
0
Fork 0
Commit Graph

4266 Commits (e911eb3b3414e62cbd9853e0a91c124f4a545c0f)

Author SHA1 Message Date
Yu Zhang e911eb3b34 KVM: x86: Add return value to kvm_cpuid().
Return false in kvm_cpuid() when it fails to find the cpuid
entry. Also, this routine(and its caller) is optimized with
a new argument - check_limit, so that the check_cpuid_limit()
fall back can be avoided.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24 18:09:15 +02:00
Paolo Bonzini 3db134805c kvm: vmx: Raise #UD on unsupported XSAVES/XRSTORS
A guest may not be configured to support XSAVES/XRSTORS, even when the host
does. If the guest does not support XSAVES/XRSTORS, clear the secondary
execution control so that the processor will raise #UD.

Also clear the "allowed-1" bit for XSAVES/XRSTORS exiting in the
IA32_VMX_PROCBASED_CTLS2 MSR, and pass through VMCS12's control in
the VMCS02.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24 18:09:13 +02:00
Jim Mattson 75f4fc8da9 kvm: vmx: Raise #UD on unsupported RDSEED
A guest may not be configured to support RDSEED, even when the host
does. If the guest does not support RDSEED, intercept the instruction
and synthesize #UD. Also clear the "allowed-1" bit for RDSEED exiting
in the IA32_VMX_PROCBASED_CTLS2 MSR.

Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24 15:35:46 +02:00
Jim Mattson 45ec368c9a kvm: vmx: Raise #UD on unsupported RDRAND
A guest may not be configured to support RDRAND, even when the host
does. If the guest does not support RDRAND, intercept the instruction
and synthesize #UD. Also clear the "allowed-1" bit for RDRAND exiting
in the IA32_VMX_PROCBASED_CTLS2 MSR.

Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24 15:35:37 +02:00
Paolo Bonzini 80154d77c9 KVM: VMX: cache secondary exec controls
Currently, secondary execution controls are divided in three groups:

- static, depending mostly on the module arguments or the processor
  (vmx_secondary_exec_control)

- static, depending on CPUID (vmx_cpuid_update)

- dynamic, depending on nested VMX or local APIC state

Because walking CPUID is expensive, prepare_vmcs02 is using only
the first group.  This however is unnecessarily complicated.  Just
cache the static secondary execution controls, and then prepare_vmcs02
does not need to compute them every time.  Computation of all static
secondary execution controls is now kept in a single function,
vmx_compute_secondary_exec_control.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24 15:35:14 +02:00
Janakarajan Natarajan 640bd6e575 KVM: SVM: Enable Virtual GIF feature
Enable the Virtual GIF feature. This is done by setting bit 25 at position
60h in the vmcb.

With this feature enabled, the processor uses bit 9 at position 60h as the
virtual GIF when executing STGI/CLGI instructions.

Since the execution of STGI by the L1 hypervisor does not cause a return to
the outermost (L0) hypervisor, the enable_irq_window and enable_nmi_window
are modified.

The IRQ window will be opened even if GIF is not set, under the assumption
that on resuming the L1 hypervisor the IRQ will be held pending until the
processor executes the STGI instruction.

For the NMI window, the STGI intercept is set. This will assist in opening
the window only when GIF=1.

Signed-off-by: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-23 18:37:37 +02:00
David Hildenbrand 42aa53b4e1 KVM: VMX: always require WB memory type for EPT
We already always set that type but don't check if it is supported. Also
for nVMX, we only support WB for now. Let's just require it.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-08-18 17:38:01 +02:00
David Hildenbrand bb97a01693 KVM: VMX: cleanup EPTP definitions
Don't use shifts, tag them correctly as EPTP and use better matching
names (PWL vs. GAW).

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-08-18 17:38:01 +02:00
Denys Vlasenko 3f0d4db757 KVM: SVM: delete avic_vm_id_bitmap (2 megabyte static array)
With lightly tweaked defconfig:

    text    data     bss      dec     hex filename
11259661 5109408 2981888 19350957 12745ad vmlinux.before
11259661 5109408  884736 17253805 10745ad vmlinux.after

Only compile-tested.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: pbonzini@redhat.com
Cc: rkrcmar@redhat.com
Cc: tglx@linutronix.de
Cc: mingo@redhat.com
Cc: hpa@zytor.com
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-08-18 14:37:50 +02:00
Paolo Bonzini 9034e6e895 KVM: x86: fix use of L1 MMIO areas in nested guests
There is currently some confusion between nested and L1 GPAs.  The
assignment to "direct" in kvm_mmu_page_fault tries to fix that, but
it is not enough.  What this patch does is fence off the MMIO cache
completely when using shadow nested page tables, since we have neither
a GVA nor an L1 GPA to put in the cache.  This also allows some
simplifications in kvm_mmu_page_fault and FNAME(page_fault).

The EPT misconfig likewise does not have an L1 GPA to pass to
kvm_io_bus_write, so that must be skipped for guest mode.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
[Changed comment to say "GPAs" instead of "L1's physical addresses", as
 per David's review. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-08-18 14:37:49 +02:00
Brijesh Singh 618232e219 KVM: x86: Avoid guest page table walk when gpa_available is set
When a guest causes a page fault which requires emulation, the
vcpu->arch.gpa_available flag is set to indicate that cr2 contains a
valid GPA.

Currently, emulator_read_write_onepage() makes use of gpa_available flag
to avoid a guest page walk for a known MMIO regions. Lets not limit
the gpa_available optimization to just MMIO region. The patch extends
the check to avoid page walk whenever gpa_available flag is set.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
[Fix EPT=0 according to Wanpeng Li's fix, plus ensure VMX also uses the
 new code. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
[Moved "ret < 0" to the else brach, as per David's review. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-08-18 14:37:49 +02:00
Paolo Bonzini e08d26f071 KVM: x86: simplify ept_misconfig
Calling handle_mmio_page_fault() has been unnecessary since commit
e9ee956e31 ("KVM: x86: MMU: Move handle_mmio_page_fault() call to
kvm_mmu_page_fault()", 2016-02-22).

handle_mmio_page_fault() can now be made static.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-08-18 14:37:48 +02:00
Jim Mattson d3802286fa kvm: x86: Disallow illegal IA32_APIC_BASE MSR values
Host-initiated writes to the IA32_APIC_BASE MSR do not have to follow
local APIC state transition constraints, but the value written must be
valid.

Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-11 18:59:30 +02:00
Wanpeng Li 26eeb53cf0 KVM: MMU: Bail out immediately if there is no available mmu page
Bailing out immediately if there is no available mmu page to alloc.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-11 18:59:29 +02:00
Wanpeng Li 42bcbebf11 KVM: MMU: Fix softlockup due to mmu_lock is held too long
watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [warn_test:3089]
 irq event stamp: 20532
 hardirqs last  enabled at (20531): [<ffffffff8e9b6908>] restore_regs_and_iret+0x0/0x1d
 hardirqs last disabled at (20532): [<ffffffff8e9b7ae8>] apic_timer_interrupt+0x98/0xb0
 softirqs last  enabled at (8266): [<ffffffff8e9badc6>] __do_softirq+0x206/0x4c1
 softirqs last disabled at (8253): [<ffffffff8e083918>] irq_exit+0xf8/0x100
 CPU: 5 PID: 3089 Comm: warn_test Tainted: G           OE   4.13.0-rc3+ #8
 RIP: 0010:kvm_mmu_prepare_zap_page+0x72/0x4b0 [kvm]
 Call Trace:
  make_mmu_pages_available.isra.120+0x71/0xc0 [kvm]
  kvm_mmu_load+0x1cf/0x410 [kvm]
  kvm_arch_vcpu_ioctl_run+0x1316/0x1bf0 [kvm]
  kvm_vcpu_ioctl+0x340/0x700 [kvm]
  ? kvm_vcpu_ioctl+0x340/0x700 [kvm]
  ? __fget+0xfc/0x210
  do_vfs_ioctl+0xa4/0x6a0
  ? __fget+0x11d/0x210
  SyS_ioctl+0x79/0x90
  entry_SYSCALL_64_fastpath+0x23/0xc2
  ? __this_cpu_preempt_check+0x13/0x20

This can be reproduced readily by ept=N and running syzkaller tests since
many syzkaller testcases don't setup any memory regions. However, if ept=Y
rmode identity map will be created, then kvm_mmu_calculate_mmu_pages() will
extend the number of VM's mmu pages to at least KVM_MIN_ALLOC_MMU_PAGES
which just hide the issue.

I saw the scenario kvm->arch.n_max_mmu_pages == 0 && kvm->arch.n_used_mmu_pages == 1,
so there is one active mmu page on the list, kvm_mmu_prepare_zap_page() fails
to zap any pages, however prepare_zap_oldest_mmu_page() always returns true.
It incurs infinite loop in make_mmu_pages_available() which causes mmu->lock
softlockup.

This patch fixes it by setting the return value of prepare_zap_oldest_mmu_page()
according to whether or not there is mmu page zapped.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-11 18:59:28 +02:00
David Hildenbrand a057e0e22c KVM: nVMX: validate eptp pointer
Let's reuse the function introduced with eptp switching.

We don't explicitly have to check against enable_ept_ad_bits, as this
is implicitly done when checking against nested_vmx_ept_caps in
valid_ept_address().

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-11 18:53:22 +02:00
Paolo Bonzini eebed24389 kvm: nVMX: Add support for fast unprotection of nested guest page tables
This is the same as commit 147277540b ("kvm: svm: Add support for
additional SVM NPF error codes", 2016-11-23), but for Intel processors.
In this case, the exit qualification field's bit 8 says whether the
EPT violation occurred while translating the guest's final physical
address or rather while translating the guest page tables.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-10 16:44:04 +02:00
Brijesh Singh 64531a3b70 KVM: SVM: Limit PFERR_NESTED_GUEST_PAGE error_code check to L1 guest
Commit 147277540b ("kvm: svm: Add support for additional SVM NPF error
codes", 2016-11-23) added a new error code to aid nested page fault
handling.  The commit unprotects (kvm_mmu_unprotect_page) the page when
we get a NPF due to guest page table walk where the page was marked RO.

However, if an L0->L2 shadow nested page table can also be marked read-only
when a page is read only in L1's nested page table.  If such a page
is accessed by L2 while walking page tables it can cause a nested
page fault (page table walks are write accesses).  However, after
kvm_mmu_unprotect_page we may get another page fault, and again in an
endless stream.

To cover this use case, we qualify the new error_code check with
vcpu->arch.mmu_direct_map so that the error_code check would run on L1
guest, and not the L2 guest.  This avoids hitting the above scenario.

Fixes: 147277540b
Cc: stable@vger.kernel.org
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Thomas Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-10 16:44:04 +02:00
Wanpeng Li bbeac2830f KVM: X86: Fix residual mmio emulation request to userspace
Reported by syzkaller:

The kvm-intel.unrestricted_guest=0

   WARNING: CPU: 5 PID: 1014 at /home/kernel/data/kvm/arch/x86/kvm//x86.c:7227 kvm_arch_vcpu_ioctl_run+0x38b/0x1be0 [kvm]
   CPU: 5 PID: 1014 Comm: warn_test Tainted: G        W  OE   4.13.0-rc3+ #8
   RIP: 0010:kvm_arch_vcpu_ioctl_run+0x38b/0x1be0 [kvm]
   Call Trace:
    ? put_pid+0x3a/0x50
    ? rcu_read_lock_sched_held+0x79/0x80
    ? kmem_cache_free+0x2f2/0x350
    kvm_vcpu_ioctl+0x340/0x700 [kvm]
    ? kvm_vcpu_ioctl+0x340/0x700 [kvm]
    ? __fget+0xfc/0x210
    do_vfs_ioctl+0xa4/0x6a0
    ? __fget+0x11d/0x210
    SyS_ioctl+0x79/0x90
    entry_SYSCALL_64_fastpath+0x23/0xc2
    ? __this_cpu_preempt_check+0x13/0x20

The syszkaller folks reported a residual mmio emulation request to userspace
due to vm86 fails to emulate inject real mode interrupt(fails to read CS) and
incurs a triple fault. The vCPU returns to userspace with vcpu->mmio_needed == true
and KVM_EXIT_SHUTDOWN exit reason. However, the syszkaller testcase constructs
several threads to launch the same vCPU, the thread which lauch this vCPU after
the thread whichs get the vcpu->mmio_needed == true and KVM_EXIT_SHUTDOWN will
trigger the warning.

   #define _GNU_SOURCE
   #include <pthread.h>
   #include <stdio.h>
   #include <stdlib.h>
   #include <string.h>
   #include <sys/wait.h>
   #include <sys/types.h>
   #include <sys/stat.h>
   #include <sys/mman.h>
   #include <fcntl.h>
   #include <unistd.h>
   #include <linux/kvm.h>
   #include <stdio.h>

   int kvmcpu;
   struct kvm_run *run;

   void* thr(void* arg)
   {
     int res;
     res = ioctl(kvmcpu, KVM_RUN, 0);
     printf("ret1=%d exit_reason=%d suberror=%d\n",
         res, run->exit_reason, run->internal.suberror);
     return 0;
   }

   void test()
   {
     int i, kvm, kvmvm;
     pthread_t th[4];

     kvm = open("/dev/kvm", O_RDWR);
     kvmvm = ioctl(kvm, KVM_CREATE_VM, 0);
     kvmcpu = ioctl(kvmvm, KVM_CREATE_VCPU, 0);
     run = (struct kvm_run*)mmap(0, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, kvmcpu, 0);
     srand(getpid());
     for (i = 0; i < 4; i++) {
       pthread_create(&th[i], 0, thr, 0);
       usleep(rand() % 10000);
     }
     for (i = 0; i < 4; i++)
       pthread_join(th[i], 0);
   }

   int main()
   {
     for (;;) {
       int pid = fork();
       if (pid < 0)
         exit(1);
       if (pid == 0) {
         test();
         exit(0);
       }
       int status;
       while (waitpid(pid, &status, __WALL) != pid) {}
     }
     return 0;
   }

This patch fixes it by resetting the vcpu->mmio_needed once we receive
the triple fault to avoid the residue.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-10 16:43:55 +02:00
Longpeng(Mike) de63ad4cf4 KVM: X86: implement the logic for spinlock optimization
get_cpl requires vcpu_load, so we must cache the result (whether the
vcpu was preempted when its cpl=0) in kvm_vcpu_arch.

Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-08 10:57:43 +02:00
Longpeng(Mike) 199b5763d3 KVM: add spinlock optimization framework
If a vcpu exits due to request a user mode spinlock, then
the spinlock-holder may be preempted in user mode or kernel mode.
(Note that not all architectures trap spin loops in user mode,
only AMD x86 and ARM/ARM64 currently do).

But if a vcpu exits in kernel mode, then the holder must be
preempted in kernel mode, so we should choose a vcpu in kernel mode
as a more likely candidate for the lock holder.

This introduces kvm_arch_vcpu_in_kernel() to decide whether the
vcpu is in kernel-mode when it's preempted.  kvm_vcpu_on_spin's
new argument says the same of the spinning VCPU.

Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-08 10:57:43 +02:00
Radim Krčmář 1b4d56b86a KVM: x86: use general helpers for some cpuid manipulation
Add guest_cpuid_clear() and use it instead of kvm_find_cpuid_entry().
Also replace some uses of kvm_find_cpuid_entry() with guest_cpuid_has().

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-07 16:16:30 +02:00
Radim Krčmář d6321d4933 KVM: x86: generalize guest_cpuid_has_ helpers
This patch turns guest_cpuid_has_XYZ(cpuid) into guest_cpuid_has(cpuid,
X86_FEATURE_XYZ), which gets rid of many very similar helpers.

When seeing a X86_FEATURE_*, we can know which cpuid it belongs to, but
this information isn't in common code, so we recreate it for KVM.

Add some BUILD_BUG_ONs to make sure that it runs nicely.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-07 16:11:50 +02:00
Radim Krčmář c6bd18011f KVM: x86: X86_FEATURE_NRIPS is not scattered anymore
bit(X86_FEATURE_NRIPS) is 3 since 2ccd71f1b2 ("x86/cpufeature: Move
some of the scattered feature bits to x86_capability"), so we can
simplify the code.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-07 16:09:38 +02:00
Bandan Das 41ab937274 KVM: nVMX: Emulate EPTP switching for the L1 hypervisor
When L2 uses vmfunc, L0 utilizes the associated vmexit to
emulate a switching of the ept pointer by reloading the
guest MMU.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Bandan Das <bsd@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-08-07 15:29:22 +02:00
Bandan Das 27c42a1bb8 KVM: nVMX: Enable VMFUNC for the L1 hypervisor
Expose VMFUNC in MSRs and VMCS fields. No actual VMFUNCs are enabled.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Bandan Das <bsd@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-08-07 15:29:21 +02:00
Bandan Das 2a499e49c2 KVM: vmx: Enable VMFUNCs
Enable VMFUNC in the secondary execution controls.  This simplifies the
changes necessary to expose it to nested hypervisors.  VMFUNCs still
cause #UD when invoked.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Bandan Das <bsd@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-08-07 15:29:20 +02:00
David Hildenbrand 53a70daf3c KVM: nVMX: get rid of nested_release_page*
Let's also just use the underlying functions directly here.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
[Rebased on top of 9f744c5974 ("KVM: nVMX: do not pin the VMCS12")]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-08-07 15:29:20 +02:00
David Hildenbrand 5e2f30b756 KVM: nVMX: get rid of nested_get_page()
nested_get_page() just sounds confusing. All we want is a page from G1.
This is even unrelated to nested.

Let's introduce kvm_vcpu_gpa_to_page() so we don't get too lengthy
lines.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
[Squash pasto fix from Wanpeng Li. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-07 15:27:00 +02:00
Paolo Bonzini 90a2db6d86 KVM: nVMX: INVPCID support
Expose the "Enable INVPCID" secondary execution control to the guest
and properly reflect the exit reason.

In addition, before this patch the guest was always running with
INVPCID enabled, causing pcid.flat's "Test on INVPCID when disabled"
test to fail.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-07 15:26:06 +02:00
Ladi Prosek 72c139bacf KVM: hyperv: support HV_X64_MSR_TSC_FREQUENCY and HV_X64_MSR_APIC_FREQUENCY
It has been experimentally confirmed that supporting these two MSRs is one
of the necessary conditions for nested Hyper-V to use the TSC page. Modern
Windows guests are noticeably slower when they fall back to reading
timestamps from the HV_X64_MSR_TIME_REF_COUNT MSR instead of using the TSC
page.

The newly supported MSRs are advertised with the AccessFrequencyRegs
partition privilege flag and CPUID.40000003H:EDX[8] "Support for
determining timer frequencies is available" (both outside of the scope of
this KVM patch).

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-07 15:26:06 +02:00
Wanpeng Li 6550c4df7e KVM: nVMX: Fix interrupt window request with "Acknowledge interrupt on exit"
------------[ cut here ]------------
 WARNING: CPU: 5 PID: 2288 at arch/x86/kvm/vmx.c:11124 nested_vmx_vmexit+0xd64/0xd70 [kvm_intel]
 CPU: 5 PID: 2288 Comm: qemu-system-x86 Not tainted 4.13.0-rc2+ #7
 RIP: 0010:nested_vmx_vmexit+0xd64/0xd70 [kvm_intel]
Call Trace:
  vmx_check_nested_events+0x131/0x1f0 [kvm_intel]
  ? vmx_check_nested_events+0x131/0x1f0 [kvm_intel]
  kvm_arch_vcpu_ioctl_run+0x5dd/0x1be0 [kvm]
  ? vmx_vcpu_load+0x1be/0x220 [kvm_intel]
  ? kvm_arch_vcpu_load+0x62/0x230 [kvm]
  kvm_vcpu_ioctl+0x340/0x700 [kvm]
  ? kvm_vcpu_ioctl+0x340/0x700 [kvm]
  ? __fget+0xfc/0x210
  do_vfs_ioctl+0xa4/0x6a0
  ? __fget+0x11d/0x210
  SyS_ioctl+0x79/0x90
  do_syscall_64+0x8f/0x750
  ? trace_hardirqs_on_thunk+0x1a/0x1c
  entry_SYSCALL64_slow_path+0x25/0x25

This can be reproduced by booting L1 guest w/ 'noapic' grub parameter, which
means that tells the kernel to not make use of any IOAPICs that may be present
in the system.

Actually external_intr variable in nested_vmx_vmexit() is the req_int_win
variable passed from vcpu_enter_guest() which means that the L0's userspace
requests an irq window. I observed the scenario (!kvm_cpu_has_interrupt(vcpu) &&
L0's userspace reqeusts an irq window) is true, so there is no interrupt which
L1 requires to inject to L2, we should not attempt to emualte "Acknowledge
interrupt on exit" for the irq window requirement in this scenario.

This patch fixes it by not attempt to emulate "Acknowledge interrupt on exit"
if there is no L1 requirement to inject an interrupt to L2.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
[Added code comment to make it obvious that the behavior is not correct.
 We should do a userspace exit with open interrupt window instead of the
 nested VM exit.  This patch still improves the behavior, so it was
 accepted as a (temporary) workaround.]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-08-03 15:38:11 +02:00
David Matlack c9f04407f2 KVM: nVMX: mark vmcs12 pages dirty on L2 exit
The host physical addresses of L1's Virtual APIC Page and Posted
Interrupt descriptor are loaded into the VMCS02. The CPU may write
to these pages via their host physical address while L2 is running,
bypassing address-translation-based dirty tracking (e.g. EPT write
protection). Mark them dirty on every exit from L2 to prevent them
from getting out of sync with dirty tracking.

Also mark the virtual APIC page and the posted interrupt descriptor
dirty when KVM is virtualizing posted interrupt processing.

Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-08-02 22:41:04 +02:00
David Matlack 8ca44e88c3 kvm: nVMX: don't flush VMCS12 during VMXOFF or VCPU teardown
According to the Intel SDM, software cannot rely on the current VMCS to be
coherent after a VMXOFF or shutdown. So this is a valid way to handle VMCS12
flushes.

24.11.1 Software Use of Virtual-Machine Control Structures
...
  If a logical processor leaves VMX operation, any VMCSs active on
  that logical processor may be corrupted (see below). To prevent
  such corruption of a VMCS that may be used either after a return
  to VMX operation or on another logical processor, software should
  execute VMCLEAR for that VMCS before executing the VMXOFF instruction
  or removing power from the processor (e.g., as part of a transition
  to the S3 and S4 power states).
...

This fixes a "suspicious rcu_dereference_check() usage!" warning during
kvm_vm_release() because nested_release_vmcs12() calls
kvm_vcpu_write_guest_page() without holding kvm->srcu.

Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-08-02 22:41:03 +02:00
Paolo Bonzini 9f744c5974 KVM: nVMX: do not pin the VMCS12
Since the current implementation of VMCS12 does a memcpy in and out
of guest memory, we do not need current_vmcs12 and current_vmcs12_page
anymore.  current_vmptr is enough to read and write the VMCS12.

And David Matlack noted:

  This patch also fixes dirty tracking (memslot->dirty_bitmap) of the
  VMCS12 page by using kvm_write_guest. nested_release_page() only marks
  the struct page dirty.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
[Added David Matlack's note and nested_release_page_clean() fix.]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-08-02 22:41:03 +02:00
Longpeng(Mike) ebd28fcb55 KVM: X86: init irq->level in kvm_pv_kick_cpu_op
'lapic_irq' is a local variable and its 'level' field isn't
initialized, so 'level' is random, it doesn't matter but
makes UBSAN unhappy:

UBSAN: Undefined behaviour in .../lapic.c:...
load of value 10 is not a valid value for type '_Bool'
...
Call Trace:
 [<ffffffff81f030b6>] dump_stack+0x1e/0x20
 [<ffffffff81f03173>] ubsan_epilogue+0x12/0x55
 [<ffffffff81f03b96>] __ubsan_handle_load_invalid_value+0x118/0x162
 [<ffffffffa1575173>] kvm_apic_set_irq+0xc3/0xf0 [kvm]
 [<ffffffffa1575b20>] kvm_irq_delivery_to_apic_fast+0x450/0x910 [kvm]
 [<ffffffffa15858ea>] kvm_irq_delivery_to_apic+0xfa/0x7a0 [kvm]
 [<ffffffffa1517f4e>] kvm_emulate_hypercall+0x62e/0x760 [kvm]
 [<ffffffffa113141a>] handle_vmcall+0x1a/0x30 [kvm_intel]
 [<ffffffffa114e592>] vmx_handle_exit+0x7a2/0x1fa0 [kvm_intel]
...

Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-08-02 22:41:01 +02:00
Wanpeng Li f4ef191086 KVM: X86: Fix loss of pending INIT due to race
When SMP VM start, AP may lost INIT because of receiving INIT between
kvm_vcpu_ioctl_x86_get/set_vcpu_events.

       vcpu 0                             vcpu 1
                                   kvm_vcpu_ioctl_x86_get_vcpu_events
                                     events->smi.latched_init = 0
  send INIT to vcpu1
    set vcpu1's pending_events
                                   kvm_vcpu_ioctl_x86_set_vcpu_events
                                      if (events->smi.latched_init == 0)
                                        clear INIT in pending_events

This patch fixes it by just update SMM related flags if we are in SMM.

Thanks Peng Hao for the report and original commit message.

Reported-by: Peng Hao <peng.hao2@zte.com.cn>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-08-02 22:41:01 +02:00
Paolo Bonzini b96fb43977 KVM: nVMX: fixes to nested virt interrupt injection
There are three issues in nested_vmx_check_exception:

1) it is not taking PFEC_MATCH/PFEC_MASK into account, as reported
by Wanpeng Li;

2) it should rebuild the interruption info and exit qualification fields
from scratch, as reported by Jim Mattson, because the values from the
L2->L0 vmexit may be invalid (e.g. if an emulated instruction causes
a page fault, the EPT misconfig's exit qualification is incorrect).

3) CR2 and DR6 should not be written for exception intercept vmexits
(CR2 only for AMD).

This patch fixes the first two and adds a comment about the last,
outlining the fix.

Cc: Jim Mattson <jmattson@google.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-01 22:24:17 +02:00
Paolo Bonzini 7313c69805 KVM: nVMX: do not fill vm_exit_intr_error_code in prepare_vmcs12
Do this in the caller of nested_vmx_vmexit instead.

nested_vmx_check_exception was doing a vmwrite to the vmcs02's
VM_EXIT_INTR_ERROR_CODE field, so that prepare_vmcs12 would move
the field to vmcs12->vm_exit_intr_error_code.  However that isn't
possible on pre-Haswell machines.  Moving the vmcs12 write to the
callers fixes it.

Reported-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[Changed nested_vmx_reflect_vmexit() return type to (int)1 from (bool)1,
 thanks to fengguang.wu@intel.com]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-08-01 22:23:25 +02:00
Wanpeng Li 1d518c6820 KVM: LAPIC: Fix reentrancy issues with preempt notifiers
Preempt can occur in the preemption timer expiration handler:

          CPU0                    CPU1

  preemption timer vmexit
  handle_preemption_timer(vCPU0)
    kvm_lapic_expired_hv_timer
      hv_timer_is_use == true
  sched_out
                           sched_in
                           kvm_arch_vcpu_load
                             kvm_lapic_restart_hv_timer
                               restart_apic_timer
                                 start_hv_timer
                                   already-expired timer or sw timer triggerd in the window
                                 start_sw_timer
                                   cancel_hv_timer
                           /* back in kvm_lapic_expired_hv_timer */
                           cancel_hv_timer
                             WARN_ON(!apic->lapic_timer.hv_timer_in_use);  ==> Oops

This can be reproduced if CONFIG_PREEMPT is enabled.

------------[ cut here ]------------
 WARNING: CPU: 4 PID: 2972 at /home/kernel/linux/arch/x86/kvm//lapic.c:1563 kvm_lapic_expired_hv_timer+0x9e/0xb0 [kvm]
 CPU: 4 PID: 2972 Comm: qemu-system-x86 Tainted: G           OE   4.13.0-rc2+ #16
 RIP: 0010:kvm_lapic_expired_hv_timer+0x9e/0xb0 [kvm]
Call Trace:
  handle_preemption_timer+0xe/0x20 [kvm_intel]
  vmx_handle_exit+0xb8/0xd70 [kvm_intel]
  kvm_arch_vcpu_ioctl_run+0xdd1/0x1be0 [kvm]
  ? kvm_arch_vcpu_load+0x47/0x230 [kvm]
  ? kvm_arch_vcpu_load+0x62/0x230 [kvm]
  kvm_vcpu_ioctl+0x340/0x700 [kvm]
  ? kvm_vcpu_ioctl+0x340/0x700 [kvm]
  ? __fget+0xfc/0x210
  do_vfs_ioctl+0xa4/0x6a0
  ? __fget+0x11d/0x210
  SyS_ioctl+0x79/0x90
  do_syscall_64+0x81/0x220
  entry_SYSCALL64_slow_path+0x25/0x25
 ------------[ cut here ]------------
 WARNING: CPU: 4 PID: 2972 at /home/kernel/linux/arch/x86/kvm//lapic.c:1498 cancel_hv_timer.isra.40+0x4f/0x60 [kvm]
 CPU: 4 PID: 2972 Comm: qemu-system-x86 Tainted: G        W  OE   4.13.0-rc2+ #16
 RIP: 0010:cancel_hv_timer.isra.40+0x4f/0x60 [kvm]
Call Trace:
  kvm_lapic_expired_hv_timer+0x3e/0xb0 [kvm]
  handle_preemption_timer+0xe/0x20 [kvm_intel]
  vmx_handle_exit+0xb8/0xd70 [kvm_intel]
  kvm_arch_vcpu_ioctl_run+0xdd1/0x1be0 [kvm]
  ? kvm_arch_vcpu_load+0x47/0x230 [kvm]
  ? kvm_arch_vcpu_load+0x62/0x230 [kvm]
  kvm_vcpu_ioctl+0x340/0x700 [kvm]
  ? kvm_vcpu_ioctl+0x340/0x700 [kvm]
  ? __fget+0xfc/0x210
  do_vfs_ioctl+0xa4/0x6a0
  ? __fget+0x11d/0x210
  SyS_ioctl+0x79/0x90
  do_syscall_64+0x81/0x220
  entry_SYSCALL64_slow_path+0x25/0x25

This patch fixes it by making the caller of cancel_hv_timer, start_hv_timer
and start_sw_timer be in preemption-disabled regions, which trivially
avoid any reentrancy issue with preempt notifier.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
[Add more WARNs. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-07-26 19:04:53 +02:00
Wanpeng Li 2d6144e366 KVM: nVMX: Fix loss of L2's NMI blocking state
Run kvm-unit-tests/eventinj.flat in L1 w/ ept=0 on both L0 and L1:

Before NMI IRET test
Sending NMI to self
NMI isr running stack 0x461000
Sending nested NMI to self
After nested NMI to self
Nested NMI isr running rip=40038e
After iret
After NMI to self
FAIL: NMI

Commit 4c4a6f790e (KVM: nVMX: track NMI blocking state separately
for each VMCS) tracks NMI blocking state separately for vmcs01 and
vmcs02. However it is not enough:

 - The L2 (kvm-unit-tests/eventinj.flat) generates NMI that will fault
   on IRET, so the L2 can generate #PF which can be intercepted by L0.
 - L0 walks L1's guest page table and sees the mapping is invalid, it
   resumes the L1 guest and injects the #PF into L1.  At this point the
   vmcs02 has nmi_known_unmasked=true.
 - L1 sets set bit 3 (blocking by NMI) in the interruptibility-state field
   of vmcs12 (and fixes the shadow page table) before resuming L2 guest.
 - L1 executes VMRESUME to resume L2, causing a vmexit to L0
 - during VMRESUME emulation, prepare_vmcs02 sets bit 3 in the
   interruptibility-state field of vmcs02, but nmi_known_unmasked is
   still true.
 - L2 immediately exits to L0 with another page fault, because L0 still has
   not updated the NGVA->HPA page tables.  However, nmi_known_unmasked is
   true so vmx_recover_nmi_blocking does not do anything.

The fix is to update nmi_known_unmasked when preparing vmcs02 from vmcs12.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-07-26 18:57:46 +02:00
Wincy Van 06a5524f09 KVM: nVMX: Fix posted intr delivery when vcpu is in guest mode
The PI vector for L0 and L1 must be different. If dest vcpu0
is in guest mode while vcpu1 is delivering a non-nested PI to
vcpu0, there wont't be any vmexit so that the non-nested interrupt
will be delayed.

Signed-off-by: Wincy Van <fanwenyi0529@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-07-26 18:57:46 +02:00
Paolo Bonzini a512177ef3 KVM: x86: do mask out upper bits of PAE CR3
This reverts the change of commit f85c758dbe,
as the behavior it modified was intended.

The VM is running in 32-bit PAE mode, and Table 4-7 of the Intel manual
says:

Table 4-7. Use of CR3 with PAE Paging
Bit Position(s)	Contents
4:0		Ignored
31:5		Physical address of the 32-Byte aligned
		page-directory-pointer table used for linear-address
		translation
63:32		Ignored (these bits exist only on processors supporting
		the Intel-64 architecture)

To placate the static checker, write the mask explicitly as an
unsigned long constant instead of using a 32-bit unsigned constant.

Cc: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: f85c758dbe
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-07-26 18:57:45 +02:00
Paolo Bonzini fa19871a16 KVM: VMX: remove unused field
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-07-24 10:55:22 +02:00
Roman Kagan f1ff89ec44 kvm: x86: hyperv: avoid livelock in oneshot SynIC timers
If the SynIC timer message delivery fails due to SINT message slot being
busy, there's no point to attempt starting the timer again until we're
notified of the slot being released by the guest (via EOM or EOI).

Even worse, when a oneshot timer fails to deliver its message, its
re-arming with an expiration time in the past leads to immediate retry
of the delivery, and so on, without ever letting the guest vcpu to run
and release the slot, which results in a livelock.

To avoid that, only start the timer when there's no timer message
pending delivery.  When there is, meaning the slot is busy, the
processing will be restarted upon notification from the guest that the
slot is released.

Signed-off-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-07-20 17:00:00 +02:00
Wanpeng Li f244deed7a KVM: VMX: Fix invalid guest state detection after task-switch emulation
This can be reproduced by EPT=1, unrestricted_guest=N, emulate_invalid_state=Y
or EPT=0, the trace of kvm-unit-tests/taskswitch2.flat is like below, it tries
to emulate invalid guest state task-switch:

kvm_exit: reason TASK_SWITCH rip 0x0 info 40000058 0
kvm_emulate_insn: 42000:0:0f 0b (0x2)
kvm_emulate_insn: 42000:0:0f 0b (0x2) failed
kvm_inj_exception: #UD (0x0)
kvm_entry: vcpu 0
kvm_exit: reason TASK_SWITCH rip 0x0 info 40000058 0
kvm_emulate_insn: 42000:0:0f 0b (0x2)
kvm_emulate_insn: 42000:0:0f 0b (0x2) failed
kvm_inj_exception: #UD (0x0)
......................

It appears that the task-switch emulation updates rflags (and vm86
flag) only after the segments are loaded, causing vmx->emulation_required
to be set, when in fact invalid guest state emulation is not needed.

This patch fixes it by updating vmx->emulation_required after the
rflags (and vm86 flag) is updated in task-switch emulation.

Thanks Radim for moving the update to vmx__set_flags and adding Paolo's
suggestion for the check.

Suggested-by: Nadav Amit <nadav.amit@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-07-20 17:00:00 +02:00
Arnd Bergmann c2ce3f5d89 x86: add MULTIUSER dependency for KVM
KVM tries to select 'TASKSTATS', which had additional dependencies:

warning: (KVM) selects TASKSTATS which has unmet direct dependencies (NET && MULTIUSER)

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-07-19 16:19:14 +02:00
Jim Mattson b3f1dfb6e8 KVM: nVMX: Disallow VM-entry in MOV-SS shadow
Immediately following MOV-to-SS/POP-to-SS, VM-entry is
disallowed. This check comes after the check for a valid VMCS. When
this check fails, the instruction pointer should fall through to the
next instruction, the ALU flags should be set to indicate VMfailValid,
and the VM-instruction error should be set to 26 ("VM entry with
events blocked by MOV SS").

Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-07-19 16:19:13 +02:00
Paolo Bonzini 4c4a6f790e KVM: nVMX: track NMI blocking state separately for each VMCS
vmx_recover_nmi_blocking is using a cached value of the guest
interruptibility info, which is stored in vmx->nmi_known_unmasked.
vmx_recover_nmi_blocking is run for both normal and nested guests,
so the cached value must be per-VMCS.

This fixes eventinj.flat in a nested non-EPT environment.  With EPT it
works, because the EPT violation handler doesn't have the
vmx->nmi_known_unmasked optimization (it is unnecessary because, unlike
vmx_recover_nmi_blocking, it can just look at the exit qualification).

Thanks to Wanpeng Li for debugging the testcase and providing an initial
patch.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-07-19 16:05:41 +02:00
Dan Carpenter f85c758dbe KVM: x86: masking out upper bits
kvm_read_cr3() returns an unsigned long and gfn is a u64.  We intended
to mask out the bottom 5 bits but because of the type issue we mask the
top 32 bits as well.  I don't know if this is a real problem, but it
causes static checker warnings.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-07-19 13:35:12 +02:00