1
0
Fork 0
Commit Graph

333 Commits (761cab513d5e9079ec8ace48ab05d84b0083653b)

Author SHA1 Message Date
Jim Mattson b119019847 kvm: nVMX: Remove unnecessary sync_roots from handle_invept
When L0 is executing handle_invept(), the TDP MMU is active. Emulating
an L1 INVEPT does require synchronizing the appropriate shadow EPT
root(s), but a call to kvm_mmu_sync_roots in this context won't do
that. Similarly, the hardware TLB and paging-structure-cache entries
associated with the appropriate shadow EPT root(s) must be flushed,
but requesting a TLB_FLUSH from this context won't do that either.

How did this ever work? KVM always does a sync_roots and TLB flush (in
the correct context) when transitioning from L1 to L2. That isn't the
best choice for nested VM performance, but it effectively papers over
the mistakes here.

Remove the unnecessary operations and leave a comment to try to do
better in the future.

Reported-by: Junaid Shahid <junaids@google.com>
Fixes: bfd0a56b90 ("nEPT: Nested INVEPT")
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Nadav Har'El <nyh@il.ibm.com>
Cc: Jun Nakajima <jun.nakajima@intel.com>
Cc: Xinhao Xu <xinhao.xu@intel.com>
Cc: Yang Zhang <yang.z.zhang@Intel.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by Peter Shier <pshier@google.com>
Reviewed-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-02 19:01:56 +02:00
Vitaly Kuznetsov 11e349143e x86/kvm/nVMX: fix VMCLEAR when Enlightened VMCS is in use
When Enlightened VMCS is in use, it is valid to do VMCLEAR and,
according to TLFS, this should "transition an enlightened VMCS from the
active to the non-active state". It is, however, wrong to assume that
it is only valid to do VMCLEAR for the eVMCS which is currently active
on the vCPU performing VMCLEAR.

Currently, the logic in handle_vmclear() is broken: in case, there is no
active eVMCS on the vCPU doing VMCLEAR we treat the argument as a 'normal'
VMCS and kvm_vcpu_write_guest() to the 'launch_state' field irreversibly
corrupts the memory area.

So, in case the VMCLEAR argument is not the current active eVMCS on the
vCPU, how can we know if the area it is pointing to is a normal or an
enlightened VMCS?
Thanks to the bug in Hyper-V (see commit 72aeb60c52 ("KVM: nVMX: Verify
eVMCS revision id match supported eVMCS version on eVMCS VMPTRLD")) we can
not, the revision can't be used to distinguish between them. So let's
assume it is always enlightened in case enlightened vmentry is enabled in
the assist page. Also, check if vmx->nested.enlightened_vmcs_enabled to
minimize the impact for 'unenlightened' workloads.

Fixes: b8bbab928f ("KVM: nVMX: implement enlightened VMPTRLD and VMCLEAR")
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-02 18:56:00 +02:00
Vitaly Kuznetsov a21a39c206 x86/KVM/nVMX: don't use clean fields data on enlightened VMLAUNCH
Apparently, Windows doesn't maintain clean fields data after it does
VMCLEAR for an enlightened VMCS so we can only use it on VMRESUME.
The issue went unnoticed because currently we do nested_release_evmcs()
in handle_vmclear() and the consecutive enlightened VMPTRLD invalidates
clean fields when a new eVMCS is mapped but we're going to change the
logic.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-02 18:56:00 +02:00
Paolo Bonzini 95c5c7c77c KVM: nVMX: list VMX MSRs in KVM_GET_MSR_INDEX_LIST
This allows userspace to know which MSRs are supported by the hypervisor.
Unfortunately userspace must resort to tricks for everything except
MSR_IA32_VMX_VMFUNC (which was just added in the previous patch).
One possibility is to use the feature control MSR, which is tied to nested
VMX as well and is present on all KVM versions that support feature MSRs.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-02 17:36:18 +02:00
Paolo Bonzini e8a70bd4e9 KVM: nVMX: allow setting the VMFUNC controls MSR
Allow userspace to set a custom value for the VMFUNC controls MSR, as long
as the capabilities it advertises do not exceed those of the host.

Fixes: 27c42a1bb ("KVM: nVMX: Enable VMFUNC for the L1 hypervisor", 2017-08-03)
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-02 17:36:12 +02:00
Paolo Bonzini 6defc59184 KVM: nVMX: include conditional controls in /dev/kvm KVM_GET_MSRS
Some secondary controls are automatically enabled/disabled based on the CPUID
values that are set for the guest.  However, they are still available at a
global level and therefore should be present when KVM_GET_MSRS is sent to
/dev/kvm.

Fixes: 1389309c81 ("KVM: nVMX: expose VMX capabilities for nested hypervisors to userspace", 2018-02-26)
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-02 17:35:57 +02:00
Linus Torvalds c884d8ac7f SPDX update for 5.2-rc6
Another round of SPDX updates for 5.2-rc6
 
 Here is what I am guessing is going to be the last "big" SPDX update for
 5.2.  It contains all of the remaining GPLv2 and GPLv2+ updates that
 were "easy" to determine by pattern matching.  The ones after this are
 going to be a bit more difficult and the people on the spdx list will be
 discussing them on a case-by-case basis now.
 
 Another 5000+ files are fixed up, so our overall totals are:
 	Files checked:            64545
 	Files with SPDX:          45529
 
 Compared to the 5.1 kernel which was:
 	Files checked:            63848
 	Files with SPDX:          22576
 This is a huge improvement.
 
 Also, we deleted another 20000 lines of boilerplate license crud, always
 nice to see in a diffstat.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXQyQYA8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ymnGQCghETUBotn1p3hTjY56VEs6dGzpHMAnRT0m+lv
 kbsjBGEJpLbMRB2krnaU
 =RMcT
 -----END PGP SIGNATURE-----

Merge tag 'spdx-5.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx

Pull still more SPDX updates from Greg KH:
 "Another round of SPDX updates for 5.2-rc6

  Here is what I am guessing is going to be the last "big" SPDX update
  for 5.2. It contains all of the remaining GPLv2 and GPLv2+ updates
  that were "easy" to determine by pattern matching. The ones after this
  are going to be a bit more difficult and the people on the spdx list
  will be discussing them on a case-by-case basis now.

  Another 5000+ files are fixed up, so our overall totals are:
	Files checked:            64545
	Files with SPDX:          45529

  Compared to the 5.1 kernel which was:
	Files checked:            63848
	Files with SPDX:          22576

  This is a huge improvement.

  Also, we deleted another 20000 lines of boilerplate license crud,
  always nice to see in a diffstat"

* tag 'spdx-5.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx: (65 commits)
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 507
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 506
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 505
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 504
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 503
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 502
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 501
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 499
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 498
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 497
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 496
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 495
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 491
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 490
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 489
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 488
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 487
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 486
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 485
  ...
2019-06-21 09:58:42 -07:00
Paolo Bonzini 9fd5887726 KVM: nVMX: reorganize initial steps of vmx_set_nested_state
Commit 332d079735 ("KVM: nVMX: KVM_SET_NESTED_STATE - Tear down old EVMCS
state before setting new state", 2019-05-02) broke evmcs_test because the
eVMCS setup must be performed even if there is no VMXON region defined,
as long as the eVMCS bit is set in the assist page.

While the simplest possible fix would be to add a check on
kvm_state->flags & KVM_STATE_NESTED_EVMCS in the initial "if" that
covers kvm_state->hdr.vmx.vmxon_pa == -1ull, that is quite ugly.

Instead, this patch moves checks earlier in the function and
conditionalizes them on kvm_state->hdr.vmx.vmxon_pa, so that
vmx_set_nested_state always goes through vmx_leave_nested
and nested_enable_evmcs.

Fixes: 332d079735 ("KVM: nVMX: KVM_SET_NESTED_STATE - Tear down old EVMCS state before setting new state")
Cc: Aaron Lewis <aaronlewis@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-20 18:05:13 +02:00
Wanpeng Li 4d763b168e KVM: VMX: check CPUID before allowing read/write of IA32_XSS
Raise #GP when guest read/write IA32_XSS, but the CPUID bits
say that it shouldn't exist.

Fixes: 203000993d (kvm: vmx: add MSR logic for XSAVES)
Reported-by: Xiaoyao Li <xiaoyao.li@linux.intel.com>
Reported-by: Tao Xu <tao3.xu@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-20 14:21:51 +02:00
Thomas Gleixner 20c8ccb197 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 499
Based on 1 normalized pattern(s):

  this work is licensed under the terms of the gnu gpl version 2 see
  the copying file in the top level directory

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 35 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Enrico Weigelt <info@metux.net>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081206.797835076@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19 17:09:53 +02:00
Liran Alon 6ca00dfafd KVM: x86: Modify struct kvm_nested_state to have explicit fields for data
Improve the KVM_{GET,SET}_NESTED_STATE structs by detailing the format
of VMX nested state data in a struct.

In order to avoid changing the ioctl values of
KVM_{GET,SET}_NESTED_STATE, there is a need to preserve
sizeof(struct kvm_nested_state). This is done by defining the data
struct as "data.vmx[0]". It was the most elegant way I found to
preserve struct size while still keeping struct readable and easy to
maintain. It does have a misfortunate side-effect that now it has to be
accessed as "data.vmx[0]" rather than just "data.vmx".

Because we are already modifying these structs, I also modified the
following:
* Define the "format" field values as macros.
* Rename vmcs_pa to vmcs12_pa for better readability.

Signed-off-by: Liran Alon <liran.alon@oracle.com>
[Remove SVM stubs, add KVM_STATE_NESTED_VMX_VMCS12_SIZE. - Paolo]
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-19 16:11:52 +02:00
Paolo Bonzini eceb9973d9 KVM: nVMX: shadow pin based execution controls
The VMX_PREEMPTION_TIMER flag may be toggled frequently, though not
*very* frequently.  Since it does not affect KVM's dirty logic, e.g.
the preemption timer value is loaded from vmcs12 even if vmcs12 is
"clean", there is no need to mark vmcs12 dirty when L1 writes pin
controls, and shadowing the field achieves that.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18 17:10:50 +02:00
Sean Christopherson 804939ea20 KVM: VMX: Leave preemption timer running when it's disabled
VMWRITEs to the major VMCS controls, pin controls included, are
deceptively expensive.  CPUs with VMCS caching (Westmere and later) also
optimize away consistency checks on VM-Entry, i.e. skip consistency
checks if the relevant fields have not changed since the last successful
VM-Entry (of the cached VMCS).  Because uops are a precious commodity,
uCode's dirty VMCS field tracking isn't as precise as software would
prefer.  Notably, writing any of the major VMCS fields effectively marks
the entire VMCS dirty, i.e. causes the next VM-Entry to perform all
consistency checks, which consumes several hundred cycles.

As it pertains to KVM, toggling PIN_BASED_VMX_PREEMPTION_TIMER more than
doubles the latency of the next VM-Entry (and again when/if the flag is
toggled back).  In a non-nested scenario, running a "standard" guest
with the preemption timer enabled, toggling the timer flag is uncommon
but not rare, e.g. roughly 1 in 10 entries.  Disabling the preemption
timer can change these numbers due to its use for "immediate exits",
even when explicitly disabled by userspace.

Nested virtualization in particular is painful, as the timer flag is set
for the majority of VM-Enters, but prepare_vmcs02() initializes vmcs02's
pin controls to *clear* the flag since its the timer's final state isn't
known until vmx_vcpu_run().  I.e. the majority of nested VM-Enters end
up unnecessarily writing pin controls *twice*.

Rather than toggle the timer flag in pin controls, set the timer value
itself to the largest allowed value to put it into a "soft disabled"
state, and ignore any spurious preemption timer exits.

Sadly, the timer is a 32-bit value and so theoretically it can fire
before the head death of the universe, i.e. spurious exits are possible.
But because KVM does *not* save the timer value on VM-Exit and because
the timer runs at a slower rate than the TSC, the maximuma timer value
is still sufficiently large for KVM's purposes.  E.g. on a modern CPU
with a timer that runs at 1/32 the frequency of a 2.4ghz constant-rate
TSC, the timer will fire after ~55 seconds of *uninterrupted* guest
execution.  In other words, spurious VM-Exits are effectively only
possible if the host is completely tickless on the logical CPU, the
guest is not using the preemption timer, and the guest is not generating
VM-Exits for any other reason.

To be safe from bad/weird hardware, disable the preemption timer if its
maximum delay is less than ten seconds.  Ten seconds is mostly arbitrary
and was selected in no small part because it's a nice round number.
For simplicity and paranoia, fall back to __kvm_request_immediate_exit()
if the preemption timer is disabled by KVM or userspace.  Previously
KVM continued to use the preemption timer to force immediate exits even
when the timer was disabled by userspace.  Now that KVM leaves the timer
running instead of truly disabling it, allow userspace to kill it
entirely in the unlikely event the timer (or KVM) malfunctions.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18 17:10:46 +02:00
Sean Christopherson 9d99cc49a4 KVM: VMX: Drop hv_timer_armed from 'struct loaded_vmcs'
... now that it is fully redundant with the pin controls shadow.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18 11:47:46 +02:00
Sean Christopherson 469debdb8b KVM: nVMX: Preset *DT exiting in vmcs02 when emulating UMIP
KVM dynamically toggles SECONDARY_EXEC_DESC to intercept (a subset of)
instructions that are subject to User-Mode Instruction Prevention, i.e.
VMCS.SECONDARY_EXEC_DESC == CR4.UMIP when emulating UMIP.  Preset the
VMCS control when preparing vmcs02 to avoid unnecessarily VMWRITEs,
e.g. KVM will clear VMCS.SECONDARY_EXEC_DESC in prepare_vmcs02_early()
and then set it in vmx_set_cr4().

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18 11:47:45 +02:00
Sean Christopherson de0286b788 KVM: nVMX: Preserve last USE_MSR_BITMAPS when preparing vmcs02
KVM dynamically toggles the CPU_BASED_USE_MSR_BITMAPS execution control
for nested guests based on whether or not both L0 and L1 want to pass
through the same MSRs to L2.  Preserve the last used value from vmcs02
so as to avoid multiple VMWRITEs to (re)set/(re)clear the bit on nested
VM-Entry.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18 11:47:45 +02:00
Sean Christopherson 3af80fec6e KVM: VMX: Explicitly initialize controls shadow at VMCS allocation
Or: Don't re-initialize vmcs02's controls on every nested VM-Entry.

VMWRITEs to the major VMCS controls are deceptively expensive.  Intel
CPUs with VMCS caching (Westmere and later) also optimize away
consistency checks on VM-Entry, i.e. skip consistency checks if the
relevant fields have not changed since the last successful VM-Entry (of
the cached VMCS).  Because uops are a precious commodity, uCode's dirty
VMCS field tracking isn't as precise as software would prefer.  Notably,
writing any of the major VMCS fields effectively marks the entire VMCS
dirty, i.e. causes the next VM-Entry to perform all consistency checks,
which consumes several hundred cycles.

Zero out the controls' shadow copies during VMCS allocation and use the
optimized setter when "initializing" controls.  While this technically
affects both non-nested and nested virtualization, nested virtualization
is the primary beneficiary as avoid VMWRITEs when prepare vmcs02 allows
hardware to optimizie away consistency checks.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18 11:47:44 +02:00
Sean Christopherson ae81d08993 KVM: nVMX: Don't reset VMCS controls shadow on VMCS switch
... now that the shadow copies are per-VMCS.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18 11:47:44 +02:00
Sean Christopherson 09e226cf07 KVM: nVMX: Shadow VMCS controls on a per-VMCS basis
... to pave the way for not preserving the shadow copies across switches
between vmcs01 and vmcs02, and eventually to avoid VMWRITEs to vmcs02
when the desired value is unchanged across nested VM-Enters.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18 11:47:43 +02:00
Sean Christopherson fe7f895dae KVM: VMX: Shadow VMCS secondary execution controls
Prepare to shadow all major control fields on a per-VMCS basis, which
allows KVM to avoid costly VMWRITEs when switching between vmcs01 and
vmcs02.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18 11:47:42 +02:00
Sean Christopherson 2183f5645a KVM: VMX: Shadow VMCS primary execution controls
Prepare to shadow all major control fields on a per-VMCS basis, which
allows KVM to avoid VMREADs when switching between vmcs01 and vmcs02,
and more importantly can eliminate costly VMWRITEs to controls when
preparing vmcs02.

Shadowing exec controls also saves a VMREAD when opening virtual
INTR/NMI windows, yay...

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18 11:47:42 +02:00
Sean Christopherson c5f2c76643 KVM: VMX: Shadow VMCS pin controls
Prepare to shadow all major control fields on a per-VMCS basis, which
allows KVM to avoid costly VMWRITEs when switching between vmcs01 and
vmcs02.

Shadowing pin controls also allows a future patch to remove the per-VMCS
'hv_timer_armed' flag, as the shadow copy is a superset of said flag.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18 11:47:41 +02:00
Sean Christopherson 70f932ecdf KVM: VMX: Add builder macros for shadowing controls
... to pave the way for shadowing all (five) major VMCS control fields
without massive amounts of error prone copy+paste+modify.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18 11:47:40 +02:00
Sean Christopherson c075c3e49d KVM: nVMX: Use adjusted pin controls for vmcs02
KVM provides a module parameter to allow disabling virtual NMI support
to simplify testing (hardware *without* virtual NMI support is hard to
come by but it does have users).  When preparing vmcs02, use the accessor
for pin controls to ensure that the module param is respected for nested
guests.

Opportunistically swap the order of applying L0's and L1's pin controls
to better align with other controls and to prepare for a future patche
that will ignore L1's, but not L0's, preemption timer flag.

Fixes: d02fcf5077 ("kvm: vmx: Allow disabling virtual NMI support")
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18 11:47:40 +02:00
Sean Christopherson c7554efc83 KVM: nVMX: Copy PDPTRs to/from vmcs12 only when necessary
Per Intel's SDM:

  ... the logical processor uses PAE paging if CR0.PG=1, CR4.PAE=1 and
  IA32_EFER.LME=0.  A VM entry to a guest that uses PAE paging loads the
  PDPTEs into internal, non-architectural registers based on the setting
  of the "enable EPT" VM-execution control.

and:

  [GUEST_PDPTR] values are saved into the four PDPTE fields as follows:

    - If the "enable EPT" VM-execution control is 0 or the logical
      processor was not using PAE paging at the time of the VM exit,
      the values saved are undefined.

In other words, if EPT is disabled or the guest isn't using PAE paging,
then the PDPTRS aren't consumed by hardware on VM-Entry and are loaded
with junk on VM-Exit.  From a nesting perspective, all of the above hold
true, i.e. KVM can effectively ignore the VMCS PDPTRs.  E.g. KVM already
loads the PDPTRs from memory when nested EPT is disabled (see
nested_vmx_load_cr3()).

Because KVM intercepts setting CR4.PAE, there is no danger of consuming
a stale value or crushing L1's VMWRITEs regardless of whether L1
intercepts CR4.PAE. The vmcs12's values are unchanged up until the
VM-Exit where L2 sets CR4.PAE, i.e. L0 will see the new PAE state on the
subsequent VM-Entry and propagate the PDPTRs from vmcs12 to vmcs02.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18 11:47:39 +02:00
Paolo Bonzini bf03d4f933 KVM: x86: introduce is_pae_paging
Checking for 32-bit PAE is quite common around code that fiddles with
the PDPTRs.  Add a function to compress all checks into a single
invocation.

Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18 11:47:38 +02:00
Sean Christopherson c27e5b0d13 KVM: nVMX: Don't update GUEST_BNDCFGS if it's clean in HV eVMCS
L1 is responsible for dirtying GUEST_GRP1 if it writes GUEST_BNDCFGS.

Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18 11:47:38 +02:00
Sean Christopherson 699a1ac214 KVM: nVMX: Update vmcs12 for MSR_IA32_DEBUGCTLMSR when it's written
KVM unconditionally intercepts WRMSR to MSR_IA32_DEBUGCTLMSR.  In the
unlikely event that L1 allows L2 to write L1's MSR_IA32_DEBUGCTLMSR, but
but saves L2's value on VM-Exit, update vmcs12 during L2's WRMSR so as
to eliminate the need to VMREAD the value from vmcs02 on nested VM-Exit.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18 11:47:37 +02:00
Sean Christopherson de70d27970 KVM: nVMX: Update vmcs12 for SYSENTER MSRs when they're written
For L2, KVM always intercepts WRMSR to SYSENTER MSRs.  Update vmcs12 in
the WRMSR handler so that they don't need to be (re)read from vmcs02 on
every nested VM-Exit.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18 11:47:37 +02:00
Sean Christopherson 142e4be77b KVM: nVMX: Update vmcs12 for MSR_IA32_CR_PAT when it's written
As alluded to by the TODO comment, KVM unconditionally intercepts writes
to the PAT MSR.  In the unlikely event that L1 allows L2 to write L1's
PAT directly but saves L2's PAT on VM-Exit, update vmcs12 when L2 writes
the PAT.  This eliminates the need to VMREAD the value from vmcs02 on
VM-Exit as vmcs12 is already up to date in all situations.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18 11:47:36 +02:00
Sean Christopherson a49700b66e KVM: nVMX: Don't speculatively write APIC-access page address
If nested_get_vmcs12_pages() fails to map L1's APIC_ACCESS_ADDR into
L2, then it disables SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES in vmcs02.
In other words, the APIC_ACCESS_ADDR in vmcs02 is guaranteed to be
written with the correct value before being consumed by hardware, drop
the unneessary VMWRITE.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18 11:47:35 +02:00
Sean Christopherson ca2f5466f8 KVM: nVMX: Don't speculatively write virtual-APIC page address
The VIRTUAL_APIC_PAGE_ADDR in vmcs02 is guaranteed to be updated before
it is consumed by hardware, either in nested_vmx_enter_non_root_mode()
or via the KVM_REQ_GET_VMCS12_PAGES callback.  Avoid an extra VMWRITE
and only stuff a bad value into vmcs02 when mapping vmcs12's address
fails.  This also eliminates the need for extra comments to connect the
dots between prepare_vmcs02_early() and nested_get_vmcs12_pages().

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18 11:47:35 +02:00
Sean Christopherson 73cb855684 KVM: nVMX: Don't dump VMCS if virtual APIC page can't be mapped
... as a malicious userspace can run a toy guest to generate invalid
virtual-APIC page addresses in L1, i.e. flood the kernel log with error
messages.

Fixes: 690908104e ("KVM: nVMX: allow tests to use bad virtual-APIC page address")
Cc: stable@vger.kernel.org
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18 11:47:21 +02:00
Sean Christopherson 8ef863e67a KVM: nVMX: Don't reread VMCS-agnostic state when switching VMCS
When switching between vmcs01 and vmcs02, there is no need to update
state tracking for values that aren't tied to any particular VMCS as
the per-vCPU values are already up-to-date (vmx_switch_vmcs() can only
be called when the vCPU is loaded).

Avoiding the update eliminates a RDMSR, and potentially a RDPKRU and
posted-interrupt update (cmpxchg64() and more).

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18 11:47:06 +02:00
Sean Christopherson 13b964a29d KVM: nVMX: Don't "put" vCPU or host state when switching VMCS
When switching between vmcs01 and vmcs02, KVM isn't actually switching
between guest and host.  If guest state is already loaded (the likely,
if not guaranteed, case), keep the guest state loaded and manually swap
the loaded_cpu_state pointer after propagating saved host state to the
new vmcs0{1,2}.

Avoiding the switch between guest and host reduces the latency of
switching between vmcs01 and vmcs02 by several hundred cycles, and
reduces the roundtrip time of a nested VM by upwards of 1000 cycles.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18 11:46:55 +02:00
Paolo Bonzini b464f57e13 KVM: VMX: simplify vmx_prepare_switch_to_{guest,host}
vmx->loaded_cpu_state can only be NULL or equal to vmx->loaded_vmcs,
so change it to a bool.  Because the direction of the bool is
now the opposite of vmx->guest_msrs_dirty, change the direction of
vmx->guest_msrs_dirty so that they match.

Finally, do not imply that MSRs have to be reloaded when
vmx->guest_state_loaded is false; instead, set vmx->guest_msrs_ready
to false explicitly in vmx_prepare_switch_to_host.

Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18 11:46:54 +02:00
Sean Christopherson 4d6c989284 KVM: nVMX: Don't rewrite GUEST_PML_INDEX during nested VM-Entry
Emulation of GUEST_PML_INDEX for a nested VMM is a bit weird.  Because
L0 flushes the PML on every VM-Exit, the value in vmcs02 at the time of
VM-Enter is a constant -1, regardless of what L1 thinks/wants.

Fixes: 09abe32002 ("KVM: nVMX: split pieces of prepare_vmcs02() to prepare_vmcs02_early()")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18 11:46:53 +02:00
Sean Christopherson c538d57f67 KVM: nVMX: Write ENCLS-exiting bitmap once per vmcs02
KVM doesn't yet support SGX virtualization, i.e. writes a constant value
to ENCLS_EXITING_BITMAP so that it can intercept ENCLS and inject a #UD.

Fixes: 0b665d3040 ("KVM: vmx: Inject #UD for SGX ENCLS instruction in guest")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18 11:46:53 +02:00
Sean Christopherson 3b013a2972 KVM: nVMX: Always sync GUEST_BNDCFGS when it comes from vmcs01
If L1 does not set VM_ENTRY_LOAD_BNDCFGS, then L1's BNDCFGS value must
be propagated to vmcs02 since KVM always runs with VM_ENTRY_LOAD_BNDCFGS
when MPX is supported.  Because the value effectively comes from vmcs01,
vmcs02 must be updated even if vmcs12 is clean.

Fixes: 62cf9bd811 ("KVM: nVMX: Fix emulation of VM_ENTRY_LOAD_BNDCFGS")
Cc: stable@vger.kernel.org
Cc: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18 11:46:52 +02:00
Sean Christopherson d28f4290b5 KVM: VMX: Always signal #GP on WRMSR to MSR_IA32_CR_PAT with bad value
The behavior of WRMSR is in no way dependent on whether or not KVM
consumes the value.

Fixes: 4566654bb9 ("KVM: vmx: Inject #GP on invalid PAT CR")
Cc: stable@vger.kernel.org
Cc: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18 11:46:51 +02:00
Paolo Bonzini b1346ab2af KVM: nVMX: Rename prepare_vmcs02_*_full to prepare_vmcs02_*_rare
These function do not prepare the entire state of the vmcs02, only the
rarely needed parts.  Rename them to make this clearer.

Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18 11:46:51 +02:00
Sean Christopherson 7952d769c2 KVM: nVMX: Sync rarely accessed guest fields only when needed
Many guest fields are rarely read (or written) by VMMs, i.e. likely
aren't accessed between runs of a nested VMCS.  Delay pulling rarely
accessed guest fields from vmcs02 until they are VMREAD or until vmcs12
is dirtied.  The latter case is necessary because nested VM-Entry will
consume all manner of fields when vmcs12 is dirty, e.g. for consistency
checks.

Note, an alternative to synchronizing all guest fields on VMREAD would
be to read *only* the field being accessed, but switching VMCS pointers
is expensive and odds are good if one guest field is being accessed then
others will soon follow, or that vmcs12 will be dirtied due to a VMWRITE
(see above).  And the full synchronization results in slightly cleaner
code.

Note, although GUEST_PDPTRs are relevant only for a 32-bit PAE guest,
they are accessed quite frequently for said guests, and a separate patch
is in flight to optimize away GUEST_PDTPR synchronziation for non-PAE
guests.

Skipping rarely accessed guest fields reduces the latency of a nested
VM-Exit by ~200 cycles.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18 11:46:50 +02:00
Sean Christopherson e2174295b4 KVM: nVMX: Add helpers to identify shadowed VMCS fields
So that future optimizations related to shadowed fields don't need to
define their own switch statement.

Add a BUILD_BUG_ON() to ensure at least one of the types (RW vs RO) is
defined when including vmcs_shadow_fields.h (guess who keeps mistyping
SHADOW_FIELD_RO as SHADOW_FIELD_R0).

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18 11:46:47 +02:00
Sean Christopherson 3731905ef2 KVM: nVMX: Use descriptive names for VMCS sync functions and flags
Nested virtualization involves copying data between many different types
of VMCSes, e.g. vmcs02, vmcs12, shadow VMCS and eVMCS.  Rename a variety
of functions and flags to document both the source and destination of
each sync.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18 11:46:06 +02:00
Sean Christopherson f4f8316d2a KVM: nVMX: Lift sync_vmcs12() out of prepare_vmcs12()
... to make it more obvious that sync_vmcs12() is invoked on all nested
VM-Exits, e.g. hiding sync_vmcs12() in prepare_vmcs12() makes it appear
that guest state is NOT propagated to vmcs12 for a normal VM-Exit.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18 11:46:06 +02:00
Sean Christopherson 1c6f0b47fb KVM: nVMX: Track vmcs12 offsets for shadowed VMCS fields
The vmcs12 fields offsets are constant and known at compile time.  Store
the associated offset for each shadowed field to avoid the costly lookup
in vmcs_field_to_offset() when copying between vmcs12 and the shadow
VMCS.  Avoiding the costly lookup reduces the latency of copying by
~100 cycles in each direction.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18 11:46:05 +02:00
Sean Christopherson b643780562 KVM: nVMX: Intercept VMWRITEs to GUEST_{CS,SS}_AR_BYTES
VMMs frequently read the guest's CS and SS AR bytes to detect 64-bit
mode and CPL respectively, but effectively never write said fields once
the VM is initialized.  Intercepting VMWRITEs for the two fields saves
~55 cycles in copy_shadow_to_vmcs12().

Because some Intel CPUs, e.g. Haswell, drop the reserved bits of the
guest access rights fields on VMWRITE, exposing the fields to L1 for
VMREAD but not VMWRITE leads to inconsistent behavior between L1 and L2.
On hardware that drops the bits, L1 will see the stripped down value due
to reading the value from hardware, while L2 will see the full original
value as stored by KVM.  To avoid such an inconsistency, emulate the
behavior on all CPUS, but only for intercepted VMWRITEs so as to avoid
introducing pointless latency into copy_shadow_to_vmcs12(), e.g. if the
emulation were added to vmcs12_write_any().

Since the AR_BYTES emulation is done only for intercepted VMWRITE, if a
future patch (re)exposed AR_BYTES for both VMWRITE and VMREAD, then KVM
would end up with incosistent behavior on pre-Haswell hardware, e.g. KVM
would drop the reserved bits on intercepted VMWRITE, but direct VMWRITE
to the shadow VMCS would not drop the bits.  Add a WARN in the shadow
field initialization to detect any attempt to expose an AR_BYTES field
without updating vmcs12_write_any().

Note, emulation of the AR_BYTES reserved bit behavior is based on a
patch[1] from Jim Mattson that applied the emulation to all writes to
vmcs12 so that live migration across different generations of hardware
would not introduce divergent behavior.  But given that live migration
of nested state has already been enabled, that ship has sailed (not to
mention that no sane VMM will be affected by this behavior).

[1] https://patchwork.kernel.org/patch/10483321/

Cc: Jim Mattson <jmattson@google.com>
Cc: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18 11:46:05 +02:00
Sean Christopherson fadcead00c KVM: nVMX: Intercept VMWRITEs to read-only shadow VMCS fields
Allowing L1 to VMWRITE read-only fields is only beneficial in a double
nesting scenario, e.g. no sane VMM will VMWRITE VM_EXIT_REASON in normal
non-nested operation.  Intercepting RO fields means KVM doesn't need to
sync them from the shadow VMCS to vmcs12 when running L2.  The obvious
downside is that L1 will VM-Exit more often when running L3, but it's
likely safe to assume most folks would happily sacrifice a bit of L3
performance, which may not even be noticeable in the grande scheme, to
improve L2 performance across the board.

Not intercepting fields tagged read-only also allows for additional
optimizations, e.g. marking GUEST_{CS,SS}_AR_BYTES as SHADOW_FIELD_RO
since those fields are rarely written by a VMMs, but read frequently.

When utilizing a shadow VMCS with asymmetric R/W and R/O bitmaps, fields
that cause VM-Exit on VMWRITE but not VMREAD need to be propagated to
the shadow VMCS during VMWRITE emulation, otherwise a subsequence VMREAD
from L1 will consume a stale value.

Note, KVM currently utilizes asymmetric bitmaps when "VMWRITE any field"
is not exposed to L1, but only so that it can reject the VMWRITE, i.e.
propagating the VMWRITE to the shadow VMCS is a new requirement, not a
bug fix.

Eliminating the copying of RO fields reduces the latency of nested
VM-Entry (copy_shadow_to_vmcs12()) by ~100 cycles (plus 40-50 cycles
if/when the AR_BYTES fields are exposed RO).

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18 11:46:04 +02:00
Sean Christopherson 95b5a48c4f KVM: VMX: Handle NMIs, #MCs and async #PFs in common irqs-disabled fn
Per commit 1b6269db3f ("KVM: VMX: Handle NMIs before enabling
interrupts and preemption"), NMIs are handled directly in vmx_vcpu_run()
to "make sure we handle NMI on the current cpu, and that we don't
service maskable interrupts before non-maskable ones".  The other
exceptions handled by complete_atomic_exit(), e.g. async #PF and #MC,
have similar requirements, and are located there to avoid extra VMREADs
since VMX bins hardware exceptions and NMIs into a single exit reason.

Clean up the code and eliminate the vaguely named complete_atomic_exit()
by moving the interrupts-disabled exception and NMI handling into the
existing handle_external_intrs() callback, and rename the callback to
a more appropriate name.  Rename VMexit handlers throughout so that the
atomic and non-atomic counterparts have similar names.

In addition to improving code readability, this also ensures the NMI
handler is run with the host's debug registers loaded in the unlikely
event that the user is debugging NMIs.  Accuracy of the last_guest_tsc
field is also improved when handling NMIs (and #MCs) as the handler
will run after updating said field.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
[Naming cleanups. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18 11:46:04 +02:00
Sean Christopherson 165072b089 KVM: x86: Move kvm_{before,after}_interrupt() calls to vendor code
VMX can conditionally call kvm_{before,after}_interrupt() since KVM
always uses "ack interrupt on exit" and therefore explicitly handles
interrupts as opposed to blindly enabling irqs.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18 11:46:03 +02:00
Sean Christopherson 2342080cd6 KVM: VMX: Store the host kernel's IDT base in a global variable
Although the kernel may use multiple IDTs, KVM should only ever see the
"real" IDT, e.g. the early init IDT is long gone by the time KVM runs
and the debug stack IDT is only used for small windows of time in very
specific flows.

Before commit a547c6db4d ("KVM: VMX: Enable acknowledge interupt on
vmexit"), the kernel's IDT base was consumed by KVM only when setting
constant VMCS state, i.e. to set VMCS.HOST_IDTR_BASE.  Because constant
host state is done once per vCPU, there was ostensibly no need to cache
the kernel's IDT base.

When support for "ack interrupt on exit" was introduced, KVM added a
second consumer of the IDT base as handling already-acked interrupts
requires directly calling the interrupt handler, i.e. KVM uses the IDT
base to find the address of the handler.  Because interrupts are a fast
path, KVM cached the IDT base to avoid having to VMREAD HOST_IDTR_BASE.
Presumably, the IDT base was cached on a per-vCPU basis simply because
the existing code grabbed the IDT base on a per-vCPU (VMCS) basis.

Note, all post-boot IDTs use the same handlers for external interrupts,
i.e. the "ack interrupt on exit" use of the IDT base would be unaffected
even if the cached IDT somehow did not match the current IDT.  And as
for the original use case of setting VMCS.HOST_IDTR_BASE, if any of the
above analysis is wrong then KVM has had a bug since the beginning of
time since KVM has effectively been caching the IDT at vCPU creation
since commit a8b732ca01c ("[PATCH] kvm: userspace interface").

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18 11:46:02 +02:00
Sean Christopherson 49def500e5 KVM: VMX: Read cached VM-Exit reason to detect external interrupt
Generic x86 code invokes the kvm_x86_ops external interrupt handler on
all VM-Exits regardless of the actual exit type.  Use the already-cached
EXIT_REASON to determine if the VM-Exit was due to an interrupt, thus
avoiding an extra VMREAD (to query VM_EXIT_INTR_INFO) for all other
types of VM-Exit.

In addition to avoiding the extra VMREAD, checking the EXIT_REASON
instead of VM_EXIT_INTR_INFO makes it more obvious that
vmx_handle_external_intr() is called for all VM-Exits, e.g. someone
unfamiliar with the flow might wonder under what condition(s)
VM_EXIT_INTR_INFO does not contain a valid interrupt, which is
simply not possible since KVM always runs with "ack interrupt on exit".

WARN once if VM_EXIT_INTR_INFO doesn't contain a valid interrupt on
an EXTERNAL_INTERRUPT VM-Exit, as such a condition would indicate a
hardware bug.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18 11:46:02 +02:00
Paolo Bonzini 2ea7203980 kvm: nVMX: small cleanup in handle_exception
The reason for skipping handling of NMI and #MC in handle_exception is
the same, namely they are handled earlier by vmx_complete_atomic_exit.
Calling the machine check handler (which just returns 1) is misleading,
don't do it.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18 11:46:01 +02:00
Sean Christopherson beb8d93b3e KVM: VMX: Fix handling of #MC that occurs during VM-Entry
A previous fix to prevent KVM from consuming stale VMCS state after a
failed VM-Entry inadvertantly blocked KVM's handling of machine checks
that occur during VM-Entry.

Per Intel's SDM, a #MC during VM-Entry is handled in one of three ways,
depending on when the #MC is recognoized.  As it pertains to this bug
fix, the third case explicitly states EXIT_REASON_MCE_DURING_VMENTRY
is handled like any other VM-Exit during VM-Entry, i.e. sets bit 31 to
indicate the VM-Entry failed.

If a machine-check event occurs during a VM entry, one of the following occurs:
 - The machine-check event is handled as if it occurred before the VM entry:
        ...
 - The machine-check event is handled after VM entry completes:
        ...
 - A VM-entry failure occurs as described in Section 26.7. The basic
   exit reason is 41, for "VM-entry failure due to machine-check event".

Explicitly handle EXIT_REASON_MCE_DURING_VMENTRY as a one-off case in
vmx_vcpu_run() instead of binning it into vmx_complete_atomic_exit().
Doing so allows vmx_vcpu_run() to handle VMX_EXIT_REASONS_FAILED_VMENTRY
in a sane fashion and also simplifies vmx_complete_atomic_exit() since
VMCS.VM_EXIT_INTR_INFO is guaranteed to be fresh.

Fixes: b060ca3b2e ("kvm: vmx: Handle VMLAUNCH/VMRESUME failure properly")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18 11:45:44 +02:00
Paolo Bonzini 73f624f47c KVM: x86: move MSR_IA32_POWER_CTL handling to common code
Make it available to AMD hosts as well, just in case someone is trying
to use an Intel processor's CPUID setup.

Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18 11:43:48 +02:00
Eugene Korenevsky fdb28619a8 kvm: vmx: segment limit check: use access length
There is an imperfection in get_vmx_mem_address(): access length is ignored
when checking the limit. To fix this, pass access length as a function argument.
The access length is usually obvious since it is used by callers after
get_vmx_mem_address() call, but for vmread/vmwrite it depends on the
state of 64-bit mode.

Signed-off-by: Eugene Korenevsky <ekorenevsky@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18 11:43:45 +02:00
Eugene Korenevsky c1a9acbc52 kvm: vmx: fix limit checking in get_vmx_mem_address()
Intel SDM vol. 3, 5.3:
The processor causes a
general-protection exception (or, if the segment is SS, a stack-fault
exception) any time an attempt is made to access the following addresses
in a segment:
- A byte at an offset greater than the effective limit
- A word at an offset greater than the (effective-limit – 1)
- A doubleword at an offset greater than the (effective-limit – 3)
- A quadword at an offset greater than the (effective-limit – 7)

Therefore, the generic limit checking error condition must be

exn = (off > limit + 1 - access_len) = (off + access_len - 1 > limit)

but not

exn = (off + access_len > limit)

as for now.

Also avoid integer overflow of `off` at 32-bit KVM by casting it to u64.

Note: access length is currently sizeof(u64) which is incorrect. This
will be fixed in the subsequent patch.

Signed-off-by: Eugene Korenevsky <ekorenevsky@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18 11:43:45 +02:00
Liran Alon 1fc5d19472 KVM: x86: Use DR_TRAP_BITS instead of hard-coded 15
Make all code consistent with kvm_deliver_exception_payload() by using
appropriate symbolic constant instead of hard-coded number.

Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18 11:43:42 +02:00
Vitaly Kuznetsov f9bc522765 KVM: nVMX: use correct clean fields when copying from eVMCS
Unfortunately, a couple of mistakes were made while implementing
Enlightened VMCS support, in particular, wrong clean fields were
used in copy_enlightened_to_vmcs12():
- exception_bitmap is covered by CONTROL_EXCPN;
- vm_exit_controls/pin_based_vm_exec_control/secondary_vm_exec_control
  are covered by CONTROL_GRP1.

Fixes: 945679e301 ("KVM: nVMX: add enlightened VMCS state")
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-13 16:05:29 +02:00
Uros Bizjak 1ae4de23ed KVM: VMX: remove unneeded 'asm volatile ("")' from vmcs_write64
__vmcs_writel uses volatile asm, so there is no need to insert another
one between the first and the second call to __vmcs_writel in order
to prevent unwanted code moves for 32bit targets.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-05 14:14:49 +02:00
Jan Beulich 5a253552a5 x86/kvm/VMX: drop bad asm() clobber from nested_vmx_check_vmentry_hw()
While upstream gcc doesn't detect conflicts on cc (yet), it really
should, and hence "cc" should not be specified for asm()-s also having
"=@cc<cond>" outputs. (It is quite pointless anyway to specify a "cc"
clobber in x86 inline assembly, since the compiler assumes it to be
always clobbered, and has no means [yet] to suppress this behavior.)

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Fixes: bbc0b82392 ("KVM: nVMX: Capture VM-Fail via CC_{SET,OUT} in nested early checks")
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-05 14:14:48 +02:00
Wanpeng Li b51700632e KVM: X86: Provide a capability to disable cstate msr read intercepts
Allow guest reads CORE cstate when exposing host CPU power management capabilities
to the guest. PKG cstate is restricted to avoid a guest to get the whole package
information in multi-tenant scenario.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-04 19:27:35 +02:00
Sean Christopherson f257d6dcda KVM: Directly return result from kvm_arch_check_processor_compat()
Add a wrapper to invoke kvm_arch_check_processor_compat() so that the
boilerplate ugliness of checking virtualization support on all CPUs is
hidden from the arch specific code.  x86's implementation in particular
is quite heinous, as it unnecessarily propagates the out-param pattern
into kvm_x86_ops.

While the x86 specific issue could be resolved solely by changing
kvm_x86_ops, make the change for all architectures as returning a value
directly is prettier and technically more robust, e.g. s390 doesn't set
the out param, which could lead to subtle breakage in the (highly
unlikely) scenario where the out-param was not pre-initialized by the
caller.

Opportunistically annotate svm_check_processor_compat() with __init.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-04 19:27:32 +02:00
Wanpeng Li b6c4bc659c KVM: LAPIC: Optimize timer latency further
Advance lapic timer tries to hidden the hypervisor overhead between the
host emulated timer fires and the guest awares the timer is fired. However,
it just hidden the time between apic_timer_fn/handle_preemption_timer ->
wait_lapic_expire, instead of the real position of vmentry which is
mentioned in the orignial commit d0659d946b ("KVM: x86: add option to
advance tscdeadline hrtimer expiration"). There is 700+ cpu cycles between
the end of wait_lapic_expire and before world switch on my haswell desktop.

This patch tries to narrow the last gap(wait_lapic_expire -> world switch),
it takes the real overhead time between apic_timer_fn/handle_preemption_timer
and before world switch into consideration when adaptively tuning timer
advancement. The patch can reduce 40% latency (~1600+ cycles to ~1000+ cycles
on a haswell desktop) for kvm-unit-tests/tscdeadline_latency when testing
busy waits.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-04 19:27:30 +02:00
Paolo Bonzini 2924b52117 KVM: x86/pmu: do not mask the value that is written to fixed PMUs
According to the SDM, for MSR_IA32_PERFCTR0/1 "the lower-order 32 bits of
each MSR may be written with any value, and the high-order 8 bits are
sign-extended according to the value of bit 31", but the fixed counters
in real hardware are limited to the width of the fixed counters ("bits
beyond the width of the fixed-function counter are reserved and must be
written as zeros").  Fix KVM to do the same.

Reported-by: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-24 21:27:14 +02:00
Paolo Bonzini 0e6f467ee2 KVM: x86/pmu: mask the result of rdpmc according to the width of the counters
This patch will simplify the changes in the next, by enforcing the
masking of the counters to RDPMC and RDMSR.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-24 21:27:13 +02:00
Paolo Bonzini 6f2f84532c KVM: x86: do not spam dmesg with VMCS/VMCB dumps
Userspace can easily set up invalid processor state in such a way that
dmesg will be filled with VMCS or VMCB dumps.  Disable this by default
using a module parameter.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-24 21:27:12 +02:00
Yi Wang 4d25996565 kvm: vmx: Fix -Wmissing-prototypes warnings
We get a warning when build kernel W=1:
arch/x86/kvm/vmx/vmx.c:6365:6: warning: no previous prototype for ‘vmx_update_host_rsp’ [-Wmissing-prototypes]
 void vmx_update_host_rsp(struct vcpu_vmx *vmx, unsigned long host_rsp)

Add the missing declaration to fix this.

Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-24 21:27:08 +02:00
Wanpeng Li 541e886f79 KVM: nVMX: Fix using __this_cpu_read() in preemptible context
BUG: using __this_cpu_read() in preemptible [00000000] code: qemu-system-x86/4590
  caller is nested_vmx_enter_non_root_mode+0xebd/0x1790 [kvm_intel]
  CPU: 4 PID: 4590 Comm: qemu-system-x86 Tainted: G           OE     5.1.0-rc4+ #1
  Call Trace:
   dump_stack+0x67/0x95
   __this_cpu_preempt_check+0xd2/0xe0
   nested_vmx_enter_non_root_mode+0xebd/0x1790 [kvm_intel]
   nested_vmx_run+0xda/0x2b0 [kvm_intel]
   handle_vmlaunch+0x13/0x20 [kvm_intel]
   vmx_handle_exit+0xbd/0x660 [kvm_intel]
   kvm_arch_vcpu_ioctl_run+0xa2c/0x1e50 [kvm]
   kvm_vcpu_ioctl+0x3ad/0x6d0 [kvm]
   do_vfs_ioctl+0xa5/0x6e0
   ksys_ioctl+0x6d/0x80
   __x64_sys_ioctl+0x1a/0x20
   do_syscall_64+0x6f/0x6c0
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

Accessing per-cpu variable should disable preemption, this patch extends the
preemption disable region for __this_cpu_read().

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Fixes: 52017608da ("KVM: nVMX: add option to perform early consistency checks via H/W")
Cc: stable@vger.kernel.org
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-24 21:27:08 +02:00
Sean Christopherson 21be4ca1ea KVM: nVMX: Clear nested_run_pending if setting nested state fails
VMX's nested_run_pending flag is subtly consumed when stuffing state to
enter guest mode, i.e. needs to be set according before KVM knows if
setting guest state is successful.  If setting guest state fails, clear
the flag as a nested run is obviously not pending.

Reported-by: Aaron Lewis <aaronlewis@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-24 21:27:03 +02:00
Paolo Bonzini db80927ea1 KVM: nVMX: really fix the size checks on KVM_SET_NESTED_STATE
The offset for reading the shadow VMCS is sizeof(*kvm_state)+VMCS12_SIZE,
so the correct size must be that plus sizeof(*vmcs12).  This could lead
to KVM reading garbage data from userspace and not reporting an error,
but is otherwise not sensitive.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-24 21:27:02 +02:00
Linus Torvalds 0ef0fd3515 * ARM: support for SVE and Pointer Authentication in guests, PMU improvements
* POWER: support for direct access to the POWER9 XIVE interrupt controller,
 memory and performance optimizations.
 
 * x86: support for accessing memory not backed by struct page, fixes and refactoring
 
 * Generic: dirty page tracking improvements
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJc3qV/AAoJEL/70l94x66Dn3QH/jX1Bn0P/RZAIt4w0SySklSg
 PqxUKDyBQqB9vN9Qeb9jWXAKPH2CtM3+up/rz7oRnBWp7qA6vXcC/R/QJYAvzdXE
 nklsR/oYCsflR1KdlVYuDvvPCPP2fLBU5zfN83OsaBQ8fNRkm3gN+N5XQ2SbXbLy
 Mo9tybS4otY201UAC96e8N0ipwwyCRpDneQpLcl+F5nH3RBt63cVbs04O+70MXn7
 eT4I+8K3+Go7LATzT8hglD21D/7uvE31qQb6yr5L33IfhU4GB51RZzBXTNaAdY8n
 hT1rMrRkAMAFWYZPQDfoMadjWU3i5DIfstKjDxOr9oTfuOEp5Z+GvJwvVnUDg1I=
 =D0+p
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "ARM:
   - support for SVE and Pointer Authentication in guests
   - PMU improvements

  POWER:
   - support for direct access to the POWER9 XIVE interrupt controller
   - memory and performance optimizations

  x86:
   - support for accessing memory not backed by struct page
   - fixes and refactoring

  Generic:
   - dirty page tracking improvements"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (155 commits)
  kvm: fix compilation on aarch64
  Revert "KVM: nVMX: Expose RDPMC-exiting only when guest supports PMU"
  kvm: x86: Fix L1TF mitigation for shadow MMU
  KVM: nVMX: Disable intercept for FS/GS base MSRs in vmcs02 when possible
  KVM: PPC: Book3S: Remove useless checks in 'release' method of KVM device
  KVM: PPC: Book3S HV: XIVE: Fix spelling mistake "acessing" -> "accessing"
  KVM: PPC: Book3S HV: Make sure to load LPID for radix VCPUs
  kvm: nVMX: Set nested_run_pending in vmx_set_nested_state after checks complete
  tests: kvm: Add tests for KVM_SET_NESTED_STATE
  KVM: nVMX: KVM_SET_NESTED_STATE - Tear down old EVMCS state before setting new state
  tests: kvm: Add tests for KVM_CAP_MAX_VCPUS and KVM_CAP_MAX_CPU_ID
  tests: kvm: Add tests to .gitignore
  KVM: Introduce KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2
  KVM: Fix kvm_clear_dirty_log_protect off-by-(minus-)one
  KVM: Fix the bitmap range to copy during clear dirty
  KVM: arm64: Fix ptrauth ID register masking logic
  KVM: x86: use direct accessors for RIP and RSP
  KVM: VMX: Use accessors for GPRs outside of dedicated caching logic
  KVM: x86: Omit caching logic for always-available GPRs
  kvm, x86: Properly check whether a pfn is an MMIO or not
  ...
2019-05-17 10:33:30 -07:00
Sean Christopherson f93f7ede08 Revert "KVM: nVMX: Expose RDPMC-exiting only when guest supports PMU"
The RDPMC-exiting control is dependent on the existence of the RDPMC
instruction itself, i.e. is not tied to the "Architectural Performance
Monitoring" feature.  For all intents and purposes, the control exists
on all CPUs with VMX support since RDPMC also exists on all VCPUs with
VMX supported.  Per Intel's SDM:

  The RDPMC instruction was introduced into the IA-32 Architecture in
  the Pentium Pro processor and the Pentium processor with MMX technology.
  The earlier Pentium processors have performance-monitoring counters, but
  they must be read with the RDMSR instruction.

Because RDPMC-exiting always exists, KVM requires the control and refuses
to load if it's not available.  As a result, hiding the PMU from a guest
breaks nested virtualization if the guest attemts to use KVM.

While it's not explicitly stated in the RDPMC pseudocode, the VM-Exit
check for RDPMC-exiting follows standard fault vs. VM-Exit prioritization
for privileged instructions, e.g. occurs after the CPL/CR0.PE/CR4.PCE
checks, but before the counter referenced in ECX is checked for validity.

In other words, the original KVM behavior of injecting a #GP was correct,
and the KVM unit test needs to be adjusted accordingly, e.g. eat the #GP
when the unit test guest (L3 in this case) executes RDPMC without
RDPMC-exiting set in the unit test host (L2).

This reverts commit e51bfdb687.

Fixes: e51bfdb687 ("KVM: nVMX: Expose RDPMC-exiting only when guest supports PMU")
Reported-by: David Hill <hilld@binarystorm.net>
Cc: Saar Amar <saaramar@microsoft.com>
Cc: Mihai Carabas <mihai.carabas@oracle.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Liran Alon <liran.alon@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-15 23:19:19 +02:00
Sean Christopherson d69129b4e4 KVM: nVMX: Disable intercept for FS/GS base MSRs in vmcs02 when possible
If L1 is using an MSR bitmap, unconditionally merge the MSR bitmaps from
L0 and L1 for MSR_{KERNEL,}_{FS,GS}_BASE.  KVM unconditionally exposes
MSRs L1.  If KVM is also running in L1 then it's highly likely L1 is
also exposing the MSRs to L2, i.e. KVM doesn't need to intercept L2
accesses.

Based on code from Jintack Lim.

Cc: Jintack Lim <jintack@xxxxxxxxxxxxxxx>
Signed-off-by: Sean Christopherson <sean.j.christopherson@xxxxxxxxx>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-15 22:53:44 +02:00
Linus Torvalds fa4bff1650 Merge branch 'x86-mds-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 MDS mitigations from Thomas Gleixner:
 "Microarchitectural Data Sampling (MDS) is a hardware vulnerability
  which allows unprivileged speculative access to data which is
  available in various CPU internal buffers. This new set of misfeatures
  has the following CVEs assigned:

     CVE-2018-12126  MSBDS  Microarchitectural Store Buffer Data Sampling
     CVE-2018-12130  MFBDS  Microarchitectural Fill Buffer Data Sampling
     CVE-2018-12127  MLPDS  Microarchitectural Load Port Data Sampling
     CVE-2019-11091  MDSUM  Microarchitectural Data Sampling Uncacheable Memory

  MDS attacks target microarchitectural buffers which speculatively
  forward data under certain conditions. Disclosure gadgets can expose
  this data via cache side channels.

  Contrary to other speculation based vulnerabilities the MDS
  vulnerability does not allow the attacker to control the memory target
  address. As a consequence the attacks are purely sampling based, but
  as demonstrated with the TLBleed attack samples can be postprocessed
  successfully.

  The mitigation is to flush the microarchitectural buffers on return to
  user space and before entering a VM. It's bolted on the VERW
  instruction and requires a microcode update. As some of the attacks
  exploit data structures shared between hyperthreads, full protection
  requires to disable hyperthreading. The kernel does not do that by
  default to avoid breaking unattended updates.

  The mitigation set comes with documentation for administrators and a
  deeper technical view"

* 'x86-mds-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
  x86/speculation/mds: Fix documentation typo
  Documentation: Correct the possible MDS sysfs values
  x86/mds: Add MDSUM variant to the MDS documentation
  x86/speculation/mds: Add 'mitigations=' support for MDS
  x86/speculation/mds: Print SMT vulnerable on MSBDS with mitigations off
  x86/speculation/mds: Fix comment
  x86/speculation/mds: Add SMT warning message
  x86/speculation: Move arch_smt_update() call to after mitigation decisions
  x86/speculation/mds: Add mds=full,nosmt cmdline option
  Documentation: Add MDS vulnerability documentation
  Documentation: Move L1TF to separate directory
  x86/speculation/mds: Add mitigation mode VMWERV
  x86/speculation/mds: Add sysfs reporting for MDS
  x86/speculation/mds: Add mitigation control for MDS
  x86/speculation/mds: Conditionally clear CPU buffers on idle entry
  x86/kvm/vmx: Add MDS protection when L1D Flush is not active
  x86/speculation/mds: Clear CPU buffers on exit to user
  x86/speculation/mds: Add mds_clear_cpu_buffers()
  x86/kvm: Expose X86_FEATURE_MD_CLEAR to guests
  x86/speculation/mds: Add BUG_MSBDS_ONLY
  ...
2019-05-14 07:57:29 -07:00
Aaron Lewis 9b5db6c762 kvm: nVMX: Set nested_run_pending in vmx_set_nested_state after checks complete
nested_run_pending=1 implies we have successfully entered guest mode.
Move setting from external state in vmx_set_nested_state() until after
all other checks are complete.

Based on a patch by Aaron Lewis.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-08 14:14:08 +02:00
Aaron Lewis 332d079735 KVM: nVMX: KVM_SET_NESTED_STATE - Tear down old EVMCS state before setting new state
Move call to nested_enable_evmcs until after free_nested() is complete.

Signed-off-by: Aaron Lewis <aaronlewis@google.com>
Reviewed-by: Marc Orr <marcorr@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-08 14:12:08 +02:00
Linus Torvalds 8ff468c29e Merge branch 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 FPU state handling updates from Borislav Petkov:
 "This contains work started by Rik van Riel and brought to fruition by
  Sebastian Andrzej Siewior with the main goal to optimize when to load
  FPU registers: only when returning to userspace and not on every
  context switch (while the task remains in the kernel).

  In addition, this optimization makes kernel_fpu_begin() cheaper by
  requiring registers saving only on the first invocation and skipping
  that in following ones.

  What is more, this series cleans up and streamlines many aspects of
  the already complex FPU code, hopefully making it more palatable for
  future improvements and simplifications.

  Finally, there's a __user annotations fix from Jann Horn"

* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (29 commits)
  x86/fpu: Fault-in user stack if copy_fpstate_to_sigframe() fails
  x86/pkeys: Add PKRU value to init_fpstate
  x86/fpu: Restore regs in copy_fpstate_to_sigframe() in order to use the fastpath
  x86/fpu: Add a fastpath to copy_fpstate_to_sigframe()
  x86/fpu: Add a fastpath to __fpu__restore_sig()
  x86/fpu: Defer FPU state load until return to userspace
  x86/fpu: Merge the two code paths in __fpu__restore_sig()
  x86/fpu: Restore from kernel memory on the 64-bit path too
  x86/fpu: Inline copy_user_to_fpregs_zeroing()
  x86/fpu: Update xstate's PKRU value on write_pkru()
  x86/fpu: Prepare copy_fpstate_to_sigframe() for TIF_NEED_FPU_LOAD
  x86/fpu: Always store the registers in copy_fpstate_to_sigframe()
  x86/entry: Add TIF_NEED_FPU_LOAD
  x86/fpu: Eager switch PKRU state
  x86/pkeys: Don't check if PKRU is zero before writing it
  x86/fpu: Only write PKRU if it is different from current
  x86/pkeys: Provide *pkru() helpers
  x86/fpu: Use a feature number instead of mask in two more helpers
  x86/fpu: Make __raw_xsave_addr() use a feature number instead of mask
  x86/fpu: Add an __fpregs_load_activate() internal helper
  ...
2019-05-07 10:24:10 -07:00
Jim Mattson e8ab8d24b4 KVM: nVMX: Fix size checks in vmx_set_nested_state
The size checks in vmx_nested_state are wrong because the calculations
are made based on the size of a pointer to a struct kvm_nested_state
rather than the size of a struct kvm_nested_state.

Reported-by: Felix Wilhelm  <fwilhelm@google.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Drew Schmitt <dasch@google.com>
Reviewed-by: Marc Orr <marcorr@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Fixes: 8fcc4b5923
Cc: stable@ver.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-01 00:43:44 +02:00
Paolo Bonzini e9c16c7850 KVM: x86: use direct accessors for RIP and RSP
Use specific inline functions for RIP and RSP instead of
going through kvm_register_read and kvm_register_write,
which are quite a mouthful.  kvm_rsp_read and kvm_rsp_write
did not exist, so add them.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-30 22:07:26 +02:00
Sean Christopherson 2b3eaf815c KVM: VMX: Use accessors for GPRs outside of dedicated caching logic
... now that there is no overhead when using dedicated accessors.

Opportunistically remove a bogus "FIXME" in handle_rdmsr() regarding
the upper 32 bits of RAX and RDX.  Zeroing the upper 32 bits is
architecturally correct as 32-bit writes in 64-bit mode unconditionally
clear the upper 32 bits.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-30 21:56:27 +02:00
Sean Christopherson de3cd117ed KVM: x86: Omit caching logic for always-available GPRs
Except for RSP and RIP, which are held in VMX's VMCS, GPRs are always
treated "available and dirtly" on both VMX and SVM, i.e. are
unconditionally loaded/saved immediately before/after VM-Enter/VM-Exit.

Eliminating the unnecessary caching code reduces the size of KVM by a
non-trivial amount, much of which comes from the most common code paths.
E.g. on x86_64, kvm_emulate_cpuid() is reduced from 342 to 182 bytes and
kvm_emulate_hypercall() from 1362 to 1143, with the total size of KVM
dropping by ~1000 bytes.  With CONFIG_RETPOLINE=y, the numbers are even
more pronounced, e.g.: 353->182, 1418->1172 and well over 2000 bytes.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-30 21:56:12 +02:00
KarimAllah Ahmed e0bf2665ca KVM/nVMX: Use page_address_valid in a few more locations
Use page_address_valid in a few more locations that is already checking for
a page aligned address that does not cross the maximum physical address.

Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-30 21:49:44 +02:00
KarimAllah Ahmed dee9c04931 KVM/nVMX: Use kvm_vcpu_map for accessing the enlightened VMCS
Use kvm_vcpu_map for accessing the enlightened VMCS since using
kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has
a "struct page".

Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-30 21:49:40 +02:00
KarimAllah Ahmed 8892530598 KVM/nVMX: Use kvm_vcpu_map for accessing the shadow VMCS
Use kvm_vcpu_map for accessing the shadow VMCS since using
kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has
a "struct page".

Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Reviewed-by: Konrad Rzessutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-30 21:49:37 +02:00
KarimAllah Ahmed 3278e04925 KVM/nVMX: Use kvm_vcpu_map when mapping the posted interrupt descriptor table
Use kvm_vcpu_map when mapping the posted interrupt descriptor table since
using kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory
that has a "struct page".

One additional semantic change is that the virtual host mapping lifecycle
has changed a bit. It now has the same lifetime of the pinning of the
interrupt descriptor table page on the host side.

Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-30 21:45:21 +02:00
KarimAllah Ahmed 96c66e87de KVM/nVMX: Use kvm_vcpu_map when mapping the virtual APIC page
Use kvm_vcpu_map when mapping the virtual APIC page since using
kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has
a "struct page".

One additional semantic change is that the virtual host mapping lifecycle
has changed a bit. It now has the same lifetime of the pinning of the
virtual APIC page on the host side.

Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-30 21:36:43 +02:00
KarimAllah Ahmed 31f0b6c4ba KVM/nVMX: Use kvm_vcpu_map when mapping the L1 MSR bitmap
Use kvm_vcpu_map when mapping the L1 MSR bitmap since using
kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has
a "struct page".

Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-30 21:34:34 +02:00
KarimAllah Ahmed b146b83928 X86/nVMX: handle_vmptrld: Use kvm_vcpu_map when copying VMCS12 from guest memory
Use kvm_vcpu_map to the map the VMCS12 from guest memory because
kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has
a "struct page".

Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-30 21:32:59 +02:00
KarimAllah Ahmed 3d5f6beb74 X86/nVMX: Update the PML table without mapping and unmapping the page
Update the PML table without mapping and unmapping the page. This also
avoids using kvm_vcpu_gpa_to_page(..) which assumes that there is a "struct
page" for guest memory.

As a side-effect of using kvm_write_guest_page the page is also properly
marked as dirty.

Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-30 21:32:38 +02:00
KarimAllah Ahmed 2e408936b6 X86/nVMX: handle_vmon: Read 4 bytes from guest memory
Read the data directly from guest memory instead of the map->read->unmap
sequence. This also avoids using kvm_vcpu_gpa_to_page() and kmap() which
assumes that there is a "struct page" for guest memory.

Suggested-by: Jim Mattson <jmattson@google.com>
Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-30 21:32:28 +02:00
Sean Christopherson 19e38336d7 KVM: VMX: Include architectural defs header in capabilities.h
The capabilities header depends on asm/vmx.h but doesn't explicitly
include said file.  This currently doesn't cause problems as all users
of capbilities.h first include asm/vmx.h, but the issue often results in
build errors if someone starts moving things around the VMX files.

Fixes: 3077c19108 ("KVM: VMX: Move capabilities structs and helpers to dedicated file")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-30 21:32:21 +02:00
Dan Carpenter d6a85c3223 KVM: vmx: clean up some debug output
Smatch complains about this:

    arch/x86/kvm/vmx/vmx.c:5730 dump_vmcs()
    warn: KERN_* level not at start of string

The code should be using pr_cont() instead of pr_err().

Fixes: 9d609649bb ("KVM: vmx: print more APICv fields in dump_vmcs")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-30 21:32:20 +02:00
Sean Christopherson 0967fa1cd3 KVM: VMX: Skip delta_tsc shift-and-divide if the dividend is zero
Ten percent of nothin' is... let me do the math here.  Nothin' into
nothin', carry the nothin'...

Cc: Wanpeng Li <wanpengli@tencent.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-30 21:32:18 +02:00
Sean Christopherson f99279825e KVM: lapic: Refactor ->set_hv_timer to use an explicit expired param
Refactor kvm_x86_ops->set_hv_timer to use an explicit parameter for
stating that the timer has expired.  Overloading the return value is
unnecessarily clever, e.g. can lead to confusion over the proper return
value from start_hv_timer() when r==1.

Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-30 21:32:16 +02:00
Liran Alon 6c6a2ab962 KVM: VMX: Nop emulation of MSR_IA32_POWER_CTL
Since commits 668fffa3f8 ("kvm: better MWAIT emulation for guestsâ€)
and 4d5422cea3 ("KVM: X86: Provide a capability to disable MWAIT interceptsâ€),
KVM was modified to allow an admin to configure certain guests to execute
MONITOR/MWAIT inside guest without being intercepted by host.

This is useful in case admin wishes to allocate a dedicated logical
processor for each vCPU thread. Thus, making it safe for guest to
completely control the power-state of the logical processor.

The ability to use this new KVM capability was introduced to QEMU by
commits 6f131f13e68d ("kvm: support -overcommit cpu-pm=on|offâ€) and
2266d4431132 ("i386/cpu: make -cpu host support monitor/mwaitâ€).

However, exposing MONITOR/MWAIT to a Linux guest may cause it's intel_idle
kernel module to execute c1e_promotion_disable() which will attempt to
RDMSR/WRMSR from/to MSR_IA32_POWER_CTL to manipulate the "C1E Enable"
bit. This behaviour was introduced by commit
32e9518005 ("intel_idle: export both C1 and C1Eâ€).

Becuase KVM doesn't emulate this MSR, running KVM with ignore_msrs=0
will cause the above guest behaviour to raise a #GP which will cause
guest to kernel panic.

Therefore, add support for nop emulation of MSR_IA32_POWER_CTL to
avoid #GP in guest in this scenario.

Future commits can optimise emulation further by reflecting guest
MSR changes to host MSR to provide guest with the ability to
fine-tune the dedicated logical processor power-state.

Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-30 21:32:14 +02:00
Luwei Kang c715eb9fe9 KVM: x86: Add support of clear Trace_ToPA_PMI status
Let guests clear the Intel PT ToPA PMI status (bit 55 of
MSR_CORE_PERF_GLOBAL_OVF_CTRL).

Signed-off-by: Luwei Kang <luwei.kang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-30 21:32:14 +02:00
Rick Edgecombe f2fde6a5bc KVM: VMX: Move RSB stuffing to before the first RET after VM-Exit
The not-so-recent change to move VMX's VM-Exit handing to a dedicated
"function" unintentionally exposed KVM to a speculative attack from the
guest by executing a RET prior to stuffing the RSB.  Make RSB stuffing
happen immediately after VM-Exit, before any unpaired returns.

Alternatively, the VM-Exit path could postpone full RSB stuffing until
its current location by stuffing the RSB only as needed, or by avoiding
returns in the VM-Exit path entirely, but both alternatives are beyond
ugly since vmx_vmexit() has multiple indirect callers (by way of
vmx_vmenter()).  And putting the RSB stuffing immediately after VM-Exit
makes it much less likely to be re-broken in the future.

Note, the cost of PUSH/POP could be avoided in the normal flow by
pairing the PUSH RAX with the POP RAX in __vmx_vcpu_run() and adding an
a POP to nested_vmx_check_vmentry_hw(), but such a weird/subtle
dependency is likely to cause problems in the long run, and PUSH/POP
will take all of a few cycles, which is peanuts compared to the number
of cycles required to fill the RSB.

Fixes: 453eafbe65 ("KVM: VMX: Move VM-Enter + VM-Exit handling to non-inline sub-routines")
Reported-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-27 09:48:52 +02:00
Sean Christopherson 39497d7660 KVM: lapic: Track lapic timer advance per vCPU
Automatically adjusting the globally-shared timer advancement could
corrupt the timer, e.g. if multiple vCPUs are concurrently adjusting
the advancement value.  That could be partially fixed by using a local
variable for the arithmetic, but it would still be susceptible to a
race when setting timer_advance_adjust_done.

And because virtual_tsc_khz and tsc_scaling_ratio are per-vCPU, the
correct calibration for a given vCPU may not apply to all vCPUs.

Furthermore, lapic_timer_advance_ns is marked __read_mostly, which is
effectively violated when finding a stable advancement takes an extended
amount of timer.

Opportunistically change the definition of lapic_timer_advance_ns to
a u32 so that it matches the style of struct kvm_timer.  Explicitly
pass the param to kvm_create_lapic() so that it doesn't have to be
exposed to lapic.c, thus reducing the probability of unintentionally
using the global value instead of the per-vCPU value.

Cc: Liran Alon <liran.alon@oracle.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Cc: stable@vger.kernel.org
Fixes: 3b8a5df6c4 ("KVM: LAPIC: Tune lapic_timer_advance_ns automatically")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-18 18:55:41 +02:00
Sean Christopherson c80add0f48 KVM: nVMX: Return -EINVAL when signaling failure in VM-Entry helpers
Most, but not all, helpers that are related to emulating consistency
checks for nested VM-Entry return -EINVAL when a check fails.  Convert
the holdouts to have consistency throughout and to make it clear that
the functions are signaling pass/fail as opposed to "resume guest" vs.
"exit to userspace".

Opportunistically fix bad indentation in nested_vmx_check_guest_state().

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16 15:39:06 +02:00
Paolo Bonzini 98d9e858fa KVM: nVMX: Return -EINVAL when signaling failure in pre-VM-Entry helpers
Convert all top-level nested VM-Enter consistency check functions to
return 0/-EINVAL instead of failure codes, since now they can only
ever return one failure code.

This also does not give the false impression that failure information is
always consumed and/or relevant, e.g. vmx_set_nested_state() only
cares whether or not the checks were successful.

nested_check_host_control_regs() can also now be inlined into its caller,
nested_vmx_check_host_state, since the two have effectively become the
same function.

Based on a patch by Sean Christopherson.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16 15:39:05 +02:00
Sean Christopherson 5478ba349f KVM: nVMX: Rename and split top-level consistency checks to match SDM
Rename the top-level consistency check functions to (loosely) align with
the SDM.  Historically, KVM has used the terms "prereq" and "postreq" to
differentiate between consistency checks that lead to VM-Fail and those
that lead to VM-Exit.  The terms are vague and potentially misleading,
e.g. "postreq" might be interpreted as occurring after VM-Entry.

Note, while the SDM lumps controls and host state into a single section,
"Checks on VMX Controls and Host-State Area", split them into separate
top-level functions as the two categories of checks result in different
VM instruction errors.  This split will allow for additional cleanup.

Note #2, "vmentry" is intentionally dropped from the new function names
to avoid confusion with nested_check_vm_entry_controls(), and to keep
the length of the functions names somewhat manageable.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16 15:39:05 +02:00
Sean Christopherson 9c3e922ba3 KVM: nVMX: Move guest non-reg state checks to VM-Exit path
Per Intel's SDM, volume 3, section Checking and Loading Guest State:

  Because the checking and the loading occur concurrently, a failure may
  be discovered only after some state has been loaded. For this reason,
  the logical processor responds to such failures by loading state from
  the host-state area, as it would for a VM exit.

In other words, a failed non-register state consistency check results in
a VM-Exit, not VM-Fail.  Moving the non-reg state checks also paves the
way for renaming nested_vmx_check_vmentry_postreqs() to align with the
SDM, i.e. nested_vmx_check_vmentry_guest_state().

Fixes: 26539bd0e4 ("KVM: nVMX: check vmcs12 for valid activity state")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16 15:39:04 +02:00
Krish Sadhukhan de2bc2bfdf kvm: nVMX: Check "load IA32_PAT" VM-entry control on vmentry
According to section "Checking and Loading Guest State" in Intel SDM vol
3C, the following check is performed on vmentry:

    If the "load IA32_PAT" VM-entry control is 1, the value of the field
    for the IA32_PAT MSR must be one that could be written by WRMSR
    without fault at CPL 0. Specifically, each of the 8 bytes in the
    field must have one of the values 0 (UC), 1 (WC), 4 (WT), 5 (WP),
    6 (WB), or 7 (UC-).

Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: Karl Heubaum <karl.heubaum@oracle.com>
Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16 15:39:03 +02:00
Krish Sadhukhan f6b0db1fda kvm: nVMX: Check "load IA32_PAT" VM-exit control on vmentry
According to section "Checks on Host Control Registers and MSRs" in Intel
SDM vol 3C, the following check is performed on vmentry:

    If the "load IA32_PAT" VM-exit control is 1, the value of the field
    for the IA32_PAT MSR must be one that could be written by WRMSR
    without fault at CPL 0. Specifically, each of the 8 bytes in the
    field must have one of the values 0 (UC), 1 (WC), 4 (WT), 5 (WP),
    6 (WB), or 7 (UC-).

Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: Karl Heubaum <karl.heubaum@oracle.com>
Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16 15:39:02 +02:00
Paolo Bonzini 674ea351cd KVM: x86: optimize check for valid PAT value
This check will soon be done on every nested vmentry and vmexit,
"parallelize" it using bitwise operations.

Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16 15:39:02 +02:00
Paolo Bonzini f16cb57be8 KVM: x86: clear VM_EXIT_SAVE_IA32_PAT
This is not needed, PAT writes always take an MSR vmexit.

Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16 15:39:01 +02:00
Paolo Bonzini 9d609649bb KVM: vmx: print more APICv fields in dump_vmcs
The SVI, RVI, virtual-APIC page address and APIC-access page address fields
were left out of dump_vmcs.  Add them.

KERN_CONT technically isn't SMP safe, but it's okay to use it here since
the whole of dump_vmcs() is a single huge multi-line piece of output
that isn't SMP-safe.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16 15:39:00 +02:00
Sean Christopherson 9ec19493fb KVM: x86: clear SMM flags before loading state while leaving SMM
RSM emulation is currently broken on VMX when the interrupted guest has
CR4.VMXE=1.  Stop dancing around the issue of HF_SMM_MASK being set when
loading SMSTATE into architectural state, e.g. by toggling it for
problematic flows, and simply clear HF_SMM_MASK prior to loading
architectural state (from SMRAM save state area).

Reported-by: Jon Doron <arilou@gmail.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Liran Alon <liran.alon@oracle.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Fixes: 5bea5123cb ("KVM: VMX: check nested state and CR4.VMXE against SMM")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Tested-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16 15:37:36 +02:00
Sean Christopherson ed19321fb6 KVM: x86: Load SMRAM in a single shot when leaving SMM
RSM emulation is currently broken on VMX when the interrupted guest has
CR4.VMXE=1.  Rather than dance around the issue of HF_SMM_MASK being set
when loading SMSTATE into architectural state, ideally RSM emulation
itself would be reworked to clear HF_SMM_MASK prior to loading non-SMM
architectural state.

Ostensibly, the only motivation for having HF_SMM_MASK set throughout
the loading of state from the SMRAM save state area is so that the
memory accesses from GET_SMSTATE() are tagged with role.smm.  Load
all of the SMRAM save state area from guest memory at the beginning of
RSM emulation, and load state from the buffer instead of reading guest
memory one-by-one.

This paves the way for clearing HF_SMM_MASK prior to loading state,
and also aligns RSM with the enter_smm() behavior, which fills a
buffer and writes SMRAM save state in a single go.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16 15:37:35 +02:00
Liran Alon e51bfdb687 KVM: nVMX: Expose RDPMC-exiting only when guest supports PMU
Issue was discovered when running kvm-unit-tests on KVM running as L1 on
top of Hyper-V.

When vmx_instruction_intercept unit-test attempts to run RDPMC to test
RDPMC-exiting, it is intercepted by L1 KVM which it's EXIT_REASON_RDPMC
handler raise #GP because vCPU exposed by Hyper-V doesn't support PMU.
Instead of unit-test expectation to be reflected with EXIT_REASON_RDPMC.

The reason vmx_instruction_intercept unit-test attempts to run RDPMC
even though Hyper-V doesn't support PMU is because L1 expose to L2
support for RDPMC-exiting. Which is reasonable to assume that is
supported only in case CPU supports PMU to being with.

Above issue can easily be simulated by modifying
vmx_instruction_intercept config in x86/unittests.cfg to run QEMU with
"-cpu host,+vmx,-pmu" and run unit-test.

To handle issue, change KVM to expose RDPMC-exiting only when guest
supports PMU.

Reported-by: Saar Amar <saaramar@microsoft.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16 15:37:34 +02:00
WANG Chao 1811d979c7 x86/kvm: move kvm_load/put_guest_xcr0 into atomic context
guest xcr0 could leak into host when MCE happens in guest mode. Because
do_machine_check() could schedule out at a few places.

For example:

kvm_load_guest_xcr0
...
kvm_x86_ops->run(vcpu) {
  vmx_vcpu_run
    vmx_complete_atomic_exit
      kvm_machine_check
        do_machine_check
          do_memory_failure
            memory_failure
              lock_page

In this case, host_xcr0 is 0x2ff, guest vcpu xcr0 is 0xff. After schedule
out, host cpu has guest xcr0 loaded (0xff).

In __switch_to {
     switch_fpu_finish
       copy_kernel_to_fpregs
         XRSTORS

If any bit i in XSTATE_BV[i] == 1 and xcr0[i] == 0, XRSTORS will
generate #GP (In this case, bit 9). Then ex_handler_fprestore kicks in
and tries to reinitialize fpu by restoring init fpu state. Same story as
last #GP, except we get DOUBLE FAULT this time.

Cc: stable@vger.kernel.org
Signed-off-by: WANG Chao <chao.wang@ucloud.cn>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16 15:37:33 +02:00
Paolo Bonzini 2b27924bb1 KVM: nVMX: always use early vmcs check when EPT is disabled
The remaining failures of vmx.flat when EPT is disabled are caused by
incorrectly reflecting VMfails to the L1 hypervisor.  What happens is
that nested_vmx_restore_host_state corrupts the guest CR3, reloading it
with the host's shadow CR3 instead, because it blindly loads GUEST_CR3
from the vmcs01.

For simplicity let's just always use hardware VMCS checks when EPT is
disabled.  This way, nested_vmx_restore_host_state is not reached at
all (or at least shouldn't be reached).

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16 15:37:12 +02:00
Paolo Bonzini 690908104e KVM: nVMX: allow tests to use bad virtual-APIC page address
As mentioned in the comment, there are some special cases where we can simply
clear the TPR shadow bit from the CPU-based execution controls in the vmcs02.
Handle them so that we can remove some XFAILs from vmx.flat.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16 10:59:07 +02:00
Sebastian Andrzej Siewior c806e88734 x86/pkeys: Provide *pkru() helpers
Dave Hansen asked for __read_pkru() and __write_pkru() to be
symmetrical.

As part of the series __write_pkru() will read back the value and only
write it if it is different.

In order to make both functions symmetrical, move the function
containing only the opcode asm into a function called like the
instruction itself.

__write_pkru() will just invoke wrpkru() but in a follow-up patch will
also read back the value.

 [ bp: Convert asm opcode wrapper names to rd/wrpkru(). ]

Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Juergen Gross <jgross@suse.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: kvm ML <kvm@vger.kernel.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20190403164156.19645-13-bigeasy@linutronix.de
2019-04-11 15:40:58 +02:00
Marc Orr c73f4c998e KVM: x86: nVMX: fix x2APIC VTPR read intercept
Referring to the "VIRTUALIZING MSR-BASED APIC ACCESSES" chapter of the
SDM, when "virtualize x2APIC mode" is 1 and "APIC-register
virtualization" is 0, a RDMSR of 808H should return the VTPR from the
virtual APIC page.

However, for nested, KVM currently fails to disable the read intercept
for this MSR. This means that a RDMSR exit takes precedence over
"virtualize x2APIC mode", and KVM passes through L1's TPR to L2,
instead of sourcing the value from L2's virtual APIC page.

This patch fixes the issue by disabling the read intercept, in VMCS02,
for the VTPR when "APIC-register virtualization" is 0.

The issue described above and fix prescribed here, were verified with
a related patch in kvm-unit-tests titled "Test VMX's virtualize x2APIC
mode w/ nested".

Signed-off-by: Marc Orr <marcorr@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Fixes: c992384bde ("KVM: vmx: speed up MSR bitmap merge")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-05 21:08:30 +02:00
Marc Orr acff78477b KVM: x86: nVMX: close leak of L0's x2APIC MSRs (CVE-2019-3887)
The nested_vmx_prepare_msr_bitmap() function doesn't directly guard the
x2APIC MSR intercepts with the "virtualize x2APIC mode" MSR. As a
result, we discovered the potential for a buggy or malicious L1 to get
access to L0's x2APIC MSRs, via an L2, as follows.

1. L1 executes WRMSR(IA32_SPEC_CTRL, 1). This causes the spec_ctrl
variable, in nested_vmx_prepare_msr_bitmap() to become true.
2. L1 disables "virtualize x2APIC mode" in VMCS12.
3. L1 enables "APIC-register virtualization" in VMCS12.

Now, KVM will set VMCS02's x2APIC MSR intercepts from VMCS12, and then
set "virtualize x2APIC mode" to 0 in VMCS02. Oops.

This patch closes the leak by explicitly guarding VMCS02's x2APIC MSR
intercepts with VMCS12's "virtualize x2APIC mode" control.

The scenario outlined above and fix prescribed here, were verified with
a related patch in kvm-unit-tests titled "Add leak scenario to
virt_x2apic_mode_test".

Note, it looks like this issue may have been introduced inadvertently
during a merge---see 15303ba5d1.

Signed-off-by: Marc Orr <marcorr@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-05 21:08:22 +02:00
Sean Christopherson 0cf9135b77 KVM: x86: Emulate MSR_IA32_ARCH_CAPABILITIES on AMD hosts
The CPUID flag ARCH_CAPABILITIES is unconditioinally exposed to host
userspace for all x86 hosts, i.e. KVM advertises ARCH_CAPABILITIES
regardless of hardware support under the pretense that KVM fully
emulates MSR_IA32_ARCH_CAPABILITIES.  Unfortunately, only VMX hosts
handle accesses to MSR_IA32_ARCH_CAPABILITIES (despite KVM_GET_MSRS
also reporting MSR_IA32_ARCH_CAPABILITIES for all hosts).

Move the MSR_IA32_ARCH_CAPABILITIES handling to common x86 code so
that it's emulated on AMD hosts.

Fixes: 1eaafe91a0 ("kvm: x86: IA32_ARCH_CAPABILITIES is always supported")
Cc: stable@vger.kernel.org
Reported-by: Xiaoyao Li <xiaoyao.li@linux.intel.com>
Cc: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-03-28 17:29:00 +01:00
Krish Sadhukhan 711eff3a8f kvm: nVMX: Add a vmentry check for HOST_SYSENTER_ESP and HOST_SYSENTER_EIP fields
According to section "Checks on VMX Controls" in Intel SDM vol 3C, the
following check is performed on vmentry of L2 guests:

    On processors that support Intel 64 architecture, the IA32_SYSENTER_ESP
    field and the IA32_SYSENTER_EIP field must each contain a canonical
    address.

Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-03-28 17:27:18 +01:00
Singh, Brijesh 05d5a48635 KVM: SVM: Workaround errata#1096 (insn_len maybe zero on SMAP violation)
Errata#1096:

On a nested data page fault when CR.SMAP=1 and the guest data read
generates a SMAP violation, GuestInstrBytes field of the VMCB on a
VMEXIT will incorrectly return 0h instead the correct guest
instruction bytes .

Recommend Workaround:

To determine what instruction the guest was executing the hypervisor
will have to decode the instruction at the instruction pointer.

The recommended workaround can not be implemented for the SEV
guest because guest memory is encrypted with the guest specific key,
and instruction decoder will not be able to decode the instruction
bytes. If we hit this errata in the SEV guest then log the message
and request a guest shutdown.

Reported-by: Venkatesh Srinivas <venkateshs@google.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-03-28 17:27:17 +01:00
Linus Torvalds 636deed6c0 ARM: some cleanups, direct physical timer assignment, cache sanitization
for 32-bit guests
 
 s390: interrupt cleanup, introduction of the Guest Information Block,
 preparation for processor subfunctions in cpu models
 
 PPC: bug fixes and improvements, especially related to machine checks
 and protection keys
 
 x86: many, many cleanups, including removing a bunch of MMU code for
 unnecessary optimizations; plus AVIC fixes.
 
 Generic: memcg accounting
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJci+7XAAoJEL/70l94x66DUMkIAKvEefhceySHYiTpfefjLjIC
 16RewgHa+9CO4Oo5iXiWd90fKxtXLXmxDQOS4VGzN0rxvLGRw/fyXIxL1MDOkaAO
 l8SLSNuewY4XBUgISL3PMz123r18DAGOuy9mEcYU/IMesYD2F+wy5lJ17HIGq6X2
 RpoF1p3qO1jfkPTKOob6Ixd4H5beJNPKpdth7LY3PJaVhDxgouj32fxnLnATVSnN
 gENQ10fnt8BCjshRYW6Z2/9bF15JCkUFR1xdBW2/xh1oj+kvPqqqk2bEN1eVQzUy
 2hT/XkwtpthqjSbX8NNavWRSFnOnbMLTRKQyIXmFVsM5VoSrwtiGsCFzBgcT++I=
 =XIzU
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "ARM:
   - some cleanups
   - direct physical timer assignment
   - cache sanitization for 32-bit guests

  s390:
   - interrupt cleanup
   - introduction of the Guest Information Block
   - preparation for processor subfunctions in cpu models

  PPC:
   - bug fixes and improvements, especially related to machine checks
     and protection keys

  x86:
   - many, many cleanups, including removing a bunch of MMU code for
     unnecessary optimizations
   - AVIC fixes

  Generic:
   - memcg accounting"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (147 commits)
  kvm: vmx: fix formatting of a comment
  KVM: doc: Document the life cycle of a VM and its resources
  MAINTAINERS: Add KVM selftests to existing KVM entry
  Revert "KVM/MMU: Flush tlb directly in the kvm_zap_gfn_range()"
  KVM: PPC: Book3S: Add count cache flush parameters to kvmppc_get_cpu_char()
  KVM: PPC: Fix compilation when KVM is not enabled
  KVM: Minor cleanups for kvm_main.c
  KVM: s390: add debug logging for cpu model subfunctions
  KVM: s390: implement subfunction processor calls
  arm64: KVM: Fix architecturally invalid reset value for FPEXC32_EL2
  KVM: arm/arm64: Remove unused timer variable
  KVM: PPC: Book3S: Improve KVM reference counting
  KVM: PPC: Book3S HV: Fix build failure without IOMMU support
  Revert "KVM: Eliminate extra function calls in kvm_get_dirty_log_protect()"
  x86: kvmguest: use TSC clocksource if invariant TSC is exposed
  KVM: Never start grow vCPU halt_poll_ns from value below halt_poll_ns_grow_start
  KVM: Expose the initial start value in grow_halt_poll_ns() as a module parameter
  KVM: grow_halt_poll_ns() should never shrink vCPU halt_poll_ns
  KVM: x86/mmu: Consolidate kvm_mmu_zap_all() and kvm_mmu_zap_mmio_sptes()
  KVM: x86/mmu: WARN if zapping a MMIO spte results in zapping children
  ...
2019-03-15 15:00:28 -07:00
Paolo Bonzini 4a605bc08e kvm: vmx: fix formatting of a comment
Eliminate a gratuitous conflict with 5.0.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-03-15 19:24:34 +01:00
Thomas Gleixner 65fd4cb65b Documentation: Move L1TF to separate directory
Move L!TF to a separate directory so the MDS stuff can be added at the
side. Otherwise the all hardware vulnerabilites have their own top level
entry. Should have done that right away.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jon Masters <jcm@redhat.com>
2019-03-06 21:52:15 +01:00
Thomas Gleixner 650b68a062 x86/kvm/vmx: Add MDS protection when L1D Flush is not active
CPUs which are affected by L1TF and MDS mitigate MDS with the L1D Flush on
VMENTER when updated microcode is installed.

If a CPU is not affected by L1TF or if the L1D Flush is not in use, then
MDS mitigation needs to be invoked explicitly.

For these cases, follow the host mitigation state and invoke the MDS
mitigation before VMENTER.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jon Masters <jcm@redhat.com>
Tested-by: Jon Masters <jcm@redhat.com>
2019-03-06 21:52:13 +01:00
Ben Gardon 4183683918 kvm: vmx: Add memcg accounting to KVM allocations
There are many KVM kernel memory allocations which are tied to the life of
the VM process and should be charged to the VM process's cgroup. If the
allocations aren't tied to the process, the OOM killer will not know
that killing the process will free the associated kernel memory.
Add __GFP_ACCOUNT flags to many of the allocations which are not yet being
charged to the VM process's cgroup.

Tested:
	Ran all kvm-unit-tests on a 64 bit Haswell machine, the patch
	introduced no new failures.
	Ran a kernel memory accounting test which creates a VM to touch
	memory and then checks that the kernel memory allocated for the
	process is within certain bounds.
	With this patch we account for much more of the vmalloc and slab memory
	allocated for the VM.

Signed-off-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:31 +01:00
Paolo Bonzini 359a6c3ddc KVM: nVMX: do not start the preemption timer hrtimer unnecessarily
The preemption timer can be started even if there is a vmentry
failure during or after loading guest state.  That is pointless,
move the call after all conditions have been checked.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:29 +01:00
Yu Zhang d92935979a kvm: vmx: Fix typos in vmentry/vmexit control setting
Previously, 'commit f99e3daf94 ("KVM: x86: Add Intel PT
virtualization work mode")' work mode' offered framework
to support Intel PT virtualization. However, the patch has
some typos in vmx_vmentry_ctrl() and vmx_vmexit_ctrl(), e.g.
used wrong flags and wrong variable, which will cause the
VM entry failure later.

Fixes: 'commit f99e3daf94 ("KVM: x86: Add Intel PT virtualization work mode")'
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:28 +01:00
Paolo Bonzini b4b65b5642 KVM: x86: cleanup freeing of nested state
Ensure that the VCPU free path goes through vmx_leave_nested and
thus nested_vmx_vmexit, so that the cancellation of the timer does
not have to be in free_nested.  In addition, because some paths through
nested_vmx_vmexit do not go through sync_vmcs12, the cancellation of
the timer is moved there.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:27 +01:00
Luwei Kang 81b016676e KVM: x86: Sync the pending Posted-Interrupts
Some Posted-Interrupts from passthrough devices may be lost or
overwritten when the vCPU is in runnable state.

The SN (Suppress Notification) of PID (Posted Interrupt Descriptor) will
be set when the vCPU is preempted (vCPU in KVM_MP_STATE_RUNNABLE state
but not running on physical CPU). If a posted interrupt coming at this
time, the irq remmaping facility will set the bit of PIR (Posted
Interrupt Requests) without ON (Outstanding Notification).
So this interrupt can't be sync to APIC virtualization register and
will not be handled by Guest because ON is zero.

Signed-off-by: Luwei Kang <luwei.kang@intel.com>
[Eliminate the pi_clear_sn fast path. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:27 +01:00
Paolo Bonzini e0dfacbfe9 KVM: nVMX: remove useless is_protmode check
VMX is only accessible in protected mode, remove a confusing check
that causes the conditional to lack a final "else" branch.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:24 +01:00
Sean Christopherson 34333cc6c2 KVM: nVMX: Ignore limit checks on VMX instructions using flat segments
Regarding segments with a limit==0xffffffff, the SDM officially states:

    When the effective limit is FFFFFFFFH (4 GBytes), these accesses may
    or may not cause the indicated exceptions.  Behavior is
    implementation-specific and may vary from one execution to another.

In practice, all CPUs that support VMX ignore limit checks for "flat
segments", i.e. an expand-up data or code segment with base=0 and
limit=0xffffffff.  This is subtly different than wrapping the effective
address calculation based on the address size, as the flat segment
behavior also applies to accesses that would wrap the 4g boundary, e.g.
a 4-byte access starting at 0xffffffff will access linear addresses
0xffffffff, 0x0, 0x1 and 0x2.

Fixes: f9eb4af67c ("KVM: nVMX: VMX instructions: add checks for #GP/#SS exceptions")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:24 +01:00
Sean Christopherson 8570f9e881 KVM: nVMX: Apply addr size mask to effective address for VMX instructions
The address size of an instruction affects the effective address, not
the virtual/linear address.  The final address may still be truncated,
e.g. to 32-bits outside of long mode, but that happens irrespective of
the address size, e.g. a 32-bit address size can yield a 64-bit virtual
address when using FS/GS with a non-zero base.

Fixes: 064aea7747 ("KVM: nVMX: Decoding memory operands of VMX instructions")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:23 +01:00
Sean Christopherson 946c522b60 KVM: nVMX: Sign extend displacements of VMX instr's mem operands
The VMCS.EXIT_QUALIFCATION field reports the displacements of memory
operands for various instructions, including VMX instructions, as a
naturally sized unsigned value, but masks the value by the addr size,
e.g. given a ModRM encoded as -0x28(%ebp), the -0x28 displacement is
reported as 0xffffffd8 for a 32-bit address size.  Despite some weird
wording regarding sign extension, the SDM explicitly states that bits
beyond the instructions address size are undefined:

    In all cases, bits of this field beyond the instruction’s address
    size are undefined.

Failure to sign extend the displacement results in KVM incorrectly
treating a negative displacement as a large positive displacement when
the address size of the VMX instruction is smaller than KVM's native
size, e.g. a 32-bit address size on a 64-bit KVM.

The very original decoding, added by commit 064aea7747 ("KVM: nVMX:
Decoding memory operands of VMX instructions"), sort of modeled sign
extension by truncating the final virtual/linear address for a 32-bit
address size.  I.e. it messed up the effective address but made it work
by adjusting the final address.

When segmentation checks were added, the truncation logic was kept
as-is and no sign extension logic was introduced.  In other words, it
kept calculating the wrong effective address while mostly generating
the correct virtual/linear address.  As the effective address is what's
used in the segment limit checks, this results in KVM incorreclty
injecting #GP/#SS faults due to non-existent segment violations when
a nested VMM uses negative displacements with an address size smaller
than KVM's native address size.

Using the -0x28(%ebp) example, an EBP value of 0x1000 will result in
KVM using 0x100000fd8 as the effective address when checking for a
segment limit violation.  This causes a 100% failure rate when running
a 32-bit KVM build as L1 on top of a 64-bit KVM L0.

Fixes: f9eb4af67c ("KVM: nVMX: VMX instructions: add checks for #GP/#SS exceptions")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:22 +01:00
Sean Christopherson 4f44c4eec5 KVM: VMX: Reorder clearing of registers in the vCPU-run assembly flow
Move the clearing of the common registers (not 64-bit-only) to the start
of the flow that clears registers holding guest state.  This is
purely a cosmetic change so that the label doesn't point at a blank line
and a #define.

No functional change intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:18 +01:00
Sean Christopherson fc2ba5a27a KVM: VMX: Call vCPU-run asm sub-routine from C and remove clobbering
...now that the sub-routine follows standard calling conventions.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:18 +01:00
Sean Christopherson 3b895ef486 KVM: VMX: Preserve callee-save registers in vCPU-run asm sub-routine
...to make it callable from C code.

Note that because KVM chooses to be ultra paranoid about guest register
values, all callee-save registers are still cleared after VM-Exit even
though the host's values are now reloaded from the stack.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:17 +01:00
Sean Christopherson e75c3c3a04 KVM: VMX: Return VM-Fail from vCPU-run assembly via standard ABI reg
...to prepare for making the assembly sub-routine callable from C code.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:17 +01:00
Sean Christopherson 77df549559 KVM: VMX: Pass @launched to the vCPU-run asm via standard ABI regs
...to prepare for making the sub-routine callable from C code.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:16 +01:00
Sean Christopherson a62fd5a76c KVM: VMX: Use RAX as the scratch register during vCPU-run
...to prepare for making the sub-routine callable from C code.  That
means returning the result in RAX.  Since RAX will be used to return the
result, use it as the scratch register as well to make the code readable
and to document that the scratch register is more or less arbitrary.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:15 +01:00
Sean Christopherson ee2fc635ef KVM: VMX: Rename ____vmx_vcpu_run() to __vmx_vcpu_run()
...now that the name is no longer usurped by a defunct helper function.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:15 +01:00
Sean Christopherson c823dd5c0f KVM: VMX: Fold __vmx_vcpu_run() back into vmx_vcpu_run()
...now that the code is no longer tagged with STACK_FRAME_NON_STANDARD.
Arguably, providing __vmx_vcpu_run() to break up vmx_vcpu_run() is
valuable on its own, but the previous split was purposely made as small
as possible to limit the effects STACK_FRAME_NON_STANDARD.  In other
words, the current split is now completely arbitrary and likely not the
most logical.

This also allows renaming ____vmx_vcpu_run() to __vmx_vcpu_run() in a
future patch.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:14 +01:00
Sean Christopherson 5e0781df18 KVM: VMX: Move vCPU-run code to a proper assembly routine
As evidenced by the myriad patches leading up to this moment, using
an inline asm blob for vCPU-run is nothing short of horrific.  It's also
been called "unholy", "an abomination" and likely a whole host of other
names that would violate the Code of Conduct if recorded here and now.

The code is relocated nearly verbatim, e.g. quotes, newlines, tabs and
__stringify need to be dropped, but other than those cosmetic changes
the only functional changees are to add the "call" and replace the final
"jmp" with a "ret".

Note that STACK_FRAME_NON_STANDARD is also dropped from __vmx_vcpu_run().

Suggested-by: Andi Kleen <ak@linux.intel.com>
Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:13 +01:00
Sean Christopherson 63c73aa07f KVM: VMX: Create a stack frame in vCPU-run
...in preparation for moving to a proper assembly sub-routnine.
vCPU-run isn't a leaf function since it calls vmx_update_host_rsp()
and vmx_vmenter().  And since we need to save/restore RBP anyways,
unconditionally creating the frame costs a single MOV, i.e. don't
bother keying off CONFIG_FRAME_POINTER or using FRAME_BEGIN, etc...

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:13 +01:00
Sean Christopherson c14f9dd50b KVM: VMX: Use #defines in place of immediates in VM-Enter inline asm
...to prepare for moving the inline asm to a proper asm sub-routine.
Eliminating the immediates allows a nearly verbatim move, e.g. quotes,
newlines, tabs and __stringify need to be dropped, but other than those
cosmetic changes the only function change will be to replace the final
"jmp" with a "ret".

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:12 +01:00
Xiaoyao Li 98ae70cc47 kvm: vmx: Fix entry number check for add_atomic_switch_msr()
Commit ca83b4a7f2 ("x86/KVM/VMX: Add find_msr() helper function")
introduces the helper function find_msr(), which returns -ENOENT when
not find the msr in vmx->msr_autoload.guest/host. Correct checking contion
of no more available entry in vmx->msr_autoload.

Fixes: ca83b4a7f2 ("x86/KVM/VMX: Add find_msr() helper function")
Cc: stable@vger.kernel.org
Signed-off-by: Xiaoyao Li <xiaoyao.li@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-14 16:22:20 +01:00
Luwei Kang c112b5f502 KVM: x86: Recompute PID.ON when clearing PID.SN
Some Posted-Interrupts from passthrough devices may be lost or
overwritten when the vCPU is in runnable state.

The SN (Suppress Notification) of PID (Posted Interrupt Descriptor) will
be set when the vCPU is preempted (vCPU in KVM_MP_STATE_RUNNABLE state but
not running on physical CPU). If a posted interrupt comes at this time,
the irq remapping facility will set the bit of PIR (Posted Interrupt
Requests) but not ON (Outstanding Notification).  Then, the interrupt
will not be seen by KVM, which always expects PID.ON=1 if PID.PIR=1
as documented in the Intel processor SDM but not in the VT-d specification.
To fix this, restore the invariant after PID.SN is cleared.

Signed-off-by: Luwei Kang <luwei.kang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-14 16:20:31 +01:00
Sean Christopherson bc44121190 KVM: nVMX: Restore a preemption timer consistency check
A recently added preemption timer consistency check was unintentionally
dropped when the consistency checks were being reorganized to match the
SDM's ordering.

Fixes: 461b4ba4c7 ("KVM: nVMX: Move the checks for VM-Execution Control Fields to a separate helper function")
Cc: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-13 19:38:25 +01:00
Vitaly Kuznetsov 6b1971c694 x86/kvm/nVMX: read from MSR_IA32_VMX_PROCBASED_CTLS2 only when it is available
SDM says MSR_IA32_VMX_PROCBASED_CTLS2 is only available "If
(CPUID.01H:ECX.[5] && IA32_VMX_PROCBASED_CTLS[63])". It was found that
some old cpus (namely "Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz (family: 0x6,
model: 0xf, stepping: 0x6") don't have it. Add the missing check.

Reported-by: Zdenek Kaspar <zkaspar82@gmail.com>
Tested-by: Zdenek Kaspar <zkaspar82@gmail.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-12 15:16:01 +01:00
Sean Christopherson d558920491 KVM: VMX: Use vcpu->arch.regs directly when saving/loading guest state
...now that all other references to struct vcpu_vmx have been removed.

Note that 'vmx' still needs to be passed into the asm blob in _ASM_ARG1
as it is consumed by vmx_update_host_rsp().  And similar to that code,
use _ASM_ARG2 in the assembly code to prepare for moving to proper asm,
while explicitly referencing the exact registers in the clobber list for
clarity in the short term and to avoid additional precompiler games.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-12 13:12:26 +01:00
Sean Christopherson f78d0971b7 KVM: VMX: Don't save guest registers after VM-Fail
A failed VM-Enter (obviously) didn't succeed, meaning the CPU never
executed an instrunction in guest mode and so can't have changed the
general purpose registers.

In addition to saving some instructions in the VM-Fail case, this also
provides a separate path entirely and thus an opportunity to propagate
the fail condition to vmx->fail via register without introducing undue
pain.  Using a register, as opposed to directly referencing vmx->fail,
eliminates the need to pass the offset of 'fail', which will simplify
moving the code to proper assembly in future patches.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-12 13:12:26 +01:00