Commit graph

677861 commits

Author SHA1 Message Date
QingFeng Hao 4d62fcc0b6 KVM: s390: Inject machine check into the guest
If the exit flag of SIE indicates that a machine check has happened
during guest's running and needs to be injected, inject it to the guest
accordingly.
But some machine checks, e.g. Channel Report Pending (CRW), refer to
host conditions only (the guest's channel devices are not managed by
the kernel directly) and are therefore not injected into the guest.
External Damage (ED) is also not reinjected into the guest because ETR
conditions are gone in Linux and STP conditions are not enabled in the
guest, and ED contains only these 8 ETR and STP conditions.
In general, instruction-processing damage, system recovery,
storage error, service-processor damage and channel subsystem damage
will be reinjected into the guest, and the remain (System damage,
timing-facility damage, warning, ED and CRW) will be handled on the host.

Signed-off-by: QingFeng Hao <haoqf@linux.vnet.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2017-06-28 12:42:32 +02:00
Christian Borntraeger aec3b2c5f9 s390,kvm: provide plumbing for machines checks when running guests
This provides the basic plumbing for handling machine checks when
 running guests
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.14 (GNU/Linux)
 
 iQIcBAABAgAGBQJZU4QPAAoJEBF7vIC1phx8GZsP/2P4nxWXBj0NS/dNq54/u7HU
 Va/zHIG7nUX81WZi8OCkPRlvb1RlcgNpIdw3Ar+BueFE6/qwVWBSdstVJCg6JSn4
 L8T1srSeV6yQEPq1/I9S8ERYtbC8bOC3dDF6g+KyaKYnICjq5yC01+86MKSVfLTI
 vFMPWY/PPCgECtXHjGpWBW6HjofRH3/H+XQbxaoTUyHKwWKdtvWer9K2V7Mc/Cf8
 XsyLY2Xq0Y5MBsJs+71Qw8+0R041Et5I3H7Od9lIc3SFYNoenQpk5oTtsujMtDG1
 ccMPZKErYI4wHE3Hy1ozK+MdFNbepUk3RBI3oXU25tpFPG3OPuksnOqCVN/iZmm+
 le9RuUi9WOOsuygPj2dsnx5v+aheedEcYWqvQ/qrNlP3pXNcpl+8waM6eke8HyCK
 1JKcqqGKBNX5wKNE9b5sRTHINWK12EVCQyVrgLlZaXoXLa40NpJPjtV27vr3ttVl
 WmGYgwMUTo15Rdr0NSJlXl8iCgIFtWMHvuRhIgp8pBuWWb28zr6aX4w++JPwOOMZ
 e4rzn55giCBDnjjDFQK2Knv5XxwnMKafYMxZXfC8gLr5ELjnI6vZDN+1zhT5L2S9
 uXd8l6rLN2qik57RzPV6YEDS0iybZnx5HF/ZPrNoFigJpdD7/0jFS5K5N0i+AhV5
 UQmGhSGnI7Teguc45mHT
 =CTzL
 -----END PGP SIGNATURE-----

Merge tag 'nmiforkvm' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into kernelorgnext

s390,kvm: provide plumbing for machines checks when running guests

This provides the basic plumbing for handling machine checks when
running guests
2017-06-28 12:42:02 +02:00
QingFeng Hao da72ca4d40 KVM: s390: Backup the guest's machine check info
When a machine check happens in the guest, related mcck info (mcic,
external damage code, ...) is stored in the vcpu's lowcore on the host.
Then the machine check handler's low-level part is executed, followed
by the high-level part.

If the high-level part's execution is interrupted by a new machine check
happening on the same vcpu on the host, the mcck info in the lowcore is
overwritten with the new machine check's data.

If the high-level part's execution is scheduled to a different cpu,
the mcck info in the lowcore is uncertain.

Therefore, for both cases, the further reinjection to the guest will use
the wrong data.
Let's backup the mcck info in the lowcore to the sie page
for further reinjection, so that the right data will be used.

Add new member into struct sie_page to store related machine check's
info of mcic, failing storage address and external damage code.

Signed-off-by: QingFeng Hao <haoqf@linux.vnet.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2017-06-27 16:05:38 +02:00
QingFeng Hao c929500d7a s390/nmi: s390: New low level handling for machine check happening in guest
Add the logic to check if the machine check happens when the guest is
running. If yes, set the exit reason -EINTR in the machine check's
interrupt handler. Refactor s390_do_machine_check to avoid panicing
the host for some kinds of machine checks which happen
when guest is running.
Reinject the instruction processing damage's machine checks including
Delayed Access Exception instead of damaging the host if it happens
in the guest because it could be caused by improper update on TLB entry
or other software case and impacts the guest only.

Signed-off-by: QingFeng Hao <haoqf@linux.vnet.ibm.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2017-06-27 16:05:27 +02:00
Martin Schwidefsky 1cae025577 KVM: s390: avoid packed attribute
For naturally aligned and sized data structures avoid superfluous
packed and aligned attributes.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2017-06-22 12:41:07 +02:00
Yi Min Zhao 2c1a48f2e5 KVM: S390: add new group for flic
In some cases, userspace needs to get or set all ais states for example
migration. So we introduce a new group KVM_DEV_FLIC_AISM_ALL to provide
interfaces to get or set the adapter-interruption-suppression mode for
all ISCs. The corresponding documentation is updated.

Signed-off-by: Yi Min Zhao <zyimin@linux.vnet.ibm.com>
Reviewed-by: Halil Pasic <pasic@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2017-06-22 12:41:07 +02:00
Christian Borntraeger 6ae1574c2a KVM: s390: implement instruction execution protection for emulated
ifetch

While currently only used to fetch the original instruction on failure
for getting the instruction length code, we should make the page table
walking code future proof.

Suggested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2017-06-22 12:41:06 +02:00
Claudio Imbrenda 4036e3874a KVM: s390: ioctls to get and set guest storage attributes
* Add the struct used in the ioctls to get and set CMMA attributes.
* Add the two functions needed to get and set the CMMA attributes for
  guest pages.
* Add the two ioctls that use the aforementioned functions.

Signed-off-by: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2017-06-22 12:41:06 +02:00
Claudio Imbrenda 190df4a212 KVM: s390: CMMA tracking, ESSA emulation, migration mode
* Add a migration state bitmap to keep track of which pages have dirty
  CMMA information.
* Disable CMMA by default, so we can track if it's used or not. Enable
  it on first use like we do for storage keys (unless we are doing a
  migration).
* Creates a VM attribute to enter and leave migration mode.
* In migration mode, CMMA is disabled in the SIE block, so ESSA is
  always interpreted and emulated in software.
* Free the migration state on VM destroy.

Signed-off-by: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2017-06-22 12:41:05 +02:00
Stefan Raspl 865279c53c tools/kvm_stat: display guest list in pid/guest selection screens
Display a (possibly inaccurate) list of all running guests. Note that we
leave a bit of extra room above the list for potential error messages.
Furthermore, we deliberately do not reject pids or guest names that are
not in our list, as we cannot rule out that our fuzzy approach might be
in error somehow.

Signed-off-by: Stefan Raspl <raspl@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-06-08 18:24:49 +02:00
Stefan Raspl 6667ae8f39 tools/kvm_stat: add new interactive command 'o'
Add new interactive command 'o' to toggle sorting by 'CurAvg/s' (default)
and 'Total' columns.

Signed-off-by: Stefan Raspl <raspl@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-06-08 18:24:48 +02:00
Stefan Raspl 64eefad2cd tools/kvm_stat: add new interactive command 's'
Add new command 's' to modify the update interval. Limited to a maximum of
25.5 sec and a minimum of 0.1 sec, since curses cannot handle longer
and shorter delays respectively.

Signed-off-by: Stefan Raspl <raspl@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-06-08 18:20:21 +02:00
Stefan Raspl 1fdea7b289 tools/kvm_stat: add new interactive command 'h'
Display interactive commands reference on 'h'.
While at it, sort interactive commands alphabetically in various places.

Signed-off-by: Stefan Raspl <raspl@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-06-08 18:17:59 +02:00
Stefan Raspl 38e89c37a1 tools/kvm_stat: rename 'Current' column to 'CurAvg/s'
'Current' can be misleading as it doesn't tell whether this is the amount
of events in the last interval or the current average per second.
Note that this necessitates widening the respective column by one more
character.

Signed-off-by: Stefan Raspl <raspl@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-06-08 18:17:37 +02:00
Stefan Raspl f6d753102a tools/kvm_stat: make heading look a bit more like 'top'
Print header in standout font just like the 'top' command does.

Signed-off-by: Stefan Raspl <raspl@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-06-08 18:17:28 +02:00
Stefan Raspl 5725393764 tools/kvm_stat: display message indicating lack of events
Give users some indication on the reason why no data is displayed on the
screen yet.

Signed-off-by: Stefan Raspl <raspl@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-06-08 18:17:18 +02:00
Stefan Raspl 62d1b6cc24 tools/kvm_stat: show cursor in selection screens
Show the cursor in the interactive screens to specify pid, filter or guest
name as an orientation for the user.

Signed-off-by: Stefan Raspl <raspl@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-06-08 18:17:06 +02:00
Stefan Raspl 099a2dfc67 tools/kvm_stat: move functions to corresponding classes
Quite a few of the functions are used only in a single class. Moving
functions accordingly to improve the overall structure.
Furthermore, introduce a base class for the providers, which might also
come handy for future extensions.

Signed-off-by: Stefan Raspl <raspl@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-06-08 18:16:28 +02:00
Stefan Raspl c469117df0 tools/kvm_stat: simplify initializers
Simplify a couple of initialization routines:
* TracepointProvider and DebugfsProvider: Pass pid into __init__() instead
  of switching to the requested value in an extra call after initializing
  to the default first.
* Pass a single options object into Stats.__init__(), delaying options
  evaluation accordingly, instead of evaluating options first and passing
  several parts of the options object to Stats.__init__() individually.
* Eliminate Stats.update_provider_pid(), since this 2-line function is now
  used in a single place only.
* Remove extra call to update_drilldown() in Tui.__init__() by getting the
  value of options.fields right initially when parsing options.
* Simplify get_providers() logic.
* Avoid duplicate fields initialization by handling it once in the
  providers' __init__() methods.

Signed-off-by: Stefan Raspl <raspl@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-06-08 18:15:45 +02:00
Stefan Raspl 5e3823a49c tools/kvm_stat: remove extra statement
Signed-off-by: Stefan Raspl <raspl@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-06-08 18:14:51 +02:00
Stefan Raspl 42a947b77b tools/kvm_stat: removed unused function
Function available_fields() is not used in any place.

Signed-off-by: Stefan Raspl <raspl@linux.vnet.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-06-08 18:14:44 +02:00
Stefan Raspl 5a7d11f8dc tools/kvm_stat: simplify line print logic
Simplify line print logic for header and data lines in interactive mode
as previously suggested by Radim.
While at it, add a space between the first two columns to avoid the
total bleeding into the event name.
Furthermore, for column 'Current', differentiate between no events being
reported (empty 'Current' column) vs the case where events were reported
but the average was rounded down to zero ('0' in 'Current column), for
the folks who appreciate the difference.
Finally: Only skip events which were not reported at all yet, instead of
events that don't have a value in the current interval.
Considered using constants for the field widths in the format strings.
However, that would make things a bit more complicated, and considering
that there are only two places where output happens, I figured it isn't
worth the trouble.

Signed-off-by: Stefan Raspl <raspl@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-06-08 18:14:36 +02:00
Stefan Raspl 2da9d4aaa7 tools/kvm_stat: remove unnecessary header redraws
Certain interactive commands will not modify any information displayed in
the header, hence we can skip them.

Signed-off-by: Stefan Raspl <raspl@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-06-08 18:13:37 +02:00
Stefan Raspl 81468d73b6 tools/kvm_stat: fix undue use of initial sleeptime
We should not use the initial sleeptime for any key press that does not
switch to a different screen, as that introduces an unaesthetic flicker due
to two updates in quick succession.

Signed-off-by: Stefan Raspl <raspl@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-06-08 18:13:30 +02:00
Stefan Raspl 124c2fc9fd tools/kvm_stat: fix event counts display for interrupted intervals
When an update interval is interrupted via key press (e.g. space), the
'Current' column value is calculated using the full interval length
instead of the elapsed time, which leads to lower than actual numbers.
Furthermore, the value should be rounded, not truncated.
This is fixed by using the actual elapsed time for the calculation.

Signed-off-by: Stefan Raspl <raspl@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-06-08 18:13:23 +02:00
Stefan Raspl 773bffeeb2 tools/kvm_stat: fix typo
Signed-off-by: Stefan Raspl <raspl@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-06-08 18:13:16 +02:00
Jim Mattson d281e13b0b KVM: nVMX: Update vmcs12->guest_linear_address on nested VM-exit
The guest-linear address field is set for VM exits due to attempts to
execute LMSW with a memory operand and VM exits due to attempts to
execute INS or OUTS for which the relevant segment is usable,
regardless of whether or not EPT is in use.

Fixes: 119a9c01a5 ("KVM: nVMX: pass valid guest linear-address to the L1")
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-06-07 16:36:41 +02:00
Jim Mattson d923fcf636 KVM: nVMX: Don't update vmcs12->xss_exit_bitmap on nested VM-exit
The XSS-exiting bitmap is a VMCS control field that does not change
while the CPU is in non-root mode. Transferring the unchanged value
from vmcs02 to vmcs12 is unnecessary.

Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-06-07 16:34:08 +02:00
Jim Mattson 4531662d1a kvm: vmx: Check value written to IA32_BNDCFGS
Bits 11:2 must be zero and the linear addess in bits 63:12 must be
canonical. Otherwise, WRMSR(BNDCFGS) should raise #GP.

Fixes: 0dd376e709 ("KVM: x86: add MSR_IA32_BNDCFGS to msrs_to_save")
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-06-07 16:28:55 +02:00
Jim Mattson 4439af9f91 kvm: x86: Guest BNDCFGS requires guest MPX support
The BNDCFGS MSR should only be exposed to the guest if the guest
supports MPX. (cf. the TSC_AUX MSR and RDTSCP.)

Fixes: 0dd376e709 ("KVM: x86: add MSR_IA32_BNDCFGS to msrs_to_save")
Change-Id: I3ad7c01bda616715137ceac878f3fa7e66b6b387
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-06-07 16:28:15 +02:00
Jim Mattson a8b6fda38f kvm: vmx: Do not disable intercepts for BNDCFGS
The MSR permission bitmaps are shared by all VMs. However, some VMs
may not be configured to support MPX, even when the host does. If the
host supports VMX and the guest does not, we should intercept accesses
to the BNDCFGS MSR, so that we can synthesize a #GP
fault. Furthermore, if the host does not support MPX and the
"ignore_msrs" kvm kernel parameter is set, then we should intercept
accesses to the BNDCFGS MSR, so that we can skip over the rdmsr/wrmsr
without raising a #GP fault.

Fixes: da8999d318 ("KVM: x86: Intel MPX vmx and msr handle")
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-06-07 16:28:15 +02:00
Nick Desaulniers 9d643f6312 KVM: x86: avoid large stack allocations in em_fxrstor
em_fxstor previously called fxstor_fixup.  Both created instances of
struct fxregs_state on the stack, which triggered the warning:

arch/x86/kvm/emulate.c:4018:12: warning: stack frame size of 1080 bytes
in function
      'em_fxrstor' [-Wframe-larger-than=]
static int em_fxrstor(struct x86_emulate_ctxt *ctxt)
           ^
with CONFIG_FRAME_WARN set to 1024.

This patch does the fixup in em_fxstor now, avoiding one additional
struct fxregs_state, and now fxstor_fixup can be removed as it has no
other call sites.

Further, the calculation for offsets into xmm_space can be shared
between em_fxstor and em_fxsave.

Signed-off-by: Nick Desaulniers <nick.desaulniers@gmail.com>
[Clean up calculation of offsets and fix it for 64-bit mode. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-06-01 11:23:12 +02:00
Dan Carpenter 7461fbc46e KVM: white space cleanup in nested_vmx_setup_ctls_msrs()
This should have been indented one more character over and it should use
tabs.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-06-01 11:23:11 +02:00
Dan Carpenter e9196cebbe KVM: Tidy the whitespace in nested_svm_check_permissions()
I moved the || to the line before.  Also I replaced some spaces with a
tab on the "return 0;" line.  It looks OK in the diff but originally
that line was only indented 7 spaces.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-06-01 11:23:11 +02:00
ZhuangYanying 47a66eed99 KVM: x86: Fix nmi injection failure when vcpu got blocked
When spin_lock_irqsave() deadlock occurs inside the guest, vcpu threads,
other than the lock-holding one, would enter into S state because of
pvspinlock. Then inject NMI via libvirt API "inject-nmi", the NMI could
not be injected into vm.

The reason is:
1 It sets nmi_queued to 1 when calling ioctl KVM_NMI in qemu, and sets
cpu->kvm_vcpu_dirty to true in do_inject_external_nmi() meanwhile.
2 It sets nmi_queued to 0 in process_nmi(), before entering guest, because
cpu->kvm_vcpu_dirty is true.

It's not enough just to check nmi_queued to decide whether to stay in
vcpu_block() or not. NMI should be injected immediately at any situation.
Add checking nmi_pending, and testing KVM_REQ_NMI replaces nmi_queued
in vm_vcpu_has_events().

Do the same change for SMIs.

Signed-off-by: Zhuang Yanying <ann.zhuangyanying@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-06-01 11:23:10 +02:00
Roman Pen d9c1b5431d KVM: SVM: do not zero out segment attributes if segment is unusable or not present
This is a fix for the problem [1], where VMCB.CPL was set to 0 and interrupt
was taken on userspace stack.  The root cause lies in the specific AMD CPU
behaviour which manifests itself as unusable segment attributes on SYSRET.
The corresponding work around for the kernel is the following:

61f01dd941 ("x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue")

In other turn virtualization side treated unusable segment incorrectly and
restored CPL from SS attributes, which were zeroed out few lines above.

In current patch it is assured only that P bit is cleared in VMCB.save state
and segment attributes are not zeroed out if segment is not presented or is
unusable, therefore CPL can be safely restored from DPL field.

This is only one part of the fix, since QEMU side should be fixed accordingly
not to zero out attributes on its side.  Corresponding patch will follow.

[1] Message id: CAJrWOzD6Xq==b-zYCDdFLgSRMPM-NkNuTSDFEtX=7MreT45i7Q@mail.gmail.com

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Mikhail Sennikovskii <mikhail.sennikovskii@profitbricks.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim KrÄmář <rkrcmar@redhat.com>
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-06-01 11:21:17 +02:00
Gioh Kim 8eae9570d1 KVM: SVM: ignore type when setting segment registers
Commit 19bca6ab75 ("KVM: SVM: Fix cross vendor migration issue with
unusable bit") added checking type when setting unusable.
So unusable can be set if present is 0 OR type is 0.
According to the AMD processor manual, long mode ignores the type value
in segment descriptor. And type can be 0 if it is read-only data segment.
Therefore type value is not related to unusable flag.

This patch is based on linux-next v4.12.0-rc3.

Signed-off-by: Gioh Kim <gi-oh.kim@profitbricks.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-05-30 17:17:22 +02:00
Radim Krčmář cbf712792b KVM: nVMX: fix nested_vmx_check_vmptr failure paths under debugging
kvm_skip_emulated_instruction() will return 0 if userspace is
single-stepping the guest.

kvm_skip_emulated_instruction() uses return status convention of exit
handler: 0 means "exit to userspace" and 1 means "continue vm entries".
The problem is that nested_vmx_check_vmptr() return status means
something else: 0 is ok, 1 is error.

This means we would continue executing after a failure.  Static checker
noticed it because vmptr was not initialized.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: 6affcbedca ("KVM: x86: Add kvm_skip_emulated_instruction and use it.")
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-05-30 17:17:21 +02:00
Jan H. Schönherr 52b5419016 KVM: x86: Fix virtual wire mode
Intel SDM says, that at most one LAPIC should be configured with ExtINT
delivery. KVM configures all LAPICs this way. This causes pic_unlock()
to kick the first available vCPU from the internal KVM data structures.
If this vCPU is not the BSP, but some not-yet-booted AP, the BSP may
never realize that there is an interrupt.

Fix that by enabling ExtINT delivery only for the BSP.

This allows booting a Linux guest without a TSC in the above situation.
Otherwise the BSP gets stuck in calibrate_delay_converge().

Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de>
Reviewed-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-05-26 18:01:21 +02:00
Jan H. Schönherr e1d39b17e0 KVM: nVMX: Fix handling of lmsw instruction
The decision whether or not to exit from L2 to L1 on an lmsw instruction is
based on bogus values: instead of using the information encoded within the
exit qualification, it uses the data also used for the mov-to-cr
instruction, which boils down to using whatever is in %eax at that point.

Use the correct values instead.

Without this fix, an L1 may not get notified when a 32-bit Linux L2
switches its secondary CPUs to protected mode; the L1 is only notified on
the next modification of CR0. This short time window poses a problem, when
there is some other reason to exit to L1 in between. Then, L2 will be
resumed in real mode and chaos ensues.

Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de>
Reviewed-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-05-26 17:59:27 +02:00
Wanpeng Li 5acc1ca4fb KVM: X86: Fix preempt the preemption timer cancel
Preemption can occur during cancel preemption timer, and there will be
inconsistent status in lapic, vmx and vmcs field.

          CPU0                    CPU1

  preemption timer vmexit
  handle_preemption_timer(vCPU0)
    kvm_lapic_expired_hv_timer
      vmx_cancel_hv_timer
        vmx->hv_deadline_tsc = -1
        vmcs_clear_bits
        /* hv_timer_in_use still true */
  sched_out
                           sched_in
                           kvm_arch_vcpu_load
                             vmx_set_hv_timer
                               write vmx->hv_deadline_tsc
                               vmcs_set_bits
                           /* back in kvm_lapic_expired_hv_timer */
                           hv_timer_in_use = false
                           ...
                           vmx_vcpu_run
                             vmx_arm_hv_run
                               write preemption timer deadline
                             spurious preemption timer vmexit
                               handle_preemption_timer(vCPU0)
                                 kvm_lapic_expired_hv_timer
                                   WARN_ON(!apic->lapic_timer.hv_timer_in_use);

This can be reproduced sporadically during boot of L2 on a
preemptible L1, causing a splat on L1.

 WARNING: CPU: 3 PID: 1952 at arch/x86/kvm/lapic.c:1529 kvm_lapic_expired_hv_timer+0xb5/0xd0 [kvm]
 CPU: 3 PID: 1952 Comm: qemu-system-x86 Not tainted 4.12.0-rc1+ #24 RIP: 0010:kvm_lapic_expired_hv_timer+0xb5/0xd0 [kvm]
  Call Trace:
  handle_preemption_timer+0xe/0x20 [kvm_intel]
  vmx_handle_exit+0xc9/0x15f0 [kvm_intel]
  ? lock_acquire+0xdb/0x250
  ? lock_acquire+0xdb/0x250
  ? kvm_arch_vcpu_ioctl_run+0xdf3/0x1ce0 [kvm]
  kvm_arch_vcpu_ioctl_run+0xe55/0x1ce0 [kvm]
  kvm_vcpu_ioctl+0x384/0x7b0 [kvm]
  ? kvm_vcpu_ioctl+0x384/0x7b0 [kvm]
  ? __fget+0xf3/0x210
  do_vfs_ioctl+0xa4/0x700
  ? __fget+0x114/0x210
  SyS_ioctl+0x79/0x90
  do_syscall_64+0x8f/0x750
  ? trace_hardirqs_on_thunk+0x1a/0x1c
  entry_SYSCALL64_slow_path+0x25/0x25

This patch fixes it by disabling preemption while cancelling
preemption timer.  This way cancel_hv_timer is atomic with
respect to kvm_arch_vcpu_load.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-05-26 17:46:20 +02:00
Linus Torvalds 08332893e3 Linux 4.12-rc2 2017-05-21 19:30:23 -07:00
Linus Torvalds 33c9e97290 x86: fix 32-bit case of __get_user_asm_u64()
The code to fetch a 64-bit value from user space was entirely buggered,
and has been since the code was merged in early 2016 in commit
b2f680380d ("x86/mm/32: Add support for 64-bit __get_user() on 32-bit
kernels").

Happily the buggered routine is almost certainly entirely unused, since
the normal way to access user space memory is just with the non-inlined
"get_user()", and the inlined version didn't even historically exist.

The normal "get_user()" case is handled by external hand-written asm in
arch/x86/lib/getuser.S that doesn't have either of these issues.

There were two independent bugs in __get_user_asm_u64():

 - it still did the STAC/CLAC user space access marking, even though
   that is now done by the wrapper macros, see commit 11f1a4b975
   ("x86: reorganize SMAP handling in user space accesses").

   This didn't result in a semantic error, it just means that the
   inlined optimized version was hugely less efficient than the
   allegedly slower standard version, since the CLAC/STAC overhead is
   quite high on modern Intel CPU's.

 - the double register %eax/%edx was marked as an output, but the %eax
   part of it was touched early in the asm, and could thus clobber other
   inputs to the asm that gcc didn't expect it to touch.

   In particular, that meant that the generated code could look like
   this:

        mov    (%eax),%eax
        mov    0x4(%eax),%edx

   where the load of %edx obviously was _supposed_ to be from the 32-bit
   word that followed the source of %eax, but because %eax was
   overwritten by the first instruction, the source of %edx was
   basically random garbage.

The fixes are trivial: remove the extraneous STAC/CLAC entries, and mark
the 64-bit output as early-clobber to let gcc know that no inputs should
alias with the output register.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable@kernel.org   # v4.8+
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-21 18:26:54 -07:00
Linus Torvalds 334a023ee5 Clean up x86 unsafe_get/put_user() type handling
Al noticed that unsafe_put_user() had type problems, and fixed them in
commit a7cc722fff ("fix unsafe_put_user()"), which made me look more
at those functions.

It turns out that unsafe_get_user() had a type issue too: it limited the
largest size of the type it could handle to "unsigned long".  Which is
fine with the current users, but doesn't match our existing normal
get_user() semantics, which can also handle "u64" even when that does
not fit in a long.

While at it, also clean up the type cast in unsafe_put_user().  We
actually want to just make it an assignment to the expected type of the
pointer, because we actually do want warnings from types that don't
convert silently.  And it makes the code more readable by not having
that one very long and complex line.

[ This patch might become stable material if we ever end up back-porting
  any new users of the unsafe uaccess code, but as things stand now this
  doesn't matter for any current existing uses. ]

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-21 15:25:46 -07:00
Linus Torvalds f3926e4c2a Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc uaccess fixes from Al Viro:
 "Fix for unsafe_put_user() (no callers currently in mainline, but
  anyone starting to use it will step into that) + alpha osf_wait4()
  infoleak fix"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  osf_wait4(): fix infoleak
  fix unsafe_put_user()
2017-05-21 12:06:44 -07:00
Linus Torvalds 970c305aa8 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Thomas Gleixner:
 "A single scheduler fix:

  Prevent idle task from ever being preempted. That makes sure that
  synchronize_rcu_tasks() which is ignoring idle task does not pretend
  that no task is stuck in preempted state. If that happens and idle was
  preempted on a ftrace trampoline the machine crashes due to
  inconsistent state"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/core: Call __schedule() from do_idle() without enabling preemption
2017-05-21 11:52:00 -07:00
Linus Torvalds e7a3d62749 Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
 "A set of small fixes for the irq subsystem:

   - Cure a data ordering problem with chained interrupts

   - Three small fixlets for the mbigen irq chip"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq: Fix chained interrupt data ordering
  irqchip/mbigen: Fix the clear register offset calculation
  irqchip/mbigen: Fix potential NULL dereferencing
  irqchip/mbigen: Fix memory mapping code
2017-05-21 11:45:26 -07:00
Al Viro a8c39544a6 osf_wait4(): fix infoleak
failing sys_wait4() won't fill struct rusage...

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-05-21 13:10:07 -04:00
Al Viro a7cc722fff fix unsafe_put_user()
__put_user_size() relies upon its first argument having the same type as what
the second one points to; the only other user makes sure of that and
unsafe_put_user() should do the same.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-05-21 13:09:57 -04:00
Linus Torvalds 56f410cf45 This fixes a bug caused by not cleaning up the new instance unique triggers
when deleting an instance. It also creates a selftest that triggers that bug.
 
 Fix the delayed optimization happening after kprobes boot up self tests
 being removed by freeing of init memory.
 
 Comment kprobes on why the delay optimization is not a problem for removal
 of modules, to keep other developers from searching that riddle.
 
 Fix another rcu isn't watching in stack trace tracing.
 
 Naveen N. Rao (4):
       ftrace: Simplify glob handling in unregister_ftrace_function_probe_func()
       ftrace/instances: Clear function triggers when removing instances
       selftests/ftrace: Fix bashisms
       selftests/ftrace: Add test to remove instance with active event triggers
 
 Steven Rostedt (1):
       tracing: Move postpone selftests to core from early_initcall
 
 Steven Rostedt (VMware) (3):
       ftrace: Remove #ifdef from code and add clear_ftrace_function_probes() stub
       kprobes: Document how optimized kprobes are removed from module unload
       tracing: Make sure RCU is watching before calling a stack trace
 
 Thomas Gleixner (1):
       tracing/kprobes: Enforce kprobes teardown after testing
 -----BEGIN PGP SIGNATURE-----
 
 iQExBAABCAAbBQJZIQapFBxyb3N0ZWR0QGdvb2RtaXMub3JnAAoJEMm5BfJq2Y3L
 A6MIAKFLb6mQ4flRBXpWd2tD2B4DQpQ0H7SovseZnlH6Q7grU6POY/qbNl9xXiBA
 3NavxqbIYokH8cxEqGAusL7ASUFPXJj6erMM1uc1WRuAzMpIjvgNacOtW5R+c5S9
 ofR1xtKlBo/854J/IP6M3J0WqrK+B7TsS1WYKohe/tFMBpolbnFloHVfMMZlaL58
 CQhCoAhkjJRsta6dJhbo+HoQy03VGyWsfFHtutBpIwsf81Naq4Stpxp7jdZLWhB8
 Di5QdOji9lDayK6Uk7DDZqHxbjC9z6cCS9nVWIGHkE4AMpR3peYtsyCaAOBjVMLV
 2OuhuREfZgKaYVMjUfdeYCayDAY=
 =1gek
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix a bug caused by not cleaning up the new instance unique triggers
   when deleting an instance. It also creates a selftest that triggers
   that bug.

 - Fix the delayed optimization happening after kprobes boot up self
   tests being removed by freeing of init memory.

 - Comment kprobes on why the delay optimization is not a problem for
   removal of modules, to keep other developers from searching that
   riddle.

 - Fix another case of rcu not watching in stack trace tracing.

* tag 'trace-v4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Make sure RCU is watching before calling a stack trace
  kprobes: Document how optimized kprobes are removed from module unload
  selftests/ftrace: Add test to remove instance with active event triggers
  selftests/ftrace: Fix bashisms
  ftrace: Remove #ifdef from code and add clear_ftrace_function_probes() stub
  ftrace/instances: Clear function triggers when removing instances
  ftrace: Simplify glob handling in unregister_ftrace_function_probe_func()
  tracing/kprobes: Enforce kprobes teardown after testing
  tracing: Move postpone selftests to core from early_initcall
2017-05-20 23:39:03 -07:00