1
0
Fork 0
Commit Graph

1963 Commits (761cab513d5e9079ec8ace48ab05d84b0083653b)

Author SHA1 Message Date
Thomas Gleixner d5bb994bcd treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 114
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 of the license or at
  your option any later version this program is distributed in the
  hope that it will be useful but without any warranty without even
  the implied warranty of merchantability or fitness for a particular
  purpose see the gnu general public license for more details you
  should have received a copy of the gnu general public license along
  with this program

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 8 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190523091650.663497195@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-24 17:39:01 +02:00
Thomas Gleixner 8607a96520 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 98
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 of the license or at
  your optional any later version of the license

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 3 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190520075212.713472955@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-24 17:37:54 +02:00
Linus Torvalds 0ef0fd3515 * ARM: support for SVE and Pointer Authentication in guests, PMU improvements
* POWER: support for direct access to the POWER9 XIVE interrupt controller,
 memory and performance optimizations.
 
 * x86: support for accessing memory not backed by struct page, fixes and refactoring
 
 * Generic: dirty page tracking improvements
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJc3qV/AAoJEL/70l94x66Dn3QH/jX1Bn0P/RZAIt4w0SySklSg
 PqxUKDyBQqB9vN9Qeb9jWXAKPH2CtM3+up/rz7oRnBWp7qA6vXcC/R/QJYAvzdXE
 nklsR/oYCsflR1KdlVYuDvvPCPP2fLBU5zfN83OsaBQ8fNRkm3gN+N5XQ2SbXbLy
 Mo9tybS4otY201UAC96e8N0ipwwyCRpDneQpLcl+F5nH3RBt63cVbs04O+70MXn7
 eT4I+8K3+Go7LATzT8hglD21D/7uvE31qQb6yr5L33IfhU4GB51RZzBXTNaAdY8n
 hT1rMrRkAMAFWYZPQDfoMadjWU3i5DIfstKjDxOr9oTfuOEp5Z+GvJwvVnUDg1I=
 =D0+p
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "ARM:
   - support for SVE and Pointer Authentication in guests
   - PMU improvements

  POWER:
   - support for direct access to the POWER9 XIVE interrupt controller
   - memory and performance optimizations

  x86:
   - support for accessing memory not backed by struct page
   - fixes and refactoring

  Generic:
   - dirty page tracking improvements"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (155 commits)
  kvm: fix compilation on aarch64
  Revert "KVM: nVMX: Expose RDPMC-exiting only when guest supports PMU"
  kvm: x86: Fix L1TF mitigation for shadow MMU
  KVM: nVMX: Disable intercept for FS/GS base MSRs in vmcs02 when possible
  KVM: PPC: Book3S: Remove useless checks in 'release' method of KVM device
  KVM: PPC: Book3S HV: XIVE: Fix spelling mistake "acessing" -> "accessing"
  KVM: PPC: Book3S HV: Make sure to load LPID for radix VCPUs
  kvm: nVMX: Set nested_run_pending in vmx_set_nested_state after checks complete
  tests: kvm: Add tests for KVM_SET_NESTED_STATE
  KVM: nVMX: KVM_SET_NESTED_STATE - Tear down old EVMCS state before setting new state
  tests: kvm: Add tests for KVM_CAP_MAX_VCPUS and KVM_CAP_MAX_CPU_ID
  tests: kvm: Add tests to .gitignore
  KVM: Introduce KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2
  KVM: Fix kvm_clear_dirty_log_protect off-by-(minus-)one
  KVM: Fix the bitmap range to copy during clear dirty
  KVM: arm64: Fix ptrauth ID register masking logic
  KVM: x86: use direct accessors for RIP and RSP
  KVM: VMX: Use accessors for GPRs outside of dedicated caching logic
  KVM: x86: Omit caching logic for always-available GPRs
  kvm, x86: Properly check whether a pfn is an MMIO or not
  ...
2019-05-17 10:33:30 -07:00
Paolo Bonzini 59c5c58c5b Second PPC KVM update for 5.2
- Fix a bug, fix a spelling mistake, remove some useless code.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJc2kTEAAoJEJ2a6ncsY3GfS88IAImcIlKXMvzSKtHFxGpRap17
 9LTZs5MQAUZHVMFJXmrJLBgogtGxUw53aX53woeyerytZsoGU4+YzwgLhk4XBEzA
 5Kt5ahlxu82sa2ThH1zyLlNWFXiTECgD5ErNTdavLbNlaKE8YG160+65/mSyixGz
 vs5wLSYGv/37no1ay6PIZ3DtwqdrYq5nJbuG+ZsaamUHPJOGprqHqg0gaTJ877NZ
 yQDUS7OVuEJ1pdUUK/elP+cnlqR9smaP5OUNsXYMHWJgPJMjc27/thBJy93iS1kk
 /zKQ8AFmxqoaePnR7ymTbqurfFFHBiSavUmyWopSQppNHCf4DDE8XjLs9MXKez8=
 =Lco4
 -----END PGP SIGNATURE-----

Merge tag 'kvm-ppc-next-5.2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into HEAD

PPC KVM update for 5.2

* Support for guests to access the new POWER9 XIVE interrupt controller
  hardware directly, reducing interrupt latency and overhead for guests.

* In-kernel implementation of the H_PAGE_INIT hypercall.

* Reduce memory usage of sparsely-populated IOMMU tables.

* Several bug fixes.

Second PPC KVM update for 5.2

* Fix a bug, fix a spelling mistake, remove some useless code.
2019-05-15 23:39:38 +02:00
Ira Weiny 73b0140bf0 mm/gup: change GUP fast to use flags rather than a write 'bool'
To facilitate additional options to get_user_pages_fast() change the
singular write parameter to be gup_flags.

This patch does not change any functionality.  New functionality will
follow in subsequent patches.

Some of the get_user_pages_fast() call sites were unchanged because they
already passed FOLL_WRITE or 0 for the write parameter.

NOTE: It was suggested to change the ordering of the get_user_pages_fast()
arguments to ensure that callers were converted.  This breaks the current
GUP call site convention of having the returned pages be the final
parameter.  So the suggestion was rejected.

Link: http://lkml.kernel.org/r/20190328084422.29911-4-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190317183438.2057-4-ira.weiny@intel.com
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Mike Marshall <hubcap@omnibond.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:47:46 -07:00
Colin Ian King 3f8cb76c80 KVM: PPC: Book3S HV: XIVE: Fix spelling mistake "acessing" -> "accessing"
There is a spelling mistake in a pr_err message, fix it.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-05-14 12:06:03 +10:00
Paul Mackerras 2eeeaf16aa KVM: PPC: Book3S HV: Make sure to load LPID for radix VCPUs
Commit 70ea13f6e6 ("KVM: PPC: Book3S HV: Flush TLB on secondary radix
threads", 2019-04-29) aimed to make radix guests that are using the
real-mode entry path load the LPID register and flush the TLB in the
same place where those things are done for HPT guests.  However, it
omitted to remove a branch which branches around that code for radix
guests.  The result is that with indep_thread_mode = N, radix guests
don't run correctly.  (With indep_threads_mode = Y, which is the
default, radix guests use a different entry path.)

This removes the offending branch, and also the load and compare that
the branch depends on, since the cr7 setting is now unused.

Reported-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Tested-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Fixes: 70ea13f6e6 ("KVM: PPC: Book3S HV: Flush TLB on secondary radix threads")
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-05-14 12:05:24 +10:00
Linus Torvalds b970afcfca powerpc updates for 5.2
Highlights:
 
  - Support for Kernel Userspace Access/Execution Prevention (like
    SMAP/SMEP/PAN/PXN) on some 64-bit and 32-bit CPUs. This prevents the kernel
    from accidentally accessing userspace outside copy_to/from_user(), or
    ever executing userspace.
 
  - KASAN support on 32-bit.
 
  - Rework of where we map the kernel, vmalloc, etc. on 64-bit hash to use the
    same address ranges we use with the Radix MMU.
 
  - A rewrite into C of large parts of our idle handling code for 64-bit Book3S
    (ie. power8 & power9).
 
  - A fast path entry for syscalls on 32-bit CPUs, for a 12-17% speedup in the
    null_syscall benchmark.
 
  - On 64-bit bare metal we have support for recovering from errors with the time
    base (our clocksource), however if that fails currently we hang in __delay()
    and never crash. We now have support for detecting that case and short
    circuiting __delay() so we at least panic() and reboot.
 
  - Add support for optionally enabling the DAWR on Power9, which had to be
    disabled by default due to a hardware erratum. This has the effect of
    enabling hardware breakpoints for GDB, the downside is a badly behaved
    program could crash the machine by pointing the DAWR at cache inhibited
    memory. This is opt-in obviously.
 
  - xmon, our crash handler, gets support for a read only mode where operations
    that could change memory or otherwise disturb the system are disabled.
 
 Plus many clean-ups, reworks and minor fixes etc.
 
 Thanks to:
   Christophe Leroy, Akshay Adiga, Alastair D'Silva, Alexey Kardashevskiy, Andrew
   Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Anton Blanchard, Ben Hutchings,
   Bo YU, Breno Leitao, Cédric Le Goater, Christopher M. Riedl, Christoph
   Hellwig, Colin Ian King, David Gibson, Ganesh Goudar, Gautham R. Shenoy,
   George Spelvin, Greg Kroah-Hartman, Greg Kurz, Horia Geantă, Jagadeesh
   Pagadala, Joel Stanley, Joe Perches, Julia Lawall, Laurentiu Tudor, Laurent
   Vivier, Lukas Bulwahn, Madhavan Srinivasan, Mahesh Salgaonkar, Mathieu
   Malaterre, Michael Neuling, Mukesh Ojha, Nathan Fontenot, Nathan Lynch,
   Nicholas Piggin, Nick Desaulniers, Oliver O'Halloran, Peng Hao, Qian Cai, Ravi
   Bangoria, Rick Lindsley, Russell Currey, Sachin Sant, Stewart Smith, Sukadev
   Bhattiprolu, Thomas Huth, Tobin C. Harding, Tyrel Datwyler, Valentin
   Schneider, Wei Yongjun, Wen Yang, YueHaibing.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJc1WbwAAoJEFHr6jzI4aWAv5cP/iDskai4Az/GCa6yLj4b+det
 7mc7tTOaEzhUtvfrYYfHgvvdNNzo1ETv7rqTdZqtWJ3xfwdeowLFXXZwSywZKUDB
 bi4pcl2v55Qlf9kxgx9RDr6+4fTwGG4nhO2qPDJDR1umEih9mG/2HJ7d+Wnq6Va2
 E9srd+R6Fa0ty88+9vzBtdyllnDK1XHu3ahsxCH62aRm79ucuVrxyydWmbbs5lJe
 a7g/OQIPgZmObHhfXvw9DFkOvkp5Pm6hfHOeyQH2nTB5X6k0judWv00uoHTJgOuP
 DKxZtDhaGnajUfuhQYboDPOuFjY7lkfgEXaagyZsjdudqridTMmv1iU1o7iy8BT4
 AId4DyJbvFFgqRJkCwKzhKRRHPfFMfM7KTJ38GPZuPmniuULk9uiIy6JyY0tXO+l
 UQEclPzOTPkAE12FBaOBuqZqTRuBQuokWQF8ZDPOxbNAixHgFoRd4Z9diNwCPpLu
 +KoyCwd2Gm5DyX+mC85sWG28IPKi9Hhhw2XBOA5F4A2kH6uFa1BnERSRGYomx+pc
 BvEXHglf/vgV0XUQZfDCsiOecIKYuWxgre0/liLhhU5qMss2pxHczzffH4KtdykS
 9y7o3mVRcS7Moitbmb6SAJoQxbR5QhzfN832DbSd6jEfKdg1ytZlfHTG0WZYHKDs
 PHs6V1N+cQANdukutrJz
 =cUkd
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-5.2-1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:
 "Slightly delayed due to the issue with printk() calling
  probe_kernel_read() interacting with our new user access prevention
  stuff, but all fixed now.

  The only out-of-area changes are the addition of a cpuhp_state, small
  additions to Documentation and MAINTAINERS updates.

  Highlights:

   - Support for Kernel Userspace Access/Execution Prevention (like
     SMAP/SMEP/PAN/PXN) on some 64-bit and 32-bit CPUs. This prevents
     the kernel from accidentally accessing userspace outside
     copy_to/from_user(), or ever executing userspace.

   - KASAN support on 32-bit.

   - Rework of where we map the kernel, vmalloc, etc. on 64-bit hash to
     use the same address ranges we use with the Radix MMU.

   - A rewrite into C of large parts of our idle handling code for
     64-bit Book3S (ie. power8 & power9).

   - A fast path entry for syscalls on 32-bit CPUs, for a 12-17% speedup
     in the null_syscall benchmark.

   - On 64-bit bare metal we have support for recovering from errors
     with the time base (our clocksource), however if that fails
     currently we hang in __delay() and never crash. We now have support
     for detecting that case and short circuiting __delay() so we at
     least panic() and reboot.

   - Add support for optionally enabling the DAWR on Power9, which had
     to be disabled by default due to a hardware erratum. This has the
     effect of enabling hardware breakpoints for GDB, the downside is a
     badly behaved program could crash the machine by pointing the DAWR
     at cache inhibited memory. This is opt-in obviously.

   - xmon, our crash handler, gets support for a read only mode where
     operations that could change memory or otherwise disturb the system
     are disabled.

  Plus many clean-ups, reworks and minor fixes etc.

  Thanks to: Christophe Leroy, Akshay Adiga, Alastair D'Silva, Alexey
  Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V, Anju T Sudhakar,
  Anton Blanchard, Ben Hutchings, Bo YU, Breno Leitao, Cédric Le Goater,
  Christopher M. Riedl, Christoph Hellwig, Colin Ian King, David Gibson,
  Ganesh Goudar, Gautham R. Shenoy, George Spelvin, Greg Kroah-Hartman,
  Greg Kurz, Horia Geantă, Jagadeesh Pagadala, Joel Stanley, Joe
  Perches, Julia Lawall, Laurentiu Tudor, Laurent Vivier, Lukas Bulwahn,
  Madhavan Srinivasan, Mahesh Salgaonkar, Mathieu Malaterre, Michael
  Neuling, Mukesh Ojha, Nathan Fontenot, Nathan Lynch, Nicholas Piggin,
  Nick Desaulniers, Oliver O'Halloran, Peng Hao, Qian Cai, Ravi
  Bangoria, Rick Lindsley, Russell Currey, Sachin Sant, Stewart Smith,
  Sukadev Bhattiprolu, Thomas Huth, Tobin C. Harding, Tyrel Datwyler,
  Valentin Schneider, Wei Yongjun, Wen Yang, YueHaibing"

* tag 'powerpc-5.2-1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (205 commits)
  powerpc/64s: Use early_mmu_has_feature() in set_kuap()
  powerpc/book3s/64: check for NULL pointer in pgd_alloc()
  powerpc/mm: Fix hugetlb page initialization
  ocxl: Fix return value check in afu_ioctl()
  powerpc/mm: fix section mismatch for setup_kup()
  powerpc/mm: fix redundant inclusion of pgtable-frag.o in Makefile
  powerpc/mm: Fix makefile for KASAN
  powerpc/kasan: add missing/lost Makefile
  selftests/powerpc: Add a signal fuzzer selftest
  powerpc/booke64: set RI in default MSR
  ocxl: Provide global MMIO accessors for external drivers
  ocxl: move event_fd handling to frontend
  ocxl: afu_irq only deals with IRQ IDs, not offsets
  ocxl: Allow external drivers to use OpenCAPI contexts
  ocxl: Create a clear delineation between ocxl backend & frontend
  ocxl: Don't pass pci_dev around
  ocxl: Split pci.c
  ocxl: Remove some unused exported symbols
  ocxl: Remove superfluous 'extern' from headers
  ocxl: read_pasid never returns an error, so make it void
  ...
2019-05-10 05:29:27 -07:00
Linus Torvalds eac7078a0f pidfd patches for v5.2-rc1
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE7btrcuORLb1XUhEwjrBW1T7ssS0FAlzReuoACgkQjrBW1T7s
 sS1uvBAA16pgnhRNxNTrp3LYft6lUWmF4n0baOTVtQNLhPjpwaOxHIrCBugkQCJB
 QcQ9IQSOvIkaEW0XAQoPBaeLviiKhHOFw1Fv89OtW6xUidSfSV15lcI9f1F2pCm2
 4yCL/8XvL6M0NhxiwftJAkWOXeDNLfjFnLwyLxBfgg3EeyqMgUB8raeosEID0ORR
 gm2/g8DYS2r+KNqM/F4xvMSgabfi2bGk+8BtAaVnftJfstpRNrqKwWnSK3Wspj1l
 5gkb8gSsiY6ns3V6RgNHrFlhevFg8V+VjcJt7FR+aUEjOkcoiXas/PhvamMzdsn/
 FM1F/A0pM8FSybIUClhnnnxNPc+p8ZN/71YQAPs+Mnh3xvbtKea2lkhC+Xv4OpK3
 edutSZWFaiIery82Rk00H3vqiSF1+kRIXSpZSS4mElk4FsVljkyH+nSP7rbmE2MR
 EQe+kKnZl8QzWrVbnODC+EVvvVpA2bXDvENJmvKqus+t2G0OdV7Iku3F5E3KjF8k
 S5RRV1zuBF3ugqnjmYrVmJtpEA8mxClmqvg6okru+qW6ngO5oOgVpPLjWn1CXcdj
 wcuQ6Pe1QwAHS54e9WSWgCHVssLvm9nCdCqypdNaoyGWmbTWntwlrY7Y0JUQnAbB
 6/G/DQQiCWY9y8bMZlTEydhIpgcsdROuPYv+oHF5+eQQthsWwHc=
 =LH11
 -----END PGP SIGNATURE-----

Merge tag 'pidfd-v5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull pidfd updates from Christian Brauner:
 "This patchset makes it possible to retrieve pidfds at process creation
  time by introducing the new flag CLONE_PIDFD to the clone() system
  call. Linus originally suggested to implement this as a new flag to
  clone() instead of making it a separate system call.

  After a thorough review from Oleg CLONE_PIDFD returns pidfds in the
  parent_tidptr argument. This means we can give back the associated pid
  and the pidfd at the same time. Access to process metadata information
  thus becomes rather trivial.

  As has been agreed, CLONE_PIDFD creates file descriptors based on
  anonymous inodes similar to the new mount api. They are made
  unconditional by this patchset as they are now needed by core kernel
  code (vfs, pidfd) even more than they already were before (timerfd,
  signalfd, io_uring, epoll etc.). The core patchset is rather small.
  The bulky looking changelist is caused by David's very simple changes
  to Kconfig to make anon inodes unconditional.

  A pidfd comes with additional information in fdinfo if the kernel
  supports procfs. The fdinfo file contains the pid of the process in
  the callers pid namespace in the same format as the procfs status
  file, i.e. "Pid:\t%d".

  To remove worries about missing metadata access this patchset comes
  with a sample/test program that illustrates how a combination of
  CLONE_PIDFD and pidfd_send_signal() can be used to gain race-free
  access to process metadata through /proc/<pid>.

  Further work based on this patchset has been done by Joel. His work
  makes pidfds pollable. It finished too late for this merge window. I
  would prefer to have it sitting in linux-next for a while and send it
  for inclusion during the 5.3 merge window"

* tag 'pidfd-v5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  samples: show race-free pidfd metadata access
  signal: support CLONE_PIDFD with pidfd_send_signal
  clone: add CLONE_PIDFD
  Make anon_inodes unconditional
2019-05-07 12:30:24 -07:00
Michael Ellerman bdc7c970bc Merge branch 'topic/ppc-kvm' into next
Merge our topic branch shared with KVM. In particular this includes the
rewrite of the idle code into C.
2019-04-30 22:52:03 +10:00
Nicholas Piggin 10d91611f4 powerpc/64s: Reimplement book3s idle code in C
Reimplement Book3S idle code in C, moving POWER7/8/9 implementation
speific HV idle code to the powernv platform code.

Book3S assembly stubs are kept in common code and used only to save
the stack frame and non-volatile GPRs before executing architected
idle instructions, and restoring the stack and reloading GPRs then
returning to C after waking from idle.

The complex logic dealing with threads and subcores, locking, SPRs,
HMIs, timebase resync, etc., is all done in C which makes it more
maintainable.

This is not a strict translation to C code, there are some
significant differences:

- Idle wakeup no longer uses the ->cpu_restore call to reinit SPRs,
  but saves and restores them itself.

- The optimisation where EC=ESL=0 idle modes did not have to save GPRs
  or change MSR is restored, because it's now simple to do. ESL=1
  sleeps that do not lose GPRs can use this optimization too.

- KVM secondary entry and cede is now more of a call/return style
  rather than branchy. nap_state_lost is not required because KVM
  always returns via NVGPR restoring path.

- KVM secondary wakeup from offline sequence is moved entirely into
  the offline wakeup, which avoids a hwsync in the normal idle wakeup
  path.

Performance measured with context switch ping-pong on different
threads or cores, is possibly improved a small amount, 1-3% depending
on stop state and core vs thread test for shallow states. Deep states
it's in the noise compared with other latencies.

KVM improvements:

- Idle sleepers now always return to caller rather than branch out
  to KVM first.

- This allows optimisations like very fast return to caller when no
  state has been lost.

- KVM no longer requires nap_state_lost because it controls NVGPR
  save/restore itself on the way in and out.

- The heavy idle wakeup KVM request check can be moved out of the
  normal host idle code and into the not-performance-critical offline
  code.

- KVM nap code now returns from where it is called, which makes the
  flow a bit easier to follow.

Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Squash the KVM changes in]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-30 22:37:48 +10:00
Paul Mackerras 0caecf5b00 KVM: PPC: Book3S HV: XIVE: Clear escalation interrupt pointers on device close
This adds code to ensure that after a XIVE or XICS-on-XIVE KVM device
is closed, KVM will not try to enable or disable any of the escalation
interrupts for the VCPUs.  We don't have to worry about races between
clearing the pointers and use of the pointers by the XIVE context
push/pull code, because the callers hold the vcpu->mutex, which is
also taken by the KVM_RUN code.  Therefore the vcpu cannot be entering
or exiting the guest concurrently.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 19:41:01 +10:00
Paul Mackerras 6f868405fa KVM: PPC: Book3S HV: XIVE: Prevent races when releasing device
Now that we have the possibility of a XIVE or XICS-on-XIVE device being
released while the VM is still running, we need to be careful about
races and potential use-after-free bugs.  Although the kvmppc_xive
struct is not freed, but kept around for re-use, the kvmppc_xive_vcpu
structs are freed, and they are used extensively in both the XIVE native
and XICS-on-XIVE code.

There are various ways in which XIVE code gets invoked:

- VCPU entry and exit, which do push and pull operations on the XIVE hardware
- one_reg get and set functions (vcpu->mutex is held)
- XICS hypercalls (but only inside guest execution, not from
  kvmppc_pseries_do_hcall)
- device creation calls (kvm->lock is held)
- device callbacks - get/set attribute, mmap, pagefault, release/destroy
- set_mapped/clr_mapped calls (kvm->lock is held)
- connect_vcpu calls
- debugfs file read callbacks

Inside a device release function, we know that userspace cannot have an
open file descriptor referring to the device, nor can it have any mmapped
regions from the device.  Therefore the device callbacks are excluded,
as are the connect_vcpu calls (since they need a fd for the device).
Further, since the caller holds the kvm->lock mutex, no other device
creation calls or set/clr_mapped calls can be executing concurrently.

To exclude VCPU execution and XICS hypercalls, we temporarily set
kvm->arch.mmu_ready to 0.  This forces any VCPU task that is trying to
enter the guest to take the kvm->lock mutex, which is held by the caller
of the release function.  Then, sending an IPI to all other CPUs forces
any VCPU currently executing in the guest to exit.

Finally, we take the vcpu->mutex for each VCPU around the process of
cleaning up and freeing its XIVE data structures, in order to exclude
any one_reg get/set calls.

To exclude the debugfs read callbacks, we just need to ensure that
debugfs_remove is called before freeing any data structures.  Once it
returns we know that no CPU can be executing the callbacks (for our
kvmppc_xive instance).

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 19:41:01 +10:00
Cédric Le Goater 5422e95103 KVM: PPC: Book3S HV: XIVE: Replace the 'destroy' method by a 'release' method
When a P9 sPAPR VM boots, the CAS negotiation process determines which
interrupt mode to use (XICS legacy or XIVE native) and invokes a
machine reset to activate the chosen mode.

We introduce 'release' methods for the XICS-on-XIVE and the XIVE
native KVM devices which are called when the file descriptor of the
device is closed after the TIMA and ESB pages have been unmapped.
They perform the necessary cleanups : clear the vCPU interrupt
presenters that could be attached and then destroy the device. The
'release' methods replace the 'destroy' methods as 'destroy' is not
called anymore once 'release' is. Compatibility with older QEMU is
nevertheless maintained.

This is not considered as a safe operation as the vCPUs are still
running and could be referencing the KVM device through their
presenters. To protect the system from any breakage, the kvmppc_xive
objects representing both KVM devices are now stored in an array under
the VM. Allocation is performed on first usage and memory is freed
only when the VM exits.

[paulus@ozlabs.org - Moved freeing of xive structures to book3s.c,
 put it under #ifdef CONFIG_KVM_XICS.]

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 19:40:39 +10:00
Cédric Le Goater 3fab2d1058 KVM: PPC: Book3S HV: XIVE: Activate XIVE exploitation mode
Full support for the XIVE native exploitation mode is now available,
advertise the capability KVM_CAP_PPC_IRQ_XIVE for guests running on
PowerNV KVM Hypervisors only. Support for nested guests (pseries KVM
Hypervisor) is not yet available. XIVE should also have been activated
which is default setting on POWER9 systems running a recent Linux
kernel.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 19:35:16 +10:00
Cédric Le Goater 232b984b7d KVM: PPC: Book3S HV: XIVE: Add passthrough support
The KVM XICS-over-XIVE device and the proposed KVM XIVE native device
implement an IRQ space for the guest using the generic IPI interrupts
of the XIVE IC controller. These interrupts are allocated at the OPAL
level and "mapped" into the guest IRQ number space in the range 0-0x1FFF.
Interrupt management is performed in the XIVE way: using loads and
stores on the addresses of the XIVE IPI interrupt ESB pages.

Both KVM devices share the same internal structure caching information
on the interrupts, among which the xive_irq_data struct containing the
addresses of the IPI ESB pages and an extra one in case of pass-through.
The later contains the addresses of the ESB pages of the underlying HW
controller interrupts, PHB4 in all cases for now.

A guest, when running in the XICS legacy interrupt mode, lets the KVM
XICS-over-XIVE device "handle" interrupt management, that is to
perform the loads and stores on the addresses of the ESB pages of the
guest interrupts. However, when running in XIVE native exploitation
mode, the KVM XIVE native device exposes the interrupt ESB pages to
the guest and lets the guest perform directly the loads and stores.

The VMA exposing the ESB pages make use of a custom VM fault handler
which role is to populate the VMA with appropriate pages. When a fault
occurs, the guest IRQ number is deduced from the offset, and the ESB
pages of associated XIVE IPI interrupt are inserted in the VMA (using
the internal structure caching information on the interrupts).

Supporting device passthrough in the guest running in XIVE native
exploitation mode adds some extra refinements because the ESB pages
of a different HW controller (PHB4) need to be exposed to the guest
along with the initial IPI ESB pages of the XIVE IC controller. But
the overall mechanic is the same.

When the device HW irqs are mapped into or unmapped from the guest
IRQ number space, the passthru_irq helpers, kvmppc_xive_set_mapped()
and kvmppc_xive_clr_mapped(), are called to record or clear the
passthrough interrupt information and to perform the switch.

The approach taken by this patch is to clear the ESB pages of the
guest IRQ number being mapped and let the VM fault handler repopulate.
The handler will insert the ESB page corresponding to the HW interrupt
of the device being passed-through or the initial IPI ESB page if the
device is being removed.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 19:35:16 +10:00
Cédric Le Goater 6520ca64cd KVM: PPC: Book3S HV: XIVE: Add a mapping for the source ESB pages
Each source is associated with an Event State Buffer (ESB) with a
even/odd pair of pages which provides commands to manage the source:
to trigger, to EOI, to turn off the source for instance.

The custom VM fault handler will deduce the guest IRQ number from the
offset of the fault, and the ESB page of the associated XIVE interrupt
will be inserted into the VMA using the internal structure caching
information on the interrupts.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 19:35:16 +10:00
Cédric Le Goater 39e9af3de5 KVM: PPC: Book3S HV: XIVE: Add a TIMA mapping
Each thread has an associated Thread Interrupt Management context
composed of a set of registers. These registers let the thread handle
priority management and interrupt acknowledgment. The most important
are :

    - Interrupt Pending Buffer     (IPB)
    - Current Processor Priority   (CPPR)
    - Notification Source Register (NSR)

They are exposed to software in four different pages each proposing a
view with a different privilege. The first page is for the physical
thread context and the second for the hypervisor. Only the third
(operating system) and the fourth (user level) are exposed the guest.

A custom VM fault handler will populate the VMA with the appropriate
pages, which should only be the OS page for now.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 19:35:16 +10:00
Cédric Le Goater e4945b9da5 KVM: PPC: Book3S HV: XIVE: Add get/set accessors for the VP XIVE state
The state of the thread interrupt management registers needs to be
collected for migration. These registers are cached under the
'xive_saved_state.w01' field of the VCPU when the VPCU context is
pulled from the HW thread. An OPAL call retrieves the backup of the
IPB register in the underlying XIVE NVT structure and merges it in the
KVM state.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 19:35:16 +10:00
Cédric Le Goater e6714bd167 KVM: PPC: Book3S HV: XIVE: Add a control to dirty the XIVE EQ pages
When migration of a VM is initiated, a first copy of the RAM is
transferred to the destination before the VM is stopped, but there is
no guarantee that the EQ pages in which the event notifications are
queued have not been modified.

To make sure migration will capture a consistent memory state, the
XIVE device should perform a XIVE quiesce sequence to stop the flow of
event notifications and stabilize the EQs. This is the purpose of the
KVM_DEV_XIVE_EQ_SYNC control which will also marks the EQ pages dirty
to force their transfer.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 19:35:16 +10:00
Cédric Le Goater 7b46b6169a KVM: PPC: Book3S HV: XIVE: Add a control to sync the sources
This control will be used by the H_INT_SYNC hcall from QEMU to flush
event notifications on the XIVE IC owning the source.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 19:35:16 +10:00
Cédric Le Goater 5ca8064748 KVM: PPC: Book3S HV: XIVE: Add a global reset control
This control is to be used by the H_INT_RESET hcall from QEMU. Its
purpose is to clear all configuration of the sources and EQs. This is
necessary in case of a kexec (for a kdump kernel for instance) to make
sure that no remaining configuration is left from the previous boot
setup so that the new kernel can start safely from a clean state.

The queue 7 is ignored when the XIVE device is configured to run in
single escalation mode. Prio 7 is used by escalations.

The XIVE VP is kept enabled as the vCPU is still active and connected
to the XIVE device.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 19:35:16 +10:00
Cédric Le Goater 13ce3297c5 KVM: PPC: Book3S HV: XIVE: Add controls for the EQ configuration
These controls will be used by the H_INT_SET_QUEUE_CONFIG and
H_INT_GET_QUEUE_CONFIG hcalls from QEMU to configure the underlying
Event Queue in the XIVE IC. They will also be used to restore the
configuration of the XIVE EQs and to capture the internal run-time
state of the EQs. Both 'get' and 'set' rely on an OPAL call to access
the EQ toggle bit and EQ index which are updated by the XIVE IC when
event notifications are enqueued in the EQ.

The value of the guest physical address of the event queue is saved in
the XIVE internal xive_q structure for later use. That is when
migration needs to mark the EQ pages dirty to capture a consistent
memory state of the VM.

To be noted that H_INT_SET_QUEUE_CONFIG does not require the extra
OPAL call setting the EQ toggle bit and EQ index to configure the EQ,
but restoring the EQ state will.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 19:35:16 +10:00
Cédric Le Goater e8676ce50e KVM: PPC: Book3S HV: XIVE: Add a control to configure a source
This control will be used by the H_INT_SET_SOURCE_CONFIG hcall from
QEMU to configure the target of a source and also to restore the
configuration of a source when migrating the VM.

The XIVE source interrupt structure is extended with the value of the
Effective Interrupt Source Number. The EISN is the interrupt number
pushed in the event queue that the guest OS will use to dispatch
events internally. Caching the EISN value in KVM eases the test when
checking if a reconfiguration is indeed needed.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 19:35:16 +10:00
Cédric Le Goater 4131f83c3d KVM: PPC: Book3S HV: XIVE: add a control to initialize a source
The XIVE KVM device maintains a list of interrupt sources for the VM
which are allocated in the pool of generic interrupts (IPIs) of the
main XIVE IC controller. These are used for the CPU IPIs as well as
for virtual device interrupts. The IRQ number space is defined by
QEMU.

The XIVE device reuses the source structures of the XICS-on-XIVE
device for the source blocks (2-level tree) and for the source
interrupts. Under XIVE native, the source interrupt caches mostly
configuration information and is less used than under the XICS-on-XIVE
device in which hcalls are still necessary at run-time.

When a source is initialized in KVM, an IPI interrupt source is simply
allocated at the OPAL level and then MASKED. KVM only needs to know
about its type: LSI or MSI.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 19:35:16 +10:00
Cédric Le Goater eacc56bb9d KVM: PPC: Book3S HV: XIVE: Introduce a new capability KVM_CAP_PPC_IRQ_XIVE
The user interface exposes a new capability KVM_CAP_PPC_IRQ_XIVE to
let QEMU connect the vCPU presenters to the XIVE KVM device if
required. The capability is not advertised for now as the full support
for the XIVE native exploitation mode is not yet available. When this
is case, the capability will be advertised on PowerNV Hypervisors
only. Nested guests (pseries KVM Hypervisor) are not supported.

Internally, the interface to the new KVM device is protected with a
new interrupt mode: KVMPPC_IRQ_XIVE.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 19:35:16 +10:00
Cédric Le Goater 90c73795af KVM: PPC: Book3S HV: Add a new KVM device for the XIVE native exploitation mode
This is the basic framework for the new KVM device supporting the XIVE
native exploitation mode. The user interface exposes a new KVM device
to be created by QEMU, only available when running on a L0 hypervisor.
Support for nested guests is not available yet.

The XIVE device reuses the device structure of the XICS-on-XIVE device
as they have a lot in common. That could possibly change in the future
if the need arise.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 19:35:16 +10:00
Paul Mackerras a878957a81 Merge remote-tracking branch 'remotes/powerpc/topic/ppc-kvm' into kvm-ppc-next
This merges in the ppc-kvm topic branch from the powerpc tree to get
patches which touch both general powerpc code and KVM code, one of
which is a prerequisite for following patches.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 19:32:47 +10:00
Suraj Jitindar Singh 44b198aee1 KVM: PPC: Book3S HV: Save/restore vrsave register in kvmhv_p9_guest_entry()
On POWER9 and later processors where the host can schedule vcpus on a
per thread basis, there is a streamlined entry path used when the guest
is radix. This entry path saves/restores the fp and vr state in
kvmhv_p9_guest_entry() by calling store_[fp/vr]_state() and
load_[fp/vr]_state(). This is the same as the old entry path however the
old entry path also saved/restored the VRSAVE register, which isn't done
in the new entry path.

This means that the vrsave register is now volatile across guest exit,
which is an incorrect change in behaviour.

Fix this by saving/restoring the vrsave register in kvmhv_p9_guest_entry().
This restores the old, correct, behaviour.

Fixes: 95a6432ce9 ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests")

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 19:32:19 +10:00
Paul Mackerras 70ea13f6e6 KVM: PPC: Book3S HV: Flush TLB on secondary radix threads
When running on POWER9 with kvm_hv.indep_threads_mode = N and the host
in SMT1 mode, KVM will run guest VCPUs on offline secondary threads.
If those guests are in radix mode, we fail to load the LPID and flush
the TLB if necessary, leading to the guest crashing with an
unsupported MMU fault.  This arises from commit 9a4506e11b ("KVM:
PPC: Book3S HV: Make radix handle process scoped LPID flush in C,
with relocation on", 2018-05-17), which didn't consider the case
where indep_threads_mode = N.

For simplicity, this makes the real-mode guest entry path flush the
TLB in the same place for both radix and hash guests, as we did before
9a4506e11b, though the code is now C code rather than assembly code.
We also have the radix TLB flush open-coded rather than calling
radix__local_flush_tlb_lpid_guest(), because the TLB flush can be
called in real mode, and in real mode we don't want to invoke the
tracepoint code.

Fixes: 9a4506e11b ("KVM: PPC: Book3S HV: Make radix handle process scoped LPID flush in C, with relocation on")
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 19:32:12 +10:00
Paul Mackerras 2940ba0c48 KVM: PPC: Book3S HV: Move HPT guest TLB flushing to C code
This replaces assembler code in book3s_hv_rmhandlers.S that checks
the kvm->arch.need_tlb_flush cpumask and optionally does a TLB flush
with C code in book3s_hv_builtin.c.  Note that unlike the radix
version, the hash version doesn't do an explicit ERAT invalidation
because we will invalidate and load up the SLB before entering the
guest, and that will invalidate the ERAT.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 19:32:01 +10:00
Suraj Jitindar Singh 7ae9bda7ed KVM: PPC: Book3S HV: Handle virtual mode in XIVE VCPU push code
The code in book3s_hv_rmhandlers.S that pushes the XIVE virtual CPU
context to the hardware currently assumes it is being called in real
mode, which is usually true.  There is however a path by which it can
be executed in virtual mode, in the case where indep_threads_mode = N.
A virtual CPU executing on an offline secondary thread can take a
hypervisor interrupt in virtual mode and return from the
kvmppc_hv_entry() call after the kvm_secondary_got_guest label.
It is possible for it to be given another vcpu to execute before it
gets to execute the stop instruction.  In that case it will call
kvmppc_hv_entry() for the second VCPU in virtual mode, and the XIVE
vCPU push code will be executed in virtual mode.  The result in that
case will be a host crash due to an unexpected data storage interrupt
caused by executing the stdcix instruction in virtual mode.

This fixes it by adding a code path for virtual mode, which uses the
virtual TIMA pointer and normal load/store instructions.

[paulus@ozlabs.org - wrote patch description]

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 19:31:52 +10:00
Paul Mackerras 1f80ba3d20 KVM: PPC: Book3S HV: Fix XICS-on-XIVE H_IPI when priority = 0
This fixes a bug in the XICS emulation on POWER9 machines which is
triggered by the guest doing a H_IPI with priority = 0 (the highest
priority).  What happens is that the notification interrupt arrives
at the destination at priority zero.  The loop in scan_interrupts()
sees that a priority 0 interrupt is pending, but because xc->mfrr is
zero, we break out of the loop before taking the notification
interrupt out of the queue and EOI-ing it.  (This doesn't happen
when xc->mfrr != 0; in that case we process the priority-0 notification
interrupt on the first iteration of the loop, and then break out of
a subsequent iteration of the loop with hirq == XICS_IPI.)

To fix this, we move the prio >= xc->mfrr check down to near the end
of the loop.  However, there are then some other things that need to
be adjusted.  Since we are potentially handling the notification
interrupt and also delivering an IPI to the guest in the same loop
iteration, we need to update pending and handle any q->pending_count
value before the xc->mfrr check, rather than at the end of the loop.
Also, we need to update the queue pointers when we have processed and
EOI-ed the notification interrupt, since we may not do it later.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 19:29:23 +10:00
Palmer Dabbelt 6fabc9f20c KVM: PPC: Book3S HV: smb->smp comment fixup
I made the same typo when trying to grep for uses of smp_wmb and figured
I might as well fix it.

Signed-off-by: Palmer Dabbelt <palmer@sifive.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 14:43:13 +10:00
Alexey Kardashevskiy e1a1ef84cd KVM: PPC: Book3S: Allocate guest TCEs on demand too
We already allocate hardware TCE tables in multiple levels and skip
intermediate levels when we can, now it is a turn of the KVM TCE tables.
Thankfully these are allocated already in 2 levels.

This moves the table's last level allocation from the creating helper to
kvmppc_tce_put() and kvm_spapr_tce_fault(). Since such allocation cannot
be done in real mode, this creates a virtual mode version of
kvmppc_tce_put() which handles allocations.

This adds kvmppc_rm_ioba_validate() to do an additional test if
the consequent kvmppc_tce_put() needs a page which has not been allocated;
if this is the case, we bail out to virtual mode handlers.

The allocations are protected by a new mutex as kvm->lock is not suitable
for the task because the fault handler is called with the mmap_sem held
but kvmhv_setup_mmu() locks kvm->lock and mmap_sem in the reverse order.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 14:43:13 +10:00
Alexey Kardashevskiy 2001825efc KVM: PPC: Book3S HV: Avoid lockdep debugging in TCE realmode handlers
The kvmppc_tce_to_ua() helper is called from real and virtual modes
and it works fine as long as CONFIG_DEBUG_LOCKDEP is not enabled.
However if the lockdep debugging is on, the lockdep will most likely break
in kvm_memslots() because of srcu_dereference_check() so we need to use
PPC-own kvm_memslots_raw() which uses realmode safe
rcu_dereference_raw_notrace().

This creates a realmode copy of kvmppc_tce_to_ua() which replaces
kvm_memslots() with kvm_memslots_raw().

Since kvmppc_rm_tce_to_ua() becomes static and can only be used inside
HV KVM, this moves it earlier under CONFIG_KVM_BOOK3S_HV_POSSIBLE.

This moves truly virtual-mode kvmppc_tce_to_ua() to where it belongs and
drops the prmap parameter which was never used in the virtual mode.

Fixes: d3695aa4f4 ("KVM: PPC: Add support for multiple-TCE hcalls", 2016-02-15)
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 14:43:13 +10:00
Alexey Kardashevskiy 3309bec85e KVM: PPC: Book3S HV: Fix lockdep warning when entering the guest
The trace_hardirqs_on() sets current->hardirqs_enabled and from here
the lockdep assumes interrupts are enabled although they are remain
disabled until the context switches to the guest. Consequent
srcu_read_lock() checks the flags in rcu_lock_acquire(), observes
disabled interrupts and prints a warning (see below).

This moves trace_hardirqs_on/off closer to __kvmppc_vcore_entry to
prevent lockdep from being confused.

DEBUG_LOCKS_WARN_ON(current->hardirqs_enabled)
WARNING: CPU: 16 PID: 8038 at kernel/locking/lockdep.c:4128 check_flags.part.25+0x224/0x280
[...]
NIP [c000000000185b84] check_flags.part.25+0x224/0x280
LR [c000000000185b80] check_flags.part.25+0x220/0x280
Call Trace:
[c000003fec253710] [c000000000185b80] check_flags.part.25+0x220/0x280 (unreliable)
[c000003fec253780] [c000000000187ea4] lock_acquire+0x94/0x260
[c000003fec253840] [c00800001a1e9768] kvmppc_run_core+0xa60/0x1ab0 [kvm_hv]
[c000003fec253a10] [c00800001a1ed944] kvmppc_vcpu_run_hv+0x73c/0xec0 [kvm_hv]
[c000003fec253ae0] [c00800001a1095dc] kvmppc_vcpu_run+0x34/0x48 [kvm]
[c000003fec253b00] [c00800001a1056bc] kvm_arch_vcpu_ioctl_run+0x2f4/0x400 [kvm]
[c000003fec253b90] [c00800001a0f3618] kvm_vcpu_ioctl+0x460/0x850 [kvm]
[c000003fec253d00] [c00000000041c4f4] do_vfs_ioctl+0xe4/0x930
[c000003fec253db0] [c00000000041ce04] ksys_ioctl+0xc4/0x110
[c000003fec253e00] [c00000000041ce78] sys_ioctl+0x28/0x80
[c000003fec253e20] [c00000000000b5a4] system_call+0x5c/0x70
Instruction dump:
419e0034 3d220004 39291730 81290000 2f890000 409e0020 3c82ffc6 3c62ffc5
3884be70 386329c0 4bf6ea71 60000000 <0fe00000> 3c62ffc6 3863be90 4801273d
irq event stamp: 1025
hardirqs last  enabled at (1025): [<c00800001a1e9728>] kvmppc_run_core+0xa20/0x1ab0 [kvm_hv]
hardirqs last disabled at (1024): [<c00800001a1e9358>] kvmppc_run_core+0x650/0x1ab0 [kvm_hv]
softirqs last  enabled at (0): [<c0000000000f1210>] copy_process.isra.4.part.5+0x5f0/0x1d00
softirqs last disabled at (0): [<0000000000000000>]           (null)
---[ end trace 31180adcc848993e ]---
possible reason: unannotated irqs-off.
irq event stamp: 1025
hardirqs last  enabled at (1025): [<c00800001a1e9728>] kvmppc_run_core+0xa20/0x1ab0 [kvm_hv]
hardirqs last disabled at (1024): [<c00800001a1e9358>] kvmppc_run_core+0x650/0x1ab0 [kvm_hv]
softirqs last  enabled at (0): [<c0000000000f1210>] copy_process.isra.4.part.5+0x5f0/0x1d00
softirqs last disabled at (0): [<0000000000000000>]           (null)

Fixes: 8b24e69fc4 ("KVM: PPC: Book3S HV: Close race with testing for signals on guest entry", 2017-06-26)
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 14:43:12 +10:00
Suraj Jitindar Singh eadfb1c5f8 KVM: PPC: Book3S HV: Implement real mode H_PAGE_INIT handler
Implement a real mode handler for the H_CALL H_PAGE_INIT which can be
used to zero or copy a guest page. The page is defined to be 4k and must
be 4k aligned.

The in-kernel real mode handler halves the time to handle this H_CALL
compared to handling it in userspace for a hash guest.

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 14:43:12 +10:00
Suraj Jitindar Singh 2d34d1c3bb KVM: PPC: Book3S HV: Implement virtual mode H_PAGE_INIT handler
Implement a virtual mode handler for the H_CALL H_PAGE_INIT which can be
used to zero or copy a guest page. The page is defined to be 4k and must
be 4k aligned.

The in-kernel handler halves the time to handle this H_CALL compared to
handling it in userspace for a radix guest.

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30 14:41:20 +10:00
Aneesh Kumar K.V 0034d395f8 powerpc/mm/hash64: Map all the kernel regions in the same 0xc range
This patch maps vmalloc, IO and vmemap regions in the 0xc address range
instead of the current 0xd and 0xf range. This brings the mapping closer
to radix translation mode.

With hash 64K page size each of this region is 512TB whereas with 4K config
we are limited by the max page table range of 64TB and hence there regions
are of 16TB size.

The kernel mapping is now:

 On 4K hash

     kernel_region_map_size = 16TB
     kernel vmalloc start   = 0xc000100000000000
     kernel IO start        = 0xc000200000000000
     kernel vmemmap start   = 0xc000300000000000

64K hash, 64K radix and 4k radix:

     kernel_region_map_size = 512TB
     kernel vmalloc start   = 0xc008000000000000
     kernel IO start        = 0xc00a000000000000
     kernel vmemmap start   = 0xc00c000000000000

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-21 23:12:39 +10:00
Michael Neuling c1fe190c06 powerpc: Add force enable of DAWR on P9 option
This adds a flag so that the DAWR can be enabled on P9 via:
  echo Y > /sys/kernel/debug/powerpc/dawr_enable_dangerous

The DAWR was previously force disabled on POWER9 in:
  9654153158 powerpc: Disable DAWR in the base POWER9 CPU features
Also see Documentation/powerpc/DAWR-POWER9.txt

This is a dangerous setting, USE AT YOUR OWN RISK.

Some users may not care about a bad user crashing their box
(ie. single user/desktop systems) and really want the DAWR.  This
allows them to force enable DAWR.

This flag can also be used to disable DAWR access. Once this is
cleared, all DAWR access should be cleared immediately and your
machine once again safe from crashing.

Userspace may get confused by toggling this. If DAWR is force
enabled/disabled between getting the number of breakpoints (via
PTRACE_GETHWDBGINFO) and setting the breakpoint, userspace will get an
inconsistent view of what's available. Similarly for guests.

For the DAWR to be enabled in a KVM guest, the DAWR needs to be force
enabled in the host AND the guest. For this reason, this won't work on
POWERVM as it doesn't allow the HCALL to work. Writes of 'Y' to the
dawr_enable_dangerous file will fail if the hypervisor doesn't support
writing the DAWR.

To double check the DAWR is working, run this kernel selftest:
  tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c
Any errors/failures/skips mean something is wrong.

Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-20 22:20:45 +10:00
David Howells 5dd50aaeb1
Make anon_inodes unconditional
Make the anon_inodes facility unconditional so that it can be used by core
VFS code and pidfd code.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
[christian@brauner.io: adapt commit message to mention pidfds]
Signed-off-by: Christian Brauner <christian@brauner.io>
2019-04-19 14:03:11 +02:00
Paolo Bonzini c110ae578c kvm: move KVM_CAP_NR_MEMSLOTS to common code
All architectures except MIPS were defining it in the same way,
and memory slots are handled entirely by common code so there
is no point in keeping the definition per-architecture.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16 15:39:08 +02:00
Alexey Kardashevskiy 345077c8e1 KVM: PPC: Book3S: Protect memslots while validating user address
Guest physical to user address translation uses KVM memslots and reading
these requires holding the kvm->srcu lock. However recently introduced
kvmppc_tce_validate() broke the rule (see the lockdep warning below).

This moves srcu_read_lock(&vcpu->kvm->srcu) earlier to protect
kvmppc_tce_validate() as well.

=============================
WARNING: suspicious RCU usage
5.1.0-rc2-le_nv2_aikATfstn1-p1 #380 Not tainted
-----------------------------
include/linux/kvm_host.h:605 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
1 lock held by qemu-system-ppc/8020:
 #0: 0000000094972fe9 (&vcpu->mutex){+.+.}, at: kvm_vcpu_ioctl+0xdc/0x850 [kvm]

stack backtrace:
CPU: 44 PID: 8020 Comm: qemu-system-ppc Not tainted 5.1.0-rc2-le_nv2_aikATfstn1-p1 #380
Call Trace:
[c000003fece8f740] [c000000000bcc134] dump_stack+0xe8/0x164 (unreliable)
[c000003fece8f790] [c000000000181be0] lockdep_rcu_suspicious+0x130/0x170
[c000003fece8f810] [c0000000000d5f50] kvmppc_tce_to_ua+0x280/0x290
[c000003fece8f870] [c00800001a7e2c78] kvmppc_tce_validate+0x80/0x1b0 [kvm]
[c000003fece8f8e0] [c00800001a7e3fac] kvmppc_h_put_tce+0x94/0x3e4 [kvm]
[c000003fece8f9a0] [c00800001a8baac4] kvmppc_pseries_do_hcall+0x30c/0xce0 [kvm_hv]
[c000003fece8fa10] [c00800001a8bd89c] kvmppc_vcpu_run_hv+0x694/0xec0 [kvm_hv]
[c000003fece8fae0] [c00800001a7d95dc] kvmppc_vcpu_run+0x34/0x48 [kvm]
[c000003fece8fb00] [c00800001a7d56bc] kvm_arch_vcpu_ioctl_run+0x2f4/0x400 [kvm]
[c000003fece8fb90] [c00800001a7c3618] kvm_vcpu_ioctl+0x460/0x850 [kvm]
[c000003fece8fd00] [c00000000041c4f4] do_vfs_ioctl+0xe4/0x930
[c000003fece8fdb0] [c00000000041ce04] ksys_ioctl+0xc4/0x110
[c000003fece8fe00] [c00000000041ce78] sys_ioctl+0x28/0x80
[c000003fece8fe20] [c00000000000b5a4] system_call+0x5c/0x70

Fixes: 42de7b9e21 ("KVM: PPC: Validate TCEs against preregistered memory page sizes", 2018-09-10)
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-05 14:37:24 +11:00
Suraj Jitindar Singh 7cb9eb106d KVM: PPC: Book3S HV: Perserve PSSCR FAKE_SUSPEND bit on guest exit
There is a hardware bug in some POWER9 processors where a treclaim in
fake suspend mode can cause an inconsistency in the XER[SO] bit across
the threads of a core, the workaround being to force the core into SMT4
when doing the treclaim.

The FAKE_SUSPEND bit (bit 10) in the PSSCR is used to control whether a
thread is in fake suspend or real suspend. The important difference here
being that thread reconfiguration is blocked in real suspend but not
fake suspend mode.

When we exit a guest which was in fake suspend mode, we force the core
into SMT4 while we do the treclaim in kvmppc_save_tm_hv().
However on the new exit path introduced with the function
kvmhv_run_single_vcpu() we restore the host PSSCR before calling
kvmppc_save_tm_hv() which means that if we were in fake suspend mode we
put the thread into real suspend mode when we clear the
PSSCR[FAKE_SUSPEND] bit. This means that we block thread reconfiguration
and the thread which is trying to get the core into SMT4 before it can
do the treclaim spins forever since it itself is blocking thread
reconfiguration. The result is that that core is essentially lost.

This results in a trace such as:
[   93.512904] CPU: 7 PID: 13352 Comm: qemu-system-ppc Not tainted 5.0.0 #4
[   93.512905] NIP:  c000000000098a04 LR: c0000000000cc59c CTR: 0000000000000000
[   93.512908] REGS: c000003fffd2bd70 TRAP: 0100   Not tainted  (5.0.0)
[   93.512908] MSR:  9000000302883033 <SF,HV,VEC,VSX,FP,ME,IR,DR,RI,LE,TM[SE]>  CR: 22222444  XER: 00000000
[   93.512914] CFAR: c000000000098a5c IRQMASK: 3
[   93.512915] PACATMSCRATCH: 0000000000000001
[   93.512916] GPR00: 0000000000000001 c000003f6cc1b830 c000000001033100 0000000000000004
[   93.512928] GPR04: 0000000000000004 0000000000000002 0000000000000004 0000000000000007
[   93.512930] GPR08: 0000000000000000 0000000000000004 0000000000000000 0000000000000004
[   93.512932] GPR12: c000203fff7fc000 c000003fffff9500 0000000000000000 0000000000000000
[   93.512935] GPR16: 2000000000300375 000000000000059f 0000000000000000 0000000000000000
[   93.512951] GPR20: 0000000000000000 0000000000080053 004000000256f41f c000003f6aa88ef0
[   93.512953] GPR24: c000003f6aa89100 0000000000000010 0000000000000000 0000000000000000
[   93.512956] GPR28: c000003f9e9a0800 0000000000000000 0000000000000001 c000203fff7fc000
[   93.512959] NIP [c000000000098a04] pnv_power9_force_smt4_catch+0x1b4/0x2c0
[   93.512960] LR [c0000000000cc59c] kvmppc_save_tm_hv+0x40/0x88
[   93.512960] Call Trace:
[   93.512961] [c000003f6cc1b830] [0000000000080053] 0x80053 (unreliable)
[   93.512965] [c000003f6cc1b8a0] [c00800001e9cb030] kvmhv_p9_guest_entry+0x508/0x6b0 [kvm_hv]
[   93.512967] [c000003f6cc1b940] [c00800001e9cba44] kvmhv_run_single_vcpu+0x2dc/0xb90 [kvm_hv]
[   93.512968] [c000003f6cc1ba10] [c00800001e9cc948] kvmppc_vcpu_run_hv+0x650/0xb90 [kvm_hv]
[   93.512969] [c000003f6cc1bae0] [c00800001e8f620c] kvmppc_vcpu_run+0x34/0x48 [kvm]
[   93.512971] [c000003f6cc1bb00] [c00800001e8f2d4c] kvm_arch_vcpu_ioctl_run+0x2f4/0x400 [kvm]
[   93.512972] [c000003f6cc1bb90] [c00800001e8e3918] kvm_vcpu_ioctl+0x460/0x7d0 [kvm]
[   93.512974] [c000003f6cc1bd00] [c0000000003ae2c0] do_vfs_ioctl+0xe0/0x8e0
[   93.512975] [c000003f6cc1bdb0] [c0000000003aeb24] ksys_ioctl+0x64/0xe0
[   93.512978] [c000003f6cc1be00] [c0000000003aebc8] sys_ioctl+0x28/0x80
[   93.512981] [c000003f6cc1be20] [c00000000000b3a4] system_call+0x5c/0x70
[   93.512983] Instruction dump:
[   93.512986] 419dffbc e98c0000 2e8b0000 38000001 60000000 60000000 60000000 40950068
[   93.512993] 392bffff 39400000 79290020 39290001 <7d2903a6> 60000000 60000000 7d235214

To fix this we preserve the PSSCR[FAKE_SUSPEND] bit until we call
kvmppc_save_tm_hv() which will mean the core can get into SMT4 and
perform the treclaim. Note kvmppc_save_tm_hv() clears the
PSSCR[FAKE_SUSPEND] bit again so there is no need to explicitly do that.

Fixes: 95a6432ce9 ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests")

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-05 14:37:24 +11:00
Linus Torvalds 636deed6c0 ARM: some cleanups, direct physical timer assignment, cache sanitization
for 32-bit guests
 
 s390: interrupt cleanup, introduction of the Guest Information Block,
 preparation for processor subfunctions in cpu models
 
 PPC: bug fixes and improvements, especially related to machine checks
 and protection keys
 
 x86: many, many cleanups, including removing a bunch of MMU code for
 unnecessary optimizations; plus AVIC fixes.
 
 Generic: memcg accounting
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJci+7XAAoJEL/70l94x66DUMkIAKvEefhceySHYiTpfefjLjIC
 16RewgHa+9CO4Oo5iXiWd90fKxtXLXmxDQOS4VGzN0rxvLGRw/fyXIxL1MDOkaAO
 l8SLSNuewY4XBUgISL3PMz123r18DAGOuy9mEcYU/IMesYD2F+wy5lJ17HIGq6X2
 RpoF1p3qO1jfkPTKOob6Ixd4H5beJNPKpdth7LY3PJaVhDxgouj32fxnLnATVSnN
 gENQ10fnt8BCjshRYW6Z2/9bF15JCkUFR1xdBW2/xh1oj+kvPqqqk2bEN1eVQzUy
 2hT/XkwtpthqjSbX8NNavWRSFnOnbMLTRKQyIXmFVsM5VoSrwtiGsCFzBgcT++I=
 =XIzU
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "ARM:
   - some cleanups
   - direct physical timer assignment
   - cache sanitization for 32-bit guests

  s390:
   - interrupt cleanup
   - introduction of the Guest Information Block
   - preparation for processor subfunctions in cpu models

  PPC:
   - bug fixes and improvements, especially related to machine checks
     and protection keys

  x86:
   - many, many cleanups, including removing a bunch of MMU code for
     unnecessary optimizations
   - AVIC fixes

  Generic:
   - memcg accounting"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (147 commits)
  kvm: vmx: fix formatting of a comment
  KVM: doc: Document the life cycle of a VM and its resources
  MAINTAINERS: Add KVM selftests to existing KVM entry
  Revert "KVM/MMU: Flush tlb directly in the kvm_zap_gfn_range()"
  KVM: PPC: Book3S: Add count cache flush parameters to kvmppc_get_cpu_char()
  KVM: PPC: Fix compilation when KVM is not enabled
  KVM: Minor cleanups for kvm_main.c
  KVM: s390: add debug logging for cpu model subfunctions
  KVM: s390: implement subfunction processor calls
  arm64: KVM: Fix architecturally invalid reset value for FPEXC32_EL2
  KVM: arm/arm64: Remove unused timer variable
  KVM: PPC: Book3S: Improve KVM reference counting
  KVM: PPC: Book3S HV: Fix build failure without IOMMU support
  Revert "KVM: Eliminate extra function calls in kvm_get_dirty_log_protect()"
  x86: kvmguest: use TSC clocksource if invariant TSC is exposed
  KVM: Never start grow vCPU halt_poll_ns from value below halt_poll_ns_grow_start
  KVM: Expose the initial start value in grow_halt_poll_ns() as a module parameter
  KVM: grow_halt_poll_ns() should never shrink vCPU halt_poll_ns
  KVM: x86/mmu: Consolidate kvm_mmu_zap_all() and kvm_mmu_zap_mmio_sptes()
  KVM: x86/mmu: WARN if zapping a MMIO spte results in zapping children
  ...
2019-03-15 15:00:28 -07:00
Michael Ellerman 2de04718ec Merge branch 'topic/ppc-kvm' into next
Merge another commit in the topic/ppc-kvm branch we're sharing with
kvm-ppc.
2019-03-02 14:42:28 +11:00
Suraj Jitindar Singh 2b57ecd020 KVM: PPC: Book3S: Add count cache flush parameters to kvmppc_get_cpu_char()
Add KVM_PPC_CPU_CHAR_BCCTR_FLUSH_ASSIST &
KVM_PPC_CPU_BEHAV_FLUSH_COUNT_CACHE to the characteristics returned
from the H_GET_CPU_CHARACTERISTICS H-CALL, as queried from either the
hypervisor or the device tree.

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-03-01 15:11:14 +11:00
Christophe Leroy 92ab45c5f2 powerpc: Avoid circular header inclusion in mmu-hash.h
When activating CONFIG_THREAD_INFO_IN_TASK, linux/sched.h includes
asm/current.h. This generates a circular dependency. To avoid that,
asm/processor.h shall not be included in mmu-hash.h.

In order to do that, this patch moves into a new header called
asm/task_size_64/32.h all the TASK_SIZE related constants, which can
then be included in mmu-hash.h directly.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Split out all the TASK_SIZE constants not just 64-bit ones]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-02-23 22:31:39 +11:00
Paolo Bonzini 54a1f393ce PPC KVM update for 5.1
There are no major new features this time, just a collection of bug
 fixes and improvements in various areas, including machine check
 handling and context switching of protection-key-related registers.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJcb3lEAAoJEJ2a6ncsY3GflNwH/2ezxhHv7CRy18d2D3F+Kna+
 YQs3V/pJfBRvVdV7ZLxnR03H/NmzAK3UOzRfqGodYUtbF+gUDqSuM27lAxMKrjBv
 S87X5g/1ZdiQNnqYK7PIBn75Tx27vnw2kJAif8rXTfqbj8qLUsXcNhsziA16sJOA
 azbD5PBp9mOVzTojawyriJ3H8LYqw+vinad0idvFrApFCuNmMxv56FR6H+IBadt7
 1UJyx6AegQACdhxvy0CzmZjzzXw02z9zeFUa4lakm2sORc4fbbyyZ68CtkGURg7A
 8rt2j9SGt649ExpjfG2Cz/UihMGIMXSAOrpqTZMfyd9UPzPgHeKx2FidnxASUBc=
 =PIT8
 -----END PGP SIGNATURE-----

Merge tag 'kvm-ppc-next-5.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into kvm-next

PPC KVM update for 5.1

There are no major new features this time, just a collection of bug
fixes and improvements in various areas, including machine check
handling and context switching of protection-key-related registers.
2019-02-22 17:43:05 +01:00
Paul Mackerras 0a0c50f771 Merge remote-tracking branch 'remotes/powerpc/topic/ppc-kvm' into kvm-ppc-next
This merges in the "ppc-kvm" topic branch of the powerpc tree to get a
series of commits that touch both general arch/powerpc code and KVM
code.  These commits will be merged both via the KVM tree and the
powerpc tree.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-02-22 13:52:30 +11:00
Michael Ellerman c3c7470c75 powerpc/kvm: Save and restore host AMR/IAMR/UAMOR
When the hash MMU is active the AMR, IAMR and UAMOR are used for
pkeys. The AMR is directly writable by user space, and the UAMOR masks
those writes, meaning both registers are effectively user register
state. The IAMR is used to create an execute only key.

Also we must maintain the value of at least the AMR when running in
process context, so that any memory accesses done by the kernel on
behalf of the process are correctly controlled by the AMR.

Although we are correctly switching all registers when going into a
guest, on returning to the host we just write 0 into all regs, except
on Power9 where we restore the IAMR correctly.

This could be observed by a user process if it writes the AMR, then
runs a guest and we then return immediately to it without
rescheduling. Because we have written 0 to the AMR that would have the
effect of granting read/write permission to pages that the process was
trying to protect.

In addition, when using the Radix MMU, the AMR can prevent inadvertent
kernel access to userspace data, writing 0 to the AMR disables that
protection.

So save and restore AMR, IAMR and UAMOR.

Fixes: cf43d3b264 ("powerpc: Enable pkey subsystem")
Cc: stable@vger.kernel.org # v4.16+
Signed-off-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Paul Mackerras <paulus@ozlabs.org>
2019-02-22 13:41:13 +11:00
Alexey Kardashevskiy 716cb11608 KVM: PPC: Book3S: Improve KVM reference counting
The anon fd's ops releases the KVM reference in the release hook.
However we reference the KVM object after we create the fd so there is
small window when the release function can be called and
dereferenced the KVM object which potentially may free it.

It is not a problem at the moment as the file is created and KVM is
referenced under the KVM lock and the release function obtains the same
lock before dereferencing the KVM (although the lock is not held when
calling kvm_put_kvm()) but it is potentially fragile against future changes.

This references the KVM object before creating a file.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-02-22 12:51:02 +11:00
Jordan Niethe e40542aff9 KVM: PPC: Book3S HV: Fix build failure without IOMMU support
Currently trying to build without IOMMU support will fail:

  (.text+0x1380): undefined reference to `kvmppc_h_get_tce'
  (.text+0x1384): undefined reference to `kvmppc_rm_h_put_tce'
  (.text+0x149c): undefined reference to `kvmppc_rm_h_stuff_tce'
  (.text+0x14a0): undefined reference to `kvmppc_rm_h_put_tce_indirect'

This happens because turning off IOMMU support will prevent
book3s_64_vio_hv.c from being built because it is only built when
SPAPR_TCE_IOMMU is set, which depends on IOMMU support.

Fix it using ifdefs for the undefined references.

Fixes: 76d837a4c0 ("KVM: PPC: Book3S PR: Don't include SPAPR TCE code on non-pseries platforms")
Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-02-22 12:51:02 +11:00
Michael Ellerman e121ee6bc3 Merge branch 'topic/ppc-kvm' into next
Merge commits we're sharing with kvm-ppc tree.
2019-02-22 00:09:56 +11:00
Paul Mackerras c057720184 powerpc/64s: Better printing of machine check info for guest MCEs
This adds an "in_guest" parameter to machine_check_print_event_info()
so that we can avoid trying to translate guest NIP values into
symbolic form using the host kernel's symbol table.

Reviewed-by: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-02-21 23:16:45 +11:00
Paul Mackerras 884dfb722d KVM: PPC: Book3S HV: Simplify machine check handling
This makes the handling of machine check interrupts that occur inside
a guest simpler and more robust, with less done in assembler code and
in real mode.

Now, when a machine check occurs inside a guest, we always get the
machine check event struct and put a copy in the vcpu struct for the
vcpu where the machine check occurred.  We no longer call
machine_check_queue_event() from kvmppc_realmode_mc_power7(), because
on POWER8, when a vcpu is running on an offline secondary thread and
we call machine_check_queue_event(), that calls irq_work_queue(),
which doesn't work because the CPU is offline, but instead triggers
the WARN_ON(lazy_irq_pending()) in pnv_smp_cpu_kill_self() (which
fires again and again because nothing clears the condition).

All that machine_check_queue_event() actually does is to cause the
event to be printed to the console.  For a machine check occurring in
the guest, we now print the event in kvmppc_handle_exit_hv()
instead.

The assembly code at label machine_check_realmode now just calls C
code and then continues exiting the guest.  We no longer either
synthesize a machine check for the guest in assembly code or return
to the guest without a machine check.

The code in kvmppc_handle_exit_hv() is extended to handle the case
where the guest is not FWNMI-capable.  In that case we now always
synthesize a machine check interrupt for the guest.  Previously, if
the host thinks it has recovered the machine check fully, it would
return to the guest without any notification that the machine check
had occurred.  If the machine check was caused by some action of the
guest (such as creating duplicate SLB entries), it is much better to
tell the guest that it has caused a problem.  Therefore we now always
generate a machine check interrupt for guests that are not
FWNMI-capable.

Reviewed-by: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-02-21 23:16:44 +11:00
Michael Ellerman d976f6807e KVM: PPC: Book3S HV: Context switch AMR on Power9
kvmhv_p9_guest_entry() implements a fast-path guest entry for Power9
when guest and host are both running with the Radix MMU.

Currently in that path we don't save the host AMR (Authority Mask
Register) value, and we always restore 0 on return to the host. That
is OK at the moment because the AMR is not used for storage keys with
the Radix MMU.

However we plan to start using the AMR on Radix to prevent the kernel
from reading/writing to userspace outside of copy_to/from_user(). In
order to make that work we need to save/restore the AMR value.

We only restore the value if it is different from the guest value,
which is already in the register when we exit to the host. This should
mean we rarely need to actually restore the value when running a
modern Linux as a guest, because it will be using the same value as
us.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Tested-by: Russell Currey <ruscur@russell.cc>
2019-02-21 13:19:52 +11:00
Nir Weiner dee339b5c1 KVM: Never start grow vCPU halt_poll_ns from value below halt_poll_ns_grow_start
grow_halt_poll_ns() have a strange behaviour in case
(vcpu->halt_poll_ns != 0) &&
(vcpu->halt_poll_ns < halt_poll_ns_grow_start).

In this case, vcpu->halt_poll_ns will be multiplied by grow factor
(halt_poll_ns_grow) which will require several grow iteration in order
to reach a value bigger than halt_poll_ns_grow_start.
This means that growing vcpu->halt_poll_ns from value of 0 is slower
than growing it from a positive value less than halt_poll_ns_grow_start.
Which is misleading and inaccurate.

Fix issue by changing grow_halt_poll_ns() to set vcpu->halt_poll_ns
to halt_poll_ns_grow_start in any case that
(vcpu->halt_poll_ns < halt_poll_ns_grow_start).
Regardless if vcpu->halt_poll_ns is 0.

use READ_ONCE to get a consistent number for all cases.

Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Nir Weiner <nir.weiner@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:51 +01:00
Nir Weiner 49113d360b KVM: Expose the initial start value in grow_halt_poll_ns() as a module parameter
The hard-coded value 10000 in grow_halt_poll_ns() stands for the initial
start value when raising up vcpu->halt_poll_ns.
It actually sets the first timeout to the first polling session.
This value has significant effect on how tolerant we are to outliers.
On the standard case, higher value is better - we will spend more time
in the polling busyloop, handle events/interrupts faster and result
in better performance.
But on outliers it puts us in a busy loop that does nothing.
Even if the shrink factor is zero, we will still waste time on the first
iteration.
The optimal value changes between different workloads. It depends on
outliers rate and polling sessions length.
As this value has significant effect on the dynamic halt-polling
algorithm, it should be configurable and exposed.

Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Nir Weiner <nir.weiner@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:50 +01:00
Nir Weiner 7fa08e71b4 KVM: grow_halt_poll_ns() should never shrink vCPU halt_poll_ns
grow_halt_poll_ns() have a strange behavior in case
(halt_poll_ns_grow == 0) && (vcpu->halt_poll_ns != 0).

In this case, vcpu->halt_pol_ns will be set to zero.
That results in shrinking instead of growing.

Fix issue by changing grow_halt_poll_ns() to not modify
vcpu->halt_poll_ns in case halt_poll_ns_grow is zero

Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Nir Weiner <nir.weiner@oracle.com>
Suggested-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:50 +01:00
Suraj Jitindar Singh 8f1f7b9bed KVM: PPC: Book3S HV: Add KVM stat largepages_[2M/1G]
This adds an entry to the kvm_stats_debugfs directory which provides the
number of large (2M or 1G) pages which have been used to setup the guest
mappings, for radix guests.

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-02-19 16:00:15 +11:00
Alexey Kardashevskiy a67614cc05 KVM: PPC: Release all hardware TCE tables attached to a group
The SPAPR TCE KVM device references all hardware IOMMU tables assigned to
some IOMMU group to ensure that in-kernel KVM acceleration of H_PUT_TCE
can work. The tables are references when an IOMMU group gets registered
with the VFIO KVM device by the KVM_DEV_VFIO_GROUP_ADD ioctl;
KVM_DEV_VFIO_GROUP_DEL calls into the dereferencing code
in kvm_spapr_tce_release_iommu_group() which walks through the list of
LIOBNs, finds a matching IOMMU table and calls kref_put() when found.

However that code stops after the very first successful derefencing
leaving other tables referenced till the SPAPR TCE KVM device is destroyed
which normally happens on guest reboot or termination so if we do hotplug
and unplug in a loop, we are leaking IOMMU tables here.

This removes a premature return to let kvm_spapr_tce_release_iommu_group()
find and dereference all attached tables.

Fixes: 121f80ba68 ("KVM: PPC: VFIO: Add in-kernel acceleration for VFIO")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-02-19 16:00:15 +11:00
Suraj Jitindar Singh 1b6422574e KVM: PPC: Book3S HV: Optimise mmio emulation for devices on FAST_MMIO_BUS
Devices on the KVM_FAST_MMIO_BUS by definition have length zero and are
thus used for notification purposes rather than data transfer. For
example eventfd for virtio devices.

This means that when emulating mmio instructions which target devices on
this bus we can immediately handle them and return without needing to load
the instruction from guest memory.

For now we restrict this to stores as this is the only use case at
present.

For a normal guest the effect is negligible, however for a nested guest
we save on the order of 5us per access.

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-02-19 16:00:15 +11:00
Paul Mackerras 03f953329b KVM: PPC: Book3S: Allow XICS emulation to work in nested hosts using XIVE
Currently, the KVM code assumes that if the host kernel is using the
XIVE interrupt controller (the new interrupt controller that first
appeared in POWER9 systems), then the in-kernel XICS emulation will
use the XIVE hardware to deliver interrupts to the guest.  However,
this only works when the host is running in hypervisor mode and has
full access to all of the XIVE functionality.  It doesn't work in any
nested virtualization scenario, either with PR KVM or nested-HV KVM,
because the XICS-on-XIVE code calls directly into the native-XIVE
routines, which are not initialized and cannot function correctly
because they use OPAL calls, and OPAL is not available in a guest.

This means that using the in-kernel XICS emulation in a nested
hypervisor that is using XIVE as its interrupt controller will cause a
(nested) host kernel crash.  To fix this, we change most of the places
where the current code calls xive_enabled() to select between the
XICS-on-XIVE emulation and the plain XICS emulation to call a new
function, xics_on_xive(), which returns false in a guest.

However, there is a further twist.  The plain XICS emulation has some
functions which are used in real mode and access the underlying XICS
controller (the interrupt controller of the host) directly.  In the
case of a nested hypervisor, this means doing XICS hypercalls
directly.  When the nested host is using XIVE as its interrupt
controller, these hypercalls will fail.  Therefore this also adds
checks in the places where the XICS emulation wants to access the
underlying interrupt controller directly, and if that is XIVE, makes
the code use the virtual mode fallback paths, which call generic
kernel infrastructure rather than doing direct XICS access.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-02-19 16:00:15 +11:00
Masahiro Yamada f1adb9c48a KVM: PPC: Remove -I. header search paths
The header search path -I. in kernel Makefiles is very suspicious;
it allows the compiler to search for headers in the top of $(srctree),
where obviously no header file exists.

Commit 46f43c6ee0 ("KVM: powerpc: convert marker probes to event
trace") first added these options, but they are completely useless.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-02-19 16:00:15 +11:00
wangbo 08434ab469 KVM: PPC: Book3S HV: Replace kmalloc_node+memset with kzalloc_node
Replace kmalloc_node and memset with kzalloc_node

Signed-off-by: wangbo <wang.bo116@zte.com.cn>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-02-19 16:00:15 +11:00
Paul Mackerras 41a8645ab1 KVM: PPC: Book3S PR: Add emulation for slbfee. instruction
Recent kernels, since commit e15a4fea4d ("powerpc/64s/hash: Add
some SLB debugging tests", 2018-10-03) use the slbfee. instruction,
which PR KVM currently does not have code to emulate.  Consequently
recent kernels fail to boot under PR KVM.  This adds emulation of
slbfee., enabling these kernels to boot successfully.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-02-19 14:25:52 +11:00
Masahiro Yamada c142e9741e KVM: powerpc: remove -I. header search paths
The header search path -I. in kernel Makefiles is very suspicious;
it allows the compiler to search for headers in the top of $(srctree),
where obviously no header file exists.

Commit 46f43c6ee0 ("KVM: powerpc: convert marker probes to event
trace") first added these options, but they are completely useless.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-01-14 20:39:27 +11:00
Michael Ellerman d538d94f0c Merge branch 'master' into fixes
We have a fix to apply on top of commit 96d4f267e4 ("Remove 'type'
argument from access_ok() function"), so merge master to get it.
2019-01-04 22:07:47 +11:00
Linus Torvalds 96d4f267e4 Remove 'type' argument from access_ok() function
Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
of the user address range verification function since we got rid of the
old racy i386-only code to walk page tables by hand.

It existed because the original 80386 would not honor the write protect
bit when in kernel mode, so you had to do COW by hand before doing any
user access.  But we haven't supported that in a long time, and these
days the 'type' argument is a purely historical artifact.

A discussion about extending 'user_access_begin()' to do the range
checking resulted this patch, because there is no way we're going to
move the old VERIFY_xyz interface to that model.  And it's best done at
the end of the merge window when I've done most of my merges, so let's
just get this done once and for all.

This patch was mostly done with a sed-script, with manual fix-ups for
the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.

There were a couple of notable cases:

 - csky still had the old "verify_area()" name as an alias.

 - the iter_iov code had magical hardcoded knowledge of the actual
   values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
   really used it)

 - microblaze used the type argument for a debug printout

but other than those oddities this should be a total no-op patch.

I tried to fix up all architectures, did fairly extensive grepping for
access_ok() uses, and the changes are trivial, but I may have missed
something.  Any missed conversion should be trivially fixable, though.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-03 18:57:57 -08:00
Michael Ellerman f460772291 KVM: PPC: Book3S HV: radix: Fix uninitialized var build error
Old GCCs (4.6.3 at least), aren't able to follow the logic in
__kvmhv_copy_tofrom_guest_radix() and warn that old_pid is used
uninitialized:

  arch/powerpc/kvm/book3s_64_mmu_radix.c:75:3: error: 'old_pid' may be
  used uninitialized in this function

The logic is OK, we only use old_pid if quadrant == 1, and in that
case it has definitely be initialised, eg:

	if (quadrant == 1) {
		old_pid = mfspr(SPRN_PID);
	...
	if (quadrant == 1 && pid != old_pid)
		mtspr(SPRN_PID, old_pid);

Annotate it to fix the error.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-12-30 14:05:13 +11:00
Linus Torvalds 769e47094d Kconfig updates for v4.21
- support -y option for merge_config.sh to avoid downgrading =y to =m
 
  - remove S_OTHER symbol type, and touch include/config/*.h files correctly
 
  - fix file name and line number in lexer warnings
 
  - fix memory leak when EOF is encountered in quotation
 
  - resolve all shift/reduce conflicts of the parser
 
  - warn no new line at end of file
 
  - make 'source' statement more strict to take only string literal
 
  - rewrite the lexer and remove the keyword lookup table
 
  - convert to SPDX License Identifier
 
  - compile C files independently instead of including them from zconf.y
 
  - fix various warnings of gconfig
 
  - misc cleanups
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJcJieuAAoJED2LAQed4NsGHlIP/1s0fQ86XD9dIMyHzAO0gh2f
 7rylfe2kEXJgIzJ0DyZdLu4iZtwbkEUqTQrRS1abriNGVemPkfBAnZdM5d92lOQX
 3iREa700AJ2xo7V7gYZ6AbhZoG3p0S9U9Q2qE5S+tFTe8c2Gy4xtjnODF+Vel85r
 S0P8tF5sE1/d00lm+yfMI/CJVfDjyNaMm+aVEnL0kZTPiRkaktjWgo6Fc2p4z1L5
 HFmMMP6/iaXmRZ+tHJGPQ2AT70GFVZw5ePxPcl50EotUP25KHbuUdzs8wDpYm3U/
 rcESVsIFpgqHWmTsdBk6dZk0q8yFZNkMlkaP/aYukVZpUn/N6oAXgTFckYl8dmQL
 fQBkQi6DTfr9EBPVbj18BKm7xI3Y4DdQ2fzTfYkJ2XwNRGFA5r9N3sjd7ZTVGjxC
 aeeMHCwvGdSx1x8PeZAhZfsUHW8xVDMSQiT713+ljBY+6cwzA+2NF0kP7B6OAqwr
 ETFzd4Xu2/lZcL7gQRH8WU3L2S5iedmDG6RnZgJMXI0/9V4qAA+nlsWaCgnl1TgA
 mpxYlLUMrd6AUJevE34FlnyFdk8IMn9iKRFsvF0f3doO5C7QzTVGqFdJu5a0CuWO
 4NBJvZjFT8/4amoWLfnDlfApWXzTfwLbKG+r6V2F30fLuXpYg5LxWhBoGRPYLZSq
 oi4xN1Mpx3TvXz6WcKVZ
 =r3Fl
 -----END PGP SIGNATURE-----

Merge tag 'kconfig-v4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kconfig updates from Masahiro Yamada:

 - support -y option for merge_config.sh to avoid downgrading =y to =m

 - remove S_OTHER symbol type, and touch include/config/*.h files correctly

 - fix file name and line number in lexer warnings

 - fix memory leak when EOF is encountered in quotation

 - resolve all shift/reduce conflicts of the parser

 - warn no new line at end of file

 - make 'source' statement more strict to take only string literal

 - rewrite the lexer and remove the keyword lookup table

 - convert to SPDX License Identifier

 - compile C files independently instead of including them from zconf.y

 - fix various warnings of gconfig

 - misc cleanups

* tag 'kconfig-v4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (39 commits)
  kconfig: surround dbg_sym_flags with #ifdef DEBUG to fix gconf warning
  kconfig: split images.c out of qconf.cc/gconf.c to fix gconf warnings
  kconfig: add static qualifiers to fix gconf warnings
  kconfig: split the lexer out of zconf.y
  kconfig: split some C files out of zconf.y
  kconfig: convert to SPDX License Identifier
  kconfig: remove keyword lookup table entirely
  kconfig: update current_pos in the second lexer
  kconfig: switch to ASSIGN_VAL state in the second lexer
  kconfig: stop associating kconf_id with yylval
  kconfig: refactor end token rules
  kconfig: stop supporting '.' and '/' in unquoted words
  treewide: surround Kconfig file paths with double quotes
  microblaze: surround string default in Kconfig with double quotes
  kconfig: use T_WORD instead of T_VARIABLE for variables
  kconfig: use specific tokens instead of T_ASSIGN for assignments
  kconfig: refactor scanning and parsing "option" properties
  kconfig: use distinct tokens for type and default properties
  kconfig: remove redundant token defines
  kconfig: rename depends_list to comment_option_list
  ...
2018-12-29 13:03:29 -08:00
Linus Torvalds 8d6973327e powerpc updates for 4.21
Notable changes:
 
  - Mitigations for Spectre v2 on some Freescale (NXP) CPUs.
 
  - A large series adding support for pass-through of Nvidia V100 GPUs to guests
    on Power9.
 
  - Another large series to enable hardware assistance for TLB table walk on
    MPC8xx CPUs.
 
  - Some preparatory changes to our DMA code, to make way for further cleanups
    from Christoph.
 
  - Several fixes for our Transactional Memory handling discovered by fuzzing the
    signal return path.
 
  - Support for generating our system call table(s) from a text file like other
    architectures.
 
  - A fix to our page fault handler so that instead of generating a WARN_ON_ONCE,
    user accesses of kernel addresses instead print a ratelimited and
    appropriately scary warning.
 
  - A cosmetic change to make our unhandled page fault messages more similar to
    other arches and also more compact and informative.
 
  - Freescale updates from Scott:
    "Highlights include elimination of legacy clock bindings use from dts
     files, an 83xx watchdog handler, fixes to old dts interrupt errors, and
     some minor cleanup."
 
 And many clean-ups, reworks and minor fixes etc.
 
 Thanks to:
  Alexandre Belloni, Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V,
  Arnd Bergmann, Benjamin Herrenschmidt, Breno Leitao, Christian Lamparter,
  Christophe Leroy, Christoph Hellwig, Daniel Axtens, Darren Stevens, David
  Gibson, Diana Craciun, Dmitry V. Levin, Firoz Khan, Geert Uytterhoeven, Greg
  Kurz, Gustavo Romero, Hari Bathini, Joel Stanley, Kees Cook, Madhavan
  Srinivasan, Mahesh Salgaonkar, Markus Elfring, Mathieu Malaterre, Michal
  Suchánek, Naveen N. Rao, Nick Desaulniers, Oliver O'Halloran, Paul Mackerras,
  Ram Pai, Ravi Bangoria, Rob Herring, Russell Currey, Sabyasachi Gupta, Sam
  Bobroff, Satheesh Rajendran, Scott Wood, Segher Boessenkool, Stephen Rothwell,
  Tang Yuantian, Thiago Jung Bauermann, Yangtao Li, Yuantian Tang, Yue Haibing.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJcJLwZAAoJEFHr6jzI4aWAAv4P/jMvP52lA90i2E8G72LOVSF1
 33DbE/Okib3VfmmMcXZpgpEfwIcEmJcIj86WWcLWzBfXLunehkgwh+AOfBLwqWch
 D08+RR9EZb7ppvGe91hvSgn4/28CWVKAxuDviSuoE1OK8lOTncu889r2+AxVFZiY
 f6Al9UPlB3FTJonNx8iO4r/GwrPigukjbzp1vkmJJg59LvNUrMQ1Fgf9D3cdlslH
 z4Ff9zS26RJy7cwZYQZI4sZXJZmeQ1DxOZ+6z6FL/nZp/O4WLgpw6C6o1+vxo1kE
 9ZnO/3+zIRhoWiXd6OcOQXBv3NNCjJZlXh9HHAiL8m5ZqbmxrStQWGyKW/jjEZuK
 wVHxfUT19x9Qy1p+BH3XcUNMlxchYgcCbEi5yPX2p9ZDXD6ogNG7sT1+NO+FBTww
 ueCT5PCCB/xWOccQlBErFTMkFXFLtyPDNFK7BkV7uxbH0PQ+9guCvjWfBZti6wjD
 /6NK4mk7FpmCiK13Y1xjwC5OqabxLUYwtVuHYOMr5TOPh8URUPS4+0pIOdoYDM6z
 Ensrq1CC843h59MWADgFHSlZ78FRtZlG37JAXunjLbqGupLOvL7phC9lnwkylHga
 2hWUWFeOV8HFQBP4gidZkLk64pkT9LzqHgdgIB4wUwrhc8r2mMZGdQTq5H7kOn3Q
 n9I48PWANvEC0PBCJ/KL
 =cr6s
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.21-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:
 "Notable changes:

   - Mitigations for Spectre v2 on some Freescale (NXP) CPUs.

   - A large series adding support for pass-through of Nvidia V100 GPUs
     to guests on Power9.

   - Another large series to enable hardware assistance for TLB table
     walk on MPC8xx CPUs.

   - Some preparatory changes to our DMA code, to make way for further
     cleanups from Christoph.

   - Several fixes for our Transactional Memory handling discovered by
     fuzzing the signal return path.

   - Support for generating our system call table(s) from a text file
     like other architectures.

   - A fix to our page fault handler so that instead of generating a
     WARN_ON_ONCE, user accesses of kernel addresses instead print a
     ratelimited and appropriately scary warning.

   - A cosmetic change to make our unhandled page fault messages more
     similar to other arches and also more compact and informative.

   - Freescale updates from Scott:
       "Highlights include elimination of legacy clock bindings use from
        dts files, an 83xx watchdog handler, fixes to old dts interrupt
        errors, and some minor cleanup."

  And many clean-ups, reworks and minor fixes etc.

  Thanks to: Alexandre Belloni, Alexey Kardashevskiy, Andrew Donnellan,
  Aneesh Kumar K.V, Arnd Bergmann, Benjamin Herrenschmidt, Breno Leitao,
  Christian Lamparter, Christophe Leroy, Christoph Hellwig, Daniel
  Axtens, Darren Stevens, David Gibson, Diana Craciun, Dmitry V. Levin,
  Firoz Khan, Geert Uytterhoeven, Greg Kurz, Gustavo Romero, Hari
  Bathini, Joel Stanley, Kees Cook, Madhavan Srinivasan, Mahesh
  Salgaonkar, Markus Elfring, Mathieu Malaterre, Michal Suchánek, Naveen
  N. Rao, Nick Desaulniers, Oliver O'Halloran, Paul Mackerras, Ram Pai,
  Ravi Bangoria, Rob Herring, Russell Currey, Sabyasachi Gupta, Sam
  Bobroff, Satheesh Rajendran, Scott Wood, Segher Boessenkool, Stephen
  Rothwell, Tang Yuantian, Thiago Jung Bauermann, Yangtao Li, Yuantian
  Tang, Yue Haibing"

* tag 'powerpc-4.21-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (201 commits)
  Revert "powerpc/fsl_pci: simplify fsl_pci_dma_set_mask"
  powerpc/zImage: Also check for stdout-path
  powerpc: Fix HMIs on big-endian with CONFIG_RELOCATABLE=y
  macintosh: Use of_node_name_{eq, prefix} for node name comparisons
  ide: Use of_node_name_eq for node name comparisons
  powerpc: Use of_node_name_eq for node name comparisons
  powerpc/pseries/pmem: Convert to %pOFn instead of device_node.name
  powerpc/mm: Remove very old comment in hash-4k.h
  powerpc/pseries: Fix node leak in update_lmb_associativity_index()
  powerpc/configs/85xx: Enable CONFIG_DEBUG_KERNEL
  powerpc/dts/fsl: Fix dtc-flagged interrupt errors
  clk: qoriq: add more compatibles strings
  powerpc/fsl: Use new clockgen binding
  powerpc/83xx: handle machine check caused by watchdog timer
  powerpc/fsl-rio: fix spelling mistake "reserverd" -> "reserved"
  powerpc/fsl_pci: simplify fsl_pci_dma_set_mask
  arch/powerpc/fsl_rmu: Use dma_zalloc_coherent
  vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] subdriver
  vfio_pci: Allow regions to add own capabilities
  vfio_pci: Allow mapping extra regions
  ...
2018-12-27 10:43:24 -08:00
Masahiro Yamada 8636a1f967 treewide: surround Kconfig file paths with double quotes
The Kconfig lexer supports special characters such as '.' and '/' in
the parameter context. In my understanding, the reason is just to
support bare file paths in the source statement.

I do not see a good reason to complicate Kconfig for the room of
ambiguity.

The majority of code already surrounds file paths with double quotes,
and it makes sense since file paths are constant string literals.

Make it treewide consistent now.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: Wolfram Sang <wsa@the-dreams.de>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
2018-12-22 00:25:54 +09:00
Paolo Bonzini c6ad459733 Second PPC KVM update for 4.21
This has 5 commits that fix page dirty tracking when running nested
 HV KVM guests, from Suraj Jitindar Singh.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJcHGnCAAoJEJ2a6ncsY3GfaMQIAKP1KQ/SzE38gORRjdf4aUTf
 8X6+B6bAjVPwu0OgR2xoU4hPT8vpViG8gIjlspDm/v1igRlokz5GeCEXXlttQwK6
 5JFfqS+psCQ/Z/bPFo0uW7cOojeDeE3s1Vd3XXjMH79T6Mvpg54fYutvxd+qbjW6
 gAl/jGK6xxo1XsaQfySeSSLA3b8ibI77mjnPBgvbSHbrxBAIjoAfqKTVSjrPwR2b
 ZvHyCbaoX2uOGyWrw9O73CqCgsiXEOQvr8dLEjsT7a+brrJBO4mv20s6DzlpC1lS
 YVHd8GAfk44h1fxsNXsj9eEUIcRMEB4fYYOd/u0TECdeyFFWz+8JjBhm6u0TRck=
 =Zrya
 -----END PGP SIGNATURE-----

Merge tag 'kvm-ppc-next-4.21-2' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into kvm-next

Second PPC KVM update for 4.21

This has 5 commits that fix page dirty tracking when running nested
HV KVM guests, from Suraj Jitindar Singh.
2018-12-21 11:48:41 +01:00
Lan Tianyu 748c0e312f KVM: Make kvm_set_spte_hva() return int
The patch is to make kvm_set_spte_hva() return int and caller can
check return value to determine flush tlb or not.

Signed-off-by: Lan Tianyu <Tianyu.Lan@microsoft.com>
Acked-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-12-21 11:28:41 +01:00
Alexey Kardashevskiy c10c21efa4 powerpc/vfio/iommu/kvm: Do not pin device memory
This new memory does not have page structs as it is not plugged to
the host so gup() will fail anyway.

This adds 2 helpers:
- mm_iommu_newdev() to preregister the "memory device" memory so
the rest of API can still be used;
- mm_iommu_is_devmem() to know if the physical address is one of thise
new regions which we must avoid unpinning of.

This adds @mm to tce_page_is_contained() and iommu_tce_xchg() to test
if the memory is device memory to avoid pfn_to_page().

This adds a check for device memory in mm_iommu_ua_mark_dirty_rm() which
does delayed pages dirtying.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-12-21 16:20:46 +11:00
Suraj Jitindar Singh ae59a7e194 KVM: PPC: Book3S HV: Keep rc bits in shadow pgtable in sync with host
The rc bits contained in ptes are used to track whether a page has been
accessed and whether it is dirty. The accessed bit is used to age a page
and the dirty bit to track whether a page is dirty or not.

Now that we support nested guests there are three ptes which track the
state of the same page:
- The partition-scoped page table in the L1 guest, mapping L2->L1 address
- The partition-scoped page table in the host for the L1 guest, mapping
  L1->L0 address
- The shadow partition-scoped page table for the nested guest in the host,
  mapping L2->L0 address

The idea is to attempt to keep the rc state of these three ptes in sync,
both when setting and when clearing rc bits.

When setting the bits we achieve consistency by:
- Initially setting the bits in the shadow page table as the 'and' of the
  other two.
- When updating in software the rc bits in the shadow page table we
  ensure the state is consistent with the other two locations first, and
  update these before reflecting the change into the shadow page table.
  i.e. only set the bits in the L2->L0 pte if also set in both the
       L2->L1 and the L1->L0 pte.

When clearing the bits we achieve consistency by:
- The rc bits in the shadow page table are only cleared when discarding
  a pte, and we don't need to record this as if either bit is set then
  it must also be set in the pte mapping L1->L0.
- When L1 clears an rc bit in the L2->L1 mapping it __should__ issue a
  tlbie instruction
  - This means we will discard the pte from the shadow page table
    meaning the mapping will have to be setup again.
  - When setup the pte again in the shadow page table we will ensure
    consistency with the L2->L1 pte.
- When the host clears an rc bit in the L1->L0 mapping we need to also
  clear the bit in any ptes in the shadow page table which map the same
  gfn so we will be notified if a nested guest accesses the page.
  This case is what this patch specifically concerns.
  - We can search the nest_rmap list for that given gfn and clear the
    same bit from all corresponding ptes in shadow page tables.
  - If a nested guest causes either of the rc bits to be set by software
    in future then we will update the L1->L0 pte and maintain consistency.

With the process outlined above we aim to maintain consistency of the 3
pte locations where we track rc for a given guest page.

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-12-21 14:42:07 +11:00
Suraj Jitindar Singh 90165d3da0 KVM: PPC: Book3S HV: Introduce kvmhv_update_nest_rmap_rc_list()
Introduce a function kvmhv_update_nest_rmap_rc_list() which for a given
nest_rmap list will traverse it, find the corresponding pte in the shadow
page tables, and if it still maps the same host page update the rc bits
accordingly.

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-12-21 14:39:35 +11:00
Suraj Jitindar Singh 8b23eee4e5 KVM: PPC: Book3S HV: Apply combination of host and l1 pte rc for nested guest
The shadow page table contains ptes for translations from nested guest
address to host address. Currently when creating these ptes we take the
rc bits from the pte for the L1 guest address to host address
translation. This is incorrect as we must also factor in the rc bits
from the pte for the nested guest address to L1 guest address
translation (as contained in the L1 guest partition table for the nested
guest).

By not calculating these bits correctly L1 may not have been correctly
notified when it needed to update its rc bits in the partition table it
maintains for its nested guest.

Modify the code so that the rc bits in the resultant pte for the L2->L0
translation are the 'and' of the rc bits in the L2->L1 pte and the L1->L0
pte, also accounting for whether this was a write access when setting
the dirty bit.

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-12-21 14:37:43 +11:00
Suraj Jitindar Singh 8400f87406 KVM: PPC: Book3S HV: Align gfn to L1 page size when inserting nest-rmap entry
Nested rmap entries are used to store the translation from L1 gpa to L2
gpa when entries are inserted into the shadow (nested) page tables. This
rmap list is located by indexing the rmap array in the memslot by L1
gfn. When we come to search for these entries we only know the L1 page size
(which could be PAGE_SIZE, 2M or a 1G page) and so can only select a gfn
aligned to that size. This means that when we insert the entry, so we can
find it later, we need to align the gfn we use to select the rmap list
in which to insert the entry to L1 page size as well.

By not doing this we were missing nested rmap entries when modifying L1
ptes which were for a page also passed through to an L2 guest.

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-12-21 14:37:43 +11:00
Suraj Jitindar Singh bec6e03b5e KVM: PPC: Book3S HV: Hold kvm->mmu_lock across updating nested pte rc bits
We already hold the kvm->mmu_lock spin lock across updating the rc bits
in the pte for the L1 guest. Continue to hold the lock across updating
the rc bits in the pte for the nested guest as well to prevent
invalidations from occurring.

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-12-21 14:37:43 +11:00
Radim Krčmář cfdfaf4a86 PPC KVM update for 4.21
The main new feature this time is support in HV nested KVM for passing
 a device that is emulated by a level 0 hypervisor and presented to
 level 1 as a PCI device through to a level 2 guest using VFIO.
 
 Apart from that there are improvements for migration of radix guests
 under HV KVM and some other fixes and cleanups.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJcGFzEAAoJEJ2a6ncsY3GfKjoH/Azcf8QIO5ftyHrjazFZOSUh
 5Lr24HZTYHheowp6obzuZWRAIyckHmflRmOkv8RVGuA8+Sp+m5pBxN3WTVPOwDUh
 WanOWVGJsuhl6qATmkm7xIxmYhQEyLxVNbnWva7WXuZ92rgGCNfHtByHWAx/7vTe
 q5Shr4fLIQ8HRzor8Xqqph1I0hQNTE9VsaK1hW/PxI0gsO8qjDwOR8SDpT/aaJrS
 Sir+lM0TwCbJREuObDxYAXn1OWy8rMYjlb9fEBv5tmPCQKiB9vJz4tV+ahR9eJ14
 PEF57MoBOGwzQXo4geFLuo/Bu8fDygKsKQX1eYGcn6tRGA4pnTxzYl0+dHLBkOM=
 =3WkD
 -----END PGP SIGNATURE-----

Merge tag 'kvm-ppc-next-4.21-1' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc

PPC KVM update for 4.21 from Paul Mackerras

The main new feature this time is support in HV nested KVM for passing
a device that is emulated by a level 0 hypervisor and presented to
level 1 as a PCI device through to a level 2 guest using VFIO.

Apart from that there are improvements for migration of radix guests
under HV KVM and some other fixes and cleanups.
2018-12-20 14:54:09 +01:00
Diana Craciun e7aa61f47b powerpc/fsl: Flush branch predictor when entering KVM
Switching from the guest to host is another place
where the speculative accesses can be exploited.
Flush the branch predictor when entering KVM.

Signed-off-by: Diana Craciun <diana.craciun@nxp.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-12-20 22:59:03 +11:00
Diana Craciun 98518c4d87 powerpc/fsl: Emulate SPRN_BUCSR register
In order to flush the branch predictor the guest kernel performs
writes to the BUCSR register which is hypervisor privilleged. However,
the branch predictor is flushed at each KVM entry, so the branch
predictor has been already flushed, so just return as soon as possible
to guest.

Signed-off-by: Diana Craciun <diana.craciun@nxp.com>
[mpe: Tweak comment formatting]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-12-20 22:59:03 +11:00
Suraj Jitindar Singh 95d386c2d2 KVM: PPC: Book3S HV: Allow passthrough of an emulated device to an L3 guest
Previously when a device was being emulated by an L1 guest for an L2
guest, that device couldn't then be passed through to an L3 guest. This
was because the L1 guest had no method for accessing L3 memory.

The hcall H_COPY_TOFROM_GUEST provides this access. Thus this setup for
passthrough can now be allowed.

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-12-17 11:33:50 +11:00
Suraj Jitindar Singh 6ff887b8bd KVM: PPC: Book3S: Introduce new hcall H_COPY_TOFROM_GUEST to access quadrants 1 & 2
A guest cannot access quadrants 1 or 2 as this would result in an
exception. Thus introduce the hcall H_COPY_TOFROM_GUEST to be used by a
guest when it wants to perform an access to quadrants 1 or 2, for
example when it wants to access memory for one of its nested guests.

Also provide an implementation for the kvm-hv module.

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-12-17 11:33:50 +11:00
Suraj Jitindar Singh 873db2cd9a KVM: PPC: Book3S HV: Allow passthrough of an emulated device to an L2 guest
Allow for a device which is being emulated at L0 (the host) for an L1
guest to be passed through to a nested (L2) guest.

The existing kvmppc_hv_emulate_mmio function can be used here. The main
challenge is that for a load the result must be stored into the L2 gpr,
not an L1 gpr as would normally be the case after going out to qemu to
complete the operation. This presents a challenge as at this point the
L2 gpr state has been written back into L1 memory.

To work around this we store the address in L1 memory of the L2 gpr
where the result of the load is to be stored and use the new io_gpr
value KVM_MMIO_REG_NESTED_GPR to indicate that this is a nested load for
which completion must be done when returning back into the kernel. Then
in kvmppc_complete_mmio_load() the resultant value is written into L1
memory at the location of the indicated L2 gpr.

Note that we don't currently let an L1 guest emulate a device for an L2
guest which is then passed through to an L3 guest.

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-12-17 11:33:50 +11:00
Suraj Jitindar Singh cc6929cc84 KVM: PPC: Update kvmppc_st and kvmppc_ld to use quadrants
The functions kvmppc_st and kvmppc_ld are used to access guest memory
from the host using a guest effective address. They do so by translating
through the process table to obtain a guest real address and then using
kvm_read_guest or kvm_write_guest to make the access with the guest real
address.

This method of access however only works for L1 guests and will give the
incorrect results for a nested guest.

We can however use the store_to_eaddr and load_from_eaddr kvmppc_ops to
perform the access for a nested guesti (and a L1 guest). So attempt this
method first and fall back to the old method if this fails and we aren't
running a nested guest.

At this stage there is no fall back method to perform the access for a
nested guest and this is left as a future improvement. For now we will
return to the nested guest and rely on the fact that a translation
should be faulted in before retrying the access.

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-12-17 11:33:50 +11:00
Suraj Jitindar Singh dceadcf91b KVM: PPC: Add load_from_eaddr and store_to_eaddr to the kvmppc_ops struct
The kvmppc_ops struct is used to store function pointers to kvm
implementation specific functions.

Introduce two new functions load_from_eaddr and store_to_eaddr to be
used to load from and store to a guest effective address respectively.

Also implement these for the kvm-hv module. If we are using the radix
mmu then we can call the functions to access quadrant 1 and 2.

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-12-17 11:33:50 +11:00
Suraj Jitindar Singh d7b4561522 KVM: PPC: Book3S HV: Implement functions to access quadrants 1 & 2
The POWER9 radix mmu has the concept of quadrants. The quadrant number
is the two high bits of the effective address and determines the fully
qualified address to be used for the translation. The fully qualified
address consists of the effective lpid, the effective pid and the
effective address. This gives then 4 possible quadrants 0, 1, 2, and 3.

When accessing these quadrants the fully qualified address is obtained
as follows:

Quadrant		| Hypervisor		| Guest
--------------------------------------------------------------------------
			| EA[0:1] = 0b00	| EA[0:1] = 0b00
0			| effLPID = 0		| effLPID = LPIDR
			| effPID  = PIDR	| effPID  = PIDR
--------------------------------------------------------------------------
			| EA[0:1] = 0b01	|
1			| effLPID = LPIDR	| Invalid Access
			| effPID  = PIDR	|
--------------------------------------------------------------------------
			| EA[0:1] = 0b10	|
2			| effLPID = LPIDR	| Invalid Access
			| effPID  = 0		|
--------------------------------------------------------------------------
			| EA[0:1] = 0b11	| EA[0:1] = 0b11
3			| effLPID = 0		| effLPID = LPIDR
			| effPID  = 0		| effPID  = 0
--------------------------------------------------------------------------

In the Guest;
Quadrant 3 is normally used to address the operating system since this
uses effPID=0 and effLPID=LPIDR, meaning the PID register doesn't need to
be switched.
Quadrant 0 is normally used to address user space since the effLPID and
effPID are taken from the corresponding registers.

In the Host;
Quadrant 0 and 3 are used as above, however the effLPID is always 0 to
address the host.

Quadrants 1 and 2 can be used by the host to address guest memory using
a guest effective address. Since the effLPID comes from the LPID register,
the host loads the LPID of the guest it would like to access (and the
PID of the process) and can perform accesses to a guest effective
address.

This means quadrant 1 can be used to address the guest user space and
quadrant 2 can be used to address the guest operating system from the
hypervisor, using a guest effective address.

Access to the quadrants can cause a Hypervisor Data Storage Interrupt
(HDSI) due to being unable to perform partition scoped translation.
Previously this could only be generated from a guest and so the code
path expects us to take the KVM trampoline in the interrupt handler.
This is no longer the case so we modify the handler to call
bad_page_fault() to check if we were expecting this fault so we can
handle it gracefully and just return with an error code. In the hash mmu
case we still raise an unknown exception since quadrants aren't defined
for the hash mmu.

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-12-17 11:33:50 +11:00
Suraj Jitindar Singh d232afebf9 KVM: PPC: Book3S HV: Add function kvmhv_vcpu_is_radix()
There exists a function kvm_is_radix() which is used to determine if a
kvm instance is using the radix mmu. However this only applies to the
first level (L1) guest. Add a function kvmhv_vcpu_is_radix() which can
be used to determine if the current execution context of the vcpu is
radix, accounting for if the vcpu is running a nested guest.

Currently all nested guests must be radix but this may change in the
future.

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-12-17 11:33:49 +11:00
Suraj Jitindar Singh 693ac10a88 KVM: PPC: Book3S: Only report KVM_CAP_SPAPR_TCE_VFIO on powernv machines
The kvm capability KVM_CAP_SPAPR_TCE_VFIO is used to indicate the
availability of in kernel tce acceleration for vfio. However it is
currently the case that this is only available on a powernv machine,
not for a pseries machine.

Thus make this capability dependent on having the cpu feature
CPU_FTR_HVMODE.

[paulus@ozlabs.org - fixed compilation for Book E.]

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-12-17 11:33:49 +11:00
Paul Mackerras 5af3e9d06d KVM: PPC: Book3S HV: Flush guest mappings when turning dirty tracking on/off
This adds code to flush the partition-scoped page tables for a radix
guest when dirty tracking is turned on or off for a memslot.  Only the
guest real addresses covered by the memslot are flushed.  The reason
for this is to get rid of any 2M PTEs in the partition-scoped page
tables that correspond to host transparent huge pages, so that page
dirtiness is tracked at a system page (4k or 64k) granularity rather
than a 2M granularity.  The page tables are also flushed when turning
dirty tracking off so that the memslot's address space can be
repopulated with THPs if possible.

To do this, we add a new function kvmppc_radix_flush_memslot().  Since
this does what's needed for kvmppc_core_flush_memslot_hv() on a radix
guest, we now make kvmppc_core_flush_memslot_hv() call the new
kvmppc_radix_flush_memslot() rather than calling kvm_unmap_radix()
for each page in the memslot.  This has the effect of fixing a bug in
that kvmppc_core_flush_memslot_hv() was previously calling
kvm_unmap_radix() without holding the kvm->mmu_lock spinlock, which
is required to be held.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-12-17 10:58:51 +11:00
Paul Mackerras c43c3a8683 KVM: PPC: Book3S HV: Cleanups - constify memslots, fix comments
This adds 'const' to the declarations for the struct kvm_memory_slot
pointer parameters of some functions, which will make it possible to
call those functions from kvmppc_core_commit_memory_region_hv()
in the next patch.

This also fixes some comments about locking.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-12-17 10:58:43 +11:00
Paul Mackerras f460f6791a KVM: PPC: Book3S HV: Map single pages when doing dirty page logging
For radix guests, this makes KVM map guest memory as individual pages
when dirty page logging is enabled for the memslot corresponding to the
guest real address.  Having a separate partition-scoped PTE for each
system page mapped to the guest means that we have a separate dirty
bit for each page, thus making the reported dirty bitmap more accurate.
Without this, if part of guest memory is backed by transparent huge
pages, the dirty status is reported at a 2MB granularity rather than
a 64kB (or 4kB) granularity for that part, causing userspace to have
to transmit more data when migrating the guest.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-12-17 10:58:33 +11:00
Bharata B Rao f032b73459 KVM: PPC: Pass change type down to memslot commit function
Currently, kvm_arch_commit_memory_region() gets called with a
parameter indicating what type of change is being made to the memslot,
but it doesn't pass it down to the platform-specific memslot commit
functions.  This adds the `change' parameter to the lower-level
functions so that they can use it in future.

[paulus@ozlabs.org - fix book E also.]

Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Reviewed-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-12-17 10:57:27 +11:00
Paolo Bonzini e5d83c74a5 kvm: make KVM_CAP_ENABLE_CAP_VM architecture agnostic
The first such capability to be handled in virt/kvm/ will be manual
dirty page reprotection.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-12-14 12:34:18 +01:00
Suraj Jitindar Singh 6142236cd9 KVM: PPC: Book3S PR: Set hflag to indicate that POWER9 supports 1T segments
When booting a kvm-pr guest on a POWER9 machine the following message is
observed:
"qemu-system-ppc64: KVM does not support 1TiB segments which guest expects"

This is because the guest is expecting to be able to use 1T segments
however we don't indicate support for it. This is because we don't set
the BOOK3S_HFLAG_MULTI_PGSIZE flag in the hflags in kvmppc_set_pvr_pr()
on POWER9.

POWER9 does indeed have support for 1T segments, so add a case for
POWER9 to the switch statement to ensure it is set.

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-12-14 15:39:47 +11:00
Yangtao Li 0f6ddf34be KVM: PPC: Book3S HV: Change to use DEFINE_SHOW_ATTRIBUTE macro
Use DEFINE_SHOW_ATTRIBUTE macro to simplify the code.

Signed-off-by: Yangtao Li <tiny.windzz@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-12-14 15:39:47 +11:00
Paul Mackerras 234ff0b729 KVM: PPC: Book3S HV: Fix race between kvm_unmap_hva_range and MMU mode switch
Testing has revealed an occasional crash which appears to be caused
by a race between kvmppc_switch_mmu_to_hpt and kvm_unmap_hva_range_hv.
The symptom is a NULL pointer dereference in __find_linux_pte() called
from kvm_unmap_radix() with kvm->arch.pgtable == NULL.

Looking at kvmppc_switch_mmu_to_hpt(), it does indeed clear
kvm->arch.pgtable (via kvmppc_free_radix()) before setting
kvm->arch.radix to NULL, and there is nothing to prevent
kvm_unmap_hva_range_hv() or the other MMU callback functions from
being called concurrently with kvmppc_switch_mmu_to_hpt() or
kvmppc_switch_mmu_to_radix().

This patch therefore adds calls to spin_lock/unlock on the kvm->mmu_lock
around the assignments to kvm->arch.radix, and makes sure that the
partition-scoped radix tree or HPT is only freed after changing
kvm->arch.radix.

This also takes the kvm->mmu_lock in kvmppc_rmap_reset() to make sure
that the clearing of each rmap array (one per memslot) doesn't happen
concurrently with use of the array in the kvm_unmap_hva_range_hv()
or the other MMU callbacks.

Fixes: 18c3640cef ("KVM: PPC: Book3S HV: Add infrastructure for running HPT guests on radix host")
Cc: stable@vger.kernel.org # v4.15+
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-12-14 15:33:15 +11:00
Christophe Leroy 994da93d19 powerpc/mm: move platform specific mmu-xxx.h in platform directories
The purpose of this patch is to move platform specific
mmu-xxx.h files in platform directories like pte-xxx.h files.

In the meantime this patch creates common nohash and
nohash/32 + nohash/64 mmu.h files for future common parts.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-12-04 19:45:01 +11:00
Paolo Bonzini caf54f59e5 PPC KVM fixes for 4.20
This has a single 1-line patch which fixes a bug in the recently-merged
 nested HV KVM support.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJb7pbhAAoJEJ2a6ncsY3Gf8UIIAKgiocLz4jTrWYaR/OVbg6EY
 tSJQBbsi6bEAog/FZMWDG0zL0YB4s+wXu34RiTt/P7g0VzHFTmR6ZHIJPiSd78aH
 oxe8H7TOVq8/EmD0TwREVgUe1qIHgLBkVkk05b0P0nlpeO5bzWQBco2No2mfKWOq
 yZcK03QWBsVaq0xhZFM/c0SkxBYOIDcm1kG+XNpOcsmWGXin96TlK+2WohOIH5nY
 +16vI61n7/jBjdoxQS0Lw8OAfsA8CjY9GaKf3MuFYe93anZUv2s8FrAv35qUwzBg
 5/Y/f+EB5AKMf3XR2A8nJ6HmoeXUFu4NUxT1YAQPAUcrxkENcsaRHDe2Uwt1QIk=
 =iPcL
 -----END PGP SIGNATURE-----

Merge tag 'kvm-ppc-fixes-4.20-1' of https://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into HEAD

PPC KVM fixes for 4.20

This has a single 1-line patch which fixes a bug in the recently-merged
nested HV KVM support.
2018-11-25 18:56:32 +01:00
Michael Roth 6c08ec1216 KVM: PPC: Book3S HV: Fix handling for interrupted H_ENTER_NESTED
While running a nested guest VCPU on L0 via H_ENTER_NESTED hcall, a
pending signal in the L0 QEMU process can generate the following
sequence:

  ret0 = kvmppc_pseries_do_hcall()
    ret1 = kvmhv_enter_nested_guest()
      ret2 = kvmhv_run_single_vcpu()
      if (ret2 == -EINTR)
        return H_INTERRUPT
    if (ret1 == H_INTERRUPT)
      kvmppc_set_gpr(vcpu, 3, 0)
      return -EINTR
    /* skipped: */
    kvmppc_set_gpr(vcpu, 3, ret)
    vcpu->arch.hcall_needed = 0
    return RESUME_GUEST

which causes an exit to L0 userspace with ret0 == -EINTR.

The intention seems to be to set the hcall return value to 0 (via
VCPU r3) so that L1 will see a successful return from H_ENTER_NESTED
once we resume executing the VCPU. However, because we don't set
vcpu->arch.hcall_needed = 0, we do the following once userspace
resumes execution via kvm_arch_vcpu_ioctl_run():

  ...
  } else if (vcpu->arch.hcall_needed) {
    int i

    kvmppc_set_gpr(vcpu, 3, run->papr_hcall.ret);
    for (i = 0; i < 9; ++i)
           kvmppc_set_gpr(vcpu, 4 + i, run->papr_hcall.args[i]);
    vcpu->arch.hcall_needed = 0;

since vcpu->arch.hcall_needed == 1 indicates that userspace should
have handled the hcall and stored the return value in
run->papr_hcall.ret. Since that's not the case here, we can get an
unexpected value in VCPU r3, which can result in
kvmhv_p9_guest_entry() reporting an unexpected trap value when it
returns from H_ENTER_NESTED, causing the following register dump to
console via subsequent call to kvmppc_handle_exit_hv() in L1:

  [  350.612854] vcpu 00000000f9564cf8 (0):
  [  350.612915] pc  = c00000000013eb98  msr = 8000000000009033  trap = 1
  [  350.613020] r 0 = c0000000004b9044  r16 = 0000000000000000
  [  350.613075] r 1 = c00000007cffba30  r17 = 0000000000000000
  [  350.613120] r 2 = c00000000178c100  r18 = 00007fffc24f3b50
  [  350.613166] r 3 = c00000007ef52480  r19 = 00007fffc24fff58
  [  350.613212] r 4 = 0000000000000000  r20 = 00000a1e96ece9d0
  [  350.613253] r 5 = 70616d00746f6f72  r21 = 00000a1ea117c9b0
  [  350.613295] r 6 = 0000000000000020  r22 = 00000a1ea1184360
  [  350.613338] r 7 = c0000000783be440  r23 = 0000000000000003
  [  350.613380] r 8 = fffffffffffffffc  r24 = 00000a1e96e9e124
  [  350.613423] r 9 = c00000007ef52490  r25 = 00000000000007ff
  [  350.613469] r10 = 0000000000000004  r26 = c00000007eb2f7a0
  [  350.613513] r11 = b0616d0009eccdb2  r27 = c00000007cffbb10
  [  350.613556] r12 = c0000000004b9000  r28 = c00000007d83a2c0
  [  350.613597] r13 = c000000001b00000  r29 = c0000000783cdf68
  [  350.613639] r14 = 0000000000000000  r30 = 0000000000000000
  [  350.613681] r15 = 0000000000000000  r31 = c00000007cffbbf0
  [  350.613723] ctr = c0000000004b9000  lr  = c0000000004b9044
  [  350.613765] srr0 = 0000772f954dd48c srr1 = 800000000280f033
  [  350.613808] sprg0 = 0000000000000000 sprg1 = c000000001b00000
  [  350.613859] sprg2 = 0000772f9565a280 sprg3 = 0000000000000000
  [  350.613911] cr = 88002848  xer = 0000000020040000  dsisr = 42000000
  [  350.613962] dar = 0000772f95390000
  [  350.614031] fault dar = c000000244b278c0 dsisr = 00000000
  [  350.614073] SLB (0 entries):
  [  350.614157] lpcr = 0040000003d40413 sdr1 = 0000000000000000 last_inst = ffffffff
  [  350.614252] trap=0x1 | pc=0xc00000000013eb98 | msr=0x8000000000009033

followed by L1's QEMU reporting the following before stopping execution
of the nested guest:

  KVM: unknown exit, hardware reason 1
  NIP c00000000013eb98   LR c0000000004b9044 CTR c0000000004b9000 XER 0000000020040000 CPU#0
  MSR 8000000000009033 HID0 0000000000000000  HF 8000000000000000 iidx 3 didx 3
  TB 00000000 00000000 DECR 00000000
  GPR00 c0000000004b9044 c00000007cffba30 c00000000178c100 c00000007ef52480
  GPR04 0000000000000000 70616d00746f6f72 0000000000000020 c0000000783be440
  GPR08 fffffffffffffffc c00000007ef52490 0000000000000004 b0616d0009eccdb2
  GPR12 c0000000004b9000 c000000001b00000 0000000000000000 0000000000000000
  GPR16 0000000000000000 0000000000000000 00007fffc24f3b50 00007fffc24fff58
  GPR20 00000a1e96ece9d0 00000a1ea117c9b0 00000a1ea1184360 0000000000000003
  GPR24 00000a1e96e9e124 00000000000007ff c00000007eb2f7a0 c00000007cffbb10
  GPR28 c00000007d83a2c0 c0000000783cdf68 0000000000000000 c00000007cffbbf0
  CR 88002848  [ L  L  -  -  E  L  G  L  ]             RES ffffffffffffffff
   SRR0 0000772f954dd48c  SRR1 800000000280f033    PVR 00000000004e1202 VRSAVE 0000000000000000
  SPRG0 0000000000000000 SPRG1 c000000001b00000  SPRG2 0000772f9565a280  SPRG3 0000000000000000
  SPRG4 0000000000000000 SPRG5 0000000000000000  SPRG6 0000000000000000  SPRG7 0000000000000000
  HSRR0 0000000000000000 HSRR1 0000000000000000
   CFAR 0000000000000000
   LPCR 0000000003d40413
   PTCR 0000000000000000   DAR 0000772f95390000  DSISR 0000000042000000

Fix this by setting vcpu->arch.hcall_needed = 0 to indicate completion
of H_ENTER_NESTED before we exit to L0 userspace.

Fixes: 360cae3137 ("KVM: PPC: Book3S HV: Nested guest entry via hypercall")
Cc: linuxppc-dev@ozlabs.org
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
Reviewed-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-11-15 13:59:21 +11:00
Scott Wood 28c5bcf74f KVM: PPC: Move and undef TRACE_INCLUDE_PATH/FILE
TRACE_INCLUDE_PATH and TRACE_INCLUDE_FILE are used by
<trace/define_trace.h>, so like that #include, they should
be outside #ifdef protection.

They also need to be #undefed before defining, in case multiple trace
headers are included by the same C file.  This became the case on
book3e after commit cf4a608515 ("powerpc/mm: Add missing tracepoint for
tlbie"), leading to the following build error:

   CC      arch/powerpc/kvm/powerpc.o
In file included from arch/powerpc/kvm/powerpc.c:51:0:
arch/powerpc/kvm/trace.h:9:0: error: "TRACE_INCLUDE_PATH" redefined
[-Werror]
  #define TRACE_INCLUDE_PATH .
  ^
In file included from arch/powerpc/kvm/../mm/mmu_decl.h:25:0,
                  from arch/powerpc/kvm/powerpc.c:48:
./arch/powerpc/include/asm/trace.h:224:0: note: this is the location of
the previous definition
  #define TRACE_INCLUDE_PATH asm
  ^
cc1: all warnings being treated as errors

Reported-by: Christian Zigotzky <chzigotzky@xenosoft.de>
Signed-off-by: Scott Wood <oss@buserror.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-11-07 23:04:38 +11:00
Linus Torvalds b69f9e17a5 powerpc fixes for 4.20 #2
Some things that I missed due to travel, or that came in late.
 
 Two fixes also going to stable:
 
  - A revert of a buggy change to the 8xx TLB miss handlers.
 
  - Our flushing of SPE (Signal Processing Engine) registers on fork was broken.
 
 Other changes:
 
  - A change to the KVM decrementer emulation to use proper APIs.
 
  - Some cleanups to the way we do code patching in the 8xx code.
 
  - Expose the maximum possible memory for the system in /proc/powerpc/lparcfg.
 
  - Merge some updates from Scott: "a couple device tree updates, and a fix for a
    missing prototype warning."
 
 A few other minor fixes and a handful of fixes for our selftests.
 
 Thanks to:
   Aravinda Prasad, Breno Leitao, Camelia Groza, Christophe Leroy, Felipe Rechia,
   Joel Stanley, Naveen N. Rao, Paul Mackerras, Scott Wood, Tyrel Datwyler.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJb3C+uAAoJEFHr6jzI4aWAPJgQAIX0aD/PiYfEUI/rm/Q0vnJI
 HO3FCKroi+LVF/URU24+NLA/1KGCBfO9by9m6D/nnmHl+vi2P69fFgokywO0Ajru
 nf9a+9Gx53IbO7EEUf1fZVwxCMobBqU8eWq1hIBndd5HTz9QEftc/RXgpgqZcQ/x
 x8xN1FNMSUT9NwMk750QDO7CFBrSfSjFC+/WrkViBaMiRWx2rwle+2tDipQ/fegY
 Tsu4wg0qEzWMT//MFP4yOlkcLV8M6d2Sw65Km59rWHA1I2wqsTek1yQ68Epo9Co0
 RdJh9Nt1kjLC5XXteneFhe18UUPKRmrXbYDFByw5CUhs5VI99Dq4w5kamh197XLr
 +jA3XHAeAyaXf21I9zmmZXbhHanowCPZGyzZqZXWJ86bVJp5v328wXmnxtKrb0Nz
 pH7fjQ6zjzsZgIcN9i2CFpIvuDQ/z1A/QyHdBnRvJ8HoXlTerZCn22JTgY7d2VJu
 XJn1n+VABG2BrzJexW/7quY3Z7V6tvdkloWwOA3PdAwkcoImd4BfYq9K2DU//diN
 BnXPnDs2K7JwDG9s+cgUEHnrP6DOKsxT+mmYpqXf0Ta0wtyoZJ1zgSdfZAUOmjnb
 MhcK46l+F4E891qjnsuuVNnspqI4yPMLmAGmife5OUrfcoFdm/vFbM75FaXameBx
 cMOmidrJJhO6z5eWSKvO
 =VCkW
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.20-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
 "Some things that I missed due to travel, or that came in late.

  Two fixes also going to stable:

   - A revert of a buggy change to the 8xx TLB miss handlers.

   - Our flushing of SPE (Signal Processing Engine) registers on fork
     was broken.

  Other changes:

   - A change to the KVM decrementer emulation to use proper APIs.

   - Some cleanups to the way we do code patching in the 8xx code.

   - Expose the maximum possible memory for the system in
     /proc/powerpc/lparcfg.

   - Merge some updates from Scott: "a couple device tree updates, and a
     fix for a missing prototype warning"

  A few other minor fixes and a handful of fixes for our selftests.

  Thanks to: Aravinda Prasad, Breno Leitao, Camelia Groza, Christophe
  Leroy, Felipe Rechia, Joel Stanley, Naveen N. Rao, Paul Mackerras,
  Scott Wood, Tyrel Datwyler"

* tag 'powerpc-4.20-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (21 commits)
  selftests/powerpc: Fix compilation issue due to asm label
  selftests/powerpc/cache_shape: Fix out-of-tree build
  selftests/powerpc/switch_endian: Fix out-of-tree build
  selftests/powerpc/pmu: Link ebb tests with -no-pie
  selftests/powerpc/signal: Fix out-of-tree build
  selftests/powerpc/ptrace: Fix out-of-tree build
  powerpc/xmon: Relax frame size for clang
  selftests: powerpc: Fix warning for security subdir
  selftests/powerpc: Relax L1d miss targets for rfi_flush test
  powerpc/process: Fix flush_all_to_thread for SPE
  powerpc/pseries: add missing cpumask.h include file
  selftests/powerpc: Fix ptrace tm failure
  KVM: PPC: Use exported tb_to_ns() function in decrementer emulation
  powerpc/pseries: Export maximum memory value
  powerpc/8xx: Use patch_site for perf counters setup
  powerpc/8xx: Use patch_site for memory setup patching
  powerpc/code-patching: Add a helper to get the address of a patch_site
  Revert "powerpc/8xx: Use L1 entry APG to handle _PAGE_ACCESSED for CONFIG_SWAP"
  powerpc/8xx: add missing header in 8xx_mmu.c
  powerpc/8xx: Add DT node for using the SEC engine of the MPC885
  ...
2018-11-02 09:19:35 -07:00
Linus Torvalds 685f7e4f16 powerpc updates for 4.20
Notable changes:
 
  - A large series to rewrite our SLB miss handling, replacing a lot of fairly
    complicated asm with much fewer lines of C.
 
  - Following on from that, we now maintain a cache of SLB entries for each
    process and preload them on context switch. Leading to a 27% speedup for our
    context switch benchmark on Power9.
 
  - Improvements to our handling of SLB multi-hit errors. We now print more debug
    information when they occur, and try to continue running by flushing the SLB
    and reloading, rather than treating them as fatal.
 
  - Enable THP migration on 64-bit Book3S machines (eg. Power7/8/9).
 
  - Add support for physical memory up to 2PB in the linear mapping on 64-bit
    Book3S. We only support up to 512TB as regular system memory, otherwise the
    percpu allocator runs out of vmalloc space.
 
  - Add stack protector support for 32 and 64-bit, with a per-task canary.
 
  - Add support for PTRACE_SYSEMU and PTRACE_SYSEMU_SINGLESTEP.
 
  - Support recognising "big cores" on Power9, where two SMT4 cores are presented
    to us as a single SMT8 core.
 
  - A large series to cleanup some of our ioremap handling and PTE flags.
 
  - Add a driver for the PAPR SCM (storage class memory) interface, allowing
    guests to operate on SCM devices (acked by Dan).
 
  - Changes to our ftrace code to handle very large kernels, where we need to use
    a trampoline to get to ftrace_caller().
 
 Many other smaller enhancements and cleanups.
 
 Thanks to:
   Alan Modra, Alistair Popple, Aneesh Kumar K.V, Anton Blanchard, Aravinda
   Prasad, Bartlomiej Zolnierkiewicz, Benjamin Herrenschmidt, Breno Leitao,
   Cédric Le Goater, Christophe Leroy, Christophe Lombard, Dan Carpenter, Daniel
   Axtens, Finn Thain, Gautham R. Shenoy, Gustavo Romero, Haren Myneni, Hari
   Bathini, Jia Hongtao, Joel Stanley, John Allen, Laurent Dufour, Madhavan
   Srinivasan, Mahesh Salgaonkar, Mark Hairgrove, Masahiro Yamada, Michael
   Bringmann, Michael Neuling, Michal Suchanek, Murilo Opsfelder Araujo, Nathan
   Fontenot, Naveen N. Rao, Nicholas Piggin, Nick Desaulniers, Oliver O'Halloran,
   Paul Mackerras, Petr Vorel, Rashmica Gupta, Reza Arbab, Rob Herring, Sam
   Bobroff, Samuel Mendoza-Jonas, Scott Wood, Stan Johnson, Stephen Rothwell,
   Stewart Smith, Suraj Jitindar Singh, Tyrel Datwyler, Vaibhav Jain, Vasant
   Hegde, YueHaibing, zhong jiang,
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJb01vTAAoJEFHr6jzI4aWADsEP/jqL3+2qxs098ra80tpXCpXJ
 tgXCosEs4b35sGtyHeUWZZZfWXeisaPAIlP8zTx1n50HACZduDYRAl0Ew9XB7Xdw
 enDHRVccD21FsmHBOx/Ii1rVJlovWlj6EQCWHKeZmNjeRoFuClVZ7CYmf+mBifKR
 sw2Db2fKA/59wMTq2zIMy5pqYgqlAs4jTWS6uN5hKPoBmO/82ARnNG+qgLuloD3Z
 O8zSDM9QQ7PpuyDgTjO9SAo2YjmEfXlEG6cOCCejsU3DMctaEAK5PUZ+blsHYHBH
 BYZYKs/x4pcw0SO41GtTh0M2YqDYBVuBIpRw8lLZap97Xo9ucSkAm5WD3rGxk4CY
 YeZKEPUql6MHN3+DKl8mx2F0V+Et/tio2HNqc9KReR1tfoolZAbe+SFZHfgmc/Rq
 RD9nnG8KRd4K2K1BTqpkTmI1EtE7jPtPJPSV8gMGhgL/N5vPmH3mql/qyOtYx48E
 6/hPzWESgs16VRZ/opLh8VvjlY1HBDODQhehhhl+o23/Vb8qEgRf8Uqhq50rQW1H
 EeOqyyYQ90txSU31Sgy1kQkvOgIFAsBObWT1ZCJ3RbfGbB4/tdEAvZqTZRlXo2OY
 7P0Sqcw/9Le5eJkHIlLtBv0TF7y1OYemCbLgRQzFlcRP+UKtYyg8eFnFjqbPEEmP
 ulwhn/BfFVSgaYKQ503u
 =I0pj
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:
 "Notable changes:

   - A large series to rewrite our SLB miss handling, replacing a lot of
     fairly complicated asm with much fewer lines of C.

   - Following on from that, we now maintain a cache of SLB entries for
     each process and preload them on context switch. Leading to a 27%
     speedup for our context switch benchmark on Power9.

   - Improvements to our handling of SLB multi-hit errors. We now print
     more debug information when they occur, and try to continue running
     by flushing the SLB and reloading, rather than treating them as
     fatal.

   - Enable THP migration on 64-bit Book3S machines (eg. Power7/8/9).

   - Add support for physical memory up to 2PB in the linear mapping on
     64-bit Book3S. We only support up to 512TB as regular system
     memory, otherwise the percpu allocator runs out of vmalloc space.

   - Add stack protector support for 32 and 64-bit, with a per-task
     canary.

   - Add support for PTRACE_SYSEMU and PTRACE_SYSEMU_SINGLESTEP.

   - Support recognising "big cores" on Power9, where two SMT4 cores are
     presented to us as a single SMT8 core.

   - A large series to cleanup some of our ioremap handling and PTE
     flags.

   - Add a driver for the PAPR SCM (storage class memory) interface,
     allowing guests to operate on SCM devices (acked by Dan).

   - Changes to our ftrace code to handle very large kernels, where we
     need to use a trampoline to get to ftrace_caller().

  And many other smaller enhancements and cleanups.

  Thanks to: Alan Modra, Alistair Popple, Aneesh Kumar K.V, Anton
  Blanchard, Aravinda Prasad, Bartlomiej Zolnierkiewicz, Benjamin
  Herrenschmidt, Breno Leitao, Cédric Le Goater, Christophe Leroy,
  Christophe Lombard, Dan Carpenter, Daniel Axtens, Finn Thain, Gautham
  R. Shenoy, Gustavo Romero, Haren Myneni, Hari Bathini, Jia Hongtao,
  Joel Stanley, John Allen, Laurent Dufour, Madhavan Srinivasan, Mahesh
  Salgaonkar, Mark Hairgrove, Masahiro Yamada, Michael Bringmann,
  Michael Neuling, Michal Suchanek, Murilo Opsfelder Araujo, Nathan
  Fontenot, Naveen N. Rao, Nicholas Piggin, Nick Desaulniers, Oliver
  O'Halloran, Paul Mackerras, Petr Vorel, Rashmica Gupta, Reza Arbab,
  Rob Herring, Sam Bobroff, Samuel Mendoza-Jonas, Scott Wood, Stan
  Johnson, Stephen Rothwell, Stewart Smith, Suraj Jitindar Singh, Tyrel
  Datwyler, Vaibhav Jain, Vasant Hegde, YueHaibing, zhong jiang"

* tag 'powerpc-4.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (221 commits)
  Revert "selftests/powerpc: Fix out-of-tree build errors"
  powerpc/msi: Fix compile error on mpc83xx
  powerpc: Fix stack protector crashes on CPU hotplug
  powerpc/traps: restore recoverability of machine_check interrupts
  powerpc/64/module: REL32 relocation range check
  powerpc/64s/radix: Fix radix__flush_tlb_collapsed_pmd double flushing pmd
  selftests/powerpc: Add a test of wild bctr
  powerpc/mm: Fix page table dump to work on Radix
  powerpc/mm/radix: Display if mappings are exec or not
  powerpc/mm/radix: Simplify split mapping logic
  powerpc/mm/radix: Remove the retry in the split mapping logic
  powerpc/mm/radix: Fix small page at boundary when splitting
  powerpc/mm/radix: Fix overuse of small pages in splitting logic
  powerpc/mm/radix: Fix off-by-one in split mapping logic
  powerpc/ftrace: Handle large kernel configs
  powerpc/mm: Fix WARN_ON with THP NUMA migration
  selftests/powerpc: Fix out-of-tree build errors
  powerpc/time: no steal_time when CONFIG_PPC_SPLPAR is not selected
  powerpc/time: Only set CONFIG_ARCH_HAS_SCALED_CPUTIME on PPC64
  powerpc/time: isolate scaled cputime accounting in dedicated functions.
  ...
2018-10-26 14:36:21 -07:00
Paul Mackerras c43befca86 KVM: PPC: Use exported tb_to_ns() function in decrementer emulation
This changes the KVM code that emulates the decrementer function to do
the conversion of decrementer values to time intervals in nanoseconds
by calling the tb_to_ns() function exported by the powerpc timer code,
in preference to open-coded arithmetic using values from the
decrementer_clockevent struct.  Similarly, the HV-KVM code that did
the same conversion using arithmetic on tb_ticks_per_sec also now
uses tb_to_ns().

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-26 21:58:58 +11:00
Linus Torvalds 0d1e8b8d2b KVM updates for v4.20
ARM:
  - Improved guest IPA space support (32 to 52 bits)
 
  - RAS event delivery for 32bit
 
  - PMU fixes
 
  - Guest entry hardening
 
  - Various cleanups
 
  - Port of dirty_log_test selftest
 
 PPC:
  - Nested HV KVM support for radix guests on POWER9.  The performance is
    much better than with PR KVM.  Migration and arbitrary level of
    nesting is supported.
 
  - Disable nested HV-KVM on early POWER9 chips that need a particular hardware
    bug workaround
 
  - One VM per core mode to prevent potential data leaks
 
  - PCI pass-through optimization
 
  - merge ppc-kvm topic branch and kvm-ppc-fixes to get a better base
 
 s390:
  - Initial version of AP crypto virtualization via vfio-mdev
 
  - Improvement for vfio-ap
 
  - Set the host program identifier
 
  - Optimize page table locking
 
 x86:
  - Enable nested virtualization by default
 
  - Implement Hyper-V IPI hypercalls
 
  - Improve #PF and #DB handling
 
  - Allow guests to use Enlightened VMCS
 
  - Add migration selftests for VMCS and Enlightened VMCS
 
  - Allow coalesced PIO accesses
 
  - Add an option to perform nested VMCS host state consistency check
    through hardware
 
  - Automatic tuning of lapic_timer_advance_ns
 
  - Many fixes, minor improvements, and cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABCAAGBQJb0FINAAoJEED/6hsPKofoI60IAJRS3vOAQ9Fav8cJsO1oBHcX
 3+NexfnBke1bzrjIR3SUcHKGZbdnVPNZc+Q4JjIbPpPmmOMU5jc9BC1dmd5f4Vzh
 BMnQ0yCvgFv3A3fy/Icx1Z8NJppxosdmqdQLrQrNo8aD3cjnqY2yQixdXrAfzLzw
 XEgKdIFCCz8oVN/C9TT4wwJn6l9OE7BM5bMKGFy5VNXzMu7t64UDOLbbjZxNgi1g
 teYvfVGdt5mH0N7b2GPPWRbJmgnz5ygVVpVNQUEFrdKZoCm6r5u9d19N+RRXAwan
 ZYFj10W2T8pJOUf3tryev4V33X7MRQitfJBo4tP5hZfi9uRX89np5zP1CFE7AtY=
 =yEPW
 -----END PGP SIGNATURE-----

Merge tag 'kvm-4.20-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Radim Krčmář:
 "ARM:
   - Improved guest IPA space support (32 to 52 bits)

   - RAS event delivery for 32bit

   - PMU fixes

   - Guest entry hardening

   - Various cleanups

   - Port of dirty_log_test selftest

  PPC:
   - Nested HV KVM support for radix guests on POWER9. The performance
     is much better than with PR KVM. Migration and arbitrary level of
     nesting is supported.

   - Disable nested HV-KVM on early POWER9 chips that need a particular
     hardware bug workaround

   - One VM per core mode to prevent potential data leaks

   - PCI pass-through optimization

   - merge ppc-kvm topic branch and kvm-ppc-fixes to get a better base

  s390:
   - Initial version of AP crypto virtualization via vfio-mdev

   - Improvement for vfio-ap

   - Set the host program identifier

   - Optimize page table locking

  x86:
   - Enable nested virtualization by default

   - Implement Hyper-V IPI hypercalls

   - Improve #PF and #DB handling

   - Allow guests to use Enlightened VMCS

   - Add migration selftests for VMCS and Enlightened VMCS

   - Allow coalesced PIO accesses

   - Add an option to perform nested VMCS host state consistency check
     through hardware

   - Automatic tuning of lapic_timer_advance_ns

   - Many fixes, minor improvements, and cleanups"

* tag 'kvm-4.20-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (204 commits)
  KVM/nVMX: Do not validate that posted_intr_desc_addr is page aligned
  Revert "kvm: x86: optimize dr6 restore"
  KVM: PPC: Optimize clearing TCEs for sparse tables
  x86/kvm/nVMX: tweak shadow fields
  selftests/kvm: add missing executables to .gitignore
  KVM: arm64: Safety check PSTATE when entering guest and handle IL
  KVM: PPC: Book3S HV: Don't use streamlined entry path on early POWER9 chips
  arm/arm64: KVM: Enable 32 bits kvm vcpu events support
  arm/arm64: KVM: Rename function kvm_arch_dev_ioctl_check_extension()
  KVM: arm64: Fix caching of host MDCR_EL2 value
  KVM: VMX: enable nested virtualization by default
  KVM/x86: Use 32bit xor to clear registers in svm.c
  kvm: x86: Introduce KVM_CAP_EXCEPTION_PAYLOAD
  kvm: vmx: Defer setting of DR6 until #DB delivery
  kvm: x86: Defer setting of CR2 until #PF delivery
  kvm: x86: Add payload operands to kvm_multiple_exception
  kvm: x86: Add exception payload fields to kvm_vcpu_events
  kvm: x86: Add has_payload and payload to kvm_queued_exception
  KVM: Documentation: Fix omission in struct kvm_vcpu_events
  KVM: selftests: add Enlightened VMCS test
  ...
2018-10-25 17:57:35 -07:00
Alexey Kardashevskiy 6e301a8e56 KVM: PPC: Optimize clearing TCEs for sparse tables
The powernv platform maintains 2 TCE tables for VFIO - a hardware TCE
table and a table with userspace addresses. These tables are radix trees,
we allocate indirect levels when they are written to. Since
the memory allocation is problematic in real mode, we have 2 accessors
to the entries:
- for virtual mode: it allocates the memory and it is always expected
to return non-NULL;
- fr real mode: it does not allocate and can return NULL.

Also, DMA windows can span to up to 55 bits of the address space and since
we never have this much RAM, such windows are sparse. However currently
the SPAPR TCE IOMMU driver walks through all TCEs to unpin DMA memory.

Since we maintain a userspace addresses table for VFIO which is a mirror
of the hardware table, we can use it to know which parts of the DMA
window have not been mapped and skip these so does this patch.

The bare metal systems do not have this problem as they use a bypass mode
of a PHB which maps RAM directly.

This helps a lot with sparse DMA windows, reducing the shutdown time from
about 3 minutes per 1 billion TCEs to a few seconds for 32GB sparse guest.
Just skipping the last level seems to be good enough.

As non-allocating accessor is used now in virtual mode as well, rename it
from IOMMU_TABLE_USERSPACE_ENTRY_RM (real mode) to _RO (read only).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-10-20 20:47:02 +11:00
Paul Mackerras 8d9fcacff9 KVM: PPC: Book3S HV: Don't use streamlined entry path on early POWER9 chips
This disables the use of the streamlined entry path for radix guests
on early POWER9 chips that need the workaround added in commit
a25bd72bad ("powerpc/mm/radix: Workaround prefetch issue with KVM",
2017-07-24), because the streamlined entry path does not include
that workaround.  This also means that we can't do nested HV-KVM
on those chips.

Since the chips that need that workaround are the same ones that can't
run both radix and HPT guests at the same time on different threads of
a core, we use the existing 'no_mixing_hpt_and_radix' variable that
identifies those chips to identify when we can't use the new guest
entry path, and when we can't do nested virtualization.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-10-19 20:44:04 +11:00
Michael Ellerman 23ad1a2700 powerpc: Add -Werror at arch/powerpc level
Back when I added -Werror in commit ba55bd7436 ("powerpc: Add
configurable -Werror for arch/powerpc") I did it by adding it to most
of the arch Makefiles.

At the time we excluded math-emu, because apparently it didn't build
cleanly. But that seems to have been fixed somewhere in the interim.

So move the -Werror addition to the top-level of the arch, this saves
us from repeating it in every Makefile and means we won't forget to
add it to any new sub-dirs.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-19 00:56:17 +11:00
Paul Mackerras 901f8c3f6f KVM: PPC: Book3S HV: Add NO_HASH flag to GET_SMMU_INFO ioctl result
This adds a KVM_PPC_NO_HASH flag to the flags field of the
kvm_ppc_smmu_info struct, and arranges for it to be set when
running as a nested hypervisor, as an unambiguous indication
to userspace that HPT guests are not supported.  Reporting the
KVM_CAP_PPC_MMU_HASH_V3 capability as false could be taken as
indicating only that the new HPT features in ISA V3.0 are not
supported, leaving it ambiguous whether pre-V3.0 HPT features
are supported.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-10-09 16:14:54 +11:00
Paul Mackerras aa069a9969 KVM: PPC: Book3S HV: Add a VM capability to enable nested virtualization
With this, userspace can enable a KVM-HV guest to run nested guests
under it.

The administrator can control whether any nested guests can be run;
setting the "nested" module parameter to false prevents any guests
becoming nested hypervisors (that is, any attempt to enable the nested
capability on a guest will fail).  Guests which are already nested
hypervisors will continue to be so.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-10-09 16:14:47 +11:00
Paul Mackerras 9d67121a4f Merge remote-tracking branch 'remotes/powerpc/topic/ppc-kvm' into kvm-ppc-next
This merges in the "ppc-kvm" topic branch of the powerpc tree to get a
series of commits that touch both general arch/powerpc code and KVM
code.  These commits will be merged both via the KVM tree and the
powerpc tree.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-10-09 16:13:20 +11:00
Paul Mackerras 83a055104e KVM: PPC: Book3S HV: Add nested shadow page tables to debugfs
This adds a list of valid shadow PTEs for each nested guest to
the 'radix' file for the guest in debugfs.  This can be useful for
debugging.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09 16:04:27 +11:00
Paul Mackerras de760db4d9 KVM: PPC: Book3S HV: Allow HV module to load without hypervisor mode
With this, the KVM-HV module can be loaded in a guest running under
KVM-HV, and if the hypervisor supports nested virtualization, this
guest can now act as a nested hypervisor and run nested guests.

This also adds some checks to inform userspace that HPT guests are not
supported by nested hypervisors (by returning false for the
KVM_CAP_PPC_MMU_HASH_V3 capability), and to prevent userspace from
configuring a guest to use HPT mode.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09 16:04:27 +11:00
Suraj Jitindar Singh 10b5022db7 KVM: PPC: Book3S HV: Handle differing endianness for H_ENTER_NESTED
The hcall H_ENTER_NESTED takes two parameters: the address in L1 guest
memory of a hv_regs struct and the address of a pt_regs struct.  The
hcall requests the L0 hypervisor to use the register values in these
structs to run a L2 guest and to return the exit state of the L2 guest
in these structs.  These are in the endianness of the L1 guest, rather
than being always big-endian as is usually the case for PAPR
hypercalls.

This is convenient because it means that the L1 guest can pass the
address of the regs field in its kvm_vcpu_arch struct.  This also
improves performance slightly by avoiding the need for two copies of
the pt_regs struct.

When reading/writing these structures, this patch handles the case
where the endianness of the L1 guest differs from that of the L0
hypervisor, by byteswapping the structures after reading and before
writing them back.

Since all the fields of the pt_regs are of the same type, i.e.,
unsigned long, we treat it as an array of unsigned longs.  The fields
of struct hv_guest_state are not all the same, so its fields are
byteswapped individually.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09 16:04:27 +11:00
Suraj Jitindar Singh 73937deb4b KVM: PPC: Book3S HV: Sanitise hv_regs on nested guest entry
restore_hv_regs() is used to copy the hv_regs L1 wants to set to run the
nested (L2) guest into the vcpu structure. We need to sanitise these
values to ensure we don't let the L1 guest hypervisor do things we don't
want it to.

We don't let data address watchpoints or completed instruction address
breakpoints be set to match in hypervisor state.

We also don't let L1 enable features in the hypervisor facility status
and control register (HFSCR) for L2 which we have disabled for L1. That
is L2 will get the subset of features which the L0 hypervisor has
enabled for L1 and the features L1 wants to enable for L2. This could
mean we give L1 a hypervisor facility unavailable interrupt for a
facility it thinks it has enabled, however it shouldn't have enabled a
facility it itself doesn't have for the L2 guest.

We sanitise the registers when copying in the L2 hv_regs. We don't need
to sanitise when copying back the L1 hv_regs since these shouldn't be
able to contain invalid values as they're just what was copied out.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09 16:04:27 +11:00
Paul Mackerras 3032341853 KVM: PPC: Book3S HV: Add one-reg interface to virtual PTCR register
This adds a one-reg register identifier which can be used to read and
set the virtual PTCR for the guest.  This register identifies the
address and size of the virtual partition table for the guest, which
contains information about the nested guests under this guest.

Migrating this value is the only extra requirement for migrating a
guest which has nested guests (assuming of course that the destination
host supports nested virtualization in the kvm-hv module).

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09 16:04:27 +11:00
Paul Mackerras f3c99f97a3 KVM: PPC: Book3S HV: Don't access HFSCR, LPIDR or LPCR when running nested
When running as a nested hypervisor, this avoids reading hypervisor
privileged registers (specifically HFSCR, LPIDR and LPCR) at startup;
instead reasonable default values are used.  This also avoids writing
LPIDR in the single-vcpu entry/exit path.

Also, this removes the check for CPU_FTR_HVMODE in kvmppc_mmu_hv_init()
since its only caller already checks this.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09 16:04:27 +11:00
Suraj Jitindar Singh 9d0b048da7 KVM: PPC: Book3S HV: Invalidate TLB when nested vcpu moves physical cpu
This is only done at level 0, since only level 0 knows which physical
CPU a vcpu is running on.  This does for nested guests what L0 already
did for its own guests, which is to flush the TLB on a pCPU when it
goes to run a vCPU there, and there is another vCPU in the same VM
which previously ran on this pCPU and has now started to run on another
pCPU.  This is to handle the situation where the other vCPU touched
a mapping, moved to another pCPU and did a tlbiel (local-only tlbie)
on that new pCPU and thus left behind a stale TLB entry on this pCPU.

This introduces a limit on the the vcpu_token values used in the
H_ENTER_NESTED hcall -- they must now be less than NR_CPUS.

[paulus@ozlabs.org - made prev_cpu array be short[] to reduce
 memory consumption.]

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09 16:04:27 +11:00
Paul Mackerras 690ed4cad8 KVM: PPC: Book3S HV: Use hypercalls for TLB invalidation when nested
This adds code to call the H_TLB_INVALIDATE hypercall when running as
a guest, in the cases where we need to invalidate TLBs (or other MMU
caches) as part of managing the mappings for a nested guest.  Calling
H_TLB_INVALIDATE lets the nested hypervisor inform the parent
hypervisor about changes to partition-scoped page tables or the
partition table without needing to do hypervisor-privileged tlbie
instructions.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09 16:04:27 +11:00
Suraj Jitindar Singh e3b6b46615 KVM: PPC: Book3S HV: Implement H_TLB_INVALIDATE hcall
When running a nested (L2) guest the guest (L1) hypervisor will use
the H_TLB_INVALIDATE hcall when it needs to change the partition
scoped page tables or the partition table which it manages.  It will
use this hcall in the situations where it would use a partition-scoped
tlbie instruction if it were running in hypervisor mode.

The H_TLB_INVALIDATE hcall can invalidate different scopes:

Invalidate TLB for a given target address:
- This invalidates a single L2 -> L1 pte
- We need to invalidate any L2 -> L0 shadow_pgtable ptes which map the L2
  address space which is being invalidated. This is because a single
  L2 -> L1 pte may have been mapped with more than one pte in the
  L2 -> L0 page tables.

Invalidate the entire TLB for a given LPID or for all LPIDs:
- Invalidate the entire shadow_pgtable for a given nested guest, or
  for all nested guests.

Invalidate the PWC (page walk cache) for a given LPID or for all LPIDs:
- We don't cache the PWC, so nothing to do.

Invalidate the entire TLB, PWC and partition table for a given/all LPIDs:
- Here we re-read the partition table entry and remove the nested state
  for any nested guest for which the first doubleword of the partition
  table entry is now zero.

The H_TLB_INVALIDATE hcall takes as parameters the tlbie instruction
word (of which only the RIC, PRS and R fields are used), the rS value
(giving the lpid, where required) and the rB value (giving the IS, AP
and EPN values).

[paulus@ozlabs.org - adapted to having the partition table in guest
memory, added the H_TLB_INVALIDATE implementation, removed tlbie
instruction emulation, reworded the commit message.]

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09 16:04:27 +11:00
Suraj Jitindar Singh 8cf531ed48 KVM: PPC: Book3S HV: Introduce rmap to track nested guest mappings
When a host (L0) page which is mapped into a (L1) guest is in turn
mapped through to a nested (L2) guest we keep a reverse mapping (rmap)
so that these mappings can be retrieved later.

Whenever we create an entry in a shadow_pgtable for a nested guest we
create a corresponding rmap entry and add it to the list for the
L1 guest memslot at the index of the L1 guest page it maps. This means
at the L1 guest memslot we end up with lists of rmaps.

When we are notified of a host page being invalidated which has been
mapped through to a (L1) guest, we can then walk the rmap list for that
guest page, and find and invalidate all of the corresponding
shadow_pgtable entries.

In order to reduce memory consumption, we compress the information for
each rmap entry down to 52 bits -- 12 bits for the LPID and 40 bits
for the guest real page frame number -- which will fit in a single
unsigned long.  To avoid a scenario where a guest can trigger
unbounded memory allocations, we scan the list when adding an entry to
see if there is already an entry with the contents we need.  This can
occur, because we don't ever remove entries from the middle of a list.

A struct nested guest rmap is a list pointer and an rmap entry;
----------------
| next pointer |
----------------
| rmap entry   |
----------------

Thus the rmap pointer for each guest frame number in the memslot can be
either NULL, a single entry, or a pointer to a list of nested rmap entries.

gfn	 memslot rmap array
 	-------------------------
 0	| NULL			|	(no rmap entry)
 	-------------------------
 1	| single rmap entry	|	(rmap entry with low bit set)
 	-------------------------
 2	| list head pointer	|	(list of rmap entries)
 	-------------------------

The final entry always has the lowest bit set and is stored in the next
pointer of the last list entry, or as a single rmap entry.
With a list of rmap entries looking like;

-----------------	-----------------	-------------------------
| list head ptr	| ----> | next pointer	| ---->	| single rmap entry	|
-----------------	-----------------	-------------------------
			| rmap entry	|	| rmap entry		|
			-----------------	-------------------------

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09 16:04:27 +11:00
Suraj Jitindar Singh fd10be2573 KVM: PPC: Book3S HV: Handle page fault for a nested guest
Consider a normal (L1) guest running under the main hypervisor (L0),
and then a nested guest (L2) running under the L1 guest which is acting
as a nested hypervisor. L0 has page tables to map the address space for
L1 providing the translation from L1 real address -> L0 real address;

	L1
	|
	| (L1 -> L0)
	|
	----> L0

There are also page tables in L1 used to map the address space for L2
providing the translation from L2 real address -> L1 read address. Since
the hardware can only walk a single level of page table, we need to
maintain in L0 a "shadow_pgtable" for L2 which provides the translation
from L2 real address -> L0 real address. Which looks like;

	L2				L2
	|				|
	| (L2 -> L1)			|
	|				|
	----> L1			| (L2 -> L0)
	      |				|
	      | (L1 -> L0)		|
	      |				|
	      ----> L0			--------> L0

When a page fault occurs while running a nested (L2) guest we need to
insert a pte into this "shadow_pgtable" for the L2 -> L0 mapping. To
do this we need to:

1. Walk the pgtable in L1 memory to find the L2 -> L1 mapping, and
   provide a page fault to L1 if this mapping doesn't exist.
2. Use our L1 -> L0 pgtable to convert this L1 address to an L0 address,
   or try to insert a pte for that mapping if it doesn't exist.
3. Now we have a L2 -> L0 mapping, insert this into our shadow_pgtable

Once this mapping exists we can take rc faults when hardware is unable
to automatically set the reference and change bits in the pte. On these
we need to:

1. Check the rc bits on the L2 -> L1 pte match, and otherwise reflect
   the fault down to L1.
2. Set the rc bits in the L1 -> L0 pte which corresponds to the same
   host page.
3. Set the rc bits in the L2 -> L0 pte.

As we reuse a large number of functions in book3s_64_mmu_radix.c for
this we also needed to refactor a number of these functions to take
an lpid parameter so that the correct lpid is used for tlb invalidations.
The functionality however has remained the same.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09 16:04:27 +11:00
Paul Mackerras 4bad77799f KVM: PPC: Book3S HV: Handle hypercalls correctly when nested
When we are running as a nested hypervisor, we use a hypercall to
enter the guest rather than code in book3s_hv_rmhandlers.S.  This means
that the hypercall handlers listed in hcall_real_table never get called.
There are some hypercalls that are handled there and not in
kvmppc_pseries_do_hcall(), which therefore won't get processed for
a nested guest.

To fix this, we add cases to kvmppc_pseries_do_hcall() to handle those
hypercalls, with the following exceptions:

- The HPT hypercalls (H_ENTER, H_REMOVE, etc.) are not handled because
  we only support radix mode for nested guests.

- H_CEDE has to be handled specially because the cede logic in
  kvmhv_run_single_vcpu assumes that it has been processed by the time
  that kvmhv_p9_guest_entry() returns.  Therefore we put a special
  case for H_CEDE in kvmhv_p9_guest_entry().

For the XICS hypercalls, if real-mode processing is enabled, then the
virtual-mode handlers assume that they are being called only to finish
up the operation.  Therefore we turn off the real-mode flag in the XICS
code when running as a nested hypervisor.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09 16:04:27 +11:00
Paul Mackerras f3c18e9342 KVM: PPC: Book3S HV: Use XICS hypercalls when running as a nested hypervisor
This adds code to call the H_IPI and H_EOI hypercalls when we are
running as a nested hypervisor (i.e. without the CPU_FTR_HVMODE cpu
feature) and we would otherwise access the XICS interrupt controller
directly or via an OPAL call.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09 16:04:27 +11:00
Paul Mackerras 360cae3137 KVM: PPC: Book3S HV: Nested guest entry via hypercall
This adds a new hypercall, H_ENTER_NESTED, which is used by a nested
hypervisor to enter one of its nested guests.  The hypercall supplies
register values in two structs.  Those values are copied by the level 0
(L0) hypervisor (the one which is running in hypervisor mode) into the
vcpu struct of the L1 guest, and then the guest is run until an
interrupt or error occurs which needs to be reported to L1 via the
hypercall return value.

Currently this assumes that the L0 and L1 hypervisors are the same
endianness, and the structs passed as arguments are in native
endianness.  If they are of different endianness, the version number
check will fail and the hcall will be rejected.

Nested hypervisors do not support indep_threads_mode=N, so this adds
code to print a warning message if the administrator has set
indep_threads_mode=N, and treat it as Y.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09 16:04:27 +11:00
Paul Mackerras 8e3f5fc104 KVM: PPC: Book3S HV: Framework and hcall stubs for nested virtualization
This starts the process of adding the code to support nested HV-style
virtualization.  It defines a new H_SET_PARTITION_TABLE hypercall which
a nested hypervisor can use to set the base address and size of a
partition table in its memory (analogous to the PTCR register).
On the host (level 0 hypervisor) side, the H_SET_PARTITION_TABLE
hypercall from the guest is handled by code that saves the virtual
PTCR value for the guest.

This also adds code for creating and destroying nested guests and for
reading the partition table entry for a nested guest from L1 memory.
Each nested guest has its own shadow LPID value, different in general
from the LPID value used by the nested hypervisor to refer to it.  The
shadow LPID value is allocated at nested guest creation time.

Nested hypervisor functionality is only available for a radix guest,
which therefore means a radix host on a POWER9 (or later) processor.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09 16:04:27 +11:00
Paul Mackerras f0f825f0e2 KVM: PPC: Book3S HV: Use kvmppc_unmap_pte() in kvm_unmap_radix()
kvmppc_unmap_pte() does a sequence of operations that are open-coded in
kvm_unmap_radix().  This extends kvmppc_unmap_pte() a little so that it
can be used by kvm_unmap_radix(), and makes kvm_unmap_radix() call it.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09 16:04:27 +11:00
Suraj Jitindar Singh 04bae9d5b4 KVM: PPC: Book3S HV: Refactor radix page fault handler
The radix page fault handler accounts for all cases, including just
needing to insert a pte.  This breaks it up into separate functions for
the two main cases; setting rc and inserting a pte.

This allows us to make the setting of rc and inserting of a pte
generic for any pgtable, not specific to the one for this guest.

[paulus@ozlabs.org - reduced diffs from previous code]

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09 16:04:27 +11:00
Suraj Jitindar Singh 9811c78e96 KVM: PPC: Book3S HV: Make kvmppc_mmu_radix_xlate process/partition table agnostic
kvmppc_mmu_radix_xlate() is used to translate an effective address
through the process tables. The process table and partition tables have
identical layout. Exploit this fact to make the kvmppc_mmu_radix_xlate()
function able to translate either an effective address through the
process tables or a guest real address through the partition tables.

[paulus@ozlabs.org - reduced diffs from previous code]

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09 16:04:27 +11:00
Suraj Jitindar Singh 89329c0be8 KVM: PPC: Book3S HV: Clear partition table entry on vm teardown
When destroying a VM we return the LPID to the pool, however we never
zero the partition table entry. This is instead done when we reallocate
the LPID.

Zero the partition table entry on VM teardown before returning the LPID
to the pool. This means if we were running as a nested hypervisor the
real hypervisor could use this to determine when it can free resources.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09 16:04:27 +11:00
Paul Mackerras fd0944baad KVM: PPC: Use ccr field in pt_regs struct embedded in vcpu struct
When the 'regs' field was added to struct kvm_vcpu_arch, the code
was changed to use several of the fields inside regs (e.g., gpr, lr,
etc.) but not the ccr field, because the ccr field in struct pt_regs
is 64 bits on 64-bit platforms, but the cr field in kvm_vcpu_arch is
only 32 bits.  This changes the code to use the regs.ccr field
instead of cr, and changes the assembly code on 64-bit platforms to
use 64-bit loads and stores instead of 32-bit ones.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09 16:04:27 +11:00
Paul Mackerras 9a94d3ee2d KVM: PPC: Book3S HV: Add a debugfs file to dump radix mappings
This adds a file called 'radix' in the debugfs directory for the
guest, which when read gives all of the valid leaf PTEs in the
partition-scoped radix tree for a radix guest, in human-readable
format.  It is analogous to the existing 'htab' file which dumps
the HPT entries for a HPT guest.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09 16:04:27 +11:00
Paul Mackerras 32eb150aee KVM: PPC: Book3S HV: Handle hypervisor instruction faults better
Currently the code for handling hypervisor instruction page faults
passes 0 for the flags indicating the type of fault, which is OK in
the usual case that the page is not mapped in the partition-scoped
page tables.  However, there are other causes for hypervisor
instruction page faults, such as not being to update a reference
(R) or change (C) bit.  The cause is indicated in bits in HSRR1,
including a bit which indicates that the fault is due to not being
able to write to a page (for example to update an R or C bit).
Not handling these other kinds of faults correctly can lead to a
loop of continual faults without forward progress in the guest.

In order to handle these faults better, this patch constructs a
"DSISR-like" value from the bits which DSISR and SRR1 (for a HISI)
have in common, and passes it to kvmppc_book3s_hv_page_fault() so
that it knows what caused the fault.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09 16:04:27 +11:00
Paul Mackerras 95a6432ce9 KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests
This creates an alternative guest entry/exit path which is used for
radix guests on POWER9 systems when we have indep_threads_mode=Y.  In
these circumstances there is exactly one vcpu per vcore and there is
no coordination required between vcpus or vcores; the vcpu can enter
the guest without needing to synchronize with anything else.

The new fast path is implemented almost entirely in C in book3s_hv.c
and runs with the MMU on until the guest is entered.  On guest exit
we use the existing path until the point where we are committed to
exiting the guest (as distinct from handling an interrupt in the
low-level code and returning to the guest) and we have pulled the
guest context from the XIVE.  At that point we check a flag in the
stack frame to see whether we came in via the old path and the new
path; if we came in via the new path then we go back to C code to do
the rest of the process of saving the guest context and restoring the
host context.

The C code is split into separate functions for handling the
OS-accessible state and the hypervisor state, with the idea that the
latter can be replaced by a hypercall when we implement nested
virtualization.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
[mpe: Fix CONFIG_ALTIVEC=n build]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09 16:04:27 +11:00
Paul Mackerras 53655ddd77 KVM: PPC: Book3S HV: Call kvmppc_handle_exit_hv() with vcore unlocked
Currently kvmppc_handle_exit_hv() is called with the vcore lock held
because it is called within a for_each_runnable_thread loop.
However, we already unlock the vcore within kvmppc_handle_exit_hv()
under certain circumstances, and this is safe because (a) any vcpus
that become runnable and are added to the runnable set by
kvmppc_run_vcpu() have their vcpu->arch.trap == 0 and can't actually
run in the guest (because the vcore state is VCORE_EXITING), and
(b) for_each_runnable_thread is safe against addition or removal
of vcpus from the runnable set.

Therefore, in order to simplify things for following patches, let's
drop the vcore lock in the for_each_runnable_thread loop, so
kvmppc_handle_exit_hv() gets called without the vcore lock held.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09 16:04:27 +11:00
Paul Mackerras 7854f7545b KVM: PPC: Book3S: Rework TM save/restore code and make it C-callable
This adds a parameter to __kvmppc_save_tm and __kvmppc_restore_tm
which allows the caller to indicate whether it wants the nonvolatile
register state to be preserved across the call, as required by the C
calling conventions.  This parameter being non-zero also causes the
MSR bits that enable TM, FP, VMX and VSX to be preserved.  The
condition register and DSCR are now always preserved.

With this, kvmppc_save_tm_hv and kvmppc_restore_tm_hv can be called
from C code provided the 3rd parameter is non-zero.  So that these
functions can be called from modules, they now include code to set
the TOC pointer (r2) on entry, as they can call other built-in C
functions which will assume the TOC to have been set.

Also, the fake suspend code in kvmppc_save_tm_hv is modified here to
assume that treclaim in fake-suspend state does not modify any registers,
which is the case on POWER9.  This enables the code to be simplified
quite a bit.

_kvmppc_save_tm_pr and _kvmppc_restore_tm_pr become much simpler with
this change, since they now only need to save and restore TAR and pass
1 for the 3rd argument to __kvmppc_{save,restore}_tm.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09 16:04:27 +11:00
Paul Mackerras df709a296e KVM: PPC: Book3S HV: Simplify real-mode interrupt handling
This streamlines the first part of the code that handles a hypervisor
interrupt that occurred in the guest.  With this, all of the real-mode
handling that occurs is done before the "guest_exit_cont" label; once
we get to that label we are committed to exiting to host virtual mode.
Thus the machine check and HMI real-mode handling is moved before that
label.

Also, the code to handle external interrupts is moved out of line, as
is the code that calls kvmppc_realmode_hmi_handler().

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09 16:04:27 +11:00
Paul Mackerras 41f4e631da KVM: PPC: Book3S HV: Extract PMU save/restore operations as C-callable functions
This pulls out the assembler code that is responsible for saving and
restoring the PMU state for the host and guest into separate functions
so they can be used from an alternate entry path.  The calling
convention is made compatible with C.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09 16:04:27 +11:00
Paul Mackerras f7035ce9f1 KVM: PPC: Book3S HV: Move interrupt delivery on guest entry to C code
This is based on a patch by Suraj Jitindar Singh.

This moves the code in book3s_hv_rmhandlers.S that generates an
external, decrementer or privileged doorbell interrupt just before
entering the guest to C code in book3s_hv_builtin.c.  This is to
make future maintenance and modification easier.  The algorithm
expressed in the C code is almost identical to the previous
algorithm.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09 16:04:27 +11:00
Paul Mackerras 966eba9316 KVM: PPC: Book3S HV: Remove left-over code in XICS-on-XIVE emulation
This removes code that clears the external interrupt pending bit in
the pending_exceptions bitmap.  This is left over from an earlier
iteration of the code where this bit was set when an escalation
interrupt arrived in order to wake the vcpu from cede.  Currently
we set the vcpu->arch.irq_pending flag instead for this purpose.
Therefore there is no need to do anything with the pending_exceptions
bitmap.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09 16:04:27 +11:00
Paul Mackerras d24ea8a733 KVM: PPC: Book3S: Simplify external interrupt handling
Currently we use two bits in the vcpu pending_exceptions bitmap to
indicate that an external interrupt is pending for the guest, one
for "one-shot" interrupts that are cleared when delivered, and one
for interrupts that persist until cleared by an explicit action of
the OS (e.g. an acknowledge to an interrupt controller).  The
BOOK3S_IRQPRIO_EXTERNAL bit is used for one-shot interrupt requests
and BOOK3S_IRQPRIO_EXTERNAL_LEVEL is used for persisting interrupts.

In practice BOOK3S_IRQPRIO_EXTERNAL never gets used, because our
Book3S platforms generally, and pseries in particular, expect
external interrupt requests to persist until they are acknowledged
at the interrupt controller.  That combined with the confusion
introduced by having two bits for what is essentially the same thing
makes it attractive to simplify things by only using one bit.  This
patch does that.

With this patch there is only BOOK3S_IRQPRIO_EXTERNAL, and by default
it has the semantics of a persisting interrupt.  In order to avoid
breaking the ABI, we introduce a new "external_oneshot" flag which
preserves the behaviour of the KVM_INTERRUPT ioctl with the
KVM_INTERRUPT_SET argument.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09 16:04:27 +11:00
Alexey Kardashevskiy a3ac077b75 KVM: PPC: Remove redundand permission bits removal
The kvmppc_gpa_to_ua() helper itself takes care of the permission
bits in the TCE and yet every single caller removes them.

This changes semantics of kvmppc_gpa_to_ua() so it takes TCEs
(which are GPAs + TCE permission bits) to make the callers simpler.

This should cause no behavioural change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09 16:04:27 +11:00
Alexey Kardashevskiy 2691f0ff3d KVM: PPC: Propagate errors to the guest when failed instead of ignoring
At the moment if the PUT_TCE{_INDIRECT} handlers fail to update
the hardware tables, we print a warning once, clear the entry and
continue. This is so as at the time the assumption was that if
a VFIO device is hotplugged into the guest, and the userspace replays
virtual DMA mappings (i.e. TCEs) to the hardware tables and if this fails,
then there is nothing useful we can do about it.

However the assumption is not valid as these handlers are not called for
TCE replay (VFIO ioctl interface is used for that) and these handlers
are for new TCEs.

This returns an error to the guest if there is a request which cannot be
processed. By now the only possible failure must be H_TOO_HARD.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09 16:04:27 +11:00
Alexey Kardashevskiy 42de7b9e21 KVM: PPC: Validate TCEs against preregistered memory page sizes
The userspace can request an arbitrary supported page size for a DMA
window and this works fine as long as the mapped memory is backed with
the pages of the same or bigger size; if this is not the case,
mm_iommu_ua_to_hpa{_rm}() fail and tables do not populated with
dangerously incorrect TCEs.

However since it is quite easy to misconfigure the KVM and we do not do
reverts to all changes made to TCE tables if an error happens in a middle,
we better do the acceptable page size validation before we even touch
the tables.

This enhances kvmppc_tce_validate() to check the hardware IOMMU page sizes
against the preregistered memory page sizes.

Since the new check uses real/virtual mode helpers, this renames
kvmppc_tce_validate() to kvmppc_rm_tce_validate() to handle the real mode
case and mirrors it for the virtual mode under the old name. The real
mode handler is not used for the virtual mode as:
1. it uses _lockless() list traversing primitives instead of RCU;
2. realmode's mm_iommu_ua_to_hpa_rm() uses vmalloc_to_phys() which
virtual mode does not have to use and since on POWER9+radix only virtual
mode handlers actually work, we do not want to slow down that path even
a bit.

This removes EXPORT_SYMBOL_GPL(kvmppc_tce_validate) as the validators
are static now.

From now on the attempts on mapping IOMMU pages bigger than allowed
will result in KVM exit.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
[mpe: Fix KVM_HV=n build]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-09 15:45:15 +11:00
Paul Mackerras aa2278644a KVM: PPC: Book3S HV: Provide mode where all vCPUs on a core must be the same VM
This adds a mode where the vcore scheduling logic in HV KVM limits itself
to scheduling only virtual cores from the same VM on any given physical
core.  This is enabled via a new module parameter on the kvm-hv module
called "one_vm_per_core".  For this to work on POWER9, it is necessary to
set indep_threads_mode=N.  (On POWER8, hardware limitations mean that KVM
is never in independent threads mode, regardless of the indep_threads_mode
setting.)

Thus the settings needed for this to work are:

1. The host is in SMT1 mode.
2. On POWER8, the host is not in 2-way or 4-way static split-core mode.
3. On POWER9, the indep_threads_mode parameter is N.
4. The one_vm_per_core parameter is Y.

With these settings, KVM can run up to 4 vcpus on a core at the same
time on POWER9, or up to 8 vcpus on POWER8 (depending on the guest
threading mode), and will ensure that all of the vcpus belong to the
same VM.

This is intended for use in security-conscious settings where users are
concerned about possible side-channel attacks between threads which could
perhaps enable one VM to attack another VM on the same core, or the host.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-10-05 15:55:06 +10:00