Commit graph

4160 commits

Author SHA1 Message Date
Linus Torvalds 0081a0ce80 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnad:
 "The main RCU related changes in this cycle were:

   - Removal of spin_unlock_wait()
   - SRCU updates
   - RCU torture-test updates
   - RCU Documentation updates
   - Extend the sys_membarrier() ABI with the MEMBARRIER_CMD_PRIVATE_EXPEDITED variant
   - Miscellaneous RCU fixes
   - CPU-hotplug fixes"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (63 commits)
  arch: Remove spin_unlock_wait() arch-specific definitions
  locking: Remove spin_unlock_wait() generic definitions
  drivers/ata: Replace spin_unlock_wait() with lock/unlock pair
  ipc: Replace spin_unlock_wait() with lock/unlock pair
  exit: Replace spin_unlock_wait() with lock/unlock pair
  completion: Replace spin_unlock_wait() with lock/unlock pair
  doc: Set down RCU's scheduling-clock-interrupt needs
  doc: No longer allowed to use rcu_dereference on non-pointers
  doc: Add RCU files to docbook-generation files
  doc: Update memory-barriers.txt for read-to-write dependencies
  doc: Update RCU documentation
  membarrier: Provide expedited private command
  rcu: Remove exports from rcu_idle_exit() and rcu_idle_enter()
  rcu: Add warning to rcu_idle_enter() for irqs enabled
  rcu: Make rcu_idle_enter() rely on callers disabling irqs
  rcu: Add assertions verifying blocked-tasks list
  rcu/tracing: Set disable_rcu_irq_enter on rcu_eqs_exit()
  rcu: Add TPS() protection for _rcu_barrier_trace strings
  rcu: Use idle versions of swait to make idle-hack clear
  swait: Add idle variants which don't contribute to load average
  ...
2017-09-04 08:13:52 -07:00
Ingo Molnar edc2988c54 Merge branch 'linus' into locking/core, to fix up conflicts
Conflicts:
	mm/page_alloc.c

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-04 11:01:18 +02:00
Cédric Le Goater ac5e5a5402 powerpc/xive: add XIVE Exploitation Mode to CAS
On POWER9, the Client Architecture Support (CAS) negotiation process
determines whether the guest operates in XIVE Legacy compatibility or
in XIVE exploitation mode. Now that we have initial guest support for
the XIVE interrupt controller, let's inform the hypervisor what we can
do.

The platform advertises the XIVE Exploitation Mode support using the
property "ibm,arch-vec-5-platform-support-vec-5", byte 23 bits 0-1 :

 - 0b00 XIVE legacy mode Only
 - 0b01 XIVE exploitation mode Only
 - 0b10 XIVE legacy or exploitation mode

The OS asks for XIVE Exploitation Mode support using the property
"ibm,architecture-vec-5", byte 23 bits 0-1:

 - 0b00 XIVE legacy mode Only
 - 0b01 XIVE exploitation mode Only

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-09-02 21:02:38 +10:00
Cédric Le Goater bed81ee181 powerpc/xive: introduce H_INT_ESB hcall
The H_INT_ESB hcall() is used to issue a load or store to the ESB page
instead of using the MMIO pages. This can be used as a workaround on
some HW issues. The OS knows that this hcall should be used on an
interrupt source when the ESB hcall flag is set to 1 in the hcall
H_INT_GET_SOURCE_INFO.

To maintain the frontier between the xive frontend and backend, we
introduce a new xive operation 'esb_rw' to be used in the routines
doing memory accesses on the ESBs.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-09-02 21:02:37 +10:00
Cédric Le Goater c58a14a9cc powerpc/xive: add the HW IRQ number under xive_irq_data
It will be required later by the H_INT_ESB hcall.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-09-02 21:02:37 +10:00
Cédric Le Goater eac1e731b5 powerpc/xive: guest exploitation of the XIVE interrupt controller
This is the framework for using XIVE in a PowerVM guest. The support
is very similar to the native one in a much simpler form.

Each source is associated with an Event State Buffer (ESB). This is a
two bit state machine which is used to trigger events. The bits are
named "P" (pending) and "Q" (queued) and can be controlled by MMIO.
The Guest OS registers event (or notifications) queues on which the HW
will post event data for a target to notify.

Instead of OPAL calls, a set of Hypervisors call are used to configure
the interrupt sources and the event/notification queues of the guest:

 - H_INT_GET_SOURCE_INFO

   used to obtain the address of the MMIO page of the Event State
   Buffer (PQ bits) entry associated with the source.

 - H_INT_SET_SOURCE_CONFIG

   assigns a source to a "target".

 - H_INT_GET_SOURCE_CONFIG

   determines to which "target" and "priority" is assigned to a source

 - H_INT_GET_QUEUE_INFO

   returns the address of the notification management page associated
   with the specified "target" and "priority".

 - H_INT_SET_QUEUE_CONFIG

   sets or resets the event queue for a given "target" and "priority".
   It is also used to set the notification config associated with the
   queue, only unconditional notification for the moment.  Reset is
   performed with a queue size of 0 and queueing is disabled in that
   case.

 - H_INT_GET_QUEUE_CONFIG

   returns the queue settings for a given "target" and "priority".

 - H_INT_RESET

   resets all of the partition's interrupt exploitation structures to
   their initial state, losing all configuration set via the hcalls
   H_INT_SET_SOURCE_CONFIG and H_INT_SET_QUEUE_CONFIG.

 - H_INT_SYNC

   issue a synchronisation on a source to make sure sure all
   notifications have reached their queue.

As for XICS, the XIVE interface for the guest is described in the
device tree under the "interrupt-controller" node. A couple of new
properties are specific to XIVE :

 - "reg"

   contains the base address and size of the thread interrupt
   managnement areas (TIMA), also called rings, for the User level and
   for the Guest OS level. Only the Guest OS level is taken into
   account today.

 - "ibm,xive-eq-sizes"

   the size of the event queues. One cell per size supported, contains
   log2 of size, in ascending order.

 - "ibm,xive-lisn-ranges"

   the interrupt numbers ranges assigned to the guest. These are
   allocated using a simple bitmap.

and also :

 - "/ibm,plat-res-int-priorities"

   contains a list of priorities that the hypervisor has reserved for
   its own use.

Tested with a QEMU XIVE model for pseries and with the Power hypervisor.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-09-02 21:02:35 +10:00
Haren Myneni 146e9f1b65 crypto/nx: Add P9 NX specific error codes for 842 engine
This patch adds changes for checking P9 specific 842 engine
error codes. These errros are reported in coprocessor status
block (CSB) for failures.

Signed-off-by: Haren Myneni <haren@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-09-01 16:42:50 +10:00
Christophe Leroy da74f65920 powerpc/32: add memset16()
Commit 694fc88ce2 ("powerpc/string: Implement optimized
memset variants") added memset16(), memset32() and memset64()
for the 64 bits PPC.

On 32 bits, memset64() is not relevant, and as shown below,
the generic version of memset32() gives a good code, so only
memset16() is candidate for an optimised version.

000009c0 <memset32>:
 9c0:   2c 05 00 00     cmpwi   r5,0
 9c4:   39 23 ff fc     addi    r9,r3,-4
 9c8:   4d 82 00 20     beqlr
 9cc:   7c a9 03 a6     mtctr   r5
 9d0:   94 89 00 04     stwu    r4,4(r9)
 9d4:   42 00 ff fc     bdnz    9d0 <memset32+0x10>
 9d8:   4e 80 00 20     blr

The last part of memset() handling the not 4-bytes multiples
operates on bytes, making it unsuitable for handling word without
modification. As it would increase memset() complexity, it is
better to implement memset16() from scratch. In addition it
has the advantage of allowing a more optimised memset16() than what
we would have by using the memset() function.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-09-01 16:42:45 +10:00
Paul Mackerras d2b65ac652 powerpc: Emulate load/store floating point as integer word instructions
This adds emulation for the lfiwax, lfiwzx and stfiwx instructions.
This necessitated adding a new flag to indicate whether a floating
point or an integer conversion was needed for LOAD_FP and STORE_FP,
so this moves the size field in op->type up 4 bits.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-09-01 16:42:44 +10:00
Paul Mackerras a53d5182e2 powerpc: Separate out load/store emulation into its own function
This moves the parts of emulate_step() that deal with emulating
load and store instructions into a new function called
emulate_loadstore().  This is to make it possible to reuse this
code in the alignment handler.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-09-01 16:42:41 +10:00
Paul Mackerras d955189ae4 powerpc: Handle opposite-endian processes in emulation code
This adds code to the load and store emulation code to byte-swap
the data appropriately when the process being emulated is set to
the opposite endianness to that of the kernel.

This also enables the emulation for the multiple-register loads
and stores (lmw, stmw, lswi, stswi, lswx, stswx) to work for
little-endian.  In little-endian mode, the partial word at the
end of a transfer for lsw*/stsw* (when the byte count is not a
multiple of 4) is loaded/stored at the least-significant end of
the register.  Additionally, this fixes a bug in the previous
code in that it could call read_mem/write_mem with a byte count
that was not 1, 2, 4 or 8.

Note that this only works correctly on processors with "true"
little-endian mode, such as IBM POWER processors from POWER6 on, not
the so-called "PowerPC" little-endian mode that uses address swizzling
as implemented on the old 32-bit 603, 604, 740/750, 74xx CPUs.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-09-01 16:39:55 +10:00
Paul Mackerras b2543f7b20 powerpc: Emulate the dcbz instruction
This adds code to analyse_instr() and emulate_step() to understand the
dcbz (data cache block zero) instruction.  The emulate_dcbz() function
is made public so it can be used by the alignment handler in future.
(The apparently unnecessary cropping of the address to 32 bits is
there because it will be needed in that situation.)

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-09-01 16:39:54 +10:00
Paul Mackerras c22435a5f3 powerpc: Emulate FP/vector/VSX loads/stores correctly when regs not live
At present, the analyse_instr/emulate_step code checks for the
relevant MSR_FP/VEC/VSX bit being set when a FP/VMX/VSX load
or store is decoded, but doesn't recheck the bit before reading or
writing the relevant FP/VMX/VSX register in emulate_step().

Since we don't have preemption disabled, it is possible that we get
preempted between checking the MSR bit and doing the register access.
If that happened, then the registers would have been saved to the
thread_struct for the current process.  Accesses to the CPU registers
would then potentially read stale values, or write values that would
never be seen by the user process.

Another way that the registers can become non-live is if a page
fault occurs when accessing user memory, and the page fault code
calls a copy routine that wants to use the VMX or VSX registers.

To fix this, the code for all the FP/VMX/VSX loads gets restructured
so that it forms an image in a local variable of the desired register
contents, then disables preemption, checks the MSR bit and either
sets the CPU register or writes the value to the thread struct.
Similarly, the code for stores checks the MSR bit, copies either the
CPU register or the thread struct to a local variable, then reenables
preemption and then copies the register image to memory.

If the instruction being emulated is in the kernel, then we must not
use the register values in the thread_struct.  In this case, if the
relevant MSR enable bit is not set, then emulate_step refuses to
emulate the instruction.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-09-01 16:39:52 +10:00
Paul Mackerras d120cdbce6 powerpc/64: Fix update forms of loads and stores to write 64-bit EA
When a 64-bit processor is executing in 32-bit mode, the update forms
of load and store instructions are required by the architecture to
write the full 64-bit effective address into the RA register, though
only the bottom 32 bits are used to address memory.  Currently,
the instruction emulation code writes the truncated address to the
RA register.  This fixes it by keeping the full 64-bit EA in the
instruction_op structure, truncating the address in emulate_step()
where it is used to address memory, rather than in the address
computations in analyse_instr().

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-09-01 16:39:49 +10:00
Paul Mackerras 350779a29f powerpc: Handle most loads and stores in instruction emulation code
This extends the instruction emulation infrastructure in sstep.c to
handle all the load and store instructions defined in the Power ISA
v3.0, except for the atomic memory operations, ldmx (which was never
implemented), lfdp/stfdp, and the vector element load/stores.

The instructions added are:

Integer loads and stores: lbarx, lharx, lqarx, stbcx., sthcx., stqcx.,
lq, stq.

VSX loads and stores: lxsiwzx, lxsiwax, stxsiwx, lxvx, lxvl, lxvll,
lxvdsx, lxvwsx, stxvx, stxvl, stxvll, lxsspx, lxsdx, stxsspx, stxsdx,
lxvw4x, lxsibzx, lxvh8x, lxsihzx, lxvb16x, stxvw4x, stxsibx, stxvh8x,
stxsihx, stxvb16x, lxsd, lxssp, lxv, stxsd, stxssp, stxv.

These instructions are handled both in the analyse_instr phase and in
the emulate_step phase.

The code for lxvd2ux and stxvd2ux has been taken out, as those
instructions were never implemented in any processor and have been
taken out of the architecture, and their opcodes have been reused for
other instructions in POWER9 (lxvb16x and stxvb16x).

The emulation for the VSX loads and stores uses helper functions
which don't access registers or memory directly, which can hopefully
be reused by KVM later.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-09-01 16:39:48 +10:00
Paul Mackerras 3cdfcbfd32 powerpc: Change analyse_instr so it doesn't modify *regs
The analyse_instr function currently doesn't just work out what an
instruction does, it also executes those instructions whose effect
is only to update CPU registers that are stored in struct pt_regs.
This is undesirable because optprobes uses analyse_instr to work out
if an instruction could be successfully emulated in future.

This changes analyse_instr so it doesn't modify *regs; instead it
stores information in the instruction_op structure to indicate what
registers (GPRs, CR, XER, LR) would be set and what value they would
be set to.  A companion function called emulate_update_regs() can
then use that information to update a pt_regs struct appropriately.

As a minor cleanup, this replaces inline asm using the cntlzw and
cntlzd instructions with calls to __builtin_clz() and __builtin_clzl().

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-09-01 16:39:27 +10:00
Jérôme Glisse fb1522e099 KVM: update to new mmu_notifier semantic v2
Calls to mmu_notifier_invalidate_page() were replaced by calls to
mmu_notifier_invalidate_range() and are now bracketed by calls to
mmu_notifier_invalidate_range_start()/end()

Remove now useless invalidate_page callback.

Changed since v1 (Linus Torvalds)
    - remove now useless kvm_arch_mmu_notifier_invalidate_page()

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Tested-by: Mike Galbraith <efault@gmx.de>
Tested-by: Adam Borowski <kilobyte@angband.pl>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: kvm@vger.kernel.org
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-31 16:13:00 -07:00
Paul Mackerras 93b2d3cf37 powerpc: Correct instruction code for xxlor instruction
The instruction code for xxlor that commit 0016a4cf55 ("powerpc:
Emulate most Book I instructions in emulate_step()", 2010-06-15)
added is actually the code for xxlnor.  It is used in get_vsr()
and put_vsr() and the effect of the error is that if emulate_step
is used to emulate a VSX load or store from any register other
than vsr0, the bitwise complement of the correct value will be
loaded or stored.  This corrects the error.

Fixes: 0016a4cf55 ("powerpc: Emulate most Book I instructions in emulate_step()")
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-31 22:07:19 +10:00
Oliver O'Halloran 2a636a56d2 powerpc/smp: Add cpu_l2_cache_map
We want to add an extra level to the CPU scheduler topology to account
for cores which share a cache. To do this we need to build a cpumask
for each CPU that indicates which CPUs share this cache to use as an
input to the scheduler.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-31 14:26:56 +10:00
Tobin C. Harding eb039161da powerpc/asm: Convert .llong directives to .8byte
.llong is an undocumented PPC specific directive. The generic
equivalent is .quad, but even better (because it's self describing) is
.8byte.

Convert all .llong directives to .8byte.

Signed-off-by: Tobin C. Harding <me@tobin.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-31 14:26:47 +10:00
Balbir Singh d1e1b351f5 powerpc/xmon: Add ISA v3.0 SPRs to SPR dump
Add support for printing the PIDR/TIDR for ISA 300 and PSSCR and PTCR
in ISA 3.0 hypervisor mode.

SPRN_PSSCR_PR is the privileged mode access and is used when we are
not in hypervisor mode.

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
[mpe: Split out of larger patch]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-31 14:26:45 +10:00
Sukadev Bhattiprolu 2392c8c8c0 powerpc/powernv/vas: Define copy/paste interfaces
Define interfaces (wrappers) to the 'copy' and 'paste'
instructions (which are new in PowerISA 3.0). These are intended to be
used to by NX driver(s) to submit Coprocessor Request Blocks (CRBs) to
the NX hardware engines.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-31 14:26:38 +10:00
Sukadev Bhattiprolu 5239af679a powerpc/powernv/vas: Define vas_tx_win_open()
Define an interface to open a VAS send window. This interface is
intended to be used the Nest Accelerator (NX) driver(s) to open
a send window and use it to submit compression/encryption requests
to a VAS receive window.

The receive window, identified by the [vasid, cop] parameters, must
already be open in VAS (i.e connected to an NX engine).

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-31 14:26:37 +10:00
Sukadev Bhattiprolu 98271d4198 powerpc/powernv/vas: Define vas_win_close() interface
Define the vas_win_close() interface which should be used to close a
send or receive windows.

While the hardware configurations required to open send and receive
windows differ, the configuration to close a window is the same for
both. So we use a single interface to close the window.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-31 14:26:37 +10:00
Sukadev Bhattiprolu 62c4eda4fa powerpc/powernv/vas: Define vas_rx_win_open() interface
Define the vas_rx_win_open() interface. This interface is intended to
be used by the Nest Accelerator (NX) driver(s) to setup receive
windows for one or more NX engines (which implement compression &
encryption algorithms in the hardware).

Follow-on patches will provide an interface to close the window and to
open a send window that kernel subsystems can use to access the NX
engines.

The interface to open a receive window is expected to be invoked for
each instance of VAS in the system.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-31 14:26:36 +10:00
Sukadev Bhattiprolu b6622a339e powerpc/powernv: Move GET_FIELD/SET_FIELD to vas.h
Move the GET_FIELD and SET_FIELD macros to vas.h as VAS and other
users of VAS, including NX-842 can use those macros.

There is a lot of related code between the VAS/NX kernel drivers
and skiboot. For consistency, switch the order of parameters in
SET_FIELD to match the order in skiboot.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Reviewed-by: Dan Streetman <ddstreet@ieee.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-31 14:26:20 +10:00
Sukadev Bhattiprolu 967689141e powerpc/powernv/vas: Define macros, register fields and structures
Define macros for the VAS hardware registers and bit-fields as well
as couple of data structures needed by the VAS driver.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
[mpe: Fixup include guard to use _ASM_POWERPC_VAS_H]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-31 14:26:17 +10:00
Alexey Kardashevskiy f1e08232ed powerpc/pci: Remove OF node back pointer from pci_dn
The check_req() helper uses pci_get_pdn() to get an OF node pointer.
pci_get_pdn() returns a pci_dn pointer which either:
1) from the OF node returned by pci_device_to_OF_node();
2) from the parent child_list where entries don't have OF node pointers.
Since check_req() does not care about 2), it can call
pci_device_to_OF_node() directly, hence the change.

The find_pe_dn() helper uses embedded pci_dn to get an OF node which is
also stored in edev->pdev so let's take a shortcut and call
pci_device_to_OF_node() directly.

With these 2 changes, we can finally get rid of the OF node back pointer.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-31 14:26:12 +10:00
Alexey Kardashevskiy 405b33a76d powerpc/eeh: Remove unnecessary config_addr from eeh_dev
The eeh_dev struct hold a config space address of an associated node
and the very same address is also stored in the pci_dn struct which
is always present during the eeh_dev lifetime.

This uses bus:devfn directly from pci_dn instead of cached and packed
config_addr.

Since config_addr is made from device's bus:dev.fn, there is no point
in keeping it in the debugfs either so remove that too.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-31 14:26:09 +10:00
Alexey Kardashevskiy 69672bd748 powerpc/eeh: Remove unnecessary pointer to phb from eeh_dev
The eeh_dev struct already holds a pointer to pci_dn which it does not
exist without and pci_dn itself holds the very same pointer so just
use it.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-31 14:26:09 +10:00
Alexey Kardashevskiy 8bae6a2319 powerpc/eeh: Reduce to one the number of places where edev is allocated
arch/powerpc/kernel/eeh_dev.c:57 is the only legit place where edev
is allocated; other 2 places allocate it on stack and in the heap for
a very short period of time to use eeh_pe_get() as takes edev.

This changes eeh_pe_get() to receive required parameters explicitly.

This removes unnecessary temporary allocation of edev.

This uses the "pe_no" name instead of the "pe_config_addr" name as
it actually is a PE number and not a config space address as it seemed.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Acked-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-31 14:26:08 +10:00
Nicholas Piggin 6fcd6baa90 powerpc/powernv: Use kernel crash path for machine checks
There are quite a few machine check exceptions that can be caused by
kernel bugs. To make debugging easier, use the kernel crash path in
cases of synchronous machine checks that occur in kernel mode, if that
would not result in the machine going straight to panic or crash dump.

There is a downside here that die()ing the process in kernel mode can
still leave the system unstable. panic_on_oops will always force the
system to fail-stop, so systems where that behaviour is important will
still do the right thing.

As a test, when triggering an i-side 0111b error (ifetch from foreign
address) in kernel mode process context on POWER9, the kernel currently
dies quickly like this:

  Severe Machine check interrupt [Not recovered]
    NIP [ffff000000000000]: 0xffff000000000000
    Initiator: CPU
    Error type: Real address [Instruction fetch (foreign)]
  [  127.426651616,0] OPAL: Reboot requested due to Platform error.
      Effective[  127.426693712,3] OPAL: Reboot requested due to Platform error. address: ffff000000000000
  opal: Reboot type 1 not supported
  Kernel panic - not syncing: PowerNV Unrecovered Machine Check
  CPU: 56 PID: 4425 Comm: syscall Tainted: G   M            4.12.0-rc1-13857-ga4700a261072-dirty #35
  Call Trace:
  [  128.017988928,4] IPMI: BUG: Dropping ESEL on the floor due to
    buggy/mising code in OPAL for this BMC
    Rebooting in 10 seconds..
  Trying to free IRQ 496 from IRQ context!

After this patch, the process is killed and the kernel continues with
this message, which gives enough information to identify the offending
branch (i.e., with CFAR):

  Severe Machine check interrupt [Not recovered]
    NIP [ffff000000000000]: 0xffff000000000000
    Initiator: CPU
    Error type: Real address [Instruction fetch (foreign)]
      Effective address: ffff000000000000
  Oops: Machine check, sig: 7 [#1]
  SMP NR_CPUS=2048
  NUMA
  PowerNV
  Modules linked in: iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 ...
  CPU: 22 PID: 4436 Comm: syscall Tainted: G   M            4.12.0-rc1-13857-ga4700a261072-dirty #36
  task: c000000932300000 task.stack: c000000932380000
  NIP: ffff000000000000 LR: 00000000217706a4 CTR: ffff000000000000
  REGS: c00000000fc8fd80 TRAP: 0200   Tainted: G   M             (4.12.0-rc1-13857-ga4700a261072-dirty)
  MSR: 90000000001c1003 <SF,HV,ME,RI,LE>
    CR: 24000484  XER: 20000000
  CFAR: c000000000004c80 DAR: 0000000021770a90 DSISR: 0a000000 SOFTE: 1
  GPR00: 0000000000001ebe 00007fffce4818b0 0000000021797f00 0000000000000000
  GPR04: 00007fff8007ac24 0000000044000484 0000000000004000 00007fff801405e8
  GPR08: 900000000280f033 0000000024000484 0000000000000000 0000000000000030
  GPR12: 9000000000001003 00007fff801bc370 0000000000000000 0000000000000000
  GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
  GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
  GPR24: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
  GPR28: 00007fff801b0000 0000000000000000 00000000217707a0 00007fffce481918
  NIP [ffff000000000000] 0xffff000000000000
  LR [00000000217706a4] 0x217706a4
  Call Trace:
  Instruction dump:
  XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
  XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-31 14:26:04 +10:00
Nicholas Piggin b746e3e01e powerpc/powernv: Flush console before platform error reboot
Unrecovered MCE and HMI errors are sent through a special restart OPAL
call to log the platform error. The downside is that they don't go
through normal Linux crash paths, so they don't give much information
to the Linux console.

Change this by providing a special crash function which does some of
the console flushing from the panic() path before calling firmware to
reboot.

The downside of this is a little more code to execute before reaching
the firmware reboot. However in practice, it's critical to get the
Linux console messages output in order to debug a problem. So this is
a desirable tradeoff.

Note on the implementation: It is difficult to plumb a custom reboot
handler into the panic path, because panic does a little bit too much
work. For example, it will try to delay with the timebase, but that
may be corrupted in some cases resulting in a hang without reaching
the platform reboot. Another problem is that panic can invoke the
crash dump code which is not what we want in the case of a hardware
platform error. Long-term the best solution will be to rework the
panic path so it can be suitable for this kind of panic, but for now
we just duplicate a bit of the code.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-31 14:26:03 +10:00
Nicholas Piggin a3b2cb30f2 powerpc: Do not call ppc_md.panic in fadump panic notifier
If fadump is not registered, and no other crash or debug handlers are
registered, the powerpc panic handler stops the guest before the
generic panic code can push out debug information to the console.

Currently, system reset injection causes the guest to silently stop.

Stop calling ppc_md.panic in the panic notifier. crash_fadump already
does rtas_os_term() to terminate the guest if fadump is registered.

Remove ppc_md.panic. Move fadump panic notifier into fadump code.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-31 14:26:01 +10:00
Nicholas Piggin 70412c55d4 powerpc/64: Fix watchdog configuration regressions
This fixes a couple more bits of fallout from the new hard lockup watchdog
patch.

It restores the required hw_nmi_get_sample_period() function for the
perf watchdog, and removes some function declarations on 64e that are only
defined for 64s. This fixes the 64e build when the hardlockup detector is
enabled.

It restores the default behaviour of disabling the perf watchdog, and also
fixes disabling the 64s watchdog when running as a guest.

Fixes: 2104180a53 ("powerpc/64s: implement arch-specific hardlockup watchdog")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-31 14:26:00 +10:00
Paul Mackerras 4dafecde44 Merge remote-tracking branch 'remotes/powerpc/topic/ppc-kvm' into kvm-ppc-next
This merges in the 'ppc-kvm' topic branch from the powerpc tree in
order to bring in some fixes which touch both powerpc and KVM code.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-08-31 12:37:03 +10:00
Ram Pai d182b8fd60 KVM: PPC: Book3S HV: Fix setting of storage key in H_ENTER
In handling a H_ENTER hypercall, the code in kvmppc_do_h_enter
clobbers the high-order two bits of the storage key, which is stored
in a split field in the second doubleword of the HPTE.  Any storage
key number above 7 hence fails to operate correctly.

This makes sure we preserve all the bits of the storage key.

Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-08-31 12:36:44 +10:00
Nicholas Piggin aafc8a8300 powerpc/64s: Move IDLE_STATE_ENTER_SEQ[_NORET] into idle_book3s.S
This macro is only used in idle_book3s.S, move it in there and add a
more descriptive comment.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Split out of larger patch and write change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-29 21:38:47 +10:00
Michael Ellerman 82b7fcc005 Merge branch 'topic/ppc-kvm' into next
Merge Nicks commit to rework the KVM thread management, shared with the
KVM tree via the ppc-kvm topic branch.
2017-08-29 21:26:30 +10:00
Nicholas Piggin 94a04bc25a KVM: PPC: Book3S HV: POWER9 does not require secondary thread management
POWER9 CPUs have independent MMU contexts per thread, so KVM does not
need to quiesce secondary threads, so the hwthread_req/hwthread_state
protocol does not have to be used. So patch it away on POWER9, and patch
away the branch from the Linux idle wakeup to kvm_start_guest that is
never used.

Add a warning and error out of kvmppc_grab_hwthread in case it is ever
called on POWER9.

This avoids a hwsync in the idle wakeup path on POWER9.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Paul Mackerras <paulus@ozlabs.org>
[mpe: Use WARN(...) instead of WARN_ON()/pr_err(...)]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-29 14:48:59 +10:00
Linus Torvalds 42e6d5e5ee powerpc fixes for 4.13 #8
Just one fix, to add a barrier in the switch_mm() code to make sure the mm
 cpumask update is ordered vs the MMU starting to load translations. As far as we
 know no one's actually hit the bug, but that's just luck.
 
 Thanks to:
   Benjamin Herrenschmidt, Nicholas Piggin.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJZoAZDAAoJEFHr6jzI4aWAi3AQAJq4boEBqdmL042oNK4PWW0M
 uGfehNmtzCw9Hp8bPfzOf8NypJ51Kw7eDQELaeSaazKW+gffUCBeEsKGS7kmHvc+
 x1tHxkXxI7PXuNIRojJg9y7rlKXdRym5SecvPSo1cm/c46RRWOlNGZaIwiHyrXSh
 eBjyP5EHu1HXpRxkcUh+//PQp2b+7SmgUYzSf0hA9UCtzSZSJr19DuY8uhetI9Ws
 AfjkO1uvb2KETqBVegGBpAruZzQtxqdtffd2HToSaCHUnAKma2iqUZqkqBNjL6OQ
 gSXWpXVInng/7ktrrfEgSiwlHns7pgHkxYHS8thDZqQpIt3GNsUg2UwpHGf6oL7V
 L+GtRp36LM91Ueq6KdlU7bJkmoiJ798Hnp3FOjpkqo+j/MGuCQDDDK4Ge1popehJ
 a17K7lE/FKGqNaFINc1Q6hnXg4MPyawAOLDlV839Ap5+ISPS6WcHaa1AgKjdQNkH
 fIkZZsYT531FIf853AjUGFw8frSlVfrHmIx9/HJOhEa1KHQhBqGRV1sWYEjuN6IB
 av+tQDlleG5aT641qhHlA/hN5DGrGZXLp8e6cFRufF+CSsRayL27u0Qw9pP9VZ3S
 bgfdnmZZyP23+bzaq/m/bjhRiOf0snSQPxIKe56KmNCJ8buTrGWDw4IuiPKB7Y6V
 06vBFn7ZUP5aeHIZkS62
 =IClj
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.13-8' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fix from Michael Ellerman:
 "Just one fix, to add a barrier in the switch_mm() code to make sure
  the mm cpumask update is ordered vs the MMU starting to load
  translations. As far as we know no one's actually hit the bug, but
  that's just luck.

  Thanks to Benjamin Herrenschmidt, Nicholas Piggin"

* tag 'powerpc-4.13-8' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/mm: Ensure cpumask update is ordered
2017-08-25 17:32:35 -07:00
Jiri Slaby 30d6e0a419 futex: Remove duplicated code and fix undefined behaviour
There is code duplicated over all architecture's headers for
futex_atomic_op_inuser. Namely op decoding, access_ok check for uaddr,
and comparison of the result.

Remove this duplication and leave up to the arches only the needed
assembly which is now in arch_futex_atomic_op_inuser.

This effectively distributes the Will Deacon's arm64 fix for undefined
behaviour reported by UBSAN to all architectures. The fix was done in
commit 5f16a046f8 (arm64: futex: Fix undefined behaviour with
FUTEX_OP_OPARG_SHIFT usage). Look there for an example dump.

And as suggested by Thomas, check for negative oparg too, because it was
also reported to cause undefined behaviour report.

Note that s390 removed access_ok check in d12a29703 ("s390/uaccess:
remove pointless access_ok() checks") as access_ok there returns true.
We introduce it back to the helper for the sake of simplicity (it gets
optimized away anyway).

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Russell King <rmk+kernel@armlinux.org.uk>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> [s390]
Acked-by: Chris Metcalf <cmetcalf@mellanox.com> [for tile]
Reviewed-by: Darren Hart (VMware) <dvhart@infradead.org>
Reviewed-by: Will Deacon <will.deacon@arm.com> [core/arm64]
Cc: linux-mips@linux-mips.org
Cc: Rich Felker <dalias@libc.org>
Cc: linux-ia64@vger.kernel.org
Cc: linux-sh@vger.kernel.org
Cc: peterz@infradead.org
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: sparclinux@vger.kernel.org
Cc: Jonas Bonn <jonas@southpole.se>
Cc: linux-s390@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: linux-hexagon@vger.kernel.org
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: linux-snps-arc@lists.infradead.org
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: linux-xtensa@linux-xtensa.org
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: openrisc@lists.librecores.org
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Stafford Horne <shorne@gmail.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Richard Henderson <rth@twiddle.net>
Cc: Chris Zankel <chris@zankel.net>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-parisc@vger.kernel.org
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: linux-alpha@vger.kernel.org
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: "David S. Miller" <davem@davemloft.net>
Link: http://lkml.kernel.org/r/20170824073105.3901-1-jslaby@suse.cz
2017-08-25 22:49:59 +02:00
Benjamin Herrenschmidt 3a2df3798d powerpc/mm: Make switch_mm_irqs_off() out of line
It's too big to be inline, there is no reason to keep it
that way.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[mpe: Rework to incorporate the comment changes via fixes branch]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-23 22:48:51 +10:00
Benjamin Herrenschmidt a619e59c07 powerpc/mm: Optimize detection of thread local mm's
Instead of comparing the whole CPU mask every time, let's
keep a counter of how many bits are set in the mask. Thus
testing for a local mm only requires testing if that counter
is 1 and the current CPU bit is set in the mask.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-23 22:28:38 +10:00
Benjamin Herrenschmidt 058ccc3465 powerpc/mm: Avoid double irq save/restore in activate_mm
It calls switch_mm() which already does the irq save/restore
these days.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-23 22:27:44 +10:00
Benjamin Herrenschmidt 43ed84a891 powerpc/mm: Move pgdir setting into a helper
Makes switch_mm_irqs_off() a bit more readable

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-23 22:27:42 +10:00
Michael Ellerman 15c659ff9d Merge branch 'fixes' into next
There's a non-trivial dependency between some commits we want to put in
next and the KVM prefetch work around that went into fixes. So merge
fixes into next.
2017-08-23 22:20:10 +10:00
Ingo Molnar 94edf6f3c2 Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU updates from Paul E. McKenney:

 - Removal of spin_unlock_wait()
 - SRCU updates
 - Torture-test updates
 - Documentation updates
 - Miscellaneous fixes
 - CPU-hotplug fixes
 - Miscellaneous non-RCU fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-21 09:45:19 +02:00
Benjamin Herrenschmidt 1a92a80ad3 powerpc/mm: Ensure cpumask update is ordered
There is no guarantee that the various isync's involved with
the context switch will order the update of the CPU mask with
the first TLB entry for the new context being loaded by the HW.

Be safe here and add a memory barrier to order any subsequent
load/store which may bring entries into the TLB.

The corresponding barrier on the other side already exists as
pte updates use pte_xchg() which uses __cmpxchg_u64 which has
a sync after the atomic operation.

Cc: stable@vger.kernel.org
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Add comments in the code]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-18 13:07:16 +10:00
Paul E. McKenney 952111d7db arch: Remove spin_unlock_wait() arch-specific definitions
There is no agreed-upon definition of spin_unlock_wait()'s semantics,
and it appears that all callers could do just as well with a lock/unlock
pair.  This commit therefore removes the underlying arch-specific
arch_spin_unlock_wait() for all architectures providing them.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: <linux-arch@vger.kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Andrea Parri <parri.andrea@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Acked-by: Boqun Feng <boqun.feng@gmail.com>
2017-08-17 08:08:59 -07:00
Aneesh Kumar K.V fa4531f753 powerpc/mm: Don't send IPI to all cpus on THP updates
Now that we made sure that lockless walk of linux page table is mostly
limitted to current task(current->mm->pgdir) we can update the THP
update sequence to only send IPI to CPUs on which this task has run.
This helps in reducing the IPI overload on systems with large number
of CPUs.

WRT kvm even though kvm is walking page table with vpc->arch.pgdir,
it is done only on secondary CPUs and in that case we have primary CPU
added to task's mm cpumask. Sending an IPI to primary will force the
secondary to do a vm exit and hence this mm cpumask usage is safe
here.

WRT CAPI, we still end up walking linux page table with capi context
MM. For now the pte lookup serialization sends an IPI to all CPUs in
CPI is in use. We can further improve this by adding the CAPI
interrupt handling CPU to task mm cpumask. That will be done in a
later patch.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-17 23:31:13 +10:00
Michael Ellerman 8434f0892e Merge branch 'topic/ppc-kvm' into next
Bring in the commit to rename find_linux_pte_or_hugepte() which touches
arch and KVM code, and might need to be merged with the kvmppc tree to
avoid conflicts.
2017-08-17 23:14:17 +10:00
Aneesh Kumar K.V 94171b19c3 powerpc/mm: Rename find_linux_pte_or_hugepte()
Add newer helpers to make the function usage simpler. It is always
recommended to use find_current_mm_pte() for walking the page table.
If we cannot use find_current_mm_pte(), it should be documented why
the said usage of __find_linux_pte() is safe against a parallel THP
split.

For now we have KVM code using __find_linux_pte(). This is because kvm
code ends up calling __find_linux_pte() in real mode with MSR_EE=0 but
with PACA soft_enabled = 1. We may want to fix that later and make
sure we keep the MSR_EE and PACA soft_enabled in sync. When we do that
we can switch kvm to use find_linux_pte().

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-17 23:13:46 +10:00
Naveen N. Rao 694fc88ce2 powerpc/string: Implement optimized memset variants
Based on Matthew Wilcox's patches for other architectures.

Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-17 23:04:35 +10:00
Dou Liyang 7baebe54a6 powerpc/topology: Remove the unused parent_node() macro
Commit a7be6e5a7f ("mm: drop useless local parameters of
__register_one_node()") removes the last user of parent_node().

The parent_node() macro in POWERPC platform is unnecessary.

Remove it for cleanup.

Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-16 21:12:41 +10:00
Aneesh Kumar K.V 4ae279c2c9 powerpc/mm/hugetlb: Allow runtime allocation of 16G.
Now that we have GIGANTIC_PAGE enabled on powerpc, use this for 16G hugepages
with hash translation mode. Depending on the total system memory we have, we may
be able to allocate 16G hugepages runtime. This also remove the hugetlb setup
difference between hash/radix translation mode.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-16 14:56:13 +10:00
Aneesh Kumar K.V 79cc38ded1 powerpc/mm/hugetlb: Add support for reserving gigantic huge pages via kernel command line
With commit aa888a7497 ("hugetlb: support larger than MAX_ORDER") we added
support for allocating gigantic hugepages via kernel command line. Switch
ppc64 arch specific code to use that.

W.r.t FSL support, we now limit our allocation range using BOOTMEM_ALLOC_ACCESSIBLE.

We use the kernel command line to do reservation of hugetlb pages on powernv
platforms. On pseries hash mmu mode the supported gigantic huge page size is
16GB and that can only be allocated with hypervisor assist. For pseries the
command line option doesn't do the allocation. Instead pseries does gigantic
hugepage allocation based on hypervisor hint that is specified via
"ibm,expected#pages" property of the memory node.

Cc: Scott Wood <oss@buserror.net>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-16 14:56:12 +10:00
Christophe Leroy ca8afd4046 powerpc/hugetlb: fix page rights verification in gup_hugepte()
gup_hugepte() checks if pages are present and readable, and
when  'write' is set, also checks if the pages are writable.

Initially this was done by checking if _PAGE_PRESENT and
_PAGE_READ were set. In addition, _PAGE_WRITE was verified for write
accesses.

The problem is that we have to handle the three following cases:
1/ The target defines __PAGE_READ and __PAGE_WRITE
2/ The target defines __PAGE_RW
3/ The target defines __PAGE_RO

In case 1/, this is obvious
In case 2/, __PAGE_READ is defined as 0 and __PAGE_WRITE as __PAGE_RW
so it works as well.
But in case 3, __PAGE_RW is defined as 0, which means __PAGE_WRITE is 0
and then the test returns true (page writable) in all cases.

A first correction was attempted in commit 6b8cb66a6a ("powerpc: Fix
usage of _PAGE_RO in hugepage"), but that fix is wrong:
instead of checking that the page is writable when write is requested,
it checks that the page is NOT writable when write is NOT requested.

This patch adds a new pte_read() helper to check whether a page is
readable or not. This avoids handling all possible cases in
gup_hugepte().

Then gup_hugepte() is modified to use pte_present(), pte_read()
and pte_write() instead of the raw flags.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-15 22:55:58 +10:00
Christophe Leroy 4cfac2f9c7 powerpc/mm: Simplify __set_fixmap()
__set_fixmap() uses __fix_to_virt() then does the boundary checks
by it self. Instead, we can use fix_to_virt() which does the
verification at build time. For this, we need to use it inline
so that GCC can see the real value of idx at buildtime.

In the meantime, we remove the 'fixmaps' variable.
This variable is set but has never been used from the beginning
(commit 2c419bdeca ("[POWERPC] Port fixmap from x86 and use
for kmap_atomic"))

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-15 22:55:58 +10:00
Christophe Leroy 86b19520e7 powerpc/mm: declare some local functions static
get_pteptr() and __mapin_ram_chunk() are only used locally,
so define them static

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-15 22:55:57 +10:00
Christophe Leroy 3184cc4b6f powerpc/mm: Fix kernel RAM protection after freeing unused memory on PPC32
As seen below, allthough the init sections have been freed, the
associated memory area is still marked as executable in the
page tables.

~ dmesg
[    5.860093] Freeing unused kernel memory: 592K (c0570000 - c0604000)

~ cat /sys/kernel/debug/kernel_page_tables
---[ Start of kernel VM ]---
0xc0000000-0xc0497fff        4704K  rw  X  present dirty accessed shared
0xc0498000-0xc056ffff         864K  rw     present dirty accessed shared
0xc0570000-0xc059ffff         192K  rw  X  present dirty accessed shared
0xc05a0000-0xc7ffffff      125312K  rw     present dirty accessed shared
---[ vmalloc() Area ]---

This patch fixes that.

The implementation is done by reusing the change_page_attr()
function implemented for CONFIG_DEBUG_PAGEALLOC

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-15 22:55:56 +10:00
Michael Ellerman 5b6c133e08 powerpc/mm/nohash: Move definition of PGALLOC_GFP to fix build errors
In some obscure Book3E configs (randconfig) we can end up missing a
definition for PGALLOC_GFP in pgtable_64.c.

Fix it by moving the definition to asm/pgalloc.h.

Fixes: de3b87611d ("powerpc/mm/book(e)(3s)/64: Add page table accounting")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-15 20:02:56 +10:00
Christophe Leroy 3ee87674e0 powerpc/8xx: Use symbolic PVR value
For the 8xx, PVR values defined in arch/powerpc/include/asm/reg.h
are nowhere used.

Remove all defines and add PVR_8xx

Use it in arch/powerpc/kernel/cputable.c

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-10 23:32:18 +10:00
Christophe Leroy 968159c003 powerpc/8xx: Getting rid of remaining use of CONFIG_8xx
Two config options exist to define powerpc MPC8xx:
* CONFIG_PPC_8xx
* CONFIG_8xx

arch/powerpc/platforms/Kconfig.cputype has contained the following
comment about CONFIG_8xx item for some years:
"# this is temp to handle compat with arch=ppc"

arch/powerpc is now the only place with remaining use of
CONFIG_8xx: get rid of them.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-10 23:32:12 +10:00
Christophe Leroy 72e4b2cdf0 powerpc/time: refactor MFTB() to limit number of ifdefs
The 8xx cannot access the TBL and TBU registers using mfspr/mtspr
It must be accessed using mftb/mftbu

Due to this, there is a number of places with #ifdef CONFIG_8xx

This patch defines new macros MFTBL(x) and MFTBU(x) on the same model
as MFTB(x) and tries to make use of them as much as possible.

In arch/powerpc/include/asm/timex.h, we also remove the ifdef
for the asm() operands as the compiler doesn't mind unused operands

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-10 23:32:09 +10:00
Michael Ellerman d30a5a5262 powerpc/traps: Use SRR1 defines for program check reasons
Currently we open code the reason codes for program checks. Instead use
the existing SRR1 defines.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-10 23:31:32 +10:00
Shilpasri G Bhat bf9571550f powerpc/powernv: Add support to clear sensor groups data
Adds support for clearing different sensor groups. OCC inband sensor
groups like CSM, Profiler, Job Scheduler can be cleared using this
driver. The min/max of all sensors belonging to these sensor groups
will be cleared.

Signed-off-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-10 22:40:05 +10:00
Shilpasri G Bhat 8e84b2d1f0 powerpc/powernv: Add support to set power-shifting-ratio
This patch adds support to set power-shifting-ratio which hints the
firmware how to distribute/throttle power between different entities
in a system (e.g CPU v/s GPU). This ratio is used by OCC for power
capping algorithm.

Signed-off-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-10 22:40:01 +10:00
Shilpasri G Bhat cb8b340de2 powerpc/powernv: Add support for powercap framework
Adds a generic powercap framework to change the system powercap
inband through OPAL-OCC command/response interface.

Signed-off-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-10 22:39:53 +10:00
Nicholas Piggin 04019bf8eb powerpc: Add irq accounting for watchdog interrupts
This adds an irq counter for the watchdog soft-NMI. This interrupt
only fires when interrupts are soft-disabled, so it will not
increment much even when the watchdog is running. However it's
useful for debugging and sanity checking.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-10 22:30:02 +10:00
Nicholas Piggin ca41ad4377 powerpc: Add irq accounting for system reset interrupts
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-10 22:30:01 +10:00
Andreas Schwab 8a583c0a8d powerpc: Fix invalid use of register expressions
binutils >= 2.26 now warns about misuse of register expressions in
assembler operands that are actually literals, for example:

  arch/powerpc/kernel/entry_64.S:535: Warning: invalid register expression

In practice these are almost all uses of r0 that should just be a
literal 0.

Signed-off-by: Andreas Schwab <schwab@linux-m68k.org>
[mpe: Mention r0 is almost always the culprit, fold in purgatory change]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-10 22:29:41 +10:00
Peter Zijlstra a9668cd6ee locking: Remove smp_mb__before_spinlock()
Now that there are no users of smp_mb__before_spinlock() left, remove
it entirely.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10 12:29:03 +02:00
Peter Zijlstra d89e588ca4 locking: Introduce smp_mb__after_spinlock()
Since its inception, our understanding of ACQUIRE, esp. as applied to
spinlocks, has changed somewhat. Also, I wonder if, with a simple
change, we cannot make it provide more.

The problem with the comment is that the STORE done by spin_lock isn't
itself ordered by the ACQUIRE, and therefore a later LOAD can pass over
it and cross with any prior STORE, rendering the default WMB
insufficient (pointed out by Alan).

Now, this is only really a problem on PowerPC and ARM64, both of
which already defined smp_mb__before_spinlock() as a smp_mb().

At the same time, we can get a much stronger construct if we place
that same barrier _inside_ the spin_lock(). In that case we upgrade
the RCpc spinlock to an RCsc.  That would make all schedule() calls
fully transitive against one another.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10 12:29:02 +02:00
Michael Ellerman 21a0e8c14b powerpc/mm/hash64: Make vmalloc 56T on hash
On 64-bit book3s, with the hash MMU, we currently define the kernel
virtual space (vmalloc, ioremap etc.), to be 16T in size. This is a
leftover from pre v3.7 when our user VM was also 16T.

Of that 16T we split it 50/50, with half used for PCI IO and ioremap
and the other 8T for vmalloc.

We never bothered to make it any bigger because 8T of vmalloc ought to
be enough for anybody. But it turns out that's not true, the per cpu
allocator wants large amounts of vmalloc space, not to make large
allocations, but to allow a large stride between allocations, because
we use pcpu_embed_first_chunk().

With a bit of juggling we can increase the entire kernel virtual space
to 64T. The only real complication is the check of the address in the
SLB miss handler, see the comment in the code.

Although we could continue to split virtual space 50/50 as we do now,
no one seems to be running out of PCI IO or ioremap space. So instead
keep that as 8T, and use the remaining 56T for vmalloc.

In future we should be able to increase the kernel virtual space to
512T, the code already supports that, it just needs testing on older
hardware.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2017-08-08 19:37:05 +10:00
Michael Ellerman 63ee9b2ff9 powerpc/mm/book3s64: Make KERN_IO_START a variable
Currently KERN_IO_START is defined as:

 #define KERN_IO_START  (KERN_VIRT_START + (KERN_VIRT_SIZE >> 1))

Although it looks like a constant, both the components are actually
variables, to allow us to have a different value between Radix and
Hash with a single kernel.

However that still requires both Radix and Hash to place the kernel IO
region at the same location relative to the start and end of the
kernel virtual region (namely 1/2 way through it), and we'd like to
change that.

So split KERN_IO_START out into its own variable, and initialise it
for Radix and Hash. In the medium term we should be able to
reconsolidate this, by doing a more involved rearrangement of the
location of the regions.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-08 19:37:04 +10:00
Matt Brown e66ca3db59 powerpc/powernv: Use darn instruction for get_random_seed() on Power9
This adds powernv_get_random_darn() which utilises the darn instruction,
introduced in ISA v3.0/POWER9.

The darn instruction can potentially return an error, which is supported
by the get_random_seed() API, in normal usage if we see an error we just
return that to the caller.

However when detecting whether darn is functional at boot we try up to
10 times, before deciding that darn doesn't work and failing the
registration of get_random_seed(). That way an intermittent failure
at boot doesn't deprive the system of randomness until the next reboot.

Signed-off-by: Matt Brown <matthew.brown.dev@gmail.com>
[mpe: Move init into a function, tweak change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-08 19:37:03 +10:00
Frederic Barrat 2552910084 powerpc/powernv: Enable PCI peer-to-peer
P9 has support for PCI peer-to-peer, enabling a device to write in the
MMIO space of another device directly, without interrupting the CPU.

This patch adds support for it on powernv, by adding a new API to be
called by drivers. The pnv_pci_set_p2p(...) call configures an
'initiator', i.e the device which will issue the MMIO operation, and a
'target', i.e. the device on the receiving side.

P9 really only supports MMIO stores for the time being but that's
expected to change in the future, so the API allows to define both
load and store operations.

  /* PCI p2p descriptor */
  #define OPAL_PCI_P2P_ENABLE           0x1
  #define OPAL_PCI_P2P_LOAD             0x2
  #define OPAL_PCI_P2P_STORE            0x4

  int pnv_pci_set_p2p(struct pci_dev *initiator, struct pci_dev *target,
                      u64 desc)

It uses a new OPAL call, as the configuration magic is done on the
PHBs by skiboot.

Signed-off-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>
Reviewed-by: Russell Currey <ruscur@russell.cc>
[mpe: Drop unrelated OPAL calls, s/uint64_t/u64/, minor formatting]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-08 11:27:30 +10:00
Linus Torvalds 841fe95303 powerpc fixes for 4.13 #5
Fixes for recently merged code:
  - a fix for the _PAGE_DEVMAP support, which was breaking KVM on Power9 radix
  - avoid a (harmless) lockdep warning in the early SMP code
  - return failure for some uses of dma_set_mask() rather than falling back to 32-bits
  - fix stack setup in watchdog soft_nmi_common() to use emergency stack
  - fix of_irq_to_resource() error check in of_fsl_spi_probe()
 
 Two fixes going to stable:
  - fix saving of Transactional Memory SPRs in core dump
  - fix __check_irq_replay missing decrementer interrupt
 
 And two misc:
  - fix 64-bit boot wrapper build with non-biarch compiler
  - work around a POWER9 PMU hang after state-loss idle
 
 Thanks to:
   Alistair Popple, Aneesh Kumar K.V, Cyril Bur, Gustavo Romero, Jose Ricardo
   Ziviani, Laurent Vivier, Nicholas Piggin, Oliver O'Halloran, Sergei Shtylyov,
   Suraj Jitindar Singh, Thomas Gleixner.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJZhEH/AAoJEFHr6jzI4aWA8GoP+wdtaBYZIElFMnPay/BY8lSQ
 MODF6FlaexdNuVFo5In2BnMAs12bIXDO69wMziqc+uZIA+vI3K7lqsi+8p1R38Ts
 FG5OFLzX/gx1ykoeDe0mC8GPUHGVBdMr6+rHzt/9M7PrQWApCoBJLakYIElOrN5B
 EiAxunwJQISKNQYYmyVhjMVFq8ydMHSgtR7JVDHTh1AoStaO9M5hoJegD4Dnytaf
 F0/0Y5U+NGGWwPIyDrjB4wE9L8NCk0JfvpBhjORMkFSpAOe4VhxZrAGDyaCzVKVh
 YkxSirVKuVwEmvV2prmq4SAJi61EMUEPx2WL7hRN3ol/dF+2lU9nl2lyL3yRnGPY
 9mtNHur07qoMqyXI8750ayTzbMG75aakJF2be9G+zwNc3Sw8dqJvs2PwpSVdnima
 jWmWnRikws4qmEOEBV6nxtWf8qYACHevyK7YzIHDU8SVXLjAkX/u6LZGHAfTQ8Xo
 jYcISpiA2ko8YnMR7yy9FZqs4KGqeOotMamgJRGUcSY/9Xn8WDIrtQ4tIuJtGjzW
 XFOFafZOzyt6s1dNz01C4hzP60J+/CxPNRxx6zWnbijY6puADu/Mv1g+ai3nYdKA
 e/CYt6O8tzooFHCHtVLUPgWA82heDwNtjmlnSikWTdT07KLRhP+0w9Qq8gBzztKT
 7B/vPMjCMPQaJrrnj6GX
 =Rdy8
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.13-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
 "Fixes for recently merged code:
   - a fix for the _PAGE_DEVMAP support, which was breaking KVM on
     Power9 radix
   - avoid a (harmless) lockdep warning in the early SMP code
   - return failure for some uses of dma_set_mask() rather than falling
     back to 32-bits
   - fix stack setup in watchdog soft_nmi_common() to use emergency
     stack
   - fix of_irq_to_resource() error check in of_fsl_spi_probe()

  Two fixes going to stable:
   - fix saving of Transactional Memory SPRs in core dump
   - fix __check_irq_replay missing decrementer interrupt

  And two misc:
   - fix 64-bit boot wrapper build with non-biarch compiler
   - work around a POWER9 PMU hang after state-loss idle

  Thanks to: Alistair Popple, Aneesh Kumar K.V, Cyril Bur, Gustavo
  Romero, Jose Ricardo Ziviani, Laurent Vivier, Nicholas Piggin, Oliver
  O'Halloran, Sergei Shtylyov, Suraj Jitindar Singh, Thomas Gleixner"

* tag 'powerpc-4.13-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/64: Fix __check_irq_replay missing decrementer interrupt
  powerpc/perf: POWER9 PMU stops after idle workaround
  powerpc/83xx/mpc832x_rdb: fix of_irq_to_resource() error check
  powerpc/64s: Fix stack setup in watchdog soft_nmi_common()
  powerpc/powernv/pci: Return failure for some uses of dma_set_mask()
  powerpc/boot: Fix 64-bit boot wrapper build with non-biarch compiler
  powerpc/smp: Call smp_ops->setup_cpu() directly on the boot CPU
  powerpc/tm: Fix saving of TM SPRs in core dump
  powerpc/mm: Fix pmd/pte_devmap() on non-leaf entries
2017-08-04 09:56:54 -07:00
Benjamin Herrenschmidt 6ff4d3e966 powerpc: Remove old unused icswx based coprocessor support
We have a whole pile of unused code to maintain the ACOP register,
allocate coprocessor PIDs and handle ACOP faults. This mechanism
was used for the HFI adapter on POWER7 which is dead and gone and
whose driver never went upstream. It was used on some A2 core based
stuff that also never saw the light of day.

Take out all that code.

There is still some POWER8 coprocessor code that uses icswx but it's
kernel only and thus doesn't use any of that infrastructure.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-03 16:06:52 +10:00
Benjamin Herrenschmidt 870cfe77a9 powerpc/mm: Update definitions of DSISR bits
This updates the definitions for the various DSISR bits to
match both some historical stuff and to match new bits on
POWER9.

In addition, we define some masks corresponding to the "bad"
faults on Book3S, and some masks corresponding to the bits
that match between DSISR and SRR1 for a DSI and an ISI.

This comes with a small code update to change the definition
of DSISR_PGDIRFAULT which becomes DSISR_PRTABLE_FAULT to
match architecture 3.0B

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-03 16:06:43 +10:00
Benjamin Herrenschmidt 424de9c6e3 powerpc/mm/radix: Avoid flushing the PWC on every flush_tlb_range
We do that because it's used by THP pmd collapsing, so use
instead a dedicated flush function.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-02 13:11:06 +10:00
Benjamin Herrenschmidt a46cc7a90f powerpc/mm/radix: Improve TLB/PWC flushes
At the moment we have to rather sub-optimal flushing behaviours:

 - flush_tlb_mm() will flush the PWC which is unnecessary (for example
   when doing a fork)

 - A large unmap will call flush_tlb_pwc() multiple times causing us
   to perform that fairly expensive operation repeatedly. This happens
   often in batches of 3 on every new process.

So we change flush_tlb_mm() to only flush the TLB, and we use the
existing "need_flush_all" flag in struct mmu_gather to indicate
that the PWC needs flushing.

Unfortunately, flush_tlb_range() still needs to do a full flush
for now as it's used by the THP collapsing. We will fix that later.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-02 13:11:06 +10:00
Gautham R. Shenoy e1c1cfed54 powerpc/powernv: Save/Restore additional SPRs for stop4 cpuidle
The stop4 idle state on POWER9 is a deep idle state which loses
hypervisor resources, but whose latency is low enough that it can be
exposed via cpuidle.

Until now, the deep idle states which lose hypervisor resources (eg:
winkle) were only exposed via CPU-Hotplug.  Hence currently on wakeup
from such states, barring a few SPRs which need to be restored to
their older value, rest of the SPRS are reinitialized to their values
corresponding to that at boot time.

When stop4 is used in the context of cpuidle, we want these additional
SPRs to be restored to their older value, to ensure that the context
on the CPU coming back from idle is same as it was before going idle.

In this patch, we define a SPR save area in PACA (since we have used
up the volatile register space in the stack) and on POWER9, we restore
SPRN_PID, SPRN_LDBAR, SPRN_FSCR, SPRN_HFSCR, SPRN_MMCRA, SPRN_MMCR1,
SPRN_MMCR2 to the values they had before entering stop.

Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-01 21:01:20 +10:00
Michael Ellerman bb272221e9 Linux v4.13-rc1
-----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJZapWhAAoJEHm+PkMAQRiGKb0IAJM6b7SbWaw69Og7+qiFB+zZ
 xp29iXqbE9fPISC6a5BRQV1ONjeDM6opGixGHqGC8Hla6k2IYz25VDNoF8wd0MXN
 cz/Ih20vd3C5afxXGe5cTT8lsPAlV0mWXxForlu6j8jPeL62FPfq6RhEkw7AcrYL
 yfYy3k3qSdOrrvBdII0WAAUi46UfIs+we9BQgbsMbkHOiqV2K0MOrzKE84Xbgepq
 RAy2xg6P4b4+hTx8xTrYc1MXwpnqjRc0oJ08gdmiwW3AOOU7LxYFn7zDkLPWi9Rr
 g4x6r4YhBTGxT4wNvovLIiqd9QFs//dMCuPWYwEtTICG48umIqqq24beQ0mvCdg=
 =08Ic
 -----END PGP SIGNATURE-----

Merge tag 'v4.13-rc1' into fixes

The fixes branch is based off a random pre-rc1 commit, because we had
some fixes that needed to go in before rc1 was released.

However we now need to fix some code that went in after that point, but
before rc1, so merge rc1 to get that code into fixes so we can fix it!
2017-07-31 20:20:29 +10:00
Linus Torvalds 080012bad6 powerpc fixes for 4.13 #4
The highlight is Ben's patch to work around a host killing bug when running KVM
 guests with the Radix MMU on Power9. See the long change log of that commit for
 more detail.
 
 And then three fairly minor fixes:
  - Fix of_node_put() underflow during reconfig remove, using old DLPAR tools.
  - Fix recently introduced ld version check with 64-bit LE-only toolchain.
  - Free the subpage_prot_table correctly, avoiding a memory leak.
 
 Thanks to:
   Aneesh Kumar K.V, Benjamin Herrenschmidt, Laurent Vivier.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJZezIGAAoJEFHr6jzI4aWAyukP/3mxlQ3WdQlYPByZ18cj6YL5
 L0kbRxDgAosD9HzcOPqku1um1l7D6Gk5KZFZfsol7SasSmCZEwV4MbdTrAiRxS6K
 tF0V14hP/BDQeIKfxlnUepzfL8PY7CkO6sDAa6BjHXvBk4+POI+37uw9+2GEV8DY
 tA45fHjA/Zq3eUXsK0WTHIcd09lJXXarf9Tlx+YNZ+3yJ1OMfOji3CXgTkjtwYM9
 XTtsKzsagY1zLwr5gXJu1P05+OGna2VmY6+Tn2lnf7scTFW3qYGF3eWRx71diiKS
 PpZCjqfzWF4+TDIGPoYIrkTE+ZKR0lyo6F38GYwae0cYZMs9pGPEpeNahd8Nun+v
 MLU6TnhNfOI40GEYgmOMNKHPJJLSx59Qr/GnrAi/h2nUEocuN76jzNbaeFBtj3jD
 /vrRTmVUtt1wGqORX7BK4YZFHqcHmZBCM7bQnxibJtLv7fMue0sk58fs2jAaZ1iD
 NacpzsXG7CWYgj6ApclVCYuF99dXTpjrw/WPxilXDg84Pxb7Dv1SpIvLb2T5Guq+
 iqqavViRHP1ng+5/giIsOvF9CnsCzbRYLb0zZTP91nckMmYI6wX2zc56lofjcI5j
 Qc5o/aJvBk4vSM9sibBGEdrZJ1Vt16gGorQ5NZUurZund/cVqvQFhm/4Tvnc0cVN
 yvLNZI8am35pI9CCJ2im
 =Z6uF
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
 "The highlight is Ben's patch to work around a host killing bug when
  running KVM guests with the Radix MMU on Power9. See the long change
  log of that commit for more detail.

  And then three fairly minor fixes:

   - fix of_node_put() underflow during reconfig remove, using old DLPAR
     tools.

   - fix recently introduced ld version check with 64-bit LE-only
     toolchain.

   - free the subpage_prot_table correctly, avoiding a memory leak.

  Thanks to: Aneesh Kumar K.V, Benjamin Herrenschmidt, Laurent Vivier"

* tag 'powerpc-4.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/mm/hash: Free the subpage_prot_table correctly
  powerpc/Makefile: Fix ld version check with 64-bit LE-only toolchain
  powerpc/pseries: Fix of_node_put() underflow during reconfig remove
  powerpc/mm/radix: Workaround prefetch issue with KVM
2017-07-28 13:25:15 -07:00
Oliver O'Halloran c9c98bc5cc powerpc/mm: Fix pmd/pte_devmap() on non-leaf entries
The Radix MMU translation tree as defined in ISA v3.0 contains two
different types of entry, directories and leaves. Leaves are
identified by _PAGE_PTE being set.

The formats of the two entries are different, with the directory
entries containing no spare bits for use by software. In particular
the bit we use for _PAGE_DEVMAP is not reserved for software, and is
part of the NLB (Next Level Base) field, essentially the address of
the next level in the tree.

Note that the Linux pte_t is not == _PAGE_PTE. A huge page pmd
entry (or devmap!) is also a leaf and so has _PAGE_PTE set, even
though we use a pmd_t for it in Linux.

The fix is to ensure that the pmd/pte_devmap() confirm they are
looking at a leaf entry (_PAGE_PTE) as well as checking _PAGE_DEVMAP.

Fixes: ebd3119793 ("powerpc/mm: Add devmap support for ppc64")
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Tested-by: Laurent Vivier <lvivier@redhat.com>
Tested-by: Jose Ricardo Ziviani <joserz@linux.vnet.ibm.com>
Reviewed-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Add a comment in the code and flesh out change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-28 15:55:30 +10:00
Benjamin Herrenschmidt a25bd72bad powerpc/mm/radix: Workaround prefetch issue with KVM
There's a somewhat architectural issue with Radix MMU and KVM.

When coming out of a guest with AIL (Alternate Interrupt Location, ie,
MMU enabled), we start executing hypervisor code with the PID register
still containing whatever the guest has been using.

The problem is that the CPU can (and will) then start prefetching or
speculatively load from whatever host context has that same PID (if
any), thus bringing translations for that context into the TLB, which
Linux doesn't know about.

This can cause stale translations and subsequent crashes.

Fixing this in a way that is neither racy nor a huge performance
impact is difficult. We could just make the host invalidations always
use broadcast forms but that would hurt single threaded programs for
example.

We chose to fix it instead by partitioning the PID space between guest
and host. This is possible because today Linux only use 19 out of the
20 bits of PID space, so existing guests will work if we make the host
use the top half of the 20 bits space.

We additionally add support for a property to indicate to Linux the
size of the PID register which will be useful if we eventually have
processors with a larger PID space available.

There is still an issue with malicious guests purposefully setting the
PID register to a value in the hosts PID range. Hopefully future HW
can prevent that, but in the meantime, we handle it with a pair of
kludges:

 - On the way out of a guest, before we clear the current VCPU in the
   PACA, we check the PID and if it's outside of the permitted range
   we flush the TLB for that PID.

 - When context switching, if the mm is "new" on that CPU (the
   corresponding bit was set for the first time in the mm cpumask), we
   check if any sibling thread is in KVM (has a non-NULL VCPU pointer
   in the PACA). If that is the case, we also flush the PID for that
   CPU (core).

This second part is needed to handle the case where a process is
migrated (or starts a new pthread) on a sibling thread of the CPU
coming out of KVM, as there's a window where stale translations can
exist before we detect it and flush them out.

A future optimization could be added by keeping track of whether the
PID has ever been used and avoid doing that for completely fresh PIDs.
We could similarily mark PIDs that have been the subject of a global
invalidation as "fresh". But for now this will do.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[mpe: Rework the asm to build with CONFIG_PPC_RADIX_MMU=n, drop
      unneeded include of kvm_book3s_asm.h]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-26 16:41:52 +10:00
Madhavan Srinivasan 8f95faaac5 powerpc/powernv: Detect and create IMC device
Code to create platform device for the In-Memory Collection (IMC)
counters. Platform devices are created based on the IMC compatibility.
New header file created to contain the data structures and macros
needed for In-Memory Collection (IMC) counter pmu devices.

The device tree for IMC counters starts at the node "imc-counters".
This node contains all the IMC PMU nodes and event nodes for these IMC
PMUs. Device probe() parses the device to locate three possible IMC
device types (Nest/Core/Thread). Function then branch to parse each
unit nodes to populate vital information such as device memory sizes,
event nodes information, base address for reserve memory access (if
any) and so on. Simple bare-minimum shutdown function added which only
"stops" the engines.

Signed-off-by: Anju T Sudhakar <anju@linux.vnet.ibm.com>
Signed-off-by: Hemant Kumar <hemant@linux.vnet.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
[mpe: Fix build with CONFIG_PERF_EVENTS=n]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-25 22:55:27 +10:00
Madhavan Srinivasan 28a5db0061 powerpc/powernv: Add IMC OPAL APIs
In-Memory Collection (IMC) counters are performance monitoring
infrastructure. These counters need special sequence of SCOMs to
init/start/stop which is handled by OPAL. And OPAL provides three APIs
to init and control these IMC engines.

OPAL API documentation:
  https://github.com/open-power/skiboot/blob/master/doc/opal-api/opal-imc-counters.rst

Patch updates the kernel side powernv platform code to support the new
OPAL APIs

Signed-off-by: Hemant Kumar <hemant@linux.vnet.ibm.com>
Signed-off-by: Anju T Sudhakar <anju@linux.vnet.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-24 23:00:22 +10:00
Laurentiu Tudor 8f36479d0e powerpc: allow compiling with GENERIC_MSI_IRQ_DOMAIN
This allows building powerpc with the GENERIC_MSI_IRQ_DOMAIN
Kconfig by enabling the asm-generic msi.h in Kbuild. Without
this, there's a compilation error [1] because powerpc, as most
arches, doesn't provide an asm/msi.h.

[1] In file included from ./include/linux/kvm_host.h:20:0,
                 from ./arch/powerpc/include/asm/kvm_ppc.h:30,
                 from arch/powerpc/kernel/dbell.c:20:
./include/linux/msi.h:195:21: fatal error: asm/msi.h: No such file or directory

Signed-off-by: Laurentiu Tudor <laurentiu.tudor@nxp.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-24 21:19:32 +10:00
Linus Torvalds ae75d1aefe TTY/Serial fixes for 4.13-rc2
Here are some small tty and serial driver fixes for 4.13-rc2.  Nothing
 huge at all, a revert of a patch that turned out to break things, a fix
 up for a new tty ioctl we added in 4.13-rc1 to get the uapi definition
 correct, and a few minor serial driver fixes for reported issues.
 
 All of these have been in linux-next for a while with no reported
 issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWXMebA8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+yn9MgCdGsIEQlTvTeG4QikxUPB7dkEJQacAoLWOKJzQ
 A65mBAeOfiJztczi8uo4
 =KEXp
 -----END PGP SIGNATURE-----

Merge tag 'tty-4.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty

Pull tty/serial fixes from Greg KH:
 "Here are some small tty and serial driver fixes for 4.13-rc2. Nothing
  huge at all, a revert of a patch that turned out to break things, a
  fix up for a new tty ioctl we added in 4.13-rc1 to get the uapi
  definition correct, and a few minor serial driver fixes for reported
  issues.

  All of these have been in linux-next for a while with no reported
  issues"

* tag 'tty-4.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
  tty: Fix TIOCGPTPEER ioctl definition
  tty: hide unused pty_get_peer function
  tty: serial: lpuart: Fix the logic for detecting the 32-bit type UART
  serial: imx: Prevent TX buffer PIO write when a DMA has been started
  Revert "serial: imx-serial - move DMA buffer configuration to DT"
  serial: sh-sci: Uninitialized variables in sysfs files
  serial: st-asc: Potential error pointer dereference
2017-07-22 09:00:24 -07:00
Linus Torvalds 10fc95547f powerpc fixes for 4.13 #3
A handful of fixes, mostly for new code.
 
 Some reworking of the new STRICT_KERNEL_RWX support to make sure we also remove
 executable permission from __init memory before it's freed.
 
 A fix to some recent optimisations to the hypercall entry where we were
 clobbering r12, this was breaking nested guests (PR KVM).
 
 A fix for the recent patch to opal_configure_cores(). This could break booting
 on bare metal Power8 boxes if the kernel was built without
 CONFIG_JUMP_LABEL_FEATURE_CHECK_DEBUG.
 
 And finally a workaround for spurious PMU interrupts on Power9 DD2.
 
 Thanks to:
   Nicholas Piggin, Anton Blanchard, Balbir Singh.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJZcctRAAoJEFHr6jzI4aWAji8P/RaG3/vGmhP4HGsAh7jHx/3v
 EIBsjXPozQ16AIs4235JcQ2Vg54LLms6IaFfw1w9oYZxi98sNC3b56Rtd3AGxLbq
 WHEGAD+mE9aXUOXkycLRkKZ+THiNzR0ZQKiMDqU+oJ9XbIAoof2gI4nc7UzUxGnt
 EqxERiL2RrYqRBvO3W2ErtL29yXhopUhv+HKdbuMlIcoH4uooSdn435MPleIJWxl
 WRFFv17DFw7PN6juwlff2YWvbTgrM1ND8czLfINzSZQDy0F+HEbKDFbtHlgAvOFN
 GjbODX9OuU/QRrPQv6MoY2UumZNtLSCUsw5BUg0y4Qaak8p0gAEb4D/gmvIZNp/r
 9ZqJDPfFKh/oA08apLTPJBKQmDUwCaYHyHVJNw10CU9GZ9CzW/2/B2/h2Egpgqao
 vt0TgR3FXYizhXGAEpmBZzadAg9VKZTXH/irN6X62wMxNfvRIDOS57829QFHr0ka
 wbsOK1gI1rE1g1GwsZafyURRSe95Sv84RDHW1fFQAtvFgpA3eHl26LUkpcvwOOFI
 53g9zMedYOyZXXIdBn05QojwxgXZ0ry1Wc93oDHOn3rgDL6XBeOXFfFI5jIwPJDL
 VZhOSvN1v+Ti3Q+I05zH+ll6BMhlU8RKBf+esq419DiLUeaNLHZR6GF0+UVJyJo+
 WcLsJ1x/GDa9pOBfMEJy
 =Oy4/
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.13-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
 "A handful of fixes, mostly for new code:

   - some reworking of the new STRICT_KERNEL_RWX support to make sure we
     also remove executable permission from __init memory before it's
     freed.

   - a fix to some recent optimisations to the hypercall entry where we
     were clobbering r12, this was breaking nested guests (PR KVM).

   - a fix for the recent patch to opal_configure_cores(). This could
     break booting on bare metal Power8 boxes if the kernel was built
     without CONFIG_JUMP_LABEL_FEATURE_CHECK_DEBUG.

   - .. and finally a workaround for spurious PMU interrupts on Power9
     DD2.

  Thanks to: Nicholas Piggin, Anton Blanchard, Balbir Singh"

* tag 'powerpc-4.13-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/mm: Mark __init memory no-execute when STRICT_KERNEL_RWX=y
  powerpc/mm/hash: Refactor hash__mark_rodata_ro()
  powerpc/mm/radix: Refactor radix__mark_rodata_ro()
  powerpc/64s: Fix hypercall entry clobbering r12 input
  powerpc/perf: Avoid spurious PMU interrupts after idle
  powerpc/powernv: Fix boot on Power8 bare metal due to opal_configure_cores()
2017-07-21 13:54:37 -07:00
Linus Torvalds 0a6109fd1b Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core fixes from Ingo Molnar:
 "A fix to WARN_ON_ONCE() done by modules, plus a MAINTAINERS update"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  debug: Fix WARN_ON_ONCE() for modules
  MAINTAINERS: Update the PTRACE entry
2017-07-21 10:41:19 -07:00
Josh Poimboeuf 325cdacd03 debug: Fix WARN_ON_ONCE() for modules
Mike Galbraith reported a situation where a WARN_ON_ONCE() call in DRM
code turned into an oops.  As it turns out, WARN_ON_ONCE() seems to be
completely broken when called from a module.

The bug was introduced with the following commit:

  19d436268d ("debug: Add _ONCE() logic to report_bug()")

That commit changed WARN_ON_ONCE() to move its 'once' logic into the bug
trap handler.  It requires a writable bug table so that the BUGFLAG_DONE
bit can be written to the flags to indicate the first warning has
occurred.

The bug table was made writable for vmlinux, which relies on
vmlinux.lds.S and vmlinux.lds.h for laying out the sections.  However,
it wasn't made writable for modules, which rely on the ELF section
header flags.

Reported-by: Mike Galbraith <efault@gmx.de>
Tested-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 19d436268d ("debug: Add _ONCE() logic to report_bug()")
Link: http://lkml.kernel.org/r/a53b04235a65478dd9afc51f5b329fdc65c84364.1500095401.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-20 12:31:04 +02:00
Michael Ellerman 029d9252b1 powerpc/mm: Mark __init memory no-execute when STRICT_KERNEL_RWX=y
Currently even with STRICT_KERNEL_RWX we leave the __init text marked
executable after init, which is bad.

Add a hook to mark it NX (no-execute) before we free it, and implement
it for radix and hash.

Note that we use __init_end as the end address, not _einittext,
because overlaps_kernel_text() uses __init_end, because there are
additional executable sections other than .init.text between
__init_begin and __init_end.

Tested on radix and hash with:

  0:mon> p $__init_begin
  *** 400 exception occurred

Fixes: 1e0fc9d1eb ("powerpc/Kconfig: Enable STRICT_KERNEL_RWX for some configs")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-18 19:54:24 +10:00
Gleb Fotengauer-Malinovskiy c632517923 tty: Fix TIOCGPTPEER ioctl definition
This ioctl does nothing to justify an _IOC_READ or _IOC_WRITE flag
because it doesn't copy anything from/to userspace to access the
argument.

Fixes: 54ebbfb160 ("tty: add TIOCGPTPEER ioctl")
Signed-off-by: Gleb Fotengauer-Malinovskiy <glebfm@altlinux.org>
Acked-by: Aleksa Sarai <asarai@suse.de>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-17 17:04:41 +02:00
Linus Torvalds 89cbec71fe Merge branch 'work.uaccess-unaligned' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull uacess-unaligned removal from Al Viro:
 "That stuff had just one user, and an exotic one, at that - binfmt_flat
  on arm and m68k"

* 'work.uaccess-unaligned' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  kill {__,}{get,put}_user_unaligned()
  binfmt_flat: flat_{get,put}_addr_from_rp() should be able to fail
2017-07-15 11:17:52 -07:00
Linus Torvalds deed9deb62 powerpc fixes for 4.13 #2
Nothing that really stands out, just a bunch of fixes that have come in in the
 last couple of weeks.
 
 None of these are actually fixes for code that is new in 4.13. It's roughly half
 older bugs, with fixes going to stable, and half fixes/updates for Power9.
 
 Thanks to:
   Aneesh Kumar K.V, Anton Blanchard, Balbir Singh, Benjamin Herrenschmidt,
   Madhavan Srinivasan, Michael Neuling, Nicholas Piggin, Oliver O'Halloran.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJZaJKuAAoJEFHr6jzI4aWAojsQAI/Mnu3T9XCPbcuWJVHiNZu2
 CubNKFb9H+HdjRZlgdT7dpx0XITCNWziQmBpQZO+w1kwa3s5uv5qNwj8I30HyTFH
 qAArMq2Yb7GIW4PR8IpbSdCEWZii4YcIiLYk3kiEMyC2UIWeznBq+j79jFyITd9T
 beevhLLq7dMoLO8T2VRSdp4YhdLKNYmgjQG1YFh3YIh8s/tyPQ0OsDaxNk5z8wgt
 /htWiVQ2X/xWRHBnVKT4olc97pXqTvtaCdBWjD/fKNZK50YIwAjr2cJVoXpJidAW
 POwwMto7rjSJDYpE6IgqQnhz0VqRTNB/4QkyaeHJkt6BPhAaafkrHZYtDgc2GaUJ
 G96J6xT9bwHn8Jz1tpsxSgSA9A2GfBhdOUKMpIUVn1l9Q1ELPA0dJzzagLhgRDRG
 w97UQ+CJqvRNfJkRHcuJbNqP5i++6yOHcHGB8QLkmkSYVcCm9Hj/fMXj/DJ5tHt/
 c9t6uw25eqI7raFcxfh09+QTX8VDV96fa+GTo1HnCH71nGG2GqCoJFXyb3HEW64s
 gD7uOPOlFlqfpuGDmV1kmdwH0R15EIJb2b6QsDssrcHEg5fA6z/xcAcV839c9y47
 U8tcBiqFF0vWgB1QljTDrKuNsjg8dIeZfkSekBGD0WuBVLSADBl7f3b2k3BZB5H3
 WxosyR4ufSVfw53HaDsb
 =6eIT
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
 "Nothing that really stands out, just a bunch of fixes that have come
  in in the last couple of weeks.

  None of these are actually fixes for code that is new in 4.13. It's
  roughly half older bugs, with fixes going to stable, and half
  fixes/updates for Power9.

  Thanks to: Aneesh Kumar K.V, Anton Blanchard, Balbir Singh, Benjamin
  Herrenschmidt, Madhavan Srinivasan, Michael Neuling, Nicholas Piggin,
  Oliver O'Halloran"

* tag 'powerpc-4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/64: Fix atomic64_inc_not_zero() to return an int
  powerpc: Fix emulation of mfocrf in emulate_step()
  powerpc: Fix emulation of mcrf in emulate_step()
  powerpc/perf: Add POWER9 alternate PM_RUN_CYC and PM_RUN_INST_CMPL events
  powerpc/perf: Fix SDAR_MODE value for continous sampling on Power9
  powerpc/asm: Mark cr0 as clobbered in mftb()
  powerpc/powernv: Fix local TLB flush for boot and MCE on POWER9
  powerpc/mm/radix: Synchronize updates to the process table
  powerpc/mm/radix: Properly clear process table entry
  powerpc/powernv: Tell OPAL about our MMU mode on POWER9
  powerpc/kexec: Fix radix to hash kexec due to IAMR/AMOR
2017-07-14 15:33:15 -07:00
Michal Hocko dcda9b0471 mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic
__GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
the page allocator.  This has been true but only for allocations
requests larger than PAGE_ALLOC_COSTLY_ORDER.  It has been always
ignored for smaller sizes.  This is a bit unfortunate because there is
no way to express the same semantic for those requests and they are
considered too important to fail so they might end up looping in the
page allocator for ever, similarly to GFP_NOFAIL requests.

Now that the whole tree has been cleaned up and accidental or misled
usage of __GFP_REPEAT flag has been removed for !costly requests we can
give the original flag a better name and more importantly a more useful
semantic.  Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
that the allocator would try really hard but there is no promise of a
success.  This will work independent of the order and overrides the
default allocator behavior.  Page allocator users have several levels of
guarantee vs.  cost options (take GFP_KERNEL as an example)

 - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
   attempt to free memory at all. The most light weight mode which even
   doesn't kick the background reclaim. Should be used carefully because
   it might deplete the memory and the next user might hit the more
   aggressive reclaim

 - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
   allocation without any attempt to free memory from the current
   context but can wake kswapd to reclaim memory if the zone is below
   the low watermark. Can be used from either atomic contexts or when
   the request is a performance optimization and there is another
   fallback for a slow path.

 - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
   non sleeping allocation with an expensive fallback so it can access
   some portion of memory reserves. Usually used from interrupt/bh
   context with an expensive slow path fallback.

 - GFP_KERNEL - both background and direct reclaim are allowed and the
   _default_ page allocator behavior is used. That means that !costly
   allocation requests are basically nofail but there is no guarantee of
   that behavior so failures have to be checked properly by callers
   (e.g. OOM killer victim is allowed to fail currently).

 - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
   and all allocation requests fail early rather than cause disruptive
   reclaim (one round of reclaim in this implementation). The OOM killer
   is not invoked.

 - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
   behavior and all allocation requests try really hard. The request
   will fail if the reclaim cannot make any progress. The OOM killer
   won't be triggered.

 - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
   and all allocation requests will loop endlessly until they succeed.
   This might be really dangerous especially for larger orders.

Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
because they already had their semantic.  No new users are added.
__alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
there is no progress and we have already passed the OOM point.

This means that all the reclaim opportunities have been exhausted except
the most disruptive one (the OOM killer) and a user defined fallback
behavior is more sensible than keep retrying in the page allocator.

[akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
[mhocko@suse.com: semantic fix]
  Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
[mhocko@kernel.org: address other thing spotted by Vlastimil]
  Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Alex Belits <alex.belits@cavium.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: David Daney <david.daney@cavium.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: NeilBrown <neilb@suse.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12 16:26:03 -07:00
Nicholas Piggin 2104180a53 powerpc/64s: implement arch-specific hardlockup watchdog
Implement an arch-speicfic watchdog rather than use the perf-based
hardlockup detector.

The new watchdog takes the soft-NMI directly, rather than going through
perf.  Perf interrupts are to be made maskable in future, so that would
prevent the perf detector from working in those regions.

Additionally, implement a SMP based detector where all CPUs watch one
another by pinging a shared cpumask.  This is because powerpc Book3S
does not have a true periodic local NMI, but some platforms do implement
a true NMI IPI.

If a CPU is stuck with interrupts hard disabled, the soft-NMI watchdog
does not work, but the SMP watchdog will.  Even on platforms without a
true NMI IPI to get a good trace from the stuck CPU, other CPUs will
notice the lockup sufficiently to report it and panic.

[npiggin@gmail.com: honor watchdog disable at boot/hotplug]
  Link: http://lkml.kernel.org/r/20170621001346.5bb337c9@roar.ozlabs.ibm.com
[npiggin@gmail.com: fix false positive warning at CPU unplug]
  Link: http://lkml.kernel.org/r/20170630080740.20766-1-npiggin@gmail.com
[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20170616065715.18390-6-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Don Zickus <dzickus@redhat.com>
Tested-by: Babu Moger <babu.moger@oracle.com>	[sparc]
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12 16:26:02 -07:00
Michael Ellerman 01e6a61ace powerpc/64: Fix atomic64_inc_not_zero() to return an int
Although it's not documented anywhere, there is an expectation that
atomic64_inc_not_zero() returns a result which fits in an int. This is
the behaviour implemented on all arches except powerpc.

This has caused at least one bug in practice, in the percpu-refcount
code, where the long result from our atomic64_inc_not_zero() was
truncated to an int leading to lost references and stuck systems. That
was worked around in that code in commit 966d2b04e0 ("percpu-refcount:
fix reference leak during percpu-atomic transition").

To the best of my grepping abilities there are no other callers
in-tree which truncate the value, but we should fix it anyway. Because
the breakage is subtle and potentially very harmful I'm also tagging
it for stable.

Code generation is largely unaffected because in most cases the
callers are just using the result for a test anyway. In particular the
case of fget() that was mentioned in commit a6cf7ed511
("powerpc/atomic: Implement atomic*_inc_not_zero") generates exactly
the same code.

Fixes: a6cf7ed511 ("powerpc/atomic: Implement atomic*_inc_not_zero")
Cc: stable@vger.kernel.org # v3.4
Noticed-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-12 21:49:55 +10:00
Oliver O'Halloran 2400fd822f powerpc/asm: Mark cr0 as clobbered in mftb()
The workaround for the CELL timebase bug does not correctly mark cr0 as
being clobbered. This means GCC doesn't know that the asm block changes cr0 and
might leave the result of an unrelated comparison in cr0 across the block, which
we then trash, leading to basically random behaviour.

Fixes: 859deea949 ("[POWERPC] Cell timebase bug workaround")
Cc: stable@vger.kernel.org # v2.6.19+
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
[mpe: Tweak change log and flag for stable]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-11 12:53:53 +10:00
Kees Cook 47ebb09d54 powerpc: move ELF_ET_DYN_BASE to 4GB / 4MB
Now that explicitly executed loaders are loaded in the mmap region, we
have more freedom to decide where we position PIE binaries in the
address space to avoid possible collisions with mmap or stack regions.

For 64-bit, align to 4GB to allow runtimes to use the entire 32-bit
address space for 32-bit pointers.  On 32-bit use 4MB, which is the
traditional x86 minimum load location, likely to avoid historically
requiring a 4MB page table entry when only a portion of the first 4MB
would be used (since the NULL address is avoided).

Link: http://lkml.kernel.org/r/1498154792-49952-4-git-send-email-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Pratyush Anand <panand@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:36 -07:00
Benjamin Herrenschmidt 1c0eaf0f56 powerpc/powernv: Tell OPAL about our MMU mode on POWER9
That will allow OPAL to configure the CPU in an optimal way.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-10 21:12:27 +10:00
Linus Torvalds d691b7e7d1 powerpc updates for 4.13
Highlights include:
 
  - Support for STRICT_KERNEL_RWX on 64-bit server CPUs.
 
  - Platform support for FSP2 (476fpe) board
 
  - Enable ZONE_DEVICE on 64-bit server CPUs.
 
  - Generic & powerpc spin loop primitives to optimise busy waiting
 
  - Convert VDSO update function to use new update_vsyscall() interface
 
  - Optimisations to hypercall/syscall/context-switch paths
 
  - Improvements to the CPU idle code on Power8 and Power9.
 
 As well as many other fixes and improvements.
 
 Thanks to:
   Akshay Adiga, Andrew Donnellan, Andrew Jeffery, Anshuman Khandual, Anton
   Blanchard, Balbir Singh, Benjamin Herrenschmidt, Christophe Leroy, Christophe
   Lombard, Colin Ian King, Dan Carpenter, Gautham R. Shenoy, Hari Bathini, Ian
   Munsie, Ivan Mikhaylov, Javier Martinez Canillas, Madhavan Srinivasan,
   Masahiro Yamada, Matt Brown, Michael Neuling, Michal Suchanek, Murilo
   Opsfelder Araujo, Naveen N. Rao, Nicholas Piggin, Oliver O'Halloran, Paul
   Mackerras, Pavel Machek, Russell Currey, Santosh Sivaraj, Stephen Rothwell,
   Thiago Jung Bauermann, Yang Li.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJZXyPCAAoJEFHr6jzI4aWAI9QQAISf2x5y//cqCi4ISyQB5pTq
 KLS/yQajNkQOw7c0fzBZOaH5Xd/SJ6AcKWDg8yDlpDR3+sRRsr98iIRECgKS5I7/
 DxD9ywcbSoMXFQQo1ZMCp5CeuMUIJRtugBnUQM+JXCSUCPbznCY5DchDTLyTBTpO
 MeMVhI//JxthhoOMA9MudiEGaYCU9ho442Z4OJUSiLUv8WRbvQX9pTqoc4vx1fxA
 BWf2mflztBVcIfKIyxIIIlDLukkMzix6gSYPMCbC7lzkbnU7JSqKiheJXjo1gJS2
 ePHKDxeNR2/QP0g/j3aT/MR1uTt9MaNBSX3gANE1xQ9OoJ8m1sOtCO4gNbSdLWae
 eXhDnoiEp930DRZOeEioOItuWWoxFaMyYk3BMmRKV4QNdYL3y3TRVeL2/XmRGqYL
 Lxz4IY/x5TteFEJNGcRX90uizNSi8AaEXPF16pUk8Ctt6eH3ZSwPMv2fHeYVCMr0
 KFlKHyaPEKEoztyzLcUR6u9QB56yxDN58bvLpd32AeHvKhqyxFoySy59x9bZbatn
 B2y8mmDItg860e0tIG6jrtplpOVvL8i5jla5RWFVoQDuxxrLAds3vG9JZQs+eRzx
 Fiic93bqeUAS6RzhXbJ6QQJYIyhE7yqpcgv7ME1W87SPef3HPBk9xlp3yIDwdA2z
 bBDvrRnvupusz8qCWrxe
 =w8rj
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:
 "Highlights include:

   - Support for STRICT_KERNEL_RWX on 64-bit server CPUs.

   - Platform support for FSP2 (476fpe) board

   - Enable ZONE_DEVICE on 64-bit server CPUs.

   - Generic & powerpc spin loop primitives to optimise busy waiting

   - Convert VDSO update function to use new update_vsyscall() interface

   - Optimisations to hypercall/syscall/context-switch paths

   - Improvements to the CPU idle code on Power8 and Power9.

  As well as many other fixes and improvements.

  Thanks to: Akshay Adiga, Andrew Donnellan, Andrew Jeffery, Anshuman
  Khandual, Anton Blanchard, Balbir Singh, Benjamin Herrenschmidt,
  Christophe Leroy, Christophe Lombard, Colin Ian King, Dan Carpenter,
  Gautham R. Shenoy, Hari Bathini, Ian Munsie, Ivan Mikhaylov, Javier
  Martinez Canillas, Madhavan Srinivasan, Masahiro Yamada, Matt Brown,
  Michael Neuling, Michal Suchanek, Murilo Opsfelder Araujo, Naveen N.
  Rao, Nicholas Piggin, Oliver O'Halloran, Paul Mackerras, Pavel Machek,
  Russell Currey, Santosh Sivaraj, Stephen Rothwell, Thiago Jung
  Bauermann, Yang Li"

* tag 'powerpc-4.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (158 commits)
  powerpc/Kconfig: Enable STRICT_KERNEL_RWX for some configs
  powerpc/mm/radix: Implement STRICT_RWX/mark_rodata_ro() for Radix
  powerpc/mm/hash: Implement mark_rodata_ro() for hash
  powerpc/vmlinux.lds: Align __init_begin to 16M
  powerpc/lib/code-patching: Use alternate map for patch_instruction()
  powerpc/xmon: Add patch_instruction() support for xmon
  powerpc/kprobes/optprobes: Use patch_instruction()
  powerpc/kprobes: Move kprobes over to patch_instruction()
  powerpc/mm/radix: Fix execute permissions for interrupt_vectors
  powerpc/pseries: Fix passing of pp0 in updatepp() and updateboltedpp()
  powerpc/64s: Blacklist rtas entry/exit from kprobes
  powerpc/64s: Blacklist functions invoked on a trap
  powerpc/64s: Un-blacklist system_call() from kprobes
  powerpc/64s: Move system_call() symbol to just after setting MSR_EE
  powerpc/64s: Blacklist system_call() and system_call_common() from kprobes
  powerpc/64s: Convert .L__replay_interrupt_return to a local label
  powerpc64/elfv1: Only dereference function descriptor for non-text symbols
  cxl: Export library to support IBM XSL
  powerpc/dts: Use #include "..." to include local DT
  powerpc/perf/hv-24x7: Aggregate result elements on POWER9 SMT8
  ...
2017-07-07 13:55:45 -07:00
Linus Torvalds 9f45efb928 Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:

 - a few hotfixes

 - various misc updates

 - ocfs2 updates

 - most of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (108 commits)
  mm, memory_hotplug: move movable_node to the hotplug proper
  mm, memory_hotplug: drop CONFIG_MOVABLE_NODE
  mm, memory_hotplug: drop artificial restriction on online/offline
  mm: memcontrol: account slab stats per lruvec
  mm: memcontrol: per-lruvec stats infrastructure
  mm: memcontrol: use generic mod_memcg_page_state for kmem pages
  mm: memcontrol: use the node-native slab memory counters
  mm: vmstat: move slab statistics from zone to node counters
  mm/zswap.c: delete an error message for a failed memory allocation in zswap_dstmem_prepare()
  mm/zswap.c: improve a size determination in zswap_frontswap_init()
  mm/zswap.c: delete an error message for a failed memory allocation in zswap_pool_create()
  mm/swapfile.c: sort swap entries before free
  mm/oom_kill: count global and memory cgroup oom kills
  mm: per-cgroup memory reclaim stats
  mm: kmemleak: treat vm_struct as alternative reference to vmalloc'ed objects
  mm: kmemleak: factor object reference updating out of scan_block()
  mm: kmemleak: slightly reduce the size of some structures on 64-bit architectures
  mm, mempolicy: don't check cpuset seqlock where it doesn't matter
  mm, cpuset: always use seqlock when changing task's nodemask
  mm, mempolicy: simplify rebinding mempolicies when updating cpusets
  ...
2017-07-06 22:27:08 -07:00
Linus Torvalds dc502142b6 Merge branch 'uaccess.strlen' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull user access str* updates from Al Viro:
 "uaccess str...() dead code removal"

* 'uaccess.strlen' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  s390 keyboard.c: don't open-code strndup_user()
  mips: get rid of unused __strnlen_user()
  get rid of unused __strncpy_from_user() instances
  kill strlen_user()
2017-07-06 22:07:44 -07:00
Linus Torvalds c856863988 Merge branch 'misc.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc compat stuff updates from Al Viro:
 "This part is basically untangling various compat stuff. Compat
  syscalls moved to their native counterparts, getting rid of quite a
  bit of double-copying and/or set_fs() uses. A lot of field-by-field
  copyin/copyout killed off.

   - kernel/compat.c is much closer to containing just the
     copyin/copyout of compat structs. Not all compat syscalls are gone
     from it yet, but it's getting there.

   - ipc/compat_mq.c killed off completely.

   - block/compat_ioctl.c cleaned up; floppy compat ioctls moved to
     drivers/block/floppy.c where they belong. Yes, there are several
     drivers that implement some of the same ioctls. Some are m68k and
     one is 32bit-only pmac. drivers/block/floppy.c is the only one in
     that bunch that can be built on biarch"

* 'misc.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  mqueue: move compat syscalls to native ones
  usbdevfs: get rid of field-by-field copyin
  compat_hdio_ioctl: get rid of set_fs()
  take floppy compat ioctls to sodding floppy.c
  ipmi: get rid of field-by-field __get_user()
  ipmi: get COMPAT_IPMICTL_RECEIVE_MSG in sync with the native one
  rt_sigtimedwait(): move compat to native
  select: switch compat_{get,put}_fd_set() to compat_{get,put}_bitmap()
  put_compat_rusage(): switch to copy_to_user()
  sigpending(): move compat to native
  getrlimit()/setrlimit(): move compat to native
  times(2): move compat to native
  compat_{get,put}_bitmap(): use unsafe_{get,put}_user()
  fb_get_fscreeninfo(): don't bother with do_fb_ioctl()
  do_sigaltstack(): lift copying to/from userland into callers
  take compat_sys_old_getrlimit() to native syscall
  trim __ARCH_WANT_SYS_OLD_GETRLIMIT
2017-07-06 20:57:13 -07:00
Linus Torvalds f72e24a124 This is the first pull request for the new dma-mapping subsystem
In this new subsystem we'll try to properly maintain all the generic
 code related to dma-mapping, and will further consolidate arch code
 into common helpers.
 
 This pull request contains:
 
  - removal of the DMA_ERROR_CODE macro, replacing it with calls
    to ->mapping_error so that the dma_map_ops instances are
    more self contained and can be shared across architectures (me)
  - removal of the ->set_dma_mask method, which duplicates the
    ->dma_capable one in terms of functionality, but requires more
    duplicate code.
  - various updates for the coherent dma pool and related arm code
    (Vladimir)
  - various smaller cleanups (me)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCAApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAlldmw0LHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYOiKA/+Ln1mFLSf3nfTzIHa24Bbk8ZTGr0B8TD4Vmyyt8iG
 oO3AeaTLn3d6ugbH/uih/tPz8PuyXsdiTC1rI/ejDMiwMTSjW6phSiIHGcStSR9X
 VFNhmMFacp7QpUpvxceV0XZYKDViAoQgHeGdp3l+K5h/v4AYePV/v/5RjQPaEyOh
 YLbCzETO+24mRWdJxdAqtTW4ovYhzj6XsiJ+pAjlV0+SWU6m5L5E+VAPNi1vqv1H
 1O2KeCFvVYEpcnfL3qnkw2timcjmfCfeFAd9mCUAc8mSRBfs3QgDTKw3XdHdtRml
 LU2WuA5cpMrOdBO4mVra2plo8E2szvpB1OZZXoKKdCpK3VGwVpVHcTvClK2Ks/3B
 GDLieroEQNu2ZIUIdWXf/g2x6le3BcC9MmpkAhnGPqCZ7skaIBO5Cjpxm0zTJAPl
 PPY3CMBBEktAvys6DcudOYGixNjKUuAm5lnfpcfTEklFdG0AjhdK/jZOplAFA6w4
 LCiy0rGHM8ZbVAaFxbYoFCqgcjnv6EjSiqkJxVI4fu/Q7v9YXfdPnEmE0PJwCVo5
 +i7aCLgrYshTdHr/F3e5EuofHN3TDHwXNJKGh/x97t+6tt326QMvDKX059Kxst7R
 rFukGbrYvG8Y7yXwrSDbusl443ta0Ht7T1oL4YUoJTZp0nScAyEluDTmrH1JVCsT
 R4o=
 =0Fso
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-4.13' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping infrastructure from Christoph Hellwig:
 "This is the first pull request for the new dma-mapping subsystem

  In this new subsystem we'll try to properly maintain all the generic
  code related to dma-mapping, and will further consolidate arch code
  into common helpers.

  This pull request contains:

   - removal of the DMA_ERROR_CODE macro, replacing it with calls to
     ->mapping_error so that the dma_map_ops instances are more self
     contained and can be shared across architectures (me)

   - removal of the ->set_dma_mask method, which duplicates the
     ->dma_capable one in terms of functionality, but requires more
     duplicate code.

   - various updates for the coherent dma pool and related arm code
     (Vladimir)

   - various smaller cleanups (me)"

* tag 'dma-mapping-4.13' of git://git.infradead.org/users/hch/dma-mapping: (56 commits)
  ARM: dma-mapping: Remove traces of NOMMU code
  ARM: NOMMU: Set ARM_DMA_MEM_BUFFERABLE for M-class cpus
  ARM: NOMMU: Introduce dma operations for noMMU
  drivers: dma-mapping: allow dma_common_mmap() for NOMMU
  drivers: dma-coherent: Introduce default DMA pool
  drivers: dma-coherent: Account dma_pfn_offset when used with device tree
  dma: Take into account dma_pfn_offset
  dma-mapping: replace dmam_alloc_noncoherent with dmam_alloc_attrs
  dma-mapping: remove dmam_free_noncoherent
  crypto: qat - avoid an uninitialized variable warning
  au1100fb: remove a bogus dma_free_nonconsistent call
  MAINTAINERS: add entry for dma mapping helpers
  powerpc: merge __dma_set_mask into dma_set_mask
  dma-mapping: remove the set_dma_mask method
  powerpc/cell: use the dma_supported method for ops switching
  powerpc/cell: clean up fixed mapping dma_ops initialization
  tile: remove dma_supported and mapping_error methods
  xen-swiotlb: remove xen_swiotlb_set_dma_mask
  arm: implement ->dma_supported instead of ->set_dma_mask
  mips/loongson64: implement ->dma_supported instead of ->set_dma_mask
  ...
2017-07-06 19:20:54 -07:00
Linus Torvalds c136b84393 PPC:
- Better machine check handling for HV KVM
 - Ability to support guests with threads=2, 4 or 8 on POWER9
 - Fix for a race that could cause delayed recognition of signals
 - Fix for a bug where POWER9 guests could sleep with interrupts pending.
 
 ARM:
 - VCPU request overhaul
 - allow timer and PMU to have their interrupt number selected from userspace
 - workaround for Cavium erratum 30115
 - handling of memory poisonning
 - the usual crop of fixes and cleanups
 
 s390:
 - initial machine check forwarding
 - migration support for the CMMA page hinting information
 - cleanups and fixes
 
 x86:
 - nested VMX bugfixes and improvements
 - more reliable NMI window detection on AMD
 - APIC timer optimizations
 
 Generic:
 - VCPU request overhaul + documentation of common code patterns
 - kvm_stat improvements
 
 There is a small conflict in arch/s390 due to an arch-wide field rename.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJZW4XTAAoJEL/70l94x66DkhMH/izpk54KI17PtyQ9VYI2sYeZ
 BWK6Kl886g3ij4pFi3pECqjDJzWaa3ai+vFfzzpJJ8OkCJT5Rv4LxC5ERltVVmR8
 A3T1I/MRktSC0VJLv34daPC2z4Lco/6SPipUpPnL4bE2HATKed4vzoOjQ3tOeGTy
 dwi7TFjKwoVDiM7kPPDRnTHqCe5G5n13sZ49dBe9WeJ7ttJauWqoxhlYosCGNPEj
 g8ZX8+cvcAhVnz5uFL8roqZ8ygNEQq2mgkU18W8ZZKuiuwR0gdsG0gSBFNTdwIMK
 NoreRKMrw0+oLXTIB8SZsoieU6Qi7w3xMAMabe8AJsvYtoersugbOmdxGCr1lsA=
 =OD7H
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "PPC:
   - Better machine check handling for HV KVM
   - Ability to support guests with threads=2, 4 or 8 on POWER9
   - Fix for a race that could cause delayed recognition of signals
   - Fix for a bug where POWER9 guests could sleep with interrupts pending.

  ARM:
   - VCPU request overhaul
   - allow timer and PMU to have their interrupt number selected from userspace
   - workaround for Cavium erratum 30115
   - handling of memory poisonning
   - the usual crop of fixes and cleanups

  s390:
   - initial machine check forwarding
   - migration support for the CMMA page hinting information
   - cleanups and fixes

  x86:
   - nested VMX bugfixes and improvements
   - more reliable NMI window detection on AMD
   - APIC timer optimizations

  Generic:
   - VCPU request overhaul + documentation of common code patterns
   - kvm_stat improvements"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (124 commits)
  Update my email address
  kvm: vmx: allow host to access guest MSR_IA32_BNDCFGS
  x86: kvm: mmu: use ept a/d in vmcs02 iff used in vmcs12
  kvm: x86: mmu: allow A/D bits to be disabled in an mmu
  x86: kvm: mmu: make spte mmio mask more explicit
  x86: kvm: mmu: dead code thanks to access tracking
  KVM: PPC: Book3S: Fix typo in XICS-on-XIVE state saving code
  KVM: PPC: Book3S HV: Close race with testing for signals on guest entry
  KVM: PPC: Book3S HV: Simplify dynamic micro-threading code
  KVM: x86: remove ignored type attribute
  KVM: LAPIC: Fix lapic timer injection delay
  KVM: lapic: reorganize restart_apic_timer
  KVM: lapic: reorganize start_hv_timer
  kvm: nVMX: Check memory operand to INVVPID
  KVM: s390: Inject machine check into the nested guest
  KVM: s390: Inject machine check into the guest
  tools/kvm_stat: add new interactive command 'b'
  tools/kvm_stat: add new command line switch '-i'
  tools/kvm_stat: fix error on interactive command 'g'
  KVM: SVM: suppress unnecessary NMI singlestep on GIF=0 and nested exit
  ...
2017-07-06 18:38:31 -07:00
Aneesh Kumar K.V 40692eb5ee powerpc/mm/hugetlb: add support for 1G huge pages
POWER9 supports hugepages of size 2M and 1G in radix MMU mode.  This
patch enables the usage of 1G page size for hugetlbfs.  This also update
the helper such we can do 1G page allocation at runtime.

We still don't enable 1G page size on DD1 version.  This is to avoid
doing workaround mentioned in commit 6d3a0379eb ("powerpc/mm: Add
radix__tlb_flush_pte_p9_dd1()").

Link: http://lkml.kernel.org/r/1494995292-4443-2-git-send-email-aneesh.kumar@linux.vnet.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06 16:24:33 -07:00
Linus Torvalds 5518b69b76 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
 "Reasonably busy this cycle, but perhaps not as busy as in the 4.12
  merge window:

   1) Several optimizations for UDP processing under high load from
      Paolo Abeni.

   2) Support pacing internally in TCP when using the sch_fq packet
      scheduler for this is not practical. From Eric Dumazet.

   3) Support mutliple filter chains per qdisc, from Jiri Pirko.

   4) Move to 1ms TCP timestamp clock, from Eric Dumazet.

   5) Add batch dequeueing to vhost_net, from Jason Wang.

   6) Flesh out more completely SCTP checksum offload support, from
      Davide Caratti.

   7) More plumbing of extended netlink ACKs, from David Ahern, Pablo
      Neira Ayuso, and Matthias Schiffer.

   8) Add devlink support to nfp driver, from Simon Horman.

   9) Add RTM_F_FIB_MATCH flag to RTM_GETROUTE queries, from Roopa
      Prabhu.

  10) Add stack depth tracking to BPF verifier and use this information
      in the various eBPF JITs. From Alexei Starovoitov.

  11) Support XDP on qed device VFs, from Yuval Mintz.

  12) Introduce BPF PROG ID for better introspection of installed BPF
      programs. From Martin KaFai Lau.

  13) Add bpf_set_hash helper for TC bpf programs, from Daniel Borkmann.

  14) For loads, allow narrower accesses in bpf verifier checking, from
      Yonghong Song.

  15) Support MIPS in the BPF selftests and samples infrastructure, the
      MIPS eBPF JIT will be merged in via the MIPS GIT tree. From David
      Daney.

  16) Support kernel based TLS, from Dave Watson and others.

  17) Remove completely DST garbage collection, from Wei Wang.

  18) Allow installing TCP MD5 rules using prefixes, from Ivan
      Delalande.

  19) Add XDP support to Intel i40e driver, from Björn Töpel

  20) Add support for TC flower offload in nfp driver, from Simon
      Horman, Pieter Jansen van Vuuren, Benjamin LaHaise, Jakub
      Kicinski, and Bert van Leeuwen.

  21) IPSEC offloading support in mlx5, from Ilan Tayari.

  22) Add HW PTP support to macb driver, from Rafal Ozieblo.

  23) Networking refcount_t conversions, From Elena Reshetova.

  24) Add sock_ops support to BPF, from Lawrence Brako. This is useful
      for tuning the TCP sockopt settings of a group of applications,
      currently via CGROUPs"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1899 commits)
  net: phy: dp83867: add workaround for incorrect RX_CTRL pin strap
  dt-bindings: phy: dp83867: provide a workaround for incorrect RX_CTRL pin strap
  cxgb4: Support for get_ts_info ethtool method
  cxgb4: Add PTP Hardware Clock (PHC) support
  cxgb4: time stamping interface for PTP
  nfp: default to chained metadata prepend format
  nfp: remove legacy MAC address lookup
  nfp: improve order of interfaces in breakout mode
  net: macb: remove extraneous return when MACB_EXT_DESC is defined
  bpf: add missing break in for the TCP_BPF_SNDCWND_CLAMP case
  bpf: fix return in load_bpf_file
  mpls: fix rtm policy in mpls_getroute
  net, ax25: convert ax25_cb.refcount from atomic_t to refcount_t
  net, ax25: convert ax25_route.refcount from atomic_t to refcount_t
  net, ax25: convert ax25_uid_assoc.refcount from atomic_t to refcount_t
  net, sctp: convert sctp_ep_common.refcnt from atomic_t to refcount_t
  net, sctp: convert sctp_transport.refcnt from atomic_t to refcount_t
  net, sctp: convert sctp_chunk.refcnt from atomic_t to refcount_t
  net, sctp: convert sctp_datamsg.refcnt from atomic_t to refcount_t
  net, sctp: convert sctp_auth_bytes.refcnt from atomic_t to refcount_t
  ...
2017-07-05 12:31:59 -07:00
Linus Torvalds 9a715cd543 TTY/Serial patches for 4.13-rc1
Here is the large tty/serial patchset for 4.13-rc1.
 
 A lot of tty and serial driver updates are in here, along with some
 fixups for some __get/put_user usages that were reported.  Nothing huge,
 just lots of development by a number of different developers, full
 details in the shortlog.
 
 All of these have been in linux-next for a while.  There will be a merge
 issue with the arm-soc tree in the include/linux/platform_data/atmel.h
 file.  Stephen has sent out a fixup for it, so it shouldn't be that
 difficult to merge.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWVpZ9w8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ylkTgCfV2HhbxIph/aEL1nJmwW64oCXFrMAoK59ZH65
 tBZIosv0d91K1A+mObBT
 =adPL
 -----END PGP SIGNATURE-----

Merge tag 'tty-4.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty

Pull tty/serial updates from Greg KH:
 "Here is the large tty/serial patchset for 4.13-rc1.

  A lot of tty and serial driver updates are in here, along with some
  fixups for some __get/put_user usages that were reported. Nothing
  huge, just lots of development by a number of different developers,
  full details in the shortlog.

  All of these have been in linux-next for a while"

* tag 'tty-4.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (71 commits)
  tty: serial: lpuart: add a more accurate baud rate calculation method
  tty: serial: lpuart: add earlycon support for imx7ulp
  tty: serial: lpuart: add imx7ulp support
  dt-bindings: serial: fsl-lpuart: add i.MX7ULP support
  tty: serial: lpuart: add little endian 32 bit register support
  tty: serial: lpuart: refactor lpuart32_{read|write} prototype
  tty: serial: lpuart: introduce lpuart_soc_data to represent SoC property
  serial: imx-serial - move DMA buffer configuration to DT
  serial: imx: Enable RTSD only when needed
  serial: imx: Remove unused members from imx_port struct
  serial: 8250: 8250_omap: Fix race b/w dma completion and RX timeout
  serial: 8250: Fix THRE flag usage for CAP_MINI
  tty/serial: meson_uart: update to stable bindings
  dt-bindings: serial: Add bindings for the Amlogic Meson UARTs
  serial: Delete dead code for CIR serial ports
  serial: sirf: make of_device_ids const
  serial/mpsc: switch to dma_alloc_attrs
  tty: serial: Add Actions Semi Owl UART earlycon
  dt-bindings: serial: Document Actions Semi Owl UARTs
  tty/serial: atmel: make the driver DT only
  ...
2017-07-03 20:04:16 -07:00
Balbir Singh cd65d69713 powerpc/mm/hash: Implement mark_rodata_ro() for hash
With hash we update the bolted pte to mark it read-only. We rely
on the MMU_FTR_KERNEL_RO to generate the correct permissions
for read-only text. The radix implementation just prints a warning
in this implementation

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
[mpe: Make the warning louder when we don't have MMU_FTR_KERNEL_RO]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-04 11:35:16 +10:00
Linus Torvalds 9a9594efe5 Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull SMP hotplug updates from Thomas Gleixner:
 "This update is primarily a cleanup of the CPU hotplug locking code.

  The hotplug locking mechanism is an open coded RWSEM, which allows
  recursive locking. The main problem with that is the recursive nature
  as it evades the full lockdep coverage and hides potential deadlocks.

  The rework replaces the open coded RWSEM with a percpu RWSEM and
  establishes full lockdep coverage that way.

  The bulk of the changes fix up recursive locking issues and address
  the now fully reported potential deadlocks all over the place. Some of
  these deadlocks have been observed in the RT tree, but on mainline the
  probability was low enough to hide them away."

* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
  cpu/hotplug: Constify attribute_group structures
  powerpc: Only obtain cpu_hotplug_lock if called by rtasd
  ARM/hw_breakpoint: Fix possible recursive locking for arch_hw_breakpoint_init
  cpu/hotplug: Remove unused check_for_tasks() function
  perf/core: Don't release cred_guard_mutex if not taken
  cpuhotplug: Link lock stacks for hotplug callbacks
  acpi/processor: Prevent cpu hotplug deadlock
  sched: Provide is_percpu_thread() helper
  cpu/hotplug: Convert hotplug locking to percpu rwsem
  s390: Prevent hotplug rwsem recursion
  arm: Prevent hotplug rwsem recursion
  arm64: Prevent cpu hotplug rwsem recursion
  kprobes: Cure hotplug lock ordering issues
  jump_label: Reorder hotplug lock and jump_label_lock
  perf/tracing/cpuhotplug: Fix locking order
  ACPI/processor: Use cpu_hotplug_disable() instead of get_online_cpus()
  PCI: Replace the racy recursion prevention
  PCI: Use cpu_hotplug_disable() instead of get_online_cpus()
  perf/x86/intel: Drop get_online_cpus() in intel_snb_check_microcode()
  x86/perf: Drop EXPORT of perf_check_microcode
  ...
2017-07-03 18:08:06 -07:00
Al Viro 3170d8d226 kill {__,}{get,put}_user_unaligned()
no users left

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-03 18:44:22 -04:00
Naveen N. Rao 83e840c770 powerpc64/elfv1: Only dereference function descriptor for non-text symbols
Currently, we assume that the function pointer we receive in
ppc_function_entry() points to a function descriptor. However, this is
not always the case. In particular, assembly symbols without the right
annotation do not have an associated function descriptor. Some of these
symbols are added to the kprobe blacklist using _ASM_NOKPROBE_SYMBOL().

When such addresses are subsequently processed through
arch_deref_entry_point() in populate_kprobe_blacklist(), we see the
below errors during bootup:
    [    0.663963] Failed to find blacklist at 7d9b02a648029b6c
    [    0.663970] Failed to find blacklist at a14d03d0394a0001
    [    0.663972] Failed to find blacklist at 7d5302a6f94d0388
    [    0.663973] Failed to find blacklist at 48027d11e8610178
    [    0.663974] Failed to find blacklist at f8010070f8410080
    [    0.663976] Failed to find blacklist at 386100704801f89d
    [    0.663977] Failed to find blacklist at 7d5302a6f94d00b0

Fix this by checking if the function pointer we receive in
ppc_function_entry() already points to kernel text. If so, we just
return it as is. If not, we assume that this is a function descriptor
and proceed to dereference it.

Suggested-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-03 23:08:50 +10:00
Christophe Lombard 3ced8d7300 cxl: Export library to support IBM XSL
This patch exports a in-kernel 'library' API which can be called by
other drivers to help interacting with an IBM XSL on a POWER9 system.

The XSL (Translation Service Layer) is a stripped down version of the
PSL (Power Service Layer) used in some cards such as the Mellanox CX5.
Like the PSL, it implements the CAIA architecture, but has a number
of differences, mostly in it's implementation dependent registers.

The XSL also uses a special DMA cxl mode, which uses a slightly
different init sequence for the CAPP and PHB.

Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Christophe Lombard <clombard@linux.vnet.ibm.com>
Acked-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-03 23:07:03 +10:00
Michael Ellerman 218ea31039 Merge branch 'fixes' into next
Merge our fixes branch, a few of them are tripping people up while
working on top of next, and we also have a dependency between the CXL
fixes and new CXL code we want to merge into next.
2017-07-03 23:05:43 +10:00
Paolo Bonzini 8a53e7e572 Merge branch 'kvm-ppc-next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into HEAD
- Better machine check handling for HV KVM
- Ability to support guests with threads=2, 4 or 8 on POWER9
- Fix for a race that could cause delayed recognition of signals
- Fix for a bug where POWER9 guests could sleep with interrupts
  pending.
2017-07-03 10:41:59 +02:00
Oliver O'Halloran ebd3119793 powerpc/mm: Add devmap support for ppc64
Add support for the devmap bit on PTEs and PMDs for PPC64 Book3S.  This
is used to differentiate device backed memory from transparent huge
pages since they are handled in more or less the same manner by the core
mm code.

Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-02 20:40:28 +10:00
Nicholas Piggin 4e287e655e powerpc: use spin loop primitives in some functions
Use the different spin loop primitives in some simple powerpc
spin loops, including those which will spin as a common case.

This will help to test the spin loop primitives before more
conversions are done.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Add some includes of <linux/processor.h>]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-02 20:40:24 +10:00
Nicholas Piggin ede8e2bbb0 powerpc/64: implement spin loop primitives
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-02 20:40:17 +10:00
Paul Mackerras 898b25b202 KVM: PPC: Book3S HV: Simplify dynamic micro-threading code
Since commit b009031f74 ("KVM: PPC: Book3S HV: Take out virtual
core piggybacking code", 2016-09-15), we only have at most one
vcore per subcore.  Previously, the fact that there might be more
than one vcore per subcore meant that we had the notion of a
"master vcore", which was the vcore that controlled thread 0 of
the subcore.  We also needed a list per subcore in the core_info
struct to record which vcores belonged to each subcore.  Now that
there can only be one vcore in the subcore, we can replace the
list with a simple pointer and get rid of the notion of the
master vcore (and in fact treat every vcore as a master vcore).

We can also get rid of the subcore_vm[] field in the core_info
struct since it is never read.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-07-01 18:59:01 +10:00
Linus Torvalds b4df2e3537 powerpc fixes for 4.12 #8
Two fixes for code we merged this cycle:
 
  - cxl: Fixes for Coherent Accelerator Interface Architecture 2.0
  - Avoid miscompilation w/GCC 4.6.3 on 32-bit - don't inline copy_to/from_user()
 
 Thanks to:
   Al Viro, Larry Finger, Christophe Lombard.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJZViCDAAoJEFHr6jzI4aWA5eMQAMBGbDU3k+OHT2kuZg1Obnyo
 HADdBg1ZcCZ4MI0xOTiFb4ETsUcXcazGle6N1z/RjNYLA0KJobV5b+t/i+ybGtz2
 0a+35j7G7i+rxBMkWFfGUgZewwWPZkOry4BmXyQHHHeVnEOyF6jj/pbm22oedf1o
 NCogUbWKhxm2YqYzftfur09dG00T59mAKQ7BeHMkhR3p6lbOD/sMZPiquXO2cV2C
 78buxYCl1SqAx2yyPrmSBbVxUF5+PKvANaniQL+jYe7fC9GVNUoJJ5Dh0NCgvqKJ
 r9u8/1K9hSCAZDGhOWePPCFnqLH4hnyFN8m8S94tMNFnK3VDhoy+45GJ+7x6RCGH
 7Xvi6qef6n2jqrj7pggsPu3NKGtd8mmBVcPOxjdyPI6R2QZeRbdrx7NyvNB3xDDF
 rUsju/aHjJJPKDIq4hbDJTMSWQMe5+Bb8aEKOYupEQ/X//MFqz8gukVcQCJNU6Pn
 0TbOE+FUSgICY8IB2rI7UBa+rKKM8VDcg1rz0YYSCGfDOccMfq9IxAlihe4y3fpz
 KzuKnkCQBVT6+Q6AayqZlqVttWU+eIG/cm9dHS9bPXDKb0XyoOSl0ZcytflmlFR9
 xsZxD7/69DoRpdV0t0kpiLK9lWd3QhPaSukhn/aoUGXsFcMeJTYpsinuvVNi3hFh
 ldhIKrQbvY7k0s7xGOCi
 =Yq9i
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.12-8' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
 "Hopefully the last two powerpc fixes for 4.12.

  The CXL one is larger than I'd usually send at rc7, but it fixes new
  code this cycle, so better to have it working for the release. It was
  actually sent a few weeks back but got blocked in testing behind
  another fix that was causing issues.

  We are still tracking one crash in v4.12-rc7, but only one person has
  reproduced it and the commit identified by bisect doesn't touch any of
  the relevant code, so I think it's 50/50 whether that commit is
  actually the problem or it's some code layout / toolchain issue.

  Two fixes for code we merged this cycle:

   - cxl: Fixes for Coherent Accelerator Interface Architecture 2.0

   - Avoid miscompilation w/GCC 4.6.3 on 32-bit - don't inline
     copy_to/from_user()

  Thanks to Al Viro, Larry Finger, Christophe Lombard"

* tag 'powerpc-4.12-8' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/32: Avoid miscompilation w/GCC 4.6.3 - don't inline copy_to/from_user()
  cxl: Fixes for Coherent Accelerator Interface Architecture 2.0
2017-06-30 10:55:34 -07:00
David S. Miller b079115937 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
A set of overlapping changes in macvlan and the rocker
driver, nothing serious.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-30 12:43:08 -04:00
Paolo Bonzini 04a7ea04d5 KVM/ARM updates for 4.13
- vcpu request overhaul
 - allow timer and PMU to have their interrupt number
   selected from userspace
 - workaround for Cavium erratum 30115
 - handling of memory poisonning
 - the usual crop of fixes and cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCAAzFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAllWCM0VHG1hcmMuenlu
 Z2llckBhcm0uY29tAAoJECPQ0LrRPXpDjJ0QAI16x6+trKhH31lTSYekYfqm4hZ2
 Fp7IbALW9KNCaY35tZov2Zuh99qGRduxTh7ewqhKpON8kkU+UKj0F7zH22+vfN4m
 yas/+uNr8R9VLyvea4ysPsgx8Q8v1Ix9setohHYNZIL9/klVqtaHpYvArHVF/mzq
 p2j/NxRS2dlp9r2TtoMRMhA05u6r0wolhUuh+z9v2ipib0gfOBIG24jsqCTEcD9n
 5A/cVd+ztYshkrV95h3y9peahwt3zOA4QBGzrQ2K25jp0s54nqhmC7JTNSa8dtar
 YGW2MuAMoIFTwCFAlpwCzrwpOJFzF3Q6A8bOxei2fjclzjPMgT1xQxuhOoe4ntFa
 lTPxSHalm5W6dFTW90YSo2DBcPe+N7sQkhjR0cCeY3GYsOFhXMLTlOl5Pt1YK1or
 +3FAI74tFRKvVmb9mhZeGTvuzhDgRvtf3Qq5rjwlGzKc2BBOEgtMyj/Wgwo4N6Dz
 IjOnoRaUGELoBCWoTorMxLpsPBdPVSUxNyJTdAhqZ/ZtT1xqjhFNLZcrVWmOTzDM
 1cav+jZkla4sLmJSNDD54aCSvvtPHis0nZn9PRlh12xgOyYiAVx4K++MNuWP0P37
 hbh1gbPT+FcoVxPurUsX/pjNlTucPZcBwFytZDQlpwtPBpEFzJiImLYe/PldRb0f
 9WQOH1Y1+q14MF+N
 =6hNK
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/ARM updates for 4.13

- vcpu request overhaul
- allow timer and PMU to have their interrupt number
  selected from userspace
- workaround for Cavium erratum 30115
- handling of memory poisonning
- the usual crop of fixes and cleanups

Conflicts:
	arch/s390/include/asm/kvm_host.h
2017-06-30 12:38:26 +02:00
Tobias Klauser 6474924e2b arch: remove unused macro/function thread_saved_pc()
The only user of thread_saved_pc() in non-arch-specific code was removed
in commit 8243d55977 ("sched/core: Remove pointless printout in
sched_show_task()").  Remove the implementations as well.

Some architectures use thread_saved_pc() in their arch-specific code.
Leave their thread_saved_pc() intact.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-28 16:13:57 -07:00
Christoph Hellwig a9a7b06f58 powerpc: merge __dma_set_mask into dma_set_mask
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-06-28 06:54:55 -07:00
Christoph Hellwig 6009faa43f powerpc: implement ->mapping_error
DMA_ERROR_CODE is going to go away, so don't rely on it.  Instead
define a ->mapping_error method for all IOMMU based dma operation
instances.  The direct ops don't ever return an error and don't
need a ->mapping_error method.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
2017-06-28 06:54:33 -07:00
Hari Bathini eae0dfcc44 powerpc/fadump: avoid holes in boot memory area when fadump is registered
To register fadump, boot memory area - the size of low memory chunk that
is required for a kernel to boot successfully when booted with restricted
memory, is assumed to have no holes. But this memory area is currently
not protected from hot-remove operations. So, fadump could fail to
re-register after a memory hot-remove operation, if memory is removed
from boot memory area. To avoid this, ensure that memory from boot
memory area is not hot-removed when fadump is registered.

Signed-off-by: Hari Bathini <hbathini@linux.vnet.ibm.com>
Reviewed-by: Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-06-28 13:08:09 +10:00
Russell Currey a4b48ba904 powerpc/powernv/pci: Add support for PHB4 diagnostics
As with P7IOC and PHB3, add kernel-side support for decoding and printing
diagnostic data for PHB4.

Signed-off-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-06-27 12:14:27 +10:00
Michael Ellerman d6bd8194e2 powerpc/32: Avoid miscompilation w/GCC 4.6.3 - don't inline copy_to/from_user()
Larry Finger reported that his Powerbook G4 was no longer booting with v4.12-rc,
userspace was up but giving weird errors such as:

  udevd[64]: starting version 175
  udevd[64]: Unable to receive ctrl message: Bad address.
  modprobe: chdir(4.12-rc1): No such file or directory

He bisected the problem to commit 3448890c32 ("powerpc: get rid of zeroing,
switch to RAW_COPY_USER").

Al identified that the problem is actually a miscompilation by GCC 4.6.3, which
is exposed by the above commit.

Al also pointed out that inlining copy_to/from_user() is probably of little or
no benefit, which is correct. Using Anton's copy_to_user benchmark, with a
pathological single byte copy, we see a small increase in performance
by *removing* inlining:

  Before (inlined):
  # time ./copy_to_user -w -l 1 -i 10000000	( x 3 )
  real	0m22.063s
  real	0m22.059s
  real	0m22.076s

  After:
  # time ./copy_to_user -w -l 1 -i 10000000	( x 3 )
  real	0m21.325s
  real	0m21.299s
  real	0m21.364s

So as a small performance improvement and to avoid the miscompilation, drop
inlining copy_to/from_user() on 32-bit.

Fixes: 3448890c32 ("powerpc: get rid of zeroing, switch to RAW_COPY_USER")
Reported-by: Larry Finger <Larry.Finger@lwfinger.net>
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-06-26 23:25:08 +10:00
Balbir Singh 0428491cba powerpc/mm: Trace tlbie(l) instructions
Add a trace point for tlbie(l) (Translation Lookaside Buffer Invalidate
Entry (Local)) instructions.

The tlbie instruction has changed over the years, so not all versions
accept the same operands. Use the ISA v3 field operands because they are
the most verbose, we may change them in future.

Example output:

  qemu-system-ppc-5371  [016]  1412.369519: tlbie:
  	tlbie with lpid 0, local 1, rb=67bd8900174c11c1, rs=0, ric=0 prs=0 r=0

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
[mpe: Add some missing trace_tlbie()s, reword change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-06-23 21:14:49 +10:00
Thiago Jung Bauermann 3e401f7a2e powerpc: Only obtain cpu_hotplug_lock if called by rtasd
Calling arch_update_cpu_topology from a CPU hotplug state machine callback
hits a deadlock because the function tries to get a read lock on
cpu_hotplug_lock while the state machine still holds a write lock on it.

Since all callers of arch_update_cpu_topology except rtasd already hold
cpu_hotplug_lock, this patch changes the function to use
stop_machine_cpuslocked and creates a separate function for rtasd which
still tries to obtain the lock.

Michael Bringmann investigated the bug and provided a detailed analysis
of the deadlock on this previous RFC for an alternate solution:

Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: John Allen <jallen@linux.vnet.ibm.com>
Cc: Michael Bringmann <mwb@linux.vnet.ibm.com>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/1497996510-4032-1-git-send-email-bauerman@linux.vnet.ibm.com
Link: https://patchwork.ozlabs.org/patch/771293/
2017-06-23 09:32:11 +02:00
Aravinda Prasad e20bbd3d8d KVM: PPC: Book3S HV: Exit guest upon MCE when FWNMI capability is enabled
Enhance KVM to cause a guest exit with KVM_EXIT_NMI
exit reason upon a machine check exception (MCE) in
the guest address space if the KVM_CAP_PPC_FWNMI
capability is enabled (instead of delivering a 0x200
interrupt to guest). This enables QEMU to build error
log and deliver machine check exception to guest via
guest registered machine check handler.

This approach simplifies the delivery of machine
check exception to guest OS compared to the earlier
approach of KVM directly invoking 0x200 guest interrupt
vector.

This design/approach is based on the feedback for the
QEMU patches to handle machine check exception. Details
of earlier approach of handling machine check exception
in QEMU and related discussions can be found at:

https://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg00813.html

Note:

This patch now directly invokes machine_check_print_event_info()
from kvmppc_handle_exit_hv() to print the event to host console
at the time of guest exit before the exception is passed on to the
guest. Hence, the host-side handling which was performed earlier
via machine_check_fwnmi is removed.

The reasons for this approach is (i) it is not possible
to distinguish whether the exception occurred in the
guest or the host from the pt_regs passed on the
machine_check_exception(). Hence machine_check_exception()
calls panic, instead of passing on the exception to
the guest, if the machine check exception is not
recoverable. (ii) the approach introduced in this
patch gives opportunity to the host kernel to perform
actions in virtual mode before passing on the exception
to the guest. This approach does not require complex
tweaks to machine_check_fwnmi and friends.

Signed-off-by: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-06-22 11:24:57 +10:00
David S. Miller 3d09198243 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Two entries being added at the same time to the IFLA
policy table, whilst parallel bug fixes to decnet
routing dst handling overlapping with the dst gc removal
in net-next.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-21 17:35:22 -04:00
Aravinda Prasad 134764ed6e KVM: PPC: Book3S HV: Add new capability to control MCE behaviour
This introduces a new KVM capability to control how KVM behaves
on machine check exception (MCE) in HV KVM guests.

If this capability has not been enabled, KVM redirects machine check
exceptions to guest's 0x200 vector, if the address in error belongs to
the guest. With this capability enabled, KVM will cause a guest exit
with the exit reason indicating an NMI.

The new capability is required to avoid problems if a new kernel/KVM
is used with an old QEMU, running a guest that doesn't issue
"ibm,nmi-register".  As old QEMU does not understand the NMI exit
type, it treats it as a fatal error.  However, the guest could have
handled the machine check error if the exception was delivered to
guest's 0x200 interrupt vector instead of NMI exit in case of old
QEMU.

[paulus@ozlabs.org - Reworded the commit message to be clearer,
 enable only on HV KVM.]

Signed-off-by: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-06-21 13:37:08 +10:00
Nicholas Piggin 8568f1e026 powerpc/64s/paca: EX_CTR is not used with RELOCATABLE=n, remove it
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-06-20 22:22:02 +10:00
Nicholas Piggin 635942ae53 powerpc/64s/paca: EX_R3 can be merged with EX_DAR
EX_R3 is used only for a small section of the bad stack handler.
Merge it with EX_DAR.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-06-20 22:22:01 +10:00
Nicholas Piggin dbeea1d6b4 powerpc/64s/paca: EX_LR can be merged with EX_DAR
EX_LR is used only for a small section of the SLB miss handler.
Merge it with EX_DAR.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-06-20 22:22:01 +10:00
Nicholas Piggin 36670fcf01 powerpc/64s/paca: EX_SRR0 is unused, remove it
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-06-20 22:22:00 +10:00
Nicholas Piggin 8c38851415 powerpc/64s: Add EX_SIZE definition for paca exception save areas
Rather than open-coding it 4 times.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Move __ASSEMBLY__ guards into head-64.h where they're really needed]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-06-20 22:22:00 +10:00
Nicholas Piggin b51351e264 powerpc/64s/idle: Branch to handler with virtual mode offset
Have the system reset idle wakeup handlers branched to in real mode
with the 0xc... kernel address applied. This allows simplifications of
avoiding rfid when switching to virtual mode in the wakeup handler.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-06-19 19:46:28 +10:00
Nicholas Piggin a9af97aa0a powerpc/64s: msgclr when handling doorbell exceptions from system reset
msgsnd doorbell exceptions are cleared when the doorbell interrupt is
taken. However if a doorbell exception causes a system reset interrupt
wake from power saving state, the message is not cleared. Processing
the doorbell from the system reset interrupt requires msgclr to avoid
taking the exception again.

Testing this plus the previous wakup direct patch gives:

                                original         wakeup direct     msgclr
Different threads, same core:   315k/s           264k/s            345k/s
Different cores:                235k/s           242k/s            242k/s

Net speedup is +10% for same core, and +3% for different core.

Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-06-19 19:46:27 +10:00
Nicholas Piggin 771d4304d0 powerpc/64s/idle: Process interrupts from system reset wakeup
When the CPU wakes from low power state, it begins at the system reset
interrupt with the exception that caused the wakeup encoded in SRR1.

Today, powernv idle wakeup ignores the wakeup reason (except a special
case for HMI), and the regular interrupt corresponding to the
exception will fire after the idle wakeup exits.

Change this to replay the interrupt from the idle wakeup before
interrupts are hard-enabled.

Test on POWER8 of context_switch selftests benchmark with polling idle
disabled (e.g., always nap, giving cross-CPU IPIs) gives the following
results:

                                original         wakeup direct
Different threads, same core:   315k/s           264k/s
Different cores:                235k/s           242k/s

There is a slowdown for doorbell IPI (same core) case because system
reset wakeup does not clear the message and the doorbell interrupt
fires again needlessly.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-06-19 19:46:27 +10:00
Nicholas Piggin 2201f994a5 powerpc/64s/idle: Move soft interrupt mask logic into C code
This simplifies the asm and fixes irq-off tracing over sleep
instructions.

Also move powersave_nap check for POWER8 into C code, and move
PSSCR register value calculation for POWER9 into C.

Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-06-19 19:46:26 +10:00
Paul Mackerras 579006944e KVM: PPC: Book3S HV: Virtualize doorbell facility on POWER9
On POWER9, we no longer have the restriction that we had on POWER8
where all threads in a core have to be in the same partition, so
the CPU threads are now independent.  However, we still want to be
able to run guests with a virtual SMT topology, if only to allow
migration of guests from POWER8 systems to POWER9.

A guest that has a virtual SMT mode greater than 1 will expect to
be able to use the doorbell facility; it will expect the msgsndp
and msgclrp instructions to work appropriately and to be able to read
sensible values from the TIR (thread identification register) and
DPDES (directed privileged doorbell exception status) special-purpose
registers.  However, since each CPU thread is a separate sub-processor
in POWER9, these instructions and registers can only be used within
a single CPU thread.

In order for these instructions to appear to act correctly according
to the guest's virtual SMT mode, we have to trap and emulate them.
We cause them to trap by clearing the HFSCR_MSGP bit in the HFSCR
register.  The emulation is triggered by the hypervisor facility
unavailable interrupt that occurs when the guest uses them.

To cause a doorbell interrupt to occur within the guest, we set the
DPDES register to 1.  If the guest has interrupts enabled, the CPU
will generate a doorbell interrupt and clear the DPDES register in
hardware.  The DPDES hardware register for the guest is saved in the
vcpu->arch.vcore->dpdes field.  Since this gets written by the guest
exit code, other VCPUs wishing to cause a doorbell interrupt don't
write that field directly, but instead set a vcpu->arch.doorbell_request
flag.  This is consumed and set to 0 by the guest entry code, which
then sets DPDES to 1.

Emulating reads of the DPDES register is somewhat involved, because
it requires reading the doorbell pending interrupt status of all of the
VCPU threads in the virtual core, and if any of those VCPUs are
running, their doorbell status is only up-to-date in the hardware
DPDES registers of the CPUs where they are running.  In order to get
a reasonable approximation of the current doorbell status, we send
those CPUs an IPI, causing an exit from the guest which will update
the vcpu->arch.vcore->dpdes field.  We then use that value in
constructing the emulated DPDES register value.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-06-19 14:34:37 +10:00
Paul Mackerras 3c31352460 KVM: PPC: Book3S HV: Allow userspace to set the desired SMT mode
This allows userspace to set the desired virtual SMT (simultaneous
multithreading) mode for a VM, that is, the number of VCPUs that
get assigned to each virtual core.  Previously, the virtual SMT mode
was fixed to the number of threads per subcore, and if userspace
wanted to have fewer vcpus per vcore, then it would achieve that by
using a sparse CPU numbering.  This had the disadvantage that the
vcpu numbers can get quite large, particularly for SMT1 guests on
a POWER8 with 8 threads per core.  With this patch, userspace can
set its desired virtual SMT mode and then use contiguous vcpu
numbering.

On POWER8, where the threading mode is "strict", the virtual SMT mode
must be less than or equal to the number of threads per subcore.  On
POWER9, which implements a "loose" threading mode, the virtual SMT
mode can be any power of 2 between 1 and 8, even though there is
effectively one thread per subcore, since the threads are independent
and can all be in different partitions.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-06-19 14:34:20 +10:00