1
0
Fork 0
Commit Graph

213 Commits (ffed15d3ce3f710b94e6f402e1ca2318f7d7c0e2)

Author SHA1 Message Date
Gavin Shan caa58f8088 powerpc/powernv: Fix corrupted PE allocation bitmap on releasing PE
In pnv_ioda_free_pe(), the PE object (including the associated PE
number) is cleared before resetting the corresponding bit in the
PE allocation bitmap. It means PE#0 is always released to the bitmap
wrongly.

This fixes above issue by caching the PE number before the PE object
is cleared.

Fixes: 1e9167726c ("powerpc/powernv: Use PE instead of number during setup and release"
Cc: stable@vger.kernel.org # v4.7+
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-09-08 13:12:52 +10:00
Gavin Shan b314427a52 powerpc/powernv: Fix crash on releasing compound PE
The compound PE is created to accommodate the devices attached to
one specific PCI bus that consume multiple M64 segments. The compound
PE is made up of one master PE and possibly multiple slave PEs. The
slave PEs should be destroyed when releasing the master PE. A kernel
crash happens when derferencing @pe->pdev on releasing the slave PE
in pnv_ioda_deconfigure_pe().

  # echo 0 > /sys/bus/pci/slots/C7/power
  iommu: Removing device 0000:01:00.1 from group 0
  iommu: Removing device 0000:01:00.0 from group 0
  Unable to handle kernel paging request for data at address 0x00000010
  Faulting instruction address: 0xc00000000005d898
  cpu 0x1: Vector: 300 (Data Access) at [c000000fe8217620]
      pc: c00000000005d898: pnv_ioda_release_pe+0x288/0x610
      lr: c00000000005dbdc: pnv_ioda_release_pe+0x5cc/0x610
      sp: c000000fe82178a0
     msr: 9000000000009033
     dar: 10
   dsisr: 40000000
    current = 0xc000000fe815ab80
    paca    = 0xc00000000ff00400	 softe: 0	 irq_happened: 0x01
      pid   = 2709, comm = sh
  Linux version 4.8.0-rc5-gavin-00006-g745efdb (gwshan@gwshan) \
  (gcc version 4.9.3 (Buildroot 2016.02-rc2-00093-g5ea3bce) ) #586 SMP \
  Tue Sep 6 13:37:29 AEST 2016
  enter ? for help
  [c000000fe8217940] c00000000005d684 pnv_ioda_release_pe+0x74/0x610
  [c000000fe82179e0] c000000000034460 pcibios_release_device+0x50/0x70
  [c000000fe8217a10] c0000000004aba80 pci_release_dev+0x50/0xa0
  [c000000fe8217a40] c000000000704898 device_release+0x58/0xf0
  [c000000fe8217ac0] c000000000470510 kobject_release+0x80/0xf0
  [c000000fe8217b00] c000000000704dd4 put_device+0x24/0x40
  [c000000fe8217b20] c0000000004af94c pci_remove_bus_device+0x12c/0x150
  [c000000fe8217b60] c000000000034244 pci_hp_remove_devices+0x94/0xd0
  [c000000fe8217ba0] c0000000004ca444 pnv_php_disable_slot+0x64/0xb0
  [c000000fe8217bd0] c0000000004c88c0 power_write_file+0xa0/0x190
  [c000000fe8217c50] c0000000004c248c pci_slot_attr_store+0x3c/0x60
  [c000000fe8217c70] c0000000002d6494 sysfs_kf_write+0x94/0xc0
  [c000000fe8217cb0] c0000000002d50f0 kernfs_fop_write+0x180/0x260
  [c000000fe8217d00] c0000000002334a0 __vfs_write+0x40/0x190
  [c000000fe8217d90] c000000000234738 vfs_write+0xc8/0x240
  [c000000fe8217de0] c000000000236250 SyS_write+0x60/0x110
  [c000000fe8217e30] c000000000009524 system_call+0x38/0x108

It fixes the kernel crash by bypassing releasing resources (DMA,
IO and memory segments, PELTM) because there are no resources assigned
to the slave PE.

Fixes: c5f7700bbd ("powerpc/powernv: Dynamically release PE")
Reported-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-09-06 14:54:46 +10:00
Andrzej Hajda 6096481649 powerpc/powernv/pci: fix iterator signedness
Unsigned type is always non-negative, so the loop could not end in case
condition is never true.

The problem has been detected using semantic patch
scripts/coccinelle/tests/unsigned_lesser_than_zero.cocci

Signed-off-by: Andrzej Hajda <a.hajda@samsung.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2016-08-22 11:09:33 +10:00
Benjamin Herrenschmidt 5958d19a14 powerpc/pnv/pci: Fix incorrect PE reservation attempt on some 64-bit BARs
The generic allocation code may sometimes decide to assign a prefetchable
64-bit BAR to the M32 window. In fact it may also decide to allocate
a 64-bit non-prefetchable BAR to the M64 one ! So using the resource
flags as a test to decide which window was used for PE allocation is
just wrong and leads to insane PE numbers.

Instead, compare the addresses to figure it out.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[mpe: Rename the function as agreed by Ben & Gavin]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-08-09 19:51:47 +10:00
Alexey Kardashevskiy 4d9021957b powerpc/powernv/ioda: Fix TCE invalidate to work in real mode again
Commit fd141d1a99 ("powerpc/powernv/pci: Rework accessing the TCE
invalidate register") broke TCE invalidation on IODA2/PHB3 for real
mode.

This makes invalidate work again.

Fixes: fd141d1a99 ("powerpc/powernv/pci: Rework accessing the TCE invalidate register")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-08-09 14:50:19 +10:00
Krzysztof Kozlowski 00085f1efa dma-mapping: use unsigned long for dma_attrs
The dma-mapping core and the implementations do not change the DMA
attributes passed by pointer.  Thus the pointer can point to const data.
However the attributes do not have to be a bitfield.  Instead unsigned
long will do fine:

1. This is just simpler.  Both in terms of reading the code and setting
   attributes.  Instead of initializing local attributes on the stack
   and passing pointer to it to dma_set_attr(), just set the bits.

2. It brings safeness and checking for const correctness because the
   attributes are passed by value.

Semantic patches for this change (at least most of them):

    virtual patch
    virtual context

    @r@
    identifier f, attrs;

    @@
    f(...,
    - struct dma_attrs *attrs
    + unsigned long attrs
    , ...)
    {
    ...
    }

    @@
    identifier r.f;
    @@
    f(...,
    - NULL
    + 0
     )

and

    // Options: --all-includes
    virtual patch
    virtual context

    @r@
    identifier f, attrs;
    type t;

    @@
    t f(..., struct dma_attrs *attrs);

    @@
    identifier r.f;
    @@
    f(...,
    - NULL
    + 0
     )

Link: http://lkml.kernel.org/r/1468399300-5399-2-git-send-email-k.kozlowski@samsung.com
Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com>
Acked-by: Vineet Gupta <vgupta@synopsys.com>
Acked-by: Robin Murphy <robin.murphy@arm.com>
Acked-by: Hans-Christian Noren Egtvedt <egtvedt@samfundet.no>
Acked-by: Mark Salter <msalter@redhat.com> [c6x]
Acked-by: Jesper Nilsson <jesper.nilsson@axis.com> [cris]
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> [drm]
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Acked-by: Joerg Roedel <jroedel@suse.de> [iommu]
Acked-by: Fabien Dessenne <fabien.dessenne@st.com> [bdisp]
Reviewed-by: Marek Szyprowski <m.szyprowski@samsung.com> [vb2-core]
Acked-by: David Vrabel <david.vrabel@citrix.com> [xen]
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> [xen swiotlb]
Acked-by: Joerg Roedel <jroedel@suse.de> [iommu]
Acked-by: Richard Kuo <rkuo@codeaurora.org> [hexagon]
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k]
Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> [s390]
Acked-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Acked-by: Hans-Christian Noren Egtvedt <egtvedt@samfundet.no> [avr32]
Acked-by: Vineet Gupta <vgupta@synopsys.com> [arc]
Acked-by: Robin Murphy <robin.murphy@arm.com> [arm64 and dma-iommu]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-04 08:50:07 -04:00
Benjamin Herrenschmidt 08a45b320a powerpc/powernv/pci: Check status of a PHB before using it
If the firmware encounters an error (internal or HW) during initialization
of a PHB, it might leave the device-node in the tree but mark it disabled
using the "status" property. We should check it.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-17 16:42:49 +10:00
Benjamin Herrenschmidt a1339faf72 powerpc/powernv/pci: Use the device-tree to get available range of M64's
M64's are the configurable 64-bit windows that cover the 64-bit MMIO
space. We used to hard code 16 windows. Newer chips might have a
variable number and might need to reserve some as well (for example
on PHB4/POWER9, M32 and M64 are actually unified and we use M64#0
to map the 32-bit space).

So newer OPALs will provide a property we can use to know what range
of windows is available. The property is named so that it can
eventually support multiple ranges but we only use the first one for
now.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-17 16:42:48 +10:00
Benjamin Herrenschmidt f0228c4130 powerpc/powernv/pci: Fallback to OPAL for TCE invalidations
If we don't find registers for the PHB or don't know the model
specific invalidation method, use OPAL calls instead.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-17 16:42:48 +10:00
Benjamin Herrenschmidt fd141d1a99 powerpc/powernv/pci: Rework accessing the TCE invalidate register
It's architected, always in a known place, so there is no need
to keep a separate pointer to it, we use the existing "regs",
and we complement it with a real mode variant.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>

# Conflicts:
#	arch/powerpc/platforms/powernv/pci-ioda.c
#	arch/powerpc/platforms/powernv/pci.h
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-17 16:42:47 +10:00
Benjamin Herrenschmidt 08acce1cab powerpc/powernv/pci: Remove SWINV constants and obsolete TCE code
We have some obsolete code in pnv_pci_p7ioc_tce_invalidate()
to handle some internal lab tools that have stopped being
useful a long time ago. Remove that along with the definition
and test for the TCE_PCI_SWINV_* flags whose value is basically
always the same.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-17 16:42:47 +10:00
Benjamin Herrenschmidt a34ab7c328 powerpc/powernv/pci: Rename TCE invalidation calls
The TCE invalidation functions are fairly implementation specific,
and while the IODA specs more/less describe the register, in practice
various implementation workarounds may be required. So name the
functions after the target PHB.

Note today and for the foreseeable future, there's a 1:1 relationship
between an IODA version and a PHB implementation. There exist another
variant of IODA1 (Torrent) but we never supported in with OPAL and
never will.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-17 16:42:46 +10:00
Benjamin Herrenschmidt fb111334e4 powerpc/powernv: Discover IODA3 PHBs
We instanciate them as IODA2. We also change the MSI EOI hack
to only kick on PHB3 since it will not be needed on any new
implementation.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-17 16:42:45 +10:00
Ian Munsie a2f67d5ee8 cxl: Add support for interrupts on the Mellanox CX4
The Mellanox CX4 in cxl mode uses a hybrid interrupt model, where
interrupts are routed from the networking hardware to the XSL using the
MSIX table, and from there will be transformed back into an MSIX
interrupt using the cxl style interrupts (i.e. using IVTE entries and
ranges to map a PE and AFU interrupt number to an MSIX address).

We want to hide the implementation details of cxl interrupts as much as
possible. To this end, we use a special version of the MSI setup &
teardown routines in the PHB while in cxl mode to allocate the cxl
interrupts and configure the IVTE entries in the process element.

This function does not configure the MSIX table - the CX4 card uses a
custom format in that table and it would not be appropriate to fill that
out in generic code. The rest of the functionality is similar to the
"Full MSI-X mode" described in the CAIA, and this could be easily
extended to support other adapters that use that mode in the future.

The interrupts will be associated with the default context. If the
maximum number of interrupts per context has been limited (e.g. by the
mlx5 driver), it will automatically allocate additional kernel contexts
to associate extra interrupts as required. These contexts will be
started using the same WED that was used to start the default context.

Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-14 20:27:08 +10:00
Ian Munsie 4361b03430 powerpc/powernv: Add support for the cxl kernel api on the real phb
This adds support for the peer model of the cxl kernel api to the
PowerNV PHB, in which physical function 0 represents the cxl function on
the card (an XSL in the case of the CX4), which other physical functions
will use for memory access and interrupt services. It is referred to as
the peer model as these functions are peers of one another, as opposed
to the Virtual PHB model which forms a hierarchy.

This patch exports APIs to enable the peer mode, check if a PCI device
is attached to a PHB in this mode, and to set and get the peer AFU for
this mode.

The cxl driver will enable this mode for supported cards by calling
pnv_cxl_enable_phb_kernel_api(). This will set a flag in the PHB to note
that this mode is enabled, and switch out it's controller_ops for the
cxl version.

The cxl version of the controller_ops struct implements it's own
versions of the enable_device_hook and release_device to handle
refcounting on the peer AFU and to allocate a default context for the
device.

Once enabled, the cxl kernel API may not be disabled on a PHB. Currently
there is no safe way to disable cxl mode short of a reboot, so until
that changes there is no reason to support the disable path.

Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-14 20:26:54 +10:00
Ian Munsie f456834a6c powerpc/powernv: Split cxl code out into a separate file
The support for using the Mellanox CX4 in cxl mode will require
additions to the PHB code. In preparation for this, move the existing
cxl code out of pci-ioda.c into a separate pci-cxl.c file to keep things
more organised.

Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-14 20:26:31 +10:00
Gavin Shan 9497a1c1c5 powerpc/powernv: Print correct PHB type names
We're initializing "IODA1" and "IODA2" PHBs though they are IODA2
and NPU PHBs as below kernel log indicates.

   Initializing IODA1 OPAL PHB /pciex@3fffe40700000
   Initializing IODA2 OPAL PHB /pciex@3fff000400000

This fixes the PHB names. After it's applied, we get:

   Initializing IODA2 PHB (/pciex@3fffe40700000)
   Initializing NPU PHB (/pciex@3fff000400000)

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-06-21 15:30:59 +10:00
Gavin Shan c5f7700bbd powerpc/powernv: Dynamically release PE
This supports releasing PEs dynamically. A reference count is
introduced to PE representing number of PCI devices associated
with the PE. The reference count is increased when PCI device
joins the PE and decreased when PCI device leaves the PE in
pnv_pci_release_device(). When the count becomes zero, the PE
and its consumed resources are released. Note that the count
is accessed concurrently. So a counter with "int" type is enough
here.

In order to release the sources consumed by the PE, couple of
helper functions are introduced as below:

   * pnv_pci_ioda1_unset_window() - Unset IODA1 DMA32 window
   * pnv_pci_ioda1_release_dma_pe() - Release IODA1 DMA32 segments
   * pnv_pci_ioda2_release_dma_pe() - Release IODA2 DMA resource
   * pnv_ioda_release_pe_seg() - Unmap IO/M32/M64 segments

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-06-21 15:30:55 +10:00
Gavin Shan 93e01a5039 powerpc/powernv: Make pnv_ioda_deconfigure_pe() visible
pnv_ioda_deconfigure_pe() is visible only when CONFIG_PCI_IOV is
enabled. The function will be used to tear down PE's associated
mapping in PCI hotplug path that doesn't depend on CONFIG_PCI_IOV.

This makes pnv_ioda_deconfigure_pe() visible and not depend on
CONFIG_PCI_IOV.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-06-21 15:30:55 +10:00
Gavin Shan 40e2a47e62 powerpc/powernv: Extend PCI bridge resources
The PCI slots are associated with root port or downstream ports
of the PCIe switch connected to root port. When adapter is hot
added to the PCI slot, it usually requests more IO or memory
resource from the directly connected parent bridge (port) and
update the bridge's windows accordingly. The resource windows
of upstream bridges can't be updated automatically. It possibly
leads to unbalanced resource across the bridges: The window of
downstream bridge is overruning that of upstream bridge. The
IO or MMIO path won't work.

This resolves the above issue by extending bridge windows of
root port and upstream port of the PCIe switch connected to
the root port to PHB's windows.

The windows of root port and bridge behind that are extended to
the PHB's windows to accomodate the PCI hotplug happening in
future. The PHB's 64KB 32-bits MSI region is included in bridge's
M32 windows (in hardware) though it's excluded in the corresponding
resource, as the bridge's M32 windows have 1MB as their minimal
alignment. We observed EEH error during system boot when the MSI
region is included in bridge's M32 window.

This excludes top 1MB (including 64KB 32-bits MSI region) region
from bridge's M32 windows when extending them.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-06-21 15:30:55 +10:00
Gavin Shan 63803c39c8 powerpc/powernv: Setup PE for root bus
There is no parent bridge for root bus, meaning pcibios_setup_bridge()
isn't invoked for root bus. The PE for root bus is the ancestor of
other PEs in PELTV. It means we need PE for root bus populated before
all others.

This populates the PE for root bus in pcibios_setup_bridge() path
if it's not populated yet. The PE number next to the reserved one
is used as the PE# to avoid holes in continuous M64 space.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-06-21 15:30:54 +10:00
Gavin Shan ccd1c1911a powerpc/powernv: Create PEs in pcibios_setup_bridge()
Currently, the PEs and their associated resources are assigned in
ppc_md.pcibios_fixup() except those used by SRIOV VFs. The function
is called for once after PCI probing and resources assignment is
completed. So it's obviously not hotplug friendly.

This creates PEs dynamically in pcibios_setup_bridge() that is
called for the event during system bootup and PCI hotplug: updating
PCI bridge's windows after resource assignment/reassignment are done.
In partial hotplug case, not all PCI devices included to one particular
PE are unplugged and plugged again, we just need unbinding/binding the
hot added PCI devices with the corresponding PE without creating new
one. The change is applied to IODA1 and IODA2 PHBs only. The behaviour
on NPU PHBs aren't changed. There are no PCI bridges on NPU PHBs,
meaning pcibios_setup_bridge() won't be invoked there. We have to use
old path (pnv_pci_ioda_fixup()) to setup PEs on NPU PHBs.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-06-21 15:30:54 +10:00
Gavin Shan 9fcd6f4a2b powerpc/powernv: Allocate PE# in reverse order
PE number for one particular PE can be allocated dynamically or
reserved according to the consumed M64 (64-bits prefetchable)
segments of the PE. The M64 segment can't be remapped to arbitrary
PE, meaning the PE number is determined according to the index
of the consumed M64 segment. As below figure shows, M64 resource
grows from low to high end, meaning the PE (number) reserved
according to M64 segment grows from low to high end as well,
so does the dynamically allocated PE number. It will lead to
conflict: PE number (M64 segment) reserved by dynamic allocation
is required by hot added PCI adapter at later point. It fails
the PCI hotplug because of the PE number can't be reserved
based on the index of the consumed M64 segment.

  +---+---+---+---+---+--------------------------------+-----+
  | 0 | 1 | 2 | 3 | 4 |      .......                   | 255 |
  +---+---+---+---+---+--------------------------------+-----+

  PE number for dynamic allocation          ----------------->
  PE number reserved for M64 segment        ----------------->

To resolve above conflicts, this forces the PE number to be
allocated dynamically in reverse order. With this patch applied,
the PE numbers are reserved in ascending order, but allocated
dynamically in reverse order.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-06-21 15:30:53 +10:00
Gavin Shan c127562ae1 powerpc/powernv: Increase PE# capacity
Each PHB maintains an array helping to translate 2-bytes Request
ID (RID) to PE# with the assumption that PE# takes one byte, meaning
that we can't have more than 256 PEs. However, pci_dn->pe_number
already had 4-bytes for the PE#.

This extends the PE# capacity for every PHB. After that, the PE number
is represented by 4-bytes value. Then we can reuse IODA_INVALID_PE to
check the PE# in phb->pe_rmap[] is valid or not.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-06-21 15:30:53 +10:00
Gavin Shan 577c8c8868 powerpc/powernv: Move pnv_pci_ioda_setup_opal_tce_kill() around
pnv_pci_ioda_setup_opal_tce_kill() called by pnv_ioda_setup_dma()
to remap the TCE kill regiter. What's done in pnv_ioda_setup_dma()
will be covered in pcibios_setup_bridge() which is invoked on each
PCI bridge. It means we will possibly remap the TCE kill register
for multiple times and it's unnecessary.

This moves pnv_pci_ioda_setup_opal_tce_kill() to where the PHB is
initialized (pnv_pci_init_ioda_phb()) to avoid above issue.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-06-21 15:30:53 +10:00
Ian Munsie b385c9e971 cxl: Add support for CAPP DMA mode
This adds support for using CAPP DMA mode, which is required for XSL
based cards such as the Mellanox CX4 to function.

This is currently an RFC as it depends on the corresponding support to
be merged into skiboot first, which was submitted here:
http://patchwork.ozlabs.org/patch/625582/

In the event that the skiboot on the system does not have the above
support, it will indicate as such in the kernel log and abort the init
process.

Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-06-16 23:10:26 +10:00
Michael Ellerman 027dfac694 powerpc: Various typo fixes
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-06-14 13:58:26 +10:00
Alexey Kardashevskiy 1d4e89cf0a powerpc/powernv/npu: Add PE to PHB's list
Before commit 3e68dc57 "powerpc/powernv: Remove DMA32 PE list", NPU PEs
were linked to the NPU PHB via phb->ioda.pe_dma_list; after that fix,
the phb->ioda.pe_list is used.

During the pe_dma_list removal, list_add_tail(&phb->ioda.pe_dma_list)
was removed, however no list_add() was added so does this patch.

Fixes: 3e68dc57219a ("powerpc/powernv: Remove DMA32 PE list")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-12 19:56:25 +10:00
Alexey Kardashevskiy 92a8675690 powerpc/powernv: Fix insufficient memory allocation
The pnv_pci_init_ioda_phb() helper allocates a blob to store auxilary
data such PE and M32/M64 segment allocation maps; this single blob has
few partitions, size of each is derived from the PE number -
phb->ioda.total_pe_num.

It was assumed that the minimum PE number is 8, however it is 4 for NPU
so the pe_alloc part was missing in the allocated blob. It was invisible
till recently as we were not tracking used M64 segments and NPUs do not
use M32 segments so the phb->ioda.m32_segmap (which was pointing to the
same address as phb->ioda.pe_alloc) has never been written to leaving
the pe_alloc memory intact.

After commit 401203ac2d "powerpc/powernv: Track M64 segment consumption"
the pe_alloc gets corrupted and PE allocation cannot work. This fixes
the issue by enforcing the minimum PE number to 8.

Fixes: 401203ac2d15 ("powerpc/powernv: Track M64 segment consumption")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-12 19:55:42 +10:00
Alexey Kardashevskiy b5cb9ab1a0 powerpc/powernv/npu: Enable NVLink pass through
IBM POWER8 NVlink systems come with Tesla K40-ish GPUs each of which
also has a couple of fast speed links (NVLink). The interface to links
is exposed as an emulated PCI bridge which is included into the same
IOMMU group as the corresponding GPU.

In the kernel, NPUs get a separate PHB of the PNV_PHB_NPU type and a PE
which behave pretty much as the standard IODA2 PHB except NPU PHB has
just a single TVE in the hardware which means it can have either
32bit window or 64bit window or DMA bypass but never two of these.

In order to make these links work when GPU is passed to the guest,
these bridges need to be passed as well; otherwise performance will
degrade.

This implements and exports API to manage NPU state in regard to VFIO;
it replicates iommu_table_group_ops.

This defines a new pnv_pci_ioda2_npu_ops which is assigned to
the IODA2 bridge if there are NPUs for a GPU on the bridge.
The new callbacks call the default IODA2 callbacks plus new NPU API.
This adds a gpe_table_group_to_npe() helper to find NPU PE for the IODA2
table_group, it is not expected to fail as the helper is only called
from the pnv_pci_ioda2_npu_ops.

This does not define NPU-specific .release_ownership() so after
VFIO is finished, DMA on NPU is disabled which is ok as the nvidia
driver sets DMA mask when probing which enable 32 or 64bit DMA on NPU.

This adds a pnv_pci_npu_setup_iommu() helper which adds NPUs to
the GPU group if any found. The helper uses helpers to look for
the "ibm,gpu" property in the device tree which is a phandle of
the corresponding GPU.

This adds an additional loop over PEs in pnv_ioda_setup_dma() as the main
loop skips NPU PEs as they do not have 32bit DMA segments.

As pnv_npu_set_window() and pnv_npu_unset_window() are started being used
by the new IODA2-NPU IOMMU group, this makes the helpers public and
adds the DMA window number parameter.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-By: Alistair Popple <alistair@popple.id.au>
[mpe: Add pnv_pci_ioda_setup_iommu_api() to fix build with IOMMU_API=n]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11 21:54:31 +10:00
Alexey Kardashevskiy 85674868ce powerpc/powernv/npu: Rework TCE Kill handling
The pnv_ioda_pe struct keeps an array of peers. At the moment it is only
used to link GPU and NPU for 2 purposes:

1. Access NPU quickly when configuring DMA for GPU - this was addressed
in the previos patch by removing use of it as DMA setup is not what
the kernel would constantly do.

2. Invalidate TCE cache for NPU when it is invalidated for GPU.
GPU and NPU are in different PE. There is already a mechanism to
attach multiple iommu_table_group to the same iommu_table (used for VFIO),
we can reuse it here so does this patch.

This gets rid of peers[] array and PNV_IODA_PE_PEER flag as they are
not needed anymore.

While we are here, add TCE cache invalidation after enabling bypass.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-By: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11 21:54:31 +10:00
Alexey Kardashevskiy 7d623e4256 powerpc/powernv/ioda2: Export debug helper pe_level_printk()
This exports debugging helper pe_level_printk() and corresponding macroses
so they can be used in npu-dma.c.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-By: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11 21:54:30 +10:00
Alexey Kardashevskiy f9f8345674 powerpc/powernv/npu: Simplify DMA setup
NPU devices are emulated in firmware and mainly used for NPU NVLink
training; one NPU device is per a hardware link. Their DMA/TCE setup
must match the GPU which is connected via PCIe and NVLink so any changes
to the DMA/TCE setup on the GPU PCIe device need to be propagated to
the NVLink device as this is what device drivers expect and it doesn't
make much sense to do anything else.

This makes NPU DMA setup explicit.
pnv_npu_ioda_controller_ops::pnv_npu_dma_set_mask is moved to pci-ioda,
made static and prints warning as dma_set_mask() should never be called
on this function as in any case it will not configure GPU; so we make
this explicit.

Instead of using PNV_IODA_PE_PEER and peers[] (which the next patch will
remove), we test every PCI device if there are corresponding NVLink
devices. If there are any, we propagate bypass mode to just found NPU
devices by calling the setup helper directly (which takes @bypass) and
avoid guessing (i.e. calculating from DMA mask) whether we need bypass
or not on NPU devices. Since DMA setup happens in very rare occasion,
this will not slow down booting or VFIO start/stop much.

This renames pnv_npu_disable_bypass to pnv_npu_dma_set_32 to make it
more clear what the function really does which is programming 32bit
table address to the TVT ("disabling bypass" means writing zeroes to
the TVT).

This removes pnv_npu_dma_set_bypass() from pnv_npu_ioda_fixup() as
the DMA configuration on NPU does not matter until dma_set_mask() is
called on GPU and that will do the NPU DMA configuration.

This removes phb->dma_dev_setup initialization for NPU as
pnv_pci_ioda_dma_dev_setup is no-op for it anyway.

This stops using npe->tce_bypass_base as it never changes and values
other than zero are not supported.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11 21:54:29 +10:00
Alexey Kardashevskiy 0bbcdb437d powerpc/powernv/npu: TCE Kill helpers cleanup
NPU PHB TCE Kill register is exactly the same as in the rest of POWER8
so let's reuse the existing code for NPU. The only bit missing is
a helper to reset the entire TCE cache so this moves such a helper
from NPU code and renames it.

Since pnv_npu_tce_invalidate() does really invalidate the entire cache,
this uses pnv_pci_ioda2_tce_invalidate_entire() directly for NPU.
This adds an explicit comment for workaround for invalidating NPU TCE
cache.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11 21:54:29 +10:00
Alexey Kardashevskiy bef9253f55 powerpc/powernv: Define TCE Kill flags
This replaces magic constants for TCE Kill IODA2 register with macros.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11 21:54:28 +10:00
Alexey Kardashevskiy a7cf13caad powerpc/powernv: Rename pnv_pci_ioda2_tce_invalidate_entire
As in fact pnv_pci_ioda2_tce_invalidate_entire() invalidates TCEs for
the specific PE rather than the entire cache, rename it to
pnv_pci_ioda2_tce_invalidate_pe(). In later patches we will add
a proper pnv_pci_ioda2_tce_invalidate_entire().

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11 21:54:28 +10:00
Gavin Shan 1e9167726c powerpc/powernv: Use PE instead of number during setup and release
In current implementation, the PEs that are allocated or picked
from the reserved list are identified by PE number. The PE instance
has to be picked according to the PE number eventually. We have
same issue when PE is released.

For pnv_ioda_pick_m64_pe() and pnv_ioda_alloc_pe(), this returns
PE instance so that pnv_ioda_setup_bus_PE() can use the allocated
or reserved PE instance directly. Also, pnv_ioda_setup_bus_PE()
returns the reserved/allocated PE instance to be used in subsequent
patches. On the other hand, pnv_ioda_free_pe() uses PE instance
(not number) as its argument. No logical changes introduced.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11 21:54:23 +10:00
Gavin Shan 2b923ed1bd powerpc/powernv/ioda1: Improve DMA32 segment track
In current implementation, the DMA32 segments required by one specific
PE isn't calculated with the information hold in the PE independently.
It conflicts with the PCI hotplug design: PE centralized, meaning the
PE's DMA32 segments should be calculated from the information hold in
the PE independently.

This introduces an array (@dma32_segmap) for every PHB to track the
DMA32 segmeng usage. Besides, this moves the logic calculating PE's
consumed DMA32 segments to pnv_pci_ioda1_setup_dma_pe() so that PE's
DMA32 segments are calculated/allocated from the information hold in
the PE (DMA32 weight). Also the logic is improved: we try to allocate
as much DMA32 segments as we can. It's acceptable that number of DMA32
segments less than the expected number are allocated.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11 21:54:22 +10:00
Gavin Shan 801846d1de powerpc/powernv: Remove DMA32 PE list
PEs are put into PHB DMA32 list (phb->ioda.pe_dma_list) according
to their DMA32 weight. The PEs on the list are iterated to setup
their TCE32 tables at system booting time. The list is used for
once at boot time and no need to keep it.

This moves the logic calculating DMA32 weight of PHB and PE to
pnv_ioda_setup_dma() to drop PHB's DMA32 list. Also, every PE
traces the consumed DMA32 segment by @tce32_seg and @tce32_segcount
are useless and they're removed.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11 21:54:22 +10:00
Gavin Shan acce971c0e powerpc/powernv/ioda1: Introduce PNV_IODA1_DMA32_SEGSIZE
Currently, there is one macro (TCE32_TABLE_SIZE) representing the
TCE table size for one DMA32 segment. The constant representing
the DMA32 segment size (1 << 28) is still used in the code.

This defines PNV_IODA1_DMA32_SEGSIZE representing one DMA32
segment size. the TCE table size can be calcualted when the page
has fixed 4KB size. So all the related calculation depends on one
macro (PNV_IODA1_DMA32_SEGSIZE). No logical changes introduced.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-By: Alistair Popple <alistair@popple.id.au>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11 21:54:21 +10:00
Gavin Shan b30d936f6f powerpc/powernv/ioda1: Rename pnv_pci_ioda_setup_dma_pe()
This renames pnv_pci_ioda_setup_dma_pe() to pnv_pci_ioda1_setup_dma_pe()
as it's the counter-part of IODA2's pnv_pci_ioda2_setup_dma_pe().
No logical changes introduced.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11 21:54:21 +10:00
Gavin Shan 9945155143 powerpc/powernv/ioda1: M64 support on P7IOC
This enables M64 window on P7IOC, which has been enabled on PHB3.
Different from PHB3 where 16 M64 BARs are supported and each of
them can be owned by one particular PE# exclusively or divided
evenly to 256 segments, every P7IOC PHB has 16 M64 BARs and each
of them are divided to 8 segments. So every P7IOC PHB supports
128 M64 segments in total. P7IOC has M64DT, which helps mapping
one particular M64 segment# to arbitrary PE#. PHB3 doesn't have
M64DT, indicating that one M64 segment can only be pinned to the
fixed PE#.

In order to unified M64 support M64 on P7IOC and PHB3, we just
provide 128 M64 segments on every P7IOC PHB and each of them is
pinned to the fixed PE# by bypassing the function of M64DT. In
turn, we just need different phb->init_m64() for P7IOC and PHB3
and maps M64 segment in pnv_ioda_reserve_m64_pe() for P7IOC, most
of the code are shared by them.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: Alistair Popple <alistair@popple.id.au>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11 21:54:20 +10:00
Gavin Shan c430670ad1 powerpc/powernv: Rename M64 related functions
This renames those functions picking PE number based on consumed
M64 segments, mapping M64 segments to PEs as those functions are
going to be shared by IODA1/IODA2 in next patch. No logical changes
introduced.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11 21:54:20 +10:00
Gavin Shan 93289d8c08 powerpc/powernv: Track M64 segment consumption
When unplugging PCI devices, their parent PEs might be offline.
The consumed M64 resource by the PEs should be released at that
time. As we track M32 segment consumption, this introduces an
array to the PHB to track the mapping between M64 segment and
PE number.

Note: M64 mapping isn't covered by pnv_ioda_setup_pe_seg() as
IODA2 doesn't support the mapping explicitly while it's supported
on IODA1. Until now, no M64 is supported on IODA1 in software.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11 21:54:19 +10:00
Gavin Shan 69d733e72d powerpc/powernv: IO and M32 mapping based on PCI device resources
Currently, the IO and M32 segments are mapped to the corresponding
PE based on the windows of the parent bridge of PE's primary bus.
It's not going to work when the windows of root port or upstream
port of the PCIe switch behind root port are extended to PHB's
apertures in order to support hotplug in subsequent patch.

This fixes the issue by mapping IO and M32 segments based on the
resources of the PCI devices included in the PE, instead of the
windows of the parent bridge of the PE's primary bus.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11 21:54:19 +10:00
Gavin Shan 23e79425fe powerpc/powernv: Simplify pnv_ioda_setup_pe_seg()
pnv_ioda_setup_pe_seg() associates the IO and M32 segments with the
owner PE. The code mapping segments should be fixed and immune from
logic changes introduced to pnv_ioda_setup_pe_seg().

This moves the code mapping segments to helper pnv_ioda_setup_pe_res().
The data type for @rc is changed to "int64_t". Also, argument @hose is
removed from pnv_ioda_setup_pe() as it can be got from @pe. No functional
changes introduced.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-By: Alistair Popple <alistair@popple.id.au>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11 21:54:18 +10:00
Gavin Shan 3fa23ff8ff powerpc/powernv: Fix initial IO and M32 segmap
There are two arrays for IO and M32 segment maps on every PHB.
The index of the arrays are segment number and the value stored
in the corresponding element is PE number, indicating the segment
is assigned to the PE. Initially, all elements in those two arrays
are zeroes, meaning all segments are assigned to PE#0. It's wrong.

This fixes the initial values in the elements of those two arrays
to IODA_INVALID_PE, meaning all segments aren't assigned to any
PE.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11 21:54:18 +10:00
Gavin Shan 689ee8c95f powerpc/powernv: Data type unsigned int for PE number
This changes the data type of PE number from "int" to "unsigned int"
in order to match the fact PE number is never negative:

   * The number of PE to which the specified PCI device is attached.
   * The PE number map for SRIOV VFs.
   * The returned PE number from pnv_ioda_alloc_pe().
   * The returned PE number from pnv_ioda2_pick_m64_pe().

Suggested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-By: Alistair Popple <alistair@popple.id.au>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11 21:54:17 +10:00
Gavin Shan 92b8f137b3 powerpc/powernv: Rename PE# fields in struct pnv_phb
This renames the fields related to PE number in "struct pnv_phb"
for better reflecting of their usages as Alexey suggested. No
logical changes introduced.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11 21:54:17 +10:00
Gavin Shan 475d92c27f powerpc/powernv: Drop phb->bdfn_to_pe()
The last usage of pnv_phb::bdfn_to_pe() was removed in
ff57b454dd ("powerpc/eeh: Do probe on pci_dn"), so drop it.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11 21:54:16 +10:00