1
0
Fork 0
Commit Graph

479 Commits (cc18b2f4f3f1d7ed3125ac1840794f9feab0325c)

Author SHA1 Message Date
Linus Torvalds de399813b5 powerpc updates for 4.10
Highlights include:
 
  - Support for the kexec_file_load() syscall, which is a prereq for secure and
    trusted boot.
 
  - Prevent kernel execution of userspace on P9 Radix (similar to SMEP/PXN).
 
  - Sort the exception tables at build time, to save time at boot, and store
    them as relative offsets to save space in the kernel image & memory.
 
  - Allow building the kernel with thin archives, which should allow us to build
    an allyesconfig once some other fixes land.
 
  - Build fixes to allow us to correctly rebuild when changing the kernel endian
    from big to little or vice versa.
 
  - Plumbing so that we can avoid doing a full mm TLB flush on P9 Radix.
 
  - Initial stack protector support (-fstack-protector).
 
  - Support for dumping the radix (aka. Linux) and hash page tables via debugfs.
 
  - Fix an oops in cxl coredump generation when cxl_get_fd() is used.
 
  - Freescale updates from Scott: "Highlights include 8xx hugepage support,
    qbman fixes/cleanup, device tree updates, and some misc cleanup."
 
  - Many and varied fixes and minor enhancements as always.
 
 Thanks to:
   Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V, Anshuman Khandual,
   Anton Blanchard, Balbir Singh, Bartlomiej Zolnierkiewicz, Christophe Jaillet,
   Christophe Leroy, Denis Kirjanov, Elimar Riesebieter, Frederic Barrat,
   Gautham R. Shenoy, Geliang Tang, Geoff Levand, Jack Miller, Johan Hovold,
   Lars-Peter Clausen, Libin, Madhavan Srinivasan, Michael Neuling, Nathan
   Fontenot, Naveen N. Rao, Nicholas Piggin, Pan Xinhui, Peter Senna Tschudin,
   Rashmica Gupta, Rui Teng, Russell Currey, Scott Wood, Simon Guo, Suraj
   Jitindar Singh, Thiago Jung Bauermann, Tobias Klauser, Vaibhav Jain.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJYU4YSAAoJEFHr6jzI4aWAC4gQALtIAqqPon0Cd5b/FVVcMbW7
 mMqB2b/0FGEl5GoRTzGUDaQqElilm6AEVfHO86C7DFji/a6olneFfw87iz+mtWuZ
 JvrNq68ZiSnoeszdUy4MgtXFLb5sTzNMev4skaHfjI9E5CepWBoR0zH4G+kNVnd5
 WSgudv8Cq4Px+MEuTOigt3QYjHzZ3cw/XNOOm9c+oGj+PDW4O9UItVI+S1WLoey4
 rAB2nRcLMDPuwfRQC9XsF3zEbkv4h1dEXo/EBRuRpcF+0lLTzFw1lv1WE8OxlUmS
 kAXbty3dIytBfSbtJT0c0Ps6sfQ4HFhu6ZV2fjnxNTz2KDkBIN7LBYHmBYiqY9oZ
 9zvbUWtfiTu5ocfRtTq7rC/Hcj4Kbr9S9F/FvXR0WyDsKgu4xxAovqC3gcn6YjYK
 Rr1tcCI4nUzyhVJVmd+OEhUvc5JbFy9aGage+YeOyejfvvSbXIunaxWlPjoDkvim
 Vjl+UKU8gw51XFssqY5ZBi/HNlMFKYedLpMFp/fItnLglhj50V0eFWkpDgdSCYom
 vo9ifPLZx8n8m8De3H7TV4E0F4gCHcTeqZdu7tW9AAUVM6iLJcDLm3asGmtNh21t
 snOHNOJ5QSIno6ezUUg29T6VBjbPh46fdJJSlIZrEe8OzLZ1haGyttf0tD00PQvY
 Z2W/m3gxafnOeGgBqvyv
 =xOzf
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:
 "Highlights include:

   - Support for the kexec_file_load() syscall, which is a prereq for
     secure and trusted boot.

   - Prevent kernel execution of userspace on P9 Radix (similar to
     SMEP/PXN).

   - Sort the exception tables at build time, to save time at boot, and
     store them as relative offsets to save space in the kernel image &
     memory.

   - Allow building the kernel with thin archives, which should allow us
     to build an allyesconfig once some other fixes land.

   - Build fixes to allow us to correctly rebuild when changing the
     kernel endian from big to little or vice versa.

   - Plumbing so that we can avoid doing a full mm TLB flush on P9
     Radix.

   - Initial stack protector support (-fstack-protector).

   - Support for dumping the radix (aka. Linux) and hash page tables via
     debugfs.

   - Fix an oops in cxl coredump generation when cxl_get_fd() is used.

   - Freescale updates from Scott: "Highlights include 8xx hugepage
     support, qbman fixes/cleanup, device tree updates, and some misc
     cleanup."

   - Many and varied fixes and minor enhancements as always.

  Thanks to:
    Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V, Anshuman
    Khandual, Anton Blanchard, Balbir Singh, Bartlomiej Zolnierkiewicz,
    Christophe Jaillet, Christophe Leroy, Denis Kirjanov, Elimar
    Riesebieter, Frederic Barrat, Gautham R. Shenoy, Geliang Tang, Geoff
    Levand, Jack Miller, Johan Hovold, Lars-Peter Clausen, Libin,
    Madhavan Srinivasan, Michael Neuling, Nathan Fontenot, Naveen N.
    Rao, Nicholas Piggin, Pan Xinhui, Peter Senna Tschudin, Rashmica
    Gupta, Rui Teng, Russell Currey, Scott Wood, Simon Guo, Suraj
    Jitindar Singh, Thiago Jung Bauermann, Tobias Klauser, Vaibhav Jain"

[ And thanks to Michael, who took time off from a new baby to get this
  pull request done.   - Linus ]

* tag 'powerpc-4.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (174 commits)
  powerpc/fsl/dts: add FMan node for t1042d4rdb
  powerpc/fsl/dts: add sg_2500_aqr105_phy4 alias on t1024rdb
  powerpc/fsl/dts: add QMan and BMan nodes on t1024
  powerpc/fsl/dts: add QMan and BMan nodes on t1023
  soc/fsl/qman: test: use DEFINE_SPINLOCK()
  powerpc/fsl-lbc: use DEFINE_SPINLOCK()
  powerpc/8xx: Implement support of hugepages
  powerpc: get hugetlbpage handling more generic
  powerpc: port 64 bits pgtable_cache to 32 bits
  powerpc/boot: Request no dynamic linker for boot wrapper
  soc/fsl/bman: Use resource_size instead of computation
  soc/fsl/qe: use builtin_platform_driver
  powerpc/fsl_pmc: use builtin_platform_driver
  powerpc/83xx/suspend: use builtin_platform_driver
  powerpc/ftrace: Fix the comments for ftrace_modify_code
  powerpc/perf: macros for power9 format encoding
  powerpc/perf: power9 raw event format encoding
  powerpc/perf: update attribute_group data structure
  powerpc/perf: factor out the event format field
  powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown
  ...
2016-12-16 09:26:42 -08:00
Linus Torvalds 0ab7b12c49 pci-v4.10-changes
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJYUt1vAAoJEFmIoMA60/r8abgP/3R+5Lsk5/kfAHk5/2Mtqbvg
 mZ0eDUpY9GbUeMjSq84Nr2H8u7d+1AJCCu8KtDJYZCmjZpnSp2SuE2PS5JoGC7zC
 fintD24jlIF4/J5+HeVXXmbfr3xATxvpTuiSLEi8sLBRJ3KRIswhMSwoPwOyeTQw
 v/EclWKPGYcI5Zp0oigY9/Jd3q3lQ17KXppi/0dDoLh7PNOFvEHItXWzmf++u/NP
 iYT9R1xmzEsy0/HRd6hiwPT2xA8YsAXxgobhHooUgh1FWmZ02Tg1WjgDemOW4lVh
 kNIUcsLczh7wZCceogrrJ+pwb9+NyyIyKuHPv6OG3ieyz1IZdznaj1fAE5HJYiPo
 eVS7cP1S6DyV3Y5qFj5F2dSRS7T4GXdXG5mNhmeCpUHs0vfzSCG36jLmhTy8UIxs
 1rCf5oFa+uU9q0okfH8VtcGOXqWjGgyxTSGGfF71HUMLnPbsci2fxC2cO6svzIX7
 wDY0uxOzpyMIYMuQR6iz7VqvAwEaZ+7pfMIrWWdDcQ9/5tCNJ49cLuKaThPL4bVu
 juiGBQtnTLg8tjrhjDL9tQiJpuVIweVXyyQ1fvZoVXkMLlhVCF2ttirvwFUit2PB
 84OlevQZ+9QdE/qalrWbv4qzhesuiwu0avkzjGoqg6tWTF0epu2AHI2vqy6UBYEG
 tcfJPEcz1019PKZNSvWy
 =ut0k
 -----END PGP SIGNATURE-----

Merge tag 'pci-v4.10-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci

Pull PCI updates from Bjorn Helgaas:
 "PCI changes:

   - add support for PCI on ARM64 boxes with ACPI. We already had this
     for theoretical spec-compliant hardware; now we're adding quirks
     for the actual hardware (Cavium, HiSilicon, Qualcomm, X-Gene)

   - add runtime PM support for hotplug ports

   - enable runtime suspend for Intel UHCI that uses platform-specific
     wakeup signaling

   - add yet another host bridge registration interface. We hope this is
     extensible enough to subsume the others

   - expose device revision in sysfs for DRM

   - to avoid device conflicts, make sure any VF BAR updates are done
     before enabling the VF

   - avoid unnecessary link retrains for ASPM

   - allow INTx masking on Mellanox devices that support it

   - allow access to non-standard VPD for Chelsio devices

   - update Broadcom iProc support for PAXB v2, PAXC v2, inbound DMA,
     etc

   - update Rockchip support for max-link-speed

   - add NVIDIA Tegra210 support

   - add Layerscape LS1046a support

   - update R-Car compatibility strings

   - add Qualcomm MSM8996 support

   - remove some uninformative bootup messages"

* tag 'pci-v4.10-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (115 commits)
  PCI: Enable access to non-standard VPD for Chelsio devices (cxgb3)
  PCI: Expand "VPD access disabled" quirk message
  PCI: pciehp: Remove loading message
  PCI: hotplug: Remove hotplug core message
  PCI: Remove service driver load/unload messages
  PCI/AER: Log AER IRQ when claiming Root Port
  PCI/AER: Log errors with PCI device, not PCIe service device
  PCI/AER: Remove unused version macros
  PCI/PME: Log PME IRQ when claiming Root Port
  PCI/PME: Drop unused support for PMEs from Root Complex Event Collectors
  PCI: Move config space size macros to pci_regs.h
  x86/platform/intel-mid: Constify mid_pci_platform_pm
  PCI/ASPM: Don't retrain link if ASPM not possible
  PCI: iproc: Skip check for legacy IRQ on PAXC buses
  PCI: pciehp: Leave power indicator on when enabling already-enabled slot
  PCI: pciehp: Prioritize data-link event over presence detect
  PCI: rcar: Add gen3 fallback compatibility string for pcie-rcar
  PCI: rcar: Use gen2 fallback compatibility last
  PCI: rcar-gen2: Use gen2 fallback compatibility last
  PCI: rockchip: Move the deassert of pm/aclk/pclk after phy_init()
  ..
2016-12-15 12:46:48 -08:00
Lorenzo Stoakes 5b56d49fc3 mm: add locked parameter to get_user_pages_remote()
Patch series "mm: unexport __get_user_pages_unlocked()".

This patch series continues the cleanup of get_user_pages*() functions
taking advantage of the fact we can now pass gup_flags as we please.

It firstly adds an additional 'locked' parameter to
get_user_pages_remote() to allow for its callers to utilise
VM_FAULT_RETRY functionality.  This is necessary as the invocation of
__get_user_pages_unlocked() in process_vm_rw_single_vec() makes use of
this and no other existing higher level function would allow it to do
so.

Secondly existing callers of __get_user_pages_unlocked() are replaced
with the appropriate higher-level replacement -
get_user_pages_unlocked() if the current task and memory descriptor are
referenced, or get_user_pages_remote() if other task/memory descriptors
are referenced (having acquiring mmap_sem.)

This patch (of 2):

Add a int *locked parameter to get_user_pages_remote() to allow
VM_FAULT_RETRY faulting behaviour similar to get_user_pages_[un]locked().

Taking into account the previous adjustments to get_user_pages*()
functions allowing for the passing of gup_flags, we are now in a
position where __get_user_pages_unlocked() need only be exported for his
ability to allow VM_FAULT_RETRY behaviour, this adjustment allows us to
subsequently unexport __get_user_pages_unlocked() as well as allowing
for future flexibility in the use of get_user_pages_remote().

[sfr@canb.auug.org.au: merge fix for get_user_pages_remote API change]
  Link: http://lkml.kernel.org/r/20161122210511.024ec341@canb.auug.org.au
Link: http://lkml.kernel.org/r/20161027095141.2569-2-lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14 16:04:08 -08:00
Wang Sheng-Hui cc10385b6f PCI: Move config space size macros to pci_regs.h
Move PCI configuration space size macros (PCI_CFG_SPACE_SIZE and
PCI_CFG_SPACE_EXP_SIZE) from drivers/pci/pci.h to
include/uapi/linux/pci_regs.h so they can be used by more drivers and
eliminate duplicate definitions.

[bhelgaas: Expand comment to include PCI-X details]
Signed-off-by: Wang Sheng-Hui <shhuiw@foxmail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2016-12-12 10:05:22 -06:00
Kirti Wankhede 2b8bb1d771 vfio iommu type1: Fix size argument to vfio_find_dma() in pin_pages/unpin_pages
Passing zero for the size to vfio_find_dma() isn't compatible with
matching the start address of an existing vfio_dma. Doing so triggers a
corner case. In vfio_find_dma(), when the start address is equal to
dma->iova and size is 0, check for the end of search range makes it to
take wrong side of RB-tree. That fails the search even though the address
is present in mapped dma ranges.
In functions pin_pages and unpin_pages, the iova which is being searched
is base address of page to be pinned or unpinned. So here size should be
set to PAGE_SIZE, as argument to vfio_find_dma().

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-12-06 12:35:53 -07:00
Kirti Wankhede 7c03f42846 vfio iommu type1: Fix size argument to vfio_find_dma() during DMA UNMAP.
Passing zero for the size to vfio_find_dma() isn't compatible with
matching the start address of an existing vfio_dma. Doing so triggers a
corner case. In vfio_find_dma(), when the start address is equal to
dma->iova and size is 0, check for the end of search range makes it to
take wrong side of RB-tree. That fails the search even though the address
is present in mapped dma ranges. Due to this, in vfio_dma_do_unmap(),
while checking boundary conditions, size should be set to 1 for verifying
start address of unmap range.
vfio_find_dma() is also used to verify last address in unmap range with
size = 0, but in that case address to be searched is calculated with
start + size - 1 and so it works correctly.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
[aw: changelog tweak]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-12-06 12:28:04 -07:00
Kirti Wankhede 3cedd7d75f vfio iommu type1: WARN_ON if notifier block is not unregistered
mdev vendor driver should unregister the iommu notifier since the vfio
iommu can persist beyond the attachment of the mdev group. WARN_ON will
show warning if vendor driver doesn't unregister the notifier and is
forced to follow the implementations steps.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-12-05 16:04:33 -07:00
Alexey Kardashevskiy 4b6fad7097 powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown
At the moment the userspace tool is expected to request pinning of
the entire guest RAM when VFIO IOMMU SPAPR v2 driver is present.
When the userspace process finishes, all the pinned pages need to
be put; this is done as a part of the userspace memory context (MM)
destruction which happens on the very last mmdrop().

This approach has a problem that a MM of the userspace process
may live longer than the userspace process itself as kernel threads
use userspace process MMs which was runnning on a CPU where
the kernel thread was scheduled to. If this happened, the MM remains
referenced until this exact kernel thread wakes up again
and releases the very last reference to the MM, on an idle system this
can take even hours.

This moves preregistered regions tracking from MM to VFIO; insteads of
using mm_iommu_table_group_mem_t::used, tce_container::prereg_list is
added so each container releases regions which it has pre-registered.

This changes the userspace interface to return EBUSY if a memory
region is already registered in a container. However it should not
have any practical effect as the only userspace tool available now
does register memory region once per container anyway.

As tce_iommu_register_pages/tce_iommu_unregister_pages are called
under container->lock, this does not need additional locking.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-12-02 14:38:34 +11:00
Alexey Kardashevskiy bc82d122ae vfio/spapr: Reference mm in tce_container
In some situations the userspace memory context may live longer than
the userspace process itself so if we need to do proper memory context
cleanup, we better have tce_container take a reference to mm_struct and
use it later when the process is gone (@current or @current->mm is NULL).

This references mm and stores the pointer in the container; this is done
in a new helper - tce_iommu_mm_set() - when one of the following happens:
- a container is enabled (IOMMU v1);
- a first attempt to pre-register memory is made (IOMMU v2);
- a DMA window is created (IOMMU v2).
The @mm stays referenced till the container is destroyed.

This replaces current->mm with container->mm everywhere except debug
prints.

This adds a check that current->mm is the same as the one stored in
the container to prevent userspace from making changes to a memory
context of other processes.

DMA map/unmap ioctls() do not check for @mm as they already check
for @enabled which is set after tce_iommu_mm_set() is called.

This does not reference a task as multiple threads within the same mm
are allowed to ioctl() to vfio and supposedly they will have same limits
and capabilities and if they do not, we'll just fail with no harm made.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-12-02 14:38:33 +11:00
Alexey Kardashevskiy d9c728949d vfio/spapr: Postpone default window creation
We are going to allow the userspace to configure container in
one memory context and pass container fd to another so
we are postponing memory allocations accounted against
the locked memory limit. One of previous patches took care of
it_userspace.

At the moment we create the default DMA window when the first group is
attached to a container; this is done for the userspace which is not
DDW-aware but familiar with the SPAPR TCE IOMMU v2 in the part of memory
pre-registration - such client expects the default DMA window to exist.

This postpones the default DMA window allocation till one of
the folliwing happens:
1. first map/unmap request arrives;
2. new window is requested;
This adds noop for the case when the userspace requested removal
of the default window which has not been created yet.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-12-02 14:38:32 +11:00
Alexey Kardashevskiy 6f01cc692a vfio/spapr: Add a helper to create default DMA window
There is already a helper to create a DMA window which does allocate
a table and programs it to the IOMMU group. However
tce_iommu_take_ownership_ddw() did not use it and did these 2 calls
itself to simplify error path.

Since we are going to delay the default window creation till
the default window is accessed/removed or new window is added,
we need a helper to create a default window from all these cases.

This adds tce_iommu_create_default_window(). Since it relies on
a VFIO container to have at least one IOMMU group (for future use),
this changes tce_iommu_attach_group() to add a group to the container
first and then call the new helper.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-12-02 14:38:31 +11:00
Alexey Kardashevskiy 39701e56f5 vfio/spapr: Postpone allocation of userspace version of TCE table
The iommu_table struct manages a hardware TCE table and a vmalloc'd
table with corresponding userspace addresses. Both are allocated when
the default DMA window is created and this happens when the very first
group is attached to a container.

As we are going to allow the userspace to configure container in one
memory context and pas container fd to another, we have to postpones
such allocations till a container fd is passed to the destination
user process so we would account locked memory limit against the actual
container user constrainsts.

This postpones the it_userspace array allocation till it is used first
time for mapping. The unmapping patch already checks if the array is
allocated.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-12-02 14:38:30 +11:00
Alexey Kardashevskiy d7baee6901 powerpc/iommu: Stop using @current in mm_iommu_xxx
This changes mm_iommu_xxx helpers to take mm_struct as a parameter
instead of getting it from @current which in some situations may
not have a valid reference to mm.

This changes helpers to receive @mm and moves all references to @current
to the caller, including checks for !current and !current->mm;
checks in mm_iommu_preregistered() are removed as there is no caller
yet.

This moves the mm_iommu_adjust_locked_vm() call to the caller as
it receives mm_iommu_table_group_mem_t but it needs mm.

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-12-02 14:38:29 +11:00
Jike Song ccd46dbae7 vfio: support notifier chain in vfio_group
Beyond vfio_iommu events, users might also be interested in
vfio_group events. For example, if a vfio_group is used along
with Qemu/KVM, whenever kvm pointer is set to/cleared from the
vfio_group, users could be notified.

Currently only VFIO_GROUP_NOTIFY_SET_KVM supported.

Cc: Kirti Wankhede <kwankhede@nvidia.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Jike Song <jike.song@intel.com>
[aw: remove use of new typedef]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-12-01 10:40:05 -07:00
Jike Song 22195cbd34 vfio: vfio_register_notifier: classify iommu notifier
Currently vfio_register_notifier assumes that there is only one
notifier chain, which is in vfio_iommu. However, the user might
also be interested in events other than vfio_iommu, for example,
vfio_group. Refactor vfio_{un}register_notifier implementation
to make it feasible.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Jike Song <jike.song@intel.com>
[aw: merge with commit 816ca69ea9c7 ("vfio: Fix handling of error returned by 'vfio_group_get_from_dev()'"), remove typedef]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-12-01 09:38:47 -07:00
Christophe JAILLET d256459fae vfio: Fix handling of error returned by 'vfio_group_get_from_dev()'
'vfio_group_get_from_dev()' seems to return only NULL on error, not an
error pointer.

Fixes: 2169037dc3 ("vfio iommu: Added pin and unpin callback functions to vfio_iommu_driver_ops")
Fixes: c086de818d ("vfio iommu: Add blocking notifier to notify DMA_UNMAP")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-12-01 08:45:49 -07:00
Eric Auger 5ba6de98c7 vfio: fix vfio_info_cap_add/shift
Capability header next field is an offset relative to the start of
the INFO buffer. tmp->next is assigned the proper value but iterations
implemented in vfio_info_cap_add and vfio_info_cap_shift use next
as an offset between the headers. When coping with multiple capabilities
this leads to an Oops.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-11-21 11:51:53 -07:00
Cao jin f4cb410019 vfio/pci: Drop unnecessary pcibios_err_to_errno()
As of commit d97ffe2368 ("PCI: Fix return value from
pci_user_{read,write}_config_*()") it's unnecessary to call
pcibios_err_to_errno() to fixup the return value from these functions.

pcibios_err_to_errno() already does simple passthrough of -errno values,
therefore no functional change is expected.

[aw: changelog]
Signed-off-by: Cao jin <caoj.fnst@cn.fujitsu.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-11-18 11:06:42 -07:00
Kirti Wankhede 8e1c5a4048 docs: Add Documentation for Mediated devices
Add file Documentation/vfio-mediated-device.txt that include details of
mediated device framework.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-11-17 08:33:20 -07:00
Kirti Wankhede a1e03e9bcc vfio_platform: Updated to use vfio_set_irqs_validate_and_prepare()
Updated vfio_platform_common.c file to use
vfio_set_irqs_validate_and_prepare()

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-11-17 08:33:20 -07:00
Kirti Wankhede ef198aaa16 vfio_pci: Updated to use vfio_set_irqs_validate_and_prepare()
Updated vfio_pci.c file to use vfio_set_irqs_validate_and_prepare()

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-11-17 08:33:20 -07:00
Kirti Wankhede c747f08aea vfio: Introduce vfio_set_irqs_validate_and_prepare()
Vendor driver using mediated device framework would use same mechnism to
validate and prepare IRQs. Introducing this function to reduce code
replication in multiple drivers.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-11-17 08:33:20 -07:00
Kirti Wankhede c535d34569 vfio_pci: Update vfio_pci to use vfio_info_add_capability()
Update msix_sparse_mmap_cap() to use vfio_info_add_capability()
Update region type capability to use vfio_info_add_capability()

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-11-17 08:33:20 -07:00
Kirti Wankhede b3c0a866f1 vfio: Introduce common function to add capabilities
Vendor driver using mediated device framework should use
vfio_info_add_capability() to add capabilities.
Introduced this function to reduce code duplication in vendor drivers.

vfio_info_cap_shift() manipulated a data buffer to add an offset to each
element in a chain. This data buffer is documented in a uapi header.
Changing vfio_info_cap_shift symbol to be available to all drivers.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-11-17 08:33:20 -07:00
Kirti Wankhede c086de818d vfio iommu: Add blocking notifier to notify DMA_UNMAP
Added blocking notifier to IOMMU TYPE1 driver to notify vendor drivers
about DMA_UNMAP.
Exported two APIs vfio_register_notifier() and vfio_unregister_notifier().
Notifier should be registered, if external user wants to use
vfio_pin_pages()/vfio_unpin_pages() APIs to pin/unpin pages.
Vendor driver should use VFIO_IOMMU_NOTIFY_DMA_UNMAP action to invalidate
mappings.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-11-17 08:33:07 -07:00
Kirti Wankhede a54eb55045 vfio iommu type1: Add support for mediated devices
VFIO IOMMU drivers are designed for the devices which are IOMMU capable.
Mediated device only uses IOMMU APIs, the underlying hardware can be
managed by an IOMMU domain.

Aim of this change is:
- To use most of the code of TYPE1 IOMMU driver for mediated devices
- To support direct assigned device and mediated device in single module

This change adds pin and unpin support for mediated device to TYPE1 IOMMU
backend module. More details:
- Domain for external user is tracked separately in vfio_iommu structure.
  It is allocated when group for first mdev device is attached.
- Pages pinned for external domain are tracked in each vfio_dma structure
  for that iova range.
- Page tracking rb-tree in vfio_dma keeps <iova, pfn, ref_count>. Key of
  rb-tree is iova, but it actually aims to track pfns.
- On external pin request for an iova, page is pinned once, if iova is
  already pinned and tracked, ref_count is incremented.
- External unpin request unpins pages only when ref_count is 0.
- Pinned pages list is used to find pfn from iova and then unpin it.
  WARN_ON is added if there are entires in pfn_list while detaching the
  group and releasing the domain.
- Page accounting is updated to account in its address space where the
  pages are pinned/unpinned, i.e dma->task
-  Accouting for mdev device is only done if there is no iommu capable
  domain in the container. When there is a direct device assigned to the
  container and that domain is iommu capable, all pages are already pinned
  during DMA_MAP.
- Page accouting is updated on hot plug and unplug mdev device and pass
  through device.

Tested by assigning below combinations of devices to a single VM:
- GPU pass through only
- vGPU device only
- One GPU pass through and one vGPU device
- Linux VM hot plug and unplug vGPU device while GPU pass through device
  exist
- Linux VM hot plug and unplug GPU pass through device while vGPU device
  exist

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-11-17 08:25:11 -07:00
Kirti Wankhede 8f0d5bb95f vfio iommu type1: Add task structure to vfio_dma
Add task structure to vfio_dma structure. Task structure is used for:
- During DMA_UNMAP, same task who mapped it or other task who shares same
address space is allowed to unmap, otherwise unmap fails.
QEMU maps few iova ranges initially, then fork threads and from the child
thread calls DMA_UNMAP on previously mapped iova. Since child shares same
address space, DMA_UNMAP is successful.
- Avoid accessing struct mm while process is exiting by acquiring
reference of task's mm during page accounting.
- It is also used to get task mlock capability and rlimit for mlock.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-11-17 08:25:09 -07:00
Kirti Wankhede 7896c998f0 vfio iommu type1: Add find_iommu_group() function
Add find_iommu_group()

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Jike Song <jike.song@intel.com>
Reviewed-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-11-17 08:25:06 -07:00
Kirti Wankhede ea85cf353e vfio iommu type1: Update argument of vaddr_get_pfn()
Update arguments of vaddr_get_pfn() to take struct mm_struct *mm as input
argument.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-11-17 08:25:03 -07:00
Kirti Wankhede 3624a2486c vfio iommu type1: Update arguments of vfio_lock_acct
Added task structure as input argument to vfio_lock_acct() function.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-11-17 08:25:01 -07:00
Kirti Wankhede 2169037dc3 vfio iommu: Added pin and unpin callback functions to vfio_iommu_driver_ops
Added APIs for pining and unpining set of pages. These call back into
backend iommu module to actually pin and unpin pages.
Added two new callback functions to struct vfio_iommu_driver_ops. Backend
IOMMU module that supports pining and unpinning pages for mdev devices
should provide these functions.

Renamed static functions in vfio_type1_iommu.c to resolve conflicts

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-11-17 08:24:58 -07:00
Kirti Wankhede 32f55d835b vfio: Common function to increment container_users
This change rearrange functions to have common function to increment
container_users

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Jike Song <jike.song@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-11-17 08:24:55 -07:00
Kirti Wankhede 7ed3ea8a71 vfio: Rearrange functions to get vfio_group from dev
This patch rearranges functions to get vfio_group from device

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Jike Song <jike.song@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-11-17 08:24:52 -07:00
Kirti Wankhede fa3da00cb8 vfio: VFIO based driver for Mediated devices
vfio_mdev driver registers with mdev core driver.
mdev core driver creates mediated device and calls probe routine of
vfio_mdev driver for each device.
Probe routine of vfio_mdev driver adds mediated device to VFIO core module

This driver forms a shim layer that pass through VFIO devices operations
to vendor driver for mediated devices.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Jike Song <jike.song@intel.com>
Reviewed-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-11-17 08:24:50 -07:00
Kirti Wankhede 7b96953bc6 vfio: Mediated device Core driver
Design for Mediated Device Driver:
Main purpose of this driver is to provide a common interface for mediated
device management that can be used by different drivers of different
devices.

This module provides a generic interface to create the device, add it to
mediated bus, add device to IOMMU group and then add it to vfio group.

Below is the high Level block diagram, with Nvidia, Intel and IBM devices
as example, since these are the devices which are going to actively use
this module as of now.

 +---------------+
 |               |
 | +-----------+ |  mdev_register_driver() +--------------+
 | |           | +<------------------------+ __init()     |
 | |  mdev     | |                         |              |
 | |  bus      | +------------------------>+              |<-> VFIO user
 | |  driver   | |     probe()/remove()    | vfio_mdev.ko |    APIs
 | |           | |                         |              |
 | +-----------+ |                         +--------------+
 |               |
 |  MDEV CORE    |
 |   MODULE      |
 |   mdev.ko     |
 | +-----------+ |  mdev_register_device() +--------------+
 | |           | +<------------------------+              |
 | |           | |                         |  nvidia.ko   |<-> physical
 | |           | +------------------------>+              |    device
 | |           | |        callback         +--------------+
 | | Physical  | |
 | |  device   | |  mdev_register_device() +--------------+
 | | interface | |<------------------------+              |
 | |           | |                         |  i915.ko     |<-> physical
 | |           | +------------------------>+              |    device
 | |           | |        callback         +--------------+
 | |           | |
 | |           | |  mdev_register_device() +--------------+
 | |           | +<------------------------+              |
 | |           | |                         | ccw_device.ko|<-> physical
 | |           | +------------------------>+              |    device
 | |           | |        callback         +--------------+
 | +-----------+ |
 +---------------+

Core driver provides two types of registration interfaces:
1. Registration interface for mediated bus driver:

/**
  * struct mdev_driver - Mediated device's driver
  * @name: driver name
  * @probe: called when new device created
  * @remove:called when device removed
  * @driver:device driver structure
  *
  **/
struct mdev_driver {
         const char *name;
         int  (*probe)  (struct device *dev);
         void (*remove) (struct device *dev);
         struct device_driver    driver;
};

Mediated bus driver for mdev device should use this interface to register
and unregister with core driver respectively:

int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
void mdev_unregister_driver(struct mdev_driver *drv);

Mediated bus driver is responsible to add/delete mediated devices to/from
VFIO group when devices are bound and unbound to the driver.

2. Physical device driver interface
This interface provides vendor driver the set APIs to manage physical
device related work in its driver. APIs are :

* dev_attr_groups: attributes of the parent device.
* mdev_attr_groups: attributes of the mediated device.
* supported_type_groups: attributes to define supported type. This is
			 mandatory field.
* create: to allocate basic resources in vendor driver for a mediated
         device. This is mandatory to be provided by vendor driver.
* remove: to free resources in vendor driver when mediated device is
         destroyed. This is mandatory to be provided by vendor driver.
* open: open callback of mediated device
* release: release callback of mediated device
* read : read emulation callback.
* write: write emulation callback.
* ioctl: ioctl callback.
* mmap: mmap emulation callback.

Drivers should use these interfaces to register and unregister device to
mdev core driver respectively:

extern int  mdev_register_device(struct device *dev,
                                 const struct parent_ops *ops);
extern void mdev_unregister_device(struct device *dev);

There are no locks to serialize above callbacks in mdev driver and
vfio_mdev driver. If required, vendor driver can have locks to serialize
above APIs in their driver.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Jike Song <jike.song@intel.com>
Reviewed-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-11-17 08:24:48 -07:00
Vlad Tsyrklevich 05692d7005 vfio/pci: Fix integer overflows, bitmask check
The VFIO_DEVICE_SET_IRQS ioctl did not sufficiently sanitize
user-supplied integers, potentially allowing memory corruption. This
patch adds appropriate integer overflow checks, checks the range bounds
for VFIO_IRQ_SET_DATA_NONE, and also verifies that only single element
in the VFIO_IRQ_SET_DATA_TYPE_MASK bitmask is set.
VFIO_IRQ_SET_ACTION_TYPE_MASK is already correctly checked later in
vfio_pci_set_irqs_ioctl().

Furthermore, a kzalloc is changed to a kcalloc because the use of a
kzalloc with an integer multiplication allowed an integer overflow
condition to be reached without this patch. kcalloc checks for overflow
and should prevent a similar occurrence.

Signed-off-by: Vlad Tsyrklevich <vlad@tsyrklevich.net>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-10-26 13:49:29 -06:00
Christoph Hellwig 61771468e0 vfio_pci: use pci_alloc_irq_vectors
Simplify the interrupt setup by using the new PCI layer helpers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-09-29 13:36:38 -06:00
Alex Williamson c93a97ee05 vfio-pci: Disable INTx after MSI/X teardown
The MSI/X shutdown path can gratuitously enable INTx, which is not
something we want to happen if we're dealing with broken INTx device.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-09-26 13:52:19 -06:00
Alex Williamson ddf9dc0eb5 vfio-pci: Virtualize PCIe & AF FLR
We use a BAR restore trick to try to detect when a user has performed
a device reset, possibly through FLR or other backdoors, to put things
back into a working state.  This is important for backdoor resets, but
we can actually just virtualize the "front door" resets provided via
PCIe and AF FLR.  Set these bits as virtualized + writable, allowing
the default write to set them in vconfig, then we can simply check the
bit, perform an FLR of our own, and clear the bit.  We don't actually
have the granularity in PCI to specify the type of reset we want to
do, but generally devices don't implement both PCIe and AF FLR and
we'll favor these over other types of reset, so we should generally
lineup.  We do test whether the device provides the requested FLR type
to stay consistent with hardware capabilities though.

This seems to fix several instance of devices getting into bad states
with userspace drivers, like dpdk, running inside a VM.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Greg Rose <grose@lightfleet.com>
2016-09-26 13:52:16 -06:00
Baoyou Xie 2e06285655 vfio: platform: mark symbols static where possible
We get a few warnings when building kernel with W=1:
drivers/vfio/platform/vfio_platform_common.c:76:5: warning: no previous prototype for 'vfio_platform_acpi_call_reset' [-Wmissing-prototypes]
drivers/vfio/platform/vfio_platform_common.c:98:6: warning: no previous prototype for 'vfio_platform_acpi_has_reset' [-Wmissing-prototypes]
drivers/vfio/platform/vfio_platform_common.c:640:5: warning: no previous prototype for 'vfio_platform_of_probe' [-Wmissing-prototypes]
drivers/vfio/platform/reset/vfio_platform_amdxgbe.c:59:5: warning: no previous prototype for 'vfio_platform_amdxgbe_reset' [-Wmissing-prototypes]
drivers/vfio/platform/reset/vfio_platform_calxedaxgmac.c:60:5: warning: no previous prototype for 'vfio_platform_calxedaxgmac_reset' [-Wmissing-prototypes]
....

In fact, these functions are only used in the file in which they are
declared and don't need a declaration, but can be made static.
so this patch marks these functions with 'static'.

Signed-off-by: Baoyou Xie <baoyou.xie@linaro.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-09-13 16:11:37 -06:00
Wei Jiangang 8138dabbab vfio/pci: Fix typos in comments
Signed-off-by: Wei Jiangang <weijg.fnst@cn.fujitsu.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-08-29 12:39:09 -06:00
Alex Williamson c8952a7075 vfio/pci: Fix NULL pointer oops in error interrupt setup handling
There are multiple cases in vfio_pci_set_ctx_trigger_single() where
we assume we can safely read from our data pointer without actually
checking whether the user has passed any data via the count field.
VFIO_IRQ_SET_DATA_NONE in particular is entirely broken since we
attempt to pull an int32_t file descriptor out before even checking
the data type.  The other data types assume the data pointer contains
one element of their type as well.

In part this is good news because we were previously restricted from
doing much sanitization of parameters because it was missed in the
past and we didn't want to break existing users.  Clearly DATA_NONE
is completely broken, so it must not have any users and we can fix
it up completely.  For DATA_BOOL and DATA_EVENTFD, we'll just
protect ourselves, returning error when count is zero since we
previously would have oopsed.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reported-by: Chris Thompson <the_cartographer@hotmail.com>
Cc: stable@vger.kernel.org
Reviewed-by: Eric Auger <eric.auger@redhat.com>
2016-08-08 16:16:23 -06:00
Sinan Kaya 0991bbdbf5 vfio: platform: check reset call return code during release
Release call is ignoring the return code from reset call and can
potentially continue even though reset call failed.

If reset_required module parameter is set, this patch is going
to validate the return code and will cause stack dump with
WARN_ON and warn the user of failure.

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-07-19 10:54:45 -06:00
Sinan Kaya e99442323f vfio: platform: check reset call return code during open
Open call is ignoring the return code from reset call and can
potentially continue even though reset call failed.

If reset_required module parameter is set, this patch is going
to validate the return code and will abort open if reset fails.

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Reviewed-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-07-19 10:54:45 -06:00
Sinan Kaya b5add544d6 vfio, platform: make reset driver a requirement by default
The code was allowing platform devices to be used without a supporting
VFIO reset driver. The hardware can be left in some inconsistent state
after a guest machine abort.

The reset driver will put the hardware back to safe state and disable
interrupts before returning the control back to the host machine.

Adding a new reset_required kernel module option to platform VFIO drivers.
The default value is true for the DT and ACPI based drivers.
The reset requirement value for AMBA drivers is set to false and is
unchangeable to maintain the existing functionality.

New requirements are:
1. A reset function needs to be implemented by the corresponding driver
via DT/ACPI.
2. The reset function needs to be discovered via DT/ACPI.

The probe of the driver will fail if any of the above conditions are
not satisfied.

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-07-19 10:26:46 -06:00
Sinan Kaya d30daa33ec vfio: platform: call _RST method when using ACPI
The device tree code checks for the presence of a reset driver and calls
the of_reset function pointer by looking up the reset driver as a module.

ACPI defines _RST method to perform device level reset. After the _RST
method is executed, the OS can resume using the device. _RST method is
expected to stop DMA transfers and IRQs.

This patch introduces two functions as vfio_platform_acpi_has_reset and
vfio_platform_acpi_call_reset. The has reset method is used to declare
reset capability via the ioctl flag VFIO_DEVICE_FLAGS_RESET. The call
reset function is used to execute the _RST ACPI method.

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-07-19 10:26:44 -06:00
Sinan Kaya 5afec27474 vfio: platform: add extra debug info argument to call reset
Getting ready to bring out extra debug information to the caller
so that more verbose information can be printed when an error is
observed.

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Reviewed-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-07-19 10:26:43 -06:00
Sinan Kaya a12a9368e1 vfio: platform: add support for ACPI probe
The code is using the compatible DT string to associate a reset driver
with the actual device itself. The compatible string does not exist on
ACPI based systems. HID is the unique identifier for a device driver
instead.

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-07-19 10:26:41 -06:00
Sinan Kaya dc5542fb11 vfio: platform: determine reset capability
Creating a new function to determine if this driver supports reset
function or not. This is an attempt to abstract device tree calls
from the rest of the code.

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-07-19 10:26:40 -06:00
Sinan Kaya f084aa7495 vfio: platform: move reset call to a common function
The reset call sequence seems to replicate itself multiple times
across the file. Grouping them together for maintenance reasons.

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-07-19 10:26:38 -06:00
Sinan Kaya 7aef80cf31 vfio: platform: rename reset function
Renaming the reset function to of_reset as it is only used
by the device tree based platforms.

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-07-19 10:26:36 -06:00
Ilya Lesokhin d370c917b9 vfio: fix possible use after free of vfio group
The vfio group should be released after
the vfio_group_try_dissolve_container call.
The code should not rely on someone else to hold
a reference on the group.

Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-07-14 14:28:16 -06:00
Yongji Xie 05f0c03fba vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio page is exclusive
Current vfio-pci implementation disallows to mmap
sub-page(size < PAGE_SIZE) MMIO BARs because these BARs' mmio
page may be shared with other BARs. This will cause some
performance issues when we passthrough a PCI device with
this kind of BARs. Guest will be not able to handle the mmio
accesses to the BARs which leads to mmio emulations in host.

However, not all sub-page BARs will share page with other BARs.
We should allow to mmap the sub-page MMIO BARs which we can
make sure will not share page with other BARs.

This patch adds support for this case. And we try to add a
dummy resource to reserve the remainder of the page which
hot-add device's BAR might be assigned into. But it's not
necessary to handle the case when the BAR is not page aligned.
Because we can't expect the BAR will be assigned into the same
location in a page in guest when we passthrough the BAR. And
it's hard to access this BAR in userspace because we have
no way to get the BAR's location in a page.

Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-07-08 10:06:04 -06:00
Peng Fan 9698cbf0be vfio: platform: support No-IOMMU mode
The vfio No-IOMMU mode was supported by this
'commit 03a76b60f8 ("vfio: Include No-IOMMU mode")',
but it only support vfio-pci.

Using vfio_iommu_group_get/put, but not iommu_group_get/put,
the platform devices can be exposed to userspace with
CONFIG_VFIO_NOIOMMU and the "enable_unsafe_noiommu_mode"
option enabled.

From 'commit 03a76b60f8 ("vfio: Include No-IOMMU mode")',
"This should make it very clear that this mode is not safe.
Additionally, CAP_SYS_RAWIO privileges are necessary to work
with groups and containers using this mode.  Groups making
use of this support are named /dev/vfio/noiommu-$GROUP and
can only make use of the special VFIO_NOIOMMU_IOMMU for the
container.  Use of this mode, specifically binding a device
without a native IOMMU group to a VFIO bus driver will taint
the kernel and should therefore not be considered supported."

Signed-off-by: Peng Fan <van.freenix@gmail.com>
Cc: Eric Auger <eric.auger@linaro.org>
Cc: Baptiste Reynal <b.reynal@virtualopensystems.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-06-23 09:37:17 -06:00
Alex Williamson ce7585f3c4 vfio/pci: Allow VPD short read
The size of the VPD area is not necessarily 4-byte aligned, so a
pci_vpd_read() might return less than 4 bytes.  Zero our buffer and
accept anything other than an error.  Intel X710 NICs exercise this.

Fixes: 4e1a635552 ("vfio/pci: Use kernel VPD access functions")
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-05-31 21:25:52 -06:00
Alex Williamson 089f1c6b2d vfio/type1: Fix build warning
This function cannot actually be called with npage = 0, so in practice
this doesn't return an uninitialized value.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-05-30 07:58:10 -06:00
Alex Williamson 956b56a984 vfio/pci: Fix ordering of eventfd vs virqfd shutdown
Both the INTx and MSI/X disable paths do an eventfd_ctx_put() for the
trigger eventfd before calling vfio_virqfd_disable() any potential
mask and unmask eventfds.  This opens a use-after-free race where an
inopportune irqfd can reference the freed signalling eventfd.  Reorder
to avoid this possibility.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-05-30 07:50:10 -06:00
Linus Torvalds 48dd7cefa0 VFIO updates for v4.7-rc1
- Hide INTx on certain known broken devices (Alex Williamson)
  - Additional backdoor reset detection (Alex Williamson)
  - Remove unused iommudata reference (Alexey Kardashevskiy)
  - Use cfg_size to avoid probing extended config space (Alexey Kardashevskiy)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.14 (GNU/Linux)
 
 iQIcBAABAgAGBQJXRc8hAAoJECObm247sIsigWAP/R+q+UnOHXd7cvSKpI33p2IP
 ur09xwV9GktQViWz4tclFdfk8cScq86UX8I5e/jkeFVCXT7e5FU7FhKqJi7dPbp1
 Yh3g5fGUkk6a0B3gebIq3qY02aLwJF17WZg0R/KHkQT0B/yQuTQqdF37Xh9rGv0S
 c7sgQn1X5nmO7jVDYYKO3SwEgmZWAcBz4Ht2XpqCdSrmhsse9OxlnmOkieM5BNUz
 rQhnziaeJE/Ya+y/A74XicbGforyThvNyJs/anJnPEE89773SGVQy/Jdlz4Lwji7
 XImPuj4AT9duTgX9HaD38xIpFOsAKDfZ6sClsICkIvhbs232UXuiMxcPszDA97c0
 7MxSJcLVr9fKB6zatq2JWhGDp7C6ylUxapEI9PFCV6gE5OYZRM7KD+hm32oVnuPv
 rSzPoqnm0Sudu8SO6n46QUZAifp+mX9MNhqzkXGR/YlBHhB1L3/QcczyekY0eBbj
 vJ2htebrg0qNQn4G9n4ygMm19r53ew/Q+pO2y7y4TdOr+gZNW1Wj7uOezMpvDOOB
 hiy+HkJ24MCTfAGNgjpjjCot/o608+QT5H+y8SR7vT1IK4shOSE4rYfvo6jlzRQp
 9FRTolGhYqrtih+zB5R7eLghtvlDp4lN0gDCSgWGHM7e2rMLxaomzoF1VNQ8eKJZ
 iUJZ8jE2QrdGDpKssRuF
 =fGnT
 -----END PGP SIGNATURE-----

Merge tag 'vfio-v4.7-rc1' of git://github.com/awilliam/linux-vfio

Pull VFIO updates from Alex Williamson:

 - Hide INTx on certain known broken devices (Alex Williamson)

 - Additional backdoor reset detection (Alex Williamson)

 - Remove unused iommudata reference (Alexey Kardashevskiy)

 - Use cfg_size to avoid probing extended config space (Alexey
   Kardashevskiy)

* tag 'vfio-v4.7-rc1' of git://github.com/awilliam/linux-vfio:
  vfio_pci: Test for extended capabilities if config space > 256 bytes
  vfio_iommu_spapr_tce: Remove unneeded iommu_group_get_iommudata
  vfio/pci: Add test for BAR restore
  vfio/pci: Hide broken INTx support from user
2016-05-25 09:47:26 -07:00
Linus Torvalds c04a588029 powerpc updates for 4.7
Highlights:
  - Support for Power ISA 3.0 (Power9) Radix Tree MMU from Aneesh Kumar K.V
  - Live patching support for ppc64le (also merged via livepatching.git)
 
 Various cleanups & minor fixes from:
  - Aaro Koskinen, Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V,
    Chris Smart, Daniel Axtens, Frederic Barrat, Gavin Shan, Ian Munsie, Lennart
    Sorensen, Madhavan Srinivasan, Mahesh Salgaonkar, Markus Elfring, Michael
    Ellerman, Oliver O'Halloran, Paul Gortmaker, Paul Mackerras, Rashmica Gupta,
    Russell Currey, Suraj Jitindar Singh, Thiago Jung Bauermann, Valentin
    Rothberg, Vipin K Parashar.
 
 General:
  - Update LMB associativity index during DLPAR add/remove from Nathan Fontenot
  - Fix branching to OOL handlers in relocatable kernel from Hari Bathini
  - Add support for userspace Power9 copy/paste from Chris Smart
  - Always use STRICT_MM_TYPECHECKS from Michael Ellerman
  - Add mask of possible MMU features from Michael Ellerman
 
 PCI:
  - Enable pass through of NVLink to guests from Alexey Kardashevskiy
  - Cleanups in preparation for powernv PCI hotplug from Gavin Shan
  - Don't report error in eeh_pe_reset_and_recover() from Gavin Shan
  - Restore initial state in eeh_pe_reset_and_recover() from Gavin Shan
  - Revert "powerpc/eeh: Fix crash in eeh_add_device_early() on Cell" from Guilherme G. Piccoli
  - Remove the dependency on EEH struct in DDW mechanism from Guilherme G. Piccoli
 
 selftests:
  - Test cp_abort during context switch from Chris Smart
  - Add several tests for transactional memory support from Rashmica Gupta
 
 perf:
  - Add support for sampling interrupt register state from Anju T
  - Add support for unwinding perf-stackdump from Chandan Kumar
 
 cxl:
  - Configure the PSL for two CAPI ports on POWER8NVL from Philippe Bergheaud
  - Allow initialization on timebase sync failures from Frederic Barrat
  - Increase timeout for detection of AFU mmio hang from Frederic Barrat
  - Handle num_of_processes larger than can fit in the SPA from Ian Munsie
  - Ensure PSL interrupt is configured for contexts with no AFU IRQs from Ian Munsie
  - Add kernel API to allow a context to operate with relocate disabled from Ian Munsie
  - Check periodically the coherent platform function's state from Christophe Lombard
 
 Freescale:
  - Updates from Scott: "Contains 86xx fixes, minor device tree fixes, an erratum
    workaround, and a kconfig dependency fix."
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXPsGzAAoJEFHr6jzI4aWAVoAP/iKdrDe0eYHlVAE9SqnbsiZs
 lgDxdsC8P3fsmP1G9o/HkKhC82zHl/La8Ztz8dtqa+LkSzbfliWP1ztJsI7GsBFo
 tyCKzWnX9Rwvd3meHu/o/SQ29TNLm/PbPyyRqpj5QPbJ8XCXkAXR7ZZZqjvcMsJW
 /AgIr7Cgf53tl9oZzzl/c7CnNHhMq+NBdA71vhWtUx+T97wfJEGyKW6HhZyHDbEU
 iAki7fu77ZpEqC/Fh9swf0dCGBJ+a132NoMVo0AdV7EQLznUYlQpQEqa+1PyHZOP
 /ArOzf2mDg6m3PfCo1eiB07v8PnVZ3llEUbVAJNg3GUxbE4SHrqq/kwm0iElm3p/
 DvFxerCwdX9vmskJX4wDs+pSZRabXYj9XVMptsgFzA4joWrqqb7mBHqaort88YcY
 YSljEt1bHyXmiJ+dBya40qARsWUkCVN7ZgEzdxckq0KI3w7g2tqpqIbO2lClWT6t
 B3GpqQ4jp34+d1M14FB91fIGK7tMvOhSInE0Mv9+tPvRsepXqiiU/SwdAtRlr3m2
 zs/K+4FYcVjJ3Rmpgc+tI38PbZxHe212I35YN6L1LP+4ZfAtzz0NyKdooTIBtkbO
 19pX4WbBjKq8zK+YutrySncBIrbnI6VjW51vtRhgVKZliPFO/6zKagyU6FbxM+E5
 udQES+t3F/9gvtxgxtDe
 =YvyQ
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:
 "Highlights:
   - Support for Power ISA 3.0 (Power9) Radix Tree MMU from Aneesh Kumar K.V
   - Live patching support for ppc64le (also merged via livepatching.git)

  Various cleanups & minor fixes from:
   - Aaro Koskinen, Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V,
     Chris Smart, Daniel Axtens, Frederic Barrat, Gavin Shan, Ian Munsie,
     Lennart Sorensen, Madhavan Srinivasan, Mahesh Salgaonkar, Markus Elfring,
     Michael Ellerman, Oliver O'Halloran, Paul Gortmaker, Paul Mackerras,
     Rashmica Gupta, Russell Currey, Suraj Jitindar Singh, Thiago Jung
     Bauermann, Valentin Rothberg, Vipin K Parashar.

  General:
   - Update LMB associativity index during DLPAR add/remove from Nathan
     Fontenot
   - Fix branching to OOL handlers in relocatable kernel from Hari Bathini
   - Add support for userspace Power9 copy/paste from Chris Smart
   - Always use STRICT_MM_TYPECHECKS from Michael Ellerman
   - Add mask of possible MMU features from Michael Ellerman

  PCI:
   - Enable pass through of NVLink to guests from Alexey Kardashevskiy
   - Cleanups in preparation for powernv PCI hotplug from Gavin Shan
   - Don't report error in eeh_pe_reset_and_recover() from Gavin Shan
   - Restore initial state in eeh_pe_reset_and_recover() from Gavin Shan
   - Revert "powerpc/eeh: Fix crash in eeh_add_device_early() on Cell"
     from Guilherme G Piccoli
   - Remove the dependency on EEH struct in DDW mechanism from Guilherme
     G Piccoli

  selftests:
   - Test cp_abort during context switch from Chris Smart
   - Add several tests for transactional memory support from Rashmica
     Gupta

  perf:
   - Add support for sampling interrupt register state from Anju T
   - Add support for unwinding perf-stackdump from Chandan Kumar

  cxl:
   - Configure the PSL for two CAPI ports on POWER8NVL from Philippe
     Bergheaud
   - Allow initialization on timebase sync failures from Frederic Barrat
   - Increase timeout for detection of AFU mmio hang from Frederic
     Barrat
   - Handle num_of_processes larger than can fit in the SPA from Ian
     Munsie
   - Ensure PSL interrupt is configured for contexts with no AFU IRQs
     from Ian Munsie
   - Add kernel API to allow a context to operate with relocate disabled
     from Ian Munsie
   - Check periodically the coherent platform function's state from
     Christophe Lombard

  Freescale:
   - Updates from Scott: "Contains 86xx fixes, minor device tree fixes,
     an erratum workaround, and a kconfig dependency fix."

* tag 'powerpc-4.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (192 commits)
  powerpc/86xx: Fix PCI interrupt map definition
  powerpc/86xx: Move pci1 definition to the include file
  powerpc/fsl: Fix build of the dtb embedded kernel images
  powerpc/fsl: Fix rcpm compatible string
  powerpc/fsl: Remove FSL_SOC dependency from FSL_LBC
  powerpc/fsl-pci: Add a workaround for PCI 5 errata
  powerpc/fsl: Fix SPI compatible on t208xrdb and t1040rdb
  powerpc/powernv/npu: Add PE to PHB's list
  powerpc/powernv: Fix insufficient memory allocation
  powerpc/iommu: Remove the dependency on EEH struct in DDW mechanism
  Revert "powerpc/eeh: Fix crash in eeh_add_device_early() on Cell"
  powerpc/eeh: Drop unnecessary label in eeh_pe_change_owner()
  powerpc/eeh: Ignore handlers in eeh_pe_reset_and_recover()
  powerpc/eeh: Restore initial state in eeh_pe_reset_and_recover()
  powerpc/eeh: Don't report error in eeh_pe_reset_and_recover()
  Revert "powerpc/powernv: Exclude root bus in pnv_pci_reset_secondary_bus()"
  powerpc/powernv/npu: Enable NVLink pass through
  powerpc/powernv/npu: Rework TCE Kill handling
  powerpc/powernv/npu: Add set/unset window helpers
  powerpc/powernv/ioda2: Export debug helper pe_level_printk()
  ...
2016-05-20 10:12:41 -07:00
Alexey Kardashevskiy f705528094 vfio_pci: Test for extended capabilities if config space > 256 bytes
PCI-Express spec says that reading 4 bytes at offset 100h should return
zero if there is no extended capability so VFIO reads this dword to
know if there are extended capabilities.

However it is not always possible to access the extended space so
generic PCI code in pci_cfg_space_size_ext() checks if
pci_read_config_dword() can read beyond 100h and if the check fails,
it sets the config space size to 100h.

VFIO does its own extended capabilities check by reading at offset 100h
which may produce 0xffffffff which VFIO treats as the extended config
space presense and calls vfio_ecap_init() which fails to parse
capabilities (which is expected) but right before the exit, it writes
zero at offset 100h which is beyond the buffer allocated for
vdev->vconfig (which is 256 bytes) which leads to random memory
corruption.

This makes VFIO only check for the extended capabilities if
the discovered config size is more than 256 bytes.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-05-19 15:04:40 -06:00
Alexey Kardashevskiy 54de285beb vfio/spapr: Relax the IOMMU compatibility check
We are going to have multiple different types of PHB on the same system
with POWER8 + NVLink and PHBs will have different IOMMU ops. However
we only really care about one callback - create_table - so we can
relax the compatibility check here.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11 21:54:27 +10:00
Robin Murphy d16e0faab9 iommu: Allow selecting page sizes per domain
Many IOMMUs support multiple page table formats, meaning that any given
domain may only support a subset of the hardware page sizes presented in
iommu_ops->pgsize_bitmap. There are also certain use-cases where the
creator of a domain may want to control which page sizes are used, for
example to force the use of hugepage mappings to reduce pagetable walk
depth.

To this end, add a per-domain pgsize_bitmap to represent the subset of
page sizes actually in use, to make it possible for domains with
different requirements to coexist.

Signed-off-by: Will Deacon <will.deacon@arm.com>
[rm: hijacked and rebased original patch with new commit message]
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2016-05-09 15:33:29 +02:00
Alexey Kardashevskiy 5ed4aba126 vfio_iommu_spapr_tce: Remove unneeded iommu_group_get_iommudata
This removes iommu_group_get_iommudata() as the result is never used.
As this is a minor cleanup, no change in behavior is expected.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-04-28 11:12:41 -06:00
Alex Williamson dc92810997 vfio/pci: Add test for BAR restore
If a device is reset without the memory or i/o bits enabled in the
command register we may not detect it, potentially leaving the device
without valid BAR programming.  Add an additional test to check the
BARs on each write to the command register.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-04-28 11:12:33 -06:00
Alex Williamson 450744051d vfio/pci: Hide broken INTx support from user
INTx masking has two components, the first is that we need the ability
to prevent the device from continuing to assert INTx.  This is
provided via the DisINTx bit in the command register and is the only
thing we can really probe for when testing if INTx masking is
supported.  The second component is that the device needs to indicate
if INTx is asserted via the interrupt status bit in the device status
register.  With these two features we can generically determine if one
of the devices we own is asserting INTx, signal the user, and mask the
interrupt while the user services the device.

Generally if one or both of these components is broken we resort to
APIC level interrupt masking, which requires an exclusive interrupt
since we have no way to determine the source of the interrupt in a
shared configuration.  This often makes it difficult or impossible to
configure the system for userspace use of the device, for an interrupt
mode that the user may not need.

One possible configuration of broken INTx masking is that the DisINTx
support is fully functional, but the interrupt status bit never
signals interrupt assertion.  In this case we do have the ability to
prevent the device from asserting INTx, but lack the ability to
identify the interrupt source.  For this case we can simply pretend
that the device lacks INTx support entirely, keeping DisINTx set on
the physical device, virtualizing this bit for the user, and
virtualizing the interrupt pin register to indicate no INTx support.
We already support virtualization of the DisINTx bit and already
virtualize the interrupt pin for platforms without INTx support.  By
tying these components together, setting DisINTx on open and reset,
and identifying devices broken in this particular way, we can provide
support for them w/o the handicap of APIC level INTx masking.

Intel i40e (XL710/X710) 10/20/40GbE NICs have been identified as being
broken in this specific way.  We leave the vfio-pci.nointxmask option
as a mechanism to bypass this support, enabling INTx on the device
with all the requirements of APIC level masking.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Cc: John Ronciak <john.ronciak@intel.com>
Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
2016-04-28 11:12:27 -06:00
Linus Torvalds 45cb5230f8 VFIO updates for v4.6-rc1
Various enablers for assignment of Intel graphics devices and future
 support of vGPU devices (Alex Williamson).  This includes
 
  - Handling the vfio type1 interface as an API rather than a specific
    implementation, allowing multiple type1 providers.
 
  - Capability chains, similar to PCI device capabilities, that allow
    extending ioctls.  Extensions here include device specific regions
    and sparse mmap descriptions.  The former is used to expose non-PCI
    regions for IGD, including the OpRegion (particularly the Video
    BIOS Table), and read only PCI config access to the host and LPC
    bridge as drivers often depend on identifying those devices.
    Sparse mmaps here are used to describe the MSIx vector table,
    which vfio has always protected from mmap, but never had an API to
    explicitly define that protection.  In future vGPU support this is
    expected to allow the description of PCI BARs that may mix direct
    access and emulated access within a single region.
 
  - The ability to expose the shadow ROM as an option ROM as IGD use
    cases may rely on the ROM even though the physical device does not
    make use of a PCI option ROM BAR.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.14 (GNU/Linux)
 
 iQIcBAABAgAGBQJW6aT1AAoJECObm247sIsiiP4P/1xf7Z08/2QWVFQzex9CLcZk
 +/iJlyb/fTpPVQE+NTKPz3Qh5h6ZhSd/57s85IUqq0T6tgVPkoGx8kkyCjBaw2y1
 yMezXZlQqJdZqGzQNI4OiHWvO+/vGxYKjQMfUnMlDM6dJgz4lGncGFoSouFPa3Vp
 mB12hGxrlk1cfIdb+C1KbfZcEdS0WhtigQtz8flBKgOfO+hYWmUO+CClJBhVw8Z4
 RNcWNAxFfLuwUPVsPb6uOLG2g65SC2vmQ9k0Tnknf1znV3PFFVjITf0aM6uChLNP
 S3SgqtPX+6yOFyCuSEs8UKhhmCbeQmAyKgt5BpxV3Rw3OMP4PsVAehr82vQmSj6g
 2o96pR2s8MDPBr8eG7gdRe4DQe3PonpLkpDfaghcpYqhkGEqNVeW5/GjiOzGQqD3
 xMshzxJ1Iz7DOHkQRUVqOfupDB0TusJmTVKwvXe6yIYL9pjkUS/sbN9U563HYSES
 JTV68TMj0VKfKwD3XKYXvGH3km1sL4i5NMlAUrsDtsMkGlXEswoGbj82Mjc8+jUo
 BvWQTJb+kouJQ88VhsO2abg1UrO9E6u82iHFHy9fEObxE8KH7pvROlS93ihMT1Wv
 WQNuUcltdpHMRVX0BDknaPs3YtC3/TGgm3RcU5SZPbv/ys1471ZmJxMlAAKcfITr
 SuvkMTYElF5b1pigv46c
 =/lJn
 -----END PGP SIGNATURE-----

Merge tag 'vfio-v4.6-rc1' of git://github.com/awilliam/linux-vfio

Pull VFIO updates from Alex Williamson:
 "Various enablers for assignment of Intel graphics devices and future
  support of vGPU devices (Alex Williamson).  This includes

   - Handling the vfio type1 interface as an API rather than a specific
     implementation, allowing multiple type1 providers.

   - Capability chains, similar to PCI device capabilities, that allow
     extending ioctls.  Extensions here include device specific regions
     and sparse mmap descriptions.  The former is used to expose non-PCI
     regions for IGD, including the OpRegion (particularly the Video
     BIOS Table), and read only PCI config access to the host and LPC
     bridge as drivers often depend on identifying those devices.

     Sparse mmaps here are used to describe the MSIx vector table, which
     vfio has always protected from mmap, but never had an API to
     explicitly define that protection.  In future vGPU support this is
     expected to allow the description of PCI BARs that may mix direct
     access and emulated access within a single region.

   - The ability to expose the shadow ROM as an option ROM as IGD use
     cases may rely on the ROM even though the physical device does not
     make use of a PCI option ROM BAR"

* tag 'vfio-v4.6-rc1' of git://github.com/awilliam/linux-vfio:
  vfio/pci: return -EFAULT if copy_to_user fails
  vfio/pci: Expose shadow ROM as PCI option ROM
  vfio/pci: Intel IGD host and LCP bridge config space access
  vfio/pci: Intel IGD OpRegion support
  vfio/pci: Enable virtual register in PCI config space
  vfio/pci: Add infrastructure for additional device specific regions
  vfio: Define device specific region type capability
  vfio/pci: Include sparse mmap capability for MSI-X table regions
  vfio: Define sparse mmap capability for regions
  vfio: Add capability chain helpers
  vfio: Define capability chains
  vfio: If an IOMMU backend fails, keep looking
  vfio/pci: Fix unsigned comparison overflow
2016-03-17 13:05:09 -07:00
Michael S. Tsirkin 8160c4e455 vfio: fix ioctl error handling
Calling return copy_to_user(...) in an ioctl will not
do the right thing if there's a pagefault:
copy_to_user returns the number of bytes not copied
in this case.

Fix up vfio to do
	return copy_to_user(...)) ?
		-EFAULT : 0;

everywhere.

Cc: stable@vger.kernel.org
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-02-28 07:38:52 -07:00
Dan Carpenter c4aec31013 vfio/pci: return -EFAULT if copy_to_user fails
The copy_to_user() function returns the number of bytes that were not
copied but we want to return -EFAULT on error here.

Fixes: 188ad9d6cb ('vfio/pci: Include sparse mmap capability for MSI-X table regions')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-02-25 21:48:42 -07:00
Alex Williamson a13b645917 vfio/pci: Expose shadow ROM as PCI option ROM
Integrated graphics may have their ROM shadowed at 0xc0000 rather than
implement a PCI option ROM.  Make this ROM appear to the user using
the ROM BAR.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-02-22 16:10:09 -07:00
Alex Williamson f572a960a1 vfio/pci: Intel IGD host and LCP bridge config space access
Provide read-only access to PCI config space of the PCI host bridge
and LPC bridge through device specific regions.  This may be used to
configure a VM with matching register contents to satisfy driver
requirements.  Providing this through the vfio file descriptor removes
an additional userspace requirement for access through pci-sysfs and
removes the CAP_SYS_ADMIN requirement that doesn't appear to apply to
the specific devices we're accessing.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-02-22 16:10:09 -07:00
Alex Williamson 5846ff54e8 vfio/pci: Intel IGD OpRegion support
This is the first consumer of vfio device specific resource support,
providing read-only access to the OpRegion for Intel graphics devices.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-02-22 16:10:09 -07:00
Alex Williamson 345d710491 vfio/pci: Enable virtual register in PCI config space
Typically config space for a device is mapped out into capability
specific handlers and unassigned space.  The latter allows direct
read/write access to config space.  Sometimes we know about registers
living in this void space and would like an easy way to virtualize
them, similar to how BAR registers are managed.  To do this, create
one more pseudo (fake) PCI capability to be handled as purely virtual
space.  Reads and writes are serviced entirely from virtual config
space.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-02-22 16:10:09 -07:00
Alex Williamson 28541d41c9 vfio/pci: Add infrastructure for additional device specific regions
Add support for additional regions with indexes started after the
already defined fixed regions.  Device specific code can register
these regions with the new vfio_pci_register_dev_region() function.
The ops structure per region currently only includes read/write
access and a release function, allowing automatic cleanup when the
device is closed.  mmap support is only missing here because it's
not needed by the first user queued for this support.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-02-22 16:10:09 -07:00
Alex Williamson 188ad9d6cb vfio/pci: Include sparse mmap capability for MSI-X table regions
vfio-pci has never allowed the user to directly mmap the MSI-X vector
table, but we've always relied on implicit knowledge of the user that
they cannot do this.  Now that we have capability chains that we can
expose in the region info ioctl and a sparse mmap capability that
represents the sub-areas within the region that can be mmap'd, we can
make the mmap constraints more explicit.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-02-22 16:10:09 -07:00
Alex Williamson d7a8d5ed87 vfio: Add capability chain helpers
Allow sub-modules to easily reallocate a buffer for managing
capability chains for info ioctls.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-02-22 16:10:08 -07:00
Alex Williamson 7c435b46c2 vfio: If an IOMMU backend fails, keep looking
Consider an IOMMU to be an API rather than an implementation, we might
have multiple implementations supporting the same API, so try another
if one fails.  The expectation here is that we'll really only have
one implementation per device type.  For instance the existing type1
driver works with any PCI device where the IOMMU API is available.  A
vGPU vendor may have a virtual PCI device which provides DMA isolation
and mapping through other mechanisms, but can re-use userspaces that
make use of the type1 VFIO IOMMU API.  This allows that to work.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-02-22 16:10:08 -07:00
Alex Williamson b95d9305e8 vfio/pci: Fix unsigned comparison overflow
Signed versus unsigned comparisons are implicitly cast to unsigned,
which result in a couple possible overflows.  For instance (start +
count) might overflow and wrap, getting through our validation test.
Also when unwinding setup, -1 being compared as unsigned doesn't
produce the intended stop condition.  Fix both of these and also fix
vfio_msi_set_vector_signal() to validate parameters before using the
vector index, though none of the callers should pass bad indexes
anymore.

Reported-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-02-22 16:03:54 -07:00
Alex Williamson 16ab8a5cbe vfio/noiommu: Don't use iommu_present() to track fake groups
Using iommu_present() to determine whether an IOMMU group is real or
fake has some problems.  First, apparently Power systems don't
register an IOMMU on the device bus, so the groups and containers get
marked as noiommu and then won't bind to their actual IOMMU driver.
Second, I expect we'll run into the same issue as we try to support
vGPUs through vfio, since they're likely to emulate this behavior of
creating an IOMMU group on a virtual device and then providing a vfio
IOMMU backend tailored to the sort of isolation they provide, which
won't necessarily be fully compatible with the IOMMU API.

The solution here is to use the existing iommudata interface to IOMMU
groups, which allows us to easily identify the fake groups we've
created for noiommu purposes.  The iommudata we set is purely
arbitrary since we're only comparing the address, so we use the
address of the noiommu switch itself.

Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Tested-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Santosh Shukla <sshukla@mvista.com>
Fixes: 03a76b60f8 ("vfio: Include No-IOMMU mode")
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-01-27 11:22:25 -07:00
Pierre Morel d4f50ee2f5 vfio/iommu_type1: make use of info.flags
The flags entry is there to tell the user that some
optional information is available.

Since we report the iova_pgsizes signal it to the user
by setting the flags to VFIO_IOMMU_INFO_PGSIZES.

Signed-off-by: Pierre Morel <pmorel@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-01-04 12:55:44 -07:00
Alex Williamson 03a76b60f8 vfio: Include No-IOMMU mode
There is really no way to safely give a user full access to a DMA
capable device without an IOMMU to protect the host system.  There is
also no way to provide DMA translation, for use cases such as device
assignment to virtual machines.  However, there are still those users
that want userspace drivers even under those conditions.  The UIO
driver exists for this use case, but does not provide the degree of
device access and programming that VFIO has.  In an effort to avoid
code duplication, this introduces a No-IOMMU mode for VFIO.

This mode requires building VFIO with CONFIG_VFIO_NOIOMMU and enabling
the "enable_unsafe_noiommu_mode" option on the vfio driver.  This
should make it very clear that this mode is not safe.  Additionally,
CAP_SYS_RAWIO privileges are necessary to work with groups and
containers using this mode.  Groups making use of this support are
named /dev/vfio/noiommu-$GROUP and can only make use of the special
VFIO_NOIOMMU_IOMMU for the container.  Use of this mode, specifically
binding a device without a native IOMMU group to a VFIO bus driver
will taint the kernel and should therefore not be considered
supported.  This patch includes no-iommu support for the vfio-pci bus
driver only.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
2015-12-21 15:28:11 -07:00
Dan Carpenter 967628827f VFIO: platform: reset: fix a warning message condition
This loop ends with count set to -1 and not zero so the warning message
isn't printed when it should be.  I've fixed this by change the postop
to a preop.

Fixes: 0990822c98 ('VFIO: platform: reset: AMD xgbe reset module')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-12-21 15:28:11 -07:00
Alex Williamson ae5515d663 Revert: "vfio: Include No-IOMMU mode"
Revert commit 033291eccb ("vfio: Include No-IOMMU mode") due to lack
of a user.  This was originally intended to fill a need for the DPDK
driver, but uptake has been slow so rather than support an unproven
kernel interface revert it and revisit when userspace catches up.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-12-04 08:38:42 -07:00
Dan Carpenter 049af1060b vfio: fix a warning message
The first argument to the WARN() macro has to be a condition.  I'm sort
of disappointed that this code doesn't generate a compiler warning.  I
guess -Wformat-extra-args doesn't work in the kernel.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-11-21 06:55:58 -07:00
Kees Cook 7200be7c81 vfio: platform: remove needless stack usage
request_module already takes format strings, so no need to duplicate
the effort.

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-11-20 09:00:10 -07:00
Julia Lawall 7d10f4e079 vfio-pci: constify pci_error_handlers structures
This pci_error_handlers structure is never modified, like all the other
pci_error_handlers structures, so declare it as const.

Done with the help of Coccinelle.

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-11-19 16:53:07 -07:00
Krzysztof Kozlowski 7fe9414223 vfio: Drop owner assignment from platform_driver
platform_driver does not need to set an owner because
platform_driver_register() will set it.

Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com>
Acked-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-11-19 15:52:08 -07:00
Linus Torvalds 934f98d7e8 VFIO updates for v4.4-rc1
- Use kernel interfaces for VPD emulation (Alex Williamson)
  - Platform fix for releasing IRQs (Eric Auger)
  - Type1 IOMMU always advertises PAGE_SIZE support when smaller
    mapping sizes are available (Eric Auger)
  - Platform fixes for incorrectly using copies of structures rather
    than pointers to structures (James Morse)
  - Rework platform reset modules, fix leak, and add AMD xgbe reset
    module (Eric Auger)
  - Fix vfio_device_get_from_name() return value (Joerg Roedel)
  - No-IOMMU interface (Alex Williamson)
  - Fix potential out of bounds array access in PCI config handling
    (Dan Carpenter)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJWRg8+AAoJECObm247sIsitkwP+wc6cRzBpeGxiufZl9Ci7JV9
 G0qNHBm8tAYDwm1uJATcyZC303ad2B3gaYSO6msTFUXTg3d9ZDtUOWhoTNPs/px1
 5dF38DREzqYHpyC9HT2Qj3i9G+ejvgg+SoyxBOTOIw/dmq3tGZhUz1Sj+wuvQTwr
 XXBKsWltZ+8wwXmQXmOWI4L3m7Xhs8NwAl5iLJ3UiltpW9zZzuPtoKnQCfMYUcmh
 hIJg52t0WPSLyn47UvecUQcqxaO+QYELa7UN84fnQAihk+ewMpzg5blTebayFdu3
 f2WC3ivxbamebw74LaRfQFjx4mT+DI0aXYtraC600PVe7gdVXB66QMNNpPhBwAy5
 wpfeFpTKU5gC+LHmrIMUS2/A4sdNfUBw44CS8+Lm2D6bQAblPv/C5xQV1rz9HADv
 f4/D3Y0TUKSYArewtBHTC0mnXdkZwetttBoy6/zQBl8vkelhoJ3GPcVa8FEZCIuT
 2MSS17I3ftJ1enfynicF+Wstn/H5lWcuRBdg5wTLHIuhFn6MiEVxfIuSEx9JfjIb
 NGZO7y5JiJ0b5QRCG0tFznwceU/cql/3oRqOGXqaf1cQ1Ag3JOIAUzxknFoJQUj1
 XYe+Im1eMaugjj39J3+m5EYNKT3nh/bBLD/V3iWYpgoZtQrmQQm5nu0JsQo88/JR
 je0BuJioCuPlO/Wj/KYw
 =bI62
 -----END PGP SIGNATURE-----

Merge tag 'vfio-v4.4-rc1' of git://github.com/awilliam/linux-vfio

Pull VFIO updates from Alex Williamson:
 - Use kernel interfaces for VPD emulation (Alex Williamson)
 - Platform fix for releasing IRQs (Eric Auger)
 - Type1 IOMMU always advertises PAGE_SIZE support when smaller mapping
   sizes are available (Eric Auger)
 - Platform fixes for incorrectly using copies of structures rather than
   pointers to structures (James Morse)
 - Rework platform reset modules, fix leak, and add AMD xgbe reset
   module (Eric Auger)
 - Fix vfio_device_get_from_name() return value (Joerg Roedel)
 - No-IOMMU interface (Alex Williamson)
 - Fix potential out of bounds array access in PCI config handling (Dan
   Carpenter)

* tag 'vfio-v4.4-rc1' of git://github.com/awilliam/linux-vfio:
  vfio/pci: make an array larger
  vfio: Include No-IOMMU mode
  vfio: Fix bug in vfio_device_get_from_name()
  VFIO: platform: reset: AMD xgbe reset module
  vfio: platform: reset: calxedaxgmac: fix ioaddr leak
  vfio: platform: add dev_info on device reset
  vfio: platform: use list of registered reset function
  vfio: platform: add compat in vfio_platform_device
  vfio: platform: reset: calxedaxgmac: add reset function registration
  vfio: platform: introduce module_vfio_reset_handler macro
  vfio: platform: add capability to register a reset function
  vfio: platform: introduce vfio-platform-base module
  vfio/platform: store mapped memory in region, instead of an on-stack copy
  vfio/type1: handle case where IOMMU does not support PAGE_SIZE size
  VFIO: platform: clear IRQ_NOAUTOEN when de-assigning the IRQ
  vfio/pci: Use kernel VPD access functions
  vfio: Whitelist PCI bridges
2015-11-13 17:05:32 -08:00
Dan Carpenter 222e684ca7 vfio/pci: make an array larger
Smatch complains about a possible out of bounds error:

	drivers/vfio/pci/vfio_pci_config.c:1241 vfio_cap_init()
	error: buffer overflow 'pci_cap_length' 20 <= 20

The problem is that pci_cap_length[] was defined as large enough to
hold "PCI_CAP_ID_AF + 1" elements.  The code in vfio_cap_init() assumes
it has PCI_CAP_ID_MAX + 1 elements.  Originally, PCI_CAP_ID_AF and
PCI_CAP_ID_MAX were the same but then we introduced PCI_CAP_ID_EA in
commit f80b0ba959 ("PCI: Add Enhanced Allocation register entries")
so now the array is too small.

Let's fix this by making the array size PCI_CAP_ID_MAX + 1.  And let's
make a similar change to pci_ext_cap_length[] for consistency.  Also
both these arrays can be made const.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-11-09 08:59:11 -07:00
Alex Williamson 033291eccb vfio: Include No-IOMMU mode
There is really no way to safely give a user full access to a DMA
capable device without an IOMMU to protect the host system.  There is
also no way to provide DMA translation, for use cases such as device
assignment to virtual machines.  However, there are still those users
that want userspace drivers even under those conditions.  The UIO
driver exists for this use case, but does not provide the degree of
device access and programming that VFIO has.  In an effort to avoid
code duplication, this introduces a No-IOMMU mode for VFIO.

This mode requires building VFIO with CONFIG_VFIO_NOIOMMU and enabling
the "enable_unsafe_noiommu_mode" option on the vfio driver.  This
should make it very clear that this mode is not safe.  Additionally,
CAP_SYS_RAWIO privileges are necessary to work with groups and
containers using this mode.  Groups making use of this support are
named /dev/vfio/noiommu-$GROUP and can only make use of the special
VFIO_NOIOMMU_IOMMU for the container.  Use of this mode, specifically
binding a device without a native IOMMU group to a VFIO bus driver
will taint the kernel and should therefore not be considered
supported.  This patch includes no-iommu support for the vfio-pci bus
driver only.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
2015-11-04 09:56:16 -07:00
Joerg Roedel e324fc82ea vfio: Fix bug in vfio_device_get_from_name()
The vfio_device_get_from_name() function might return a
non-NULL pointer, when called with a device name that is not
found in the list. This causes undefined behavior, in my
case calling an invalid function pointer later on:

 kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
 BUG: unable to handle kernel paging request at ffff8800cb3ddc08

[...]

 Call Trace:
  [<ffffffffa03bd733>] ? vfio_group_fops_unl_ioctl+0x253/0x410 [vfio]
  [<ffffffff811efc4d>] do_vfs_ioctl+0x2cd/0x4c0
  [<ffffffff811f9657>] ? __fget+0x77/0xb0
  [<ffffffff811efeb9>] SyS_ioctl+0x79/0x90
  [<ffffffff81001bb0>] ? syscall_return_slowpath+0x50/0x130
  [<ffffffff8167f776>] entry_SYSCALL_64_fastpath+0x16/0x75

Fix the issue by returning NULL when there is no device with
the requested name in the list.

Cc: stable@vger.kernel.org # v4.2+
Fixes: 4bc94d5dc9 ("vfio: Fix lockdep issue")
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-11-04 09:27:39 -07:00
Eric Auger 0990822c98 VFIO: platform: reset: AMD xgbe reset module
This patch introduces a module that registers and implements a low-level
reset function for the AMD XGBE device.

it performs the following actions:
- reset the PHY
- disable auto-negotiation
- disable & clear auto-negotiation IRQ
- soft-reset the MAC

Those tiny pieces of code are inherited from the native xgbe driver.

Signed-off-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-11-03 12:55:21 -07:00
Eric Auger daac3bbedb vfio: platform: reset: calxedaxgmac: fix ioaddr leak
In the current code the vfio_platform_region is copied on the stack.
As a consequence the ioaddr address is not iounmapped in the vfio
platform driver (vfio_platform_regions_cleanup). The patch uses the
pointer to the region instead.

Signed-off-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-11-03 12:55:06 -07:00
Eric Auger 705e60bae3 vfio: platform: add dev_info on device reset
It might be helpful for the end-user to check the device reset
function was found by the vfio platform reset framework.

Lets store a pointer to the struct device in vfio_platform_device
and trace when the reset function is called or not found.

Signed-off-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-11-03 12:55:05 -07:00
Eric Auger e9e0506ee6 vfio: platform: use list of registered reset function
Remove the static lookup table and use the dynamic list of registered
reset functions instead. Also load the reset module through its alias.
The reset struct module pointer is stored in vfio_platform_device.

We also remove the useless struct device pointer parameter in
vfio_platform_get_reset.

This patch fixes the issue related to the usage of __symbol_get, which
besides from being moot, prevented compilation with CONFIG_MODULES
disabled.

Also usage of MODULE_ALIAS makes possible to add a new reset module
without needing to update the framework. This was suggested by Arnd.

Signed-off-by: Eric Auger <eric.auger@linaro.org>
Reported-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-11-03 12:55:03 -07:00
Eric Auger 0628c4dfd3 vfio: platform: add compat in vfio_platform_device
Let's retrieve the compatibility string on probe and store it
in the vfio_platform_device struct

Signed-off-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-11-03 12:55:02 -07:00
Eric Auger 680742f644 vfio: platform: reset: calxedaxgmac: add reset function registration
This patch adds the reset function registration/unregistration.
This is handled through the module_vfio_reset_handler macro. This
latter also defines a MODULE_ALIAS which simplifies the load from
vfio-platform.

Signed-off-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-11-03 12:55:00 -07:00
Eric Auger 588646529f vfio: platform: introduce module_vfio_reset_handler macro
The module_vfio_reset_handler macro
- define a module alias
- implement module init/exit function which respectively registers
  and unregisters the reset function.

Signed-off-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-11-03 12:54:59 -07:00
Eric Auger e086497d31 vfio: platform: add capability to register a reset function
In preparation for subsequent changes in reset function lookup,
lets introduce a dynamic list of reset combos (compat string,
reset module, reset function). The list can be populated/voided with
vfio_platform_register/unregister_reset. Those are not yet used in
this patch.

Signed-off-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-11-03 12:54:57 -07:00
Eric Auger 32a2d71c4e vfio: platform: introduce vfio-platform-base module
To prepare for vfio platform reset rework let's build
vfio_platform_common.c and vfio_platform_irq.c in a separate
module from vfio-platform and vfio-amba. This makes possible
to have separate module inits and works around a race between
platform driver init and vfio reset module init: that way we
make sure symbols exported by base are available when vfio-platform
driver gets probed.

The open/release being implemented in the base module, the ref
count is applied to the parent module instead.

Signed-off-by: Eric Auger <eric.auger@linaro.org>
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-11-03 12:54:56 -07:00
James Morse 1b4bb2eaa9 vfio/platform: store mapped memory in region, instead of an on-stack copy
vfio_platform_{read,write}_mmio() call ioremap_nocache() to map
a region of io memory, which they store in struct vfio_platform_region to
be eventually re-used, or unmapped by vfio_platform_regions_cleanup().

These functions receive a copy of their struct vfio_platform_region
argument on the stack - so these mapped areas are always allocated, and
always leaked.

Pass this argument as a pointer instead.

Fixes: 6e3f264560 "vfio/platform: read and write support for the device fd"
Signed-off-by: James Morse <james.morse@arm.com>
Acked-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Tested-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-11-03 12:54:00 -07:00
Eric Auger 4644321fd3 vfio/type1: handle case where IOMMU does not support PAGE_SIZE size
Current vfio_pgsize_bitmap code hides the supported IOMMU page
sizes smaller than PAGE_SIZE. As a result, in case the IOMMU
does not support PAGE_SIZE page, the alignment check on map/unmap
is done with larger page sizes, if any. This can fail although
mapping could be done with pages smaller than PAGE_SIZE.

This patch modifies vfio_pgsize_bitmap implementation so that,
in case the IOMMU supports page sizes smaller than PAGE_SIZE
we pretend PAGE_SIZE is supported and hide sub-PAGE_SIZE sizes.
That way the user will be able to map/unmap buffers whose size/
start address is aligned with PAGE_SIZE. Pinning code uses that
granularity while iommu driver can use the sub-PAGE_SIZE size
to map the buffer.

Signed-off-by: Eric Auger <eric.auger@linaro.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-11-03 12:53:53 -07:00
Eric Auger 1276ece32c VFIO: platform: clear IRQ_NOAUTOEN when de-assigning the IRQ
The vfio platform driver currently sets the IRQ_NOAUTOEN before
doing the request_irq to properly handle the user masking. However
it does not clear it when de-assigning the IRQ. This brings issues
when loading the native driver again which may not explicitly enable
the IRQ. This problem was observed with xgbe driver.

Signed-off-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-10-27 15:02:53 -06:00
Alex Williamson 4e1a635552 vfio/pci: Use kernel VPD access functions
The PCI VPD capability operates on a set of window registers in PCI
config space.  Writing to the address register triggers either a read
or write, depending on the setting of the PCI_VPD_ADDR_F bit within
the address register.  The data register provides either the source
for writes or the target for reads.

This model is susceptible to being broken by concurrent access, for
which the kernel has adopted a set of access functions to serialize
these registers.  Additionally, commits like 932c435cab ("PCI: Add
dev_flags bit to access VPD through function 0") and 7aa6ca4d39
("PCI: Add VPD function 0 quirk for Intel Ethernet devices") indicate
that VPD registers can be shared between functions on multifunction
devices creating dependencies between otherwise independent devices.

Fortunately it's quite easy to emulate the VPD registers, simply
storing copies of the address and data registers in memory and
triggering a VPD read or write on writes to the address register.
This allows vfio users to avoid seeing spurious register changes from
accesses on other devices and enables the use of shared quirks in the
host kernel.  We can theoretically still race with access through
sysfs, but the window of opportunity is much smaller.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Mark Rustad <mark.d.rustad@intel.com>
2015-10-27 14:53:05 -06:00
Alex Williamson 5f096b14d4 vfio: Whitelist PCI bridges
When determining whether a group is viable, we already allow devices
bound to pcieport.  Generalize this to include any PCI bridge device.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-10-27 14:53:04 -06:00
Feng Wu 6d7425f109 vfio: Register/unregister irq_bypass_producer
This patch adds the registration/unregistration of an
irq_bypass_producer for MSI/MSIx on vfio pci devices.

Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Feng Wu <feng.wu@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-10-01 15:06:50 +02:00
Alex Williamson 4bc94d5dc9 vfio: Fix lockdep issue
When we open a device file descriptor, we currently have the
following:

vfio_group_get_device_fd()
  mutex_lock(&group->device_lock);
    open()
    ...
    if (ret)
      release()

If we hit that error case, we call the backend driver release path,
which for vfio-pci looks like this:

vfio_pci_release()
  vfio_pci_disable()
    vfio_pci_try_bus_reset()
      vfio_pci_get_devs()
        vfio_device_get_from_dev()
          vfio_group_get_device()
            mutex_lock(&group->device_lock);

Whoops, we've stumbled back onto group.device_lock and created a
deadlock.  There's a low likelihood of ever seeing this play out, but
obviously it needs to be fixed.  To do that we can use a reference to
the vfio_device for vfio_group_get_device_fd() rather than holding the
lock.  There was a loop in this function, theoretically allowing
multiple devices with the same name, but in practice we don't expect
such a thing to happen and the code is already aborting from the loop
with break on any sort of error rather than continuing and only
parsing the first match anyway, so the loop was effectively unused
already.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Fixes: 20f300175a ("vfio/pci: Fix racy vfio_device_get_from_dev() call")
Reported-by: Joerg Roedel <joro@8bytes.org>
Tested-by: Joerg Roedel <jroedel@suse.de>
2015-07-24 15:14:04 -06:00
Linus Torvalds b779157dd3 VFIO updates for v4.2
- Fix race with device reference versus driver release (Alex Williamson)
 - Add reset hooks and Calxeda xgmac reset for vfio-platform (Eric Auger)
 - Enable vfio-platform for ARM64 (Eric Auger)
 - Tag Baptiste Reynal as vfio-platform sub-maintainer (Alex Williamson)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJVjaiUAAoJECObm247sIsiRJQP/RiqcYg9uDGbBt8Gi+Qi2BmG
 KrVqG/ty18CZgyis01GZa5O8bL2ViQFAwfpnIJ6l5v4IQAYUUPlfy6Gy9ZgotQir
 U5ot4ABT9KGqJyjkDixvfUmg5M1gj1euO/xC+KVMpnxN0tC1wbymtAahzd3/RXyq
 5cef/cran5bwnh2RtAN/rFkbjZBuqppVZJJV7uyOd69HXhktrZw5skQssL/61+vT
 znww9H6WT/Qk60haQrcD+SVqAaPv6Tx8bu1LVZ1Zc/rWVxJ2HisOp5dTxGeRnaSu
 PtMfOndgtYwe5fBaYyNjWluJLpBzY7b5UvHtsZ52hPXvlgkKSMrfCJxM4BQQSekW
 txmeV6LpQsm+Nq0+sIOFph8h/LzZaebvMsc5YnLamixGb/mAxbbr31yGjokTuS4F
 ndKUR+9/mFrdcqeBT4UD1Yt45YHl7bRZ9AWKJIC1TstrTW+1rEZ49r71PLea6y4f
 SJJ+eCvhYr/eebvrsfePqzeo1X+S8HNTgCXrSlqZdohS21qtnsh4g/CaOxI2UVt8
 zaZosxNQLzpNK/Z8s7MZpCMenWZsq2frcckROeEgNKdPntwkbVT4MG3LBrWvtuBU
 HGAKfWCGAjfIIN8ZBUSFtIqua/kXCcec99CcYYQOtv16UbmhadAPoUlMur3HI9T9
 S0vWtIKTsRa8ymyrGBDp
 =/gWA
 -----END PGP SIGNATURE-----

Merge tag 'vfio-v4.2-rc1' of git://github.com/awilliam/linux-vfio

Pull VFIO updates from Alex Williamson:

 - fix race with device reference versus driver release (Alex Williamson)

 - add reset hooks and Calxeda xgmac reset for vfio-platform (Eric Auger)

 - enable vfio-platform for ARM64 (Eric Auger)

 - tag Baptiste Reynal as vfio-platform sub-maintainer (Alex Williamson)

* tag 'vfio-v4.2-rc1' of git://github.com/awilliam/linux-vfio:
  MAINTAINERS: Add vfio-platform sub-maintainer
  VFIO: platform: enable ARM64 build
  VFIO: platform: Calxeda xgmac reset module
  VFIO: platform: populate the reset function on probe
  VFIO: platform: add reset callback
  VFIO: platform: add reset struct and lookup table
  vfio/pci: Fix racy vfio_device_get_from_dev() call
2015-06-28 12:32:13 -07:00
Linus Torvalds 08d183e3c1 powerpc updates for 4.2
- Disable the 32-bit vdso when building LE, so we can build with a 64-bit only
    toolchain.
  - EEH fixes from Gavin & Richard.
  - Enable the sys_kcmp syscall from Laurent.
  - Sysfs control for fastsleep workaround from Shreyas.
  - Expose OPAL events as an irq chip by Alistair.
  - MSI ops moved to pci_controller_ops by Daniel.
  - Fix for kernel to userspace backtraces for perf from Anton.
  - Merge pseries and pseries_le defconfigs from Cyril.
  - CXL in-kernel API from Mikey.
  - OPAL prd driver from Jeremy.
  - Fix for DSCR handling & tests from Anshuman.
  - Powernv flash mtd driver from Cyril.
  - Dynamic DMA Window support on powernv from Alexey.
  - LLVM clang fixes & workarounds from Anton.
  - Reworked version of the patch to abort syscalls when transactional.
  - Fix the swap encoding to support 4TB, from Aneesh.
  - Various fixes as usual.
  - Freescale updates from Scott: Highlights include more 8xx optimizations, an
    e6500 hugetlb optimization, QMan device tree nodes, t1024/t1023 support, and
    various fixes and cleanup.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJViSZqAAoJEFHr6jzI4aWAA7kQAKq3+pejfo2rY7alpKJyeVao
 vlaIEaDNOTh+ctcmu3MFF9Jy6fai8gNZziRXU5JRmE5RW4GVBN4KZiqXRbkVjdBK
 uG9sCX7Y58VRsS2vnGBYLsamfTMgjaXeDvgunQHVLiechJnrDr0RHEK90F3LSi73
 Axp6l8XIG63a3zFZmkhzANMCme2lm5+MWmGlSjUUNi5F+viQUgJc5iiO8xrVUgM5
 RpNlV2NJSqFiU+gMQWJ226V85UIniouq4j+qtyUcu8/m9BberyolXVU0GPlPFdsx
 r/Qh9uCJyZaUdSB5hzomQZj50IsSz6J6nEuJTeGRoVZOmeI8Dnc2xU9fxQF5fC8H
 lUJw10WPoNOggQZTeSUKn7wTXw3i4p3KsWNUczaW68VJdhqZUVaSp0+I6mnDSqzs
 9iGC+VffLYNa1OHq7mGRFrgDdLBCHes31aZ3CxlQsmyNpAPCwMzsD4TUfVnvOG6E
 oJOeaQ4mZM9PvqxEYJfoIL+vgRxmQ8sdIBtNY4in+C7J6eFnZNFO9xmPnJZuVU31
 PGtx60kjFCOVMXvqn34WkRNbgqGWI91IK0KcRwFO2LXVio1uY77TWL52kNK2IMsp
 Az+VDDvqnT3+BoV1yz0P6SrXAkwTpvFk2y+IdmEiUUN7zZFL5ZSA2epej9AzHTAK
 WID2bc5yVtIL6p6x5ICH
 =d9Wh
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux

Pull powerpc updates from Michael Ellerman:

 - disable the 32-bit vdso when building LE, so we can build with a
   64-bit only toolchain.

 - EEH fixes from Gavin & Richard.

 - enable the sys_kcmp syscall from Laurent.

 - sysfs control for fastsleep workaround from Shreyas.

 - expose OPAL events as an irq chip by Alistair.

 - MSI ops moved to pci_controller_ops by Daniel.

 - fix for kernel to userspace backtraces for perf from Anton.

 - merge pseries and pseries_le defconfigs from Cyril.

 - CXL in-kernel API from Mikey.

 - OPAL prd driver from Jeremy.

 - fix for DSCR handling & tests from Anshuman.

 - Powernv flash mtd driver from Cyril.

 - dynamic DMA Window support on powernv from Alexey.

 - LLVM clang fixes & workarounds from Anton.

 - reworked version of the patch to abort syscalls when transactional.

 - fix the swap encoding to support 4TB, from Aneesh.

 - various fixes as usual.

 - Freescale updates from Scott: Highlights include more 8xx
   optimizations, an e6500 hugetlb optimization, QMan device tree nodes,
   t1024/t1023 support, and various fixes and cleanup.

* tag 'powerpc-4.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux: (180 commits)
  cxl: Fix typo in debug print
  cxl: Add CXL_KERNEL_API config option
  powerpc/powernv: Fix wrong IOMMU table in pnv_ioda_setup_bus_dma()
  powerpc/mm: Change the swap encoding in pte.
  powerpc/mm: PTE_RPN_MAX is not used, remove the same
  powerpc/tm: Abort syscalls in active transactions
  powerpc/iommu/ioda2: Enable compile with IOV=on and IOMMU_API=off
  powerpc/include: Add opal-prd to installed uapi headers
  powerpc/powernv: fix construction of opal PRD messages
  powerpc/powernv: Increase opal-irqchip initcall priority
  powerpc: Make doorbell check preemption safe
  powerpc/powernv: pnv_init_idle_states() should only run on powernv
  macintosh/nvram: Remove as unused
  powerpc: Don't use gcc specific options on clang
  powerpc: Don't use -mno-strict-align on clang
  powerpc: Only use -mtraceback=no, -mno-string and -msoft-float if toolchain supports it
  powerpc: Only use -mabi=altivec if toolchain supports it
  powerpc: Fix duplicate const clang warning in user access code
  vfio: powerpc/spapr: Support Dynamic DMA windows
  vfio: powerpc/spapr: Register memory and define IOMMU v2
  ...
2015-06-24 08:46:32 -07:00
Eric Auger e6bcd47f8f VFIO: platform: enable ARM64 build
This patch enables building VFIO platform and derivatives on ARM64.

Signed-off-by: Eric Auger <eric.auger@linaro.org>
Acked-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Tested-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-06-22 09:35:47 -06:00
Eric Auger 713cc334a6 VFIO: platform: Calxeda xgmac reset module
This patch introduces a module that registers and implements a basic
reset function for the Calxeda xgmac device. This latter basically disables
interrupts and stops DMA transfers.

The reset function code is inherited from the native calxeda xgmac driver.

Signed-off-by: Eric Auger <eric.auger@linaro.org>
Acked-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Tested-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-06-22 09:35:38 -06:00
Eric Auger 3eeb0d510c VFIO: platform: populate the reset function on probe
The reset function lookup happens on vfio-platform probe. The reset
module load is requested  and a reference to the function symbol is
hold. The reference is released on vfio-platform remove.

Signed-off-by: Eric Auger <eric.auger@linaro.org>
Acked-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Tested-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-06-22 09:35:30 -06:00
Eric Auger 813ae66008 VFIO: platform: add reset callback
A new reset callback is introduced. If this callback is populated,
the reset is invoked on device first open/last close or upon userspace
ioctl.  The modality is exposed on VFIO_DEVICE_GET_INFO.

Signed-off-by: Eric Auger <eric.auger@linaro.org>
Acked-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Tested-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-06-22 09:35:22 -06:00
Eric Auger 9f85d8f9fa VFIO: platform: add reset struct and lookup table
This patch introduces the vfio_platform_reset_combo struct that
stores all the information useful to handle the reset modality:
compat string, name of the reset function, name of the module that
implements the reset function. A lookup table of such structures
is added, currently void.

Signed-off-by: Eric Auger <eric.auger@linaro.org>
Acked-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Tested-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-06-17 15:39:09 -06:00
Alexey Kardashevskiy e633bc86a9 vfio: powerpc/spapr: Support Dynamic DMA windows
This adds create/remove window ioctls to create and remove DMA windows.
sPAPR defines a Dynamic DMA windows capability which allows
para-virtualized guests to create additional DMA windows on a PCI bus.
The existing linux kernels use this new window to map the entire guest
memory and switch to the direct DMA operations saving time on map/unmap
requests which would normally happen in a big amounts.

This adds 2 ioctl handlers - VFIO_IOMMU_SPAPR_TCE_CREATE and
VFIO_IOMMU_SPAPR_TCE_REMOVE - to create and remove windows.
Up to 2 windows are supported now by the hardware and by this driver.

This changes VFIO_IOMMU_SPAPR_TCE_GET_INFO handler to return additional
information such as a number of supported windows and maximum number
levels of TCE tables.

DDW is added as a capability, not as a SPAPR TCE IOMMU v2 unique feature
as we still want to support v2 on platforms which cannot do DDW for
the sake of TCE acceleration in KVM (coming soon).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:16:55 +10:00
Alexey Kardashevskiy 2157e7b82f vfio: powerpc/spapr: Register memory and define IOMMU v2
The existing implementation accounts the whole DMA window in
the locked_vm counter. This is going to be worse with multiple
containers and huge DMA windows. Also, real-time accounting would requite
additional tracking of accounted pages due to the page size difference -
IOMMU uses 4K pages and system uses 4K or 64K pages.

Another issue is that actual pages pinning/unpinning happens on every
DMA map/unmap request. This does not affect the performance much now as
we spend way too much time now on switching context between
guest/userspace/host but this will start to matter when we add in-kernel
DMA map/unmap acceleration.

This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU.
New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces
2 new ioctls to register/unregister DMA memory -
VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY -
which receive user space address and size of a memory region which
needs to be pinned/unpinned and counted in locked_vm.
New IOMMU splits physical pages pinning and TCE table update
into 2 different operations. It requires:
1) guest pages to be registered first
2) consequent map/unmap requests to work only with pre-registered memory.
For the default single window case this means that the entire guest
(instead of 2GB) needs to be pinned before using VFIO.
When a huge DMA window is added, no additional pinning will be
required, otherwise it would be guest RAM + 2GB.

The new memory registration ioctls are not supported by
VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration
will require memory to be preregistered in order to work.

The accounting is done per the user process.

This advertises v2 SPAPR TCE IOMMU and restricts what the userspace
can do with v1 or v2 IOMMUs.

In order to support memory pre-registration, we need a way to track
the use of every registered memory region and only allow unregistration
if a region is not in use anymore. So we need a way to tell from what
region the just cleared TCE was from.

This adds a userspace view of the TCE table into iommu_table struct.
It contains userspace address, one per TCE entry. The table is only
allocated when the ownership over an IOMMU group is taken which means
it is only used from outside of the powernv code (such as VFIO).

As v2 IOMMU supports IODA2 and pre-IODA2 IOMMUs (which do not support
DDW API), this creates a default DMA window for IODA2 for consistency.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:16:55 +10:00
Alexey Kardashevskiy 46d3e1e162 vfio: powerpc/spapr: powerpc/powernv/ioda2: Use DMA windows API in ownership control
Before the IOMMU user (VFIO) would take control over the IOMMU table
belonging to a specific IOMMU group. This approach did not allow sharing
tables between IOMMU groups attached to the same container.

This introduces a new IOMMU ownership flavour when the user can not
just control the existing IOMMU table but remove/create tables on demand.
If an IOMMU implements take/release_ownership() callbacks, this lets
the user have full control over the IOMMU group. When the ownership
is taken, the platform code removes all the windows so the caller must
create them.
Before returning the ownership back to the platform code, VFIO
unprograms and removes all the tables it created.

This changes IODA2's onwership handler to remove the existing table
rather than manipulating with the existing one. From now on,
iommu_take_ownership() and iommu_release_ownership() are only called
from the vfio_iommu_spapr_tce driver.

Old-style ownership is still supported allowing VFIO to run on older
P5IOC2 and IODA IO controllers.

No change in userspace-visible behaviour is expected. Since it recreates
TCE tables on each ownership change, related kernel traces will appear
more often.

This adds a pnv_pci_ioda2_setup_default_config() which is called
when PE is being configured at boot time and when the ownership is
passed from VFIO to the platform code.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:16:54 +10:00
Alexey Kardashevskiy 4793d65d1a vfio: powerpc/spapr: powerpc/powernv/ioda: Define and implement DMA windows API
This extends iommu_table_group_ops by a set of callbacks to support
dynamic DMA windows management.

create_table() creates a TCE table with specific parameters.
it receives iommu_table_group to know nodeid in order to allocate
TCE table memory closer to the PHB. The exact format of allocated
multi-level table might be also specific to the PHB model (not
the case now though).
This callback calculated the DMA window offset on a PCI bus from @num
and stores it in a just created table.

set_window() sets the window at specified TVT index + @num on PHB.

unset_window() unsets the window from specified TVT.

This adds a free() callback to iommu_table_ops to free the memory
(potentially a tree of tables) allocated for the TCE table.

create_table() and free() are supposed to be called once per
VFIO container and set_window()/unset_window() are supposed to be
called for every group in a container.

This adds IOMMU capabilities to iommu_table_group such as default
32bit window parameters and others. This makes use of new values in
vfio_iommu_spapr_tce. IODA1/P5IOC2 do not support DDW so they do not
advertise pagemasks to the userspace.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:16:52 +10:00
Alexey Kardashevskiy 05c6cfb9dc powerpc/iommu/powernv: Release replaced TCE
At the moment writing new TCE value to the IOMMU table fails with EBUSY
if there is a valid entry already. However PAPR specification allows
the guest to write new TCE value without clearing it first.

Another problem this patch is addressing is the use of pool locks for
external IOMMU users such as VFIO. The pool locks are to protect
DMA page allocator rather than entries and since the host kernel does
not control what pages are in use, there is no point in pool locks and
exchange()+put_page(oldtce) is sufficient to avoid possible races.

This adds an exchange() callback to iommu_table_ops which does the same
thing as set() plus it returns replaced TCE and DMA direction so
the caller can release the pages afterwards. The exchange() receives
a physical address unlike set() which receives linear mapping address;
and returns a physical address as the clear() does.

This implements exchange() for P5IOC2/IODA/IODA2. This adds a requirement
for a platform to have exchange() implemented in order to support VFIO.

This replaces iommu_tce_build() and iommu_clear_tce() with
a single iommu_tce_xchg().

This makes sure that TCE permission bits are not set in TCE passed to
IOMMU API as those are to be calculated by platform code from
DMA direction.

This moves SetPageDirty() to the IOMMU code to make it work for both
VFIO ioctl interface in in-kernel TCE acceleration (when it becomes
available later).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:16:49 +10:00
Alexey Kardashevskiy f87a88642e vfio: powerpc/spapr/iommu/powernv/ioda2: Rework IOMMU ownership control
This adds tce_iommu_take_ownership() and tce_iommu_release_ownership
which call in a loop iommu_take_ownership()/iommu_release_ownership()
for every table on the group. As there is just one now, no change in
behaviour is expected.

At the moment the iommu_table struct has a set_bypass() which enables/
disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code
which calls this callback when external IOMMU users such as VFIO are
about to get over a PHB.

The set_bypass() callback is not really an iommu_table function but
IOMMU/PE function. This introduces a iommu_table_group_ops struct and
adds take_ownership()/release_ownership() callbacks to it which are
called when an external user takes/releases control over the IOMMU.

This replaces set_bypass() with ownership callbacks as it is not
necessarily just bypass enabling, it can be something else/more
so let's give it more generic name.

The callbacks is implemented for IODA2 only. Other platforms (P5IOC2,
IODA1) will use the old iommu_take_ownership/iommu_release_ownership API.
The following patches will replace iommu_take_ownership/
iommu_release_ownership calls in IODA2 with full IOMMU table release/
create.

As we here and touching bypass control, this removes
pnv_pci_ioda2_setup_bypass_pe() as it does not do much
more compared to pnv_pci_ioda2_set_bypass. This moves tce_bypass_base
initialization to pnv_pci_ioda2_setup_dma_pe.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:16:47 +10:00
Alexey Kardashevskiy 0eaf4defc7 powerpc/spapr: vfio: Switch from iommu_table to new iommu_table_group
So far one TCE table could only be used by one IOMMU group. However
IODA2 hardware allows programming the same TCE table address to
multiple PE allowing sharing tables.

This replaces a single pointer to a group in a iommu_table struct
with a linked list of groups which provides the way of invalidating
TCE cache for every PE when an actual TCE table is updated. This adds
pnv_pci_link_table_and_group() and pnv_pci_unlink_table_and_group()
helpers to manage the list. However without VFIO, it is still going
to be a single IOMMU group per iommu_table.

This changes iommu_add_device() to add a device to a first group
from the group list of a table as it is only called from the platform
init code or PCI bus notifier and at these moments there is only
one group per table.

This does not change TCE invalidation code to loop through all
attached groups in order to simplify this patch and because
it is not really needed in most cases. IODA2 is fixed in a later
patch.

This should cause no behavioural change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:16:15 +10:00
Alexey Kardashevskiy b348aa6529 powerpc/spapr: vfio: Replace iommu_table with iommu_table_group
Modern IBM POWERPC systems support multiple (currently two) TCE tables
per IOMMU group (a.k.a. PE). This adds a iommu_table_group container
for TCE tables. Right now just one table is supported.

This defines iommu_table_group struct which stores pointers to
iommu_group and iommu_table(s). This replaces iommu_table with
iommu_table_group where iommu_table was used to identify a group:
- iommu_register_group();
- iommudata of generic iommu_group;

This removes @data from iommu_table as it_table_group provides
same access to pnv_ioda_pe.

For IODA, instead of embedding iommu_table, the new iommu_table_group
keeps pointers to those. The iommu_table structs are allocated
dynamically.

For P5IOC2, both iommu_table_group and iommu_table are embedded into
PE struct. As there is no EEH and SRIOV support for P5IOC2,
iommu_free_table() should not be called on iommu_table struct pointers
so we can keep it embedded in pnv_phb::p5ioc2.

For pSeries, this replaces multiple calls of kzalloc_node() with a new
iommu_pseries_alloc_group() helper and stores the table group struct
pointer into the pci_dn struct. For release, a iommu_table_free_group()
helper is added.

This moves iommu_table struct allocation from SR-IOV code to
the generic DMA initialization code in pnv_pci_ioda_setup_dma_pe and
pnv_pci_ioda2_setup_dma_pe as this is where DMA is actually initialized.
This change is here because those lines had to be changed anyway.

This should cause no behavioural change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:14:57 +10:00
Alexey Kardashevskiy 22af48596e vfio: powerpc/spapr: Rework groups attaching
This is to make extended ownership and multiple groups support patches
simpler for review.

This should cause no behavioural change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:14:56 +10:00
Alexey Kardashevskiy 649354b75d vfio: powerpc/spapr: Moving pinning/unpinning to helpers
This is a pretty mechanical patch to make next patches simpler.

New tce_iommu_unuse_page() helper does put_page() now but it might skip
that after the memory registering patch applied.

As we are here, this removes unnecessary checks for a value returned
by pfn_to_page() as it cannot possibly return NULL.

This moves tce_iommu_disable() later to let tce_iommu_clear() know if
the container has been enabled because if it has not been, then
put_page() must not be called on TCEs from the TCE table. This situation
is not yet possible but it will after KVM acceleration patchset is
applied.

This changes code to work with physical addresses rather than linear
mapping addresses for better code readability. Following patches will
add an xchg() callback for an IOMMU table which will accept/return
physical addresses (unlike current tce_build()) which will eliminate
redundant conversions.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:14:56 +10:00
Alexey Kardashevskiy 3c56e822f8 vfio: powerpc/spapr: Disable DMA mappings on disabled container
At the moment DMA map/unmap requests are handled irrespective to
the container's state. This allows the user space to pin memory which
it might not be allowed to pin.

This adds checks to MAP/UNMAP that the container is enabled, otherwise
-EPERM is returned.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:14:56 +10:00
Alexey Kardashevskiy 2d270df8f7 vfio: powerpc/spapr: Move locked_vm accounting to helpers
There moves locked pages accounting to helpers.
Later they will be reused for Dynamic DMA windows (DDW).

This reworks debug messages to show the current value and the limit.

This stores the locked pages number in the container so when unlocking
the iommu table pointer won't be needed. This does not have an effect
now but it will with the multiple tables per container as then we will
allow attaching/detaching groups on fly and we may end up having
a container with no group attached but with the counter incremented.

While we are here, update the comment explaining why RLIMIT_MEMLOCK
might be required to be bigger than the guest RAM. This also prints
pid of the current process in pr_warn/pr_debug.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:14:55 +10:00
Alexey Kardashevskiy 00663d4ee0 vfio: powerpc/spapr: Use it_page_size
This makes use of the it_page_size from the iommu_table struct
as page size can differ.

This replaces missing IOMMU_PAGE_SHIFT macro in commented debug code
as recently introduced IOMMU_PAGE_XXX macros do not include
IOMMU_PAGE_SHIFT.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:14:55 +10:00
Alexey Kardashevskiy e432bc7e15 vfio: powerpc/spapr: Check that IOMMU page is fully contained by system page
This checks that the TCE table page size is not bigger that the size of
a page we just pinned and going to put its physical address to the table.

Otherwise the hardware gets unwanted access to physical memory between
the end of the actual page and the end of the aligned up TCE page.

Since compound_order() and compound_head() work correctly on non-huge
pages, there is no need for additional check whether the page is huge.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:14:55 +10:00
Alexey Kardashevskiy 9b14a1ff86 vfio: powerpc/spapr: Move page pinning from arch code to VFIO IOMMU driver
This moves page pinning (get_user_pages_fast()/put_page()) code out of
the platform IOMMU code and puts it to VFIO IOMMU driver where it belongs
to as the platform code does not deal with page pinning.

This makes iommu_take_ownership()/iommu_release_ownership() deal with
the IOMMU table bitmap only.

This removes page unpinning from iommu_take_ownership() as the actual
TCE table might contain garbage and doing put_page() on it is undefined
behaviour.

Besides the last part, the rest of the patch is mechanical.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:14:55 +10:00
Alex Williamson 20f300175a vfio/pci: Fix racy vfio_device_get_from_dev() call
Testing the driver for a PCI device is racy, it can be all but
complete in the release path and still report the driver as ours.
Therefore we can't trust drvdata to be valid.  This race can sometimes
be seen when one port of a multifunction device is being unbound from
the vfio-pci driver while another function is being released by the
user and attempting a bus reset.  The device in the remove path is
found as a dependent device for the bus reset of the release path
device, the driver is still set to vfio-pci, but the drvdata has
already been cleared, resulting in a null pointer dereference.

To resolve this, fix vfio_device_get_from_dev() to not take the
dev_get_drvdata() shortcut and instead traverse through the
iommu_group, vfio_group, vfio_device path to get a reference we
can trust.  Once we have that reference, we know the device isn't
in transition and we can test to make sure the driver is still what
we expect, so that we don't interfere with devices we don't own.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-06-09 10:08:57 -06:00
Will Deacon 8a0a01bff8 drivers/vfio: Allow type-1 IOMMU instantiation on top of an ARM SMMUv3
The ARM SMMUv3 driver is compatible with the notion of a type-1 IOMMU in
VFIO.

This patch allows VFIO_IOMMU_TYPE1 to be selected if ARM_SMMU_V3=y.

Signed-off-by: Will Deacon <will.deacon@arm.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2015-05-29 11:12:40 +02:00
Gavin Shan 68cbbc3a9d drivers/vfio: Support EEH error injection
The patch adds one more EEH sub-command (VFIO_EEH_PE_INJECT_ERR)
to inject the specified EEH error, which is represented by
(struct vfio_eeh_pe_err), to the indicated PE for testing purpose.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-05-12 20:33:35 +10:00
Alex Williamson db7d4d7f40 vfio: Fix runaway interruptible timeout
Commit 13060b64b8 ("vfio: Add and use device request op for vfio
bus drivers") incorrectly makes use of an interruptible timeout.
When interrupted, the signal remains pending resulting in subsequent
timeouts occurring instantly.  This makes the loop spin at a much
higher rate than intended.

Instead of making this completely non-interruptible, we can change
this into a sort of interruptible-once behavior and use the "once"
to log debug information.  The driver API doesn't allow us to abort
and return an error code.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Fixes: 13060b64b8
Cc: stable@vger.kernel.org # v4.0
2015-05-01 16:31:41 -06:00
Alex Williamson 5f55d2ae69 vfio-pci: Log device requests more verbosely
Log some clues indicating whether the user is receiving device
request interfaces or not listening.  This can help indicate why a
driver unbind is blocked or explain why QEMU automatically unplugged
a device from the VM.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-05-01 14:00:53 -06:00
Alex Williamson 5a0ff17741 vfio-pci: Fix use after free
Reported by 0-day test infrastructure.

Fixes: ecaa1f6a01 ("vfio-pci: Add VGA arbiter client")
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-04-08 08:11:51 -06:00
Alex Williamson 6eb7018705 vfio-pci: Move idle devices to D3hot power state
We can save some power by putting devices that are bound to vfio-pci
but not in use by the user in the D3hot power state.  Devices get
woken into D0 when opened by the user.  Resets return the device to
D0, so we need to re-apply the low power state after a bus reset.
It's tempting to try to use D3cold, but we have no reason to inhibit
hotplug of idle devices and we might get into a loop of having the
device disappear before we have a chance to try to use it.

A new module parameter allows this feature to be disabled if there are
devices that misbehave as a result of this change.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-04-07 11:14:46 -06:00
Alex Williamson 561d72ddbb vfio-pci: Remove warning if try-reset fails
As indicated in the comment, this is not entirely uncommon and
causes user concern for no reason.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-04-07 11:14:44 -06:00
Alex Williamson 80c7e8cc2a vfio-pci: Allow PCI IDs to be specified as module options
This copies the same support from pci-stub for exactly the same
purpose, enabling a set of PCI IDs to be automatically added to the
driver's dynamic ID table at module load time.  The code here is
pretty simple and both vfio-pci and pci-stub are fairly unique in
being meta drivers, capable of attaching to any device, so there's no
attempt made to generalize the code into pci-core.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-04-07 11:14:43 -06:00
Alex Williamson ecaa1f6a01 vfio-pci: Add VGA arbiter client
If VFIO VGA access is disabled for the user, either by CONFIG option
or module parameter, we can often opt-out of VGA arbitration.  We can
do this when PCI bridge control of VGA routing is possible.  This
means that we must have a parent bridge and there must only be a
single VGA device below that bridge.  Fortunately this is the typical
case for discrete GPUs.

Doing this allows us to minimize the impact of additional GPUs, in
terms of VGA arbitration, when they are only used via vfio-pci for
non-VGA applications.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-04-07 11:14:41 -06:00
Alex Williamson 88c0dead9f vfio-pci: Add module option to disable VGA region access
Add a module option so that we don't require a CONFIG change and
kernel rebuild to disable VGA support.  Not only can VGA support be
troublesome in itself, but by disabling it we can reduce the impact
to host devices by doing a VGA arbitration opt-out.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-04-07 11:14:40 -06:00
Alex Williamson 71be3423a6 vfio: Split virqfd into a separate module for vfio bus drivers
An unintended consequence of commit 42ac9bd18d ("vfio: initialize
the virqfd workqueue in VFIO generic code") is that the vfio module
is renamed to vfio_core so that it can include both vfio and virqfd.
That's a user visible change that may break module loading scritps
and it imposes eventfd support as a dependency on the core vfio code,
which it's really not.  virqfd is intended to be provided as a service
to vfio bus drivers, so instead of wrapping it into vfio.ko, we can
make it a stand-alone module toggled by vfio bus drivers.  This has
the additional benefit of removing initialization and exit from the
core vfio code.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-03-17 08:33:38 -06:00
kbuild test robot 66fdc052d7 vfio: virqfd_lock can be static
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-03-17 08:20:31 -06:00
Zhen Lei 2f51bf4be9 vfio: put off the allocation of "minor" in vfio_create_group
The next code fragment "list_for_each_entry" is not depend on "minor". With this
patch, the free of "minor" in "list_for_each_entry" can be reduced, and there is
no functional change.

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-03-16 14:08:56 -06:00
Antonios Motakis a7fa7c77cf vfio/platform: implement IRQ masking/unmasking via an eventfd
With this patch the VFIO user will be able to set an eventfd that can be
used in order to mask and unmask IRQs of platform devices.

Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-03-16 14:08:55 -06:00
Antonios Motakis 42ac9bd18d vfio: initialize the virqfd workqueue in VFIO generic code
Now we have finally completely decoupled virqfd from VFIO_PCI. We can
initialize it from the VFIO generic code, in order to safely use it from
multiple independent VFIO bus drivers.

Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-03-16 14:08:54 -06:00
Antonios Motakis 7e992d6927 vfio: move eventfd support code for VFIO_PCI to a separate file
The virqfd functionality that is used by VFIO_PCI to implement interrupt
masking and unmasking via an eventfd, is generic enough and can be reused
by another driver. Move it to a separate file in order to allow the code
to be shared.

Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-03-16 14:08:54 -06:00
Antonios Motakis 09bbcb8810 vfio: pass an opaque pointer on virqfd initialization
VFIO_PCI passes the VFIO device structure *vdev via eventfd to the handler
that implements masking/unmasking of IRQs via an eventfd. We can replace
it in the virqfd infrastructure with an opaque type so we can make use
of the mechanism from other VFIO bus drivers.

Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-03-16 14:08:53 -06:00
Antonios Motakis 9269c393e7 vfio: add local lock for virqfd instead of depending on VFIO PCI
The Virqfd code needs to keep accesses to any struct *virqfd safe, but
this comes into play only when creating or destroying eventfds, so sharing
the same spinlock with the VFIO bus driver is not necessary.

Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-03-16 14:08:52 -06:00
Antonios Motakis bb78e9eaab vfio: virqfd: rename vfio_pci_virqfd_init and vfio_pci_virqfd_exit
The functions vfio_pci_virqfd_init and vfio_pci_virqfd_exit are not really
PCI specific, since we plan to reuse the virqfd code with more VFIO drivers
in addition to VFIO_PCI.

Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
[Baptiste Reynal: Move rename vfio_pci_virqfd_init and vfio_pci_virqfd_exit
from "vfio: add a vfio_ prefix to virqfd_enable and virqfd_disable and export"]
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-03-16 14:08:52 -06:00
Antonios Motakis bdc5e1021b vfio: add a vfio_ prefix to virqfd_enable and virqfd_disable and export
We want to reuse virqfd functionality in multiple VFIO drivers; before
moving these functions to core VFIO, add the vfio_ prefix to the
virqfd_enable and virqfd_disable functions, and export them so they can
be used from other modules.

Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-03-16 14:08:51 -06:00
Antonios Motakis 06211b40ce vfio/platform: support for level sensitive interrupts
Level sensitive interrupts are exposed as maskable and automasked
interrupts and are masked and disabled automatically when they fire.

Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
[Baptiste Reynal: Move masked interrupt initialization from "vfio/platform:
trigger an interrupt via eventfd"]
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-03-16 14:08:50 -06:00
Antonios Motakis 57f972e2b3 vfio/platform: trigger an interrupt via eventfd
This patch allows to set an eventfd for a platform device's interrupt,
and also to trigger the interrupt eventfd from userspace for testing.
Level sensitive interrupts are marked as maskable and are handled in
a later patch. Edge triggered interrupts are not advertised as maskable
and are implemented here using a simple and efficient IRQ handler.

Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
[Baptiste Reynal: fix masked interrupt initialization]
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-03-16 14:08:50 -06:00
Antonios Motakis 9a36321c8d vfio/platform: initial interrupts support code
This patch is a skeleton for the VFIO_DEVICE_SET_IRQS IOCTL, around which
most IRQ functionality is implemented in VFIO.

Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-03-16 14:08:49 -06:00
Antonios Motakis 682704c41e vfio/platform: return IRQ info
Return information for the interrupts exposed by the device.
This patch extends VFIO_DEVICE_GET_INFO with the number of IRQs
and enables VFIO_DEVICE_GET_IRQ_INFO.

Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-03-16 14:08:48 -06:00
Antonios Motakis fad4d5b1f0 vfio/platform: support MMAP of MMIO regions
Allow to memory map the MMIO regions of the device so userspace can
directly access them. PIO regions are not being handled at this point.

Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-03-16 14:08:48 -06:00
Antonios Motakis 6e3f264560 vfio/platform: read and write support for the device fd
VFIO returns a file descriptor which we can use to manipulate the memory
regions of the device. Usually, the user will mmap memory regions that are
addressable on page boundaries, however for memory regions where this is
not the case we cannot provide mmap functionality due to security concerns.
For this reason we also allow to use read and write functions to the file
descriptor pointing to the memory regions.

We implement this functionality only for MMIO regions of platform devices;
PIO regions are not being handled at this point.

Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-03-16 14:08:47 -06:00
Antonios Motakis e8909e67ca vfio/platform: return info for device memory mapped IO regions
This patch enables the IOCTLs VFIO_DEVICE_GET_REGION_INFO ioctl call,
which allows the user to learn about the available MMIO resources of
a device.

Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-03-16 14:08:46 -06:00
Antonios Motakis 2e8567bbb5 vfio/platform: return info for bound device
A VFIO userspace driver will start by opening the VFIO device
that corresponds to an IOMMU group, and will use the ioctl interface
to get the basic device info, such as number of memory regions and
interrupts, and their properties. This patch enables the
VFIO_DEVICE_GET_INFO ioctl call.

Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
[Baptiste Reynal: added include in vfio_platform_common.c]
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-03-16 14:08:46 -06:00
Antonios Motakis b13329adc2 vfio: amba: add the VFIO for AMBA devices module to Kconfig
Enable building the VFIO AMBA driver. VFIO_AMBA depends on VFIO_PLATFORM,
since it is sharing a portion of the code, and it is essentially implemented
as a platform device whose resources are discovered via AMBA specific APIs
in the kernel.

Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-03-16 14:08:45 -06:00
Antonios Motakis 36fe431f28 vfio: amba: VFIO support for AMBA devices
Add support for discovering AMBA devices with VFIO and handle them
similarly to Linux platform devices.

Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-03-16 14:08:44 -06:00
Antonios Motakis 5316153239 vfio: platform: add the VFIO PLATFORM module to Kconfig
Enable building the VFIO PLATFORM driver that allows to use Linux platform
devices with VFIO.

Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-03-16 14:08:44 -06:00
Antonios Motakis 9df85aaa43 vfio: platform: probe to devices on the platform bus
Driver to bind to Linux platform devices, and callbacks to discover their
resources to be used by the main VFIO PLATFORM code.

Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-03-16 14:08:43 -06:00
Antonios Motakis de49fc0d99 vfio/platform: initial skeleton of VFIO support for platform devices
This patch forms the common skeleton code for platform devices support
with VFIO. This will include the core functionality of VFIO_PLATFORM,
however binding to the device and discovering the device resources will
be done with the help of a separate file where any Linux platform bus
specific code will reside.

This will allow us to implement support for also discovering AMBA devices
and their resources, but still reuse a large part of the VFIO_PLATFORM
implementation.

Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
[Baptiste Reynal: added includes in vfio_platform_private.h]
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-03-16 14:08:42 -06:00
Alexey Kardashevskiy ec76f40070 vfio-pci: Add missing break to enable VFIO_PCI_ERR_IRQ_INDEX
This adds a missing break statement to VFIO_DEVICE_SET_IRQS handler
without which vfio_pci_set_err_trigger() would never be called.

While we are here, add another "break" to VFIO_PCI_REQ_IRQ_INDEX case
so if we add more indexes later, we won't miss it.

Fixes: 6140a8f562 ("vfio-pci: Add device request interface")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-03-12 09:51:38 -06:00
Alex Williamson 6140a8f562 vfio-pci: Add device request interface
Userspace can opt to receive a device request notification,
indicating that the device should be released.  This is setup
the same way as the error IRQ and also supports eventfd signaling.
Future support may forcefully remove the device from the user if
the request is ignored.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-02-10 12:38:14 -07:00
Alex Williamson cac80d6e38 vfio-pci: Generalize setup of simple eventfds
We want another single vector IRQ index to support signaling of
the device request to userspace.  Generalize the error reporting
IRQ index to avoid code duplication.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-02-10 12:37:57 -07:00
Alex Williamson 13060b64b8 vfio: Add and use device request op for vfio bus drivers
When a request is made to unbind a device from a vfio bus driver,
we need to wait for the device to become unused, ie. for userspace
to release the device.  However, we have a long standing TODO in
the code to do something proactive to make that happen.  To enable
this, we add a request callback on the vfio bus driver struct,
which is intended to signal the user through the vfio device
interface to release the device.  Instead of passively waiting for
the device to become unused, we can now pester the user to give
it up.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-02-10 12:37:47 -07:00
Alex Williamson 4a68810dbb vfio: Tie IOMMU group reference to vfio group
Move the iommu_group reference from the device to the vfio_group.
This ensures that the iommu_group persists as long as the vfio_group
remains.  This can be important if all of the device from an
iommu_group are removed, but we still have an outstanding vfio_group
reference; we can still walk the empty list of devices.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-02-06 15:05:06 -07:00
Alex Williamson 60720a0fc6 vfio: Add device tracking during unbind
There's a small window between the vfio bus driver calling
vfio_del_group_dev() and the device being completely unbound where
the vfio group appears to be non-viable.  This creates a race for
users like QEMU/KVM where the kvm-vfio module tries to get an
external reference to the group in order to match and release an
existing reference, while the device is potentially being removed
from the vfio bus driver.  If the group is momentarily non-viable,
kvm-vfio may not be able to release the group reference until VM
shutdown, making the group unusable until that point.

Bridge the gap between device removal from the group and completion
of the driver unbind by tracking it in a list.  The device is added
to the list before the bus driver reference is released and removed
using the existing unbind notifier.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-02-06 15:05:06 -07:00
Alex Williamson c5e6688752 vfio/type1: Add conditional rescheduling
IOMMU operations can be expensive and it's not very difficult for a
user to give us a lot of work to do for a map or unmap operation.
Killing a large VM will vfio assigned devices can result in soft
lockups and IOMMU tracing shows that we can easily spend 80% of our
time with need-resched set.  A sprinkling of conf_resched() calls
after map and unmap calls has a very tiny affect on performance
while resulting in traces with <1% of calls overflowing into needs-
resched.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-02-06 14:19:12 -07:00
Alex Williamson babbf17609 vfio/type1: Chunk contiguous reserved/invalid page mappings
We currently map invalid and reserved pages, such as often occur from
mapping MMIO regions of a VM through the IOMMU, using single pages.
There's really no reason we can't instead follow the methodology we
use for normal pages and find the largest possible physically
contiguous chunk for mapping.  The only difference is that we don't
do locked memory accounting for these since they're not back by RAM.

In most applications this will be a very minor improvement, but when
graphics and GPGPU devices are in play, MMIO BARs become non-trivial.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-02-06 10:59:16 -07:00
Alex Williamson 6fe1010d6d vfio/type1: DMA unmap chunking
When unmapping DMA entries we try to rely on the IOMMU API behavior
that allows the IOMMU to unmap a larger area than requested, up to
the size of the original mapping.  This works great when the IOMMU
supports superpages *and* they're in use.  Otherwise, each PAGE_SIZE
increment is unmapped separately, resulting in poor performance.

Instead we can use the IOVA-to-physical-address translation provided
by the IOMMU API and unmap using the largest contiguous physical
memory chunk available, which is also how vfio/type1 would have
mapped the region.  For a synthetic 1TB guest VM mapping and shutdown
test on Intel VT-d (2M IOMMU pagesize support), this achieves about
a 30% overall improvement mapping standard 4K pages, regardless of
IOMMU superpage enabling, and about a 40% improvement mapping 2M
hugetlbfs pages when IOMMU superpages are not available.  Hugetlbfs
with IOMMU superpages enabled is effectively unchanged.

Unfortunately the same algorithm does not work well on IOMMUs with
fine-grained superpages, like AMD-Vi, costing about 25% extra since
the IOMMU will automatically unmap any power-of-two contiguous
mapping we've provided it.  We add a routine and a domain flag to
detect this feature, leaving AMD-Vi unaffected by this unmap
optimization.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-02-06 10:58:56 -07:00
Wei Yang 7c2e211f3c vfio-pci: Fix the check on pci device type in vfio_pci_probe()
Current vfio-pci just supports normal pci device, so vfio_pci_probe() will
return if the pci device is not a normal device. While current code makes a
mistake. PCI_HEADER_TYPE is the offset in configuration space of the device
type, but we use this value to mask the type value.

This patch fixs this by do the check directly on the pci_dev->hdr_type.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Cc: stable@vger.kernel.org # v3.6+
2015-01-07 10:29:11 -07:00
Linus Torvalds cc669743a3 VFIO updates for v3.19-rc1
- s390 support (Frank Blaschka)
  - Enable iommu-type1 for ARM SMMU (Will Deacon)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJUka+sAAoJECObm247sIsiiy0P/iqrQpv94Z7rUKRlV3K8YtJj
 8Oi5fLLnT2by9v5mS+KMElnQ5gLU/C5B/QGLMNrF2uQl8lguWSXJw37r7MkbIkpN
 RDLx1NhetLqbJ4CYLjyv/Jx3vl+Wr/2nNWRVIS5ajBmMjEgKVLvjYs4SaXELc3a8
 a3YzcGW10BrVFlCJgUYqYIFGS1BmKjf7fbD5YBocj8tPv6NAlCiNNYYr+0pzW8Lf
 GTi39JlZ2t06hDq33eiUkrySWNjrIBn4g4PfAl7HBAscsZKS1w18MD1qVw4UXXa2
 15+CBbsHU7ZLVo6G7vuZeJNCX9tdQ0WIZWQzHstQa914l86WYImTJ2tyH7Rn0ZcQ
 3Mu9fzef9JgjkI56ol2zDwuOs+qttOYaLWjhhHiW4jkIxdnljnesIFlvmM3XeDGz
 3Zowg09HzE3K+dt8265jVKkcNJbPLzspLvF27nPMudZNHBozcoPE0jKxU7QC3eMT
 Ij36+puQq+jccUic3Np6rxk5tzTHEat1a7w3IUwXCCUP5P5QW+kuuIjbd4hqQkHn
 VDRjnT6MWC3GguUCXR5VyO0zezpI20pTbWwE8u2qwnE349m0Eq/vxytj2lCLYLPR
 Jjtdduf1/Ppam7tATd3PwTu6KljY3dJiUUikyOc1J0KmkgkSMw+BtR6G7qytyW4q
 /fhClcsaNtxSheVh0N+b
 =1L2V
 -----END PGP SIGNATURE-----

Merge tag 'vfio-v3.19-rc1' of git://github.com/awilliam/linux-vfio

Pull VFIO updates from Alex Williamson:
 - s390 support (Frank Blaschka)
 - Enable iommu-type1 for ARM SMMU (Will Deacon)

* tag 'vfio-v3.19-rc1' of git://github.com/awilliam/linux-vfio:
  drivers/vfio: allow type-1 IOMMU instantiation on top of an ARM SMMU
  vfio: make vfio run on s390
2014-12-17 10:44:22 -08:00
Jiang Liu 83a18912b0 PCI/MSI: Rename write_msi_msg() to pci_write_msi_msg()
Rename write_msi_msg() to pci_write_msi_msg() to mark it as PCI
specific.

Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Yingjoe Chen <yingjoe.chen@mediatek.com>
Cc: Yijing Wang <wangyijing@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-23 13:01:45 +01:00
Will Deacon 5e9f36c59a drivers/vfio: allow type-1 IOMMU instantiation on top of an ARM SMMU
The ARM SMMU driver is compatible with the notion of a type-1 IOMMU in
VFIO.

This patch allows VFIO_IOMMU_TYPE1 to be selected if ARM_SMMU=y.

Signed-off-by: Will Deacon <will.deacon@arm.com>
[aw: update for existing S390 patch]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-11-14 09:10:59 -07:00
Frank Blaschka 1d53a3a7d3 vfio: make vfio run on s390
add Kconfig switch to hide INTx
add Kconfig switch to let vfio announce PCI BARs are not mapable

Signed-off-by: Frank Blaschka <frank.blaschka@de.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-11-07 09:52:22 -07:00
Linus Torvalds 23971bdfff IOMMU Updates for Linux v3.18
This pull-request includes:
 
 	* Change in the IOMMU-API to convert the former iommu_domain_capable
 	  function to just iommu_capable
 
 	* Various fixes in handling RMRR ranges for the VT-d driver (one fix
 	  requires a device driver core change which was acked
 	  by Greg KH)
 
 	* The AMD IOMMU driver now assigns and deassigns complete alias groups
 	  to fix issues with devices using the wrong PCI request-id
 
 	* MMU-401 support for the ARM SMMU driver
 
 	* Multi-master IOMMU group support for the ARM SMMU driver
 
 	* Various other small fixes all over the place
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABAgAGBQJUPNxYAAoJECvwRC2XARrjMwMP/RLSr+oA31rGVjLXcmcCHl7Q
 Uj7xpcnG19qB0aqNR1JeJuZNkK/tw44pE353MQPbz4N9UVUiogklGIVD1iJvFV53
 0qm84bvpDJIof4aP35B3H3Umft2USTn/lmsQg/RklQcNTW8DzNj63b8BTNR7k/GL
 G7bLg7F1BUCl0shZCCsFspOIulQPAJYN2OvHlfYBav/bfDvfouQ3lrV+loGrK44r
 F2Hmp+imXlIhUCjfbiWz6wKFxvPrxZx482vm2pXBCSnXEdW4/fz6nf9VHUK/Cfsq
 JAimY1CfiDo1aqH9/yVHUOw5SD/NYOXq6E5bFPg/WENbipbbae5cK2u6PX5MMBAn
 CG4BM8l9xicfGPqgn5YFSRY/6qC6K7NlxMnt9U8l18QIkDVDqEtUgJQISJuce7wx
 FWx6eSWaxpIe5yhq19/h2ELalUUyR/fPq+UXXjYDL1kLV/vcvC/lC3mbNAQU93zU
 WK0bG2tDg88JHavc25Ewa2aOn4BVM2BpwuLbYlgQReaEmsQRnEPgtmRNyLJHqbFE
 wwpCj8pBWdufsJWRyvpnXQ+CfA7oSz4e7hz1G+0/5uiDmagfvg16Ql5JtPmmuLUm
 Kc3dVIiG0s1ewohZIIJETGCqprQbCSqs8CCQqB6p2zDBWFKpNT7F38lm/KlehkCz
 JpAiI7Y2K9Jejp0VIPrt
 =OMOt
 -----END PGP SIGNATURE-----

Merge tag 'iommu-updates-v3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu

Pull IOMMU updates from Joerg Roedel:
 "This pull-request includes:

   - change in the IOMMU-API to convert the former iommu_domain_capable
     function to just iommu_capable

   - various fixes in handling RMRR ranges for the VT-d driver (one fix
     requires a device driver core change which was acked by Greg KH)

   - the AMD IOMMU driver now assigns and deassigns complete alias
     groups to fix issues with devices using the wrong PCI request-id

   - MMU-401 support for the ARM SMMU driver

   - multi-master IOMMU group support for the ARM SMMU driver

   - various other small fixes all over the place"

* tag 'iommu-updates-v3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (41 commits)
  iommu/vt-d: Work around broken RMRR firmware entries
  iommu/vt-d: Store bus information in RMRR PCI device path
  iommu/vt-d: Only remove domain when device is removed
  driver core: Add BUS_NOTIFY_REMOVED_DEVICE event
  iommu/amd: Fix devid mapping for ivrs_ioapic override
  iommu/irq_remapping: Fix the regression of hpet irq remapping
  iommu: Fix bus notifier breakage
  iommu/amd: Split init_iommu_group() from iommu_init_device()
  iommu: Rework iommu_group_get_for_pci_dev()
  iommu: Make of_device_id array const
  amd_iommu: do not dereference a NULL pointer address.
  iommu/omap: Remove omap_iommu unused owner field
  iommu: Remove iommu_domain_has_cap() API function
  IB/usnic: Convert to use new iommu_capable() API function
  vfio: Convert to use new iommu_capable() API function
  kvm: iommu: Convert to use new iommu_capable() API function
  iommu/tegra: Convert to iommu_capable() API function
  iommu/msm: Convert to iommu_capable() API function
  iommu/vt-d: Convert to iommu_capable() API function
  iommu/fsl: Convert to iommu_capable() API function
  ...
2014-10-15 07:23:49 +02:00
Linus Torvalds 27a9716bc8 VFIO updates for v3.18-rc1
- Nested IOMMU extension to type1 (Will Deacon)
  - Restore MSIx message before enabling (Gavin Shan)
  - Fix remove path locking (Alex Williamson)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJUOOETAAoJECObm247sIsihDQP/jADEe9KFu4ymWu7rqi24w1L
 81hGNLXlfx2PPomluN3jENpyueo7vWdP5yZ8q/bi6oF6UbShL8Po01UKHOJzJJwW
 8GW86YcNsmPz/jl8Jcdbkex3dKvT1OzrDjFjCiKTJBHxE9nEdtWlRV8mO1pwd00t
 YFiXF8xFbkpHExMiQNU36rq/fzZCTOu4ZpCK9kDT7Sy+lsKAnGoXuM1IZK+7DGJo
 jcsMF32DVDmji6riy3uHHPc0qprP24QNVy6FfOmLEUvuOEIUOxMAYM9je9mmsHeS
 CeR/NHexr4RgYQE33jL1w8A1saT0rbu7DSKSa7OQebnY2Zte+oncLtqFZR2/Wylh
 jBU5r7P3PdxM6ykqEeC/3ytx7iFX6c7jc0SU4I5m8bFexmUQXqOko28gGIt0OL3n
 R8CmNF/MDs3gqYprhW6MvSJI1diY1+pX7pX0e7k7lDAoZ1QOjPNSGv+YOfF3H1YB
 AggIVxIKXW0T0bQ/hKcQiDKkxQ88vi1hld2LknbiBW9nMNLjNkxl2RZSGunFvWWN
 LzOYkBgR6rrTbhTvsWApsfYguYtGkgAGGJZSR1oev0BJnx4UHOfL1bykJRyUHdUd
 KDSBEni5TY65087IKD93nkyRhassszOa9XHmRDwQLxQeJCKRZi6bQRSzFZVheXIO
 O3XINOo2wNF1bIrfD/vR
 =s2+/
 -----END PGP SIGNATURE-----

Merge tag 'vfio-v3.18-rc1' of git://github.com/awilliam/linux-vfio

Pull VFIO updates from Alex Williamson:
 - Nested IOMMU extension to type1 (Will Deacon)
 - Restore MSIx message before enabling (Gavin Shan)
 - Fix remove path locking (Alex Williamson)

* tag 'vfio-v3.18-rc1' of git://github.com/awilliam/linux-vfio:
  vfio-pci: Fix remove path locking
  drivers/vfio: Export vfio_spapr_iommu_eeh_ioctl() with GPL
  vfio/pci: Restore MSIx message prior to enabling
  PCI: Export MSI message relevant functions
  vfio/iommu_type1: add new VFIO_TYPE1_NESTING_IOMMU IOMMU type
  iommu: introduce domain attribute for nesting IOMMUs
2014-10-11 06:49:24 -04:00
Alex Williamson 93899a679f vfio-pci: Fix remove path locking
Locking both the remove() and release() path results in a deadlock
that should have been obvious.  To fix this we can get and hold the
vfio_device reference as we evaluate whether to do a bus/slot reset.
This will automatically block any remove() calls, allowing us to
remove the explict lock.  Fixes 61d792562b.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Cc: stable@vger.kernel.org	[3.17]
2014-09-29 17:18:39 -06:00
Gavin Shan 0f905ce2b5 drivers/vfio: Export vfio_spapr_iommu_eeh_ioctl() with GPL
The function should have been exported with EXPORT_SYMBOL_GPL()
as part of commit 92d18a6851 ("drivers/vfio: Fix EEH build error").

Suggested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-09-29 10:31:51 -06:00
Gavin Shan b8f02af096 vfio/pci: Restore MSIx message prior to enabling
The MSIx vector table lives in device memory, which may be cleared as
part of a backdoor device reset. This is the case on the IBM IPR HBA
when the BIST is run on the device. When assigned to a QEMU guest,
the guest driver does a pci_save_state(), issues a BIST, then does a
pci_restore_state(). The BIST clears the MSIx vector table, but due
to the way interrupts are configured the pci_restore_state() does not
restore the vector table as expected. Eventually this results in an
EEH error on Power platforms when the device attempts to signal an
interrupt with the zero'd table entry.

Fix the problem by restoring the host cached MSI message prior to
enabling each vector.

Reported-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-09-29 10:16:24 -06:00
Will Deacon f5c9ecebaf vfio/iommu_type1: add new VFIO_TYPE1_NESTING_IOMMU IOMMU type
VFIO allows devices to be safely handed off to userspace by putting
them behind an IOMMU configured to ensure DMA and interrupt isolation.
This enables userspace KVM clients, such as kvmtool and qemu, to further
map the device into a virtual machine.

With IOMMUs such as the ARM SMMU, it is then possible to provide SMMU
translation services to the guest operating system, which are nested
with the existing translation installed by VFIO. However, enabling this
feature means that the IOMMU driver must be informed that the VFIO domain
is being created for the purposes of nested translation.

This patch adds a new IOMMU type (VFIO_TYPE1_NESTING_IOMMU) to the VFIO
type-1 driver. The new IOMMU type acts identically to the
VFIO_TYPE1v2_IOMMU type, but additionally sets the DOMAIN_ATTR_NESTING
attribute on its IOMMU domains.

Cc: Joerg Roedel <joro@8bytes.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-09-29 10:06:19 -06:00
Chen, Gong 846fc70986 PCI/AER: Rename PCI_ERR_UNC_TRAIN to PCI_ERR_UNC_UND
In PCIe r1.0, sec 5.10.2, bit 0 of the Uncorrectable Error Status, Mask,
and Severity Registers was for "Training Error." In PCIe r1.1, sec 7.10.2,
bit 0 was redefined to be "Undefined."

Rename PCI_ERR_UNC_TRAIN to PCI_ERR_UNC_UND to reflect this change.

No functional change.

[bhelgaas: changelog]
Signed-off-by: Chen, Gong <gong.chen@linux.intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2014-09-25 09:42:40 -06:00
Joerg Roedel eb165f0584 vfio: Convert to use new iommu_capable() API function
Cc: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2014-09-25 15:47:45 +02:00
Alexey Kardashevskiy 9b936c960f drivers/vfio: Enable VFIO if EEH is not supported
The existing vfio_pci_open() fails upon error returned from
vfio_spapr_pci_eeh_open(), which breaks POWER7's P5IOC2 PHB
support which this patch brings back.

The patch fixes the issue by dropping the return value of
vfio_spapr_pci_eeh_open().

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-08-08 10:39:16 -06:00
Alexey Kardashevskiy 89a2edd62f drivers/vfio: Allow EEH to be built as module
This adds necessary declarations to the SPAPR VFIO EEH module,
otherwise multiple dynamic linker errors reported:

vfio_spapr_eeh: Unknown symbol eeh_pe_set_option (err 0)
vfio_spapr_eeh: Unknown symbol eeh_pe_configure (err 0)
vfio_spapr_eeh: Unknown symbol eeh_pe_reset (err 0)
vfio_spapr_eeh: Unknown symbol eeh_pe_get_state (err 0)
vfio_spapr_eeh: Unknown symbol eeh_iommu_group_to_pe (err 0)
vfio_spapr_eeh: Unknown symbol eeh_dev_open (err 0)
vfio_spapr_eeh: Unknown symbol eeh_pe_set_option (err 0)
vfio_spapr_eeh: Unknown symbol eeh_pe_configure (err 0)
vfio_spapr_eeh: Unknown symbol eeh_pe_reset (err 0)
vfio_spapr_eeh: Unknown symbol eeh_pe_get_state (err 0)
vfio_spapr_eeh: Unknown symbol eeh_iommu_group_to_pe (err 0)
vfio_spapr_eeh: Unknown symbol eeh_dev_open (err 0)

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-08-08 10:37:51 -06:00
Gavin Shan 92d18a6851 drivers/vfio: Fix EEH build error
The VFIO related components could be built as dynamic modules.
Unfortunately, CONFIG_EEH can't be configured to "m". The patch
fixes the build errors when configuring VFIO related components
as dynamic modules as follows:

  CC [M]  drivers/vfio/vfio_iommu_spapr_tce.o
In file included from drivers/vfio/vfio.c:33:0:
include/linux/vfio.h:101:43: warning: ‘struct pci_dev’ declared \
inside parameter list [enabled by default]
   :
  WRAP    arch/powerpc/boot/zImage.pseries
  WRAP    arch/powerpc/boot/zImage.maple
  WRAP    arch/powerpc/boot/zImage.pmac
  WRAP    arch/powerpc/boot/zImage.epapr
  MODPOST 1818 modules
ERROR: ".vfio_spapr_iommu_eeh_ioctl" [drivers/vfio/vfio_iommu_spapr_tce.ko]\
undefined!
ERROR: ".vfio_spapr_pci_eeh_open" [drivers/vfio/pci/vfio-pci.ko] undefined!
ERROR: ".vfio_spapr_pci_eeh_release" [drivers/vfio/pci/vfio-pci.ko] undefined!

Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-08-08 10:36:20 -06:00
Alex Williamson bc4fba7712 vfio-pci: Attempt bus/slot reset on release
Each time a device is released, mark whether a local reset was
successful or whether a bus/slot reset is needed.  If a reset is
needed and all of the affected devices are bound to vfio-pci and
unused, allow the reset.  This is most useful when the userspace
driver is killed and releases all the devices in an unclean state,
such as when a QEMU VM quits.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-08-07 11:12:07 -06:00
Alex Williamson 61d792562b vfio-pci: Use mutex around open, release, and remove
Serializing open/release allows us to fix a refcnt error if we fail
to enable the device and lets us prevent devices from being unbound
or opened, giving us an opportunity to do bus resets on release.  No
restriction added to serialize binding devices to vfio-pci while the
mutex is held though.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-08-07 11:12:04 -06:00
Alex Williamson 9c22e660ce vfio-pci: Release devices with BusMaster disabled
Our current open/release path looks like this:

vfio_pci_open
  vfio_pci_enable
    pci_enable_device
    pci_save_state
    pci_store_saved_state

vfio_pci_release
  vfio_pci_disable
    pci_disable_device
    pci_restore_state

pci_enable_device() doesn't modify PCI_COMMAND_MASTER, so if a device
comes to us with it enabled, it persists through the open and gets
stored as part of the device saved state.  We then restore that saved
state when released, which can allow the device to attempt to continue
to do DMA.  When the group is disconnected from the domain, this will
get caught by the IOMMU, but if there are other devices in the group,
the device may continue running and interfere with the user.  Even in
the former case, IOMMUs don't necessarily behave well and a stream of
blocked DMA can result in unpleasant behavior on the host.

Explicitly disable Bus Master as we're enabling the device and
slightly re-work release to make sure that pci_disable_device() is
the last thing that touches the device.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-08-07 11:12:02 -06:00
Gavin Shan 1b69be5e8a drivers/vfio: EEH support for VFIO PCI device
The patch adds new IOCTL commands for sPAPR VFIO container device
to support EEH functionality for PCI devices, which have been passed
through from host to somebody else via VFIO.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Alexander Graf <agraf@suse.de>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-08-05 15:28:48 +10:00
Linus Torvalds a68a7509d3 A handful of VFIO bug fixes for v3.16
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJTkiFBAAoJECObm247sIsiymsQALDs03j1lm1ZUx3peI8JuEGp
 +4WrflS4eKBX2r+XOSqHQNlg5ceKpj1zsEMKXkLFBCwTEZXRz9tK7bu6hsRuE8XR
 j67JkBm4alsJQrJIQl0+dYJMFPVGWBNyoBc4kV2YbernBnFPc/5oOmP9wAJD80OI
 HHNQxOdfce+YZnkgltc/GkOWdG/YvuQbkyt2YmGtTePapzboj6Tf0yvkqRDaL0Bf
 07+vYi6ZOSzx/Niw/ak09hll4OhYfgiOQ/t6LETmLi7PEGR88TYO/kqdzGmlnx6V
 XsDZ+WORWjtORGnyJI4p7iVFiP7mz4mmwUptfJcA0AX98csEdYHt6H4mlVkN7Gmd
 rSvarKQ7On8MRbymQIVw5w3V78X/HiWUsUi3Fef5xytQyuD4VxylS0HDtPM/AyeN
 3zEJKOI+CIkFLZ6fReL8LKZuaRdQTbJvjxMVvmTr2XrBxZTlk91CsqT3JFXjJ6CV
 2QEv2eVmsCkPtScWNZzaeQsgH75Hc8vznZxA+/BFDoeR2iq31BPE9iuYDpaVZS58
 mnAEll4tPMIqcqQvxwoB8B8kyzX636trwW+585jr3gKRS8V6qHaWkXz06N5kzz7s
 dhRUy3rKe1SeutVo0Vs067Ym60yUlvuTeWTppuOnfRpgIpbCLeMwuiA22Wq3hBIF
 loXH9eAZOuwzBg1hqVWJ
 =aR4I
 -----END PGP SIGNATURE-----

Merge tag 'vfio-v3.16-rc1' of git://github.com/awilliam/linux-vfio into next

Pull VFIO updates from Alex Williamson:
 "A handful of VFIO bug fixes for v3.16"

* tag 'vfio-v3.16-rc1' of git://github.com/awilliam/linux-vfio:
  drivers/vfio/pci: Fix wrong MSI interrupt count
  drivers/vfio: Rework offsetofend()
  vfio/iommu_type1: Avoid overflow
  vfio/pci: Fix unchecked return value
  vfio/pci: Fix sizing of DPA and THP express capabilities
2014-06-07 20:12:15 -07:00
Gavin Shan fd49c81f08 drivers/vfio/pci: Fix wrong MSI interrupt count
According PCI local bus specification, the register of Message
Control for MSI (offset: 2, length: 2) has bit#0 to enable or
disable MSI logic and it shouldn't be part contributing to the
calculation of MSI interrupt count. The patch fixes the issue.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-05-30 11:35:54 -06:00
Alex Williamson c8dbca165b vfio/iommu_type1: Avoid overflow
Coverity reports use of a tained scalar used as a loop boundary.
For the most part, any values passed from userspace for a DMA mapping
size, IOVA, or virtual address are valid, with some alignment
constraints.  The size is ultimately bound by how many pages the user
is able to lock, IOVA is tested by the IOMMU driver when doing a map,
and the virtual address needs to pass get_user_pages.  The only
problem I can find is that we do expect the __u64 user values to fit
within our variables, which might not happen on 32bit platforms.  Add
a test for this and return error on overflow.  Also propagate use of
the type-correct local variables throughout the function.

The above also points to the 'end' variable, which can be zero if
we're operating at the very top of the address space.  We try to
account for this, but our loop botches it.  Rework the loop to use
the remaining size as our loop condition rather than the IOVA vs end.

Detected by Coverity: CID 714659

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-05-30 11:35:54 -06:00
Alex Williamson eb5685f0de vfio/pci: Fix unchecked return value
There's nothing we can do different if pci_load_and_free_saved_state()
fails, other than maybe print some log message, but the actual re-load
of the state is an unnecessary step here since we've only just saved
it.  We can cleanup a coverity warning and eliminate the unnecessary
step by freeing the state ourselves.

Detected by Coverity: CID 753101

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-05-30 11:35:53 -06:00
Alex Williamson afa63252b2 vfio/pci: Fix sizing of DPA and THP express capabilities
When sizing the TPH capability we store the register containing the
table size into the 'dword' variable, but then use the uninitialized
'byte' variable to analyze the size.  The table size is also actually
reported as an N-1 value, so correct sizing to account for this.

The round_up() for both TPH and DPA is unnecessary, remove it.

Detected by Coverity: CID 714665 & 715156

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-05-30 10:50:31 -06:00
Jean Delvare 8283b4919e driver core: dev_set_drvdata can no longer fail
So there is no point in checking its return value, which will soon
disappear.

Signed-off-by: Jean Delvare <jdelvare@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-27 13:40:51 -07:00
Linus Torvalds d0cb5f71c5 VFIO updates for v3.15 include:
- Allow the vfio-type1 IOMMU to support multiple domains within a container
 - Plumb path to query whether all domains are cache-coherent
 - Wire query into kvm-vfio device to avoid KVM x86 WBINVD emulation
 - Always select CONFIG_ANON_INODES, vfio depends on it (Arnd)
 
 The first patch also makes the vfio-type1 IOMMU driver completely independent
 of the bus_type of the devices it's handling, which enables it to be used for
 both vfio-pci and a future vfio-platform (and hopefully combinations involving
 both simultaneously).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJTPbikAAoJECObm247sIsiG9cP/jnurW84YuAHmybzy3R4nMaa
 8BWcQST+TyQ78GWvubFDcRt+vHmJUI4iFWPBIm1twBSVrMy6F1GcT/spmSne1Dfb
 OvcfGEy59tGO+BeklHg5Qq2Hj1UzfeV9uQoy+PrpTF/sYzsyL6g5+O3Zv39SNvr0
 zCwnO2JAKXIpQlkK3wVha6V13X07Z/+d2b4JSsvmON2cnlPzOUYNrgBvoSeV2IQe
 3kMAU9qfAfQlpNDhRRZL/HshgWazCg4XZUp9UdNjUkOxrUp4vXHrVTUJnUGBDytk
 8V9UGRKF3mik9PqpZJk4jLV5urgUVpnUR5747uqs0KF+9GWxClXvh3gp1XX8Zn7f
 NCfDSMn/wrDr6fQBglKUadITaDhF49KV+J0cX083q46BbYxhAfgxv1I9UOvLreG6
 SGAezlTB0mefa1BmSwJdfKwDBsuDOhbF0TsHC6zT7yFc8pqe3hl/vQzUyu3HtwuM
 yPycQiQQmvOaymaiiBRwyLocwJbDNKPigDomv8NZ8Mcd97Fu9YsF4ydaxMJmmeKf
 TffYS4z4tUElYiU2vv1ipB8T+ReCmdtj2cIvyfwTL9jycY9ouO4bJ56dOGWd3WNl
 m7AVokbrrJQ03itUpFMuYjHAzTNUlXVvXlGr9n6L4VTR0RTL1zwJVrMVDsXdhxm4
 XgDbVB5PrxinDAC1vJvI
 =mnzO
 -----END PGP SIGNATURE-----

Merge tag 'vfio-v3.15-rc1' of git://github.com/awilliam/linux-vfio

Pull VFIO updates from Alex Williamson:
 "VFIO updates for v3.15 include:

   - Allow the vfio-type1 IOMMU to support multiple domains within a
     container
   - Plumb path to query whether all domains are cache-coherent
   - Wire query into kvm-vfio device to avoid KVM x86 WBINVD emulation
   - Always select CONFIG_ANON_INODES, vfio depends on it (Arnd)

  The first patch also makes the vfio-type1 IOMMU driver completely
  independent of the bus_type of the devices it's handling, which
  enables it to be used for both vfio-pci and a future vfio-platform
  (and hopefully combinations involving both simultaneously)"

* tag 'vfio-v3.15-rc1' of git://github.com/awilliam/linux-vfio:
  vfio: always select ANON_INODES
  kvm/vfio: Support for DMA coherent IOMMUs
  vfio: Add external user check extension interface
  vfio/type1: Add extension to test DMA cache coherence of IOMMU
  vfio/iommu_type1: Multi-IOMMU domain support
2014-04-03 14:05:02 -07:00
Linus Torvalds 4b1779c2cf PCI changes for the v3.15 merge window:
Enumeration
     - Increment max correctly in pci_scan_bridge() (Andreas Noever)
     - Clarify the "scan anyway" comment in pci_scan_bridge() (Andreas Noever)
     - Assign CardBus bus number only during the second pass (Andreas Noever)
     - Use request_resource_conflict() instead of insert_ for bus numbers (Andreas Noever)
     - Make sure bus number resources stay within their parents bounds (Andreas Noever)
     - Remove pci_fixup_parent_subordinate_busnr() (Andreas Noever)
     - Check for child busses which use more bus numbers than allocated (Andreas Noever)
     - Don't scan random busses in pci_scan_bridge() (Andreas Noever)
     - x86: Drop pcibios_scan_root() check for bus already scanned (Bjorn Helgaas)
     - x86: Use pcibios_scan_root() instead of pci_scan_bus_with_sysdata() (Bjorn Helgaas)
     - x86: Use pcibios_scan_root() instead of pci_scan_bus_on_node() (Bjorn Helgaas)
     - x86: Merge pci_scan_bus_on_node() into pcibios_scan_root() (Bjorn Helgaas)
     - x86: Drop return value of pcibios_scan_root() (Bjorn Helgaas)
 
   NUMA
     - x86: Add x86_pci_root_bus_node() to look up NUMA node from PCI bus (Bjorn Helgaas)
     - x86: Use x86_pci_root_bus_node() instead of get_mp_bus_to_node() (Bjorn Helgaas)
     - x86: Remove mp_bus_to_node[], set_mp_bus_to_node(), get_mp_bus_to_node() (Bjorn Helgaas)
     - x86: Use NUMA_NO_NODE, not -1, for unknown node (Bjorn Helgaas)
     - x86: Remove acpi_get_pxm() usage (Bjorn Helgaas)
     - ia64: Use NUMA_NO_NODE, not MAX_NUMNODES, for unknown node (Bjorn Helgaas)
     - ia64: Remove acpi_get_pxm() usage (Bjorn Helgaas)
     - ACPI: Fix acpi_get_node() prototype (Bjorn Helgaas)
 
   Resource management
     - i2o: Fix and refactor PCI space allocation (Bjorn Helgaas)
     - Add resource_contains() (Bjorn Helgaas)
     - Add %pR support for IORESOURCE_UNSET (Bjorn Helgaas)
     - Mark resources as IORESOURCE_UNSET if we can't assign them (Bjorn Helgaas)
     - Don't clear IORESOURCE_UNSET when updating BAR (Bjorn Helgaas)
     - Check IORESOURCE_UNSET before updating BAR (Bjorn Helgaas)
     - Don't try to claim IORESOURCE_UNSET resources (Bjorn Helgaas)
     - Mark 64-bit resource as IORESOURCE_UNSET if we only support 32-bit (Bjorn Helgaas)
     - Don't enable decoding if BAR hasn't been assigned an address (Bjorn Helgaas)
     - Add "weak" generic pcibios_enable_device() implementation (Bjorn Helgaas)
     - alpha, microblaze, sh, sparc, tile: Use default pcibios_enable_device() (Bjorn Helgaas)
     - s390: Use generic pci_enable_resources() (Bjorn Helgaas)
     - Don't check resource_size() in pci_bus_alloc_resource() (Bjorn Helgaas)
     - Set type in __request_region() (Bjorn Helgaas)
     - Check all IORESOURCE_TYPE_BITS in pci_bus_alloc_from_region() (Bjorn Helgaas)
     - Change pci_bus_alloc_resource() type_mask to unsigned long (Bjorn Helgaas)
     - Log IDE resource quirk in dmesg (Bjorn Helgaas)
     - Revert "[PATCH] Insert GART region into resource map" (Bjorn Helgaas)
 
   PCI device hotplug
     - Make check_link_active() non-static (Rajat Jain)
     - Use link change notifications for hot-plug and removal (Rajat Jain)
     - Enable link state change notifications (Rajat Jain)
     - Don't disable the link permanently during removal (Rajat Jain)
     - Don't check adapter or latch status while disabling (Rajat Jain)
     - Disable link notification across slot reset (Rajat Jain)
     - Ensure very fast hotplug events are also processed (Rajat Jain)
     - Add hotplug_lock to serialize hotplug events (Rajat Jain)
     - Remove a non-existent card, regardless of "surprise" capability (Rajat Jain)
     - Don't turn slot off when hot-added device already exists (Yijing Wang)
 
   MSI
     - Keep pci_enable_msi() documentation (Alexander Gordeev)
     - ahci: Fix broken single MSI fallback (Alexander Gordeev)
     - ahci, vfio: Use pci_enable_msi_range() (Alexander Gordeev)
     - Check kmalloc() return value, fix leak of name (Greg Kroah-Hartman)
     - Fix leak of msi_attrs (Greg Kroah-Hartman)
     - Fix pci_msix_vec_count() htmldocs failure (Masanari Iida)
 
   Virtualization
     - Device-specific ACS support (Alex Williamson)
 
   Freescale i.MX6
     - Wait for retraining (Marek Vasut)
 
   Marvell MVEBU
     - Use Device ID and revision from underlying endpoint (Andrew Lunn)
     - Fix incorrect size for PCI aperture resources (Jason Gunthorpe)
     - Call request_resource() on the apertures (Jason Gunthorpe)
     - Fix potential issue in range parsing (Jean-Jacques Hiblot)
 
   Renesas R-Car
     - Check platform_get_irq() return code (Ben Dooks)
     - Add error interrupt handling (Ben Dooks)
     - Fix bridge logic configuration accesses (Ben Dooks)
     - Register each instance independently (Magnus Damm)
     - Break out window size handling (Magnus Damm)
     - Make the Kconfig dependencies more generic (Magnus Damm)
 
   Synopsys DesignWare
     - Fix RC BAR to be single 64-bit non-prefetchable memory (Mohit Kumar)
 
   Miscellaneous
     - Remove unused SR-IOV VF Migration support (Bjorn Helgaas)
     - Enable INTx if BIOS left them disabled (Bjorn Helgaas)
     - Fix hex vs decimal typo in cpqhpc_probe() (Dan Carpenter)
     - Clean up par-arch object file list (Liviu Dudau)
     - Set IORESOURCE_ROM_SHADOW only for the default VGA device (Sander Eikelenboom)
     - ACPI, ARM, drm, powerpc, pcmcia, PCI: Use list_for_each_entry() for bus traversal (Yijing Wang)
     - Fix pci_bus_b() build failure (Paul Gortmaker)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJTOdAZAAoJEFmIoMA60/r8VYUQALRrReyMBk3pjRt/fKIX4Kwi
 ydSo/YJeeKTN8K93fLw8bb8bdPItJScJFTfEa4Q2SpZezR/ecGXLowisy0BBaPHK
 qtOyB8EqjkLS17GfyecIe9Nd2SIAI2De/0bchK3kDtIX1YlZB/k/tD3eCPMHDnnl
 m8c5kAHKPQYd8g01I+S8nrtGHk/A33grfYpJXPZbcqyhE0lWU3SI8KDAGbcKzNHE
 23Do0yNyd4nHIdixWlhETcNvzHn35Q/O38JJwW9Mf1aI9gusYuml6GFefCgu/iov
 lxqp3CEW7iPZgQEgNbrQ0HzWn/durL2Trd6S/Yh6f2xbm1LGYKWh3LZUFLd3AQDd
 INEpUgKsyb//nF3dtiyGnZlp0QykoqFyLo2AEDrb+ILTd4up5DeRY/m1UpjAXR5p
 QicBmrDksHrSivPmMZwLx1DFQYKjQbdx5lOqy9hQM/Jmsr+N3/l7QBrbQWXks3JZ
 NNAyn4RZHQB7UDQS/MmVPArs+JK5qaEDQD57QuOTlqgP19VY9C9E/l/aEqefjdFo
 XOAm7CwGpB/iBAkIbE6ROEDiJArigRVHEfxLYeE/jtGOdRDCD1deWk+g3S8DWD7m
 ZxWSgIVB00PMAmomczdg59YVFBhocgwPUa8/cw6yqzx2QKP4mWXIFZ/Sjau5I3tn
 WWoxXlUirZfTJc29XnVy
 =3mNS
 -----END PGP SIGNATURE-----

Merge tag 'pci-v3.15-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci

Pull PCI changes from Bjorn Helgaas:
 "Enumeration
   - Increment max correctly in pci_scan_bridge() (Andreas Noever)
   - Clarify the "scan anyway" comment in pci_scan_bridge() (Andreas Noever)
   - Assign CardBus bus number only during the second pass (Andreas Noever)
   - Use request_resource_conflict() instead of insert_ for bus numbers (Andreas Noever)
   - Make sure bus number resources stay within their parents bounds (Andreas Noever)
   - Remove pci_fixup_parent_subordinate_busnr() (Andreas Noever)
   - Check for child busses which use more bus numbers than allocated (Andreas Noever)
   - Don't scan random busses in pci_scan_bridge() (Andreas Noever)
   - x86: Drop pcibios_scan_root() check for bus already scanned (Bjorn Helgaas)
   - x86: Use pcibios_scan_root() instead of pci_scan_bus_with_sysdata() (Bjorn Helgaas)
   - x86: Use pcibios_scan_root() instead of pci_scan_bus_on_node() (Bjorn Helgaas)
   - x86: Merge pci_scan_bus_on_node() into pcibios_scan_root() (Bjorn Helgaas)
   - x86: Drop return value of pcibios_scan_root() (Bjorn Helgaas)

  NUMA
   - x86: Add x86_pci_root_bus_node() to look up NUMA node from PCI bus (Bjorn Helgaas)
   - x86: Use x86_pci_root_bus_node() instead of get_mp_bus_to_node() (Bjorn Helgaas)
   - x86: Remove mp_bus_to_node[], set_mp_bus_to_node(), get_mp_bus_to_node() (Bjorn Helgaas)
   - x86: Use NUMA_NO_NODE, not -1, for unknown node (Bjorn Helgaas)
   - x86: Remove acpi_get_pxm() usage (Bjorn Helgaas)
   - ia64: Use NUMA_NO_NODE, not MAX_NUMNODES, for unknown node (Bjorn Helgaas)
   - ia64: Remove acpi_get_pxm() usage (Bjorn Helgaas)
   - ACPI: Fix acpi_get_node() prototype (Bjorn Helgaas)

  Resource management
   - i2o: Fix and refactor PCI space allocation (Bjorn Helgaas)
   - Add resource_contains() (Bjorn Helgaas)
   - Add %pR support for IORESOURCE_UNSET (Bjorn Helgaas)
   - Mark resources as IORESOURCE_UNSET if we can't assign them (Bjorn Helgaas)
   - Don't clear IORESOURCE_UNSET when updating BAR (Bjorn Helgaas)
   - Check IORESOURCE_UNSET before updating BAR (Bjorn Helgaas)
   - Don't try to claim IORESOURCE_UNSET resources (Bjorn Helgaas)
   - Mark 64-bit resource as IORESOURCE_UNSET if we only support 32-bit (Bjorn Helgaas)
   - Don't enable decoding if BAR hasn't been assigned an address (Bjorn Helgaas)
   - Add "weak" generic pcibios_enable_device() implementation (Bjorn Helgaas)
   - alpha, microblaze, sh, sparc, tile: Use default pcibios_enable_device() (Bjorn Helgaas)
   - s390: Use generic pci_enable_resources() (Bjorn Helgaas)
   - Don't check resource_size() in pci_bus_alloc_resource() (Bjorn Helgaas)
   - Set type in __request_region() (Bjorn Helgaas)
   - Check all IORESOURCE_TYPE_BITS in pci_bus_alloc_from_region() (Bjorn Helgaas)
   - Change pci_bus_alloc_resource() type_mask to unsigned long (Bjorn Helgaas)
   - Log IDE resource quirk in dmesg (Bjorn Helgaas)
   - Revert "[PATCH] Insert GART region into resource map" (Bjorn Helgaas)

  PCI device hotplug
   - Make check_link_active() non-static (Rajat Jain)
   - Use link change notifications for hot-plug and removal (Rajat Jain)
   - Enable link state change notifications (Rajat Jain)
   - Don't disable the link permanently during removal (Rajat Jain)
   - Don't check adapter or latch status while disabling (Rajat Jain)
   - Disable link notification across slot reset (Rajat Jain)
   - Ensure very fast hotplug events are also processed (Rajat Jain)
   - Add hotplug_lock to serialize hotplug events (Rajat Jain)
   - Remove a non-existent card, regardless of "surprise" capability (Rajat Jain)
   - Don't turn slot off when hot-added device already exists (Yijing Wang)

  MSI
   - Keep pci_enable_msi() documentation (Alexander Gordeev)
   - ahci: Fix broken single MSI fallback (Alexander Gordeev)
   - ahci, vfio: Use pci_enable_msi_range() (Alexander Gordeev)
   - Check kmalloc() return value, fix leak of name (Greg Kroah-Hartman)
   - Fix leak of msi_attrs (Greg Kroah-Hartman)
   - Fix pci_msix_vec_count() htmldocs failure (Masanari Iida)

  Virtualization
   - Device-specific ACS support (Alex Williamson)

  Freescale i.MX6
   - Wait for retraining (Marek Vasut)

  Marvell MVEBU
   - Use Device ID and revision from underlying endpoint (Andrew Lunn)
   - Fix incorrect size for PCI aperture resources (Jason Gunthorpe)
   - Call request_resource() on the apertures (Jason Gunthorpe)
   - Fix potential issue in range parsing (Jean-Jacques Hiblot)

  Renesas R-Car
   - Check platform_get_irq() return code (Ben Dooks)
   - Add error interrupt handling (Ben Dooks)
   - Fix bridge logic configuration accesses (Ben Dooks)
   - Register each instance independently (Magnus Damm)
   - Break out window size handling (Magnus Damm)
   - Make the Kconfig dependencies more generic (Magnus Damm)

  Synopsys DesignWare
   - Fix RC BAR to be single 64-bit non-prefetchable memory (Mohit Kumar)

  Miscellaneous
   - Remove unused SR-IOV VF Migration support (Bjorn Helgaas)
   - Enable INTx if BIOS left them disabled (Bjorn Helgaas)
   - Fix hex vs decimal typo in cpqhpc_probe() (Dan Carpenter)
   - Clean up par-arch object file list (Liviu Dudau)
   - Set IORESOURCE_ROM_SHADOW only for the default VGA device (Sander Eikelenboom)
   - ACPI, ARM, drm, powerpc, pcmcia, PCI: Use list_for_each_entry() for bus traversal (Yijing Wang)
   - Fix pci_bus_b() build failure (Paul Gortmaker)"

* tag 'pci-v3.15-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (108 commits)
  Revert "[PATCH] Insert GART region into resource map"
  PCI: Log IDE resource quirk in dmesg
  PCI: Change pci_bus_alloc_resource() type_mask to unsigned long
  PCI: Check all IORESOURCE_TYPE_BITS in pci_bus_alloc_from_region()
  resources: Set type in __request_region()
  PCI: Don't check resource_size() in pci_bus_alloc_resource()
  s390/PCI: Use generic pci_enable_resources()
  tile PCI RC: Use default pcibios_enable_device()
  sparc/PCI: Use default pcibios_enable_device() (Leon only)
  sh/PCI: Use default pcibios_enable_device()
  microblaze/PCI: Use default pcibios_enable_device()
  alpha/PCI: Use default pcibios_enable_device()
  PCI: Add "weak" generic pcibios_enable_device() implementation
  PCI: Don't enable decoding if BAR hasn't been assigned an address
  PCI: Enable INTx in pci_reenable_device() only when MSI/MSI-X not enabled
  PCI: Mark 64-bit resource as IORESOURCE_UNSET if we only support 32-bit
  PCI: Don't try to claim IORESOURCE_UNSET resources
  PCI: Check IORESOURCE_UNSET before updating BAR
  PCI: Don't clear IORESOURCE_UNSET when updating BAR
  PCI: Mark resources as IORESOURCE_UNSET if we can't assign them
  ...

Conflicts:
	arch/x86/include/asm/topology.h
	drivers/ata/ahci.c
2014-04-01 15:14:04 -07:00
Arnd Bergmann 4379d2ae15 vfio: always select ANON_INODES
The vfio code cannot be built when CONFIG_ANON_INODES is
disabled, so this enforces the symbol to be enabled through
Kconfig.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-03-27 11:58:58 -06:00
David Rientjes 668f9abbd4 mm: close PageTail race
Commit bf6bddf192 ("mm: introduce compaction and migration for
ballooned pages") introduces page_count(page) into memory compaction
which dereferences page->first_page if PageTail(page).

This results in a very rare NULL pointer dereference on the
aforementioned page_count(page).  Indeed, anything that does
compound_head(), including page_count() is susceptible to racing with
prep_compound_page() and seeing a NULL or dangling page->first_page
pointer.

This patch uses Andrea's implementation of compound_trans_head() that
deals with such a race and makes it the default compound_head()
implementation.  This includes a read memory barrier that ensures that
if PageTail(head) is true that we return a head page that is neither
NULL nor dangling.  The patch then adds a store memory barrier to
prep_compound_page() to ensure page->first_page is set.

This is the safest way to ensure we see the head page that we are
expecting, PageTail(page) is already in the unlikely() path and the
memory barriers are unfortunately required.

Hugetlbfs is the exception, we don't enforce a store memory barrier
during init since no race is possible.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Holger Kiehl <Holger.Kiehl@dwd.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-04 07:55:47 -08:00
Alex Williamson 88d7ab8949 vfio: Add external user check extension interface
This lets us check extensions, particularly VFIO_DMA_CC_IOMMU using
the external user interface, allowing KVM to probe IOMMU coherency.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-02-26 11:38:39 -07:00
Alex Williamson aa42931827 vfio/type1: Add extension to test DMA cache coherence of IOMMU
Now that the type1 IOMMU backend can support IOMMU_CACHE, we need to
be able to test whether coherency is currently enforced.  Add an
extension for this.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-02-26 11:38:37 -07:00
Alex Williamson 1ef3e2bc04 vfio/iommu_type1: Multi-IOMMU domain support
We currently have a problem that we cannot support advanced features
of an IOMMU domain (ex. IOMMU_CACHE), because we have no guarantee
that those features will be supported by all of the hardware units
involved with the domain over its lifetime.  For instance, the Intel
VT-d architecture does not require that all DRHDs support snoop
control.  If we create a domain based on a device behind a DRHD that
does support snoop control and enable SNP support via the IOMMU_CACHE
mapping option, we cannot then add a device behind a DRHD which does
not support snoop control or we'll get reserved bit faults from the
SNP bit in the pagetables.  To add to the complexity, we can't know
the properties of a domain until a device is attached.

We could pass this problem off to userspace and require that a
separate vfio container be used, but we don't know how to handle page
accounting in that case.  How do we know that a page pinned in one
container is the same page as a different container and avoid double
billing the user for the page.

The solution is therefore to support multiple IOMMU domains per
container.  In the majority of cases, only one domain will be required
since hardware is typically consistent within a system.  However, this
provides us the ability to validate compatibility of domains and
support mixed environments where page table flags can be different
between domains.

To do this, our DMA tracking needs to change.  We currently try to
coalesce user mappings into as few tracking entries as possible.  The
problem then becomes that we lose granularity of user mappings.  We've
never guaranteed that a user is able to unmap at a finer granularity
than the original mapping, but we must honor the granularity of the
original mapping.  This coalescing code is therefore removed, allowing
only unmaps covering complete maps.  The change in accounting is
fairly small here, a typical QEMU VM will start out with roughly a
dozen entries, so it's arguable if this coalescing was ever needed.

We also move IOMMU domain creation to the point where a group is
attached to the container.  An interesting side-effect of this is that
we now have access to the device at the time of domain creation and
can probe the devices within the group to determine the bus_type.
This finally makes vfio_iommu_type1 completely device/bus agnostic.
In fact, each IOMMU domain can host devices on different buses managed
by different physical IOMMUs, and present a single DMA mapping
interface to the user.  When a new domain is created, mappings are
replayed to bring the IOMMU pagetables up to the state of the current
container.  And of course, DMA mapping and unmapping automatically
traverse all of the configured IOMMU domains.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Cc: Varun Sethi <Varun.Sethi@freescale.com>
2014-02-26 11:38:36 -07:00
Alexander Gordeev 94cccde648 vfio: Use pci_enable_msi_range() and pci_enable_msix_range()
pci_enable_msix() and pci_enable_msi_block() have been deprecated; use
pci_enable_msix_range() and pci_enable_msi_range() instead.

[bhelgaas: changelog]
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
2014-02-14 14:23:40 -07:00
Linus Torvalds 1b17366d69 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
Pull powerpc updates from Ben Herrenschmidt:
 "So here's my next branch for powerpc.  A bit late as I was on vacation
  last week.  It's mostly the same stuff that was in next already, I
  just added two patches today which are the wiring up of lockref for
  powerpc, which for some reason fell through the cracks last time and
  is trivial.

  The highlights are, in addition to a bunch of bug fixes:

   - Reworked Machine Check handling on kernels running without a
     hypervisor (or acting as a hypervisor).  Provides hooks to handle
     some errors in real mode such as TLB errors, handle SLB errors,
     etc...

   - Support for retrieving memory error information from the service
     processor on IBM servers running without a hypervisor and routing
     them to the memory poison infrastructure.

   - _PAGE_NUMA support on server processors

   - 32-bit BookE relocatable kernel support

   - FSL e6500 hardware tablewalk support

   - A bunch of new/revived board support

   - FSL e6500 deeper idle states and altivec powerdown support

  You'll notice a generic mm change here, it has been acked by the
  relevant authorities and is a pre-req for our _PAGE_NUMA support"

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (121 commits)
  powerpc: Implement arch_spin_is_locked() using arch_spin_value_unlocked()
  powerpc: Add support for the optimised lockref implementation
  powerpc/powernv: Call OPAL sync before kexec'ing
  powerpc/eeh: Escalate error on non-existing PE
  powerpc/eeh: Handle multiple EEH errors
  powerpc: Fix transactional FP/VMX/VSX unavailable handlers
  powerpc: Don't corrupt transactional state when using FP/VMX in kernel
  powerpc: Reclaim two unused thread_info flag bits
  powerpc: Fix races with irq_work
  Move precessing of MCE queued event out from syscall exit path.
  pseries/cpuidle: Remove redundant call to ppc64_runlatch_off() in cpu idle routines
  powerpc: Make add_system_ram_resources() __init
  powerpc: add SATA_MV to ppc64_defconfig
  powerpc/powernv: Increase candidate fw image size
  powerpc: Add debug checks to catch invalid cpu-to-node mappings
  powerpc: Fix the setup of CPU-to-Node mappings during CPU online
  powerpc/iommu: Don't detach device without IOMMU group
  powerpc/eeh: Hotplug improvement
  powerpc/eeh: Call opal_pci_reinit() on powernv for restoring config space
  powerpc/eeh: Add restore_config operation
  ...
2014-01-27 21:11:26 -08:00
Linus Torvalds 2d08cd0ef8 - Convert to misc driver to support module auto loading
- Remove unnecessary and dangerous use of device_lock
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJS4T1aAAoJECObm247sIsiLMkQAKn11qSLk880dJO+cfW2VpqL
 vm3WreShOxNsKTEQfcbrnJtS+kYGkiyo8+Vve377QDIg+xeXpNZa0PbAEfh4HyuJ
 NH+vn0FYDf3rzF4Z9snm2HOHWCAsqLCE0guyMwwhUYH+fks4+YTJWCm4YWZOXbIS
 aI++G+5LBxGzMRbotoovqUhQy474fpQN/GFIlIZHP2bGtdeXiRuH0qH3BxVKafOF
 50RtAgJsLm1Nb+S20Umsz+llL3sop1wcPLKOlgCXqd+545CJAoq/qqEPjUtOrr+S
 qXvXhb0XM7zo73H2WzkVmS09HZWX4sMoHBYp8D6ToOIFF01LOVsXBKduPCPpTD7E
 zohhjgag9/RQ99l2B153IIIv0Bsg7b4YXBQ6qkWFoJolPL94LTq1PXk0f9mNhyPo
 mSqatcAeI5qtjqu9ncSd4YYTdBgp7SJcgwji8fI44tvzzpz8iheVUrpwcU1UumAK
 TpXLBoTLXllK1op3u9xgFrF4ISBMl7lZeZp3c1/1YRga7Ch6SdlB0tcLPfuSBRF0
 1Qb5jQrWz4qt/oZSkapcFALXQDgwoK8am9I0a5YXdhy9gSYm+o/TLwG5pTEnF4In
 xxuibmmJAmNcTJ9q6rUKjLLU/TqRKin+vyqW6is41IredgvNX8XDS3YokA5Fgv01
 mu1s7odFi08xjPRuYvmq
 =t9+R
 -----END PGP SIGNATURE-----

Merge tag 'vfio-v3.14-rc1' of git://github.com/awilliam/linux-vfio

Pull vfio update from Alex Williamson:
 - convert to misc driver to support module auto loading
 - remove unnecessary and dangerous use of device_lock

* tag 'vfio-v3.14-rc1' of git://github.com/awilliam/linux-vfio:
  vfio-pci: Don't use device_lock around AER interrupt setup
  vfio: Convert control interface to misc driver
  misc: Reserve minor for VFIO
2014-01-24 17:42:31 -08:00
Alex Williamson 890ed578df vfio-pci: Use pci "try" reset interface
PCI resets will attempt to take the device_lock for any device to be
reset.  This is a problem if that lock is already held, for instance
in the device remove path.  It's not sufficient to simply kill the
user process or skip the reset if called after .remove as a race could
result in the same deadlock.  Instead, we handle all resets as "best
effort" using the PCI "try" reset interfaces.  This prevents the user
from being able to induce a deadlock by triggering a reset.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2014-01-15 10:43:17 -07:00
Alex Williamson 3be3a074cf vfio-pci: Don't use device_lock around AER interrupt setup
device_lock is much too prone to lockups.  For instance if we have a
pending .remove then device_lock is already held.  If userspace
attempts to modify AER signaling after that point, a deadlock occurs.
eventfd setup/teardown is already protected in vfio with the igate
mutex.  AER is not a high performance interrupt, so we can also use
the same mutex to protect signaling versus setup races.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-01-14 16:12:55 -07:00
Alistair Popple e589a4404f powerpc/iommu: Update constant names to reflect their hardcoded page size
The powerpc iommu uses a hardcoded page size of 4K. This patch changes
the name of the IOMMU_PAGE_* macros to reflect the hardcoded values. A
future patch will use the existing names to support dynamic page
sizes.

Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-12-30 14:17:06 +11:00
Alex Williamson d10999016f vfio: Convert control interface to misc driver
This change allows us to support module auto loading using devname
support in userspace tools.  With this, /dev/vfio/vfio will always
be present and opening it will cause the vfio module to load.  This
should avoid needing to configure the system to statically load
vfio in order to get libvirt to correctly detect support for it.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-12-19 10:17:13 -07:00
Alex Williamson 274127a1fd PCI: Rename PCI_VC_PORT_REG1/2 to PCI_VC_PORT_CAP1/2
These are set of two capability registers, it's pretty much given that
they're registers, so reflect their purpose in the name.

Suggested-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2013-12-17 17:49:39 -07:00
Antonios Motakis d93b3ac0ed VFIO: vfio_iommu_type1: fix bug caused by break in nested loop
In vfio_iommu_type1.c there is a bug in vfio_dma_do_map, when checking
that pages are not already mapped. Since the check is being done in a
for loop nested within the main loop, breaking out of it does not create
the intended behavior. If the underlying IOMMU driver returns a non-NULL
value, this will be ignored and mapping the DMA range will be attempted
anyway, leading to unpredictable behavior.

This interracts badly with the ARM SMMU driver issue fixed in the patch
that was submitted with the title:
"[PATCH 2/2] ARM: SMMU: return NULL on error in arm_smmu_iova_to_phys"
Both fixes are required in order to use the vfio_iommu_type1 driver
with an ARM SMMU.

This patch refactors the function slightly, in order to also make this
kind of bug less likely.

Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-10-11 10:40:46 -06:00
Alex Williamson 8b27ee60bf vfio-pci: PCI hot reset interface
The current VFIO_DEVICE_RESET interface only maps to PCI use cases
where we can isolate the reset to the individual PCI function.  This
means the device must support FLR (PCIe or AF), PM reset on D3hot->D0
transition, device specific reset, or be a singleton device on a bus
for a secondary bus reset.  FLR does not have widespread support,
PM reset is not very reliable, and bus topology is dictated by the
system and device design.  We need to provide a means for a user to
induce a bus reset in cases where the existing mechanisms are not
available or not reliable.

This device specific extension to VFIO provides the user with this
ability.  Two new ioctls are introduced:
 - VFIO_DEVICE_PCI_GET_HOT_RESET_INFO
 - VFIO_DEVICE_PCI_HOT_RESET

The first provides the user with information about the extent of
devices affected by a hot reset.  This is essentially a list of
devices and the IOMMU groups they belong to.  The user may then
initiate a hot reset by calling the second ioctl.  We must be
careful that the user has ownership of all the affected devices
found via the first ioctl, so the second ioctl takes a list of file
descriptors for the VFIO groups affected by the reset.  Each group
must have IOMMU protection established for the ioctl to succeed.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-09-04 11:28:04 -06:00
Alex Williamson 17638db1b8 vfio-pci: Test for extended config space
Having PCIe/PCI-X capability isn't enough to assume that there are
extended capabilities.  Both specs define that the first capability
header is all zero if there are no extended capabilities.  Testing
for this avoids an erroneous message about hiding capability 0x0 at
offset 0x100.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-09-04 10:58:52 -06:00
Alex Williamson 20e7745784 vfio-pci: Use fdget() rather than eventfd_fget()
eventfd_fget() tests to see whether the file is an eventfd file, which
we then immediately pass to eventfd_ctx_fileget(), which again tests
whether the file is an eventfd file.  Simplify slightly by using
fdget() so that we only test that we're looking at an eventfd once.
fget() could also be used, but fdget() makes use of fget_light() for
another slight optimization.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-08-28 09:49:55 -06:00
Alex Williamson 5d042fbdbb vfio: Add O_CLOEXEC flag to vfio device fd
Add the default O_CLOEXEC flag for device file descriptors.  This is
generally considered a safer option as it allows the user a race free
option to decide whether file descriptors are inherited across exec,
with the default avoiding file descriptor leaks.

Reported-by: Yann Droneaud <ydroneaud@opteya.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-08-22 10:33:41 -06:00
Yann Droneaud a5d550703d vfio: use get_unused_fd_flags(0) instead of get_unused_fd()
Macro get_unused_fd() is used to allocate a file descriptor with
default flags. Those default flags (0) can be "unsafe":
O_CLOEXEC must be used by default to not leak file descriptor
across exec().

Instead of macro get_unused_fd(), functions anon_inode_getfd()
or get_unused_fd_flags() should be used with flags given by userspace.
If not possible, flags should be set to O_CLOEXEC to provide userspace
with a default safe behavor.

In a further patch, get_unused_fd() will be removed so that
new code start using anon_inode_getfd() or get_unused_fd_flags()
with correct flags.

This patch replaces calls to get_unused_fd() with equivalent call to
get_unused_fd_flags(0) to preserve current behavor for existing code.

The hard coded flag value (0) should be reviewed on a per-subsystem basis,
and, if possible, set to O_CLOEXEC.

Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Link: http://lkml.kernel.org/r/cover.1376327678.git.ydroneaud@opteya.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-08-22 10:20:05 -06:00
Alexey Kardashevskiy 6cdd978213 vfio: add external user support
VFIO is designed to be used via ioctls on file descriptors
returned by VFIO.

However in some situations support for an external user is required.
The first user is KVM on PPC64 (SPAPR TCE protocol) which is going to
use the existing VFIO groups for exclusive access in real/virtual mode
on a host to avoid passing map/unmap requests to the user space which
would made things pretty slow.

The protocol includes:

1. do normal VFIO init operation:
	- opening a new container;
	- attaching group(s) to it;
	- setting an IOMMU driver for a container.
When IOMMU is set for a container, all groups in it are
considered ready to use by an external user.

2. User space passes a group fd to an external user.
The external user calls vfio_group_get_external_user()
to verify that:
	- the group is initialized;
	- IOMMU is set for it.
If both checks passed, vfio_group_get_external_user()
increments the container user counter to prevent
the VFIO group from disposal before KVM exits.

3. The external user calls vfio_external_user_iommu_id()
to know an IOMMU ID. PPC64 KVM uses it to link logical bus
number (LIOBN) with IOMMU ID.

4. When the external KVM finishes, it calls
vfio_group_put_external_user() to release the VFIO group.
This call decrements the container user counter.
Everything gets released.

The "vfio: Limit group opens" patch is also required for the consistency.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-08-05 10:52:36 -06:00
Alex Williamson d24cdbfd28 vfio-pci: Avoid deadlock on remove
If an attempt is made to unbind a device from vfio-pci while that
device is in use, the request is blocked until the device becomes
unused.  Unfortunately, that unbind path still grabs the device_lock,
which certain things like __pci_reset_function() also want to take.
This means we need to try to acquire the locks ourselves and use the
pre-locked version, __pci_reset_function_locked().

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-07-24 16:36:41 -06:00
Alex Williamson c64019302b vfio: Ignore sprurious notifies
Remove debugging WARN_ON if we get a spurious notify for a group that
no longer exists.  No reports of anyone hitting this, but it would
likely be a race and not a bug if they did.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-07-24 16:36:40 -06:00
Alex Williamson de9c7602ca vfio: Don't overreact to DEL_DEVICE
BUS_NOTIFY_DEL_DEVICE triggers IOMMU drivers to remove devices from
their iommu group, but there's really nothing we can do about it at
this point.  If the device is in use, then the vfio sub-driver will
block the device_del from completing until it's released.  If the
device is not in use or not owned by a vfio sub-driver, then we
really don't care that it's being removed.

The current code can be triggered just by unloading an sr-iov driver
(ex. igb) while the VFs are attached to vfio-pci because it makes an
incorrect assumption about the ordering of driver remove callbacks
vs the DEL_DEVICE notification.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-07-24 16:36:00 -06:00
Linus Torvalds 15a49b9a90 vfio Updates for v3.11
Largely hugepage support for vfio/type1 iommu and surrounding cleanups and fixes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.13 (GNU/Linux)
 
 iQIcBAABAgAGBQJR2uNvAAoJECObm247sIsiJRYQAJK15MfXgJq2PtBABNvFUAOG
 nqUvLgBgM5Ow1NI0Rzh9jkNohNqCvXDFGaWXXnsaX83hIpi59GFK31W2E3SiFCj3
 xISA9SUnm7Kjt9LAF6HTNz805zBkshIOk4MCx6HlezVWSRlWwT3rZzI4dI2fMvl8
 iPRk1Ion3QSQui99HWfXv/rtezAIzgZqsFqPC6DjWRfN7LcdEtKtcQwnrSb5GGY9
 3TIRY9IRYTSfJ2yjSz5f5258JxoDG5sR8dTMkgG2Gm92iGvGcPGpzQWPzVc4t+TO
 PdTqtv9ftEyAJKsYTFjPIod8XbzJBa1FSPadVAIfwF0JCDcsSFjoWGp+RzMQQSF8
 MK3VsnQ/pqJfs2nJHDQbWbKu0qWYPntvOCdojZ4679ceDTd0t515npfYeDQuX8yU
 fAA5rB46mDXjyxikTP574NdnkcGjbAj7EOCp7s+WTsVPGQQ3mId/3fQw0Wg7bE6v
 jaJqdRj70SNTRHs8DFLQhvSZgpef4RzepE4sRBZqzY4vWd4riNcAC3Got+F2rQy3
 X4hcHHU/5LGLoGMxOJQmuBfKVM8RAgikq6w2RfttVMLeKCknKtJ29OnotKilvILh
 W8nAOGxRnkmONFfHakNJtLl5tQJ4FQXc2cG8OeIIhHgheJjUxL72/zv8bBxOo7rY
 jUBjtZ5riQXc/ck4FEGI
 =9+Jh
 -----END PGP SIGNATURE-----

Merge tag 'vfio-v3.11' of git://github.com/awilliam/linux-vfio

Pull vfio updates from Alex Williamson:
 "Largely hugepage support for vfio/type1 iommu and surrounding cleanups
  and fixes"

* tag 'vfio-v3.11' of git://github.com/awilliam/linux-vfio:
  vfio/type1: Fix leak on error path
  vfio: Limit group opens
  vfio/type1: Fix missed frees and zero sized removes
  vfio: fix documentation
  vfio: Provide module option to disable vfio_iommu_type1 hugepage support
  vfio: hugepage support for vfio_iommu_type1
  vfio: Convert type1 iommu to use rbtree
2013-07-10 14:50:08 -07:00
Linus Torvalds 65b97fb730 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
Pull powerpc updates from Ben Herrenschmidt:
 "This is the powerpc changes for the 3.11 merge window.  In addition to
  the usual bug fixes and small updates, the main highlights are:

   - Support for transparent huge pages by Aneesh Kumar for 64-bit
     server processors.  This allows the use of 16M pages as transparent
     huge pages on kernels compiled with a 64K base page size.

   - Base VFIO support for KVM on power by Alexey Kardashevskiy

   - Wiring up of our nvram to the pstore infrastructure, including
     putting compressed oopses in there by Aruna Balakrishnaiah

   - Move, rework and improve our "EEH" (basically PCI error handling
     and recovery) infrastructure.  It is no longer specific to pseries
     but is now usable by the new "powernv" platform as well (no
     hypervisor) by Gavin Shan.

   - I fixed some bugs in our math-emu instruction decoding and made it
     usable to emulate some optional FP instructions on processors with
     hard FP that lack them (such as fsqrt on Freescale embedded
     processors).

   - Support for Power8 "Event Based Branch" facility by Michael
     Ellerman.  This facility allows what is basically "userspace
     interrupts" for performance monitor events.

   - A bunch of Transactional Memory vs.  Signals bug fixes and HW
     breakpoint/watchpoint fixes by Michael Neuling.

  And more ...  I appologize in advance if I've failed to highlight
  something that somebody deemed worth it."

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (156 commits)
  pstore: Add hsize argument in write_buf call of pstore_ftrace_call
  powerpc/fsl: add MPIC timer wakeup support
  powerpc/mpic: create mpic subsystem object
  powerpc/mpic: add global timer support
  powerpc/mpic: add irq_set_wake support
  powerpc/85xx: enable coreint for all the 64bit boards
  powerpc/8xx: Erroneous double irq_eoi() on CPM IRQ in MPC8xx
  powerpc/fsl: Enable CONFIG_E1000E in mpc85xx_smp_defconfig
  powerpc/mpic: Add get_version API both for internal and external use
  powerpc: Handle both new style and old style reserve maps
  powerpc/hw_brk: Fix off by one error when validating DAWR region end
  powerpc/pseries: Support compression of oops text via pstore
  powerpc/pseries: Re-organise the oops compression code
  pstore: Pass header size in the pstore write callback
  powerpc/powernv: Fix iommu initialization again
  powerpc/pseries: Inform the hypervisor we are using EBB regs
  powerpc/perf: Add power8 EBB support
  powerpc/perf: Core EBB support for 64-bit book3s
  powerpc/perf: Drop MMCRA from thread_struct
  powerpc/perf: Don't enable if we have zero events
  ...
2013-07-04 10:29:23 -07:00
Alex Williamson 8d38ef1948 vfio/type1: Fix leak on error path
We also don't handle unpinning zero pages as an error on other exits
so we can fix that inconsistency by rolling in the next conditional
return.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-07-01 08:28:58 -06:00
Al Viro a47df1518e vfio: remap_pfn_range() sets all those flags...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:46:41 +04:00
Alex Williamson 6d6768c61b vfio: Limit group opens
vfio_group_fops_open attempts to limit concurrent sessions by
disallowing opens once group->container is set.  This really doesn't
do what we want and allow for inconsistent behavior, for instance a
group can be opened twice, then a container set giving the user two
file descriptors to the group.  But then it won't allow more to be
opened.  There's not much reason to have the group opened multiple
times since most access is through devices or the container, so
complete what the original code intended and only allow a single
instance.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-06-25 16:06:54 -06:00
Alex Williamson f5bfdbf252 vfio/type1: Fix missed frees and zero sized removes
With hugepage support we can only properly aligned and sized ranges.
We only guarantee that we can unmap the same ranges mapped and not
arbitrary sub-ranges.  This means we might not free anything or might
free more than requested.  The vfio unmap interface started storing
the unmapped size to return to userspace to handle this.  This patch
fixes a few places where we don't properly handle those cases, moves
a memory allocation to a place where failure is an option and checks
our loops to make sure we don't get into an infinite loop trying to
remove an overlap.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-06-25 16:01:44 -06:00
Alex Williamson 5c6c2b21ec vfio: Provide module option to disable vfio_iommu_type1 hugepage support
Add a module option to vfio_iommu_type1 to disable IOMMU hugepage
support.  This causes iommu_map to only be called with single page
mappings, disabling the IOMMU driver's ability to use hugepages.
This option can be enabled by loading vfio_iommu_type1 with
disable_hugepages=1 or dynamically through sysfs.  If enabled
dynamically, only new mappings are restricted.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-06-21 09:38:11 -06:00
Alex Williamson 166fd7d94a vfio: hugepage support for vfio_iommu_type1
We currently send all mappings to the iommu in PAGE_SIZE chunks,
which prevents the iommu from enabling support for larger page sizes.
We still need to pin pages, which means we step through them in
PAGE_SIZE chunks, but we can batch up contiguous physical memory
chunks to allow the iommu the opportunity to use larger pages.  The
approach here is a bit different that the one currently used for
legacy KVM device assignment.  Rather than looking at the vma page
size and using that as the maximum size to pass to the iommu, we
instead simply look at whether the next page is physically
contiguous.  This means we might ask the iommu to map a 4MB region,
while legacy KVM might limit itself to a maximum of 2MB.

Splitting our mapping path also allows us to be smarter about locked
memory because we can more easily unwind if the user attempts to
exceed the limit.  Therefore, rather than assuming that a mapping
will result in locked memory, we test each page as it is pinned to
determine whether it locks RAM vs an mmap'd MMIO region.  This should
result in better locking granularity and less locked page fudge
factors in userspace.

The unmap path uses the same algorithm as legacy KVM.  We don't want
to track the pfn for each mapping ourselves, but we need the pfn in
order to unpin pages.  We therefore ask the iommu for the iova to
physical address translation, ask it to unpin a page, and see how many
pages were actually unpinned.  iommus supporting large pages will
often return something bigger than a page here, which we know will be
physically contiguous and we can unpin a batch of pfns.  iommus not
supporting large mappings won't see an improvement in batching here as
they only unmap a page at a time.

With this change, we also make a clarification to the API for mapping
and unmapping DMA.  We can only guarantee unmaps at the same
granularity as used for the original mapping.  In other words,
unmapping a subregion of a previous mapping is not guaranteed and may
result in a larger or smaller unmapping than requested.  The size
field in the unmapping structure is updated to reflect this.
Previously this was unmodified on mapping, always returning the the
requested unmap size.  This is now updated to return the actual unmap
size on success, allowing userspace to appropriately track mappings.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-06-21 09:38:02 -06:00
Alex Williamson cd9b22685e vfio: Convert type1 iommu to use rbtree
We need to keep track of all the DMA mappings of an iommu container so
that it can be automatically unmapped when the user releases the file
descriptor.  We currently do this using a simple list, where we merge
entries with contiguous iovas and virtual addresses.  Using a tree for
this is a bit more efficient and allows us to use common code instead
of inventing our own.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-06-21 09:37:50 -06:00
Alexey Kardashevskiy 5b25199eff powerpc/vfio: Enable on pSeries platform
The enables VFIO on the pSeries platform, enabling user space
programs to access PCI devices directly.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-20 16:55:15 +10:00
Alexey Kardashevskiy 5ffd229c02 powerpc/vfio: Implement IOMMU driver for VFIO
VFIO implements platform independent stuff such as
a PCI driver, BAR access (via read/write on a file descriptor
or direct mapping when possible) and IRQ signaling.

The platform dependent part includes IOMMU initialization
and handling.  This implements an IOMMU driver for VFIO
which does mapping/unmapping pages for the guest IO and
provides information about DMA window (required by a POWER
guest).

Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-20 16:55:14 +10:00
Alexey Kardashevskiy 9a6aa279d3 vfio: fix crash on rmmod
devtmpfs_delete_node() calls devnode() callback with mode==NULL but
vfio still tries to write there.

The patch fixes this.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-06-05 08:54:16 -06:00
Linus Torvalds 0b2e3b6bb4 vfio updates for v3.10
Changes include extension to support PCI AER notification to userspace, byte granularity of PCI config space and access to unarchitected PCI config space, better protection around IOMMU driver accesses, default file mode fix, and a few misc cleanups.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.13 (GNU/Linux)
 
 iQIcBAABAgAGBQJRgox9AAoJECObm247sIsizJwP/23KprR6BuRt168Obo+xC/lp
 kj1E4Hd4XHx6T/5XRexa4GqhmyLqgoqbS19uj/K6ebY9rouVKo0V6OYNFDQiFR/F
 yEGjYIBKV9/eBZMQI4RbZpZKGDXTeFrXyjsac1m51JrR3st8zsvZlNPE/TjzhSfz
 jMnsCC99ZwIFjmptw/yWwgInswy1n9H9iICS14Xn1x7v71iyOE32reTG6M9HsPHe
 Xm5F0K5iLQX8qoISHAomrTnFmCLxp5Y2qhh7nZWmC2gCbPBEnGdqx4prwzJayAaC
 DAN8osaYldoKQuwI4wFYMICHxxCEFk8xU58FeKCmeQJZgomefVAoRKf86gAEsSLR
 XHptBQOktWVKNFhzZWyOuAX3iua/hmWcOa05I6inycV1x2ZAGlNhvJnj1iKIphLk
 +neBFAD9wDcAJZdXkfudq0jJ9jJTSAWBv8d+hozNfkjTVhF+wRnhZ7V6dZfb9+W6
 kPGbgwqKApLoOabQbxbZ5Ftr6S7344prB/HAMywGa+xZfoxoYPxBHCi0rSTvUTWy
 oWLtzrKRbjBDc0qF4eEK9RZlmVt2ZmCUnUB3kbWED4mrJAHu4LlFghnAloePrIZ3
 kXlMARWbyw/E+1WIMO4Ito5+1s3zVsJJgzVsAQDJkTQw2aYDhns73/Y25Dj7x7QS
 BKebrXnh2xVCN6OIu+eX
 =F0Cc
 -----END PGP SIGNATURE-----

Merge tag 'vfio-for-v3.10' of git://github.com/awilliam/linux-vfio

Pull vfio updates from Alex Williamson:
 "Changes include extension to support PCI AER notification to
  userspace, byte granularity of PCI config space and access to
  unarchitected PCI config space, better protection around IOMMU driver
  accesses, default file mode fix, and a few misc cleanups."

* tag 'vfio-for-v3.10' of git://github.com/awilliam/linux-vfio:
  vfio: Set container device mode
  vfio: Use down_reads to protect iommu disconnects
  vfio: Convert container->group_lock to rwsem
  PCI/VFIO: use pcie_flags_reg instead of access PCI-E Capabilities Register
  vfio-pci: Enable raw access to unassigned config space
  vfio-pci: Use byte granularity in config map
  vfio: make local function vfio_pci_intx_unmask_handler() static
  VFIO-AER: Vfio-pci driver changes for supporting AER
  VFIO: Wrapper for getting reference to vfio_device
2013-05-02 14:02:32 -07:00
Alex Williamson 664e9386bd vfio: Set container device mode
Minor 0 is the VFIO container device (/dev/vfio/vfio).  On it's own
the container does not provide a user with any privileged access.  It
only supports API version check and extension check ioctls.  Only by
attaching a VFIO group to the container does it gain any access.  Set
the mode of the container to allow access.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-04-30 15:42:28 -06:00
Linus Torvalds 96a3e8af5a PCI changes for the v3.10 merge window:
PCI device hotplug
     - Remove ACPI PCI subdrivers (Jiang Liu, Myron Stowe)
     - Make acpiphp builtin only, not modular (Jiang Liu)
     - Add acpiphp mutual exclusion (Jiang Liu)
 
   Power management
     - Skip "PME enabled/disabled" messages when not supported (Rafael Wysocki)
     - Fix fallback to PCI_D0 (Rafael Wysocki)
 
   Miscellaneous
     - Factor quirk_io_region (Yinghai Lu)
     - Cache MSI capability offsets & cleanup (Gavin Shan, Bjorn Helgaas)
     - Clean up EISA resource initialization and logging (Bjorn Helgaas)
     - Fix prototype warnings (Andy Shevchenko, Bjorn Helgaas)
     - MIPS: Initialize of_node before scanning bus (Gabor Juhos)
     - Fix pcibios_get_phb_of_node() declaration "weak" annotation (Gabor Juhos)
     - Add MSI INTX_DISABLE quirks for AR8161/AR8162/etc (Xiong Huang)
     - Fix aer_inject return values (Prarit Bhargava)
     - Remove PME/ACPI dependency (Andrew Murray)
     - Use shared PCI_BUS_NUM() and PCI_DEVID() (Shuah Khan)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJRfhSWAAoJEFmIoMA60/r8GrYQAIHDsyZIuJSf6g+8Td1h+PIC
 YD3wQhbyrDqQDuKU4+9cz+JsbHmnozUGA4UmlwmOGBxEa/Uauspb6yX1P1+x9Ok1
 WD7Ar3BlA5OuYI/1L1mgCiA428MTujwoR4fPnC0+KFy8xk1tBpmhzzeOFohbKyFF
 hMBO/Xt9tCzPATJ1LhjIH4xAykfDkbnPNHNcUKRoAkRo0CO0lS8gcTk0shXXSNng
 p9kQ6c4cYZvlRIJTwlawWV09nr7mDsBYa3JClqXYZufUWfEwvIuhisJxCJ57sWi9
 t+Ev8dm7VM6Cr5dV+ORArlboBFrq4f/W5U9j9GPFrRplwf+WbNT6tNGSpSDq8XhU
 Q7JjNgPWVdWXe1vIsMwaO49zi45/bNehuCSFLZiyPZwedMk764tys+iYw+tMRtv1
 tBR7lwESSXfagmvWyQAuQOTy6Rj26BPd2T8e2lMsvsuQO9mCyTK6Ey3YyKuqKQK/
 l5Gns4vv4eaCjGXqqDGiydUjSes+r/v1bu43XiRnwPQJUKb5kr5SjN5/zSMBuUgm
 TLT/bnv8qvdFxCpVQJFv4k/uzULARMdbvLtTy8osB14vNHX9jPn+xORjLaZNiO6O
 7fFispMU8Om56hNkD6C451r3icRjjGlD7OA8KOlbZ8f876sLzGV9i6P9gwCoRdEB
 wclDPsN7kAzw/V2sEE60
 =bj8i
 -----END PGP SIGNATURE-----

Merge tag 'pci-v3.10-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci

Pull PCI updates from Bjorn Helgaas:
 "PCI changes for the v3.10 merge window:

  PCI device hotplug
   - Remove ACPI PCI subdrivers (Jiang Liu, Myron Stowe)
   - Make acpiphp builtin only, not modular (Jiang Liu)
   - Add acpiphp mutual exclusion (Jiang Liu)

  Power management
   - Skip "PME enabled/disabled" messages when not supported (Rafael
     Wysocki)
   - Fix fallback to PCI_D0 (Rafael Wysocki)

  Miscellaneous
   - Factor quirk_io_region (Yinghai Lu)
   - Cache MSI capability offsets & cleanup (Gavin Shan, Bjorn Helgaas)
   - Clean up EISA resource initialization and logging (Bjorn Helgaas)
   - Fix prototype warnings (Andy Shevchenko, Bjorn Helgaas)
   - MIPS: Initialize of_node before scanning bus (Gabor Juhos)
   - Fix pcibios_get_phb_of_node() declaration "weak" annotation (Gabor
     Juhos)
   - Add MSI INTX_DISABLE quirks for AR8161/AR8162/etc (Xiong Huang)
   - Fix aer_inject return values (Prarit Bhargava)
   - Remove PME/ACPI dependency (Andrew Murray)
   - Use shared PCI_BUS_NUM() and PCI_DEVID() (Shuah Khan)"

* tag 'pci-v3.10-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (63 commits)
  vfio-pci: Use cached MSI/MSI-X capabilities
  vfio-pci: Use PCI_MSIX_TABLE_BIR, not PCI_MSIX_FLAGS_BIRMASK
  PCI: Remove "extern" from function declarations
  PCI: Use PCI_MSIX_TABLE_BIR, not PCI_MSIX_FLAGS_BIRMASK
  PCI: Drop msi_mask_reg() and remove drivers/pci/msi.h
  PCI: Use msix_table_size() directly, drop multi_msix_capable()
  PCI: Drop msix_table_offset_reg() and msix_pba_offset_reg() macros
  PCI: Drop is_64bit_address() and is_mask_bit_support() macros
  PCI: Drop msi_data_reg() macro
  PCI: Drop msi_lower_address_reg() and msi_upper_address_reg() macros
  PCI: Drop msi_control_reg() macro and use PCI_MSI_FLAGS directly
  PCI: Use cached MSI/MSI-X offsets from dev, not from msi_desc
  PCI: Clean up MSI/MSI-X capability #defines
  PCI: Use cached MSI-X cap while enabling MSI-X
  PCI: Use cached MSI cap while enabling MSI interrupts
  PCI: Remove MSI/MSI-X cap check in pci_msi_check_device()
  PCI: Cache MSI/MSI-X capability offsets in struct pci_dev
  PCI: Use u8, not int, for PM capability offset
  [SCSI] megaraid_sas: Use correct #define for MSI-X capability
  PCI: Remove "extern" from function declarations
  ...
2013-04-29 09:30:25 -07:00
Alex Williamson 0b43c08233 vfio: Use down_reads to protect iommu disconnects
If a group or device is released or a container is unset from a group
it can race against file ops on the container.  Protect these with
down_reads to allow concurrent users.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reported-by: Michael S. Tsirkin <mst@redhat.com>
2013-04-29 08:41:36 -06:00
Alex Williamson 9587f44aa6 vfio: Convert container->group_lock to rwsem
All current users are writers, maintaining current mutual exclusion.
This lets us add read users next.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-04-25 16:12:38 -06:00
Bjorn Helgaas a9047f24df vfio-pci: Use cached MSI/MSI-X capabilities
We now cache the MSI/MSI-X capability offsets in the struct pci_dev,
so no need to find the capabilities again.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
2013-04-24 11:35:56 -06:00
Bjorn Helgaas 508d1aa602 vfio-pci: Use PCI_MSIX_TABLE_BIR, not PCI_MSIX_FLAGS_BIRMASK
PCI_MSIX_FLAGS_BIRMASK is mis-named because the BIR mask is in the
Table Offset register, not the flags ("Message Control" per spec)
register.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
2013-04-24 11:35:56 -06:00
Yijing Wang aa2cba51a0 PCI/VFIO: use pcie_flags_reg instead of access PCI-E Capabilities Register
Currently, we use pcie_flags_reg to cache PCI-E Capabilities Register,
because PCI-E Capabilities Register bits are almost read-only. This patch
use pcie_caps_reg() instead of another access PCI-E Capabilities Register.

Signed-off-by: Yijing Wang <wangyijing@huawei.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-04-15 08:45:10 -06:00
Alex Williamson a7d1ea1c11 vfio-pci: Enable raw access to unassigned config space
Devices like be2net hide registers between the gaps in capabilities
and architected regions of PCI config space.  Our choices to support
such devices is to either build an ever growing and unmanageable white
list or rely on hardware isolation to protect us.  These registers are
really no different than MMIO or I/O port space registers, which we
don't attempt to regulate, so treat PCI config space in the same way.

Reported-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Tested-by: Gavin Shan <shangw@linux.vnet.ibm.com>
2013-04-01 09:04:12 -06:00
Alex Williamson 180b138107 vfio-pci: Use byte granularity in config map
The config map previously used a byte per dword to map regions of
config space to capabilities.  Modulo a bug where we round the length
of capabilities down instead of up, this theoretically works well and
saves space so long as devices don't try to hide registers in the gaps
between capabilities.  Unfortunately they do exactly that so we need
byte granularity on our config space map.  Increase the allocation of
the config map and split accesses at capability region boundaries.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Tested-by: Gavin Shan <shangw@linux.vnet.ibm.com>
2013-04-01 09:03:44 -06:00
Alex Williamson 904c680c7b vfio-pci: Fix possible integer overflow
The VFIO_DEVICE_SET_IRQS ioctl takes a start and count parameter, both
of which are unsigned.  We attempt to bounds check these, but fail to
account for the case where start is a very large number, allowing
start + count to wrap back into the valid range.  Bounds check both
start and start + count.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-03-26 11:33:16 -06:00
Wei Yongjun 0bced2f728 vfio: make local function vfio_pci_intx_unmask_handler() static
vfio_pci_intx_unmask_handler() was not declared. It should be static.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-03-25 11:15:44 -06:00
Arnd Bergmann 25e9789ddd vfio: include <linux/slab.h> for kmalloc
The vfio drivers call kmalloc or kzalloc, but do not
include <linux/slab.h>, which causes build errors on
ARM.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Cc: kvm@vger.kernel.org
2013-03-15 12:58:20 -06:00
Vijay Mohan Pandarathil dad9f8972e VFIO-AER: Vfio-pci driver changes for supporting AER
- New VFIO_SET_IRQ ioctl option to pass the eventfd that is signaled when
  an error occurs in the vfio_pci_device

- Register pci_error_handler for the vfio_pci driver

- When the device encounters an error, the error handler registered by
  the vfio_pci driver gets invoked by the AER infrastructure

- In the error handler, signal the eventfd registered for the device.

- This results in the qemu eventfd handler getting invoked and
  appropriate action taken for the guest.

Signed-off-by: Vijay Mohan Pandarathil <vijaymohan.pandarathil@hp.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-03-11 09:31:22 -06:00
Vijay Mohan Pandarathil 44f507163d VFIO: Wrapper for getting reference to vfio_device
- Added vfio_device_get_from_dev() as wrapper to get
  reference to vfio_device from struct device.

- Added vfio_device_data() as a wrapper to get device_data from
  vfio_device.

Signed-off-by: Vijay Mohan Pandarathil <vijaymohan.pandarathil@hp.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-03-11 09:28:44 -06:00
Tejun Heo a1c36b166b vfio: convert to idr_alloc()
Convert to the much saner new idr interface.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27 19:10:19 -08:00