Commit graph

464 commits

Author SHA1 Message Date
Linus Torvalds 7c0f6ba682 Replace <asm/uaccess.h> with <linux/uaccess.h> globally
This was entirely automated, using the script by Al:

  PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
  sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
        $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)

to do the replacement at the end of the merge window.

Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-24 11:46:01 -08:00
Linus Torvalds 4d5b57e05a Updates for 4.10 kernel merge window
- Shared mlx5 updates with net stack (will drop out on merge if Dave's
   tree has already been merged)
 - Driver updates: cxgb4, hfi1, hns-roce, i40iw, mlx4, mlx5, qedr, rxe
 - Debug cleanups
 - New connection rejection helpers
 - SRP updates
 - Various misc fixes
 - New paravirt driver from vmware
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJYUbAPAAoJELgmozMOVy/dMXcP/iuG5MNzfN8Ny1JftyBQGWg3
 cqoQ2OLj9CsXjwVB+5EqbcZHRZY852lKONaLoDKkIOx4YAXO2YuIKOp944vN7EQx
 96wfqzT1F5jzAcy5mYZXgLaStGFDAwejKMqeHd0LfJj3OEtemGnVPWYzyqSQmSKo
 dzJraS1Z9GIRppzU5WaRpB9PtRBkqIqGJ5vZ0EKLGhed5hYY5r0iMJB0GfriMRDO
 lJ4UUVfpsAoLPnqDBFH6IMn2V2UeAw9IR5zNa1mrM1RBfvt/uYTxrw1w3p9WoaNs
 GRodhk4DCeAfeyqzVPNBLyXZ4Zq4FzGe3UWM4qysJ1RR4oFNw9Cuw0Fqk8mrfznr
 7hv5TpGIckRZiKf8l6e+qLirF0qGtXJg29j2vPVQI9i5nSj95g1agA81PnLQlLLb
 flWyxeMj81my7lfMHN1xcV6pqPEKMCOysZmfcvVfJd2XxpjuVD7ekl/YXWp8o8kU
 YPdQMqPD626XsD8VpPdMszb9FPmx0JD0HEv+Y1rIFX8JegEI+c3H2X0dqC27T/Ou
 FEPWOy025EgHm0Fh/7eIzkG6tjZ4JHoCugJAcxNZGj2XW4eB6r5vY8UwJ8iQRv+n
 PVYHiy0UoIRePh0mrdOSSphGZMi/GO/DsqKwCtAMEK43WqZQju6wR7QSIGkh66mp
 4uSHJqpf3YEYylxGMhk3
 =QeGy
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma

Pull rdma updates from Doug Ledford:
 "This is the complete update for the rdma stack for this release cycle.

  Most of it is typical driver and core updates, but there is the
  entirely new VMWare pvrdma driver. You may have noticed that there
  were changes in DaveM's pull request to the bnxt Ethernet driver to
  support a RoCE RDMA driver. The bnxt_re driver was tentatively set to
  be pulled in this release cycle, but it simply wasn't ready in time
  and was dropped (a few review comments still to address, and some
  multi-arch build issues like prefetch() not working across all
  arches).

  Summary:

   - shared mlx5 updates with net stack (will drop out on merge if
     Dave's tree has already been merged)

   - driver updates: cxgb4, hfi1, hns-roce, i40iw, mlx4, mlx5, qedr, rxe

   - debug cleanups

   - new connection rejection helpers

   - SRP updates

   - various misc fixes

   - new paravirt driver from vmware"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (210 commits)
  IB: Add vmw_pvrdma driver
  IB/mlx4: fix improper return value
  IB/ocrdma: fix bad initialization
  infiniband: nes: return value of skb_linearize should be handled
  MAINTAINERS: Update Intel RDMA RNIC driver maintainers
  MAINTAINERS: Remove Mitesh Ahuja from emulex maintainers
  IB/core: fix unmap_sg argument
  qede: fix general protection fault may occur on probe
  IB/mthca: Replace pci_pool_alloc by pci_pool_zalloc
  mlx5, calc_sq_size(): Make a debug message more informative
  mlx5: Remove a set-but-not-used variable
  mlx5: Use { } instead of { 0 } to init struct
  IB/srp: Make writing the add_target sysfs attr interruptible
  IB/srp: Make mapping failures easier to debug
  IB/srp: Make login failures easier to debug
  IB/srp: Introduce a local variable in srp_add_one()
  IB/srp: Fix CONFIG_DYNAMIC_DEBUG=n build
  IB/multicast: Check ib_find_pkey() return value
  IPoIB: Avoid reading an uninitialized member variable
  IB/mad: Fix an array index check
  ...
2016-12-15 12:03:32 -08:00
Doug Ledford 9032ad78bb Merge branches 'misc', 'qedr', 'reject-helpers', 'rxe' and 'srp' into merge-test 2016-12-14 14:44:47 -05:00
Doug Ledford 86ef0beaa0 Merge branch 'mlx' into merge-test 2016-12-14 14:44:25 -05:00
Bart Van Assche 11b642b84e IPoIB: Avoid reading an uninitialized member variable
This patch avoids that Coverity reports the following:

    Using uninitialized value port_attr.state when calling printk

Fixes: commit 94232d9ce8 ("IPoIB: Start multicast join process only on active ports")
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Erez Shitrit <erezsh@mellanox.com>
Cc: <stable@vger.kernel.org>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-12-14 13:27:34 -05:00
Leon Romanovsky 74226649f4 IB/ipoib: Remove and fix debug prints after allocation failure
The prints after [k|v][m|z|c]alloc() functions are not needed,
because in case of failure, allocator will print their internal
error prints anyway.

Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-12-03 13:12:52 -05:00
Kamal Heib 0b59970e7d IB/IPoIB: Remove can't use GFP_NOIO warning
Remove the warning print of "can't use of GFP_NOIO" to avoid prints in
each QP creation when devices aren't supporting IB_QP_CREATE_USE_GFP_NOIO.

This print become more annoying when the IPoIB interface is configured
to work in connected mode.

Fixes: 09b93088d7 ('IB: Add a QP creation flag to use GFP_NOIO allocations')
Signed-off-by: Kamal Heib <kamalh@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-11-16 20:04:48 -05:00
David S. Miller 27058af401 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Mostly simple overlapping changes.

For example, David Ahern's adjacency list revamp in 'net-next'
conflicted with an adjacency list traversal bug fix in 'net'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-30 12:42:58 -04:00
Jarod Wilson b3e3893e12 net: use core MTU range checking in misc drivers
firewire-net:
- set min/max_mtu
- remove fwnet_change_mtu

nes:
- set max_mtu
- clean up nes_netdev_change_mtu

xpnet:
- set min/max_mtu
- remove xpnet_dev_change_mtu

hippi:
- set min/max_mtu
- remove hippi_change_mtu

batman-adv:
- set max_mtu
- remove batadv_interface_change_mtu
- initialization is a little async, not 100% certain that max_mtu is set
  in the optimal place, don't have hardware to test with

rionet:
- set min/max_mtu
- remove rionet_change_mtu

slip:
- set min/max_mtu
- streamline sl_change_mtu

um/net_kern:
- remove pointless ndo_change_mtu

hsi/clients/ssi_protocol:
- use core MTU range checking
- remove now redundant ssip_pn_set_mtu

ipoib:
- set a default max MTU value
- Note: ipoib's actual max MTU can vary, depending on if the device is in
  connected mode or not, so we'll just set the max_mtu value to the max
  possible, and let the ndo_change_mtu function continue to validate any new
  MTU change requests with checks for CM or not. Note that ipoib has no
  min_mtu set, and thus, the network core's mtu > 0 check is the only lower
  bounds here.

mptlan:
- use net core MTU range checking
- remove now redundant mpt_lan_change_mtu

fddi:
- min_mtu = 21, max_mtu = 4470
- remove now redundant fddi_change_mtu (including export)

fjes:
- min_mtu = 8192, max_mtu = 65536
- The max_mtu value is actually one over IP_MAX_MTU here, but the idea is to
  get past the core net MTU range checks so fjes_change_mtu can validate a
  new MTU against what it supports (see fjes_support_mtu in fjes_hw.c)

hsr:
- min_mtu = 0 (calls ether_setup, max_mtu is 1500)

f_phonet:
- min_mtu = 6, max_mtu = 65541

u_ether:
- min_mtu = 14, max_mtu = 15412

phonet/pep-gprs:
- min_mtu = 576, max_mtu = 65530
- remove redundant gprs_set_mtu

CC: netdev@vger.kernel.org
CC: linux-rdma@vger.kernel.org
CC: Stefan Richter <stefanr@s5r6.in-berlin.de>
CC: Faisal Latif <faisal.latif@intel.com>
CC: linux-rdma@vger.kernel.org
CC: Cliff Whickman <cpw@sgi.com>
CC: Robin Holt <robinmholt@gmail.com>
CC: Jes Sorensen <jes@trained-monkey.org>
CC: Marek Lindner <mareklindner@neomailbox.ch>
CC: Simon Wunderlich <sw@simonwunderlich.de>
CC: Antonio Quartulli <a@unstable.cc>
CC: Sathya Prakash <sathya.prakash@broadcom.com>
CC: Chaitra P B <chaitra.basappa@broadcom.com>
CC: Suganath Prabu Subramani <suganath-prabu.subramani@broadcom.com>
CC: MPT-FusionLinux.pdl@broadcom.com
CC: Sebastian Reichel <sre@kernel.org>
CC: Felipe Balbi <balbi@kernel.org>
CC: Arvid Brodin <arvid.brodin@alten.se>
CC: Remi Denis-Courmont <courmisch@gmail.com>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-20 14:51:10 -04:00
David Ahern e0e79c8e74 IB/ipoib: Flip to new dev walk API
Convert ipoib_get_net_dev_match_addr to the new upper device walk API.
This is just a code conversion; no functional change is intended.

v2
- removed typecast of data

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-18 11:44:59 -04:00
Paolo Abeni fc791b6335 IB/ipoib: move back IB LL address into the hard header
After the commit 9207f9d45b ("net: preserve IP control block
during GSO segmentation"), the GSO CB and the IPoIB CB conflict.
That destroy the IPoIB address information cached there,
causing a severe performance regression, as better described here:

http://marc.info/?l=linux-kernel&m=146787279825501&w=2

This change moves the data cached by the IPoIB driver from the
skb control lock into the IPoIB hard header, as done before
the commit 936d7de3d7 ("IPoIB: Stop lying about hard_header_len
and use skb->cb to stash LL addresses").
In order to avoid GRO issue, on packet reception, the IPoIB driver
stash into the skb a dummy pseudo header, so that the received
packets have actually a hard header matching the declared length.
To avoid changing the connected mode maximum mtu, the allocated
head buffer size is increased by the pseudo header length.

After this commit, IPoIB performances are back to pre-regression
value.

v2 -> v3: rebased
v1 -> v2: avoid changing the max mtu, increasing the head buf size

Fixes: 9207f9d45b ("net: preserve IP control block during GSO segmentation")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-14 10:54:53 -04:00
Linus Torvalds b9044ac829 Merge of primary rdma-core code for 4.9
- Updates to mlx5
 - Updates to mlx4 (two conflicts, both minor and easily resolved)
 - Updates to iw_cxgb4 (one conflict, not so obvious to resolve, proper
   resolution is to keep the code in cxgb4_main.c as it is in Linus'
   tree as attach_uld was refactored and moved into cxgb4_uld.c)
 - Improvements to uAPI (moved vendor specific API elements to uAPI area)
 - Add hns-roce driver and hns and hns-roce ACPI reset support
 - Conversion of all rdma code away from deprecated
   create_singlethread_workqueue
 - Security improvement: remove unsafe ib_get_dma_mr (breaks lustre in
   staging)
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJX+AwSAAoJELgmozMOVy/d0WkQAKxPzVccMWwHv28iZI4ey13u
 JwE+VoCNpCAZAVuEgzK5zzFdNHPvAk2jU93H4apA7dfXJBXPatVuj9Lnk+ieEEnW
 tbFwJjBpbQ3Zol3+SPfAHnsVMbtax+xmd6WDKExPXXEDl1L6rutwL3KKfmgWEitg
 ysX7XOJCiSdyM0hcg4T6UPB9a3jGPff9NLu0oGamV+yoUk5Y0WGoVFxHZ4MKcw8t
 OkFBYIxGz4SGwq2tulStuH03HteURX594KngtrA8dyq6l1R2GlGRv+bkJAUEIWUv
 aA0ow3VWusOM6fT+jLXPCv8iUwIXM8tR/U6F7X+cmORUUtWvCl+uCUVid113j/aN
 BK+Af2nJnfoJ5cDBPsD+bC76l5gQycNZO/Qh8op2kmgJtD+6OpGM3cBXsHx53+kk
 0wloJ2lKCGShWxNj+ig8n8rR/rhhs/x3vV3ouCVWNMbOUgOSN3eYHxmK3wGFW4nd
 Qx+WYCjj9Yi/J6nmUDcfEQ4NWPR22Q2+0ENAabfhLhV6mDloAO5ILHd4GDqC3IA9
 UtxlVjf4ZonaiLnTQQzCnDMGVVk6tT8FJ9D42s0ScwjbdYwjyCW9/rs/g2EhcprR
 Cc+AmjqLviCWGtzBSFO0SijqQon8lcQOwdLw61CdFFvPa/mlLdf1rbx9ArIyNVKn
 JSrbr3CGyoqyYj6qaEO5
 =LC+S
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma

Pull main rdma updates from Doug Ledford:
 "This is the main pull request for the rdma stack this release.  The
  code has been through 0day and I had it tagged for linux-next testing
  for a couple days.

  Summary:

   - updates to mlx5

   - updates to mlx4 (two conflicts, both minor and easily resolved)

   - updates to iw_cxgb4 (one conflict, not so obvious to resolve,
     proper resolution is to keep the code in cxgb4_main.c as it is in
     Linus' tree as attach_uld was refactored and moved into
     cxgb4_uld.c)

   - improvements to uAPI (moved vendor specific API elements to uAPI
     area)

   - add hns-roce driver and hns and hns-roce ACPI reset support

   - conversion of all rdma code away from deprecated
     create_singlethread_workqueue

   - security improvement: remove unsafe ib_get_dma_mr (breaks lustre in
     staging)"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (75 commits)
  staging/lustre: Disable InfiniBand support
  iw_cxgb4: add fast-path for small REG_MR operations
  cxgb4: advertise support for FR_NSMR_TPTE_WR
  IB/core: correctly handle rdma_rw_init_mrs() failure
  IB/srp: Fix infinite loop when FMR sg[0].offset != 0
  IB/srp: Remove an unused argument
  IB/core: Improve ib_map_mr_sg() documentation
  IB/mlx4: Fix possible vl/sl field mismatch in LRH header in QP1 packets
  IB/mthca: Move user vendor structures
  IB/nes: Move user vendor structures
  IB/ocrdma: Move user vendor structures
  IB/mlx4: Move user vendor structures
  IB/cxgb4: Move user vendor structures
  IB/cxgb3: Move user vendor structures
  IB/mlx5: Move and decouple user vendor structures
  IB/{core,hw}: Add constant for node_desc
  ipoib: Make ipoib_warn ratelimited
  IB/mlx4/alias_GUID: Remove deprecated create_singlethread_workqueue
  IB/ipoib_verbs: Remove deprecated create_singlethread_workqueue
  IB/ipoib: Remove deprecated create_singlethread_workqueue
  ...
2016-10-09 17:04:33 -07:00
kernel@kyup.com 32f7451d1c ipoib: Make ipoib_warn ratelimited
In certain cases it's possible to be flooded by warning messages. To
cope with such situations make the ipoib_warn macro be ratelimited.
To prevent accidental limiting of legitimate, bursty messages make
the limit fairly liberal by allowing up to 100 messages in 10 seconds.

Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-10-07 16:54:33 -04:00
Bhaktipriya Shridhar b4541f6f88 IB/ipoib_verbs: Remove deprecated create_singlethread_workqueue
alloc_ordered_workqueue() with WQ_MEM_RECLAIM set, replaces
deprecated create_singlethread_workqueue(). This is the identity
conversion.

The workqueue "wq" queues mulitple work items viz &priv->restart_task,
&priv->cm.rx_reap_task, &priv->cm.skb_task, &priv->neigh_reap_task,
&priv->ah_reap_task, &priv->mcast_task and &priv->carrier_on_task.
The work items require strict execution ordering.
Hence, an ordered dedicated workqueue has been used.

WQ_MEM_RECLAIM has been set to ensure forward progress under
memory pressure.

Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-10-07 16:54:32 -04:00
Bhaktipriya Shridhar 855cda6828 IB/ipoib: Remove deprecated create_singlethread_workqueue
alloc_ordered_workqueue() replaces deprecated
create_singlethread_workqueue().

The workqueue "ipoib_workqueue" that is used for all flush operations
for the device.

WQ_MEM_RECLAIM has been set since the flush operations may need to
complete in order for other network functions to continue, and
the memory reclaim operation might need the network functioning in
order to make progress.

Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-10-07 16:54:32 -04:00
Christoph Hellwig ed082d36a7 IB/core: add support to create a unsafe global rkey to ib_create_pd
Instead of exposing ib_get_dma_mr to ULPs and letting them use it more or
less unchecked, this moves the capability of creating a global rkey into
the RDMA core, where it can be easily audited.  It also prints a warning
everytime this feature is used as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-09-23 13:47:44 -04:00
Alex Vesker 344bacca8c IB/ipoib: Don't allow MC joins during light MC flush
This fix solves a race between light flush and on the fly joins.
Light flush doesn't set the device to down and unset IPOIB_OPER_UP
flag, this means that if while flushing we have a MC join in progress
and the QP was attached to BC MGID we can have a mismatches when
re-attaching a QP to the BC MGID.

The light flush would set the broadcast group to NULL causing an on
the fly join to rejoin and reattach to the BC MCG as well as adding
the BC MGID to the multicast list. The flush process would later on
remove the BC MGID and detach it from the QP. On the next flush
the BC MGID is present in the multicast list but not found when trying
to detach it because of the previous double attach and single detach.

[18332.714265] ------------[ cut here ]------------
[18332.717775] WARNING: CPU: 6 PID: 3767 at drivers/infiniband/core/verbs.c:280 ib_dealloc_pd+0xff/0x120 [ib_core]
...
[18332.775198] Hardware name: Red Hat KVM, BIOS Bochs 01/01/2011
[18332.779411]  0000000000000000 ffff8800b50dfbb0 ffffffff813fed47 0000000000000000
[18332.784960]  0000000000000000 ffff8800b50dfbf0 ffffffff8109add1 0000011832f58300
[18332.790547]  ffff880226a596c0 ffff880032482000 ffff880032482830 ffff880226a59280
[18332.796199] Call Trace:
[18332.798015]  [<ffffffff813fed47>] dump_stack+0x63/0x8c
[18332.801831]  [<ffffffff8109add1>] __warn+0xd1/0xf0
[18332.805403]  [<ffffffff8109aebd>] warn_slowpath_null+0x1d/0x20
[18332.809706]  [<ffffffffa025d90f>] ib_dealloc_pd+0xff/0x120 [ib_core]
[18332.814384]  [<ffffffffa04f3d7c>] ipoib_transport_dev_cleanup+0xfc/0x1d0 [ib_ipoib]
[18332.820031]  [<ffffffffa04ed648>] ipoib_ib_dev_cleanup+0x98/0x110 [ib_ipoib]
[18332.825220]  [<ffffffffa04e62c8>] ipoib_dev_cleanup+0x2d8/0x550 [ib_ipoib]
[18332.830290]  [<ffffffffa04e656f>] ipoib_uninit+0x2f/0x40 [ib_ipoib]
[18332.834911]  [<ffffffff81772a8a>] rollback_registered_many+0x1aa/0x2c0
[18332.839741]  [<ffffffff81772bd1>] rollback_registered+0x31/0x40
[18332.844091]  [<ffffffff81773b18>] unregister_netdevice_queue+0x48/0x80
[18332.848880]  [<ffffffffa04f489b>] ipoib_vlan_delete+0x1fb/0x290 [ib_ipoib]
[18332.853848]  [<ffffffffa04df1cd>] delete_child+0x7d/0xf0 [ib_ipoib]
[18332.858474]  [<ffffffff81520c08>] dev_attr_store+0x18/0x30
[18332.862510]  [<ffffffff8127fe4a>] sysfs_kf_write+0x3a/0x50
[18332.866349]  [<ffffffff8127f4e0>] kernfs_fop_write+0x120/0x170
[18332.870471]  [<ffffffff81207198>] __vfs_write+0x28/0xe0
[18332.874152]  [<ffffffff810e09bf>] ? percpu_down_read+0x1f/0x50
[18332.878274]  [<ffffffff81208062>] vfs_write+0xa2/0x1a0
[18332.881896]  [<ffffffff812093a6>] SyS_write+0x46/0xa0
[18332.885632]  [<ffffffff810039b7>] do_syscall_64+0x57/0xb0
[18332.889709]  [<ffffffff81883321>] entry_SYSCALL64_slow_path+0x25/0x25
[18332.894727] ---[ end trace 09ebbe31f831ef17 ]---

Fixes: ee1e2c82c2 ("IPoIB: Refresh paths instead of flushing them on SM change events")
Signed-off-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-09-16 14:14:08 -04:00
Erez Shitrit 546481c281 IB/ipoib: Fix memory corruption in ipoib cm mode connect flow
When a new CM connection is being requested, ipoib driver copies data
from the path pointer in the CM/tx object, the path object might be
invalid at the point and memory corruption will happened later when now
the CM driver will try using that data.

The next scenario demonstrates it:
	neigh_add_path --> ipoib_cm_create_tx -->
	queue_work (pointer to path is in the cm/tx struct)
	#while the work is still in the queue,
	#the port goes down and causes the ipoib_flush_paths:
	ipoib_flush_paths --> path_free --> kfree(path)
	#at this point the work scheduled starts.
	ipoib_cm_tx_start --> copy from the (invalid)path pointer:
	(memcpy(&pathrec, &p->path->pathrec, sizeof pathrec);)
	 -> memory corruption.

To fix that the driver now starts the CM/tx connection only if that
specific path exists in the general paths database.
This check is protected with the relevant locks, and uses the gid from
the neigh member in the CM/tx object which is valid according to the ref
count that was taken by the CM/tx.

Fixes: 839fcaba35 ('IPoIB: Connected mode experimental support')
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-09-02 14:07:38 -04:00
Doug Ledford 7f1d25b47d Merge branches 'misc' and 'rxe' into k.o/for-4.8-1 2016-08-04 11:13:47 -04:00
Yuval Shaia 5faba54695 IB/ipoib: Report SG feature regardless of HW UD CSUM capability
Decouple SG support from HW ability to do UD checksum.
This coupling is for historical reasons and removed with 'commit
ec5f061564 ("net: Kill link between CSUM and SG features.")'

During driver load it is assumed that device does not supports SG. The
final decision is taken after creating UD QP based on device capability.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-08-03 21:03:32 -04:00
Ira Weiny 1a8632121a IB/ipoib: Use new device FW version string
Using this allows for devices to specify the format of their
firmware version rather than forcing a format.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-06-23 12:08:34 -04:00
Erez Shitrit 61c78eea95 IB/IPoIB: Don't update neigh validity for unresolved entries
ipoib_neigh_get unconditionally updates the "alive" variable member on
any packet send.  This prevents the neighbor garbage collection from
cleaning out a dead neighbor entry if we are still queueing packets
for it.  If the queue for this neighbor is full, then don't update the
alive timestamp.  That way the neighbor can time out even if packets
are still being queued as long as none of them are being sent.

Fixes: b63b70d877 ("IPoIB: Use a private hash table for path lookup in xmit path")
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-06-07 10:49:48 -04:00
Mark Bloch 9b29953bf8 IB/IPoIB: Disable bottom half when dealing with device address
Align locking usage when touching device address with rest
of the kernel. Lock the bottom half when doing so using
netif_addr_lock_bh.

This also solves the following case as reported by lockdep:
	CPU0                    CPU1
	----                    ----
lock(_xmit_INFINIBAND);
				local_irq_disable();
				lock(&(&mc->mca_lock)->rlock);
				lock(_xmit_INFINIBAND);
<Interrupt>
lock(&(&mc->mca_lock)->rlock);

*** DEADLOCK ***

Fixes: 492a7e67ff ("IB/IPoIB: Allow setting the device address")
Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-06-07 09:50:54 -04:00
Erez Shitrit 198b12f770 IB/IPoIB: Fix race between ipoib_remove_one to sysfs functions
In ipoib_remove_one the driver holds the rtnl_lock and tries to do some
operation like dev_change_flags or unregister_netdev, while sysfs
callback like ipoib_vlan_delete holds sysfs mutex and tries to hold the
rtnl_lock via rtnl_trylock() and restart_syscall() if the lock is not
free, meanwhile ipoib_remove_one tries to get the sysfs lock in order to
free its sysfs directory, and we will get  a->b, b->a deadlock.

    Trace like the following:

        schedule+0x37/0x80
        schedule_preempt_disabled+0xe/0x10
        __mutex_lock_slowpath+0xb5/0x120
        mutex_lock+0x23/0x40
        rtnl_lock+0x15/0x20
        netdev_run_todo+0x17c/0x320
        rtnl_unlock+0xe/0x10
        ipoib_vlan_delete+0x11b/0x1b0 [ib_ipoib]
        delete_child+0x54/0x80 [ib_ipoib]
        dev_attr_store+0x18/0x30
        sysfs_kf_write+0x37/0x40
        mutex_lock+0x16/0x40
        SyS_write+0x55/0xc0
        entry_SYSCALL_64_fastpath+0x16/0x75
    And
        schedule+0x37/0x80
        __kernfs_remove+0x1a8/0x260
        ? wake_atomic_t_function+0x60/0x60
        kernfs_remove+0x25/0x40
        sysfs_remove_dir+0x50/0x80
        kobject_del+0x18/0x50
        device_del+0x19f/0x260
        netdev_unregister_kobject+0x6a/0x80
        rollback_registered_many+0x1fd/0x340
        rollback_registered+0x3c/0x70
        unregister_netdevice_queue+0x55/0xc0
        unregister_netdev+0x20/0x30
        ipoib_remove_one+0x114/0x1b0 [ib_ipoib]
        ib_unregister_client+0x4a/0x170 [ib_core]
        ? find_module_all+0x71/0xa0
        ipoib_cleanup_module+0x10/0x94 [ib_ipoib]
        SyS_delete_module+0x1b5/0x210
        entry_SYSCALL_64_fastpath+0x16/0x75

The fix is by checking the flag IPOIB_FLAG_INTF_ON_DESTROY in order to
get out from the sysfs function.

Fixes: 862096a8bb ("IB/ipoib: Add more rtnl_link_ops callbacks")
Fixes: 9baa0b0364 ("IB/ipoib: Add rtnl_link_ops support")
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-06-07 09:50:53 -04:00
Linus Torvalds 1cbe06c3cf Round two of 4.7 merge window patches
- Dynamic counter infrastructure in the IB drivers
   This is a sysfs based code to allow free form access to the hardware
   counters RDMA devices might support so drivers don't need to code
   this up repeatedly themselves
 - SendOnlyFullMember multicast support
 - IB router support
 - A couple misc fixes
 - The big item on the list: hfi1 driver updates, plus moving the hfi1
   driver out of staging
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXRyupAAoJELgmozMOVy/deHAQAJp87yn8hntHXh81Kjrvezch
 McJ0v7FTsfsLVe/Y+YAbaAREjanhnUDzZ36WUqS9rHadOl+Ia0fD+M53aTk9hDKS
 R9/oWQDsUb8k7imdhNrSS72Q2kbwr++8RqBoEEFcl4fuDCVsrRhUaf+perJTEnoK
 I9APDQCNU3FCTPFRllK5al6gi5UGEW9vO9JQZb+VnTVW9+LjlYobzcZfttAw8oRN
 rtQLP3xGb7ZwV0ngH0dt2TthN2ih97TavnISojTdJ5I+WHFxLV0YbHSK/E2JLdoi
 hVknhrflPIKr+mRVv97yCCz92kGAdTKG3sT9vNUVUQ0Y8kEdrW/LrnrmxsBEYzkl
 GcQeqg7//ffOlQ0dGZZjlkPg/8yMPo9CBsfPDcOvXeenK0lkao9qg2yrIz+CBC/9
 YlpUKfL0JLRal9O6y3Mh++ZQ9R1kvQN8/CgHLUlRQdWLfxf7VPA4rIcaWT7iGfvg
 z6jIJKjsK+hqKhY9oKh3YfNmXh65IQ5OF3+tBzHKwNiGXNSsMYAVYP3vc/HYTJvz
 4EpsenndKb5l0AOvDch8rMmw7YZPf2NOdi90TW6Bq1VgTIDRxh0WzR2QTnWBVRSf
 XAcC3DLXTHqC/UUpHrdXgfvU/0nQXaVUvPWIAM/t3hy91GLscByDw4973+tISs/B
 7AHSptbVQaBUpTjWHfKf
 =5+cf
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma

Pull more rdma updates from Doug Ledford:
 "This is the second group of code for the 4.7 merge window.  It looks
  large, but only in one sense.  I'll get to that in a minute.  The list
  of changes here breaks down as follows:

   - Dynamic counter infrastructure in the IB drivers

     This is a sysfs based code to allow free form access to the
     hardware counters RDMA devices might support so drivers don't need
     to code this up repeatedly themselves

   - SendOnlyFullMember multicast support

   - IB router support

   - A couple misc fixes

   - The big item on the list: hfi1 driver updates, plus moving the hfi1
     driver out of staging

  There was a group of 15 patches in the hfi1 list that I thought I had
  in the first pull request but they weren't.  So that added to the
  length of the hfi1 section here.

  As far as these go, everything but the hfi1 is pretty straight
  forward.

  The hfi1 is, if you recall, the driver that Al had complaints about
  how it used the write/writev interfaces in an overloaded fashion.  The
  write portion of their interface behaved like the write handler in the
  IB stack proper and did bi-directional communications.  The writev
  interface, on the other hand, only accepts SDMA request structures.
  The completions for those structures are sent back via an entirely
  different event mechanism.

  With the security patch, we put security checks on the write
  interface, however, we also knew they would be going away soon.  Now,
  we've converted the write handler in the hfi1 driver to use ioctls
  from the IB reserved magic area for its bidirectional communications.
  With that change, Intel has addressed all of the items originally on
  their TODO when they went into staging (as well as many items added to
  the list later).

  As such, I moved them out, and since they were the last item in the
  staging/rdma directory, and I don't have immediate plans to use the
  staging area again, I removed the staging/rdma area.

  Because of the move out of staging, as well as a series of 5 patches
  in the hfi1 driver that removed code people thought should be done in
  a different way and was optional to begin with (a snoop debug
  interface, an eeprom driver for an eeprom connected directory to their
  hfi1 chip and not via an i2c bus, and a few other things like that),
  the line count, especially the removal count, is high"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (56 commits)
  staging/rdma: Remove the entire rdma subdirectory of staging
  IB/core: Make device counter infrastructure dynamic
  IB/hfi1: Fix pio map initialization
  IB/hfi1: Correct 8051 link parameter settings
  IB/hfi1: Update pkey table properly after link down or FM start
  IB/rdamvt: Fix rdmavt s_ack_queue sizing
  IB/rdmavt: Max atomic value should be a u8
  IB/hfi1: Fix hard lockup due to not using save/restore spin lock
  IB/hfi1: Add tracing support for send with invalidate opcode
  IB/hfi1, qib: Add ieth to the packet header definitions
  IB/hfi1: Move driver out of staging
  IB/hfi1: Do not free hfi1 cdev parent structure early
  IB/hfi1: Add trace message in user IOCTL handling
  IB/hfi1: Remove write(), use ioctl() for user cmds
  IB/hfi1: Add ioctl() interface for user commands
  IB/hfi1: Remove unused user command
  IB/hfi1: Remove snoop/diag interface
  IB/hfi1: Remove EPROM functionality from data device
  IB/hfi1: Remove UI char device
  IB/hfi1: Remove multiple device cdev
  ...
2016-05-28 11:04:16 -07:00
Mark Bloch 492a7e67ff IB/IPoIB: Allow setting the device address
In IB networks, and specifically in IPoIB/rdmacm traffic, the device
address of an IPoIB interface is used as a means to exchange information
between nodes needed for communication.

Currently an IPoIB interface will always be created with a device
address based on its node GUID without a way to change that.

This change adds the ability to set the device address of an IPoIB
interface by value. We use the set mac address ndo to do that.

The flow should be broken down to two:
1) The GID value is already in the GID table,
   in this case the interface will be able to set carrier up.

2) The GID value is not yet in the GID table,
   in this case the interface won't try to join the multicast group
   and will wait (listen on GID_CHANGE event) until the GID is inserted.

In order to track those changes, we add a new flag:
* IPOIB_FLAG_DEV_ADDR_SET.

When set, it means the dev_addr is a based on a value in the gid
table. this bit will be cleared upon a dev_addr change triggered
by the user and set after validation.

Per IB spec the port GUID can't change if the module is loaded.
port GUID is the basis for GID at index 0 which is the basis for
the default device address of a ipoib interface.

The issue is that there are devices that don't follow the spec,
they change the port GUID while HCA is powered on, so in order
not to break userspace applications. We need to check if the
user wanted to control the device address and we assume that
if he sets the device address back to be based on GID index 0,
he no longer wishs to control it.

In order to track this, we add an additional flag:
* IPOIB_FLAG_DEV_ADDR_CTRL

When setting the device address, there is no validation of the upper
twelve bytes of the device address (flags, qpn, subnet prefix) as those
bytes are not under the control of the user.

Signed-off-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-05-25 15:39:03 -04:00
Erez Shitrit 3b56113016 IB/ipoib: Support SendOnlyFullMember MCG for SendOnly join
Check (via an SA query) if the SM supports the new option for SendOnly
multicast joins.
If the SM supports that option it will use the new join state to create
such multicast group.
If SendOnlyFullMember is supported, we wouldn't use faked FullMember state
join for SendOnly MCG, use the correct state if supported.

This check is performed at every invocation of mcast_restart task, to be
sure that the driver stays in sync with the current state of the SM.

Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-05-25 15:39:03 -04:00
Linus Torvalds 76b584d312 Primary 4.7 merge window changes
- Updates to the new Intel X722 iWARP driver
 - Updates to the hfi1 driver
 - Fixes for the iw_cxgb4 driver
 - Misc core fixes
 - Generic RDMA READ/WRITE API addition
 - SRP updates
 - Misc ipoib updates
 - Minor mlx5 updates
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXPIAnAAoJELgmozMOVy/dEYUP/0A6NH5ptzUrwPuLrWoz8h+e
 KBfJE7H09mfKBx0Rq8YmnU+pz4lk8vrMLLaqpbGN57mwO0a1lK9bgc3E6KUhQPhc
 dpGEX/NG1+aILomD7M4l1yAKkG17kxFLD75cLCeaxhO76jBRWsunukqk5mT/u0EG
 fUYZs1fRb9t2LDtTNhPfSFR1+dgP5S17xLhpl9ttn87hTmIuiGWR6ig2nTC7azZD
 G0d7RVjohfY2sDD28YgQiUEJ+q+1ymp3XTaCZhPCVl9VCRPweEdtLKcbNWZIvClx
 ewuTCgADXg8tAL/6zu6bEKqZlC17UrmVJee3csKQLT09PJkSEICeFR/ld6ONUKAF
 nDhi3ySa75Xxg9VDcCnuRkXKK+/zi7oDelZuh9mvMG0JJqPK9rTZDD29j2kBf7C0
 bdx4R5cI4KJWQ/GlCyi/nLiuYkmAiCugzcGnRho4ub+EJ0yX1w6n8KVYr37kFsFu
 q6MCnEfArEgDpbq1wo0+9MWtqBYrnOI/XtG81Zd+6X2MW975qU85wUdUSjg6OOb1
 v1osyAmFDy9A0Y80yY+l1HHrSVIvI0IAWZDfxsbCLQY8O03ZNcvxE2RsrzWd5CKL
 iZsX24tjV0WR9+lORHLfAKB3DL9CcfHv/tHo7q+5iAHmIuWZGrEN22ELkwS/4X7x
 d/V0XDjzs6lgQeTJ7R4B
 =e+Zv
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma

Pull rdma updates from Doug Ledford:
 "Primary 4.7 merge window changes

   - Updates to the new Intel X722 iWARP driver
   - Updates to the hfi1 driver
   - Fixes for the iw_cxgb4 driver
   - Misc core fixes
   - Generic RDMA READ/WRITE API addition
   - SRP updates
   - Misc ipoib updates
   - Minor mlx5 updates"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (148 commits)
  IB/mlx5: Fire the CQ completion handler from tasklet
  net/mlx5_core: Use tasklet for user-space CQ completion events
  IB/core: Do not require CAP_NET_ADMIN for packet sniffing
  IB/mlx4: Fix unaligned access in send_reply_to_slave
  IB/mlx5: Report Scatter FCS device capability when supported
  IB/mlx5: Add Scatter FCS support for Raw Packet QP
  IB/core: Add Scatter FCS create flag
  IB/core: Add Raw Scatter FCS device capability
  IB/core: Add extended device capability flags
  i40iw: pass hw_stats by reference rather than by value
  i40iw: Remove unnecessary synchronize_irq() before free_irq()
  i40iw: constify i40iw_vf_cqp_ops structure
  IB/mlx5: Add UARs write-combining and non-cached mapping
  IB/mlx5: Allow mapping the free running counter on PROT_EXEC
  IB/mlx4: Use list_for_each_entry_safe
  IB/SA: Use correct free function
  IB/core: Fix a potential array overrun in CMA and SA agent
  IB/core: Remove unnecessary check in ibnl_rcv_msg
  IB/IWPM: Fix a potential skb leak
  RDMA/nes: replace custom print_hex_dump()
  ...
2016-05-20 14:35:07 -07:00
Hans Westgaard Ry e3614bc9dc IB/ipoib: Add readout of statistics using ethtool
IPoIB collects statistics of traffic including number of packets
sent/received, number of bytes transferred, and certain errors. This
patch makes these statistics available to be queried by ethtool.

Signed-off-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Tested-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-05-13 19:39:43 -04:00
Sebastian Andrzej Siewior 01690e9c70 infiniband/ulp/ipoib: remove pkey_mutex
The last user of pkey_mutex was removed in db84f88037 ("IB/ipoib: Use
P_Key change event instead of P_Key polling mechanism") but the lock
remained.
This patch removes it.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-05-13 19:39:43 -04:00
Florian Westphal 860e9538a9 treewide: replace dev->trans_start update with helper
Replace all trans_start updates with netif_trans_update helper.
change was done via spatch:

struct net_device *d;
@@
- d->trans_start = jiffies
+ netif_trans_update(d)

Compile tested only.

Cc: user-mode-linux-devel@lists.sourceforge.net
Cc: linux-xtensa@linux-xtensa.org
Cc: linux1394-devel@lists.sourceforge.net
Cc: linux-rdma@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: MPT-FusionLinux.pdl@broadcom.com
Cc: linux-scsi@vger.kernel.org
Cc: linux-can@vger.kernel.org
Cc: linux-parisc@vger.kernel.org
Cc: linux-omap@vger.kernel.org
Cc: linux-hams@vger.kernel.org
Cc: linux-usb@vger.kernel.org
Cc: linux-wireless@vger.kernel.org
Cc: linux-s390@vger.kernel.org
Cc: devel@driverdev.osuosl.org
Cc: b.a.t.m.a.n@lists.open-mesh.org
Cc: linux-bluetooth@vger.kernel.org
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Felipe Balbi <felipe.balbi@linux.intel.com>
Acked-by: Mugunthan V N <mugunthanvnm@ti.com>
Acked-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 14:16:49 -04:00
Florian Westphal 4d0e965732 drivers: replace dev->trans_start accesses with dev_trans_start
a trans_start struct member exists twice:
- in struct net_device (legacy)
- in struct netdev_queue

Instead of open-coding dev->trans_start usage to obtain the current
trans_start value, use dev_trans_start() instead.

This is not exactly the same, as dev_trans_start also considers
the trans_start values of the netdev queues owned by the device
and provides the most recent one.

For legacy devices this doesn't matter as dev_trans_start can cope
with netdev trans_start values of 0 (they are ignored).

This is a prerequisite to eventual removal of dev->trans_start.

Cc: linux-rdma@vger.kernel.org
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 14:16:47 -04:00
Eli Cohen 68996a6e76 IB/ipoib: Allow mcast packets from other VFs
With SRIOV enabled, two VFs on the same HCA which have the same port LID
and may have the same QP number. To enable receiving multicasts from
such VFs, further qualify the check: ignore the receive only if, in
addition, the packet source gid equals the receiving VF's source gid.

Signed-off-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-03-21 17:13:14 -04:00
Eli Cohen 9c3c5f8e1f IB/ipoib: Add ndo operations for configuring VFs
Add ndo operations to the network driver that enables configuring the
following operations:

ipoib_set_vf_link_state - configure the VF link policy
ipoib_get_vf_config - get link state configuration
ipoib_set_vf_guid - set a VF port or node GUID
ipoib_get_vf_stats - get statistics of a VF

Signed-off-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-03-21 17:13:14 -04:00
Leon Romanovsky fb532d6a79 IB/{core, ulp} Support above 32 possible device capability flags
The old bitwise device_cap_flags variable was limited to u32 which
has all bits already defined. In order to overcome it, we converted
device_cap_flags variable to be u64 type.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-03-21 16:32:59 -04:00
Hans Westgaard Ry 78a50a5e60 IB/ipoib: Add handling for sending of skb with many frags
IPoIB converts skb-fragments to sge adding 1 extra sge when SG is enabled.
Current codepath assumes that the max number of sge a device support
is at least MAX_SKB_FRAGS+1, there is no interaction with upper layers
to limit number of fragments in an skb if a device suports fewer
sges. The assumptions also lead to requesting a fixed number of sge
when IPoIB creates queue-pairs with SG enabled.

A fallback/slowpath is implemented using skb_linearize to
handle cases where the conversion would result in more sges than supported.

Signed-off-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com>
Reviewed-by: HÃ¥kon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-03-03 09:49:44 -05:00
Alex Estrin 08bc327629 IB/ipoib: fix for rare multicast join race condition
A narrow window for race condition still exist between
multicast join thread and *dev_flush workers.
A kernel crash caused by prolong erratic link state changes
was observed (most likely a faulty cabling):

[167275.656270] BUG: unable to handle kernel NULL pointer dereference at
0000000000000020
[167275.665973] IP: [<ffffffffa05f8f2e>] ipoib_mcast_join+0xae/0x1d0 [ib_ipoib]
[167275.674443] PGD 0
[167275.677373] Oops: 0000 [#1] SMP
...
[167275.977530] Call Trace:
[167275.982225]  [<ffffffffa05f92f0>] ? ipoib_mcast_free+0x200/0x200 [ib_ipoib]
[167275.992024]  [<ffffffffa05fa1b7>] ipoib_mcast_join_task+0x2a7/0x490
[ib_ipoib]
[167276.002149]  [<ffffffff8109d5fb>] process_one_work+0x17b/0x470
[167276.010754]  [<ffffffff8109e3cb>] worker_thread+0x11b/0x400
[167276.019088]  [<ffffffff8109e2b0>] ? rescuer_thread+0x400/0x400
[167276.027737]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
Here was a hit spot:
ipoib_mcast_join() {
..............
      rec.qkey      = priv->broadcast->mcmember.qkey;
                                       ^^^^^^^
.....
 }
Proposed patch should prevent multicast join task to continue
if link state change is detected.

Signed-off-by: Alex Estrin <alex.estrin@intel.com>

Changes from v4:
- as suggested by Doug Ledford, optimized spinlock usage,
i.e. ipoib_mcast_join() is called with lock held.
Changes from v3:
- sync with priv->lock before flag check.
Chages from v2:
- Move check for OPER_UP flag state to mcast_join() to
ensure no event worker is in progress.
- minor style fixes.
Changes from v1:
- No need to lock again if error detected.
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-02-12 14:53:22 -05:00
Carol L Soto bb6a777369 IB/IPoIB: Do not set skb truesize since using one linearskb
We are seeing this warning: at net/core/skbuff.c:4174
and before commit a44878d100 ("IB/ipoib: Use one linear skb in RX flow")
skb truesize was not being set when ipoib was using just one skb.
Removing this line avoids the warning when running tcp tests like iperf.

Fixes: a44878d100 ("IB/ipoib: Use one linear skb in RX flow")
Signed-off-by: Carol L Soto <clsoto@linux.vnet.ibm.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-02-04 07:08:12 -05:00
Erez Shitrit 50be28de6f IB/IPoIB: Fix kernel panic on multicast flow
ipoib_mcast_restart_task calls ipoib_mcast_remove_list with the
parameter mcast->dev. That mcast is a temporary (used as an iterator)
variable that may be uninitialized.
There is no need to send the variable dev to the function, as each mcast
has its dev as a member in the mcast struct.

This causes the next panic:
RIP: 0010: ipoib_mcast_leave+0x6d/0xf0 [ib_ipoib]
RSP: 0018: EFLAGS: 00010246
RAX: f0201 RBX: 24e00 RCX: 00000
....
....
Stack:
Call Trace:
	ipoib_mcast_remove_list+0x3a/0x70 [ib_ipoib]
	ipoib_mcast_restart_task+0x3bb/0x520 [ib_ipoib]
	process_one_work+0x164/0x470
	worker_thread+0x11d/0x420
	...

Fixes: 5a0e81f6f4 ('IB/IPoIB: factor out common multicast list removal code')
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Reported-by: Doron Tsur <doront@mellanox.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-01-19 12:59:54 -05:00
Christoph Lameter 432c55fff4 IB/IPoIB: Move multicast specific code out of ipoib_main.c
Code cleanup to move multicast specific code that checks for
a sendonly join to ipoib_multicast.c. This allows the removal
of the export of __ipoib_mcast_find().

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-12-23 14:28:52 -05:00
Christoph Lameter 5a0e81f6f4 IB/IPoIB: factor out common multicast list removal code
Code cleanup to remove multicast specific code from ipoib_main.c

The removal of a list of multicast groups occurs in three places.
Create a new function ipoib_mcast_remove_list(). Use this new
function in ipoib_main.c too.
That in turn allows the dropping of two functions that were
exported from ipoib_multicast.c for expiration of mc groups.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-12-23 14:28:00 -05:00
Doug Ledford 882f3b3b91 Merge branches '4.5/Or-cleanup' and '4.5/rdma-cq' into k.o/for-4.5
Signed-off-by: Doug Ledford <dledford@redhat.com>

Conflicts:
	drivers/infiniband/ulp/iser/iser_verbs.c
2015-12-22 17:03:15 -05:00
Or Gerlitz 4a061b287b IB/ulps: Avoid calling ib_query_device
Instead, use the cached copy of the attributes present on the device.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-12-22 14:39:00 -05:00
Christoph Hellwig 14d3a3b249 IB: add a proper completion queue abstraction
This adds an abstraction that allows ULPs to simply pass a completion
object and completion callback with each submitted WR and let the RDMA
core handle the nitty gritty details of how to handle completion
interrupts and poll the CQ.

In detail there is a new ib_cqe structure which just contains the
completion callback, and which can be used to get at the containing
object using container_of.  It is pointed to by the WR and WC as an
alternative to the wr_id field, similar to how many ULPs already use
the field to store a pointer using casts.

A driver using the new completion callbacks allocates it's CQs using
the new ib_create_cq API, which in addition to the number of CQEs and
the completion vectors also takes a mode on how we poll for CQEs.
Three modes are available: direct for drivers that never take CQ
interrupts and just poll for them, softirq to poll from softirq context
using the to be renamed blk-iopoll infrastructure which takes care of
rearming and budgeting, or a workqueue for consumer who want to be
called from user context.

Thanks a lot to Sagi Grimberg who helped reviewing the API, wrote
the current version of the workqueue code because my two previous
attempts sucked too much and converted the iSER initiator to the new
API.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2015-12-11 14:10:43 -08:00
Doug Ledford 63e8790d39 Merge branch 'wr-cleanup' into k.o/for-4.4 2015-10-28 22:23:34 -04:00
Matan Barak 55ee3ab2e4 IB/core: Add netdev and gid attributes paramteres to cache
Adding an ability to query the IB cache by a netdev and get the
attributes of a GID. These parameters are necessary in order to
successfully resolve the required GID (when the netdevice is known)
and get the Ethernet L2 attributes from a GID.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-By: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-10-21 23:48:17 -04:00
Christoph Lameter 0b5c9279e5 IB/ipoib: For sendonly join free the multicast group on leave
When we leave the multicast group on expiration of a neighbor we
do not free the mcast structure. This results in a memory leak
that causes ib_dealloc_pd to fail and print a WARN_ON message
and backtrace.

Fixes: bd99b2e05c (IB/ipoib: Expire sendonly multicast joins)
Signed-off-by: Christoph Lameter <cl@linux.com>
Tested-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-10-13 16:43:59 -04:00
Christoph Hellwig e622f2f4ad IB: split struct ib_send_wr
This patch split up struct ib_send_wr so that all non-trivial verbs
use their own structure which embedds struct ib_send_wr.  This dramaticly
shrinks the size of a WR for most common operations:

sizeof(struct ib_send_wr) (old):	96

sizeof(struct ib_send_wr):		48
sizeof(struct ib_rdma_wr):		64
sizeof(struct ib_atomic_wr):		96
sizeof(struct ib_ud_wr):		88
sizeof(struct ib_fast_reg_wr):		88
sizeof(struct ib_bind_mw_wr):		96
sizeof(struct ib_sig_handover_wr):	80

And with Sagi's pending MR rework the fast registration WR will also be
down to a reasonable size:

sizeof(struct ib_fastreg_wr):		64

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> [srp, srpt]
Reviewed-by: Chuck Lever <chuck.lever@oracle.com> [sunrpc]
Tested-by: Haggai Eran <haggaie@mellanox.com>
Tested-by: Sagi Grimberg <sagig@mellanox.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
2015-10-08 11:09:10 +01:00
Doug Ledford 2866196f29 IB/ipoib: increase the max mcast backlog queue
When performing sendonly joins, we queue the packets that trigger
the join until the join completes.  This may take on the order of
hundreds of milliseconds.  It is easy to have many more than three
packets come in during that time.  Expand the maximum queue depth
in order to try and prevent dropped packets during the time it
takes to join the multicast group.

Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-09-25 22:30:24 -04:00
Doug Ledford c3852ab0e6 IB/ipoib: Make sendonly multicast joins create the mcast group
Since IPoIB should, as much as possible, emulate how multicast
sends work on Ethernet for regular TCP/IP apps, there should be
no requirement to subscribe to a multicast group before your
sends are properly sent.  However, due to the difference in how
multicast is handled on InfiniBand, we must join the appropriate
multicast group before we can send to it.  Previously we tried
not to trigger the auto-create feature of the subnet manager when
doing this because we didn't have tracking of these sendonly
groups and the auto-creation might never get undone.  The previous
patch added timing to these sendonly joins and allows us to
leave them after a reasonable idle expiration time.  So supply
all of the information needed to auto-create group.

Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-09-25 14:46:58 -04:00