Commit graph

1403 commits

Author SHA1 Message Date
Qing Huang 1383cb8103 mlx4_core: allocate ICM memory in page size chunks
When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...

With 4KB ICM chunk size on x86_64 arch, the above issue is fixed.

However in order to support smaller ICM chunk size, we need to fix
another issue in large size kcalloc allocations.

E.g.
Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt
entry). So we need a 16MB allocation for a table->icm pointer array to
hold 2M pointers which can easily cause kcalloc to fail.

The solution is to use kvzalloc to replace kcalloc which will fall back
to vmalloc automatically if kmalloc fails.

Signed-off-by: Qing Huang <qing.huang@oracle.com>
Acked-by: Daniel Jurgens <danielj@mellanox.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-25 10:22:53 -04:00
Jack Morgenstein d546b67cda net/mlx4: Fix irq-unsafe spinlock usage
spin_lock/unlock was used instead of spin_un/lock_irq
in a procedure used in process space, on a spinlock
which can be grabbed in an interrupt.

This caused the stack trace below to be displayed (on kernel
4.17.0-rc1 compiled with Lock Debugging enabled):

[  154.661474] WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
[  154.668909] 4.17.0-rc1-rdma_rc_mlx+ #3 Tainted: G          I
[  154.675856] -----------------------------------------------------
[  154.682706] modprobe/10159 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
[  154.690254] 00000000f3b0e495 (&(&qp_table->lock)->rlock){+.+.}, at: mlx4_qp_remove+0x20/0x50 [mlx4_core]
[  154.700927]
and this task is already holding:
[  154.707461] 0000000094373b5d (&(&cq->lock)->rlock/1){....}, at: destroy_qp_common+0x111/0x560 [mlx4_ib]
[  154.718028] which would create a new lock dependency:
[  154.723705]  (&(&cq->lock)->rlock/1){....} -> (&(&qp_table->lock)->rlock){+.+.}
[  154.731922]
but this new dependency connects a SOFTIRQ-irq-safe lock:
[  154.740798]  (&(&cq->lock)->rlock){..-.}
[  154.740800]
... which became SOFTIRQ-irq-safe at:
[  154.752163]   _raw_spin_lock_irqsave+0x3e/0x50
[  154.757163]   mlx4_ib_poll_cq+0x36/0x900 [mlx4_ib]
[  154.762554]   ipoib_tx_poll+0x4a/0xf0 [ib_ipoib]
...
to a SOFTIRQ-irq-unsafe lock:
[  154.815603]  (&(&qp_table->lock)->rlock){+.+.}
[  154.815604]
... which became SOFTIRQ-irq-unsafe at:
[  154.827718] ...
[  154.827720]   _raw_spin_lock+0x35/0x50
[  154.833912]   mlx4_qp_lookup+0x1e/0x50 [mlx4_core]
[  154.839302]   mlx4_flow_attach+0x3f/0x3d0 [mlx4_core]

Since mlx4_qp_lookup() is called only in process space, we can
simply replace the spin_un/lock calls with spin_un/lock_irq calls.

Fixes: 6dc06c08be ("net/mlx4: Fix the check in attaching steering rules")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-23 15:48:58 -04:00
Colin Ian King 4f7f56b6b1 net/mlx4: fix spelling mistake: "Inrerface" -> "Interface" and rephrase message
Trivial fix to spelling mistake in mlx4_dbg debug message and also
change the phrasing of the message so that is is more readable

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-23 14:58:10 -04:00
David S. Miller 6f6e434aa2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
S390 bpf_jit.S is removed in net-next and had changes in 'net',
since that code isn't used any more take the removal.

TLS data structures split the TX and RX components in 'net-next',
put the new struct members from the bug fix in 'net' into the RX
part.

The 'net-next' tree had some reworking of how the ERSPAN code works in
the GRE tunneling code, overlapping with a one-line headroom
calculation fix in 'net'.

Overlapping changes in __sock_map_ctx_update_elem(), keep the bits
that read the prog members via READ_ONCE() into local variables
before using them.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-21 16:01:54 -04:00
Tarick Bedeir 57f6f99fda net/mlx4_core: Fix error handling in mlx4_init_port_info.
Avoid exiting the function with a lingering sysfs file (if the first
call to device_create_file() fails while the second succeeds), and avoid
calling devlink_port_unregister() twice.

In other words, either mlx4_init_port_info() succeeds and returns zero, or
it fails, returns non-zero, and requires no cleanup.

Fixes: 096335b3f9 ("mlx4_core: Allow dynamic MTU configuration for IB ports")
Signed-off-by: Tarick Bedeir <tarick@google.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-14 16:29:08 -04:00
David S. Miller b2d6cee117 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The bpf syscall and selftests conflicts were trivial
overlapping changes.

The r8169 change involved moving the added mdelay from 'net' into a
different function.

A TLS close bug fix overlapped with the splitting of the TLS state
into separate TX and RX parts.  I just expanded the tests in the bug
fix from "ctx->conf == X" into "ctx->tx_conf == X && ctx->rx_conf
== X".

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-11 20:53:22 -04:00
Christophe JAILLET a577d868b7 net/mlx4_en: Fix an error handling path in 'mlx4_en_init_netdev()'
If an error occurs, 'mlx4_en_destroy_netdev()' is called.
It then calls 'mlx4_en_free_resources()' which does the needed resources
cleanup.

So, doing some explicit kfree in the error handling path would lead to
some double kfree.

Simplify code to avoid such a case.

Fixes: 67f8b1dcb9 ("net/mlx4_en: Refactor the XDP forwarding rings scheme")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-10 17:46:45 -04:00
Moshe Shemesh 6ad4e91c6d net/mlx4_en: Verify coalescing parameters are in range
Add check of coalescing parameters received through ethtool are within
range of values supported by the HW.
Driver gets the coalescing rx/tx-usecs and rx/tx-frames as set by the
users through ethtool. The ethtool support up to 32 bit value for each.
However, mlx4 modify cq limits the coalescing time parameter and
coalescing frames parameters to 16 bits.
Return out of range error if user tries to set these parameters to
higher values.
Change type of sample-interval and adaptive_rx_coal parameters in mlx4
driver to u32 as the ethtool holds them as u32 and these parameters are
not limited due to mlx4 HW.

Fixes: c27a02cd94 ('mlx4_en: Add driver for Mellanox ConnectX 10GbE NIC')
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-10 16:09:05 -04:00
Tariq Toukan e57328384b net/mlx4_core: Use msi_x module param to limit num of MSI-X irqs
Extend the boolean interpretation of msi_x module parameter
to numerical, as follows:

0   - Don't use MSI-X.
1   - Use MSI-X, driver decides the num of MSI-X irqs.
>=2 - Use MSI-X, limit number of MSI-X irqs to msi_x.
      In SRIOV, this limits the number of MSI-X irqs per VF.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Cc: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-10 16:08:25 -04:00
Yishai Hadas 86a3e5d02c net/mlx4_core: Add PCI calls for suspend/resume
Implement suspend/resume callbacks in struct pci_driver.

Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-10 16:08:25 -04:00
Eran Ben Elisha e5c9a70545 net/mlx4_core: Report driver version to FW
If supported, write a driver version string to FW as part of the
INIT_HCA command.

Example of driver version: "Linux,mlx4_core,4.0-0"

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-10 16:08:25 -04:00
Eric Dumazet 2d943adf86 net/mlx4_en: optimizes get_fixed_ipv6_csum()
While trying to support CHECKSUM_COMPLETE for IPV6 fragments,
I had to experiments various hacks in get_fixed_ipv6_csum().
I must admit I could not find how to implement this :/

However, get_fixed_ipv6_csum() does a lot of redundant operations,
calling csum_partial() twice.

First csum_partial() computes the checksum of saddr and daddr,
put in @csum_pseudo_hdr. Undone later in the second csum_partial()
computed on whole ipv6 header.

Then nexthdr is added once, added a second time, then substracted.

payload_len is added once, then substracted.

Really all this can be reduced to two add_csum(), to add back 6 bytes
that were removed by mlx4 when providing hw_checksum in RX descriptor.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-04 11:59:19 -04:00
David S. Miller a7b15ab887 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Overlapping changes in selftests Makefile.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-04 09:58:56 -04:00
Colin Ian King 26ff75857e net/mlx4: fix spelling mistake: "failedi" -> "failed"
trivial fix to spelling mistake in mlx4_warn message.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-01 14:17:29 -04:00
Alexander Duyck b86629ebd5 mlx4: Don't bother using skb_tx_hash in mlx4_en_select_queue
The code in the fallback path has supported XDP in conjunction with the Tx
traffic classification for TCs for over a year now. So instead of just
calling skb_tx_hash for every packet we are better off using the fallback
since that will record the Tx queue to the socket and then that can be used
instead of having to recompute the hash every time.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-29 22:01:32 -04:00
Nikita V. Shirokov e5e0a59b7c bpf: make mlx4 compatible w/ bpf_xdp_adjust_tail
w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
well (only "decrease" of pointer's location is going to be supported).
changing of this pointer will change packet's size.
for mlx4 driver we will just calculate packet's length unconditionally
(the same way as it's already being done in mlx5)

Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-04-18 23:34:16 +02:00
Linus Torvalds 3c0d551e02 pci-v4.17-changes
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAlrHeY8UHGJoZWxnYWFz
 QGdvb2dsZS5jb20ACgkQWYigwDrT+vxhLRAAndV/0NDyWZU0eZNM6twri2SEFnF7
 E4ar+YthxDxxJG4TLJbIA12jc5NgHZy4WuttDa6Jb99KreBXIHJFlNi/V/tme6zf
 +yXUuxWae7wJzBiaay57VqLGSc80gt/LTgjLa1siwQqjTbO3wSXR6JJXNaE9FtQ4
 /jL61t8bD1Peb5cWTpt9p0hrnKI0/pHwASdReyFS4F/HDKdvpof7BxE/OU3HSxxA
 XKC2v6RjY4S93vkzvApDXQ+vhKquVRK7/ojyTXQUO/GIzcARprO7H4k62N4ar0x/
 qbXLkR8IMkwA8ecsNmcL92ftb/cXoHfd+wdK8WpijqzF4kW4SdteVWbIhUzI0gbr
 0gjDYIzjplvH3pZGv/qvx+8sFtAP95OdPjuAAW2qJ9TCVfmiS8naNFCvcxg87RhD
 gjyQD3If1X7F8wy309lhq7VNyRexTHgIMgTXHyFvuZMzn/Qe1huL2XCwDcEAg/OX
 AvU2iuSE5tWAh7gIUMF/aWi3uoeJUyyoru5ZR//gqdFfx9YxpSimO1UDXnpPi8SR
 Iz/jzHJc0aWGYdQ9l6HiSbJF3P/QQcWYs9igt0A7BRGB05SPdWCh7sSO70FJa8ME
 f4WID5/qEiaH26kiSRX4cUqpc8Amk8bT0DXw2OT57qy3JM0ZdV5ENQX11pSpr9hv
 uLEf0DU7AEmdvzQ=
 =T++R
 -----END PGP SIGNATURE-----

Merge tag 'pci-v4.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci

Pull PCI updates from Bjorn Helgaas:

 - move pci_uevent_ers() out of pci.h (Michael Ellerman)

 - skip ASPM common clock warning if BIOS already configured it (Sinan
   Kaya)

 - fix ASPM Coverity warning about threshold_ns (Gustavo A. R. Silva)

 - remove last user of pci_get_bus_and_slot() and the function itself
   (Sinan Kaya)

 - add decoding for 16 GT/s link speed (Jay Fang)

 - add interfaces to get max link speed and width (Tal Gilboa)

 - add pcie_bandwidth_capable() to compute max supported link bandwidth
   (Tal Gilboa)

 - add pcie_bandwidth_available() to compute bandwidth available to
   device (Tal Gilboa)

 - add pcie_print_link_status() to log link speed and whether it's
   limited (Tal Gilboa)

 - use PCI core interfaces to report when device performance may be
   limited by its slot instead of doing it in each driver (Tal Gilboa)

 - fix possible cpqphp NULL pointer dereference (Shawn Lin)

 - rescan more of the hierarchy on ACPI hotplug to fix Thunderbolt/xHCI
   hotplug (Mika Westerberg)

 - add support for PCI I/O port space that's neither directly accessible
   via CPU in/out instructions nor directly mapped into CPU physical
   memory space. This is fairly intrusive and includes minor changes to
   interfaces used for I/O space on most platforms (Zhichang Yuan, John
   Garry)

 - add support for HiSilicon Hip06/Hip07 LPC I/O space (Zhichang Yuan,
   John Garry)

 - use PCI_EXP_DEVCTL2_COMP_TIMEOUT in rapidio/tsi721 (Bjorn Helgaas)

 - remove possible NULL pointer dereference in of_pci_bus_find_domain_nr()
   (Shawn Lin)

 - report quirk timings with dev_info (Bjorn Helgaas)

 - report quirks that take longer than 10ms (Bjorn Helgaas)

 - add and use Altera Vendor ID (Johannes Thumshirn)

 - tidy Makefiles and comments (Bjorn Helgaas)

 - don't set up INTx if MSI or MSI-X is enabled to align cris, frv,
   ia64, and mn10300 with x86 (Bjorn Helgaas)

 - move pcieport_if.h to drivers/pci/pcie/ to encapsulate it (Frederick
   Lawler)

 - merge pcieport_if.h into portdrv.h (Bjorn Helgaas)

 - move workaround for BIOS PME issue from portdrv to PCI core (Bjorn
   Helgaas)

 - completely disable portdrv with "pcie_ports=compat" (Bjorn Helgaas)

 - remove portdrv link order dependency (Bjorn Helgaas)

 - remove support for unused VC portdrv service (Bjorn Helgaas)

 - simplify portdrv feature permission checking (Bjorn Helgaas)

 - remove "pcie_hp=nomsi" parameter (use "pci=nomsi" instead) (Bjorn
   Helgaas)

 - remove unnecessary "pcie_ports=auto" parameter (Bjorn Helgaas)

 - use cached AER capability offset (Frederick Lawler)

 - don't enable DPC if BIOS hasn't granted AER control (Mika Westerberg)

 - rename pcie-dpc.c to dpc.c (Bjorn Helgaas)

 - use generic pci_mmap_resource_range() instead of powerpc and xtensa
   arch-specific versions (David Woodhouse)

 - support arbitrary PCI host bridge offsets on sparc (Yinghai Lu)

 - remove System and Video ROM reservations on sparc (Bjorn Helgaas)

 - probe for device reset support during enumeration instead of runtime
   (Bjorn Helgaas)

 - add ACS quirk for Ampere (née APM) root ports (Feng Kan)

 - add function 1 DMA alias quirk for Marvell 88SE9220 (Thomas
   Vincent-Cross)

 - protect device restore with device lock (Sinan Kaya)

 - handle failure of FLR gracefully (Sinan Kaya)

 - handle CRS (config retry status) after device resets (Sinan Kaya)

 - skip various config reads for SR-IOV VFs as an optimization
   (KarimAllah Ahmed)

 - consolidate VPD code in vpd.c (Bjorn Helgaas)

 - add Tegra dependency on PCI_MSI_IRQ_DOMAIN (Arnd Bergmann)

 - add DT support for R-Car r8a7743 (Biju Das)

 - fix a PCI_EJECT vs PCI_BUS_RELATIONS race condition in Hyper-V host
   bridge driver that causes a general protection fault (Dexuan Cui)

 - fix Hyper-V host bridge hang in MSI setup on 1-vCPU VMs with SR-IOV
   (Dexuan Cui)

 - fix Hyper-V host bridge hang when ejecting a VF before setting up MSI
   (Dexuan Cui)

 - make several structures static (Fengguang Wu)

 - increase number of MSI IRQs supported by Synopsys DesignWare bridges
   from 32 to 256 (Gustavo Pimentel)

 - implemented multiplexed IRQ domain API and remove obsolete MSI IRQ
   API from DesignWare drivers (Gustavo Pimentel)

 - add Tegra power management support (Manikanta Maddireddy)

 - add Tegra loadable module support (Manikanta Maddireddy)

 - handle 64-bit BARs correctly in endpoint support (Niklas Cassel)

 - support optional regulator for HiSilicon STB (Shawn Guo)

 - use regulator bulk API for Qualcomm apq8064 (Srinivas Kandagatla)

 - support power supplies for Qualcomm msm8996 (Srinivas Kandagatla)

* tag 'pci-v4.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (123 commits)
  MAINTAINERS: Add John Garry as maintainer for HiSilicon LPC driver
  HISI LPC: Add ACPI support
  ACPI / scan: Do not enumerate Indirect IO host children
  ACPI / scan: Rename acpi_is_serial_bus_slave() for more general use
  HISI LPC: Support the LPC host on Hip06/Hip07 with DT bindings
  of: Add missing I/O range exception for indirect-IO devices
  PCI: Apply the new generic I/O management on PCI IO hosts
  PCI: Add fwnode handler as input param of pci_register_io_range()
  PCI: Remove __weak tag from pci_register_io_range()
  MAINTAINERS: Add missing /drivers/pci/cadence directory entry
  fm10k: Report PCIe link properties with pcie_print_link_status()
  net/mlx5e: Use pcie_bandwidth_available() to compute bandwidth
  net/mlx5: Report PCIe link properties with pcie_print_link_status()
  net/mlx4_core: Report PCIe link properties with pcie_print_link_status()
  PCI: Add pcie_print_link_status() to log link speed and whether it's limited
  PCI: Add pcie_bandwidth_available() to compute bandwidth available to device
  misc: pci_endpoint_test: Handle 64-bit BARs properly
  PCI: designware-ep: Make dw_pcie_ep_reset_bar() handle 64-bit BARs properly
  PCI: endpoint: Make sure that BAR_5 does not have 64-bit flag set when clearing
  PCI: endpoint: Make epc->ops->clear_bar()/pci_epc_clear_bar() take struct *epf_bar
  ...
2018-04-06 18:31:06 -07:00
Linus Torvalds 19fd08b85b Merge candidates for 4.17 merge window
- Fix RDMA uapi headers to actually compile in userspace and be more
   complete
 
 - Three shared with netdev pull requests from Mellanox:
 
    * 7 patches, mostly to net with 1 IB related one at the back). This
      series addresses an IRQ performance issue (patch 1), cleanups related to
      the fix for the IRQ performance problem (patches 2-6), and then extends
      the fragmented completion queue support that already exists in the net
      side of the driver to the ib side of the driver (patch 7).
 
    * Mostly IB, with 5 patches to net that are needed to support the remaining
      10 patches to the IB subsystem. This series extends the current
      'representor' framework when the mlx5 driver is in switchdev mode from
      being a netdev only construct to being a netdev/IB dev construct. The IB
      dev is limited to raw Eth queue pairs only, but by having an IB dev of
      this type attached to the representor for a switchdev port, it enables
      DPDK to work on the switchdev device.
 
    * All net related, but needed as infrastructure for the rdma driver
 
 - Updates for the hns, i40iw, bnxt_re, cxgb3, cxgb4, hns drivers
 
 - SRP performance updates
 
 - IB uverbs write path cleanup patch series from Leon
 
 - Add RDMA_CM support to ib_srpt. This is disabled by default.  Users need to
   set the port for ib_srpt to listen on in configfs in order for it to be
   enabled (/sys/kernel/config/target/srpt/discovery_auth/rdma_cm_port)
 
 - TSO and Scatter FCS support in mlx4
 
 - Refactor of modify_qp routine to resolve problems seen while working on new
   code that is forthcoming
 
 - More refactoring and updates of RDMA CM for containers support from Parav
 
 - mlx5 'fine grained packet pacing', 'ipsec offload' and 'device memory'
   user API features
 
 - Infrastructure updates for the new IOCTL interface, based on increased usage
 
 - ABI compatibility bug fixes to fully support 32 bit userspace on 64 bit
   kernel as was originally intended. See the commit messages for
   extensive details
 
 - Syzkaller bugs and code cleanups motivated by them
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCgAGBQJax5Z0AAoJEDht9xV+IJsacCwQAJBIgmLCvVp5fBu2kJcXMMVI
 y3l2YNzAUJvDDKv1r5yTC9ugBXEkDtgzi/W/C2/5es2yUG/QeT/zzQ3YPrtsnN68
 5FkiXQ35Tt7+PBHMr0cacGRmF4M3Td3MeW0X5aJaBKhqlNKwA+aF18pjGWBmpVYx
 URYCwLb5BZBKVh4+1Leebsk4i0/7jSauAqE5M+9notuAUfBCoY1/Eve3DipEIBBp
 EyrEnMDIdujYRsg4KHlxFKKJ1EFGItknLQbNL1+SEa0Oe0SnEl5Bd53Yxfz7ekNP
 oOWQe5csTcs3Yr4Ob0TC+69CzI71zKbz6qPDILTwXmsPFZJ9ipJs4S8D6F7ra8tb
 D5aT1EdRzh/vAORPC9T3DQ3VsHdvhwpUMG7knnKrVT9X/g7E+gSji1BqaQaTr/xs
 i40GepHT7lM/TWEuee/6LRpqdhuOhud7vfaRFwn2JGRX9suqTcvwhkBkPUDGV5XX
 5RkHcWOb/7KvmpG7S1gaRGK5kO208LgmAZi7REaJFoZB74FqSneMR6NHIH07ha41
 Zou7rnxV68CT2bgu27m+72EsprgmBkVDeEzXgKxVI/+PZ1oadUFpgcZ3pRLOPWVx
 rEqjHu65rlA/YPog4iXQaMfSwt/oRD3cVJS/n8EdJKXi4Qt2RDDGdyOmt74w4prM
 QuLEdvJIFmwrND1KDoqn
 =Ku8g
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-unmerged' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:
 "Doug and I are at a conference next week so if another PR is sent I
  expect it to only be bug fixes. Parav noted yesterday that there are
  some fringe case behavior changes in his work that he would like to
  fix, and I see that Intel has a number of rc looking patches for HFI1
  they posted yesterday.

  Parav is again the biggest contributor by patch count with his ongoing
  work to enable container support in the RDMA stack, followed by Leon
  doing syzkaller inspired cleanups, though most of the actual fixing
  went to RC.

  There is one uncomfortable series here fixing the user ABI to actually
  work as intended in 32 bit mode. There are lots of notes in the commit
  messages, but the basic summary is we don't think there is an actual
  32 bit kernel user of drivers/infiniband for several good reasons.

  However we are seeing people want to use a 32 bit user space with 64
  bit kernel, which didn't completely work today. So in fixing it we
  required a 32 bit rxe user to upgrade their userspace. rxe users are
  still already quite rare and we think a 32 bit one is non-existing.

   - Fix RDMA uapi headers to actually compile in userspace and be more
     complete

   - Three shared with netdev pull requests from Mellanox:

      * 7 patches, mostly to net with 1 IB related one at the back).
        This series addresses an IRQ performance issue (patch 1),
        cleanups related to the fix for the IRQ performance problem
        (patches 2-6), and then extends the fragmented completion queue
        support that already exists in the net side of the driver to the
        ib side of the driver (patch 7).

      * Mostly IB, with 5 patches to net that are needed to support the
        remaining 10 patches to the IB subsystem. This series extends
        the current 'representor' framework when the mlx5 driver is in
        switchdev mode from being a netdev only construct to being a
        netdev/IB dev construct. The IB dev is limited to raw Eth queue
        pairs only, but by having an IB dev of this type attached to the
        representor for a switchdev port, it enables DPDK to work on the
        switchdev device.

      * All net related, but needed as infrastructure for the rdma
        driver

   - Updates for the hns, i40iw, bnxt_re, cxgb3, cxgb4, hns drivers

   - SRP performance updates

   - IB uverbs write path cleanup patch series from Leon

   - Add RDMA_CM support to ib_srpt. This is disabled by default. Users
     need to set the port for ib_srpt to listen on in configfs in order
     for it to be enabled
     (/sys/kernel/config/target/srpt/discovery_auth/rdma_cm_port)

   - TSO and Scatter FCS support in mlx4

   - Refactor of modify_qp routine to resolve problems seen while
     working on new code that is forthcoming

   - More refactoring and updates of RDMA CM for containers support from
     Parav

   - mlx5 'fine grained packet pacing', 'ipsec offload' and 'device
     memory' user API features

   - Infrastructure updates for the new IOCTL interface, based on
     increased usage

   - ABI compatibility bug fixes to fully support 32 bit userspace on 64
     bit kernel as was originally intended. See the commit messages for
     extensive details

   - Syzkaller bugs and code cleanups motivated by them"

* tag 'for-linus-unmerged' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (199 commits)
  IB/rxe: Fix for oops in rxe_register_device on ppc64le arch
  IB/mlx5: Device memory mr registration support
  net/mlx5: Mkey creation command adjustments
  IB/mlx5: Device memory support in mlx5_ib
  net/mlx5: Query device memory capabilities
  IB/uverbs: Add device memory registration ioctl support
  IB/uverbs: Add alloc/free dm uverbs ioctl support
  IB/uverbs: Add device memory capabilities reporting
  IB/uverbs: Expose device memory capabilities to user
  RDMA/qedr: Fix wmb usage in qedr
  IB/rxe: Removed GID add/del dummy routines
  RDMA/qedr: Zero stack memory before copying to user space
  IB/mlx5: Add ability to hash by IPSEC_SPI when creating a TIR
  IB/mlx5: Add information for querying IPsec capabilities
  IB/mlx5: Add IPsec support for egress and ingress
  {net,IB}/mlx5: Add ipsec helper
  IB/mlx5: Add modify_flow_action_esp verb
  IB/mlx5: Add implementation for create and destroy action_xfrm
  IB/uverbs: Introduce ESP steering match filter
  IB/uverbs: Add modify ESP flow_action
  ...
2018-04-06 17:35:43 -07:00
Tal Gilboa 190b509c8d net/mlx4_core: Report PCIe link properties with pcie_print_link_status()
Use pcie_print_link_status() to report PCIe link speed and possible
limitations instead of implementing this in the driver itself.

Signed-off-by: Tal Gilboa <talgi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
[bhelgaas: changelog]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2018-04-03 08:58:31 -05:00
David S. Miller c0b458a946 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Minor conflicts in drivers/net/ethernet/mellanox/mlx5/core/en_rep.c,
we had some overlapping changes:

1) In 'net' MLX5E_PARAMS_LOG_{SQ,RQ}_SIZE -->
   MLX5E_REP_PARAMS_LOG_{SQ,RQ}_SIZE

2) In 'net-next' params->log_rq_size is renamed to be
   params->log_rq_mtu_frames.

3) In 'net-next' params->hard_mtu is added.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-01 19:49:34 -04:00
Eric Dumazet ce5a453c44 net/mlx4_en: CHECKSUM_COMPLETE support for fragments
Refine the RX check summing handling to propagate the
hardware provided checksum so that we do not have to
compute it later in software.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-01 13:56:37 -04:00
Moshe Shemesh 461d5f1b59 net/mlx4_core: Fix memory leak while delete slave's resources
mlx4_delete_all_resources_for_slave in resource tracker should free all
memory allocated for a slave.
While releasing memory of fs_rule, it misses releasing memory of
fs_rule->mirr_mbox.

Fixes: 78efed2751 ('net/mlx4_core: Support mirroring VF DMFS rules on both ports')
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-27 12:02:30 -04:00
Eran Ben Elisha 6e8814ceb7 net/mlx4_en: Fix mixed PFC and Global pause user control requests
Global pause and PFC configuration should be mutually exclusive (i.e. only
one of them at most can be set). However, once PFC was turned off,
driver automatically turned Global pause on. This is a bug.

Fix the driver behaviour to turn off PFC/Global once the user turned the
other on.

This also fixed a weird behaviour that at a current time, the profile
had both PFC and global pause configuration turned on, which is
Hardware-wise impossible and caused returning false positive indication
to query tools.

In addition, fix error code when setting global pause or PFC to change
metadata only upon successful change.

Also, removed useless debug print.

Fixes: af7d518526 ("net/mlx4_en: Add DCB PFC support through CEE netlink commands")
Fixes: c27a02cd94 ("mlx4_en: Add driver for Mellanox ConnectX 10GbE NIC")
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-27 12:02:30 -04:00
Joe Perches d3757ba4c1 ethernet: Use octal not symbolic permissions
Prefer the direct use of octal for permissions.

Done with checkpatch -f --types=SYMBOLIC_PERMS --fix-inplace
and some typing.

Miscellanea:

o Whitespace neatening around these conversions.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-26 12:07:49 -04:00
Jason Gunthorpe 48962f5c6f RDMA/mlx4: Move flag constants to uapi header
MLX4_USER_DEV_CAP_LARGE_CQE (via mlx4_ib_alloc_ucontext_resp.dev_caps)
and MLX4_IB_QUERY_DEV_RESP_MASK_CORE_CLOCK_OFFSET (via
mlx4_uverbs_ex_query_device_resp.comp_mask) are copied directly to
userspace and form part of the uAPI.

Move them to the uapi header where they belong.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-03-15 15:58:03 -06:00
Eric Dumazet d8c13f2271 net/mlx4_en: try to use high order pages for RX rings
RX rings can fit most of the time in a contiguous piece of memory,
so lets use kvzalloc_node/kvfree instead of vzalloc_node/vfree

Note that kvzalloc_node() automatically falls back to another node,
there is no need to do the fallback ourselves.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-07 13:35:23 -05:00
Tariq Toukan a970d8dba5 net/mlx4_en: RX csum, pre-define enabled protocols for IP status masking
Pre-define a mask for IP status of a completion, that tests the
MLX4_CQE_STATUS_IPV6 only in case CONFIG_IPV6 is enabled.
Use it for IP status testing upon completion, instead of separating
the datapath into two flows.
This takes common code structures (such as closing parenthesis)
back to their original place, and makes code more readable.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Suggested-by: David S. Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27 14:53:26 -05:00
Tariq Toukan 1cb8b1216c net/mlx4_en: Combine checks of end-cases in RX completion function
Combine two end-cases in the same if statement with a single return value.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27 14:53:26 -05:00
Eran Ben Elisha 4f32e1c4a9 net/mlx4_en: Remove unnecessary warn print in reset config
In mlx4_en_reset_config, there was a redundant warn print that was left
from previous versions of this function. No warn is needed anymore.

This warn can be confusing when RX-FCS is changed:
Turn OFF RX-FCS:
  mlx4_en: eth1: Changing device configuration rx filter(0) rx vlan(1)
Turn ON RX-FCS:
  mlx4_en: eth1: Changing device configuration rx filter(0) rx vlan(1)

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27 14:53:26 -05:00
Eran Ben Elisha f26d0d2543 net/mlx4_en: Add physical RX/TX bytes/packets counters
Add physical RX/TX packets/bytes counters into ethtool output to monitor
all traffic that was received and transmitted on the port. These
counters are available only for none Virtual Function.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27 14:53:26 -05:00
Linus Torvalds b2fe5fa686 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:

 1) Significantly shrink the core networking routing structures. Result
    of http://vger.kernel.org/~davem/seoul2017_netdev_keynote.pdf

 2) Add netdevsim driver for testing various offloads, from Jakub
    Kicinski.

 3) Support cross-chip FDB operations in DSA, from Vivien Didelot.

 4) Add a 2nd listener hash table for TCP, similar to what was done for
    UDP. From Martin KaFai Lau.

 5) Add eBPF based queue selection to tun, from Jason Wang.

 6) Lockless qdisc support, from John Fastabend.

 7) SCTP stream interleave support, from Xin Long.

 8) Smoother TCP receive autotuning, from Eric Dumazet.

 9) Lots of erspan tunneling enhancements, from William Tu.

10) Add true function call support to BPF, from Alexei Starovoitov.

11) Add explicit support for GRO HW offloading, from Michael Chan.

12) Support extack generation in more netlink subsystems. From Alexander
    Aring, Quentin Monnet, and Jakub Kicinski.

13) Add 1000BaseX, flow control, and EEE support to mvneta driver. From
    Russell King.

14) Add flow table abstraction to netfilter, from Pablo Neira Ayuso.

15) Many improvements and simplifications to the NFP driver bpf JIT,
    from Jakub Kicinski.

16) Support for ipv6 non-equal cost multipath routing, from Ido
    Schimmel.

17) Add resource abstration to devlink, from Arkadi Sharshevsky.

18) Packet scheduler classifier shared filter block support, from Jiri
    Pirko.

19) Avoid locking in act_csum, from Davide Caratti.

20) devinet_ioctl() simplifications from Al viro.

21) More TCP bpf improvements from Lawrence Brakmo.

22) Add support for onlink ipv6 route flag, similar to ipv4, from David
    Ahern.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1925 commits)
  tls: Add support for encryption using async offload accelerator
  ip6mr: fix stale iterator
  net/sched: kconfig: Remove blank help texts
  openvswitch: meter: Use 64-bit arithmetic instead of 32-bit
  tcp_nv: fix potential integer overflow in tcpnv_acked
  r8169: fix RTL8168EP take too long to complete driver initialization.
  qmi_wwan: Add support for Quectel EP06
  rtnetlink: enable IFLA_IF_NETNSID for RTM_NEWLINK
  ipmr: Fix ptrdiff_t print formatting
  ibmvnic: Wait for device response when changing MAC
  qlcnic: fix deadlock bug
  tcp: release sk_frag.page in tcp_disconnect
  ipv4: Get the address of interface correctly.
  net_sched: gen_estimator: fix lockdep splat
  net: macb: Handle HRESP error
  net/mlx5e: IPoIB, Fix copy-paste bug in flow steering refactoring
  ipv6: addrconf: break critical section in addrconf_verify_rtnl()
  ipv6: change route cache aging logic
  i40e/i40evf: Update DESC_NEEDED value to reflect larger value
  bnxt_en: cleanup DIM work on device shutdown
  ...
2018-01-31 14:31:10 -08:00
Jason Gunthorpe e7996a9a77 Linux 4.15
-----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJabj6pAAoJEHm+PkMAQRiGs8cIAJQFkCWnbz86e3vG4DuWhyA8
 CMGHCQdUOxxFGa/ixhIiuetbC0x+JVHAjV2FwVYbAQfaZB3pfw2iR1ncQxpAP1AI
 oLU9vBEqTmwKMPc9CM5rRfnLFWpGcGwUNzgPdxD5yYqGDtcM8K840mF6NdkYe5AN
 xU8rv1wlcFPF4A5pvHCH0pvVmK4VxlVFk/2H67TFdxBs4PyJOnSBnf+bcGWgsKO6
 hC8XIVtcKCH2GfFxt5d0Vgc5QXJEpX1zn2mtCa1MwYRjN2plgYfD84ha0xE7J0B0
 oqV/wnjKXDsmrgVpncr3txd4+zKJFNkdNRE4eLAIupHo2XHTG4HvDJ5dBY2NhGU=
 =sOml
 -----END PGP SIGNATURE-----

Merge tag v4.15 of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git

To resolve conflicts in:
 drivers/infiniband/hw/mlx5/main.c
 drivers/infiniband/hw/mlx5/qp.c

From patches merged into the -rc cycle. The conflict resolution matches
what linux-next has been carrying.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-30 09:30:00 -07:00
Jack Morgenstein 852f692759 IB/mlx4: Fix incorrectly releasing steerable UD QPs when have only ETH ports
Allocating steerable UD QPs depends on having at least one IB port,
while releasing those QPs does not.

As a result, when there are only ETH ports, the IB (RoCE) driver
requests releasing a qp range whose base qp is zero, with
qp count zero.

When SR-IOV is enabled, and the VF driver is running on a VM over
a hypervisor which treats such qp release calls as errors
(rather than NOPs), we see lines in the VM message log like:

 mlx4_core 0002:00:02.0: Failed to release qp range base:0 cnt:0

Fix this by adding a check for a zero count in mlx4_release_qp_range()
(which thus treats releasing 0 qps as a nop), and eliminating the
check for device managed flow steering when releasing steerable UD QPs.
(Freeing ib_uc_qpns_bitmap unconditionally is also OK, since it
remains NULL when steerable UD QPs are not allocated).

Cc: <stable@vger.kernel.org>
Fixes: 4196670be7 ("IB/mlx4: Don't allocate range of steerable UD QPs for Ethernet-only device")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-15 15:33:21 -07:00
Eugenia Emantayev 7589fd5c8c net/mlx4_en: Align behavior of set ring size flow via ethtool
In current implementation, any requested RX/TX ring size value
that is less than minimum is silently casted to nearest valid value.
Update this behavior to align with mlx5 behavior by printing warning
in dmesg and remaining the size unchanged.
Kernel is responsible for verifying against the maximum.

Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-09 11:54:49 -05:00
Jesper Dangaard Brouer ae75415de1 mlx4: setup xdp_rxq_info
Driver hook points for xdp_rxq_info:
 * reg  : mlx4_en_create_rx_ring
 * unreg: mlx4_en_destroy_rx_ring

Tested on actual hardware.

Cc: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-05 15:21:21 -08:00
Moni Shoua a42b63c1ac net/mlx4_en: Change default QoS settings
Change the default mapping between TC and TCG as follows:

Prio     |             TC/TCG
         |      from             to
         |    (set by FW)      (set by SW)
---------+-----------------------------------
0        |      0/0              0/7
1        |      1/0              0/6
2        |      2/0              0/5
3        |      3/0              0/4
4        |      4/0              0/3
5        |      5/0              0/2
6        |      6/0              0/1
7        |      7/0              0/0

These new settings cause that a pause frame for any prio stops
traffic for all prios.

Fixes: 564c274c3d ("net/mlx4_en: DCB QoS support")
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-28 12:24:05 -05:00
Tariq Toukan fd4a3e2828 net/mlx4_core: Cleanup FMR unmapping flow
Remove redundant and not essential operations in fmr unmap/free.
According to device spec, in FMR unmap it is sufficient to set
ownership bit to SW. This allows remapping afterwards.

Fixes: 8ad11fb6b0 ("IB/mlx4: Implement FMRs")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-28 12:24:05 -05:00
Tariq Toukan dc484851ed net/mlx4_en: RX csum, reorder branches
Use early goto commands, and save else branches.
This uses less indentations and brackets, making the code
more readable.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-28 12:24:05 -05:00
Tariq Toukan 345ef18c24 net/mlx4_en: RX csum, remove redundant branches and checks
Do not check IPv6 bit in cqe status if CONFIG_IPV6 is not enabled.
Function check_csum() is reached only with IPv4 or IPv6 set (if enabled),
if IPv6 is not set (or is not enabled) it is redundant to test the
IPv4 bit.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-28 12:24:05 -05:00
Eran Ben Elisha 5a1647c391 net/mlx4_en: Fill all counters under one call of stats lock
Before this patch, the stats_lock was acquired twice. In between the
locks Driver sent command to gather some more statistics (per priority
and counter statistics). If the stats lock was acquired by get
statistics NDO in between we would have report out of sync counters.

Fix this by collecting all stats from Firmware in advance and then
fill the Software structs under one lock.

Fixes: 0b131561a7 ("net/mlx4_en: Add Flow control statistics display via ethtool")
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-13 16:38:37 -05:00
Eran Ben Elisha 0bb9fc4f54 net/mlx4_core: Fix wrong calculation of free counters
The field res_free indicates the total number of counters which are
available for allocation (reserved and unreserved). Fixed a bug where
the reserved counters were subtracted from res_free before any
allocation was performed.

Before this fix, free counters which were not reserved could not be
allocated.

Fixes: 9de92c60be ("net/mlx4_core: Adjust counter grant policy in the resource tracker")
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-13 16:38:36 -05:00
Eugenia Emantayev 78034f5fdd net/mlx4_en: Fix selftest for small MTUs
Set the minimal MTU threshold for running loopback selftest.
MTU should be big enough to include packet payload, NET_IP_ALIGN,
Ethernet headers and preamble length.

Fixes: e7c1c2c462 ("mlx4_en: Added self diagnostics test implementation")
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-13 16:38:36 -05:00
Linus Torvalds 7c225c69f8 Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - a few misc bits

 - ocfs2 updates

 - almost all of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (131 commits)
  memory hotplug: fix comments when adding section
  mm: make alloc_node_mem_map a void call if we don't have CONFIG_FLAT_NODE_MEM_MAP
  mm: simplify nodemask printing
  mm,oom_reaper: remove pointless kthread_run() error check
  mm/page_ext.c: check if page_ext is not prepared
  writeback: remove unused function parameter
  mm: do not rely on preempt_count in print_vma_addr
  mm, sparse: do not swamp log with huge vmemmap allocation failures
  mm/hmm: remove redundant variable align_end
  mm/list_lru.c: mark expected switch fall-through
  mm/shmem.c: mark expected switch fall-through
  mm/page_alloc.c: broken deferred calculation
  mm: don't warn about allocations which stall for too long
  fs: fuse: account fuse_inode slab memory as reclaimable
  mm, page_alloc: fix potential false positive in __zone_watermark_ok
  mm: mlock: remove lru_add_drain_all()
  mm, sysctl: make NUMA stats configurable
  shmem: convert shmem_init_inodecache() to void
  Unify migrate_pages and move_pages access checks
  mm, pagevec: rename pagevec drained field
  ...
2017-11-15 19:42:40 -08:00
Mel Gorman 453f85d43f mm: remove __GFP_COLD
As the page free path makes no distinction between cache hot and cold
pages, there is no real useful ordering of pages in the free list that
allocation requests can take advantage of.  Juding from the users of
__GFP_COLD, it is likely that a number of them are the result of copying
other sites instead of actually measuring the impact.  Remove the
__GFP_COLD parameter which simplifies a number of paths in the page
allocator.

This is potentially controversial but bear in mind that the size of the
per-cpu pagelists versus modern cache sizes means that the whole per-cpu
list can often fit in the L3 cache.  Hence, there is only a potential
benefit for microbenchmarks that alloc/free pages in a tight loop.  It's
even worse when THP is taken into account which has little or no chance
of getting a cache-hot page as the per-cpu list is bypassed and the
zeroing of multiple pages will thrash the cache anyway.

The truncate microbenchmarks are not shown as this patch affects the
allocation path and not the free path.  A page fault microbenchmark was
tested but it showed no sigificant difference which is not surprising
given that the __GFP_COLD branches are a miniscule percentage of the
fault path.

Link: http://lkml.kernel.org/r/20171018075952.10627-9-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:06 -08:00
Linus Torvalds ad0835a930 Updates for 4.15 kernel merge window
- Add iWARP support to qedr driver
 - Lots of misc fixes across subsystem
 - Multiple update series to hns roce driver
 - Multiple update series to hfi1 driver
 - Updates to vnic driver
 - Add kref to wait struct in cxgb4 driver
 - Updates to i40iw driver
 - Mellanox shared pull request
 - timer_setup changes
 - massive cleanup series from Bart Van Assche
 - Two series of SRP/SRPT changes from Bart Van Assche
 - Core updates from Mellanox
 - i40iw updates
 - IPoIB updates
 - mlx5 updates
 - mlx4 updates
 - hns updates
 - bnxt_re fixes
 - PCI write padding support
 - Sparse/Smatch/warning cleanups/fixes
 - CQ moderation support
 - SRQ support in vmw_pvrdma
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJaDF9JAAoJELgmozMOVy/dDXUP/i92g+G4OJ+4hHMh4KCjQMHT
 eMr/w9l1C033HrtsU1afPhqHOsKSxwCuJSiTgN4uXIm67/2kPK5Vlx+ir7mbOLwB
 3ukVK6Q/aFdigWCUhIaJSlDpjbd2sEj7JwKtM3rucvMWJlBJ4mAbcVQVfU96CCsv
 V9mO7dpR3QtYWDId9DukfnAfPUPFa3SMZnD7tdl6mKNRg/MjWGYLAL4nJoBfex5f
 b4o+MTrbuFWXYsfDru1m9BpHgyul20ldfcnbe8C/sVOQmOgkX7ngD5Sdi1FLeRJP
 GF/DnAqInC9N7cAxZHx4kH9x6mLMmEdfnwQ9VTVqGUHBsj3H4hQTVIAFfHUhWUbG
 TP5ZHgZG2CewZ0rf092cWlDZwp6n0BalnbQJr+QN4MzPmYbofs3AccSKUwrle+e+
 E6yYf4XxJdt7wRr4F1QKygtUEXSnNkNYUDQ4ZFbpJS/D4Sq80R1ZV/WZ7PJxm1D/
 EIKoi7NU9cbPMIlbCzn8kzgfjS7Pe4p0WW/Xxc/IYmACzpwNPkZuFGSND79ksIpF
 jhHqwZsOWFuXISjvcR4loc8wW6a5w5vjOiX0lLVz0NSdXSzVqav/2at7ZLDx/PT+
 Lh9YVL51akA3hiD+3X6iOhfOUu6kskjT9HijE5T8rJnf0V+C6AtIRpwrQ7ONmjJm
 3JMrjjLxtCIvpUyzCvDW
 =A1oL
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma

Pull rdma updates from Doug Ledford:
 "This is a fairly plain pull request. Lots of driver updates across the
  stack, a huge number of static analysis cleanups including a close to
  50 patch series from Bart Van Assche, and a number of new features
  inside the stack such as general CQ moderation support.

  Nothing really stands out, but there might be a few conflicts as you
  take things in. In particular, the cleanups touched some of the same
  lines as the new timer_setup changes.

  Everything in this pull request has been through 0day and at least two
  days of linux-next (since Stephen doesn't necessarily flag new
  errors/warnings until day2). A few more items (about 30 patches) from
  Intel and Mellanox showed up on the list on Tuesday. I've excluded
  those from this pull request, and I'm sure some of them qualify as
  fixes suitable to send any time, but I still have to review them
  fully. If they contain mostly fixes and little or no new development,
  then I will probably send them through by the end of the week just to
  get them out of the way.

  There was a break in my acceptance of patches which coincides with the
  computer problems I had, and then when I got things mostly back under
  control I had a backlog of patches to process, which I did mostly last
  Friday and Monday. So there is a larger number of patches processed in
  that timeframe than I was striving for.

  Summary:
   - Add iWARP support to qedr driver
   - Lots of misc fixes across subsystem
   - Multiple update series to hns roce driver
   - Multiple update series to hfi1 driver
   - Updates to vnic driver
   - Add kref to wait struct in cxgb4 driver
   - Updates to i40iw driver
   - Mellanox shared pull request
   - timer_setup changes
   - massive cleanup series from Bart Van Assche
   - Two series of SRP/SRPT changes from Bart Van Assche
   - Core updates from Mellanox
   - i40iw updates
   - IPoIB updates
   - mlx5 updates
   - mlx4 updates
   - hns updates
   - bnxt_re fixes
   - PCI write padding support
   - Sparse/Smatch/warning cleanups/fixes
   - CQ moderation support
   - SRQ support in vmw_pvrdma"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (296 commits)
  RDMA/core: Rename kernel modify_cq to better describe its usage
  IB/mlx5: Add CQ moderation capability to query_device
  IB/mlx4: Add CQ moderation capability to query_device
  IB/uverbs: Add CQ moderation capability to query_device
  IB/mlx5: Exposing modify CQ callback to uverbs layer
  IB/mlx4: Exposing modify CQ callback to uverbs layer
  IB/uverbs: Allow CQ moderation with modify CQ
  iw_cxgb4: atomically flush the qp
  iw_cxgb4: only call the cq comp_handler when the cq is armed
  iw_cxgb4: Fix possible circular dependency locking warning
  RDMA/bnxt_re: report vlan_id and sl in qp1 recv completion
  IB/core: Only maintain real QPs in the security lists
  IB/ocrdma_hw: remove unnecessary code in ocrdma_mbx_dealloc_lkey
  RDMA/core: Make function rdma_copy_addr return void
  RDMA/vmw_pvrdma: Add shared receive queue support
  RDMA/core: avoid uninitialized variable warning in create_udata
  RDMA/bnxt_re: synchronize poll_cq and req_notify_cq verbs
  RDMA/bnxt_re: Flush CQ notification Work Queue before destroying QP
  RDMA/bnxt_re: Set QP state in case of response completion errors
  RDMA/bnxt_re: Add memory barriers when processing CQ/EQ entries
  ...
2017-11-15 14:54:53 -08:00
Linus Torvalds 5bbcc0f595 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
 "Highlights:

   1) Maintain the TCP retransmit queue using an rbtree, with 1GB
      windows at 100Gb this really has become necessary. From Eric
      Dumazet.

   2) Multi-program support for cgroup+bpf, from Alexei Starovoitov.

   3) Perform broadcast flooding in hardware in mv88e6xxx, from Andrew
      Lunn.

   4) Add meter action support to openvswitch, from Andy Zhou.

   5) Add a data meta pointer for BPF accessible packets, from Daniel
      Borkmann.

   6) Namespace-ify almost all TCP sysctl knobs, from Eric Dumazet.

   7) Turn on Broadcom Tags in b53 driver, from Florian Fainelli.

   8) More work to move the RTNL mutex down, from Florian Westphal.

   9) Add 'bpftool' utility, to help with bpf program introspection.
      From Jakub Kicinski.

  10) Add new 'cpumap' type for XDP_REDIRECT action, from Jesper
      Dangaard Brouer.

  11) Support 'blocks' of transformations in the packet scheduler which
      can span multiple network devices, from Jiri Pirko.

  12) TC flower offload support in cxgb4, from Kumar Sanghvi.

  13) Priority based stream scheduler for SCTP, from Marcelo Ricardo
      Leitner.

  14) Thunderbolt networking driver, from Amir Levy and Mika Westerberg.

  15) Add RED qdisc offloadability, and use it in mlxsw driver. From
      Nogah Frankel.

  16) eBPF based device controller for cgroup v2, from Roman Gushchin.

  17) Add some fundamental tracepoints for TCP, from Song Liu.

  18) Remove garbage collection from ipv6 route layer, this is a
      significant accomplishment. From Wei Wang.

  19) Add multicast route offload support to mlxsw, from Yotam Gigi"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2177 commits)
  tcp: highest_sack fix
  geneve: fix fill_info when link down
  bpf: fix lockdep splat
  net: cdc_ncm: GetNtbFormat endian fix
  openvswitch: meter: fix NULL pointer dereference in ovs_meter_cmd_reply_start
  netem: remove unnecessary 64 bit modulus
  netem: use 64 bit divide by rate
  tcp: Namespace-ify sysctl_tcp_default_congestion_control
  net: Protect iterations over net::fib_notifier_ops in fib_seq_sum()
  ipv6: set all.accept_dad to 0 by default
  uapi: fix linux/tls.h userspace compilation error
  usbnet: ipheth: prevent TX queue timeouts when device not ready
  vhost_net: conditionally enable tx polling
  uapi: fix linux/rxrpc.h userspace compilation errors
  net: stmmac: fix LPI transitioning for dwmac4
  atm: horizon: Fix irq release error
  net-sysfs: trigger netlink notification on ifalias change via sysfs
  openvswitch: Using kfree_rcu() to simplify the code
  openvswitch: Make local function ovs_nsh_key_attr_size() static
  openvswitch: Fix return value check in ovs_meter_cmd_features()
  ...
2017-11-15 11:56:19 -08:00
Slava Shwartsman a1b8714593 net/mlx4: Use Kconfig flag to remove support of old gen2 Mellanox devices
Since Mellanox focus is on newer adapters, we would like to have the
ability to disable the support for old gen2 adapters.

This can be done by turning off the MLX4_CORE_GEN2 Kconfig flag.
We keep it turned on by default.

Signed-off-by: Slava Shwartsman <slavash@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13 10:27:51 +09:00
Nogah Frankel 575ed7d39e net_sch: mqprio: Change TC_SETUP_MQPRIO to TC_SETUP_QDISC_MQPRIO
Change TC_SETUP_MQPRIO to TC_SETUP_QDISC_MQPRIO to match the new
convention.

Signed-off-by: Nogah Frankel <nogahf@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 12:23:38 +09:00
Ingo Molnar 8c5db92a70 Merge branch 'linus' into locking/core, to resolve conflicts
Conflicts:
	include/linux/compiler-clang.h
	include/linux/compiler-gcc.h
	include/linux/compiler-intel.h
	include/uapi/linux/stddef.h

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-07 10:32:44 +01:00
Jakub Kicinski f4e63525ee net: bpf: rename ndo_xdp to ndo_bpf
ndo_xdp is a control path callback for setting up XDP in the
driver.  We can reuse it for other forms of communication
between the eBPF stack and the drivers.  Rename the callback
and associated structures and definitions.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05 22:26:18 +09:00
David S. Miller 2a171788ba Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Files removed in 'net-next' had their license header updated
in 'net'.  We take the remove from 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-04 09:26:51 +09:00
Greg Kroah-Hartman b24413180f License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier.  The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
 - file had no licensing information it it.
 - file was a */uapi/* one with no licensing information in it,
 - file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne.  Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed.  Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
 - Files considered eligible had to be source code files.
 - Make and config files were included as candidates if they contained >5
   lines of source
 - File already had some variant of a license header in it (even if <5
   lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

 - when both scanners couldn't find any license traces, file was
   considered to have no license information in it, and the top level
   COPYING file license applied.

   For non */uapi/* files that summary was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0                                              11139

   and resulted in the first patch in this series.

   If that file was a */uapi/* path one, it was "GPL-2.0 WITH
   Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0 WITH Linux-syscall-note                        930

   and resulted in the second patch in this series.

 - if a file had some form of licensing information in it, and was one
   of the */uapi/* ones, it was denoted with the Linux-syscall-note if
   any GPL family license was found in the file or had no licensing in
   it (per prior point).  Results summary:

   SPDX license identifier                            # files
   ---------------------------------------------------|------
   GPL-2.0 WITH Linux-syscall-note                       270
   GPL-2.0+ WITH Linux-syscall-note                      169
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
   LGPL-2.1+ WITH Linux-syscall-note                      15
   GPL-1.0+ WITH Linux-syscall-note                       14
   ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
   LGPL-2.0+ WITH Linux-syscall-note                       4
   LGPL-2.1 WITH Linux-syscall-note                        3
   ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
   ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1

   and that resulted in the third patch in this series.

 - when the two scanners agreed on the detected license(s), that became
   the concluded license(s).

 - when there was disagreement between the two scanners (one detected a
   license but the other didn't, or they both detected different
   licenses) a manual inspection of the file occurred.

 - In most cases a manual inspection of the information in the file
   resulted in a clear resolution of the license that should apply (and
   which scanner probably needed to revisit its heuristics).

 - When it was not immediately clear, the license identifier was
   confirmed with lawyers working with the Linux Foundation.

 - If there was any question as to the appropriate license identifier,
   the file was flagged for further research and to be revisited later
   in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights.  The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
 - a full scancode scan run, collecting the matched texts, detected
   license ids and scores
 - reviewing anything where there was a license detected (about 500+
   files) to ensure that the applied SPDX license was correct
 - reviewing anything where there was no detection but the patch license
   was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
   SPDX license was correct

This produced a worksheet with 20 files needing minor correction.  This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg.  Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected.  This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.)  Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-02 11:10:55 +01:00
Mark Rutland 6aa7de0591 locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()
Please do not apply this to mainline directly, instead please re-run the
coccinelle script shown below and apply its output.

For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't harmful, and changing them results in
churn.

However, for some features, the read/write distinction is critical to
correct operation. To distinguish these cases, separate read/write
accessors must be used. This patch migrates (most) remaining
ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
coccinelle script:

----
// Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
// WRITE_ONCE()

// $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch

virtual patch

@ depends on patch @
expression E1, E2;
@@

- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)

@ depends on patch @
expression E;
@@

- ACCESS_ONCE(E)
+ READ_ONCE(E)
----

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davem@davemloft.net
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: tj@kernel.org
Cc: viro@zeniv.linux.org.uk
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-25 11:01:08 +02:00
Elena Reshetova 17ac99b2b8 drivers, net, mlx4: convert mlx4_srq.refcount from atomic_t to refcount_t
atomic_t variables are currently used to implement reference
counters with the following properties:
 - counter is initialized to 1 using atomic_set()
 - a resource is freed upon counter reaching zero
 - once counter reaches zero, its further
   increments aren't allowed
 - counter schema uses basic atomic operations
   (set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable mlx4_srq.refcount is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-22 02:22:38 +01:00
Elena Reshetova 0068895ff8 drivers, net, mlx4: convert mlx4_qp.refcount from atomic_t to refcount_t
atomic_t variables are currently used to implement reference
counters with the following properties:
 - counter is initialized to 1 using atomic_set()
 - a resource is freed upon counter reaching zero
 - once counter reaches zero, its further
   increments aren't allowed
 - counter schema uses basic atomic operations
   (set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable mlx4_qp.refcount is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-22 02:22:38 +01:00
Elena Reshetova ff61b5e3f0 drivers, net, mlx4: convert mlx4_cq.refcount from atomic_t to refcount_t
atomic_t variables are currently used to implement reference
counters with the following properties:
 - counter is initialized to 1 using atomic_set()
 - a resource is freed upon counter reaching zero
 - once counter reaches zero, its further
   increments aren't allowed
 - counter schema uses basic atomic operations
   (set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable mlx4_cq.refcount is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-22 02:22:38 +01:00
Tariq Toukan f025fd6061 net/mlx4_en: XDP_TX, assign constant values of TX descs on ring creaion
In XDP_TX, some fields in tx_info and tx_desc are constants across
all entries of the different XDP_TX rings.
Assign values to these fields on ring creation time, rather than in
data-path.

Patchset performance tests:
Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Single queue no-RSS optimization ON.

XDP_TX packet rate:
------------------------------
Before    | After     | Gain |
13.7 Mpps | 14.0 Mpps | %2.2 |
------------------------------

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-11 20:21:23 -07:00
Tariq Toukan f6f0aa9741 net/mlx4_en: Obsolete call to generic write_desc in XDP xmit flow
Function mlx4_en_tx_write_desc() is not optimized to use of XDP xmit.
Use the relevant parts inline instead.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-11 20:21:23 -07:00
Tariq Toukan 5dad61b838 net/mlx4_en: Replace netdev parameter with priv in XDP xmit function
The struct net_device parameter was passed only to extract
struct mlx4_en_priv out of it.
Here we pass the priv parameter directly.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-11 20:21:23 -07:00
Inbar Karmy 80a8dc75ee net/mlx4_en: Increase number of default RX rings
Remove limitation of netif_get_num_default_rss_queues()
from logic of RX rings default number.

Signed-off-by: Inbar Karmy <inbark@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-10 13:11:22 -07:00
Inbar Karmy b8d394367a net/mlx4_en: Limit the number of RX rings
Limit the number of RX rings by the number of cores
in the system.

Signed-off-by: Inbar Karmy <inbark@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-10 13:11:22 -07:00
Inbar Karmy 7e1dc5e926 net/mlx4_en: Limit the number of TX rings
Limit the number of TX rings per UP by the number of cores
in the system.

Signed-off-by: Inbar Karmy <inbark@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-10 13:11:22 -07:00
Tariq Toukan 7ba5e7bd64 net/mlx4_en: Use __force to fix a sparse warning in TX datapath
In TX data-path, we intentionally do not byte-swap, as documented
in code and in the cited commit log.
This fixes sparse warning:
en_tx.c:720:23: warning: incorrect type in argument 1 (different base types)
en_tx.c:720:23:    expected unsigned int [unsigned] [usertype] <noident>
en_tx.c:720:23:    got restricted __be32 [usertype] doorbell_qpn

Fixes: 492f5add4b ("net/mlx4_en: Doorbell is byteswapped in Little Endian archs")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09 10:33:05 -07:00
Tariq Toukan b71322d9db net/mlx4_core: Fix cast warning in fw.c
Fix the following SPARSE warning, in MLX4_GET() macro:
drivers/net/ethernet/mellanox/mlx4/fw.c:233:9: warning: cast to restricted __be64

Fixes: 17d5ceb6e4 ("net/mlx4_core: Fix unaligned accesses")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09 10:33:05 -07:00
Tariq Toukan bb428a5c4d net/mlx4: Fix endianness issue in qp context params
Should take care of the endianness before assigning to params2 field.

Fixes: 53f33ae295 ("net/mlx4_core: Port aggregation upper layer interface")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09 10:33:05 -07:00
Kees Cook 55c0fcc3de net/mlx4_core: Convert timers to use timer_setup()
In preparation for unconditionally passing the struct timer_list pointer to
all timer callbacks, switch to using the new timer_setup() and from_timer()
to pass the timer pointer explicitly.

Cc: Tariq Toukan <tariqt@mellanox.com>
Cc: netdev@vger.kernel.org
Cc: linux-rdma@vger.kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-10-09 12:19:41 -04:00
Daniel Borkmann de8f3a83b0 bpf: add meta pointer for direct access
This work enables generic transfer of metadata from XDP into skb. The
basic idea is that we can make use of the fact that the resulting skb
must be linear and already comes with a larger headroom for supporting
bpf_xdp_adjust_head(), which mangles xdp->data. Here, we base our work
on a similar principle and introduce a small helper bpf_xdp_adjust_meta()
for adjusting a new pointer called xdp->data_meta. Thus, the packet has
a flexible and programmable room for meta data, followed by the actual
packet data. struct xdp_buff is therefore laid out that we first point
to data_hard_start, then data_meta directly prepended to data followed
by data_end marking the end of packet. bpf_xdp_adjust_head() takes into
account whether we have meta data already prepended and if so, memmove()s
this along with the given offset provided there's enough room.

xdp->data_meta is optional and programs are not required to use it. The
rationale is that when we process the packet in XDP (e.g. as DoS filter),
we can push further meta data along with it for the XDP_PASS case, and
give the guarantee that a clsact ingress BPF program on the same device
can pick this up for further post-processing. Since we work with skb
there, we can also set skb->mark, skb->priority or other skb meta data
out of BPF, thus having this scratch space generic and programmable
allows for more flexibility than defining a direct 1:1 transfer of
potentially new XDP members into skb (it's also more efficient as we
don't need to initialize/handle each of such new members). The facility
also works together with GRO aggregation. The scratch space at the head
of the packet can be multiple of 4 byte up to 32 byte large. Drivers not
yet supporting xdp->data_meta can simply be set up with xdp->data_meta
as xdp->data + 1 as bpf_xdp_adjust_meta() will detect this and bail out,
such that the subsequent match against xdp->data for later access is
guaranteed to fail.

The verifier treats xdp->data_meta/xdp->data the same way as we treat
xdp->data/xdp->data_end pointer comparisons. The requirement for doing
the compare against xdp->data is that it hasn't been modified from it's
original address we got from ctx access. It may have a range marking
already from prior successful xdp->data/xdp->data_end pointer comparisons
though.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-26 13:36:44 -07:00
Allen Pais d2a0012e76 drivers: net: mlx4: use setup_timer() helper.
Use setup_timer function instead of initializing timer with the
    function and data fields.

Signed-off-by: Allen Pais <allen.lkml@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-21 11:44:42 -07:00
Linus Torvalds aae3dbb477 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:

 1) Support ipv6 checksum offload in sunvnet driver, from Shannon
    Nelson.

 2) Move to RB-tree instead of custom AVL code in inetpeer, from Eric
    Dumazet.

 3) Allow generic XDP to work on virtual devices, from John Fastabend.

 4) Add bpf device maps and XDP_REDIRECT, which can be used to build
    arbitrary switching frameworks using XDP. From John Fastabend.

 5) Remove UFO offloads from the tree, gave us little other than bugs.

 6) Remove the IPSEC flow cache, from Florian Westphal.

 7) Support ipv6 route offload in mlxsw driver.

 8) Support VF representors in bnxt_en, from Sathya Perla.

 9) Add support for forward error correction modes to ethtool, from
    Vidya Sagar Ravipati.

10) Add time filter for packet scheduler action dumping, from Jamal Hadi
    Salim.

11) Extend the zerocopy sendmsg() used by virtio and tap to regular
    sockets via MSG_ZEROCOPY. From Willem de Bruijn.

12) Significantly rework value tracking in the BPF verifier, from Edward
    Cree.

13) Add new jump instructions to eBPF, from Daniel Borkmann.

14) Rework rtnetlink plumbing so that operations can be run without
    taking the RTNL semaphore. From Florian Westphal.

15) Support XDP in tap driver, from Jason Wang.

16) Add 32-bit eBPF JIT for ARM, from Shubham Bansal.

17) Add Huawei hinic ethernet driver.

18) Allow to report MD5 keys in TCP inet_diag dumps, from Ivan
    Delalande.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1780 commits)
  i40e: point wb_desc at the nvm_wb_desc during i40e_read_nvm_aq
  i40e: avoid NVM acquire deadlock during NVM update
  drivers: net: xgene: Remove return statement from void function
  drivers: net: xgene: Configure tx/rx delay for ACPI
  drivers: net: xgene: Read tx/rx delay for ACPI
  rocker: fix kcalloc parameter order
  rds: Fix non-atomic operation on shared flag variable
  net: sched: don't use GFP_KERNEL under spin lock
  vhost_net: correctly check tx avail during rx busy polling
  net: mdio-mux: add mdio_mux parameter to mdio_mux_init()
  rxrpc: Make service connection lookup always check for retry
  net: stmmac: Delete dead code for MDIO registration
  gianfar: Fix Tx flow control deactivation
  cxgb4: Ignore MPS_TX_INT_CAUSE[Bubble] for T6
  cxgb4: Fix pause frame count in t4_get_port_stats
  cxgb4: fix memory leak
  tun: rename generic_xdp to skb_xdp
  tun: reserve extra headroom only when XDP is set
  net: dsa: bcm_sf2: Configure IMP port TC2QOS mapping
  net: dsa: bcm_sf2: Advertise number of egress queues
  ...
2017-09-06 14:45:08 -07:00
Thomas Meyer 691223ec97 net/mlx4_core: Use ARRAY_SIZE macro
Use ARRAY_SIZE macro, rather than explicitly coding some variant of it
yourself.
Found with: find -type f -name "*.c" -o -name "*.h" | xargs perl -p -i -e
's/\bsizeof\s*\(\s*(\w+)\s*\)\s*\ /\s*sizeof\s*\(\s*\1\s*\[\s*0\s*\]\s*\)
/ARRAY_SIZE(\1)/g' and manual check/verification.

Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-05 11:49:16 -07:00
Linus Torvalds aa9d4648c2 Updates for 4.14 kernel merge window
- Lots of hfi1 driver updates (mixed with a few qib and core updates as
   well)
 - rxe updates
 - various mlx updates
 - Set default roce type to RoCEv2
 - Several larger fixes for bnxt_re that were too big for -rc
 - Several larger fixes for qedr that, likewise, were too big for -rc
 - Misc core changes
 - Make the hns_roce driver compilable on arches other than aarch64 so we
   can more easily debug build issues related to it
 - Add rdma-netlink infrastructure updates
 - Add automatic IRQ affinity infrastructure
 - Add 32bit lid support
 - Lots of misc fixes across the subsystem from random people
 - Autoloading of RDMA netlink modules
 - PCI pool cleanups from Romain Perier
 - mlx5 driver feature additions and fixes
 - Hardware tag matchine feature
 - Fix sleeping in atomic when resolving roce ah
 - Add experimental ioctl interface as posted to linux-api@
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZqBDtAAoJELgmozMOVy/dNlcQAJhYNRGaNUBx0L6+8t2xwUrt
 7ndP6qlMar30DJY9FjTQCzRBw0CRMWkXdJD8rYlyaHy07pjWDKG8LZtxEXu1FLdZ
 oNRvQX6ZJh8Bz7db2SQFBCTF2uWGZZFqWQCrSbQwjj9xxjMDs59u/knmwHVY9dKk
 egjPG4IQBDmcTeNY7h1otG2hXpx7QPIOilQW2EFN5SWAuBAazdF2JKxjjxqhnUfp
 gD2pSdgsm3VSMoo0zpMa6qOP+9GcOu8J97fYFhasRYWCavPdWHyq+XNu9S/eicRd
 xbv+seCYM+9jPb2dsNdjEKll7w3yyWdu7h6tSCMPYv54eN9sDDiO1w2L2ZnESMZa
 JRnSfB+HXru1r4RyHOTPO8peaNhYlR1V4u8bTS5G2dffbHis9BajkWoAR/oSiUcB
 AIjIIDcdJFVGfpF9KIt/pEl+adHNgESibSijzOUYkyw6RNbPqDmdd7YakPHcQhKN
 clE3zQfIsPRLWsToP/nkBE0tUd3tQocRuLy7ote7hXQK+0p7TBz0a6Kkj87MvX33
 8dVbUI+q6WRlEY90l71y0ZdXy/AvkxkFxAc4Y7FQZyJxhEArTaKgfa5fmpRwVxBm
 yi9baoYCspHNRNv6AO4IL86ZCJqmWBuch8CBY1n2X3h8IGfKYEZUAZ+T/mnTTeUq
 A4joXduz94ZD4w23leD1
 =2ntC
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-ioctl' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma

Pull rdma updates from Doug Ledford:
 "This is a big pull request.

  Of note is that I'm sending you the new ioctl API for the rdma
  subsystem. We put it up on linux-api@, but didn't get much response.
  The API is complex, but it solves two different problems in one go:

   1) The bi-directional nature of the RDMA file write calls, which
      created the security hole we had to handle (and for which the fix
      is now causing problems for systems in production, we were a bit
      over zealous in the fix and the ability to open a device, then
      fork, then create new queue pairs on the device and use them is
      broken).

   2) The bloat caused by different vendors implementing extensions to
      the base verbs API. Each vendor's hardware is slightly different,
      and the hardware might be suitable for one extension but not
      another.

      By the time we add generic extensions for all the different ways
      that the different hardware can offload things, the API becomes
      bloated. Things like our completion structs have started to exceed
      a cache line in size because of all the elements needed to support
      this. That in turn shows up heavily in the performance graphs with
      a noticable drop in performance on 100Gigabit links as our
      completion structs go from occupying one cache line to 1+.

      This API makes things like the completion structs modular in a
      very similar way to netlink so that your structs can only include
      the items needed for the offloads/features you are actually using
      on a given queue pair. In that way we support everything, but only
      use what we need, and our structs stay smaller.

  The ioctl API is better explained by the posting on linux-api@ than I
  can explain it here, so I'll just leave it at that.

  The rest of the pull request is typical stuff.

  Updates for 4.14 kernel merge window

   - Lots of hfi1 driver updates (mixed with a few qib and core updates
     as well)

   - rxe updates

   - various mlx updates

   - Set default roce type to RoCEv2

   - Several larger fixes for bnxt_re that were too big for -rc

   - Several larger fixes for qedr that, likewise, were too big for -rc

   - Misc core changes

   - Make the hns_roce driver compilable on arches other than aarch64 so
     we can more easily debug build issues related to it

   - Add rdma-netlink infrastructure updates

   - Add automatic IRQ affinity infrastructure

   - Add 32bit lid support

   - Lots of misc fixes across the subsystem from random people

   - Autoloading of RDMA netlink modules

   - PCI pool cleanups from Romain Perier

   - mlx5 driver feature additions and fixes

   - Hardware tag matchine feature

   - Fix sleeping in atomic when resolving roce ah

   - Add experimental ioctl interface as posted to linux-api@"

* tag 'for-linus-ioctl' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (328 commits)
  IB/core: Expose ioctl interface through experimental Kconfig
  IB/core: Assign root to all drivers
  IB/core: Add completion queue (cq) object actions
  IB/core: Add legacy driver's user-data
  IB/core: Export ioctl enum types to user-space
  IB/core: Explicitly destroy an object while keeping uobject
  IB/core: Add macros for declaring methods and attributes
  IB/core: Add uverbs merge trees functionality
  IB/core: Add DEVICE object and root tree structure
  IB/core: Declare an object instead of declaring only type attributes
  IB/core: Add new ioctl interface
  RDMA/vmw_pvrdma: Fix a signedness
  RDMA/vmw_pvrdma: Report network header type in WC
  IB/core: Add might_sleep() annotation to ib_init_ah_from_wc()
  IB/cm: Fix sleeping in atomic when RoCE is used
  IB/core: Add support to finalize objects in one transaction
  IB/core: Add a generic way to execute an operation on a uobject
  Documentation: Hardware tag matching
  IB/mlx5: Support IB_SRQT_TM
  net/mlx5: Add XRQ support
  ...
2017-09-03 17:49:17 -07:00
Colin Ian King 942e7e5fc1 net/mlx4_core: fix incorrect size allocation for dev->caps.spec_qps
The current allocation for dev->caps.spec_qps is for the size of the
pointer and not the size of the actual  mlx4_spec_qps structure.  Fix
this by using the correct size.   Also splint allocation over a few
lines to make it cppcheck clean on overly wide lines.

Detected by CoverityScan, CID#1455222 ("Wrong sizeof argument")

Fixes: c73c8b1e47 ("net/mlx4_core: Dynamically allocate structs at mlx4_slave_cap")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-03 10:57:10 -07:00
Colin Ian King 542deb88b0 net/mlx4_core: fix memory leaks on error exit path
The structures hca_param and func_cap are not being kfree'd on an error
exit path causing two memory leaks. Fix this by jumping to the existing
free memory error exit path.

Detected by CoverityScan, CID#1455219, CID#1455224 ("Resource Leak")

Fixes: c73c8b1e47 ("net/mlx4_core: Dynamically allocate structs at mlx4_slave_cap")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-03 10:57:10 -07:00
Moshe Shemesh be59960395 net/mlx4: Add user mac FW update support
Adding support for updating the FW on new port mac, when port mac change
is requested by the user. This info is required by the FW as OEM
management tools require this info directly from the NIC FW.
Check device capability bit to verify the FW supports user mac.
If the FW does support it, use set_port command to notify the FW on the
new mac.
The feature is relevant only to PF port mac.

Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-29 14:58:32 -07:00
Tariq Toukan a434f1fd2c net/mlx4_core: Fix misplaced brackets of sizeof
When changing the sizeof style usage in the patch cited below,
one brackets misplacement was introduced. Here we fix it.

Fixes: 31975e27a4 ("mlx4: sizeof style usage")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-29 14:58:32 -07:00
Leon Romanovsky 187782eb58 net/mlx4_core: Make explicit conversion to 64bit value
The "lg" variable is declared as int so in all places where this variable
is used as a shift operand, the output will be int too.

This produces the following smatch warning:
drivers/net/ethernet/mellanox/mlx4/fw.c:1532 mlx4_map_cmd() warn:
	should '1 << lg' be a 64 bit type?

Simple declaration of "1" to be "1ULL" will fix the issue.

Fixes: 225c7b1fee ("IB/mlx4: Add a driver Mellanox ConnectX InfiniBand adapters")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-29 14:58:32 -07:00
Eran Ben Elisha c73c8b1e47 net/mlx4_core: Dynamically allocate structs at mlx4_slave_cap
In order to avoid temporary large structs on the stack,
allocate them dynamically.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Tal Alon <talal@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-29 14:58:32 -07:00
Bhumika Goyal 3f2c5fb2d8 net/mlx4_core: make mlx4_profile const
Make these const as they are only used in a copy operation.

Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-24 12:33:31 -07:00
Romain Perier b9f761aa78 mlx4: Replace PCI pool old API
The PCI pool API is deprecated. This commit replaces the PCI pool old
API by the appropriate function with the DMA pool API.

Signed-off-by: Romain Perier <romain.perier@collabora.com>
Acked-by: Peter Senna Tschudin <peter.senna@collabora.com>
Tested-by: Peter Senna Tschudin <peter.senna@collabora.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Acked-by: Doug Ledford <dledford@redhat.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-08-22 13:59:46 -04:00
David S. Miller e2a7c34fb2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-08-21 17:06:42 -07:00
Huy Nguyen ca3d89a3eb net/mlx4_core: Enable 4K UAR if SRIOV module parameter is not enabled
enable_4k_uar module parameter was added in patch cited below to
address the backward compatibility issue in SRIOV when the VM has
system's PAGE_SIZE uar implementation and the Hypervisor has 4k uar
implementation.

The above compatibility issue does not exist in the non SRIOV case.
In this patch, we always enable 4k uar implementation if SRIOV
is not enabled on mlx4's supported cards.

Fixes: 76e39ccf9c ("net/mlx4_core: Fix backward compatibility on VFs")
Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-18 16:15:37 -07:00
Colin Ian King ba5c4dac03 net/mlx4: fix spelling mistake: "availible" -> "available"
Trivial fix to spelling mistakes in the mlx4 driver

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-16 14:23:45 -07:00
stephen hemminger 31975e27a4 mlx4: sizeof style usage
The kernel coding style is to treat sizeof as a function
(ie. with parenthesis) not as an operator.

Also use kcalloc and kmalloc_array

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-16 11:01:57 -07:00
Zhu Yanjun e084a8b89c mlx4: remove unnecessary pci_set_drvdata()
The driver core clears the driver data to NULL after device_release
or on probe failure. Thus, it is not necessary to manually clear the
device driver data to NULL.

Cc: Joe Jin <joe.jin@oracle.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-15 16:46:48 -07:00
David S. Miller 3118e6e19d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The UDP offload conflict is dealt with by simply taking what is
in net-next where we have removed all of the UFO handling code
entirely.

The TCP conflict was a case of local variables in a function
being removed from both net and net-next.

In netvsc we had an assignment right next to where a missing
set of u64 stats sync object inits were added.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-09 16:28:45 -07:00
Davide Caratti e718fe450e net/mlx4_en: don't set CHECKSUM_COMPLETE on SCTP packets
if the NIC fails to validate the checksum on TCP/UDP, and validation of IP
checksum is successful, the driver subtracts the pseudo-header checksum
from the value obtained by the hardware and sets CHECKSUM_COMPLETE. Don't
do that if protocol is IPPROTO_SCTP, otherwise CRC32c validation fails.

V2: don't test MLX4_CQE_STATUS_IPV6 if MLX4_CQE_STATUS_IPV4 is set

Reported-by: Shuang Li <shuali@redhat.com>
Fixes: f8c6455bb0 ("net/mlx4_en: Extend checksum offloading by CHECKSUM COMPLETE")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-08 17:59:57 -07:00
Jiri Pirko de4784ca03 net: sched: get rid of struct tc_to_netdev
Get rid of struct tc_to_netdev which is now just unnecessary container
and rather pass per-type structures down to drivers directly.
Along with that, consolidate the naming of per-type structure variables
in cls_*.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-07 09:42:37 -07:00
Jiri Pirko 38cf0426e5 net: sched: change return value of ndo_setup_tc for driver supporting mqprio only
Change the return value from -EINVAL to -EOPNOTSUPP. The rest of the
drivers have it like that, so be aligned.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-07 09:42:37 -07:00
Jiri Pirko 5fd9fc4e20 net: sched: push cls related args into cls_common structure
As ndo_setup_tc is generic offload op for whole tc subsystem, does not
really make sense to have cls-specific args. So move them under
cls_common structurure which is embedded in all cls structs.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-07 09:42:37 -07:00
Jiri Pirko 2572ac53c4 net: sched: make type an argument for ndo_setup_tc
Since the type is always present, push it to be a separate argument to
ndo_setup_tc. On the way, name the type enum and use it for arg type.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-07 09:42:35 -07:00
Jack Morgenstein bff0c6840c net/mlx4_core: Fixes missing capability bit in flags2 capability dump
The cited commit introduced the following new enum value in file
include/linux/mlx4/device.h:

    QUERY_DEV_CAP_DIAG_RPRT_PER_PORT

However, it failed to introduce a corresponding entry in function
dump_dev_cap_flags2() for outputting a line in the message log
when this capability bit is set.

The change here fixes that omission.

Fixes: c7c122ed67 ("net/mlx4: Add diagnostic counters capability bit")
Reported-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-02 10:44:09 -07:00
Jack Morgenstein f9fb9d0b85 net/mlx4_core: Fix namespace misalignment in QinQ VST support commit
The cited commit introduced the following new enum value in file
include/linux/mlx4/device.h:

    MLX4_DEV_CAP_FLAG2_SVLAN_BY_QP

However the value of MLX4_DEV_CAP_FLAG2_SVLAN_BY_QP needs to stay
consistent with the value used in another namespace in
function dump_dev_cap_flags2(), which is manually kept in sync.
The change here restores that consistency.

Fixes: 7c3d21c815 ("net/mlx4_core: Preparation for VF vlan protocol 802.1ad")
Reported-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-02 10:44:09 -07:00
Jack Morgenstein 5886259c12 net/mlx4_core: Fix sl_to_vl_change bit offset in flags2 dump
The index value in function dump_dev_cap_flags2() for outputting
"sl to vl mapping table change event support" needs to be
consistent with the value of the enumerated constant
MLX4_DEV_CAP_FLAG2_SL_TO_VL_CHANGE_EVENT defined in file
include/linux/mlx4_device.h

The change here restores that consistency.

Fixes: fd10ed8e6f ("IB/mlx4: Fix possible vl/sl field mismatch in LRH header in QP1 packets")
Reported-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-02 10:44:09 -07:00
Inbar Karmy c994f778bb net/mlx4_en: Fix wrong indication of Wake-on-LAN (WoL) support
Currently when WoL is supported but disabled, ethtool reports:
"Supports Wake-on: d".
Fix the indication of Wol support, so that the indication
remains "g" all the time if the NIC supports WoL.

Tested:
As accepted, when NIC supports WoL- ethtool reports:
	Supports Wake-on: g
	Wake-on: d
when NIC doesn't support WoL- ethtool reports:
        Supports Wake-on: d
        Wake-on: d

Fixes: 14c07b1358 ("mlx4: Wake on LAN support")
Signed-off-by: Inbar Karmy <inbark@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-02 10:44:09 -07:00
Zhu Yanjun f575a02ee7 mlx4_en: remove unnecessary error check
The function mlx4_en_get_profile always return zero. So it is not
necessary to check its return value.

CC: Joe Jin <joe.jin@oracle.com>
CC: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-24 17:26:32 -07:00
Zhu Yanjun f3eebe8819 mlx4_en: remove unnecessary returned value
The function mlx4_en_arm_cq always returns zero. So change the
return type of the function mlx4_en_arm_cq to void.

CC: Joe Jin <joe.jin@oracle.com>
CC: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-24 16:29:55 -07:00
Moshe Shemesh f330187016 (IB, net)/mlx4: Add resource utilization support
Adding visibility of resource usage of QPs, CQs and counters used by
virtual functions. This feature will be used to give the PF administrator
more data while debugging VF status. Usage info was added to ALLOC_RES
command, to notify the PF if the resource which is being reserved or
allocated for the VF will be used by kernel driver or by user verbs.

Updated reservation and allocation functions of QP, CQ and counter with
additional usage parameter.

Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-07-24 10:41:35 -04:00
Linus Torvalds 96080f6977 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) BPF verifier signed/unsigned value tracking fix, from Daniel
    Borkmann, Edward Cree, and Josef Bacik.

 2) Fix memory allocation length when setting up calls to
    ->ndo_set_mac_address, from Cong Wang.

 3) Add a new cxgb4 device ID, from Ganesh Goudar.

 4) Fix FIB refcount handling, we have to set it's initial value before
    the configure callback (which can bump it). From David Ahern.

 5) Fix double-free in qcom/emac driver, from Timur Tabi.

 6) A bunch of gcc-7 string format overflow warning fixes from Arnd
    Bergmann.

 7) Fix link level headroom tests in ip_do_fragment(), from Vasily
    Averin.

 8) Fix chunk walking in SCTP when iterating over error and parameter
    headers. From Alexander Potapenko.

 9) TCP BBR congestion control fixes from Neal Cardwell.

10) Fix SKB fragment handling in bcmgenet driver, from Doug Berger.

11) BPF_CGROUP_RUN_PROG_SOCK_OPS needs to check for null __sk, from Cong
    Wang.

12) xmit_recursion in ppp driver needs to be per-device not per-cpu,
    from Gao Feng.

13) Cannot release skb->dst in UDP if IP options processing needs it.
    From Paolo Abeni.

14) Some netdev ioctl ifr_name[] NULL termination fixes. From Alexander
    Levin and myself.

15) Revert some rtnetlink notification changes that are causing
    regressions, from David Ahern.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (83 commits)
  net: bonding: Fix transmit load balancing in balance-alb mode
  rds: Make sure updates to cp_send_gen can be observed
  net: ethernet: ti: cpsw: Push the request_irq function to the end of probe
  ipv4: initialize fib_trie prior to register_netdev_notifier call.
  rtnetlink: allocate more memory for dev_set_mac_address()
  net: dsa: b53: Add missing ARL entries for BCM53125
  bpf: more tests for mixed signed and unsigned bounds checks
  bpf: add test for mixed signed and unsigned bounds checks
  bpf: fix up test cases with mixed signed/unsigned bounds
  bpf: allow to specify log level and reduce it for test_verifier
  bpf: fix mixed signed/unsigned derived min/max value bounds
  ipv6: avoid overflow of offset in ip6_find_1stfragopt
  net: tehuti: don't process data if it has not been copied from userspace
  Revert "rtnetlink: Do not generate notifications for CHANGEADDR event"
  net: dsa: mv88e6xxx: Enable CMODE config support for 6390X
  dt-binding: ptp: Add SoC compatibility strings for dte ptp clock
  NET: dwmac: Make dwmac reset unconditional
  net: Zero terminate ifr_name in dev_ifname().
  wireless: wext: terminate ifr name coming from userspace
  netfilter: fix netfilter_net_init() return
  ...
2017-07-20 16:33:39 -07:00
Leon Romanovsky 8900b894e7 {net, IB}/mlx4: Remove gfp flags argument
The caller to the driver marks GFP_NOIO allocations with help
of memalloc_noio-* calls now. This makes redundant to pass down
to the driver gfp flags, which can be GFP_KERNEL only.

The patch removes the gfp flags argument and updates all driver paths.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-07-17 21:21:24 -04:00
Zhu Yanjun e36fef66f4 mlx4_en: remove unnecessary returned value check
The function __mlx4_zone_remove_one_entry always returns zero. So
it is not necessary to check it.

Cc: Joe Jin <joe.jin@oracle.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-15 14:29:49 -07:00
Zhu Yanjun 3b68067bd2 mlx4_en: make mlx4_log_num_mgm_entry_size static
The variable mlx4_log_num_mgm_entry_size is only called in main.c.

CC: Joe Jin <joe.jin@oracle.com>
CC: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-03 02:41:26 -07:00
Inbar Karmy ec327f7a43 net/mlx4_en: Do not allocate redundant TX queues when TC is disabled
Currently the number of TX queues that are allocated doesn't depend
on the number of TCs, the module always loads with max num of UP
per channel.
In order to prevent the allocation of unnecessary memory, the
module will load with minimum number of UPs per channel, and the
user will be able to control the number of TX queues per channel
by changing the number of TC to 8 using the tc command.
The variable num_up will hold the information about the current
number of UPs.
Due to the change, needed to remove the lines that set the value of
UP to be different than zero in the func "mlx4_en_select_queue",
since now the num of TX queues that are allocated is only one per channel
in default.
In order not to force the UP to be zero in case of only one TC, added
a condition before forcing it in the func "mlx4_en_fill_qp_context".

Tested:
After the module is loaded with minimum number of UP per channel, to
increase num of TCs to 8, use:
tc qdisc add dev ens8 root mqprio num_tc 8
In order to decrease the number of TCs to minimum number of UP per channel,
use:
tc qdisc del dev ens8 root

Signed-off-by: Inbar Karmy <inbark@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Cc: Tarick Bedeir <tarick@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-29 15:56:15 -04:00
Inbar Karmy f21ad61424 net/mlx4_en: Add dynamic variable to hold the number of user priorities (UP)
Until this patch, the number of UPs was hard coded for eight.
Replace this with a variable in struct "mlx4_en_port_profile".
Currently, the variable will hold the maximum number of UP,
as before.
The patch creates an infrastructure to add an option for dynamic
change of the actual number of TCs.

Signed-off-by: Inbar Karmy <inbark@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Cc: Tarick Bedeir <tarick@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-29 15:56:15 -04:00
Colin Ian King 46ccf725bf net/mlx4: fix spelling mistake: "enforcment" -> "enforcement"
Trivial fix to spelling mistake in mlx4_dbg debug message

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-29 12:25:01 -04:00
Colin Ian King 593814d1be net/mlx4: fix spelling mistake: "coalesing" -> "coalescing"
Trivial fix to spelling mistake in en_dbg debug message

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-26 23:18:29 -04:00
Martin KaFai Lau 2e37e9b0f5 bpf: mlx4: Report bpf_prog ID during XDP_QUERY_PROG
Add support to mlx4 to report bpf_prog ID during XDP_QUERY_PROG.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-16 11:58:36 -04:00
Johannes Berg 4df864c1d9 networking: make skb_put & friends return void pointers
It seems like a historic accident that these return unsigned char *,
and in many places that means casts are required, more often than not.

Make these functions (skb_put, __skb_put and pskb_put) return void *
and remove all the casts across the tree, adding a (u8 *) cast only
where the unsigned char pointer was used directly, all done with the
following spatch:

    @@
    expression SKB, LEN;
    typedef u8;
    identifier fn = { skb_put, __skb_put };
    @@
    - *(fn(SKB, LEN))
    + *(u8 *)fn(SKB, LEN)

    @@
    expression E, SKB, LEN;
    identifier fn = { skb_put, __skb_put };
    type T;
    @@
    - E = ((T *)(fn(SKB, LEN)))
    + E = fn(SKB, LEN)

which actually doesn't cover pskb_put since there are only three
users overall.

A handful of stragglers were converted manually, notably a macro in
drivers/isdn/i4l/isdn_bsdcomp.c and, oddly enough, one of the many
instances in net/bluetooth/hci_sock.c. In the former file, I also
had to fix one whitespace problem spatch introduced.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-16 11:48:39 -04:00
Tariq Toukan 4c07c13240 net/mlx4_en: Refactor mlx4_en_free_tx_desc
Some code re-ordering, functionally equivalent.

- The !tx_info->inl check is evaluated anyway in both flows
  (common case/end case). Run it first, this might finish
  the flows earlier.
- dma_unmap calls are identical in both flows, get it out
  of the if block into the common area.

Performance tests:
Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz

Gain is too small to be measurable, no degradation sensed.
Results are similar for IPv4 and IPv6.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Cc: kernel-team@fb.com
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-15 22:53:23 -04:00
Tariq Toukan 9573e0d39f net/mlx4_en: Replace TXBB_SIZE multiplications with shift operations
Define LOG_TXBB_SIZE, log of TXBB_SIZE, and use it with a shift
operation instead of a multiplication with TXBB_SIZE.
Operations are equivalent as TXBB_SIZE is a power of two.

Performance tests:
Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz

Gain is too small to be measurable, no degradation sensed.
Results are similar for IPv4 and IPv6.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Cc: kernel-team@fb.com
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-15 22:53:23 -04:00
Tariq Toukan 77788b5bf6 net/mlx4_en: Increase default TX ring size
Increase the default TX ring size (from 512 to 1024) to match
the RX ring size.
This gives the XDP TX ring a better chance to keep up with the
rate of its RX ring in case of a high load of XDP_TX actions.

Tested:
Ethtool counter rx_xdp_tx_full used to increase, after applying this
patch it stopped.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Cc: kernel-team@fb.com
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-15 22:53:23 -04:00
Tariq Toukan 6c78511b05 net/mlx4_en: Poll XDP TX completion queue in RX NAPI
Instead of having their own NAPIs, XDP TX completion queues get
polled within the corresponding RX NAPI.
This prevents any possible race on TX ring prod/cons indices,
between the context that issues the transmits (RX NAPI) and the
context that handles the completions (was previously done in
a separate NAPI).

This also improves performance, as it decreases the number
of NAPIs running on a CPU, saving the overhead of syncing
and switching between the contexts.

Performance tests:
Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Single queue no-RSS optimization ON.

XDP_TX packet rate:
-------------------------------------
     | Before    | After     | Gain |
IPv4 | 12.0 Mpps | 13.8 Mpps |  15% |
IPv6 | 12.0 Mpps | 13.8 Mpps |  15% |
-------------------------------------

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Cc: kernel-team@fb.com
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-15 22:53:23 -04:00
Tariq Toukan 36ea796498 net/mlx4_en: Improve XDP xmit function
Several performance improvements in XDP TX datapath,
including:
- Ring a single doorbell for XDP TX ring per NAPI budget,
  instead of doing it per a lower threshold (was 8).
  This includes removing the flow of immediate doorbell ringing
  in case of a full TX ring.
- Compiler branch predictor hints.
- Calculate values in compile time rather than in runtime.

Performance tests:
Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Single queue no-RSS optimization ON.

XDP_TX packet rate:
-------------------------------------
     | Before    | After     | Gain |
IPv4 | 10.3 Mpps | 12.0 Mpps |  17% |
IPv6 | 10.3 Mpps | 12.0 Mpps |  17% |
-------------------------------------

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Cc: kernel-team@fb.com
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-15 22:53:23 -04:00
Tariq Toukan f28186d6b5 net/mlx4_en: Improve stack xmit function
Several small code and performance improvements in stack TX datapath,
including:
- Compiler branch predictor hints.
- Minimize variables scope.
- Move tx_info non-inline flow handling to a separate function.
- Calculate data_offset in compile time rather than in runtime
  (for !lso_header_size branch).
- Avoid trinary-operator ("?") when value can be preset in a matching
  branch.

Performance tests:
Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz

Gain is too small to be measurable, no degradation sensed.
Results are similar for IPv4 and IPv6.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Cc: kernel-team@fb.com
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-15 22:53:23 -04:00
Tariq Toukan cc26a49086 net/mlx4_en: Improve transmit CQ polling
Several small performance improvements in TX CQ polling,
including:
- Compiler branch predictor hints.
- Minimize variables scope.
- More proper check of cq type.
- Use boolean instead of int for a binary indication.

Performance tests:
Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz

Packet-rate tests for both regular stack and XDP use cases:
No noticeable gain, no degradation.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Cc: kernel-team@fb.com
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-15 22:53:23 -04:00
Tariq Toukan 9bcee89ac4 net/mlx4_en: Improve receive data-path
Several small performance improvements in RX datapath,
including:
- Compiler branch predictor hints.
- Replace a multiplication with a shift operation.
- Minimize variables scope.
- Write-prefetch for packet header.
- Avoid trinary-operator ("?") when value can be preset in a matching
  branch.
- Save a branch by updating RX ring doorbell within
  mlx4_en_refill_rx_buffers(), which now returns void.

Performance tests:
Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Single queue no-RSS optimization ON
(enable by ethtool -L <interface> rx 1).

XDP_DROP packet rate:
Same (28.1 Mpps), lower CPU utilization (from ~100% to ~92%).

Drop packets in TC:
-------------------------------------
     | Before    | After     | Gain |
IPv4 | 4.14 Mpps | 4.18 Mpps |   1% |
-------------------------------------

XDP_TX packet rate:
-------------------------------------
     | Before    | After     | Gain |
IPv4 | 10.1 Mpps | 10.3 Mpps |   2% |
IPv6 | 10.1 Mpps | 10.3 Mpps |   2% |
-------------------------------------

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Cc: kernel-team@fb.com
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-15 22:53:23 -04:00
Saeed Mahameed 4931c6ef04 net/mlx4_en: Optimized single ring steering
Avoid touching RX QP RSS context when loading with only
one RX ring, to allow optimized A0 RX steering.

Enable by:
- loading mlx4_core with module param: log_num_mgm_entry_size = -6.
- then: ethtool -L <interface> rx 1

Performance tests:
Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz

XDP_DROP packet rate:
-------------------------------------
     | Before    | After     | Gain |
IPv4 | 20.5 Mpps | 28.1 Mpps |  37% |
IPv6 | 18.4 Mpps | 28.1 Mpps |  53% |
-------------------------------------

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Cc: kernel-team@fb.com
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-15 22:53:22 -04:00
Tariq Toukan cf97050d54 net/mlx4_en: Remove unused argument in TX datapath function
Remove owner argument, as it is obsolete and unused.
This also saves the overhead of calculating its value in data-path.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Cc: kernel-team@fb.com
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-15 22:53:22 -04:00
Jiri Pirko a5fcf8a6c9 net: propagate tc filter chain index down the ndo_setup_tc call
We need to push the chain index down to the drivers, so they have the
information to which chain the rule belongs. For now, no driver supports
multichain offload, so only chain 0 is supported. This is needed to
prevent chain squashes during offload for now. Later this will be used
to implement multichain offload.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-08 09:55:53 -04:00
Tariq Toukan 808df6a209 net/mlx4_en: Bump driver version
Remove date and bump version for mlx4_en driver.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-07 15:33:01 -04:00
Tariq Toukan cea2a6d81d net/mlx4_core: Bump driver version
Remove date and bump version for mlx4_core driver.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-07 15:33:00 -04:00
David S. Miller 216fe8f021 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Just some simple overlapping changes in marvell PHY driver
and the DSA core code.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-06 22:20:08 -04:00
Ido Shamay 269f9883fe net/mlx4: Check if Granular QoS per VF has been enabled before updating QP qos_vport
The Granular QoS per VF feature must be enabled in FW before it can be
used.

Thus, the driver cannot modify a QP's qos_vport value (via the UPDATE_QP FW
command) if the feature has not been enabled -- the FW returns an error if
this is attempted.

Fixes: 08068cd568 ("net/mlx4: Added qos_vport QP configuration in VST mode")
Signed-off-by: Ido Shamay <idos@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-05 11:29:54 -04:00
Talat Batheesh 6dc06c08be net/mlx4: Fix the check in attaching steering rules
Our previous patch (cited below) introduced a regression
for RAW Eth QPs.

Fix it by checking if the QP number provided by user-space
exists, hence allowing steering rules to be added for valid
QPs only.

Fixes: 89c557687a ("net/mlx4_en: Avoid adding steering rules with invalid ring")
Reported-by: Or Gerlitz <gerlitz.or@gmail.com>
Signed-off-by: Talat Batheesh <talatb@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-04 23:10:05 -04:00
Miroslav Lichvar e341257548 net: ethernet: update drivers to handle HWTSTAMP_FILTER_NTP_ALL
Include HWTSTAMP_FILTER_NTP_ALL in net_hwtstamp_validate() as a valid
filter and update drivers which can timestamp all packets, or which
explicitly list unsupported filters instead of using a default case, to
handle the filter.

CC: Richard Cochran <richardcochran@gmail.com>
CC: Willem de Bruijn <willemb@google.com>
Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-21 13:37:32 -04:00
yuval.shaia@oracle.com 4762010f09 net/mlx4_core: Use min3 to select number of MSI-X vectors
Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-15 14:20:26 -04:00
Linus Torvalds 50fb55d88c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix multiqueue in stmmac driver on PCI, from Andy Shevchenko.

 2) cdc_ncm doesn't actually fully zero out the padding area is
    allocates on TX, from Jim Baxter.

 3) Don't leak map addresses in BPF verifier, from Daniel Borkmann.

 4) If we randomize TCP timestamps, we have to do it everywhere
    including SYN cookies. From Eric Dumazet.

 5) Fix "ethtool -S" crash in aquantia driver, from Pavel Belous.

 6) Fix allocation size for ntp filter bitmap in bnxt_en driver, from
    Dan Carpenter.

 7) Add missing memory allocation return value check to DSA loop driver,
    from Christophe Jaillet.

 8) Fix XDP leak on driver unload in qed driver, from Suddarsana Reddy
    Kalluru.

 9) Don't inherit MC list from parent inet connection sockets, another
    syzkaller spotted gem. Fix from Eric Dumazet.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (43 commits)
  dccp/tcp: do not inherit mc_list from parent
  qede: Split PF/VF ndos.
  qed: Correct doorbell configuration for !4Kb pages
  qed: Tell QM the number of tasks
  qed: Fix VF removal sequence
  qede: Fix XDP memory leak on unload
  net/mlx4_core: Reduce harmless SRIOV error message to debug level
  net/mlx4_en: Avoid adding steering rules with invalid ring
  net/mlx4_en: Change the error print to debug print
  drivers: net: wimax: i2400m: i2400m-usb: Use time_after for time comparison
  DECnet: Use container_of() for embedded struct
  Revert "ipv4: restore rt->fi for reference counting"
  net: mdio-mux: bcm-iproc: call mdiobus_free() in error path
  net: ethernet: ti: cpsw: adjust cpsw fifos depth for fullduplex flow control
  ipv6: reorder ip6_route_dev_notifier after ipv6_dev_notf
  net: cdc_ncm: Fix TX zero padding
  stmmac: pci: split out common_default_data() helper
  stmmac: pci: RX queue routing configuration
  stmmac: pci: TX and RX queue priority configuration
  stmmac: pci: set default number of rx and tx queues
  ...
2017-05-09 15:42:31 -07:00
Jack Morgenstein 83bd5118a1 net/mlx4_core: Reduce harmless SRIOV error message to debug level
Under SRIOV resource management, extra counters are allocated to VFs
from a free pool. If that pool is empty, the ALLOC_RES command for
a counter resource fails -- and this generates a misleading error
message in the message log.

Under SRIOV, each VF is allocated (i.e., guaranteed) 2 counters --
one counter per port. For ETH ports, the RoCE driver requests an
additional counter (above the guaranteed counters). If that request
fails, the VF RoCE driver simply uses the default (i.e., guaranteed)
counter for that port.

Thus, failing to allocate an additional counter does not constitute
a  problem, and the error message on the PF when this occurs should
be reduced to debug level.

Finally, to identify the situation that the reason for the failure is
that no resources are available to grant to the VF, we modified the
error returned by mlx4_grant_resource to -EDQUOT (Quota exceeded),
which more accurately describes the error.

Fixes: c3abb51bdb ("IB/mlx4: Add RoCE/IB dedicated counters")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-09 11:22:46 -04:00
Talat Batheesh 89c557687a net/mlx4_en: Avoid adding steering rules with invalid ring
Inserting steering rules with illegal ring is an invalid operation,
block it.

Fixes: 820672812f ('net/mlx4_en: Manage flow steering rules with ethtool')
Signed-off-by: Talat Batheesh <talatb@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-09 11:22:46 -04:00
Kamal Heib 505a9249c2 net/mlx4_en: Change the error print to debug print
The error print within mlx4_en_calc_rx_buf() should be a debug print.

Fixes: 51151a16a6 ('mlx4: allow order-0 memory allocations in RX path')
Signed-off-by: Kamal Heib <kamalh@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-09 11:22:46 -04:00
Michal Hocko 752ade68cb treewide: use kv[mz]alloc* rather than opencoded variants
There are many code paths opencoding kvmalloc.  Let's use the helper
instead.  The main difference to kvmalloc is that those users are
usually not considering all the aspects of the memory allocator.  E.g.
allocation requests <= 32kB (with 4kB pages) are basically never failing
and invoke OOM killer to satisfy the allocation.  This sounds too
disruptive for something that has a reasonable fallback - the vmalloc.
On the other hand those requests might fallback to vmalloc even when the
memory allocator would succeed after several more reclaim/compaction
attempts previously.  There is no guarantee something like that happens
though.

This patch converts many of those places to kv[mz]alloc* helpers because
they are more conservative.

Link: http://lkml.kernel.org/r/20170306103327.2766-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> # Xen bits
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Andreas Dilger <andreas.dilger@intel.com> # Lustre
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> # KVM/s390
Acked-by: Dan Williams <dan.j.williams@intel.com> # nvdim
Acked-by: David Sterba <dsterba@suse.com> # btrfs
Acked-by: Ilya Dryomov <idryomov@gmail.com> # Ceph
Acked-by: Tariq Toukan <tariqt@mellanox.com> # mlx4
Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx5
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Colin Cross <ccross@android.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Santosh Raspatur <santosh@chelsio.com>
Cc: Hariprasad S <hariprasad@chelsio.com>
Cc: Yishai Hadas <yishaih@mellanox.com>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: "Yan, Zheng" <zyan@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:13 -07:00
Greg Thelen 8dc7d11f1d net/mlx4: suppress 'may be used uninitialized' warning
gcc 4.8.4 complains that mlx4_SW2HW_MPT_wrapper() uses an uninitialized
'mpt' variable:
  drivers/net/ethernet/mellanox/mlx4/resource_tracker.c: In function 'mlx4_SW2HW_MPT_wrapper':
  drivers/net/ethernet/mellanox/mlx4/resource_tracker.c:2802:12: warning: 'mpt' may be used uninitialized in this function [-Wmaybe-uninitialized]
     mpt->mtt = mtt;

I think this warning is a false complaint.  mpt is only used when
mr_res_start_move_to() return zero, and in all such cases it initializes
mpt.  But apparently gcc cannot see that.

Initialize mpt to avoid the warning.

Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-20 13:19:21 -04:00
Eric Dumazet 75d04aa368 mlx4: trust shinfo->gso_segs
mlx4 is the only driver in the tree making a point to recompute
shinfo->gso_segs.

Lets remove superfluous code.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-06 13:27:49 -07:00
David S. Miller 16ae1f2236 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/broadcom/genet/bcmmii.c
	drivers/net/hyperv/netvsc.c
	kernel/bpf/hashtab.c

Almost entirely overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-23 16:41:27 -07:00
Jack Morgenstein 4cbe4dac82 net/mlx4_core: Avoid delays during VF driver device shutdown
Some Hypervisors detach VFs from VMs by instantly causing an FLR event
to be generated for a VF.

In the mlx4 case, this will cause that VF's comm channel to be disabled
before the VM has an opportunity to invoke the VF device's "shutdown"
method.

For such Hypervisors, there is a race condition between the VF's
shutdown method and its internal-error detection/reset thread.

The internal-error detection/reset thread (which runs every 5 seconds) also
detects a disabled comm channel. If the internal-error detection/reset
flow wins the race, we still get delays (while that flow tries repeatedly
to detect comm-channel recovery).

The cited commit fixed the command timeout problem when the
internal-error detection/reset flow loses the race.

This commit avoids the unneeded delays when the internal-error
detection/reset flow wins.

Fixes: d585df1c5c ("net/mlx4_core: Avoid command timeouts during VF driver device shutdown")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Reported-by: Simon Xiao <sixiao@microsoft.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16 20:14:51 -07:00
Amritha Nambiar 56f36acd21 mqprio: Modify mqprio to pass user parameters via ndo_setup_tc.
The configurable priority to traffic class mapping and the user specified
queue ranges are used to configure the traffic class, overriding the
hardware defaults when the 'hw' option is set to 0. However, when the 'hw'
option is non-zero, the hardware QOS defaults are used.

This patch makes it so that we can pass the data the user provided to
ndo_setup_tc. This allows us to pull in the queue configuration if the
user requested it as well as any additional hardware offload type
requested by using a value other than 1 for the hw value.

Finally it also provides a means for the device driver to return the level
supported for the offload type via the qopt->hw value. Previously we were
just always assuming the value to be 1, in the future values beyond just 1
may be supported.

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-15 15:20:27 -07:00
Eric Dumazet 68b8df4644 mlx4: remove duplicate code in mlx4_en_process_rx_cq()
We should keep one way to build skbs, regardless of GRO being on or off.

Note that I made sure to defer as much as possible the point we need to
pull data from the frame, so that future prefetch() we might add
are more effective.

These skb attributes derive from the CQE or ring :
 ip_summed, csum
 hash
 vlan offload
 hwtstamps
 queue_mapping

As a bonus, this patch removes mlx4 dependency on eth_get_headlen()
which is very often broken enough to give us headaches.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09 09:54:46 -08:00
Eric Dumazet 6969cf0fdb mlx4: make validate_loopback() more generic
Testing a boolean in fast path is not worth duplicating
the code allocating packets, when GRO is on or off.

If this proves to be a problem, we might later use a jump label.

Next patch will remove this duplicated code and ease code review.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09 09:54:46 -08:00
Eric Dumazet 02e6fd3e55 mlx4: factorize page_address() calls
We need to compute the frame virtual address at different points.
Do it once.

Following patch will use the new va address for validate_loopback()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09 09:54:46 -08:00
Eric Dumazet 9e8c0395a7 mlx4: do not access rx_desc from mlx4_en_process_rx_cq()
Instead of fetching dma address from rx_desc->data[0].addr,
prefer using frags[0].dma + frags[0].page_offset to avoid
a potential cache line miss.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09 09:54:46 -08:00
Eric Dumazet 7d7bfc6a3f mlx4: add rx_alloc_pages counter in ethtool -S
This new counter tracks number of pages that we allocated for one port.

lpaa24:~# ethtool -S eth0 | egrep 'rx_alloc_pages|rx_packets'
     rx_packets: 306755183
     rx_alloc_pages: 932897

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09 09:54:46 -08:00
Eric Dumazet 34db548bfb mlx4: add page recycling in receive path
Same technique than some Intel drivers, for arches where PAGE_SIZE = 4096

In most cases, pages are reused because they were consumed
before we could loop around the RX ring.

This brings back performance, and is even better,
a single TCP flow reaches 30Gbit on my hosts.

v2: added full memset() in mlx4_en_free_frag(), as Tariq found it was needed
if we switch to large MTU, as priv->log_rx_info can dynamically be changed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09 09:54:46 -08:00
Eric Dumazet b5a54d9a31 mlx4: use order-0 pages for RX
Use of order-3 pages is problematic in some cases.

This patch might add three kinds of regression :

1) a CPU performance regression, but we will add later page
recycling and performance should be back.

2) TCP receiver could grow its receive window slightly slower,
   because skb->len/skb->truesize ratio will decrease.
   This is mostly ok, we prefer being conservative to not risk OOM,
   and eventually tune TCP better in the future.
   This is consistent with other drivers using 2048 per ethernet frame.

3) Because we allocate one page per RX slot, we consume more
   memory for the ring buffers. XDP already had this constraint anyway.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09 09:54:46 -08:00
Eric Dumazet 60c7f5ae54 mlx4: removal of frag_sizes[]
We will soon use order-0 pages, and frag truesize will more precisely
match real sizes.

In the new model, we prefer to use <= 2048 bytes fragments, so that
we can use page-recycle technique on PAGE_SIZE=4096 arches.

We will still pack as much frames as possible on arches with big
pages, like PowerPC.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09 09:54:46 -08:00
Eric Dumazet acd7628de0 mlx4: reduce rx ring page_cache size
We only need to store the page and dma address.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09 09:54:46 -08:00
Eric Dumazet d85f6c14e9 mlx4: rx_headroom is a per port attribute
No need to duplicate it per RX queue / frags.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09 09:54:46 -08:00
Eric Dumazet aaca121dd6 mlx4: get rid of frag_prefix_size
Using per frag storage for frag_prefix_size is really silly.

mlx4_en_complete_rx_desc() has all needed info already.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09 09:54:46 -08:00
Eric Dumazet 159ddfd2ca mlx4: remove order field from mlx4_en_frag_info
This is really a port attribute, no need to duplicate it per
RX queue and per frag.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09 09:54:46 -08:00
Eric Dumazet 69ba943151 mlx4: dma_dir is a mlx4_en_priv attribute
No need to duplicate it for all queues and frags.

num_frags & log_rx_info become u8 to save space.
u8 accesses are a bit faster than u16 anyway.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09 09:54:46 -08:00
Eric Dumazet 47d3a07528 net/mlx4_en: fix overflow in mlx4_en_init_timestamp()
The cited commit makes a great job of finding optimal shift/multiplier
values assuming a 10 seconds wrap around, but forgot to change the
overflow_period computation.

It overflows in cyclecounter_cyc2ns(), and the final result is 804 ms,
which is silly.

Lets simply use 5 seconds, no need to recompute this, given how it is
supposed to work.

Later, we will use a timer instead of a work queue, since the new RX
allocation schem will no longer need mlx4_en_recover_from_oom() and the
service_task firing every 250 ms.

Fixes: 31c128b66e ("net/mlx4_en: Choose time-stamping shift value according to HW frequency")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Cc: Eugenia Emantayev <eugenia@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-26 15:39:43 -05:00
Eric Dumazet 7f0137e2ef net/mlx4_en: Use __skb_fill_page_desc()
Or we might miss the fact that a page was allocated from memory reserves.

Fixes: dceeab0e52 ("mlx4: support __GFP_MEMALLOC for rx")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-23 10:57:57 -05:00
Jack Morgenstein 6ed63d845e net/mlx4_core: Use cq quota in SRIOV when creating completion EQs
When creating EQs to handle CQ completion events for the PF
or for VFs, we create enough EQE entries to handle completions
for the max number of CQs that can use that EQ.

When SRIOV is activated, the max number of CQs a VF (or the PF) can
obtain is its CQ quota (determined by the Hypervisor resource tracker).
Therefore, when creating an EQ, the number of EQE entries that the VF
should request for that EQ is the CQ quota value (and not the total
number of CQs available in the FW).

Under SRIOV, the PF, also must use its CQ quota, because
the resource tracker also controls how many CQs the PF can obtain.

Using the FW total CQs instead of the CQ quota when creating EQs resulted
wasting MTT entries, due to allocating more EQEs than were needed.

Fixes: 5a0d0a6161 ("mlx4: Structures and init/teardown for VF resource quotas")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Reported-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-23 10:57:57 -05:00
Majd Dibbiny 95f1ba9a24 net/mlx4_core: Fix VF overwrite of module param which disables DMFS on new probed PFs
In the VF driver, module parameter mlx4_log_num_mgm_entry_size was
mistakenly overwritten -- and in a manner which overrode the
device-managed flow steering option encoded in the parameter.

log_num_mgm_entry_size is a global module parameter which
affects all ConnectX-3 PFs installed on that host.
If a VF changes log_num_mgm_entry_size, this will affect all PFs
which are probed subsequent to the change (by disabling DMFS for
those PFs).

Fixes: 3c439b5586 ("mlx4_core: Allow choosing flow steering mode")
Signed-off-by: Majd Dibbiny <majd@mellanox.com>
Reviewed-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-23 10:57:57 -05:00
Eugenia Emantayev 745d8ae462 net/mlx4: Spoofcheck and zero MAC can't coexist
Spoofcheck can't be enabled if VF MAC is zero.
Vice versa, can't zero MAC if spoofcheck is on.

Fixes: 8f7ba3ca12 ('net/mlx4: Add set VF mac address support')
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-23 10:57:56 -05:00
Or Gerlitz 423b3aecf2 net/mlx4: Change ENOTSUPP to EOPNOTSUPP
As ENOTSUPP is specific to NFS, change the return error value to
EOPNOTSUPP in various places in the mlx4 driver.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Suggested-by: Yotam Gigi <yotamg@mellanox.com>
Reviewed-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-23 10:57:56 -05:00
Eric Dumazet 3608b13ccc mlx4: reduce OOM risk on arches with large pages
Since mlx4 NIC are used on PowerPC with 64K pages, we need to adapt
MLX4_EN_ALLOC_PREFER_ORDER definition.

Otherwise, a fragment sitting in an out of order TCP queue can hold
0.5 Mbytes and it is a serious OOM risk.

Fixes: 51151a16a6 ("mlx4: allow order-0 memory allocations in RX path")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-20 10:31:50 -05:00
Eric Dumazet f5a5772337 mlx4: fix potential divide by 0 in mlx4_en_auto_moderation()
1) In the case where rate == priv->pkt_rate_low == priv->pkt_rate_high,
mlx4_en_auto_moderation() does a divide by zero.

2) We want to properly change the moderation parameters if rx_frames
was changed (like in ethtool -C eth0 rx-frames 16)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-19 18:15:23 -05:00
Eric Dumazet 01f0f42534 mlx4: do not fire tasklet unless necessary
All rx and rx netdev interrupts are handled by respectively
by mlx4_en_rx_irq() and mlx4_en_tx_irq() which simply schedule a NAPI.

But mlx4_eq_int() also fires a tasklet to service all items that were
queued via mlx4_add_cq_to_tasklet(), but this handler was not called
unless user cqe was handled.

This is very confusing, as "mpstat -I SCPU ..." show huge number of
tasklet invocations.

This patch saves this overhead, by carefully firing the tasklet directly
from mlx4_add_cq_to_tasklet(), removing four atomic operations per IRQ.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17 10:55:22 -05:00
Eric Dumazet 99f5711e7c mlx4: do not use rwlock in fast path
Using a reader-writer lock in fast path is silly, when we can
instead use RCU or a seqlock.

For mlx4 hwstamp clock, a seqlock is the way to go, removing
two atomic operations and false sharing.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-15 12:06:18 -05:00
David S. Miller 3efa70d78f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The conflict was an interaction between a bug fix in the
netvsc driver in 'net' and an optimization of the RX path
in 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-07 16:29:30 -05:00
Benjamin Poirier bd4ce941c8 mlx4: Invoke softirqs after napi_reschedule
mlx4 may schedule napi from a workqueue. Afterwards, softirqs are not run
in a deterministic time frame and the following message may be logged:
NOHZ: local_softirq_pending 08

The problem is the same as what was described in commit ec13ee8014
("virtio_net: invoke softirqs after __napi_schedule") and this patch
applies the same fix to mlx4.

Fixes: 07841f9d94 ("net/mlx4_en: Schedule napi when RX buffers allocation fails")
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-07 12:50:43 -05:00
Dan Carpenter 73cfb2a2e4 net/mlx4_en: fix a condition
There is a "||" vs "|" typo here so we test 0x1 instead of 0x6.

Fixes: 1f8176f735 ("net/mlx4_en: Check the enabling pptx/pprx flags in SET_PORT wrapper flow")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-06 12:01:06 -05:00
Martin KaFai Lau 770f82253d mlx4: xdp_prog becomes inactive after ethtool '-L' or '-G'
After calling mlx4_en_try_alloc_resources (e.g. by changing the
number of rx-queues with ethtool -L), the existing xdp_prog becomes
inactive.

The bug is that the xdp_prog ptr has not been carried over from
the old rx-queues to the new rx-queues

Fixes: 47a38e1550 ("net/mlx4_en: add support for fast rx drop bpf program")
Cc: Brenden Blanco <bblanco@plumgrid.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-02 21:27:05 -05:00
Martin KaFai Lau f32b20e89e mlx4: Fix memory leak after mlx4_en_update_priv()
In mlx4_en_update_priv(), dst->tx_ring[t] and dst->tx_cq[t]
are over-written by src->tx_ring[t] and src->tx_cq[t] without
first calling kfree.

One of the reproducible code paths is by doing 'ethtool -L'.

The fix is to do the kfree in mlx4_en_free_resources().

Here is the kmemleak report:
unreferenced object 0xffff880841211800 (size 2048):
  comm "ethtool", pid 3096, jiffies 4294716940 (age 528.353s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff81930718>] kmemleak_alloc+0x28/0x50
    [<ffffffff8120b213>] kmem_cache_alloc_trace+0x103/0x260
    [<ffffffff8170e0a8>] mlx4_en_try_alloc_resources+0x118/0x1a0
    [<ffffffff817065a9>] mlx4_en_set_ringparam+0x169/0x210
    [<ffffffff818040c5>] dev_ethtool+0xae5/0x2190
    [<ffffffff8181b898>] dev_ioctl+0x168/0x6f0
    [<ffffffff817d7a72>] sock_do_ioctl+0x42/0x50
    [<ffffffff817d819b>] sock_ioctl+0x21b/0x2d0
    [<ffffffff81247a73>] do_vfs_ioctl+0x93/0x6a0
    [<ffffffff812480f9>] SyS_ioctl+0x79/0x90
    [<ffffffff8193d7ea>] entry_SYSCALL_64_fastpath+0x18/0xad
    [<ffffffffffffffff>] 0xffffffffffffffff
unreferenced object 0xffff880841213000 (size 2048):
  comm "ethtool", pid 3096, jiffies 4294716940 (age 528.353s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff81930718>] kmemleak_alloc+0x28/0x50
    [<ffffffff8120b213>] kmem_cache_alloc_trace+0x103/0x260
    [<ffffffff8170e0cb>] mlx4_en_try_alloc_resources+0x13b/0x1a0
    [<ffffffff817065a9>] mlx4_en_set_ringparam+0x169/0x210
    [<ffffffff818040c5>] dev_ethtool+0xae5/0x2190
    [<ffffffff8181b898>] dev_ioctl+0x168/0x6f0
    [<ffffffff817d7a72>] sock_do_ioctl+0x42/0x50
    [<ffffffff817d819b>] sock_ioctl+0x21b/0x2d0
    [<ffffffff81247a73>] do_vfs_ioctl+0x93/0x6a0
    [<ffffffff812480f9>] SyS_ioctl+0x79/0x90
    [<ffffffff8193d7ea>] entry_SYSCALL_64_fastpath+0x18/0xad
    [<ffffffffffffffff>] 0xffffffffffffffff

(gdb) list *mlx4_en_try_alloc_resources+0x118
0xffffffff8170e0a8 is in mlx4_en_try_alloc_resources (drivers/net/ethernet/mellanox/mlx4/en_netdev.c:2145).
2140                    if (!dst->tx_ring_num[t])
2141                            continue;
2142
2143                    dst->tx_ring[t] = kzalloc(sizeof(struct mlx4_en_tx_ring *) *
2144                                              MAX_TX_RINGS, GFP_KERNEL);
2145                    if (!dst->tx_ring[t])
2146                            goto err_free_tx;
2147
2148                    dst->tx_cq[t] = kzalloc(sizeof(struct mlx4_en_cq *) *
2149                                            MAX_TX_RINGS, GFP_KERNEL);
(gdb) list *mlx4_en_try_alloc_resources+0x13b
0xffffffff8170e0cb is in mlx4_en_try_alloc_resources (drivers/net/ethernet/mellanox/mlx4/en_netdev.c:2150).
2145                    if (!dst->tx_ring[t])
2146                            goto err_free_tx;
2147
2148                    dst->tx_cq[t] = kzalloc(sizeof(struct mlx4_en_cq *) *
2149                                            MAX_TX_RINGS, GFP_KERNEL);
2150                    if (!dst->tx_cq[t]) {
2151                            kfree(dst->tx_ring[t]);
2152                            goto err_free_tx;
2153                    }
2154            }

Fixes: ec25bc04ed ("net/mlx4_en: Add resilience in low memory systems")
Cc: Eugenia Emantayev <eugenia@mellanox.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-02 21:27:05 -05:00
David S. Miller e2160156bf Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
All merge conflicts were simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-02 16:54:00 -05:00
Jack Morgenstein d585df1c5c net/mlx4_core: Avoid command timeouts during VF driver device shutdown
Some Hypervisors detach VFs from VMs by instantly causing an FLR event
to be generated for a VF.

In the mlx4 case, this will cause that VF's comm channel to be disabled
before the VM has an opportunity to invoke the VF device's "shutdown"
method.

The result is that the VF driver on the VM will experience a command
timeout during the shutdown process when the Hypervisor does not deliver
a command-completion event to the VM.

To avoid FW command timeouts on the VM when the driver's shutdown method
is invoked, we detect the absence of the VF's comm channel at the very
start of the shutdown process. If the comm-channel has already been
disabled, we cause all FW commands during the device shutdown process to
immediately return success (and thus avoid all command timeouts).

Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-30 15:45:27 -05:00
Shaker Daibes 1f8176f735 net/mlx4_en: Check the enabling pptx/pprx flags in SET_PORT wrapper flow
Make sure pptx/pprx mask flag is set using new fields upon set port
request. In addition, move this code into a helper function for better
code readability.

Signed-off-by: Shaker Daibes <shakerd@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-30 15:26:43 -05:00
Shaker Daibes bf1f939683 net/mlx4_en: Check the enabling mtu flag in SET_PORT wrapper flow
Make sure MTU mask flag is set using new field upon set port
request. In addition, move this code into a helper function for better
code readability.

Signed-off-by: Shaker Daibes <shakerd@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-30 15:26:43 -05:00
Shaker Daibes 40fb4fc1e1 net/mlx4_en: Pass user MTU value to Firmware at set port command
When starting the port, driver will inform Firmware about the actual MTU
which does not include implicit headers, such as FCS or VLAN tags.

Signed-off-by: Shaker Daibes <shakerd@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-30 15:26:43 -05:00
Ariel Levkovich 297e1cf29e net/mlx4_en: Adding support of turning off link autonegotiation via ethtool
This feature will allow the user to disable auto negotiation
on the port for mlx4 devices while setting the speed is limited
to 1GbE speeds.
Other speeds will not be accepted in autoneg off mode.

This functionality is permitted providing that the firmware
is compatible with this feature.
The above is determined by querying a new dedicated capability
bit in the device.

Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-30 15:26:42 -05:00
Alaa Hleihel 4b5e5b7ece net/mlx4_core: Get num_tc using netdev_get_num_tc
Avoid reading num_tc directly from struct net_device, but use
the helper function netdev_get_num_tc.

Fixes: bc6a4744b8 ("net/mlx4_en: num cores tx rings for every UP")
Fixes: f5b6345ba8 ("net/mlx4_en: User prio mapping gets corrupted when changing number of channels")
Signed-off-by: Alaa Hleihel <alaa@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-30 15:26:42 -05:00
Matan Barak ae5a2e29d1 net/mlx4_core: Add resource alloc/dealloc debugging
In order to aid debugging of functions that take a resource but
don't put it, add the last function name that successfully grabbed
this resource.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-30 15:26:42 -05:00
Yishai Hadas 3835336401 net/mlx4_core: Device revision support
The device revision field returned by the NodeInfo MAD is incorrect
on ConnectX3 devices.

This patch is driver side handling to complete a FW fix added at 2.11.1172.
INIT_HCA - bit at offset 0x0C.12 is set to 1 so that FW will report
correct device revision.

Older FW versions won't be affected from turning on that bit,
no capability bit is needed.

Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-30 15:26:42 -05:00
Tariq Toukan 72b8eaab24 net/mlx4: Replace ENOSYS with better fitting error codes
Conform the following warning:
WARNING: ENOSYS means 'invalid syscall nr' and nothing else.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-30 15:26:41 -05:00
David S. Miller 4e8f2fc1a5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Two trivial overlapping changes conflicts in MPLS and mlx5.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-28 10:33:06 -05:00
Daniel Borkmann a67edbf4fb bpf: add initial bpf tracepoints
This work adds a number of tracepoints to paths that are either
considered slow-path or exception-like states, where monitoring or
inspecting them would be desirable.

For bpf(2) syscall, tracepoints have been placed for main commands
when they succeed. In XDP case, tracepoint is for exceptions, that
is, f.e. on abnormal BPF program exit such as unknown or XDP_ABORTED
return code, or when error occurs during XDP_TX action and the packet
could not be forwarded.

Both have been split into separate event headers, and can be further
extended. Worst case, if they unexpectedly should get into our way in
future, they can also removed [1]. Of course, these tracepoints (like
any other) can be analyzed by eBPF itself, etc. Example output:

  # ./perf record -a -e bpf:* sleep 10
  # ./perf script
  sock_example  6197 [005]   283.980322:      bpf:bpf_map_create: map type=ARRAY ufd=4 key=4 val=8 max=256 flags=0
  sock_example  6197 [005]   283.980721:       bpf:bpf_prog_load: prog=a5ea8fa30ea6849c type=SOCKET_FILTER ufd=5
  sock_example  6197 [005]   283.988423:   bpf:bpf_prog_get_type: prog=a5ea8fa30ea6849c type=SOCKET_FILTER
  sock_example  6197 [005]   283.988443: bpf:bpf_map_lookup_elem: map type=ARRAY ufd=4 key=[06 00 00 00] val=[00 00 00 00 00 00 00 00]
  [...]
  sock_example  6197 [005]   288.990868: bpf:bpf_map_lookup_elem: map type=ARRAY ufd=4 key=[01 00 00 00] val=[14 00 00 00 00 00 00 00]
       swapper     0 [005]   289.338243:    bpf:bpf_prog_put_rcu: prog=a5ea8fa30ea6849c type=SOCKET_FILTER

  [1] https://lwn.net/Articles/705270/

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25 13:17:47 -05:00
Geliang Tang 3704eb6f6f net/mlx4: use rb_entry()
To make the code clearer, use rb_entry() instead of container_of() to
deal with rbtree.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-22 16:46:13 -05:00
Eric Dumazet dceeab0e52 mlx4: support __GFP_MEMALLOC for rx
Commit 04aeb56a17 ("net/mlx4_en: allocate non 0-order pages for RX
ring with __GFP_NOMEMALLOC") added code that appears to be not needed at
that time, since mlx4 never used __GFP_MEMALLOC allocations anyway.

As using memory reserves is a must in some situations (swap over NFS or
iSCSI), this patch adds this flag.

Note that this driver does not reuse pages (yet) so we do not have to
add anything else.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-19 23:35:12 -05:00
Eran Ben Elisha e91ef71dfe net/mlx4_en: Remove unnecessary checks when setting num channels
Boundaries checks for the number of RX, TX, other and combined channels
should be checked by the caller and not in the driver.

In addition, remove wrong memset on get channels as it overrides the cmd
field in the requester struct.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18 14:58:23 -05:00
David S. Miller 580bdf5650 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-01-17 15:19:37 -05:00
Jack Morgenstein 9577b174cd net/mlx4_core: Eliminate warning messages for SRQ_LIMIT under SRIOV
When running SRIOV, warnings for SRQ LIMIT events flood the Hypervisor's
message log when (correct, normally operating) apps use SRQ LIMIT events
as a trigger to post WQEs to SRQs.

Add more information to the existing debug printout for SRQ_LIMIT, and
output the warning messages only for the SRQ CATAS ERROR event.

Fixes: acba2420f9 ("mlx4_core: Add wrapper functions and comm channel and slave event support to EQs")
Fixes: e0debf9cb5 ("mlx4_core: Reduce warning message for SRQ_LIMIT event to debug level")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-16 15:08:29 -05:00
Jack Morgenstein 7c3945bc20 net/mlx4_core: Fix when to save some qp context flags for dynamic VST to VGT transitions
Save the qp context flags byte containing the flag disabling vlan stripping
in the RESET to INIT qp transition, rather than in the INIT to RTR
transition. Per the firmware spec, the flags in this byte are active
in the RESET to INIT transition.

As a result of saving the flags in the incorrect qp transition, when
switching dynamically from VGT to VST and back to VGT, the vlan
remained stripped (as is required for VST) and did not return to
not-stripped (as is required for VGT).

Fixes: f0f829bf42 ("net/mlx4_core: Add immediate activate for VGT->VST->VGT")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-16 15:08:28 -05:00
Jack Morgenstein 291c566a28 net/mlx4_core: Fix racy CQ (Completion Queue) free
In function mlx4_cq_completion() and mlx4_cq_event(), the
radix_tree_lookup requires a rcu_read_lock.
This is mandatory: if another core frees the CQ, it could
run the radix_tree_node_rcu_free() call_rcu() callback while
its being used by the radix tree lookup function.

Additionally, in function mlx4_cq_event(), since we are adding
the rcu lock around the radix-tree lookup, we no longer need to take
the spinlock. Also, the synchronize_irq() call for the async event
eliminates the need for incrementing the cq reference count in
mlx4_cq_event().

Other changes:
1. In function mlx4_cq_free(), replace spin_lock_irq with spin_lock:
   we no longer take this spinlock in the interrupt context.
   The spinlock here, therefore, simply protects against different
   threads simultaneously invoking mlx4_cq_free() for different cq's.

2. In function mlx4_cq_free(), we move the radix tree delete to before
   the synchronize_irq() calls. This guarantees that we will not
   access this cq during any subsequent interrupts, and therefore can
   safely free the CQ after the synchronize_irq calls. The rcu_read_lock
   in the interrupt handlers only needs to protect against corrupting the
   radix tree; the interrupt handlers may access the cq outside the
   rcu_read_lock due to the synchronize_irq calls which protect against
   premature freeing of the cq.

3. In function mlx4_cq_event(), we change the mlx_warn message to mlx4_dbg.

4. We leave the cq reference count mechanism in place, because it is
   still needed for the cq completion tasklet mechanism.

Fixes: 6d90aa5cf1 ("net/mlx4_core: Make sure there are no pending async events when freeing CQ")
Fixes: 225c7b1fee ("IB/mlx4: Add a driver Mellanox ConnectX InfiniBand adapters")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-16 15:08:28 -05:00
Eric Dumazet 8cf699ec84 mlx4: do not call napi_schedule() without care
Disable BH around the call to napi_schedule() to avoid following warning

[   52.095499] NOHZ: local_softirq_pending 08
[   52.421291] NOHZ: local_softirq_pending 08
[   52.608313] NOHZ: local_softirq_pending 08

Fixes: 8d59de8f7b ("net/mlx4_en: Process all completions in RX rings after port goes up")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Erez Shitrit <erezsh@mellanox.com>
Cc: Eugenia Emantayev <eugenia@mellanox.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-16 11:44:10 -05:00
David S. Miller 02ac5d1487 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Two AF_* families adding entries to the lockdep tables
at the same time.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-11 14:43:39 -05:00
Martin KaFai Lau 9f9b74ef89 mlx4: Return EOPNOTSUPP instead of ENOTSUPP
In commit b45f0674b9 ("mlx4: xdp: Allow raising MTU up to one page minus eth and vlan hdrs"),
it changed EOPNOTSUPP to ENOTSUPP by mistake.  This patch fixes it.

Fixes: b45f0674b9 ("mlx4: xdp: Allow raising MTU up to one page minus eth and vlan hdrs")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-10 21:16:43 -05:00
stephen hemminger bc1f44709c net: make ndo_get_stats64 a void function
The network device operation for reading statistics is only called
in one place, and it ignores the return value. Having a structure
return value is potentially confusing because some future driver could
incorrectly assume that the return value was used.

Fix all drivers with ndo_get_stats64 to have a void function.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-08 17:51:44 -05:00
Jack Morgenstein 10b1c04e92 net/mlx4_core: Fix raw qp flow steering rules under SRIOV
Demoting simple flow steering rule priority (for DPDK) was achieved by
wrapping FW commands MLX4_QP_FLOW_STEERING_ATTACH/DETACH for the PF
as well, and forcing the priority to MLX4_DOMAIN_NIC in the wrapper
function for the PF and all VFs.

In function mlx4_ib_create_flow(), this change caused the main rule
creation for the PF to be wrapped, while it left the associated
tunnel steering rule creation unwrapped for the PF.

This mismatch caused rule deletion failures in mlx4_ib_destroy_flow()
for the PF when the detach wrapper function did not find the associated
tunnel-steering rule (since creation of that rule for the PF did not
go through the wrapper function).

Fix this by setting MLX4_QP_FLOW_STEERING_ATTACH/DETACH to be "native"
(so that the PF invocation does not go through the wrapper), and perform
the required priority demotion for the PF in the mlx4_ib_create_flow()
code path.

Fixes: 48564135cb ("net/mlx4_core: Demote simple multicast and broadcast flow steering rules")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-29 14:17:40 -05:00
Slava Shwartsman 61b6034c6c net/mlx4_en: Fix type mismatch for 32-bit systems
is_power_of_2 expects unsigned long and we pass u64 max_val_cycles,
this will be truncated on 32 bit systems, and the result is not what we
were expecting.
div_u64 expects u32 as a second argument and we pass
max_val_cycles_rounded which is u64 hence it will always be truncated.
Fix was tested on both 64 and 32 bit systems and got same results for
max_val_cycles and max_val_cycles_rounded.

Fixes: 4850cf4581 ("net/mlx4_en: Resolve dividing by zero in 32-bit system")
Signed-off-by: Slava Shwartsman <slavash@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-29 14:17:40 -05:00
Leon Romanovsky c1d5f8ff80 net/mlx4: Remove BUG_ON from ICM allocation routine
This patch removes BUG_ON() macro from mlx4_alloc_icm_coherent()
by checking DMA address alignment in advance and performing proper
folding in case of error.

Fixes: 5b0bf5e25e ("mlx4_core: Support ICM tables in coherent memory")
Reported-by: Ozgur Karatas <okaratas@member.fsf.org>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-29 14:17:40 -05:00
Eugenia Emantayev 6496bbf0ec net/mlx4_en: Fix bad WQE issue
Single send WQE in RX buffer should be stamped with software
ownership in order to prevent the flow of QP in error in FW
once UPDATE_QP is called.

Fixes: 9f519f68cf ('mlx4_en: Not using Shared Receive Queues')
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-29 14:17:40 -05:00
Jack Morgenstein 3b01fe7f91 net/mlx4_core: Use-after-free causes a resource leak in flow-steering detach
mlx4_QP_FLOW_STEERING_DETACH_wrapper first removes the steering
rule (which results in freeing the rule structure), and then
references a field in this struct (the qp number) when releasing the
busy-status on the rule's qp.

Since this memory was freed, it could reallocated and changed.
Therefore, the qp number in the struct may be incorrect,
so that we are releasing the incorrect qp. This leaves the rule's qp
in the busy state (and could possibly release an incorrect qp as well).

Fix this by saving the qp number in a local variable, for use after
removing the steering rule.

Fixes: 2c473ae7e5 ("net/mlx4_core: Disallow releasing VF QPs which have steering rules")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-29 14:17:40 -05:00
Linus Torvalds 8f18e4d03e Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Various ipvlan fixes from Eric Dumazet and Mahesh Bandewar.

    The most important is to not assume the packet is RX just because
    the destination address matches that of the device. Such an
    assumption causes problems when an interface is put into loopback
    mode.

 2) If we retry when creating a new tc entry (because we dropped the
    RTNL mutex in order to load a module, for example) we end up with
    -EAGAIN and then loop trying to replay the request. But we didn't
    reset some state when looping back to the top like this, and if
    another thread meanwhile inserted the same tc entry we were trying
    to, we re-link it creating an enless loop in the tc chain. Fix from
    Daniel Borkmann.

 3) There are two different WRITE bits in the MDIO address register for
    the stmmac chip, depending upon the chip variant. Due to a bug we
    could set them both, fix from Hock Leong Kweh.

 4) Fix mlx4 bug in XDP_TX handling, from Tariq Toukan.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
  net: stmmac: fix incorrect bit set in gmac4 mdio addr register
  r8169: add support for RTL8168 series add-on card.
  net: xdp: remove unused bfp_warn_invalid_xdp_buffer()
  openvswitch: upcall: Fix vlan handling.
  ipv4: Namespaceify tcp_tw_reuse knob
  net: korina: Fix NAPI versus resources freeing
  net, sched: fix soft lockup in tc_classify
  net/mlx4_en: Fix user prio field in XDP forward
  tipc: don't send FIN message from connectionless socket
  ipvlan: fix multicast processing
  ipvlan: fix various issues in ipvlan_process_multicast()
2016-12-27 16:04:37 -08:00
Thomas Gleixner a5a1d1c291 clocksource: Use a plain u64 instead of cycle_t
There is no point in having an extra type for extra confusion. u64 is
unambiguous.

Conversion was done with the following coccinelle script:

@rem@
@@
-typedef u64 cycle_t;

@fix@
typedef cycle_t;
@@
-cycle_t
+u64

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
2016-12-25 11:04:12 +01:00
Tariq Toukan eb9def61be net/mlx4_en: Fix user prio field in XDP forward
The user prio field is wrong (and overflows) in the XDP forward
flow.
This is a result of a bad value for num_tx_rings_p_up, which should
account all XDP TX rings, as they operate for the same user prio.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reported-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-23 17:53:47 -05:00
Linus Torvalds 0ab7b12c49 pci-v4.10-changes
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJYUt1vAAoJEFmIoMA60/r8abgP/3R+5Lsk5/kfAHk5/2Mtqbvg
 mZ0eDUpY9GbUeMjSq84Nr2H8u7d+1AJCCu8KtDJYZCmjZpnSp2SuE2PS5JoGC7zC
 fintD24jlIF4/J5+HeVXXmbfr3xATxvpTuiSLEi8sLBRJ3KRIswhMSwoPwOyeTQw
 v/EclWKPGYcI5Zp0oigY9/Jd3q3lQ17KXppi/0dDoLh7PNOFvEHItXWzmf++u/NP
 iYT9R1xmzEsy0/HRd6hiwPT2xA8YsAXxgobhHooUgh1FWmZ02Tg1WjgDemOW4lVh
 kNIUcsLczh7wZCceogrrJ+pwb9+NyyIyKuHPv6OG3ieyz1IZdznaj1fAE5HJYiPo
 eVS7cP1S6DyV3Y5qFj5F2dSRS7T4GXdXG5mNhmeCpUHs0vfzSCG36jLmhTy8UIxs
 1rCf5oFa+uU9q0okfH8VtcGOXqWjGgyxTSGGfF71HUMLnPbsci2fxC2cO6svzIX7
 wDY0uxOzpyMIYMuQR6iz7VqvAwEaZ+7pfMIrWWdDcQ9/5tCNJ49cLuKaThPL4bVu
 juiGBQtnTLg8tjrhjDL9tQiJpuVIweVXyyQ1fvZoVXkMLlhVCF2ttirvwFUit2PB
 84OlevQZ+9QdE/qalrWbv4qzhesuiwu0avkzjGoqg6tWTF0epu2AHI2vqy6UBYEG
 tcfJPEcz1019PKZNSvWy
 =ut0k
 -----END PGP SIGNATURE-----

Merge tag 'pci-v4.10-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci

Pull PCI updates from Bjorn Helgaas:
 "PCI changes:

   - add support for PCI on ARM64 boxes with ACPI. We already had this
     for theoretical spec-compliant hardware; now we're adding quirks
     for the actual hardware (Cavium, HiSilicon, Qualcomm, X-Gene)

   - add runtime PM support for hotplug ports

   - enable runtime suspend for Intel UHCI that uses platform-specific
     wakeup signaling

   - add yet another host bridge registration interface. We hope this is
     extensible enough to subsume the others

   - expose device revision in sysfs for DRM

   - to avoid device conflicts, make sure any VF BAR updates are done
     before enabling the VF

   - avoid unnecessary link retrains for ASPM

   - allow INTx masking on Mellanox devices that support it

   - allow access to non-standard VPD for Chelsio devices

   - update Broadcom iProc support for PAXB v2, PAXC v2, inbound DMA,
     etc

   - update Rockchip support for max-link-speed

   - add NVIDIA Tegra210 support

   - add Layerscape LS1046a support

   - update R-Car compatibility strings

   - add Qualcomm MSM8996 support

   - remove some uninformative bootup messages"

* tag 'pci-v4.10-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (115 commits)
  PCI: Enable access to non-standard VPD for Chelsio devices (cxgb3)
  PCI: Expand "VPD access disabled" quirk message
  PCI: pciehp: Remove loading message
  PCI: hotplug: Remove hotplug core message
  PCI: Remove service driver load/unload messages
  PCI/AER: Log AER IRQ when claiming Root Port
  PCI/AER: Log errors with PCI device, not PCIe service device
  PCI/AER: Remove unused version macros
  PCI/PME: Log PME IRQ when claiming Root Port
  PCI/PME: Drop unused support for PMEs from Root Complex Event Collectors
  PCI: Move config space size macros to pci_regs.h
  x86/platform/intel-mid: Constify mid_pci_platform_pm
  PCI/ASPM: Don't retrain link if ASPM not possible
  PCI: iproc: Skip check for legacy IRQ on PAXC buses
  PCI: pciehp: Leave power indicator on when enabling already-enabled slot
  PCI: pciehp: Prioritize data-link event over presence detect
  PCI: rcar: Add gen3 fallback compatibility string for pcie-rcar
  PCI: rcar: Use gen2 fallback compatibility last
  PCI: rcar-gen2: Use gen2 fallback compatibility last
  PCI: rockchip: Move the deassert of pm/aclk/pclk after phy_init()
  ..
2016-12-15 12:46:48 -08:00
Linus Torvalds 9465d9cc31 Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
 "The time/timekeeping/timer folks deliver with this update:

   - Fix a reintroduced signed/unsigned issue and cleanup the whole
     signed/unsigned mess in the timekeeping core so this wont happen
     accidentaly again.

   - Add a new trace clock based on boot time

   - Prevent injection of random sleep times when PM tracing abuses the
     RTC for storage

   - Make posix timers configurable for real tiny systems

   - Add tracepoints for the alarm timer subsystem so timer based
     suspend wakeups can be instrumented

   - The usual pile of fixes and updates to core and drivers"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
  timekeeping: Use mul_u64_u32_shr() instead of open coding it
  timekeeping: Get rid of pointless typecasts
  timekeeping: Make the conversion call chain consistently unsigned
  timekeeping_Force_unsigned_clocksource_to_nanoseconds_conversion
  alarmtimer: Add tracepoints for alarm timers
  trace: Update documentation for mono, mono_raw and boot clock
  trace: Add an option for boot clock as trace clock
  timekeeping: Add a fast and NMI safe boot clock
  timekeeping/clocksource_cyc2ns: Document intended range limitation
  timekeeping: Ignore the bogus sleep time if pm_trace is enabled
  selftests/timers: Fix spelling mistake "Asyncrhonous" -> "Asynchronous"
  clocksource/drivers/bcm2835_timer: Unmap region obtained by of_iomap
  clocksource/drivers/arm_arch_timer: Map frame with of_io_request_and_map()
  arm64: dts: rockchip: Arch counter doesn't tick in system suspend
  clocksource/drivers/arm_arch_timer: Don't assume clock runs in suspend
  posix-timers: Make them configurable
  posix_cpu_timers: Move the add_device_randomness() call to a proper place
  timer: Move sys_alarm from timer.c to itimer.c
  ptp_clock: Allow for it to be optional
  Kconfig: Regenerate *.c_shipped files after previous changes
  ...
2016-12-12 19:56:15 -08:00
Martin KaFai Lau ea3349a035 mlx4: xdp: Reserve headroom for receiving packet when XDP prog is active
Reserve XDP_PACKET_HEADROOM for packet and enable bpf_xdp_adjust_head()
support.  This patch only affects the code path when XDP is active.

After testing, the tx_dropped counter is incremented if the xdp_prog sends
more than wire MTU.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-08 14:25:13 -05:00
Martin KaFai Lau b45f0674b9 mlx4: xdp: Allow raising MTU up to one page minus eth and vlan hdrs
When XDP is active in mlx4, mlx4 is using one page/pkt.
At the same time (i.e. when XDP is active), it is currently
limiting MTU to be FRAG_SZ0 - ETH_HLEN - (2 * VLAN_HLEN)
which is 1514 in x86.  AFAICT, we can at least raise the MTU
limit up to PAGE_SIZE - ETH_HLEN - (2 * VLAN_HLEN) which this
patch is doing.  It will be useful in the next patch which
allows XDP program to extend the packet by adding new header(s).

Note: In the earlier XDP patches, there is already existing guard
to ensure the page/pkt scheme only applies when XDP is active
in mlx4.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-08 14:25:13 -05:00
Martin KaFai Lau 17bedab272 bpf: xdp: Allow head adjustment in XDP prog
This patch allows XDP prog to extend/remove the packet
data at the head (like adding or removing header).  It is
done by adding a new XDP helper bpf_xdp_adjust_head().

It also renames bpf_helper_changes_skb_data() to
bpf_helper_changes_pkt_data() to better reflect
that XDP prog does not work on skb.

This patch adds one "xdp_adjust_head" bit to bpf_prog for the
XDP-capable driver to check if the XDP prog requires
bpf_xdp_adjust_head() support.  The driver can then decide
to error out during XDP_SETUP_PROG.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-08 14:25:13 -05:00
Zhang Shengju 690291093a mlx4: use reset to set mac header
Since offset is zero, it's not necessary to use set function. Reset
function is straightforward, and will remove the unnecessary add
operation in set function.

Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-03 15:49:16 -05:00