1
0
Fork 0
Commit Graph

194 Commits (1204c70d9dcba31164f78ad5d8c88c42335d51f8)

Author SHA1 Message Date
Maxim Mikityanskiy 9df86bdb67 net/mlx5e: Fix handling of compressed CQEs in case of low NAPI budget
When CQE compression is enabled, compressed CQEs use the following
structure: a title is followed by one or many blocks, each containing 8
mini CQEs (except the last, which may contain fewer mini CQEs).

Due to NAPI budget restriction, a complete structure is not always
parsed in one NAPI run, and some blocks with mini CQEs may be deferred
to the next NAPI poll call - we have the mlx5e_decompress_cqes_cont call
in the beginning of mlx5e_poll_rx_cq. However, if the budget is
extremely low, some blocks may be left even after that, but the code
that follows the mlx5e_decompress_cqes_cont call doesn't check it and
assumes that a new CQE begins, which may not be the case. In such cases,
random memory corruptions occur.

An extremely low NAPI budget of 8 is used when busy_poll or busy_read is
active.

This commit adds a check to make sure that the previous compressed CQE
has been completely parsed after mlx5e_decompress_cqes_cont, otherwise
it prevents a new CQE from being fetched in the middle of a compressed
CQE.

This commit fixes random crashes in __build_skb, __page_pool_put_page
and other not-related-directly places, that used to happen when both CQE
compression and busy_poll/busy_read were enabled.

Fixes: 7219ab34f1 ("net/mlx5e: CQE compression")
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-10-29 16:27:19 -07:00
David S. Miller 1e46c09ec1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Add the ability to use unaligned chunks in the AF_XDP umem. By
   relaxing where the chunks can be placed, it allows to use an
   arbitrary buffer size and place whenever there is a free
   address in the umem. Helps more seamless DPDK AF_XDP driver
   integration. Support for i40e, ixgbe and mlx5e, from Kevin and
   Maxim.

2) Addition of a wakeup flag for AF_XDP tx and fill rings so the
   application can wake up the kernel for rx/tx processing which
   avoids busy-spinning of the latter, useful when app and driver
   is located on the same core. Support for i40e, ixgbe and mlx5e,
   from Magnus and Maxim.

3) bpftool fixes for printf()-like functions so compiler can actually
   enforce checks, bpftool build system improvements for custom output
   directories, and addition of 'bpftool map freeze' command, from Quentin.

4) Support attaching/detaching XDP programs from 'bpftool net' command,
   from Daniel.

5) Automatic xskmap cleanup when AF_XDP socket is released, and several
   barrier/{read,write}_once fixes in AF_XDP code, from Björn.

6) Relicense of bpf_helpers.h/bpf_endian.h for future libbpf
   inclusion as well as libbpf versioning improvements, from Andrii.

7) Several new BPF kselftests for verifier precision tracking, from Alexei.

8) Several BPF kselftest fixes wrt endianess to run on s390x, from Ilya.

9) And more BPF kselftest improvements all over the place, from Stanislav.

10) Add simple BPF map op cache for nfp driver to batch dumps, from Jakub.

11) AF_XDP socket umem mapping improvements for 32bit archs, from Ivan.

12) Add BPF-to-BPF call and BTF line info support for s390x JIT, from Yauheni.

13) Small optimization in arm64 JIT to spare 1 insns for BPF_MOD, from Jerin.

14) Fix an error check in bpf_tcp_gen_syncookie() helper, from Petar.

15) Various minor fixes and cleanups, from Nathan, Masahiro, Masanari,
    Peter, Wei, Yue.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-06 16:49:17 +02:00
Aya Levin 8276ea1353 net/mlx5e: Report and recover from CQE with error on RQ
Add support for report and recovery from error on completion on RQ by
setting the queue back to ready state. Handle only errors with a
syndrome indicating the RQ might enter error state and could be
recovered.

Signed-off-by: Aya Levin <ayal@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-08-20 13:08:18 -07:00
Saeed Mahameed 0a35ab3e13 net/mlx5e: RX, Handle CQE with error at the earliest stage
Just to be aligned with the MPWQE handlers, handle RX WQE with error
for legacy RQs in the top RX handlers, just before calling skb_from_cqe().

CQE error handling will now be called at the same stage regardless of
the RQ type or netdev mode NIC, Representor, IPoIB, etc ..

This will be useful for down stream patch to improve error CQE
handling.

Signed-off-by: Aya Levin <ayal@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-08-20 13:08:18 -07:00
Aya Levin be5323c837 net/mlx5e: Report and recover from CQE error on ICOSQ
Add support for report and recovery from error on completion on ICOSQ.
Deactivate RQ and flush, then deactivate ICOSQ. Set the queue back to
ready state (firmware) and reset the ICOSQ and the RQ (software
resources). Finally, activate the ICOSQ and the RQ.

Signed-off-by: Aya Levin <ayal@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-08-20 13:08:17 -07:00
Maxim Mikityanskiy a7bd4018d6 net/mlx5e: Add AF_XDP need_wakeup support
This commit adds support for the new need_wakeup feature of AF_XDP. The
applications can opt-in by using the XDP_USE_NEED_WAKEUP bind() flag.
When this feature is enabled, some behavior changes:

RX side: If the Fill Ring is empty, instead of busy-polling, set the
flag to tell the application to kick the driver when it refills the Fill
Ring.

TX side: If there are pending completions or packets queued for
transmission, set the flag to tell the application that it can skip the
sendto() syscall and save time.

The performance testing was performed on a machine with the following
configuration:

- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link

The results with retpoline disabled:

       | without need_wakeup  | with need_wakeup     |
       |----------------------|----------------------|
       | one core | two cores | one core | two cores |
-------|----------|-----------|----------|-----------|
txonly | 20.1     | 33.5      | 29.0     | 34.2      |
rxdrop | 0.065    | 14.1      | 12.0     | 14.1      |
l2fwd  | 0.032    | 7.3       | 6.6      | 7.2       |

"One core" means the application and NAPI run on the same core. "Two
cores" means they are pinned to different cores.

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-17 23:07:32 +02:00
Saeed Mahameed 8c7698d5ca net/mlx5e: Rx, checksum handling refactoring
Move vlan checksum fixup flow into mlx5e_skb_padding_csum(), which is
supposed to fixup SKB checksum if needed. And rename
mlx5e_skb_padding_csum() to mlx5e_skb_csum_fixup().

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-08-01 12:33:31 -07:00
David S. Miller 114a5c3240 mlx5-fixes-2019-07-11
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAl0ng7AACgkQSD+KveBX
 +j41FQgAjcUgjCiq6gmwI62EsMpZjgjwKhGfYz56jwSmNDgG7Ltj9o/K/+b+oL3u
 ZVPO9YgNstWuoPyApE12MzsA+kx/ydRz8GW6IrI5v71JK3yy8Mw/HrYVyUKEnzH4
 BVnSRnvHgvCnjZ3SjmUb6Xs74vmoxti9cYr0w0XtBogzuXlfpSYILYFiqrkks6sH
 VEmD9P3dTBNQUviYwkRcHhJoI7NqdQBApn9/30F4dQsmnUsikoaom5GovrbBw817
 L3Q+U/9+2Xp5k1i7J2MLFSL6Vyaxs/48eRKxs3YubBmrtwBjPj+aOYdRdDOscguQ
 WcDRiYPOW72cGIlFtYyuzApqo+J+xw==
 =mq0x
 -----END PGP SIGNATURE-----

Merge tag 'mlx5-fixes-2019-07-11' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
Mellanox, mlx5 fixes 2019-07-11

This series introduces some fixes to mlx5 driver.

Please pull and let me know if there is any problem.

For -stable v4.15
('net/mlx5e: IPoIB, Add error path in mlx5_rdma_setup_rn')

For -stable v5.1
('net/mlx5e: Fix port tunnel GRE entropy control')
('net/mlx5e: Rx, Fix checksum calculation for new hardware')
('net/mlx5e: Fix return value from timeout recover function')
('net/mlx5e: Fix error flow in tx reporter diagnose')

For -stable v5.2
('net/mlx5: E-Switch, Fix default encap mode')

Conflict note: This pull request will produce a small conflict when
merged with net-next.
In drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
Take the hunk from net and replace:
esw_offloads_steering_init(esw, vf_nvports, total_nvports);
with:
esw_offloads_steering_init(esw);
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-11 15:06:37 -07:00
Saeed Mahameed db849faa9b net/mlx5e: Rx, Fix checksum calculation for new hardware
CQE checksum full mode in new HW, provides a full checksum of rx frame.
Covering bytes starting from eth protocol up to last byte in the received
frame (frame_size - ETH_HLEN), as expected by the stack.

Fixing up skb->csum by the driver is not required in such case. This fix
is to avoid wrong checksum calculation in drivers which already support
the new hardware with the new checksum mode.

Fixes: 85327a9c41 ("net/mlx5: Update the list of the PCI supported devices")
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-07-11 11:45:03 -07:00
Maxim Mikityanskiy db05815b36 net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.

We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.

XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.

Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.

Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.

We create a dedicated XSK SQ in the channel. This separation has its
advantages:

1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.

2. Calculating statistics separately.

When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.

Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.

LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.

The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:

1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.

2. MTU changes while there are UMEMs registered.

Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.

The performance testing was performed on a machine with the following
configuration:

- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link

The results with retpoline disabled, single stream:

txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps

The results with retpoline enabled, single stream:

txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-27 22:53:28 +02:00
Maxim Mikityanskiy ed084fb604 net/mlx5e: Allow ICO SQ to be used by multiple RQs
Prepare to creation of the XSK RQ, which will require posting UMRs, too.
The same ICO SQ will be used for both RQs and also to trigger interrupts
by posting NOPs. UMR WQEs can't be reused any more. Optimization
introduced in commit ab966d7e4f ("net/mlx5e: RX, Recycle buffer of
UMR WQEs") is reverted.

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-27 22:53:27 +02:00
Jesper Dangaard Brouer 29b006a676 mlx5: more strict use of page_pool API
The mlx5 driver is using page_pool, but not for DMA-mapping (currently), and
is a little too relaxed about returning or releasing page resources, as it
is not strictly necessary, when not using DMA-mappings.

As this patchset is working towards tracking page_pool resources, to know
about in-flight frames on shutdown. Then fix places where mlx5 leak
page_pool resource.

In case of dma_mapping_error, then recycle into page_pool.

In mlx5e_free_rq() moved the page_pool_destroy() call to after the
mlx5e_page_release() calls, as it is more correct.

In mlx5e_page_release() when no recycle was requested, then release page
from the page_pool, via page_pool_release_page().

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19 11:23:13 -04:00
Paolo Abeni 55f968726e net/mlx5e: use indirect calls wrapper for the rx packet handler
We can avoid another indirect call per packet wrapping the rx
handler call with the proper helper.

To ensure that even the last listed direct call experience
measurable gain, despite the additional conditionals we must
traverse before reaching it, I tested reversing the order of the
listed options, with performance differences below noise level.

Together with the previous indirect call patch, this gives
~6% performance improvement in raw UDP tput.

v2 -> v3:
 - use only the direct calls always available regardless of
   the mlx5 build options
 - drop the direct call list macro, to keep the code as simple
   as possible for future rework

v1 -> v2:
 - update the direct call list and use a macro to define it,
   as per Saeed suggestion. An intermediated additional
   macro is needed to allow arg list expansion

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14 15:35:17 -07:00
Paolo Abeni b3c04e8340 net/mlx5e: use indirect calls wrapper for skb allocation
We can avoid an indirect call per packet wrapping the skb creation
with the appropriate helper.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14 15:35:17 -07:00
Tariq Toukan fd9b4be800 net/mlx5e: RX, Support multiple outstanding UMR posts
The buffers mapping of the Multi-Packet WQEs (of Striding RQ)
is done via UMR posts, one UMR WQE per an RX MPWQE.

A single MPWQE is capable of serving many incoming packets,
usually larger than the budget of a single napi cycle.
Hence, posting a single UMR WQE per napi cycle (and handling its
completion in the next cycle) works fine in many common cases,
but not always.

When an XDP program is loaded, every MPWQE is capable of serving less
packets, to satisfy the packet-per-page requirement.
Thus, for the same number of packets more MPWQEs (and UMR posts)
are needed (twice as much for the default MTU), giving less latency
room for the UMR completions.

In this patch, we add support for multiple outstanding UMR posts,
to allow faster gap closure between consuming MPWQEs and reposting
them back into the WQ.

For better SW and HW locality, we combine the UMR posts in bulks of
(at least) two.

This is expected to improve packet rate in high CPU scale.

Performance test:
As expected, huge improvement in large-scale (48 cores).

xdp_redirect_map, 64B UDP multi-stream.
Redirect from ConnectX-5 100Gbps to ConnectX-6 100Gbps.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz.

Before: Unstable, 7 to 30 Mpps
After:  Stable,   at 70.5 Mpps

No degradation in other tested scenarios.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-04-23 12:09:19 -07:00
Or Gerlitz 8c8811d46d Revert "net/mlx5e: Enable reporting checksum unnecessary also for L3 packets"
This reverts commit b820e6fb09.

Prior the commit we are reverting, checksum unnecessary was only set when
both the L3 OK and L4 OK bits are set on the CQE. This caused packets
of IP protocols such as SCTP which are not dealt by the current HW L4
parser (hence the L4 OK bit is not set, but the L4 header type none bit
is set) to go through the checksum none code, where currently we wrongly
report checksum unnecessary for them, a regression. Fix this by a revert.

Note that on our usual track we report checksum complete, so the revert
isn't expected to have any notable performance impact. Also, when we are
not on the checksum complete track, the L4 protocols for which we report
checksum none are not high performance ones, we will still report
checksum unnecessary for UDP/TCP.

Fixes: b820e6fb09 ("net/mlx5e: Enable reporting checksum unnecessary also for L3 packets")
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reported-by: Avi Urman <aviu@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-04-09 12:33:51 -07:00
Saeed Mahameed 0318a7b7fc net/mlx5e: Rx, Check ip headers sanity
In the two places is_last_ethertype_ip is being called, the caller will
be looking inside the ip header, to be safe, add ip{4,6} header sanity
check. And return true only on valid ip headers, i.e: the whole header
is contained in the linear part of the skb.

Note: Such situation is very rare and hard to reproduce, since mlx5e
allocates a large enough headroom to contain the largest header one can
imagine.

Fixes: fe1dc06999 ("net/mlx5e: don't set CHECKSUM_COMPLETE on SCTP packets")
Reported-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-04-09 12:33:51 -07:00
Saeed Mahameed 0aa1d18615 net/mlx5e: Rx, Fixup skb checksum for packets with tail padding
When an ethernet frame with ip payload is padded, the padding octets are
not covered by the hardware checksum.

Prior to the cited commit, skb checksum was forced to be CHECKSUM_NONE
when padding is detected. After it, the kernel will try to trim the
padding bytes and subtract their checksum from skb->csum.

In this patch we fixup skb->csum for any ip packet with tail padding of
any size, if any padding found.
FCS case is just one special case of this general purpose patch, hence,
it is removed.

Fixes: 88078d98d1 ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends"),
Cc: Eric Dumazet <edumazet@google.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-04-09 12:33:50 -07:00
Saeed Mahameed 5d0bb3bac4 net/mlx5e: XDP, Avoid checksum complete when XDP prog is loaded
XDP programs might change packets data contents which will make the
reported skb checksum (checksum complete) invalid.

When XDP programs are loaded/unloaded set/clear rx RQs
MLX5E_RQ_STATE_NO_CSUM_COMPLETE flag.

Fixes: 86994156c7 ("net/mlx5e: XDP fast RX drop bpf programs support")
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-04-09 12:33:50 -07:00
Feras Daoud 3d6f3cdf9b net/mlx5e: IPoIB, Fix RX checksum statistics update
Update the RX checksum only if the feature is enabled.

Fixes: 9d6bd752c6 ("net/mlx5e: IPoIB, RX handler")
Signed-off-by: Feras Daoud <ferasda@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-03-11 11:32:40 -07:00
Tariq Toukan 79d356ef2c net/mlx5e: Take CQ decompress fields into a separate structure
Only the Receive CQ makes use of these fields. Take them
out into a separate struct and save space in the generic
CQ structure.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-01-25 12:16:13 -08:00
Tariq Toukan 9481627838 net/mlx5e: RX, Make sure packet header does not cross page boundary
In the non-linear SKB memory scheme of Striding RQ, a packet header
could cross page boundary. This requires special care in fast path
that costs LoC, additional runtime instructions and branches.

It could happen when the header (up to 256B) does not fit in
a single stride. Avoid this by working with a stride size that fits
the maximum possible header. Stride is increased form 64B to 256B.

Performance:
Tested packet rate for UDP streams, single ring, on ConnectX-5.

Configuration:
Set Striding RQ and LRO ON (to enabled the non-linear SKB scheme).
GRO OFF, early drop by TC rule.

64B: 4x worse memory utilization, no page-crossers headers
- No degradation (5,887,305 pps).
- The reduction in memory utilization is compensated by the saving of
  branches tests.

192B: 1.33x worse memory utilization, avoid page-crossers headers
- Before: 5,727,252. After: 5,777,037. ~1% gain.

256B: Same memory util, no page-crossers
- Before: 5,691,885. After: 5,748,007. ~1% gain.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-01-25 12:16:13 -08:00
Cong Wang e8c8b53cca net/mlx5e: Force CHECKSUM_UNNECESSARY for short ethernet frames
When an ethernet frame is padded to meet the minimum ethernet frame
size, the padding octets are not covered by the hardware checksum.
Fortunately the padding octets are usually zero's, which don't affect
checksum. However, we have a switch which pads non-zero octets, this
causes kernel hardware checksum fault repeatedly.

Prior to:
commit '88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE ...")'
skb checksum was forced to be CHECKSUM_NONE when padding is detected.
After it, we need to keep skb->csum updated, like what we do for RXFCS.
However, fixing up CHECKSUM_COMPLETE requires to verify and parse IP
headers, it is not worthy the effort as the packets are so small that
CHECKSUM_COMPLETE can't save anything.

Fixes: 88078d98d1 ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends"),
Cc: Eric Dumazet <edumazet@google.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Cc: Nikola Ciprich <nikola.ciprich@linuxbox.cz>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-01-18 16:15:30 -08:00
Tariq Toukan 4fb2f51618 net/mlx5e: XDP, Precede XDP-related operations in RQ poll by a loaded program check
At the end of the RQ polling loop, some XDP-related operations
might be required. Before checking them one by one, check if
an XDP program is even loaded.
Combine all the checks and operations in a single function in xdp files.

This saves unnecessary checks for non-XDP flows.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-12-20 22:54:17 -08:00
David S. Miller 2be09de7d6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Lots of conflicts, by happily all cases of overlapping
changes, parallel adds, things of that nature.

Thanks to Stephen Rothwell, Saeed Mahameed, and others
for their guidance in these resolutions.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-20 11:53:36 -08:00
Tariq Toukan bfc698254b net/mlx5e: RX, Fix wrong early return in receive queue poll
When the completion queue of the RQ is empty, do not immediately return.
If left-over decompressed CQEs (from the previous cycle) were processed,
need to go to the finalization part of the poll function.

Bug exists only when CQE compression is turned ON.

This solves the following issue:
mlx5_core 0000:82:00.1: mlx5_eq_int:544:(pid 0): CQ error on CQN 0xc08, syndrome 0x1
mlx5_core 0000:82:00.1 p4p2: mlx5e_cq_error_event: cqn=0x000c08 event=0x04

Fixes: 4b7dfc9925 ("net/mlx5e: Early-return on empty completion queues")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-12-19 13:31:16 -08:00
Saeed Mahameed 2f62747c77 Merge branch 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
mlx5-next shared branch with rdma subtree to avoid mlx5 rdma v.s. netdev
conflicts.

Highlights:

1) RDMA ODP  (On Demand Paging) improvements and moving ODP logic to
mlx5 RDMA driver
2) Improved mlx5 core driver and device events handling and provided API
for upper layers to subscribe to device events.
3) RDMA only code cleanup from mlx5 core
4) Add helper to get CQE opcode
5) Rework handling of port module events
6) shared mlx5_ifc.h updates to avoid conflicts

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-12-10 15:50:50 -08:00
Tariq Toukan 6254adeb1f net/mlx5: Use helper to get CQE opcode
Introduce and use a helper that extracts the opcode
from a CQE (completion queue entry) structure.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-12-09 18:16:16 -08:00
Cong Wang ef6fcd4552 mlx5: fix get_ip_proto()
IP header is not necessarily located right after struct ethhdr,
there could be multiple 802.1Q headers in between, this is why
we call __vlan_get_protocol().

Fixes: fe1dc06999 ("net/mlx5e: don't set CHECKSUM_COMPLETE on SCTP packets")
Cc: Alaa Hleihel <alaa@mellanox.com>
Cc: Or Gerlitz <ogerlitz@mellanox.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-30 17:19:33 -08:00
Moshe Shemesh 0073c8f727 net/mlx5e: RX, verify received packet size in Linear Striding RQ
In case of striding RQ, we use  MPWRQ (Multi Packet WQE RQ), which means
that WQE (RX descriptor) can be used for many packets and so the WQE is
much bigger than MTU.  In virtualization setups where the port mtu can
be larger than the vf mtu, if received packet is bigger than MTU, it
won't be dropped by HW on too small receive WQE. If we use linear SKB in
striding RQ, since each stride has room for mtu size payload and skb
info, an oversized packet can lead to crash for crossing allocated page
boundary upon the call to build_skb. So driver needs to check packet
size and drop it.

Introduce new SW rx counter, rx_oversize_pkts_sw_drop, which counts the
number of packets dropped by the driver for being too large.

As a new field is added to the RQ struct, re-open the channels whenever
this field is being used in datapath (i.e., in the case of linear
Striding RQ).

Fixes: 619a8f2a42 ("net/mlx5e: Use linear SKB in Striding RQ")
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-11-19 14:35:04 -08:00
Eric Dumazet d48051c5b8 net/mlx5e: fix csum adjustments caused by RXFCS
As shown by Dmitris, we need to use csum_block_add() instead of csum_add()
when adding the FCS contribution to skb csum.

Before 4.18 (more exactly commit 88078d98d1 "net: pskb_trim_rcsum()
and CHECKSUM_COMPLETE are friends"), the whole skb csum was thrown away,
so RXFCS changes were ignored.

Then before commit d55bef5059 ("net: fix pskb_trim_rcsum_slow() with
odd trim offset") both mlx5 and pskb_trim_rcsum_slow() bugs were canceling
each other.

Now we fixed pskb_trim_rcsum_slow() we need to fix mlx5.

Note that this patch also rewrites mlx5e_get_fcs() to :

- Use skb_header_pointer() instead of reinventing it.
- Use __get_unaligned_cpu32() to avoid possible non aligned accesses
  as Dmitris pointed out.

Fixes: 902a545904 ("net/mlx5e: When RXFCS is set, add FCS data into checksum calculation")
Reported-by: Paweł Staszewski <pstaszewski@itcare.pl>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Eran Ben Elisha <eranbe@mellanox.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Cc: Dimitris Michailidis <dmichail@google.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Paweł Staszewski <pstaszewski@itcare.pl>
Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com>
Tested-By: Maria Pasechnik <mariap@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-31 12:40:52 -07:00
Eric Dumazet 55469bc6b5 drivers: net: remove <net/busy_poll.h> inclusion when not needed
Drivers using generic NAPI interface no longer need to include
<net/busy_poll.h>, since busy polling was moved to core networking
stack long ago.

See commit 79e7fff47b ("net: remove support for per driver
ndo_busy_poll()") for reference.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-25 16:20:02 -07:00
David S. Miller 2e2d6f0342 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
net/sched/cls_api.c has overlapping changes to a call to
nlmsg_parse(), one (from 'net') added rtm_tca_policy instead of NULL
to the 5th argument, and another (from 'net-next') added cb->extack
instead of NULL to the 6th argument.

net/ipv4/ipmr_base.c is a case of a bug fix in 'net' being done to
code which moved (to mr_table_dump)) in 'net-next'.  Thanks to David
Ahern for the heads up.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-19 11:03:06 -07:00
Tariq Toukan 37fdffb217 net/mlx5: WQ, fixes for fragmented WQ buffers API
mlx5e netdevice used to calculate fragment edges by a call to
mlx5_wq_cyc_get_frag_size(). This calculation did not give the correct
indication for queues smaller than a PAGE_SIZE, (broken by default on
PowerPC, where PAGE_SIZE == 64KB).  Here it is replaced by the correct new
calls/API.

Since (TX/RX) Work Queues buffers are fragmented, here we introduce
changes to the API in core driver, so that it gets a stride index and
returns the index of last stride on same fragment, and an additional
wrapping function that returns the number of physically contiguous
strides that can be written contiguously to the work queue.

This obsoletes the following API functions, and their buggy
usage in EN driver:
* mlx5_wq_cyc_get_frag_size()
* mlx5_wq_cyc_ctr2fragix()

The new API improves modularity and hides the details of such
calculation for mlx5e netdevice and mlx5_ib rdma drivers.

New calculation is also more efficient, and improves performance
as follows:

Packet rate test: pktgen, UDP / IPv4, 64byte, single ring, 8K ring size.

Before: 16,477,619 pps
After:  17,085,793 pps

3.7% improvement

Fixes: 3a2f703312 ("net/mlx5: Use order-0 allocations for all WQ types")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-10-10 18:26:16 -07:00
Or Gerlitz b856df28f9 net/mlx5e: Allow reporting of checksum unnecessary
Currently we practically never report checksum unnecessary, because
for all IP packets we take the checksum complete path.

Enable non-default runs with reprorting checksum unnecessary, using
an ethtool private flag. This can be useful for performance evals
and other explorations.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-10-01 11:32:47 -07:00
Or Gerlitz b820e6fb09 net/mlx5e: Enable reporting checksum unnecessary also for L3 packets
We can report checksum unnecessary also when the L3 checksum
flag on the cqe is set and there's no L4 header.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-10-01 11:32:46 -07:00
Alaa Hleihel fe1dc06999 net/mlx5e: don't set CHECKSUM_COMPLETE on SCTP packets
CHECKSUM_COMPLETE is not applicable to SCTP protocol.
Setting it for SCTP packets leads to CRC32c validation failure.

Fixes: bbceefce9a ("net/mlx5e: Support RX CHECKSUM_COMPLETE")
Signed-off-by: Alaa Hleihel <alaa@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-09-05 21:14:57 -07:00
Natali Shechtman f007c13d4a net/mlx5e: Set ECN for received packets using CQE indication
In multi-host (MH) NIC scheme, a single HW port serves multiple hosts
or sockets on the same host.
The HW uses a mechanism in the PCIe buffer which monitors
the amount of consumed PCIe buffers per host.
On a certain configuration, under congestion,
the HW emulates a switch doing ECN marking on packets using ECN
indication on the completion descriptor (CQE).

The driver needs to set the ECN bits on the packet SKB,
such that the network stack can react on that, this commit does that.

Signed-off-by: Natali Shechtman <natali@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-09-05 21:14:57 -07:00
Feras Daoud 19052a3b77 net/mlx5e: IPoIB, Use priv stats in completion rx flow
Since the RQs are shared between all pkey interfaces, the stats
should be taken from where the per-ring stats are stored instead
of the parent RQ.

Fixes: 4c6c615e3f ("net/mlx5e: IPoIB, Add PKEY child interface nic profile")
Signed-off-by: Feras Daoud <ferasda@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-02 16:22:42 -07:00
Tariq Toukan d3398a4f18 net/mlx5e: RX, Prefetch the xdp_frame data area
A loaded XDP program might write to the xdp_frame data area,
prefetchw() it to avoid a potential cache miss.

Performance tests:
ConnectX-5, XDP_TX packet rate, single ring.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz

Before: 13,172,976 pps
After:  13,456,248 pps
2% gain.

Fixes: 22f4539881 ("net/mlx5e: Support XDP over Striding RQ")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-07-26 15:23:57 -07:00
Tariq Toukan dac0d15fff net/mlx5e: Re-order fields of struct mlx5e_xdpsq
In the downstream patch that adds support to XDP_REDIRECT-out,
the XDP xmit frame function doesn't share the same run context as
the NAPI that polls the XDP-SQ completion queue.

Hence, need to re-order the XDP-SQ fields to avoid cacheline
false-sharing.

Take redirect_flush and doorbell out of DB, into separated
cachelines.

Add a cacheline breaker within the stats struct.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-07-26 15:23:56 -07:00
Tariq Toukan 159d213134 net/mlx5e: Move XDP related code into new XDP files
Take XDP code out of the general EN header and RX file into
new XDP files.

Currently, XDP-SQ resides only within an RQ and used from a
single flow (XDP_TX) triggered upon RX completions.
In a downstream patch, additional type of XDP-SQ instances will be
presented and used for the XDP_REDIRECT flow, totally unrelated to
the RX context.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-07-26 15:23:53 -07:00
Tariq Toukan cb5189d173 net/mlx5e: Do not recycle RX pages in interface down flow
Keep all page-pool recycle calls within NAPI context.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-07-26 15:23:51 -07:00
Tariq Toukan afab995e06 net/mlx5e: Replace call to MPWQE free with dealloc in interface down flow
No need to expose the MPWQE free function to control path.
The dealloc function already exposed, use it.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-07-26 15:23:51 -07:00
Boris Pismenny b3ccf97813 net/mlx5e: IPsec, fix byte count in CQE
This patch fixes the byte count indication in CQE for processed IPsec
packets that contain a metadata header.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16 00:13:28 -07:00
Boris Pismenny 00aebab27c net/mlx5e: TLS, add Innova TLS rx data path
Implement the TLS rx offload data path according to the
requirements of the TLS generic NIC offload infrastructure.

Special metadata ethertype is used to pass information to
the hardware.

When hardware loses synchronization a special resync request
metadata message is used to request resync.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16 00:13:11 -07:00
Tariq Toukan b71ba6b46f net/mlx5e: Add counter for MPWQE filler strides
Add ethtool counter to indicate the number of strides consumed
by filler CQEs.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-06-28 14:44:19 -07:00
Tariq Toukan dc983f0e2b net/mlx5e: Add a counter for congested UMRs
Add per-ring and global ethtool counters for congested UMR requests.
These events indicate congestion in UMR handlers in HW.

Such event is concluded when there's an outstanding UMR post,
yet the SW consumed at least two additional MPWQEs in the meanwhile.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-06-28 14:44:18 -07:00
Tariq Toukan cbe73aaeec net/mlx5e: Add XDP_TX completions statistics
Add per-ring and global ethtool counters for XDP_TX completions.
This helps us monitor and analyze XDP_TX flow performance.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-06-28 14:44:17 -07:00
Tariq Toukan c400028371 net/mlx5e: RX, Use existing WQ local variable
Local variable 'wq' already points to &sq->wq, use it.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-06-28 14:44:16 -07:00