2015-05-28 13:28:48 -06:00
|
|
|
/*
|
2016-02-22 09:17:32 -07:00
|
|
|
* Copyright (c) 2015-2016, Mellanox Technologies. All rights reserved.
|
2015-05-28 13:28:48 -06:00
|
|
|
*
|
|
|
|
* This software is available to you under a choice of one of two
|
|
|
|
* licenses. You may choose to be licensed under the terms of the GNU
|
|
|
|
* General Public License (GPL) Version 2, available from the file
|
|
|
|
* COPYING in the main directory of this source tree, or the
|
|
|
|
* OpenIB.org BSD license below:
|
|
|
|
*
|
|
|
|
* Redistribution and use in source and binary forms, with or
|
|
|
|
* without modification, are permitted provided that the following
|
|
|
|
* conditions are met:
|
|
|
|
*
|
|
|
|
* - Redistributions of source code must retain the above
|
|
|
|
* copyright notice, this list of conditions and the following
|
|
|
|
* disclaimer.
|
|
|
|
*
|
|
|
|
* - Redistributions in binary form must reproduce the above
|
|
|
|
* copyright notice, this list of conditions and the following
|
|
|
|
* disclaimer in the documentation and/or other materials
|
|
|
|
* provided with the distribution.
|
|
|
|
*
|
|
|
|
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
|
|
|
* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
|
|
|
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
|
|
|
* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
|
|
|
|
* BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
|
|
|
|
* ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
|
|
|
|
* CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
|
|
* SOFTWARE.
|
|
|
|
*/
|
|
|
|
|
2016-03-08 03:42:36 -07:00
|
|
|
#include <net/tc_act/tc_gact.h>
|
|
|
|
#include <net/pkt_cls.h>
|
2015-12-10 08:12:44 -07:00
|
|
|
#include <linux/mlx5/fs.h>
|
2016-02-22 09:17:32 -07:00
|
|
|
#include <net/vxlan.h>
|
2019-03-21 16:51:38 -06:00
|
|
|
#include <net/geneve.h>
|
net/mlx5e: XDP fast RX drop bpf programs support
Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver.
When XDP is on we make sure to change channels RQs type to
MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to
ensure "page per packet".
On XDP set, we fail if HW LRO is set and request from user to turn it
off. Since on ConnectX4-LX HW LRO is always on by default, this will be
annoying, but we prefer not to enforce LRO off from XDP set function.
Full channels reset (close/open) is required only when setting XDP
on/off.
When XDP set is called just to exchange programs, we will update
each RQ xdp program on the fly and for synchronization with current
data path RX activity of that RQ, we temporally disable that RQ and
ensure RX path is not running, quickly update and re-enable that RQ,
for that we do:
- rq.state = disabled
- napi_synnchronize
- xchg(rq->xdp_prg)
- rq.state = enabled
- napi_schedule // Just in case we've missed an IRQ
Packet rate performance testing was done with pktgen 64B packets and on
TX side and, TC drop action on RX side compared to XDP fast drop.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Comparison is done between:
1. Baseline, Before this patch with TC drop action
2. This patch with TC drop action
3. This patch with XDP RX fast drop
RX Cores Baseline(TC drop) TC drop XDP fast Drop
--------------------------------------------------------------
1 5.3Mpps 5.3Mpps 16.5Mpps
2 10.2Mpps 10.2Mpps 31.3Mpps
4 20.5Mpps 19.9Mpps 36.3Mpps*
*My xmitter was limited to 36.3Mpps, so it is the bottleneck.
It seems that receive side can handle more.
Signed-off-by: Rana Shahout <ranas@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 03:19:46 -06:00
|
|
|
#include <linux/bpf.h>
|
2019-02-19 20:40:34 -07:00
|
|
|
#include <linux/if_bridge.h>
|
mlx5: use page_pool for xdp_return_frame call
This patch shows how it is possible to have both the driver local page
cache, which uses elevated refcnt for "catching"/avoiding SKB
put_page returns the page through the page allocator. And at the
same time, have pages getting returned to the page_pool from
ndp_xdp_xmit DMA completion.
The performance improvement for XDP_REDIRECT in this patch is really
good. Especially considering that (currently) the xdp_return_frame
API and page_pool_put_page() does per frame operations of both
rhashtable ID-lookup and locked return into (page_pool) ptr_ring.
(It is the plan to remove these per frame operation in a followup
patchset).
The benchmark performed was RX on mlx5 and XDP_REDIRECT out ixgbe,
with xdp_redirect_map (using devmap) . And the target/maximum
capability of ixgbe is 13Mpps (on this HW setup).
Before this patch for mlx5, XDP redirected frames were returned via
the page allocator. The single flow performance was 6Mpps, and if I
started two flows the collective performance drop to 4Mpps, because we
hit the page allocator lock (further negative scaling occurs).
Two test scenarios need to be covered, for xdp_return_frame API, which
is DMA-TX completion running on same-CPU or cross-CPU free/return.
Results were same-CPU=10Mpps, and cross-CPU=12Mpps. This is very
close to our 13Mpps max target.
The reason max target isn't reached in cross-CPU test, is likely due
to RX-ring DMA unmap/map overhead (which doesn't occur in ixgbe to
ixgbe testing). It is also planned to remove this unnecessary DMA
unmap in a later patchset
V2: Adjustments requested by Tariq
- Changed page_pool_create return codes not return NULL, only
ERR_PTR, as this simplifies err handling in drivers.
- Save a branch in mlx5e_page_release
- Correct page_pool size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ
V5: Updated patch desc
V8: Adjust for b0cedc844c00 ("net/mlx5e: Remove rq_headroom field from params")
V9:
- Adjust for 121e89275471 ("net/mlx5e: Refactor RQ XDP_TX indication")
- Adjust for 73281b78a37a ("net/mlx5e: Derive Striding RQ size from MTU")
- Correct handling if page_pool_create fail for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ
V10: Req from Tariq
- Change pool_size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 08:46:27 -06:00
|
|
|
#include <net/page_pool.h>
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
#include <net/xdp_sock.h>
|
2017-04-24 03:36:42 -06:00
|
|
|
#include "eswitch.h"
|
2015-05-28 13:28:48 -06:00
|
|
|
#include "en.h"
|
2019-07-05 09:30:15 -06:00
|
|
|
#include "en/txrx.h"
|
2016-03-08 03:42:36 -07:00
|
|
|
#include "en_tc.h"
|
2017-04-24 03:36:42 -06:00
|
|
|
#include "en_rep.h"
|
2017-04-18 07:04:28 -06:00
|
|
|
#include "en_accel/ipsec.h"
|
2017-06-19 05:04:36 -06:00
|
|
|
#include "en_accel/ipsec_rxtx.h"
|
2019-03-21 16:51:38 -06:00
|
|
|
#include "en_accel/en_accel.h"
|
2018-04-30 01:16:19 -06:00
|
|
|
#include "en_accel/tls.h"
|
2017-06-19 05:04:36 -06:00
|
|
|
#include "accel/ipsec.h"
|
2018-04-30 01:16:19 -06:00
|
|
|
#include "accel/tls.h"
|
2018-05-09 14:28:00 -06:00
|
|
|
#include "lib/vxlan.h"
|
2018-07-29 04:29:45 -06:00
|
|
|
#include "lib/clock.h"
|
2018-02-22 12:22:56 -07:00
|
|
|
#include "en/port.h"
|
2018-07-15 01:28:44 -06:00
|
|
|
#include "en/xdp.h"
|
2018-11-19 11:52:38 -07:00
|
|
|
#include "lib/eq.h"
|
2018-10-20 07:18:00 -06:00
|
|
|
#include "en/monitor_stats.h"
|
2019-07-01 05:53:34 -06:00
|
|
|
#include "en/health.h"
|
2019-03-27 05:09:27 -06:00
|
|
|
#include "en/params.h"
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
#include "en/xsk/umem.h"
|
|
|
|
#include "en/xsk/setup.h"
|
|
|
|
#include "en/xsk/rx.h"
|
|
|
|
#include "en/xsk/tx.h"
|
2019-08-21 23:06:00 -06:00
|
|
|
#include "en/hv_vhca_stats.h"
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2019-07-05 09:30:15 -06:00
|
|
|
|
2018-02-07 05:51:45 -07:00
|
|
|
bool mlx5e_check_fragmented_striding_rq_cap(struct mlx5_core_dev *mdev)
|
2016-09-21 03:19:45 -06:00
|
|
|
{
|
2017-07-10 03:52:36 -06:00
|
|
|
bool striding_rq_umr = MLX5_CAP_GEN(mdev, striding_rq) &&
|
2016-09-21 03:19:45 -06:00
|
|
|
MLX5_CAP_GEN(mdev, umr_ptr_rlky) &&
|
|
|
|
MLX5_CAP_ETH(mdev, reg_umr_sq);
|
2017-07-10 03:52:36 -06:00
|
|
|
u16 max_wqe_sz_cap = MLX5_CAP_GEN(mdev, max_wqe_sz_sq);
|
|
|
|
bool inline_umr = MLX5E_UMR_WQE_INLINE_SZ <= max_wqe_sz_cap;
|
|
|
|
|
|
|
|
if (!striding_rq_umr)
|
|
|
|
return false;
|
|
|
|
if (!inline_umr) {
|
|
|
|
mlx5_core_warn(mdev, "Cannot support Striding RQ: UMR WQE size (%d) exceeds maximum supported (%d).\n",
|
|
|
|
(int)MLX5E_UMR_WQE_INLINE_SZ, max_wqe_sz_cap);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
return true;
|
2016-09-21 03:19:45 -06:00
|
|
|
}
|
|
|
|
|
2017-11-14 00:44:55 -07:00
|
|
|
void mlx5e_init_rq_type_params(struct mlx5_core_dev *mdev,
|
2018-02-18 02:37:06 -07:00
|
|
|
struct mlx5e_params *params)
|
2016-09-21 03:19:45 -06:00
|
|
|
{
|
2018-02-11 06:21:33 -07:00
|
|
|
params->log_rq_mtu_frames = is_kdump_kernel() ?
|
|
|
|
MLX5E_PARAMS_MINIMUM_LOG_RQ_SIZE :
|
|
|
|
MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE;
|
2016-09-21 03:19:45 -06:00
|
|
|
|
2016-12-21 08:24:35 -07:00
|
|
|
mlx5_core_info(mdev, "MLX5E: StrdRq(%d) RqSz(%ld) StrdSz(%ld) RxCqeCmprss(%d)\n",
|
|
|
|
params->rq_wq_type == MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ,
|
net/mlx5e: Use linear SKB in Striding RQ
Current Striding RQ HW feature utilizes the RX buffers so that
there is no wasted room between the strides. This maximises
the memory utilization.
This prevents the use of build_skb() (which requires headroom
and tailroom), and demands to memcpy the packets headers into
the skb linear part.
In this patch, whenever a set of conditions holds, we apply
an RQ configuration that allows combining the use of linear SKB
on top of a Striding RQ.
To use build_skb() with Striding RQ, the following must hold:
1. packet does not cross a page boundary.
2. there is enough headroom and tailroom surrounding the packet.
We can satisfy 1 and 2 by configuring:
stride size = MTU + headroom + tailoom.
This is possible only when:
a. (MTU - headroom - tailoom) does not exceed PAGE_SIZE.
b. HW LRO is turned off.
Using linear SKB has many advantages:
- Saves a memcpy of the headers.
- No page-boundary checks in datapath.
- No filler CQEs.
- Significantly smaller CQ.
- SKB data continuously resides in linear part, and not split to
small amount (linear part) and large amount (fragment).
This saves datapath cycles in driver and improves utilization
of SKB fragments in GRO.
- The fragments of a resulting GRO SKB follow the IP forwarding
assumption of equal-size fragments.
Some implementation details:
HW writes the packets to the beginning of a stride,
i.e. does not keep headroom. To overcome this we make sure we can
extend backwards and use the last bytes of stride i-1.
Extra care is needed for stride 0 as it has no preceding stride.
We make sure headroom bytes are available by shifting the buffer
pointer passed to HW by headroom bytes.
This configuration now becomes default, whenever capable.
Of course, this implies turning LRO off.
Performance testing:
ConnectX-5, single core, single RX ring, default MTU.
UDP packet rate, early drop in TC layer:
--------------------------------------------
| pkt size | before | after | ratio |
--------------------------------------------
| 1500byte | 4.65 Mpps | 5.96 Mpps | 1.28x |
| 500byte | 5.23 Mpps | 5.97 Mpps | 1.14x |
| 64byte | 5.94 Mpps | 5.96 Mpps | 1.00x |
--------------------------------------------
TCP streams: ~20% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-02-07 05:41:25 -07:00
|
|
|
params->rq_wq_type == MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ ?
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
BIT(mlx5e_mpwqe_get_log_rq_size(params, NULL)) :
|
2018-02-11 06:21:33 -07:00
|
|
|
BIT(params->log_rq_mtu_frames),
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
BIT(mlx5e_mpwqe_get_log_stride_size(mdev, params, NULL)),
|
2016-12-21 08:24:35 -07:00
|
|
|
MLX5E_GET_PFLAG(params, MLX5E_PFLAG_RX_CQE_COMPRESS));
|
2016-09-21 03:19:45 -06:00
|
|
|
}
|
|
|
|
|
2018-02-07 05:51:45 -07:00
|
|
|
bool mlx5e_striding_rq_possible(struct mlx5_core_dev *mdev,
|
|
|
|
struct mlx5e_params *params)
|
|
|
|
{
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
if (!mlx5e_check_fragmented_striding_rq_cap(mdev))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
if (MLX5_IPSEC_DEV(mdev))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
if (params->xdp_prog) {
|
|
|
|
/* XSK params are not considered here. If striding RQ is in use,
|
|
|
|
* and an XSK is being opened, mlx5e_rx_mpwqe_is_linear_skb will
|
|
|
|
* be called with the known XSK params.
|
|
|
|
*/
|
|
|
|
if (!mlx5e_rx_mpwqe_is_linear_skb(mdev, params, NULL))
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
return true;
|
2018-02-07 05:51:45 -07:00
|
|
|
}
|
2018-02-11 02:58:30 -07:00
|
|
|
|
2018-02-07 05:51:45 -07:00
|
|
|
void mlx5e_set_rq_type(struct mlx5_core_dev *mdev, struct mlx5e_params *params)
|
2016-09-21 03:19:45 -06:00
|
|
|
{
|
2018-02-07 05:51:45 -07:00
|
|
|
params->rq_wq_type = mlx5e_striding_rq_possible(mdev, params) &&
|
|
|
|
MLX5E_GET_PFLAG(params, MLX5E_PFLAG_RX_STRIDING_RQ) ?
|
2018-02-11 02:58:30 -07:00
|
|
|
MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ :
|
2018-04-02 08:31:31 -06:00
|
|
|
MLX5_WQ_TYPE_CYCLIC;
|
2016-09-21 03:19:45 -06:00
|
|
|
}
|
|
|
|
|
2018-11-08 11:42:55 -07:00
|
|
|
void mlx5e_update_carrier(struct mlx5e_priv *priv)
|
2015-05-28 13:28:48 -06:00
|
|
|
{
|
|
|
|
struct mlx5_core_dev *mdev = priv->mdev;
|
|
|
|
u8 port_state;
|
|
|
|
|
|
|
|
port_state = mlx5_query_vport_state(mdev,
|
2018-08-08 17:23:49 -06:00
|
|
|
MLX5_VPORT_STATE_OP_MOD_VNIC_VPORT,
|
2017-05-28 06:40:43 -06:00
|
|
|
0);
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2016-06-30 08:34:50 -06:00
|
|
|
if (port_state == VPORT_STATE_UP) {
|
|
|
|
netdev_info(priv->netdev, "Link up\n");
|
2015-05-28 13:28:48 -06:00
|
|
|
netif_carrier_on(priv->netdev);
|
2016-06-30 08:34:50 -06:00
|
|
|
} else {
|
|
|
|
netdev_info(priv->netdev, "Link down\n");
|
2015-05-28 13:28:48 -06:00
|
|
|
netif_carrier_off(priv->netdev);
|
2016-06-30 08:34:50 -06:00
|
|
|
}
|
2015-05-28 13:28:48 -06:00
|
|
|
}
|
|
|
|
|
|
|
|
static void mlx5e_update_carrier_work(struct work_struct *work)
|
|
|
|
{
|
|
|
|
struct mlx5e_priv *priv = container_of(work, struct mlx5e_priv,
|
|
|
|
update_carrier_work);
|
|
|
|
|
|
|
|
mutex_lock(&priv->state_lock);
|
|
|
|
if (test_bit(MLX5E_STATE_OPENED, &priv->state))
|
2017-05-18 05:32:11 -06:00
|
|
|
if (priv->profile->update_carrier)
|
|
|
|
priv->profile->update_carrier(priv);
|
2015-05-28 13:28:48 -06:00
|
|
|
mutex_unlock(&priv->state_lock);
|
|
|
|
}
|
|
|
|
|
2017-11-28 14:52:13 -07:00
|
|
|
void mlx5e_update_stats(struct mlx5e_priv *priv)
|
2015-05-28 13:28:48 -06:00
|
|
|
{
|
2017-11-28 14:52:13 -07:00
|
|
|
int i;
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2017-11-28 14:52:13 -07:00
|
|
|
for (i = mlx5e_num_stats_grps - 1; i >= 0; i--)
|
|
|
|
if (mlx5e_stats_grps[i].update_stats)
|
|
|
|
mlx5e_stats_grps[i].update_stats(priv);
|
2015-05-28 13:28:48 -06:00
|
|
|
}
|
|
|
|
|
2018-10-20 07:18:00 -06:00
|
|
|
void mlx5e_update_ndo_stats(struct mlx5e_priv *priv)
|
2017-05-10 06:10:33 -06:00
|
|
|
{
|
2017-11-28 14:52:13 -07:00
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = mlx5e_num_stats_grps - 1; i >= 0; i--)
|
|
|
|
if (mlx5e_stats_grps[i].update_stats_mask &
|
|
|
|
MLX5E_NDO_UPDATE_STATS)
|
|
|
|
mlx5e_stats_grps[i].update_stats(priv);
|
2017-05-10 06:10:33 -06:00
|
|
|
}
|
|
|
|
|
2018-10-09 04:06:02 -06:00
|
|
|
static void mlx5e_update_stats_work(struct work_struct *work)
|
2015-05-28 13:28:48 -06:00
|
|
|
{
|
2018-09-12 00:45:33 -06:00
|
|
|
struct mlx5e_priv *priv = container_of(work, struct mlx5e_priv,
|
2015-05-28 13:28:48 -06:00
|
|
|
update_stats_work);
|
2018-05-23 19:26:09 -06:00
|
|
|
|
2015-05-28 13:28:48 -06:00
|
|
|
mutex_lock(&priv->state_lock);
|
2018-05-23 19:26:09 -06:00
|
|
|
priv->profile->update_stats(priv);
|
2015-05-28 13:28:48 -06:00
|
|
|
mutex_unlock(&priv->state_lock);
|
|
|
|
}
|
|
|
|
|
2018-09-12 00:45:33 -06:00
|
|
|
void mlx5e_queue_update_stats(struct mlx5e_priv *priv)
|
|
|
|
{
|
|
|
|
if (!priv->profile->update_stats)
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (unlikely(test_bit(MLX5E_STATE_DESTROYING, &priv->state)))
|
|
|
|
return;
|
|
|
|
|
|
|
|
queue_work(priv->wq, &priv->update_stats_work);
|
|
|
|
}
|
|
|
|
|
2018-11-26 15:38:58 -07:00
|
|
|
static int async_event(struct notifier_block *nb, unsigned long event, void *data)
|
2015-05-28 13:28:48 -06:00
|
|
|
{
|
2018-11-26 15:38:58 -07:00
|
|
|
struct mlx5e_priv *priv = container_of(nb, struct mlx5e_priv, events_nb);
|
|
|
|
struct mlx5_eqe *eqe = data;
|
2016-03-01 15:13:32 -07:00
|
|
|
|
2018-11-26 15:38:58 -07:00
|
|
|
if (event != MLX5_EVENT_TYPE_PORT_CHANGE)
|
|
|
|
return NOTIFY_DONE;
|
2016-03-01 15:13:32 -07:00
|
|
|
|
2018-11-26 15:38:58 -07:00
|
|
|
switch (eqe->sub_type) {
|
|
|
|
case MLX5_PORT_CHANGE_SUBTYPE_DOWN:
|
|
|
|
case MLX5_PORT_CHANGE_SUBTYPE_ACTIVE:
|
2016-05-01 13:59:56 -06:00
|
|
|
queue_work(priv->wq, &priv->update_carrier_work);
|
2015-05-28 13:28:48 -06:00
|
|
|
break;
|
|
|
|
default:
|
2018-11-26 15:38:58 -07:00
|
|
|
return NOTIFY_DONE;
|
2015-05-28 13:28:48 -06:00
|
|
|
}
|
2018-11-26 15:38:58 -07:00
|
|
|
|
|
|
|
return NOTIFY_OK;
|
2015-05-28 13:28:48 -06:00
|
|
|
}
|
|
|
|
|
|
|
|
static void mlx5e_enable_async_events(struct mlx5e_priv *priv)
|
|
|
|
{
|
2018-11-26 15:38:58 -07:00
|
|
|
priv->events_nb.notifier_call = async_event;
|
|
|
|
mlx5_notifier_register(priv->mdev, &priv->events_nb);
|
2015-05-28 13:28:48 -06:00
|
|
|
}
|
|
|
|
|
|
|
|
static void mlx5e_disable_async_events(struct mlx5e_priv *priv)
|
|
|
|
{
|
2018-11-26 15:38:58 -07:00
|
|
|
mlx5_notifier_unregister(priv->mdev, &priv->events_nb);
|
2015-05-28 13:28:48 -06:00
|
|
|
}
|
|
|
|
|
2017-03-24 15:52:14 -06:00
|
|
|
static inline void mlx5e_build_umr_wqe(struct mlx5e_rq *rq,
|
|
|
|
struct mlx5e_icosq *sq,
|
2017-12-20 02:56:35 -07:00
|
|
|
struct mlx5e_umr_wqe *wqe)
|
net/mlx5e: Single flow order-0 pages for Striding RQ
To improve the memory consumption scheme, we omit the flow that
demands and splits high-order pages in Striding RQ, and stay
with a single Striding RQ flow that uses order-0 pages.
Moving to fragmented memory allows the use of larger MPWQEs,
which reduces the number of UMR posts and filler CQEs.
Moving to a single flow allows several optimizations that improve
performance, especially in production servers where we would
anyway fallback to order-0 allocations:
- inline functions that were called via function pointers.
- improve the UMR post process.
This patch alone is expected to give a slight performance reduction.
However, the new memory scheme gives the possibility to use a page-cache
of a fair size, that doesn't inflate the memory footprint, which will
dramatically fix the reduction and even give a performance gain.
Performance tests:
The following results were measured on a freshly booted system,
giving optimal baseline performance, as high-order pages are yet to
be fragmented and depleted.
We ran pktgen single-stream benchmarks, with iptables-raw-drop:
Single stride, 64 bytes:
* 4,739,057 - baseline
* 4,749,550 - this patch
no reduction
Larger packets, no page cross, 1024 bytes:
* 3,982,361 - baseline
* 3,845,682 - this patch
3.5% reduction
Larger packets, every 3rd packet crosses a page, 1500 bytes:
* 3,731,189 - baseline
* 3,579,414 - this patch
4% reduction
Fixes: 461017cb006a ("net/mlx5e: Support RX multi-packet WQE (Striding RQ)")
Fixes: bc77b240b3c5 ("net/mlx5e: Add fragmented memory support for RX multi packet WQE")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-15 07:08:36 -06:00
|
|
|
{
|
|
|
|
struct mlx5_wqe_ctrl_seg *cseg = &wqe->ctrl;
|
|
|
|
struct mlx5_wqe_umr_ctrl_seg *ucseg = &wqe->uctrl;
|
2017-07-10 03:52:36 -06:00
|
|
|
u8 ds_cnt = DIV_ROUND_UP(MLX5E_UMR_WQE_INLINE_SZ, MLX5_SEND_WQE_DS);
|
net/mlx5e: Single flow order-0 pages for Striding RQ
To improve the memory consumption scheme, we omit the flow that
demands and splits high-order pages in Striding RQ, and stay
with a single Striding RQ flow that uses order-0 pages.
Moving to fragmented memory allows the use of larger MPWQEs,
which reduces the number of UMR posts and filler CQEs.
Moving to a single flow allows several optimizations that improve
performance, especially in production servers where we would
anyway fallback to order-0 allocations:
- inline functions that were called via function pointers.
- improve the UMR post process.
This patch alone is expected to give a slight performance reduction.
However, the new memory scheme gives the possibility to use a page-cache
of a fair size, that doesn't inflate the memory footprint, which will
dramatically fix the reduction and even give a performance gain.
Performance tests:
The following results were measured on a freshly booted system,
giving optimal baseline performance, as high-order pages are yet to
be fragmented and depleted.
We ran pktgen single-stream benchmarks, with iptables-raw-drop:
Single stride, 64 bytes:
* 4,739,057 - baseline
* 4,749,550 - this patch
no reduction
Larger packets, no page cross, 1024 bytes:
* 3,982,361 - baseline
* 3,845,682 - this patch
3.5% reduction
Larger packets, every 3rd packet crosses a page, 1500 bytes:
* 3,731,189 - baseline
* 3,579,414 - this patch
4% reduction
Fixes: 461017cb006a ("net/mlx5e: Support RX multi-packet WQE (Striding RQ)")
Fixes: bc77b240b3c5 ("net/mlx5e: Add fragmented memory support for RX multi packet WQE")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-15 07:08:36 -06:00
|
|
|
|
|
|
|
cseg->qpn_ds = cpu_to_be32((sq->sqn << MLX5_WQE_CTRL_QPN_SHIFT) |
|
|
|
|
ds_cnt);
|
|
|
|
cseg->fm_ce_se = MLX5_WQE_CTRL_CQ_UPDATE;
|
|
|
|
cseg->imm = rq->mkey_be;
|
|
|
|
|
2017-07-10 03:52:36 -06:00
|
|
|
ucseg->flags = MLX5_UMR_TRANSLATION_OFFSET_EN | MLX5_UMR_INLINE;
|
2017-01-02 02:37:42 -07:00
|
|
|
ucseg->xlt_octowords =
|
net/mlx5e: Single flow order-0 pages for Striding RQ
To improve the memory consumption scheme, we omit the flow that
demands and splits high-order pages in Striding RQ, and stay
with a single Striding RQ flow that uses order-0 pages.
Moving to fragmented memory allows the use of larger MPWQEs,
which reduces the number of UMR posts and filler CQEs.
Moving to a single flow allows several optimizations that improve
performance, especially in production servers where we would
anyway fallback to order-0 allocations:
- inline functions that were called via function pointers.
- improve the UMR post process.
This patch alone is expected to give a slight performance reduction.
However, the new memory scheme gives the possibility to use a page-cache
of a fair size, that doesn't inflate the memory footprint, which will
dramatically fix the reduction and even give a performance gain.
Performance tests:
The following results were measured on a freshly booted system,
giving optimal baseline performance, as high-order pages are yet to
be fragmented and depleted.
We ran pktgen single-stream benchmarks, with iptables-raw-drop:
Single stride, 64 bytes:
* 4,739,057 - baseline
* 4,749,550 - this patch
no reduction
Larger packets, no page cross, 1024 bytes:
* 3,982,361 - baseline
* 3,845,682 - this patch
3.5% reduction
Larger packets, every 3rd packet crosses a page, 1500 bytes:
* 3,731,189 - baseline
* 3,579,414 - this patch
4% reduction
Fixes: 461017cb006a ("net/mlx5e: Support RX multi-packet WQE (Striding RQ)")
Fixes: bc77b240b3c5 ("net/mlx5e: Add fragmented memory support for RX multi packet WQE")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-15 07:08:36 -06:00
|
|
|
cpu_to_be16(MLX5_MTT_OCTW(MLX5_MPWRQ_PAGES_PER_WQE));
|
|
|
|
ucseg->mkey_mask = cpu_to_be64(MLX5_MKEY_MASK_FREE);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int mlx5e_rq_alloc_mpwqe_info(struct mlx5e_rq *rq,
|
|
|
|
struct mlx5e_channel *c)
|
|
|
|
{
|
2018-04-02 08:23:14 -06:00
|
|
|
int wq_sz = mlx5_wq_ll_get_size(&rq->mpwqe.wq);
|
net/mlx5e: Single flow order-0 pages for Striding RQ
To improve the memory consumption scheme, we omit the flow that
demands and splits high-order pages in Striding RQ, and stay
with a single Striding RQ flow that uses order-0 pages.
Moving to fragmented memory allows the use of larger MPWQEs,
which reduces the number of UMR posts and filler CQEs.
Moving to a single flow allows several optimizations that improve
performance, especially in production servers where we would
anyway fallback to order-0 allocations:
- inline functions that were called via function pointers.
- improve the UMR post process.
This patch alone is expected to give a slight performance reduction.
However, the new memory scheme gives the possibility to use a page-cache
of a fair size, that doesn't inflate the memory footprint, which will
dramatically fix the reduction and even give a performance gain.
Performance tests:
The following results were measured on a freshly booted system,
giving optimal baseline performance, as high-order pages are yet to
be fragmented and depleted.
We ran pktgen single-stream benchmarks, with iptables-raw-drop:
Single stride, 64 bytes:
* 4,739,057 - baseline
* 4,749,550 - this patch
no reduction
Larger packets, no page cross, 1024 bytes:
* 3,982,361 - baseline
* 3,845,682 - this patch
3.5% reduction
Larger packets, every 3rd packet crosses a page, 1500 bytes:
* 3,731,189 - baseline
* 3,579,414 - this patch
4% reduction
Fixes: 461017cb006a ("net/mlx5e: Support RX multi-packet WQE (Striding RQ)")
Fixes: bc77b240b3c5 ("net/mlx5e: Add fragmented memory support for RX multi packet WQE")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-15 07:08:36 -06:00
|
|
|
|
2018-07-04 11:28:47 -06:00
|
|
|
rq->mpwqe.info = kvzalloc_node(array_size(wq_sz,
|
|
|
|
sizeof(*rq->mpwqe.info)),
|
2018-06-05 02:47:04 -06:00
|
|
|
GFP_KERNEL, cpu_to_node(c->cpu));
|
2016-09-21 03:19:43 -06:00
|
|
|
if (!rq->mpwqe.info)
|
2017-07-10 03:52:36 -06:00
|
|
|
return -ENOMEM;
|
net/mlx5e: Single flow order-0 pages for Striding RQ
To improve the memory consumption scheme, we omit the flow that
demands and splits high-order pages in Striding RQ, and stay
with a single Striding RQ flow that uses order-0 pages.
Moving to fragmented memory allows the use of larger MPWQEs,
which reduces the number of UMR posts and filler CQEs.
Moving to a single flow allows several optimizations that improve
performance, especially in production servers where we would
anyway fallback to order-0 allocations:
- inline functions that were called via function pointers.
- improve the UMR post process.
This patch alone is expected to give a slight performance reduction.
However, the new memory scheme gives the possibility to use a page-cache
of a fair size, that doesn't inflate the memory footprint, which will
dramatically fix the reduction and even give a performance gain.
Performance tests:
The following results were measured on a freshly booted system,
giving optimal baseline performance, as high-order pages are yet to
be fragmented and depleted.
We ran pktgen single-stream benchmarks, with iptables-raw-drop:
Single stride, 64 bytes:
* 4,739,057 - baseline
* 4,749,550 - this patch
no reduction
Larger packets, no page cross, 1024 bytes:
* 3,982,361 - baseline
* 3,845,682 - this patch
3.5% reduction
Larger packets, every 3rd packet crosses a page, 1500 bytes:
* 3,731,189 - baseline
* 3,579,414 - this patch
4% reduction
Fixes: 461017cb006a ("net/mlx5e: Support RX multi-packet WQE (Striding RQ)")
Fixes: bc77b240b3c5 ("net/mlx5e: Add fragmented memory support for RX multi packet WQE")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-15 07:08:36 -06:00
|
|
|
|
2017-12-20 02:56:35 -07:00
|
|
|
mlx5e_build_umr_wqe(rq, &c->icosq, &rq->mpwqe.umr_wqe);
|
net/mlx5e: Single flow order-0 pages for Striding RQ
To improve the memory consumption scheme, we omit the flow that
demands and splits high-order pages in Striding RQ, and stay
with a single Striding RQ flow that uses order-0 pages.
Moving to fragmented memory allows the use of larger MPWQEs,
which reduces the number of UMR posts and filler CQEs.
Moving to a single flow allows several optimizations that improve
performance, especially in production servers where we would
anyway fallback to order-0 allocations:
- inline functions that were called via function pointers.
- improve the UMR post process.
This patch alone is expected to give a slight performance reduction.
However, the new memory scheme gives the possibility to use a page-cache
of a fair size, that doesn't inflate the memory footprint, which will
dramatically fix the reduction and even give a performance gain.
Performance tests:
The following results were measured on a freshly booted system,
giving optimal baseline performance, as high-order pages are yet to
be fragmented and depleted.
We ran pktgen single-stream benchmarks, with iptables-raw-drop:
Single stride, 64 bytes:
* 4,739,057 - baseline
* 4,749,550 - this patch
no reduction
Larger packets, no page cross, 1024 bytes:
* 3,982,361 - baseline
* 3,845,682 - this patch
3.5% reduction
Larger packets, every 3rd packet crosses a page, 1500 bytes:
* 3,731,189 - baseline
* 3,579,414 - this patch
4% reduction
Fixes: 461017cb006a ("net/mlx5e: Support RX multi-packet WQE (Striding RQ)")
Fixes: bc77b240b3c5 ("net/mlx5e: Add fragmented memory support for RX multi packet WQE")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-15 07:08:36 -06:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2017-03-14 11:43:52 -06:00
|
|
|
static int mlx5e_create_umr_mkey(struct mlx5_core_dev *mdev,
|
2016-11-30 08:59:39 -07:00
|
|
|
u64 npages, u8 page_shift,
|
|
|
|
struct mlx5_core_mkey *umr_mkey)
|
2016-11-30 08:59:38 -07:00
|
|
|
{
|
|
|
|
int inlen = MLX5_ST_SZ_BYTES(create_mkey_in);
|
|
|
|
void *mkc;
|
|
|
|
u32 *in;
|
|
|
|
int err;
|
|
|
|
|
2017-05-10 12:32:18 -06:00
|
|
|
in = kvzalloc(inlen, GFP_KERNEL);
|
2016-11-30 08:59:38 -07:00
|
|
|
if (!in)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
mkc = MLX5_ADDR_OF(create_mkey_in, in, memory_key_mkey_entry);
|
|
|
|
|
|
|
|
MLX5_SET(mkc, mkc, free, 1);
|
|
|
|
MLX5_SET(mkc, mkc, umr_en, 1);
|
|
|
|
MLX5_SET(mkc, mkc, lw, 1);
|
|
|
|
MLX5_SET(mkc, mkc, lr, 1);
|
2018-04-05 09:53:28 -06:00
|
|
|
MLX5_SET(mkc, mkc, access_mode_1_0, MLX5_MKC_ACCESS_MODE_MTT);
|
2016-11-30 08:59:38 -07:00
|
|
|
|
|
|
|
MLX5_SET(mkc, mkc, qpn, 0xffffff);
|
|
|
|
MLX5_SET(mkc, mkc, pd, mdev->mlx5e_res.pdn);
|
2016-11-30 08:59:39 -07:00
|
|
|
MLX5_SET64(mkc, mkc, len, npages << page_shift);
|
2016-11-30 08:59:38 -07:00
|
|
|
MLX5_SET(mkc, mkc, translations_octword_size,
|
|
|
|
MLX5_MTT_OCTW(npages));
|
2016-11-30 08:59:39 -07:00
|
|
|
MLX5_SET(mkc, mkc, log_page_size, page_shift);
|
2016-11-30 08:59:38 -07:00
|
|
|
|
2016-11-30 08:59:39 -07:00
|
|
|
err = mlx5_core_create_mkey(mdev, umr_mkey, in, inlen);
|
2016-11-30 08:59:38 -07:00
|
|
|
|
|
|
|
kvfree(in);
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2017-03-14 11:43:52 -06:00
|
|
|
static int mlx5e_create_rq_umr_mkey(struct mlx5_core_dev *mdev, struct mlx5e_rq *rq)
|
2016-11-30 08:59:39 -07:00
|
|
|
{
|
2018-04-02 08:23:14 -06:00
|
|
|
u64 num_mtts = MLX5E_REQUIRED_MTTS(mlx5_wq_ll_get_size(&rq->mpwqe.wq));
|
2016-11-30 08:59:39 -07:00
|
|
|
|
2017-03-14 11:43:52 -06:00
|
|
|
return mlx5e_create_umr_mkey(mdev, num_mtts, PAGE_SHIFT, &rq->umr_mkey);
|
2016-11-30 08:59:39 -07:00
|
|
|
}
|
|
|
|
|
2017-12-20 02:56:35 -07:00
|
|
|
static inline u64 mlx5e_get_mpwqe_offset(struct mlx5e_rq *rq, u16 wqe_ix)
|
|
|
|
{
|
|
|
|
return (wqe_ix << MLX5E_LOG_ALIGNED_MPWQE_PPW) << PAGE_SHIFT;
|
|
|
|
}
|
|
|
|
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
static void mlx5e_init_frags_partition(struct mlx5e_rq *rq)
|
|
|
|
{
|
2019-08-01 07:52:54 -06:00
|
|
|
struct mlx5e_wqe_frag_info next_frag = {};
|
|
|
|
struct mlx5e_wqe_frag_info *prev = NULL;
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
int i;
|
|
|
|
|
|
|
|
next_frag.di = &rq->wqe.di[0];
|
|
|
|
|
|
|
|
for (i = 0; i < mlx5_wq_cyc_get_size(&rq->wqe.wq); i++) {
|
|
|
|
struct mlx5e_rq_frag_info *frag_info = &rq->wqe.info.arr[0];
|
|
|
|
struct mlx5e_wqe_frag_info *frag =
|
|
|
|
&rq->wqe.frags[i << rq->wqe.info.log_num_frags];
|
|
|
|
int f;
|
|
|
|
|
|
|
|
for (f = 0; f < rq->wqe.info.num_frags; f++, frag++) {
|
|
|
|
if (next_frag.offset + frag_info[f].frag_stride > PAGE_SIZE) {
|
|
|
|
next_frag.di++;
|
|
|
|
next_frag.offset = 0;
|
|
|
|
if (prev)
|
|
|
|
prev->last_in_page = true;
|
|
|
|
}
|
|
|
|
*frag = next_frag;
|
|
|
|
|
|
|
|
/* prepare next */
|
|
|
|
next_frag.offset += frag_info[f].frag_stride;
|
|
|
|
prev = frag;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (prev)
|
|
|
|
prev->last_in_page = true;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int mlx5e_init_di_list(struct mlx5e_rq *rq,
|
|
|
|
int wq_sz, int cpu)
|
|
|
|
{
|
|
|
|
int len = wq_sz << rq->wqe.info.log_num_frags;
|
|
|
|
|
treewide: Use array_size() in kvzalloc_node()
The kvzalloc_node() function has no 2-factor argument form, so
multiplication factors need to be wrapped in array_size(). This patch
replaces cases of:
kvzalloc_node(a * b, gfp, node)
with:
kvzalloc_node(array_size(a, b), gfp, node)
as well as handling cases of:
kvzalloc_node(a * b * c, gfp, node)
with:
kvzalloc_node(array3_size(a, b, c), gfp, node)
This does, however, attempt to ignore constant size factors like:
kvzalloc_node(4 * 1024, gfp, node)
though any constants defined via macros get caught up in the conversion.
Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.
The Coccinelle script used for this was:
// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@
(
kvzalloc_node(
- (sizeof(TYPE)) * E
+ sizeof(TYPE) * E
, ...)
|
kvzalloc_node(
- (sizeof(THING)) * E
+ sizeof(THING) * E
, ...)
)
// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@
(
kvzalloc_node(
- sizeof(u8) * (COUNT)
+ COUNT
, ...)
|
kvzalloc_node(
- sizeof(__u8) * (COUNT)
+ COUNT
, ...)
|
kvzalloc_node(
- sizeof(char) * (COUNT)
+ COUNT
, ...)
|
kvzalloc_node(
- sizeof(unsigned char) * (COUNT)
+ COUNT
, ...)
|
kvzalloc_node(
- sizeof(u8) * COUNT
+ COUNT
, ...)
|
kvzalloc_node(
- sizeof(__u8) * COUNT
+ COUNT
, ...)
|
kvzalloc_node(
- sizeof(char) * COUNT
+ COUNT
, ...)
|
kvzalloc_node(
- sizeof(unsigned char) * COUNT
+ COUNT
, ...)
)
// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@
(
kvzalloc_node(
- sizeof(TYPE) * (COUNT_ID)
+ array_size(COUNT_ID, sizeof(TYPE))
, ...)
|
kvzalloc_node(
- sizeof(TYPE) * COUNT_ID
+ array_size(COUNT_ID, sizeof(TYPE))
, ...)
|
kvzalloc_node(
- sizeof(TYPE) * (COUNT_CONST)
+ array_size(COUNT_CONST, sizeof(TYPE))
, ...)
|
kvzalloc_node(
- sizeof(TYPE) * COUNT_CONST
+ array_size(COUNT_CONST, sizeof(TYPE))
, ...)
|
kvzalloc_node(
- sizeof(THING) * (COUNT_ID)
+ array_size(COUNT_ID, sizeof(THING))
, ...)
|
kvzalloc_node(
- sizeof(THING) * COUNT_ID
+ array_size(COUNT_ID, sizeof(THING))
, ...)
|
kvzalloc_node(
- sizeof(THING) * (COUNT_CONST)
+ array_size(COUNT_CONST, sizeof(THING))
, ...)
|
kvzalloc_node(
- sizeof(THING) * COUNT_CONST
+ array_size(COUNT_CONST, sizeof(THING))
, ...)
)
// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@
kvzalloc_node(
- SIZE * COUNT
+ array_size(COUNT, SIZE)
, ...)
// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@
(
kvzalloc_node(
- sizeof(TYPE) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kvzalloc_node(
- sizeof(TYPE) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kvzalloc_node(
- sizeof(TYPE) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kvzalloc_node(
- sizeof(TYPE) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kvzalloc_node(
- sizeof(THING) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kvzalloc_node(
- sizeof(THING) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kvzalloc_node(
- sizeof(THING) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kvzalloc_node(
- sizeof(THING) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
)
// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@
(
kvzalloc_node(
- sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kvzalloc_node(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kvzalloc_node(
- sizeof(THING1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kvzalloc_node(
- sizeof(THING1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kvzalloc_node(
- sizeof(TYPE1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
|
kvzalloc_node(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
)
// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@
(
kvzalloc_node(
- (COUNT) * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kvzalloc_node(
- COUNT * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kvzalloc_node(
- COUNT * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kvzalloc_node(
- (COUNT) * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kvzalloc_node(
- COUNT * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kvzalloc_node(
- (COUNT) * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kvzalloc_node(
- (COUNT) * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kvzalloc_node(
- COUNT * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
)
// Any remaining multi-factor products, first at least 3-factor products
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@
(
kvzalloc_node(C1 * C2 * C3, ...)
|
kvzalloc_node(
- E1 * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
)
// And then all remaining 2 factors products when they're not all constants.
@@
expression E1, E2;
constant C1, C2;
@@
(
kvzalloc_node(C1 * C2, ...)
|
kvzalloc_node(
- E1 * E2
+ array_size(E1, E2)
, ...)
)
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 15:28:04 -06:00
|
|
|
rq->wqe.di = kvzalloc_node(array_size(len, sizeof(*rq->wqe.di)),
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
GFP_KERNEL, cpu_to_node(cpu));
|
|
|
|
if (!rq->wqe.di)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
mlx5e_init_frags_partition(rq);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void mlx5e_free_di_list(struct mlx5e_rq *rq)
|
|
|
|
{
|
|
|
|
kvfree(rq->wqe.di);
|
|
|
|
}
|
|
|
|
|
2019-06-26 14:21:40 -06:00
|
|
|
static void mlx5e_rq_err_cqe_work(struct work_struct *recover_work)
|
|
|
|
{
|
|
|
|
struct mlx5e_rq *rq = container_of(recover_work, struct mlx5e_rq, recover_work);
|
|
|
|
|
|
|
|
mlx5e_reporter_rq_cqe_err(rq);
|
|
|
|
}
|
|
|
|
|
net/mlx5e: Proper names for SQ/RQ/CQ functions
Rename mlx5e_{create,destroy}_{sq,rq,cq} to
mlx5e_{alloc,free}_{sq,rq,cq}.
Rename mlx5e_{enable,disable}_{sq,rq,cq} to
mlx5e_{create,destroy}_{sq,rq,cq}.
mlx5e_{enable,disable}_{sq,rq,cq} used to actually create/destroy the SQ
in FW, so we rename them to align the functions names with FW semantics.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 15:52:12 -06:00
|
|
|
static int mlx5e_alloc_rq(struct mlx5e_channel *c,
|
2016-12-21 08:24:35 -07:00
|
|
|
struct mlx5e_params *params,
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
struct mlx5e_xsk_param *xsk,
|
|
|
|
struct xdp_umem *umem,
|
2016-12-21 08:24:35 -07:00
|
|
|
struct mlx5e_rq_param *rqp,
|
net/mlx5e: Proper names for SQ/RQ/CQ functions
Rename mlx5e_{create,destroy}_{sq,rq,cq} to
mlx5e_{alloc,free}_{sq,rq,cq}.
Rename mlx5e_{enable,disable}_{sq,rq,cq} to
mlx5e_{create,destroy}_{sq,rq,cq}.
mlx5e_{enable,disable}_{sq,rq,cq} used to actually create/destroy the SQ
in FW, so we rename them to align the functions names with FW semantics.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 15:52:12 -06:00
|
|
|
struct mlx5e_rq *rq)
|
2015-05-28 13:28:48 -06:00
|
|
|
{
|
mlx5: use page_pool for xdp_return_frame call
This patch shows how it is possible to have both the driver local page
cache, which uses elevated refcnt for "catching"/avoiding SKB
put_page returns the page through the page allocator. And at the
same time, have pages getting returned to the page_pool from
ndp_xdp_xmit DMA completion.
The performance improvement for XDP_REDIRECT in this patch is really
good. Especially considering that (currently) the xdp_return_frame
API and page_pool_put_page() does per frame operations of both
rhashtable ID-lookup and locked return into (page_pool) ptr_ring.
(It is the plan to remove these per frame operation in a followup
patchset).
The benchmark performed was RX on mlx5 and XDP_REDIRECT out ixgbe,
with xdp_redirect_map (using devmap) . And the target/maximum
capability of ixgbe is 13Mpps (on this HW setup).
Before this patch for mlx5, XDP redirected frames were returned via
the page allocator. The single flow performance was 6Mpps, and if I
started two flows the collective performance drop to 4Mpps, because we
hit the page allocator lock (further negative scaling occurs).
Two test scenarios need to be covered, for xdp_return_frame API, which
is DMA-TX completion running on same-CPU or cross-CPU free/return.
Results were same-CPU=10Mpps, and cross-CPU=12Mpps. This is very
close to our 13Mpps max target.
The reason max target isn't reached in cross-CPU test, is likely due
to RX-ring DMA unmap/map overhead (which doesn't occur in ixgbe to
ixgbe testing). It is also planned to remove this unnecessary DMA
unmap in a later patchset
V2: Adjustments requested by Tariq
- Changed page_pool_create return codes not return NULL, only
ERR_PTR, as this simplifies err handling in drivers.
- Save a branch in mlx5e_page_release
- Correct page_pool size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ
V5: Updated patch desc
V8: Adjust for b0cedc844c00 ("net/mlx5e: Remove rq_headroom field from params")
V9:
- Adjust for 121e89275471 ("net/mlx5e: Refactor RQ XDP_TX indication")
- Adjust for 73281b78a37a ("net/mlx5e: Derive Striding RQ size from MTU")
- Correct handling if page_pool_create fail for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ
V10: Req from Tariq
- Change pool_size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 08:46:27 -06:00
|
|
|
struct page_pool_params pp_params = { 0 };
|
2017-03-14 11:43:52 -06:00
|
|
|
struct mlx5_core_dev *mdev = c->mdev;
|
2016-12-21 08:24:35 -07:00
|
|
|
void *rqc = rqp->rqc;
|
2015-05-28 13:28:48 -06:00
|
|
|
void *rqc_wq = MLX5_ADDR_OF(rqc, rqc, wq);
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
u32 num_xsk_frames = 0;
|
|
|
|
u32 rq_xdp_ix;
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
u32 pool_size;
|
2015-05-28 13:28:48 -06:00
|
|
|
int wq_sz;
|
|
|
|
int err;
|
|
|
|
int i;
|
|
|
|
|
2017-11-09 23:59:52 -07:00
|
|
|
rqp->wq.db_numa_node = cpu_to_node(c->cpu);
|
2015-07-23 14:35:57 -06:00
|
|
|
|
2016-12-21 08:24:35 -07:00
|
|
|
rq->wq_type = params->rq_wq_type;
|
net/mlx5e: Single flow order-0 pages for Striding RQ
To improve the memory consumption scheme, we omit the flow that
demands and splits high-order pages in Striding RQ, and stay
with a single Striding RQ flow that uses order-0 pages.
Moving to fragmented memory allows the use of larger MPWQEs,
which reduces the number of UMR posts and filler CQEs.
Moving to a single flow allows several optimizations that improve
performance, especially in production servers where we would
anyway fallback to order-0 allocations:
- inline functions that were called via function pointers.
- improve the UMR post process.
This patch alone is expected to give a slight performance reduction.
However, the new memory scheme gives the possibility to use a page-cache
of a fair size, that doesn't inflate the memory footprint, which will
dramatically fix the reduction and even give a performance gain.
Performance tests:
The following results were measured on a freshly booted system,
giving optimal baseline performance, as high-order pages are yet to
be fragmented and depleted.
We ran pktgen single-stream benchmarks, with iptables-raw-drop:
Single stride, 64 bytes:
* 4,739,057 - baseline
* 4,749,550 - this patch
no reduction
Larger packets, no page cross, 1024 bytes:
* 3,982,361 - baseline
* 3,845,682 - this patch
3.5% reduction
Larger packets, every 3rd packet crosses a page, 1500 bytes:
* 3,731,189 - baseline
* 3,579,414 - this patch
4% reduction
Fixes: 461017cb006a ("net/mlx5e: Support RX multi-packet WQE (Striding RQ)")
Fixes: bc77b240b3c5 ("net/mlx5e: Add fragmented memory support for RX multi packet WQE")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-15 07:08:36 -06:00
|
|
|
rq->pdev = c->pdev;
|
|
|
|
rq->netdev = c->netdev;
|
2017-03-14 11:43:52 -06:00
|
|
|
rq->tstamp = c->tstamp;
|
2017-08-15 04:46:04 -06:00
|
|
|
rq->clock = &mdev->clock;
|
net/mlx5e: Single flow order-0 pages for Striding RQ
To improve the memory consumption scheme, we omit the flow that
demands and splits high-order pages in Striding RQ, and stay
with a single Striding RQ flow that uses order-0 pages.
Moving to fragmented memory allows the use of larger MPWQEs,
which reduces the number of UMR posts and filler CQEs.
Moving to a single flow allows several optimizations that improve
performance, especially in production servers where we would
anyway fallback to order-0 allocations:
- inline functions that were called via function pointers.
- improve the UMR post process.
This patch alone is expected to give a slight performance reduction.
However, the new memory scheme gives the possibility to use a page-cache
of a fair size, that doesn't inflate the memory footprint, which will
dramatically fix the reduction and even give a performance gain.
Performance tests:
The following results were measured on a freshly booted system,
giving optimal baseline performance, as high-order pages are yet to
be fragmented and depleted.
We ran pktgen single-stream benchmarks, with iptables-raw-drop:
Single stride, 64 bytes:
* 4,739,057 - baseline
* 4,749,550 - this patch
no reduction
Larger packets, no page cross, 1024 bytes:
* 3,982,361 - baseline
* 3,845,682 - this patch
3.5% reduction
Larger packets, every 3rd packet crosses a page, 1500 bytes:
* 3,731,189 - baseline
* 3,579,414 - this patch
4% reduction
Fixes: 461017cb006a ("net/mlx5e: Support RX multi-packet WQE (Striding RQ)")
Fixes: bc77b240b3c5 ("net/mlx5e: Add fragmented memory support for RX multi packet WQE")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-15 07:08:36 -06:00
|
|
|
rq->channel = c;
|
|
|
|
rq->ix = c->ix;
|
2017-03-14 11:43:52 -06:00
|
|
|
rq->mdev = mdev;
|
net/mlx5e: RX, verify received packet size in Linear Striding RQ
In case of striding RQ, we use MPWRQ (Multi Packet WQE RQ), which means
that WQE (RX descriptor) can be used for many packets and so the WQE is
much bigger than MTU. In virtualization setups where the port mtu can
be larger than the vf mtu, if received packet is bigger than MTU, it
won't be dropped by HW on too small receive WQE. If we use linear SKB in
striding RQ, since each stride has room for mtu size payload and skb
info, an oversized packet can lead to crash for crossing allocated page
boundary upon the call to build_skb. So driver needs to check packet
size and drop it.
Introduce new SW rx counter, rx_oversize_pkts_sw_drop, which counts the
number of packets dropped by the driver for being too large.
As a new field is added to the RQ struct, re-open the channels whenever
this field is being used in datapath (i.e., in the case of linear
Striding RQ).
Fixes: 619a8f2a42f1 ("net/mlx5e: Use linear SKB in Striding RQ")
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-10-10 22:31:10 -06:00
|
|
|
rq->hw_mtu = MLX5E_SW2HW_MTU(params, params->sw_mtu);
|
2019-06-26 08:35:33 -06:00
|
|
|
rq->xdpsq = &c->rq_xdpsq;
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
rq->umem = umem;
|
|
|
|
|
|
|
|
if (rq->umem)
|
|
|
|
rq->stats = &c->priv->channel_stats[c->ix].xskrq;
|
|
|
|
else
|
|
|
|
rq->stats = &c->priv->channel_stats[c->ix].rq;
|
2019-06-26 14:21:40 -06:00
|
|
|
INIT_WORK(&rq->recover_work, mlx5e_rq_err_cqe_work);
|
2016-11-18 17:45:00 -07:00
|
|
|
|
2016-12-21 08:24:35 -07:00
|
|
|
rq->xdp_prog = params->xdp_prog ? bpf_prog_inc(params->xdp_prog) : NULL;
|
2016-11-18 17:45:00 -07:00
|
|
|
if (IS_ERR(rq->xdp_prog)) {
|
|
|
|
err = PTR_ERR(rq->xdp_prog);
|
|
|
|
rq->xdp_prog = NULL;
|
|
|
|
goto err_rq_wq_destroy;
|
|
|
|
}
|
net/mlx5e: Single flow order-0 pages for Striding RQ
To improve the memory consumption scheme, we omit the flow that
demands and splits high-order pages in Striding RQ, and stay
with a single Striding RQ flow that uses order-0 pages.
Moving to fragmented memory allows the use of larger MPWQEs,
which reduces the number of UMR posts and filler CQEs.
Moving to a single flow allows several optimizations that improve
performance, especially in production servers where we would
anyway fallback to order-0 allocations:
- inline functions that were called via function pointers.
- improve the UMR post process.
This patch alone is expected to give a slight performance reduction.
However, the new memory scheme gives the possibility to use a page-cache
of a fair size, that doesn't inflate the memory footprint, which will
dramatically fix the reduction and even give a performance gain.
Performance tests:
The following results were measured on a freshly booted system,
giving optimal baseline performance, as high-order pages are yet to
be fragmented and depleted.
We ran pktgen single-stream benchmarks, with iptables-raw-drop:
Single stride, 64 bytes:
* 4,739,057 - baseline
* 4,749,550 - this patch
no reduction
Larger packets, no page cross, 1024 bytes:
* 3,982,361 - baseline
* 3,845,682 - this patch
3.5% reduction
Larger packets, every 3rd packet crosses a page, 1500 bytes:
* 3,731,189 - baseline
* 3,579,414 - this patch
4% reduction
Fixes: 461017cb006a ("net/mlx5e: Support RX multi-packet WQE (Striding RQ)")
Fixes: bc77b240b3c5 ("net/mlx5e: Add fragmented memory support for RX multi packet WQE")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-15 07:08:36 -06:00
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
rq_xdp_ix = rq->ix;
|
|
|
|
if (xsk)
|
|
|
|
rq_xdp_ix += params->num_channels * MLX5E_RQ_GROUP_XSK;
|
|
|
|
err = xdp_rxq_info_reg(&rq->xdp_rxq, rq->netdev, rq_xdp_ix);
|
2018-01-10 00:30:53 -07:00
|
|
|
if (err < 0)
|
2018-01-03 03:25:18 -07:00
|
|
|
goto err_rq_wq_destroy;
|
|
|
|
|
2017-01-31 07:48:59 -07:00
|
|
|
rq->buff.map_dir = rq->xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
rq->buff.headroom = mlx5e_get_rq_headroom(mdev, params, xsk);
|
|
|
|
rq->buff.umem_headroom = xsk ? xsk->headroom : 0;
|
mlx5: use page_pool for xdp_return_frame call
This patch shows how it is possible to have both the driver local page
cache, which uses elevated refcnt for "catching"/avoiding SKB
put_page returns the page through the page allocator. And at the
same time, have pages getting returned to the page_pool from
ndp_xdp_xmit DMA completion.
The performance improvement for XDP_REDIRECT in this patch is really
good. Especially considering that (currently) the xdp_return_frame
API and page_pool_put_page() does per frame operations of both
rhashtable ID-lookup and locked return into (page_pool) ptr_ring.
(It is the plan to remove these per frame operation in a followup
patchset).
The benchmark performed was RX on mlx5 and XDP_REDIRECT out ixgbe,
with xdp_redirect_map (using devmap) . And the target/maximum
capability of ixgbe is 13Mpps (on this HW setup).
Before this patch for mlx5, XDP redirected frames were returned via
the page allocator. The single flow performance was 6Mpps, and if I
started two flows the collective performance drop to 4Mpps, because we
hit the page allocator lock (further negative scaling occurs).
Two test scenarios need to be covered, for xdp_return_frame API, which
is DMA-TX completion running on same-CPU or cross-CPU free/return.
Results were same-CPU=10Mpps, and cross-CPU=12Mpps. This is very
close to our 13Mpps max target.
The reason max target isn't reached in cross-CPU test, is likely due
to RX-ring DMA unmap/map overhead (which doesn't occur in ixgbe to
ixgbe testing). It is also planned to remove this unnecessary DMA
unmap in a later patchset
V2: Adjustments requested by Tariq
- Changed page_pool_create return codes not return NULL, only
ERR_PTR, as this simplifies err handling in drivers.
- Save a branch in mlx5e_page_release
- Correct page_pool size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ
V5: Updated patch desc
V8: Adjust for b0cedc844c00 ("net/mlx5e: Remove rq_headroom field from params")
V9:
- Adjust for 121e89275471 ("net/mlx5e: Refactor RQ XDP_TX indication")
- Adjust for 73281b78a37a ("net/mlx5e: Derive Striding RQ size from MTU")
- Correct handling if page_pool_create fail for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ
V10: Req from Tariq
- Change pool_size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 08:46:27 -06:00
|
|
|
pool_size = 1 << params->log_rq_mtu_frames;
|
2016-09-21 03:19:48 -06:00
|
|
|
|
2016-12-21 08:24:35 -07:00
|
|
|
switch (rq->wq_type) {
|
net/mlx5e: Support RX multi-packet WQE (Striding RQ)
Introduce the feature of multi-packet WQE (RX Work Queue Element)
referred to as (MPWQE or Striding RQ), in which WQEs are larger
and serve multiple packets each.
Every WQE consists of many strides of the same size, every received
packet is aligned to a beginning of a stride and is written to
consecutive strides within a WQE.
In the regular approach, each regular WQE is big enough to be capable
of serving one received packet of any size up to MTU or 64K in case of
device LRO is enabled, making it very wasteful when dealing with
small packets or device LRO is enabled.
For its flexibility, MPWQE allows a better memory utilization
(implying improvements in CPU utilization and packet rate) as packets
consume strides according to their size, preserving the rest of
the WQE to be available for other packets.
MPWQE default configuration:
Num of WQEs = 16
Strides Per WQE = 2048
Stride Size = 64 byte
The default WQEs memory footprint went from 1024*mtu (~1.5MB) to
16 * 2048 * 64 = 2MB per ring.
However, HW LRO can now be supported at no additional cost in memory
footprint, and hence we turn it on by default and get an even better
performance.
Performance tested on ConnectX4-Lx 50G.
To isolate the feature under test, the numbers below were measured with
HW LRO turned off. We verified that the performance just improves when
LRO is turned back on.
* Netperf single TCP stream:
- BW raised by 10-15% for representative packet sizes:
default, 64B, 1024B, 1478B, 65536B.
* Netperf multi TCP stream:
- No degradation, line rate reached.
* Pktgen: packet rate raised by 2-10% for traffic of different message
sizes: 64B, 128B, 256B, 1024B, and 1500B.
* Pktgen: packet loss in bursts of small messages (64byte),
single stream:
- | num packets | packets loss before | packets loss after
| 2K | ~ 1K | 0
| 8K | ~ 6K | 0
| 16K | ~13K | 0
| 32K | ~28K | 0
| 64K | ~57K | ~24K
As expected as the driver can receive as many small packets (<=64B) as
the number of total strides in the ring (default = 2048 * 16) vs. 1024
(default ring size regardless of packets size) before this feature.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-20 13:02:13 -06:00
|
|
|
case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
|
2018-04-02 08:23:14 -06:00
|
|
|
err = mlx5_wq_ll_create(mdev, &rqp->wq, rqc_wq, &rq->mpwqe.wq,
|
|
|
|
&rq->wq_ctrl);
|
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
|
|
|
|
rq->mpwqe.wq.db = &rq->mpwqe.wq.db[MLX5_RCV_DBR];
|
|
|
|
|
|
|
|
wq_sz = mlx5_wq_ll_get_size(&rq->mpwqe.wq);
|
mlx5: use page_pool for xdp_return_frame call
This patch shows how it is possible to have both the driver local page
cache, which uses elevated refcnt for "catching"/avoiding SKB
put_page returns the page through the page allocator. And at the
same time, have pages getting returned to the page_pool from
ndp_xdp_xmit DMA completion.
The performance improvement for XDP_REDIRECT in this patch is really
good. Especially considering that (currently) the xdp_return_frame
API and page_pool_put_page() does per frame operations of both
rhashtable ID-lookup and locked return into (page_pool) ptr_ring.
(It is the plan to remove these per frame operation in a followup
patchset).
The benchmark performed was RX on mlx5 and XDP_REDIRECT out ixgbe,
with xdp_redirect_map (using devmap) . And the target/maximum
capability of ixgbe is 13Mpps (on this HW setup).
Before this patch for mlx5, XDP redirected frames were returned via
the page allocator. The single flow performance was 6Mpps, and if I
started two flows the collective performance drop to 4Mpps, because we
hit the page allocator lock (further negative scaling occurs).
Two test scenarios need to be covered, for xdp_return_frame API, which
is DMA-TX completion running on same-CPU or cross-CPU free/return.
Results were same-CPU=10Mpps, and cross-CPU=12Mpps. This is very
close to our 13Mpps max target.
The reason max target isn't reached in cross-CPU test, is likely due
to RX-ring DMA unmap/map overhead (which doesn't occur in ixgbe to
ixgbe testing). It is also planned to remove this unnecessary DMA
unmap in a later patchset
V2: Adjustments requested by Tariq
- Changed page_pool_create return codes not return NULL, only
ERR_PTR, as this simplifies err handling in drivers.
- Save a branch in mlx5e_page_release
- Correct page_pool size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ
V5: Updated patch desc
V8: Adjust for b0cedc844c00 ("net/mlx5e: Remove rq_headroom field from params")
V9:
- Adjust for 121e89275471 ("net/mlx5e: Refactor RQ XDP_TX indication")
- Adjust for 73281b78a37a ("net/mlx5e: Derive Striding RQ size from MTU")
- Correct handling if page_pool_create fail for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ
V10: Req from Tariq
- Change pool_size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 08:46:27 -06:00
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
if (xsk)
|
|
|
|
num_xsk_frames = wq_sz <<
|
|
|
|
mlx5e_mpwqe_get_log_num_strides(mdev, params, xsk);
|
|
|
|
|
|
|
|
pool_size = MLX5_MPWRQ_PAGES_PER_WQE <<
|
|
|
|
mlx5e_mpwqe_get_log_rq_size(params, xsk);
|
2018-04-02 08:23:14 -06:00
|
|
|
|
2017-07-17 03:27:26 -06:00
|
|
|
rq->post_wqes = mlx5e_post_rx_mpwqes;
|
2016-06-30 08:34:46 -06:00
|
|
|
rq->dealloc_wqe = mlx5e_dealloc_rx_mpwqe;
|
net/mlx5e: Support RX multi-packet WQE (Striding RQ)
Introduce the feature of multi-packet WQE (RX Work Queue Element)
referred to as (MPWQE or Striding RQ), in which WQEs are larger
and serve multiple packets each.
Every WQE consists of many strides of the same size, every received
packet is aligned to a beginning of a stride and is written to
consecutive strides within a WQE.
In the regular approach, each regular WQE is big enough to be capable
of serving one received packet of any size up to MTU or 64K in case of
device LRO is enabled, making it very wasteful when dealing with
small packets or device LRO is enabled.
For its flexibility, MPWQE allows a better memory utilization
(implying improvements in CPU utilization and packet rate) as packets
consume strides according to their size, preserving the rest of
the WQE to be available for other packets.
MPWQE default configuration:
Num of WQEs = 16
Strides Per WQE = 2048
Stride Size = 64 byte
The default WQEs memory footprint went from 1024*mtu (~1.5MB) to
16 * 2048 * 64 = 2MB per ring.
However, HW LRO can now be supported at no additional cost in memory
footprint, and hence we turn it on by default and get an even better
performance.
Performance tested on ConnectX4-Lx 50G.
To isolate the feature under test, the numbers below were measured with
HW LRO turned off. We verified that the performance just improves when
LRO is turned back on.
* Netperf single TCP stream:
- BW raised by 10-15% for representative packet sizes:
default, 64B, 1024B, 1478B, 65536B.
* Netperf multi TCP stream:
- No degradation, line rate reached.
* Pktgen: packet rate raised by 2-10% for traffic of different message
sizes: 64B, 128B, 256B, 1024B, and 1500B.
* Pktgen: packet loss in bursts of small messages (64byte),
single stream:
- | num packets | packets loss before | packets loss after
| 2K | ~ 1K | 0
| 8K | ~ 6K | 0
| 16K | ~13K | 0
| 32K | ~28K | 0
| 64K | ~57K | ~24K
As expected as the driver can receive as many small packets (<=64B) as
the number of total strides in the ring (default = 2048 * 16) vs. 1024
(default ring size regardless of packets size) before this feature.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-20 13:02:13 -06:00
|
|
|
|
2017-04-12 21:37:03 -06:00
|
|
|
rq->handle_rx_cqe = c->priv->profile->rx_handlers.handle_rx_cqe_mpwqe;
|
2017-06-19 05:04:36 -06:00
|
|
|
#ifdef CONFIG_MLX5_EN_IPSEC
|
|
|
|
if (MLX5_IPSEC_DEV(mdev)) {
|
|
|
|
err = -EINVAL;
|
|
|
|
netdev_err(c->netdev, "MPWQE RQ with IPSec offload not supported\n");
|
|
|
|
goto err_rq_wq_destroy;
|
|
|
|
}
|
|
|
|
#endif
|
2017-04-12 21:37:03 -06:00
|
|
|
if (!rq->handle_rx_cqe) {
|
|
|
|
err = -EINVAL;
|
|
|
|
netdev_err(c->netdev, "RX handler of MPWQE RQ is not set, err %d\n", err);
|
|
|
|
goto err_rq_wq_destroy;
|
|
|
|
}
|
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
rq->mpwqe.skb_from_cqe_mpwrq = xsk ?
|
|
|
|
mlx5e_xsk_skb_from_cqe_mpwrq_linear :
|
|
|
|
mlx5e_rx_mpwqe_is_linear_skb(mdev, params, NULL) ?
|
|
|
|
mlx5e_skb_from_cqe_mpwrq_linear :
|
|
|
|
mlx5e_skb_from_cqe_mpwrq_nonlinear;
|
|
|
|
|
|
|
|
rq->mpwqe.log_stride_sz = mlx5e_mpwqe_get_log_stride_size(mdev, params, xsk);
|
|
|
|
rq->mpwqe.num_strides =
|
|
|
|
BIT(mlx5e_mpwqe_get_log_num_strides(mdev, params, xsk));
|
2016-09-21 03:19:42 -06:00
|
|
|
|
2017-03-14 11:43:52 -06:00
|
|
|
err = mlx5e_create_rq_umr_mkey(mdev, rq);
|
net/mlx5e: Single flow order-0 pages for Striding RQ
To improve the memory consumption scheme, we omit the flow that
demands and splits high-order pages in Striding RQ, and stay
with a single Striding RQ flow that uses order-0 pages.
Moving to fragmented memory allows the use of larger MPWQEs,
which reduces the number of UMR posts and filler CQEs.
Moving to a single flow allows several optimizations that improve
performance, especially in production servers where we would
anyway fallback to order-0 allocations:
- inline functions that were called via function pointers.
- improve the UMR post process.
This patch alone is expected to give a slight performance reduction.
However, the new memory scheme gives the possibility to use a page-cache
of a fair size, that doesn't inflate the memory footprint, which will
dramatically fix the reduction and even give a performance gain.
Performance tests:
The following results were measured on a freshly booted system,
giving optimal baseline performance, as high-order pages are yet to
be fragmented and depleted.
We ran pktgen single-stream benchmarks, with iptables-raw-drop:
Single stride, 64 bytes:
* 4,739,057 - baseline
* 4,749,550 - this patch
no reduction
Larger packets, no page cross, 1024 bytes:
* 3,982,361 - baseline
* 3,845,682 - this patch
3.5% reduction
Larger packets, every 3rd packet crosses a page, 1500 bytes:
* 3,731,189 - baseline
* 3,579,414 - this patch
4% reduction
Fixes: 461017cb006a ("net/mlx5e: Support RX multi-packet WQE (Striding RQ)")
Fixes: bc77b240b3c5 ("net/mlx5e: Add fragmented memory support for RX multi packet WQE")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-15 07:08:36 -06:00
|
|
|
if (err)
|
|
|
|
goto err_rq_wq_destroy;
|
2016-11-30 08:59:39 -07:00
|
|
|
rq->mkey_be = cpu_to_be32(rq->umr_mkey.key);
|
|
|
|
|
|
|
|
err = mlx5e_rq_alloc_mpwqe_info(rq, c);
|
|
|
|
if (err)
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
goto err_free;
|
net/mlx5e: Support RX multi-packet WQE (Striding RQ)
Introduce the feature of multi-packet WQE (RX Work Queue Element)
referred to as (MPWQE or Striding RQ), in which WQEs are larger
and serve multiple packets each.
Every WQE consists of many strides of the same size, every received
packet is aligned to a beginning of a stride and is written to
consecutive strides within a WQE.
In the regular approach, each regular WQE is big enough to be capable
of serving one received packet of any size up to MTU or 64K in case of
device LRO is enabled, making it very wasteful when dealing with
small packets or device LRO is enabled.
For its flexibility, MPWQE allows a better memory utilization
(implying improvements in CPU utilization and packet rate) as packets
consume strides according to their size, preserving the rest of
the WQE to be available for other packets.
MPWQE default configuration:
Num of WQEs = 16
Strides Per WQE = 2048
Stride Size = 64 byte
The default WQEs memory footprint went from 1024*mtu (~1.5MB) to
16 * 2048 * 64 = 2MB per ring.
However, HW LRO can now be supported at no additional cost in memory
footprint, and hence we turn it on by default and get an even better
performance.
Performance tested on ConnectX4-Lx 50G.
To isolate the feature under test, the numbers below were measured with
HW LRO turned off. We verified that the performance just improves when
LRO is turned back on.
* Netperf single TCP stream:
- BW raised by 10-15% for representative packet sizes:
default, 64B, 1024B, 1478B, 65536B.
* Netperf multi TCP stream:
- No degradation, line rate reached.
* Pktgen: packet rate raised by 2-10% for traffic of different message
sizes: 64B, 128B, 256B, 1024B, and 1500B.
* Pktgen: packet loss in bursts of small messages (64byte),
single stream:
- | num packets | packets loss before | packets loss after
| 2K | ~ 1K | 0
| 8K | ~ 6K | 0
| 16K | ~13K | 0
| 32K | ~28K | 0
| 64K | ~57K | ~24K
As expected as the driver can receive as many small packets (<=64B) as
the number of total strides in the ring (default = 2048 * 16) vs. 1024
(default ring size regardless of packets size) before this feature.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-20 13:02:13 -06:00
|
|
|
break;
|
2018-04-02 08:31:31 -06:00
|
|
|
default: /* MLX5_WQ_TYPE_CYCLIC */
|
|
|
|
err = mlx5_wq_cyc_create(mdev, &rqp->wq, rqc_wq, &rq->wqe.wq,
|
|
|
|
&rq->wq_ctrl);
|
2018-04-02 08:23:14 -06:00
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
|
|
|
|
rq->wqe.wq.db = &rq->wqe.wq.db[MLX5_RCV_DBR];
|
|
|
|
|
2018-04-02 08:31:31 -06:00
|
|
|
wq_sz = mlx5_wq_cyc_get_size(&rq->wqe.wq);
|
2018-04-02 08:23:14 -06:00
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
if (xsk)
|
|
|
|
num_xsk_frames = wq_sz << rq->wqe.info.log_num_frags;
|
|
|
|
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
rq->wqe.info = rqp->frags_info;
|
|
|
|
rq->wqe.frags =
|
treewide: Use array_size() in kvzalloc_node()
The kvzalloc_node() function has no 2-factor argument form, so
multiplication factors need to be wrapped in array_size(). This patch
replaces cases of:
kvzalloc_node(a * b, gfp, node)
with:
kvzalloc_node(array_size(a, b), gfp, node)
as well as handling cases of:
kvzalloc_node(a * b * c, gfp, node)
with:
kvzalloc_node(array3_size(a, b, c), gfp, node)
This does, however, attempt to ignore constant size factors like:
kvzalloc_node(4 * 1024, gfp, node)
though any constants defined via macros get caught up in the conversion.
Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.
The Coccinelle script used for this was:
// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@
(
kvzalloc_node(
- (sizeof(TYPE)) * E
+ sizeof(TYPE) * E
, ...)
|
kvzalloc_node(
- (sizeof(THING)) * E
+ sizeof(THING) * E
, ...)
)
// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@
(
kvzalloc_node(
- sizeof(u8) * (COUNT)
+ COUNT
, ...)
|
kvzalloc_node(
- sizeof(__u8) * (COUNT)
+ COUNT
, ...)
|
kvzalloc_node(
- sizeof(char) * (COUNT)
+ COUNT
, ...)
|
kvzalloc_node(
- sizeof(unsigned char) * (COUNT)
+ COUNT
, ...)
|
kvzalloc_node(
- sizeof(u8) * COUNT
+ COUNT
, ...)
|
kvzalloc_node(
- sizeof(__u8) * COUNT
+ COUNT
, ...)
|
kvzalloc_node(
- sizeof(char) * COUNT
+ COUNT
, ...)
|
kvzalloc_node(
- sizeof(unsigned char) * COUNT
+ COUNT
, ...)
)
// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@
(
kvzalloc_node(
- sizeof(TYPE) * (COUNT_ID)
+ array_size(COUNT_ID, sizeof(TYPE))
, ...)
|
kvzalloc_node(
- sizeof(TYPE) * COUNT_ID
+ array_size(COUNT_ID, sizeof(TYPE))
, ...)
|
kvzalloc_node(
- sizeof(TYPE) * (COUNT_CONST)
+ array_size(COUNT_CONST, sizeof(TYPE))
, ...)
|
kvzalloc_node(
- sizeof(TYPE) * COUNT_CONST
+ array_size(COUNT_CONST, sizeof(TYPE))
, ...)
|
kvzalloc_node(
- sizeof(THING) * (COUNT_ID)
+ array_size(COUNT_ID, sizeof(THING))
, ...)
|
kvzalloc_node(
- sizeof(THING) * COUNT_ID
+ array_size(COUNT_ID, sizeof(THING))
, ...)
|
kvzalloc_node(
- sizeof(THING) * (COUNT_CONST)
+ array_size(COUNT_CONST, sizeof(THING))
, ...)
|
kvzalloc_node(
- sizeof(THING) * COUNT_CONST
+ array_size(COUNT_CONST, sizeof(THING))
, ...)
)
// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@
kvzalloc_node(
- SIZE * COUNT
+ array_size(COUNT, SIZE)
, ...)
// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@
(
kvzalloc_node(
- sizeof(TYPE) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kvzalloc_node(
- sizeof(TYPE) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kvzalloc_node(
- sizeof(TYPE) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kvzalloc_node(
- sizeof(TYPE) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kvzalloc_node(
- sizeof(THING) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kvzalloc_node(
- sizeof(THING) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kvzalloc_node(
- sizeof(THING) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kvzalloc_node(
- sizeof(THING) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
)
// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@
(
kvzalloc_node(
- sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kvzalloc_node(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kvzalloc_node(
- sizeof(THING1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kvzalloc_node(
- sizeof(THING1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kvzalloc_node(
- sizeof(TYPE1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
|
kvzalloc_node(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
)
// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@
(
kvzalloc_node(
- (COUNT) * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kvzalloc_node(
- COUNT * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kvzalloc_node(
- COUNT * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kvzalloc_node(
- (COUNT) * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kvzalloc_node(
- COUNT * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kvzalloc_node(
- (COUNT) * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kvzalloc_node(
- (COUNT) * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kvzalloc_node(
- COUNT * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
)
// Any remaining multi-factor products, first at least 3-factor products
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@
(
kvzalloc_node(C1 * C2 * C3, ...)
|
kvzalloc_node(
- E1 * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
)
// And then all remaining 2 factors products when they're not all constants.
@@
expression E1, E2;
constant C1, C2;
@@
(
kvzalloc_node(C1 * C2, ...)
|
kvzalloc_node(
- E1 * E2
+ array_size(E1, E2)
, ...)
)
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 15:28:04 -06:00
|
|
|
kvzalloc_node(array_size(sizeof(*rq->wqe.frags),
|
|
|
|
(wq_sz << rq->wqe.info.log_num_frags)),
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
GFP_KERNEL, cpu_to_node(c->cpu));
|
2018-06-04 20:42:56 -06:00
|
|
|
if (!rq->wqe.frags) {
|
|
|
|
err = -ENOMEM;
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
goto err_free;
|
2018-06-04 20:42:56 -06:00
|
|
|
}
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
|
2019-03-07 10:30:30 -07:00
|
|
|
err = mlx5e_init_di_list(rq, wq_sz, c->cpu);
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
if (err)
|
|
|
|
goto err_free;
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
|
2017-07-17 03:27:26 -06:00
|
|
|
rq->post_wqes = mlx5e_post_rx_wqes;
|
2016-06-30 08:34:46 -06:00
|
|
|
rq->dealloc_wqe = mlx5e_dealloc_rx_wqe;
|
net/mlx5e: Support RX multi-packet WQE (Striding RQ)
Introduce the feature of multi-packet WQE (RX Work Queue Element)
referred to as (MPWQE or Striding RQ), in which WQEs are larger
and serve multiple packets each.
Every WQE consists of many strides of the same size, every received
packet is aligned to a beginning of a stride and is written to
consecutive strides within a WQE.
In the regular approach, each regular WQE is big enough to be capable
of serving one received packet of any size up to MTU or 64K in case of
device LRO is enabled, making it very wasteful when dealing with
small packets or device LRO is enabled.
For its flexibility, MPWQE allows a better memory utilization
(implying improvements in CPU utilization and packet rate) as packets
consume strides according to their size, preserving the rest of
the WQE to be available for other packets.
MPWQE default configuration:
Num of WQEs = 16
Strides Per WQE = 2048
Stride Size = 64 byte
The default WQEs memory footprint went from 1024*mtu (~1.5MB) to
16 * 2048 * 64 = 2MB per ring.
However, HW LRO can now be supported at no additional cost in memory
footprint, and hence we turn it on by default and get an even better
performance.
Performance tested on ConnectX4-Lx 50G.
To isolate the feature under test, the numbers below were measured with
HW LRO turned off. We verified that the performance just improves when
LRO is turned back on.
* Netperf single TCP stream:
- BW raised by 10-15% for representative packet sizes:
default, 64B, 1024B, 1478B, 65536B.
* Netperf multi TCP stream:
- No degradation, line rate reached.
* Pktgen: packet rate raised by 2-10% for traffic of different message
sizes: 64B, 128B, 256B, 1024B, and 1500B.
* Pktgen: packet loss in bursts of small messages (64byte),
single stream:
- | num packets | packets loss before | packets loss after
| 2K | ~ 1K | 0
| 8K | ~ 6K | 0
| 16K | ~13K | 0
| 32K | ~28K | 0
| 64K | ~57K | ~24K
As expected as the driver can receive as many small packets (<=64B) as
the number of total strides in the ring (default = 2048 * 16) vs. 1024
(default ring size regardless of packets size) before this feature.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-20 13:02:13 -06:00
|
|
|
|
2017-06-19 05:04:36 -06:00
|
|
|
#ifdef CONFIG_MLX5_EN_IPSEC
|
|
|
|
if (c->priv->ipsec)
|
|
|
|
rq->handle_rx_cqe = mlx5e_ipsec_handle_rx_cqe;
|
|
|
|
else
|
|
|
|
#endif
|
|
|
|
rq->handle_rx_cqe = c->priv->profile->rx_handlers.handle_rx_cqe;
|
2017-04-12 21:37:03 -06:00
|
|
|
if (!rq->handle_rx_cqe) {
|
|
|
|
err = -EINVAL;
|
|
|
|
netdev_err(c->netdev, "RX handler of RQ is not set, err %d\n", err);
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
goto err_free;
|
2017-04-12 21:37:03 -06:00
|
|
|
}
|
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
rq->wqe.skb_from_cqe = xsk ?
|
|
|
|
mlx5e_xsk_skb_from_cqe_linear :
|
|
|
|
mlx5e_rx_is_linear_skb(params, NULL) ?
|
|
|
|
mlx5e_skb_from_cqe_linear :
|
|
|
|
mlx5e_skb_from_cqe_nonlinear;
|
net/mlx5e: Single flow order-0 pages for Striding RQ
To improve the memory consumption scheme, we omit the flow that
demands and splits high-order pages in Striding RQ, and stay
with a single Striding RQ flow that uses order-0 pages.
Moving to fragmented memory allows the use of larger MPWQEs,
which reduces the number of UMR posts and filler CQEs.
Moving to a single flow allows several optimizations that improve
performance, especially in production servers where we would
anyway fallback to order-0 allocations:
- inline functions that were called via function pointers.
- improve the UMR post process.
This patch alone is expected to give a slight performance reduction.
However, the new memory scheme gives the possibility to use a page-cache
of a fair size, that doesn't inflate the memory footprint, which will
dramatically fix the reduction and even give a performance gain.
Performance tests:
The following results were measured on a freshly booted system,
giving optimal baseline performance, as high-order pages are yet to
be fragmented and depleted.
We ran pktgen single-stream benchmarks, with iptables-raw-drop:
Single stride, 64 bytes:
* 4,739,057 - baseline
* 4,749,550 - this patch
no reduction
Larger packets, no page cross, 1024 bytes:
* 3,982,361 - baseline
* 3,845,682 - this patch
3.5% reduction
Larger packets, every 3rd packet crosses a page, 1500 bytes:
* 3,731,189 - baseline
* 3,579,414 - this patch
4% reduction
Fixes: 461017cb006a ("net/mlx5e: Support RX multi-packet WQE (Striding RQ)")
Fixes: bc77b240b3c5 ("net/mlx5e: Add fragmented memory support for RX multi packet WQE")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-15 07:08:36 -06:00
|
|
|
rq->mkey_be = c->mkey_be;
|
net/mlx5e: Support RX multi-packet WQE (Striding RQ)
Introduce the feature of multi-packet WQE (RX Work Queue Element)
referred to as (MPWQE or Striding RQ), in which WQEs are larger
and serve multiple packets each.
Every WQE consists of many strides of the same size, every received
packet is aligned to a beginning of a stride and is written to
consecutive strides within a WQE.
In the regular approach, each regular WQE is big enough to be capable
of serving one received packet of any size up to MTU or 64K in case of
device LRO is enabled, making it very wasteful when dealing with
small packets or device LRO is enabled.
For its flexibility, MPWQE allows a better memory utilization
(implying improvements in CPU utilization and packet rate) as packets
consume strides according to their size, preserving the rest of
the WQE to be available for other packets.
MPWQE default configuration:
Num of WQEs = 16
Strides Per WQE = 2048
Stride Size = 64 byte
The default WQEs memory footprint went from 1024*mtu (~1.5MB) to
16 * 2048 * 64 = 2MB per ring.
However, HW LRO can now be supported at no additional cost in memory
footprint, and hence we turn it on by default and get an even better
performance.
Performance tested on ConnectX4-Lx 50G.
To isolate the feature under test, the numbers below were measured with
HW LRO turned off. We verified that the performance just improves when
LRO is turned back on.
* Netperf single TCP stream:
- BW raised by 10-15% for representative packet sizes:
default, 64B, 1024B, 1478B, 65536B.
* Netperf multi TCP stream:
- No degradation, line rate reached.
* Pktgen: packet rate raised by 2-10% for traffic of different message
sizes: 64B, 128B, 256B, 1024B, and 1500B.
* Pktgen: packet loss in bursts of small messages (64byte),
single stream:
- | num packets | packets loss before | packets loss after
| 2K | ~ 1K | 0
| 8K | ~ 6K | 0
| 16K | ~13K | 0
| 32K | ~28K | 0
| 64K | ~57K | ~24K
As expected as the driver can receive as many small packets (<=64B) as
the number of total strides in the ring (default = 2048 * 16) vs. 1024
(default ring size regardless of packets size) before this feature.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-20 13:02:13 -06:00
|
|
|
}
|
2015-05-28 13:28:48 -06:00
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
if (xsk) {
|
|
|
|
err = mlx5e_xsk_resize_reuseq(umem, num_xsk_frames);
|
|
|
|
if (unlikely(err)) {
|
|
|
|
mlx5_core_err(mdev, "Unable to allocate the Reuse Ring for %u frames\n",
|
|
|
|
num_xsk_frames);
|
|
|
|
goto err_free;
|
|
|
|
}
|
|
|
|
|
|
|
|
rq->zca.free = mlx5e_xsk_zca_free;
|
|
|
|
err = xdp_rxq_info_reg_mem_model(&rq->xdp_rxq,
|
|
|
|
MEM_TYPE_ZERO_COPY,
|
|
|
|
&rq->zca);
|
|
|
|
} else {
|
|
|
|
/* Create a page_pool and register it with rxq */
|
|
|
|
pp_params.order = 0;
|
|
|
|
pp_params.flags = 0; /* No-internal DMA mapping in page_pool */
|
|
|
|
pp_params.pool_size = pool_size;
|
|
|
|
pp_params.nid = cpu_to_node(c->cpu);
|
|
|
|
pp_params.dev = c->pdev;
|
|
|
|
pp_params.dma_dir = rq->buff.map_dir;
|
|
|
|
|
|
|
|
/* page_pool can be used even when there is no rq->xdp_prog,
|
|
|
|
* given page_pool does not handle DMA mapping there is no
|
|
|
|
* required state to clear. And page_pool gracefully handle
|
|
|
|
* elevated refcnt.
|
|
|
|
*/
|
|
|
|
rq->page_pool = page_pool_create(&pp_params);
|
|
|
|
if (IS_ERR(rq->page_pool)) {
|
|
|
|
err = PTR_ERR(rq->page_pool);
|
|
|
|
rq->page_pool = NULL;
|
|
|
|
goto err_free;
|
|
|
|
}
|
|
|
|
err = xdp_rxq_info_reg_mem_model(&rq->xdp_rxq,
|
|
|
|
MEM_TYPE_PAGE_POOL, rq->page_pool);
|
2018-04-17 08:46:07 -06:00
|
|
|
}
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
if (err)
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
goto err_free;
|
2018-04-17 08:46:07 -06:00
|
|
|
|
2015-05-28 13:28:48 -06:00
|
|
|
for (i = 0; i < wq_sz; i++) {
|
2017-06-25 07:28:46 -06:00
|
|
|
if (rq->wq_type == MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ) {
|
2018-04-02 08:31:31 -06:00
|
|
|
struct mlx5e_rx_wqe_ll *wqe =
|
2018-04-02 08:23:14 -06:00
|
|
|
mlx5_wq_ll_get_wqe(&rq->mpwqe.wq, i);
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
u32 byte_count =
|
|
|
|
rq->mpwqe.num_strides << rq->mpwqe.log_stride_sz;
|
2017-12-20 02:56:35 -07:00
|
|
|
u64 dma_offset = mlx5e_get_mpwqe_offset(rq, i);
|
2017-06-25 07:28:46 -06:00
|
|
|
|
2018-04-02 08:31:31 -06:00
|
|
|
wqe->data[0].addr = cpu_to_be64(dma_offset + rq->buff.headroom);
|
|
|
|
wqe->data[0].byte_count = cpu_to_be32(byte_count);
|
|
|
|
wqe->data[0].lkey = rq->mkey_be;
|
2018-04-02 08:23:14 -06:00
|
|
|
} else {
|
2018-04-02 08:31:31 -06:00
|
|
|
struct mlx5e_rx_wqe_cyc *wqe =
|
|
|
|
mlx5_wq_cyc_get_wqe(&rq->wqe.wq, i);
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
int f;
|
|
|
|
|
|
|
|
for (f = 0; f < rq->wqe.info.num_frags; f++) {
|
|
|
|
u32 frag_size = rq->wqe.info.arr[f].frag_size |
|
|
|
|
MLX5_HW_START_PADDING;
|
|
|
|
|
|
|
|
wqe->data[f].byte_count = cpu_to_be32(frag_size);
|
|
|
|
wqe->data[f].lkey = rq->mkey_be;
|
|
|
|
}
|
|
|
|
/* check if num_frags is not a pow of two */
|
|
|
|
if (rq->wqe.info.num_frags < (1 << rq->wqe.info.log_num_frags)) {
|
|
|
|
wqe->data[f].byte_count = 0;
|
|
|
|
wqe->data[f].lkey = cpu_to_be32(MLX5_INVALID_LKEY);
|
|
|
|
wqe->data[f].addr = 0;
|
|
|
|
}
|
2018-04-02 08:23:14 -06:00
|
|
|
}
|
2015-05-28 13:28:48 -06:00
|
|
|
}
|
|
|
|
|
2018-01-09 14:06:17 -07:00
|
|
|
INIT_WORK(&rq->dim.work, mlx5e_rx_dim_work);
|
|
|
|
|
|
|
|
switch (params->rx_cq_moderation.cq_period_mode) {
|
|
|
|
case MLX5_CQ_PERIOD_MODE_START_FROM_CQE:
|
2018-11-05 03:07:52 -07:00
|
|
|
rq->dim.mode = DIM_CQ_PERIOD_MODE_START_FROM_CQE;
|
2018-01-09 14:06:17 -07:00
|
|
|
break;
|
|
|
|
case MLX5_CQ_PERIOD_MODE_START_FROM_EQE:
|
|
|
|
default:
|
2018-11-05 03:07:52 -07:00
|
|
|
rq->dim.mode = DIM_CQ_PERIOD_MODE_START_FROM_EQE;
|
2018-01-09 14:06:17 -07:00
|
|
|
}
|
|
|
|
|
net/mlx5e: Implement RX mapped page cache for page recycle
Instead of reallocating and mapping pages for RX data-path,
recycle already used pages in a per ring cache.
Performance tests:
The following results were measured on a freshly booted system,
giving optimal baseline performance, as high-order pages are yet to
be fragmented and depleted.
We ran pktgen single-stream benchmarks, with iptables-raw-drop:
Single stride, 64 bytes:
* 4,739,057 - baseline
* 4,749,550 - order0 no cache
* 4,786,899 - order0 with cache
1% gain
Larger packets, no page cross, 1024 bytes:
* 3,982,361 - baseline
* 3,845,682 - order0 no cache
* 4,127,852 - order0 with cache
3.7% gain
Larger packets, every 3rd packet crosses a page, 1500 bytes:
* 3,731,189 - baseline
* 3,579,414 - order0 no cache
* 3,931,708 - order0 with cache
5.4% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-15 07:08:38 -06:00
|
|
|
rq->page_cache.head = 0;
|
|
|
|
rq->page_cache.tail = 0;
|
|
|
|
|
2015-05-28 13:28:48 -06:00
|
|
|
return 0;
|
|
|
|
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
err_free:
|
|
|
|
switch (rq->wq_type) {
|
|
|
|
case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
|
2018-06-05 02:47:04 -06:00
|
|
|
kvfree(rq->mpwqe.info);
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
mlx5_core_destroy_mkey(mdev, &rq->umr_mkey);
|
|
|
|
break;
|
|
|
|
default: /* MLX5_WQ_TYPE_CYCLIC */
|
|
|
|
kvfree(rq->wqe.frags);
|
|
|
|
mlx5e_free_di_list(rq);
|
|
|
|
}
|
2016-11-30 08:59:39 -07:00
|
|
|
|
2015-05-28 13:28:48 -06:00
|
|
|
err_rq_wq_destroy:
|
2016-11-18 17:45:00 -07:00
|
|
|
if (rq->xdp_prog)
|
|
|
|
bpf_prog_put(rq->xdp_prog);
|
2018-01-03 03:25:18 -07:00
|
|
|
xdp_rxq_info_unreg(&rq->xdp_rxq);
|
2019-07-08 15:34:28 -06:00
|
|
|
page_pool_destroy(rq->page_pool);
|
2015-05-28 13:28:48 -06:00
|
|
|
mlx5_wq_destroy(&rq->wq_ctrl);
|
|
|
|
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
net/mlx5e: Proper names for SQ/RQ/CQ functions
Rename mlx5e_{create,destroy}_{sq,rq,cq} to
mlx5e_{alloc,free}_{sq,rq,cq}.
Rename mlx5e_{enable,disable}_{sq,rq,cq} to
mlx5e_{create,destroy}_{sq,rq,cq}.
mlx5e_{enable,disable}_{sq,rq,cq} used to actually create/destroy the SQ
in FW, so we rename them to align the functions names with FW semantics.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 15:52:12 -06:00
|
|
|
static void mlx5e_free_rq(struct mlx5e_rq *rq)
|
2015-05-28 13:28:48 -06:00
|
|
|
{
|
net/mlx5e: Implement RX mapped page cache for page recycle
Instead of reallocating and mapping pages for RX data-path,
recycle already used pages in a per ring cache.
Performance tests:
The following results were measured on a freshly booted system,
giving optimal baseline performance, as high-order pages are yet to
be fragmented and depleted.
We ran pktgen single-stream benchmarks, with iptables-raw-drop:
Single stride, 64 bytes:
* 4,739,057 - baseline
* 4,749,550 - order0 no cache
* 4,786,899 - order0 with cache
1% gain
Larger packets, no page cross, 1024 bytes:
* 3,982,361 - baseline
* 3,845,682 - order0 no cache
* 4,127,852 - order0 with cache
3.7% gain
Larger packets, every 3rd packet crosses a page, 1500 bytes:
* 3,731,189 - baseline
* 3,579,414 - order0 no cache
* 3,931,708 - order0 with cache
5.4% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-15 07:08:38 -06:00
|
|
|
int i;
|
|
|
|
|
net/mlx5e: XDP fast RX drop bpf programs support
Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver.
When XDP is on we make sure to change channels RQs type to
MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to
ensure "page per packet".
On XDP set, we fail if HW LRO is set and request from user to turn it
off. Since on ConnectX4-LX HW LRO is always on by default, this will be
annoying, but we prefer not to enforce LRO off from XDP set function.
Full channels reset (close/open) is required only when setting XDP
on/off.
When XDP set is called just to exchange programs, we will update
each RQ xdp program on the fly and for synchronization with current
data path RX activity of that RQ, we temporally disable that RQ and
ensure RX path is not running, quickly update and re-enable that RQ,
for that we do:
- rq.state = disabled
- napi_synnchronize
- xchg(rq->xdp_prg)
- rq.state = enabled
- napi_schedule // Just in case we've missed an IRQ
Packet rate performance testing was done with pktgen 64B packets and on
TX side and, TC drop action on RX side compared to XDP fast drop.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Comparison is done between:
1. Baseline, Before this patch with TC drop action
2. This patch with TC drop action
3. This patch with XDP RX fast drop
RX Cores Baseline(TC drop) TC drop XDP fast Drop
--------------------------------------------------------------
1 5.3Mpps 5.3Mpps 16.5Mpps
2 10.2Mpps 10.2Mpps 31.3Mpps
4 20.5Mpps 19.9Mpps 36.3Mpps*
*My xmitter was limited to 36.3Mpps, so it is the bottleneck.
It seems that receive side can handle more.
Signed-off-by: Rana Shahout <ranas@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 03:19:46 -06:00
|
|
|
if (rq->xdp_prog)
|
|
|
|
bpf_prog_put(rq->xdp_prog);
|
|
|
|
|
net/mlx5e: Support RX multi-packet WQE (Striding RQ)
Introduce the feature of multi-packet WQE (RX Work Queue Element)
referred to as (MPWQE or Striding RQ), in which WQEs are larger
and serve multiple packets each.
Every WQE consists of many strides of the same size, every received
packet is aligned to a beginning of a stride and is written to
consecutive strides within a WQE.
In the regular approach, each regular WQE is big enough to be capable
of serving one received packet of any size up to MTU or 64K in case of
device LRO is enabled, making it very wasteful when dealing with
small packets or device LRO is enabled.
For its flexibility, MPWQE allows a better memory utilization
(implying improvements in CPU utilization and packet rate) as packets
consume strides according to their size, preserving the rest of
the WQE to be available for other packets.
MPWQE default configuration:
Num of WQEs = 16
Strides Per WQE = 2048
Stride Size = 64 byte
The default WQEs memory footprint went from 1024*mtu (~1.5MB) to
16 * 2048 * 64 = 2MB per ring.
However, HW LRO can now be supported at no additional cost in memory
footprint, and hence we turn it on by default and get an even better
performance.
Performance tested on ConnectX4-Lx 50G.
To isolate the feature under test, the numbers below were measured with
HW LRO turned off. We verified that the performance just improves when
LRO is turned back on.
* Netperf single TCP stream:
- BW raised by 10-15% for representative packet sizes:
default, 64B, 1024B, 1478B, 65536B.
* Netperf multi TCP stream:
- No degradation, line rate reached.
* Pktgen: packet rate raised by 2-10% for traffic of different message
sizes: 64B, 128B, 256B, 1024B, and 1500B.
* Pktgen: packet loss in bursts of small messages (64byte),
single stream:
- | num packets | packets loss before | packets loss after
| 2K | ~ 1K | 0
| 8K | ~ 6K | 0
| 16K | ~13K | 0
| 32K | ~28K | 0
| 64K | ~57K | ~24K
As expected as the driver can receive as many small packets (<=64B) as
the number of total strides in the ring (default = 2048 * 16) vs. 1024
(default ring size regardless of packets size) before this feature.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-20 13:02:13 -06:00
|
|
|
switch (rq->wq_type) {
|
|
|
|
case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
|
2018-06-05 02:47:04 -06:00
|
|
|
kvfree(rq->mpwqe.info);
|
2017-03-14 11:43:52 -06:00
|
|
|
mlx5_core_destroy_mkey(rq->mdev, &rq->umr_mkey);
|
net/mlx5e: Support RX multi-packet WQE (Striding RQ)
Introduce the feature of multi-packet WQE (RX Work Queue Element)
referred to as (MPWQE or Striding RQ), in which WQEs are larger
and serve multiple packets each.
Every WQE consists of many strides of the same size, every received
packet is aligned to a beginning of a stride and is written to
consecutive strides within a WQE.
In the regular approach, each regular WQE is big enough to be capable
of serving one received packet of any size up to MTU or 64K in case of
device LRO is enabled, making it very wasteful when dealing with
small packets or device LRO is enabled.
For its flexibility, MPWQE allows a better memory utilization
(implying improvements in CPU utilization and packet rate) as packets
consume strides according to their size, preserving the rest of
the WQE to be available for other packets.
MPWQE default configuration:
Num of WQEs = 16
Strides Per WQE = 2048
Stride Size = 64 byte
The default WQEs memory footprint went from 1024*mtu (~1.5MB) to
16 * 2048 * 64 = 2MB per ring.
However, HW LRO can now be supported at no additional cost in memory
footprint, and hence we turn it on by default and get an even better
performance.
Performance tested on ConnectX4-Lx 50G.
To isolate the feature under test, the numbers below were measured with
HW LRO turned off. We verified that the performance just improves when
LRO is turned back on.
* Netperf single TCP stream:
- BW raised by 10-15% for representative packet sizes:
default, 64B, 1024B, 1478B, 65536B.
* Netperf multi TCP stream:
- No degradation, line rate reached.
* Pktgen: packet rate raised by 2-10% for traffic of different message
sizes: 64B, 128B, 256B, 1024B, and 1500B.
* Pktgen: packet loss in bursts of small messages (64byte),
single stream:
- | num packets | packets loss before | packets loss after
| 2K | ~ 1K | 0
| 8K | ~ 6K | 0
| 16K | ~13K | 0
| 32K | ~28K | 0
| 64K | ~57K | ~24K
As expected as the driver can receive as many small packets (<=64B) as
the number of total strides in the ring (default = 2048 * 16) vs. 1024
(default ring size regardless of packets size) before this feature.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-20 13:02:13 -06:00
|
|
|
break;
|
2018-04-02 08:31:31 -06:00
|
|
|
default: /* MLX5_WQ_TYPE_CYCLIC */
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
kvfree(rq->wqe.frags);
|
|
|
|
mlx5e_free_di_list(rq);
|
net/mlx5e: Support RX multi-packet WQE (Striding RQ)
Introduce the feature of multi-packet WQE (RX Work Queue Element)
referred to as (MPWQE or Striding RQ), in which WQEs are larger
and serve multiple packets each.
Every WQE consists of many strides of the same size, every received
packet is aligned to a beginning of a stride and is written to
consecutive strides within a WQE.
In the regular approach, each regular WQE is big enough to be capable
of serving one received packet of any size up to MTU or 64K in case of
device LRO is enabled, making it very wasteful when dealing with
small packets or device LRO is enabled.
For its flexibility, MPWQE allows a better memory utilization
(implying improvements in CPU utilization and packet rate) as packets
consume strides according to their size, preserving the rest of
the WQE to be available for other packets.
MPWQE default configuration:
Num of WQEs = 16
Strides Per WQE = 2048
Stride Size = 64 byte
The default WQEs memory footprint went from 1024*mtu (~1.5MB) to
16 * 2048 * 64 = 2MB per ring.
However, HW LRO can now be supported at no additional cost in memory
footprint, and hence we turn it on by default and get an even better
performance.
Performance tested on ConnectX4-Lx 50G.
To isolate the feature under test, the numbers below were measured with
HW LRO turned off. We verified that the performance just improves when
LRO is turned back on.
* Netperf single TCP stream:
- BW raised by 10-15% for representative packet sizes:
default, 64B, 1024B, 1478B, 65536B.
* Netperf multi TCP stream:
- No degradation, line rate reached.
* Pktgen: packet rate raised by 2-10% for traffic of different message
sizes: 64B, 128B, 256B, 1024B, and 1500B.
* Pktgen: packet loss in bursts of small messages (64byte),
single stream:
- | num packets | packets loss before | packets loss after
| 2K | ~ 1K | 0
| 8K | ~ 6K | 0
| 16K | ~13K | 0
| 32K | ~28K | 0
| 64K | ~57K | ~24K
As expected as the driver can receive as many small packets (<=64B) as
the number of total strides in the ring (default = 2048 * 16) vs. 1024
(default ring size regardless of packets size) before this feature.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-20 13:02:13 -06:00
|
|
|
}
|
|
|
|
|
net/mlx5e: Implement RX mapped page cache for page recycle
Instead of reallocating and mapping pages for RX data-path,
recycle already used pages in a per ring cache.
Performance tests:
The following results were measured on a freshly booted system,
giving optimal baseline performance, as high-order pages are yet to
be fragmented and depleted.
We ran pktgen single-stream benchmarks, with iptables-raw-drop:
Single stride, 64 bytes:
* 4,739,057 - baseline
* 4,749,550 - order0 no cache
* 4,786,899 - order0 with cache
1% gain
Larger packets, no page cross, 1024 bytes:
* 3,982,361 - baseline
* 3,845,682 - order0 no cache
* 4,127,852 - order0 with cache
3.7% gain
Larger packets, every 3rd packet crosses a page, 1500 bytes:
* 3,731,189 - baseline
* 3,579,414 - order0 no cache
* 3,931,708 - order0 with cache
5.4% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-15 07:08:38 -06:00
|
|
|
for (i = rq->page_cache.head; i != rq->page_cache.tail;
|
|
|
|
i = (i + 1) & (MLX5E_CACHE_SIZE - 1)) {
|
|
|
|
struct mlx5e_dma_info *dma_info = &rq->page_cache.page_cache[i];
|
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
/* With AF_XDP, page_cache is not used, so this loop is not
|
|
|
|
* entered, and it's safe to call mlx5e_page_release_dynamic
|
|
|
|
* directly.
|
|
|
|
*/
|
|
|
|
mlx5e_page_release_dynamic(rq, dma_info, false);
|
net/mlx5e: Implement RX mapped page cache for page recycle
Instead of reallocating and mapping pages for RX data-path,
recycle already used pages in a per ring cache.
Performance tests:
The following results were measured on a freshly booted system,
giving optimal baseline performance, as high-order pages are yet to
be fragmented and depleted.
We ran pktgen single-stream benchmarks, with iptables-raw-drop:
Single stride, 64 bytes:
* 4,739,057 - baseline
* 4,749,550 - order0 no cache
* 4,786,899 - order0 with cache
1% gain
Larger packets, no page cross, 1024 bytes:
* 3,982,361 - baseline
* 3,845,682 - order0 no cache
* 4,127,852 - order0 with cache
3.7% gain
Larger packets, every 3rd packet crosses a page, 1500 bytes:
* 3,731,189 - baseline
* 3,579,414 - order0 no cache
* 3,931,708 - order0 with cache
5.4% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-15 07:08:38 -06:00
|
|
|
}
|
2019-06-18 07:05:42 -06:00
|
|
|
|
|
|
|
xdp_rxq_info_unreg(&rq->xdp_rxq);
|
2019-07-08 15:34:28 -06:00
|
|
|
page_pool_destroy(rq->page_pool);
|
2015-05-28 13:28:48 -06:00
|
|
|
mlx5_wq_destroy(&rq->wq_ctrl);
|
|
|
|
}
|
|
|
|
|
2016-12-21 08:24:35 -07:00
|
|
|
static int mlx5e_create_rq(struct mlx5e_rq *rq,
|
|
|
|
struct mlx5e_rq_param *param)
|
2015-05-28 13:28:48 -06:00
|
|
|
{
|
2017-03-14 11:43:52 -06:00
|
|
|
struct mlx5_core_dev *mdev = rq->mdev;
|
2015-05-28 13:28:48 -06:00
|
|
|
|
|
|
|
void *in;
|
|
|
|
void *rqc;
|
|
|
|
void *wq;
|
|
|
|
int inlen;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
inlen = MLX5_ST_SZ_BYTES(create_rq_in) +
|
|
|
|
sizeof(u64) * rq->wq_ctrl.buf.npages;
|
2017-05-10 12:32:18 -06:00
|
|
|
in = kvzalloc(inlen, GFP_KERNEL);
|
2015-05-28 13:28:48 -06:00
|
|
|
if (!in)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
rqc = MLX5_ADDR_OF(create_rq_in, in, ctx);
|
|
|
|
wq = MLX5_ADDR_OF(rqc, rqc, wq);
|
|
|
|
|
|
|
|
memcpy(rqc, param->rqc, sizeof(param->rqc));
|
|
|
|
|
2015-07-29 06:05:43 -06:00
|
|
|
MLX5_SET(rqc, rqc, cqn, rq->cq.mcq.cqn);
|
2015-05-28 13:28:48 -06:00
|
|
|
MLX5_SET(rqc, rqc, state, MLX5_RQC_STATE_RST);
|
|
|
|
MLX5_SET(wq, wq, log_wq_pg_sz, rq->wq_ctrl.buf.page_shift -
|
2015-07-29 06:05:40 -06:00
|
|
|
MLX5_ADAPTER_PAGE_SHIFT);
|
2015-05-28 13:28:48 -06:00
|
|
|
MLX5_SET64(wq, wq, dbr_addr, rq->wq_ctrl.db.dma);
|
|
|
|
|
2018-04-04 03:54:23 -06:00
|
|
|
mlx5_fill_page_frag_array(&rq->wq_ctrl.buf,
|
|
|
|
(__be64 *)MLX5_ADDR_OF(wq, wq, pas));
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2015-06-04 10:30:37 -06:00
|
|
|
err = mlx5_core_create_rq(mdev, in, inlen, &rq->rqn);
|
2015-05-28 13:28:48 -06:00
|
|
|
|
|
|
|
kvfree(in);
|
|
|
|
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2019-06-25 08:44:28 -06:00
|
|
|
int mlx5e_modify_rq_state(struct mlx5e_rq *rq, int curr_state, int next_state)
|
2015-05-28 13:28:48 -06:00
|
|
|
{
|
2018-02-08 06:09:57 -07:00
|
|
|
struct mlx5_core_dev *mdev = rq->mdev;
|
2015-05-28 13:28:48 -06:00
|
|
|
|
|
|
|
void *in;
|
|
|
|
void *rqc;
|
|
|
|
int inlen;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
inlen = MLX5_ST_SZ_BYTES(modify_rq_in);
|
2017-05-10 12:32:18 -06:00
|
|
|
in = kvzalloc(inlen, GFP_KERNEL);
|
2015-05-28 13:28:48 -06:00
|
|
|
if (!in)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
rqc = MLX5_ADDR_OF(modify_rq_in, in, ctx);
|
|
|
|
|
|
|
|
MLX5_SET(modify_rq_in, in, rq_state, curr_state);
|
|
|
|
MLX5_SET(rqc, rqc, state, next_state);
|
|
|
|
|
2015-06-04 10:30:37 -06:00
|
|
|
err = mlx5_core_modify_rq(mdev, rq->rqn, in, inlen);
|
2015-05-28 13:28:48 -06:00
|
|
|
|
|
|
|
kvfree(in);
|
|
|
|
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2017-02-20 07:18:17 -07:00
|
|
|
static int mlx5e_modify_rq_scatter_fcs(struct mlx5e_rq *rq, bool enable)
|
|
|
|
{
|
|
|
|
struct mlx5e_channel *c = rq->channel;
|
|
|
|
struct mlx5e_priv *priv = c->priv;
|
|
|
|
struct mlx5_core_dev *mdev = priv->mdev;
|
|
|
|
|
|
|
|
void *in;
|
|
|
|
void *rqc;
|
|
|
|
int inlen;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
inlen = MLX5_ST_SZ_BYTES(modify_rq_in);
|
2017-05-10 12:32:18 -06:00
|
|
|
in = kvzalloc(inlen, GFP_KERNEL);
|
2017-02-20 07:18:17 -07:00
|
|
|
if (!in)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
rqc = MLX5_ADDR_OF(modify_rq_in, in, ctx);
|
|
|
|
|
|
|
|
MLX5_SET(modify_rq_in, in, rq_state, MLX5_RQC_STATE_RDY);
|
|
|
|
MLX5_SET64(modify_rq_in, in, modify_bitmask,
|
|
|
|
MLX5_MODIFY_RQ_IN_MODIFY_BITMASK_SCATTER_FCS);
|
|
|
|
MLX5_SET(rqc, rqc, scatter_fcs, enable);
|
|
|
|
MLX5_SET(rqc, rqc, state, MLX5_RQC_STATE_RDY);
|
|
|
|
|
|
|
|
err = mlx5_core_modify_rq(mdev, rq->rqn, in, inlen);
|
|
|
|
|
|
|
|
kvfree(in);
|
|
|
|
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2016-04-24 13:51:55 -06:00
|
|
|
static int mlx5e_modify_rq_vsd(struct mlx5e_rq *rq, bool vsd)
|
|
|
|
{
|
|
|
|
struct mlx5e_channel *c = rq->channel;
|
2017-03-14 11:43:52 -06:00
|
|
|
struct mlx5_core_dev *mdev = c->mdev;
|
2016-04-24 13:51:55 -06:00
|
|
|
void *in;
|
|
|
|
void *rqc;
|
|
|
|
int inlen;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
inlen = MLX5_ST_SZ_BYTES(modify_rq_in);
|
2017-05-10 12:32:18 -06:00
|
|
|
in = kvzalloc(inlen, GFP_KERNEL);
|
2016-04-24 13:51:55 -06:00
|
|
|
if (!in)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
rqc = MLX5_ADDR_OF(modify_rq_in, in, ctx);
|
|
|
|
|
|
|
|
MLX5_SET(modify_rq_in, in, rq_state, MLX5_RQC_STATE_RDY);
|
2016-08-04 08:32:02 -06:00
|
|
|
MLX5_SET64(modify_rq_in, in, modify_bitmask,
|
|
|
|
MLX5_MODIFY_RQ_IN_MODIFY_BITMASK_VSD);
|
2016-04-24 13:51:55 -06:00
|
|
|
MLX5_SET(rqc, rqc, vsd, vsd);
|
|
|
|
MLX5_SET(rqc, rqc, state, MLX5_RQC_STATE_RDY);
|
|
|
|
|
|
|
|
err = mlx5_core_modify_rq(mdev, rq->rqn, in, inlen);
|
|
|
|
|
|
|
|
kvfree(in);
|
|
|
|
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
net/mlx5e: Proper names for SQ/RQ/CQ functions
Rename mlx5e_{create,destroy}_{sq,rq,cq} to
mlx5e_{alloc,free}_{sq,rq,cq}.
Rename mlx5e_{enable,disable}_{sq,rq,cq} to
mlx5e_{create,destroy}_{sq,rq,cq}.
mlx5e_{enable,disable}_{sq,rq,cq} used to actually create/destroy the SQ
in FW, so we rename them to align the functions names with FW semantics.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 15:52:12 -06:00
|
|
|
static void mlx5e_destroy_rq(struct mlx5e_rq *rq)
|
2015-05-28 13:28:48 -06:00
|
|
|
{
|
2017-03-14 11:43:52 -06:00
|
|
|
mlx5_core_destroy_rq(rq->mdev, rq->rqn);
|
2015-05-28 13:28:48 -06:00
|
|
|
}
|
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
int mlx5e_wait_for_min_rx_wqes(struct mlx5e_rq *rq, int wait_time)
|
2015-05-28 13:28:48 -06:00
|
|
|
{
|
2018-03-28 04:26:50 -06:00
|
|
|
unsigned long exp_time = jiffies + msecs_to_jiffies(wait_time);
|
2015-05-28 13:28:48 -06:00
|
|
|
struct mlx5e_channel *c = rq->channel;
|
2017-03-14 11:43:52 -06:00
|
|
|
|
2018-04-02 08:23:14 -06:00
|
|
|
u16 min_wqes = mlx5_min_rx_wqes(rq->wq_type, mlx5e_rqwq_get_size(rq));
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2018-03-28 04:26:50 -06:00
|
|
|
do {
|
2018-04-02 08:23:14 -06:00
|
|
|
if (mlx5e_rqwq_get_cur_sz(rq) >= min_wqes)
|
2015-05-28 13:28:48 -06:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
msleep(20);
|
2018-03-28 04:26:50 -06:00
|
|
|
} while (time_before(jiffies, exp_time));
|
|
|
|
|
|
|
|
netdev_warn(c->netdev, "Failed to get min RX wqes on Channel[%d] RQN[0x%x] wq cur_sz(%d) min_rx_wqes(%d)\n",
|
2018-04-02 08:23:14 -06:00
|
|
|
c->ix, rq->rqn, mlx5e_rqwq_get_cur_sz(rq), min_wqes);
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2019-06-25 12:42:27 -06:00
|
|
|
mlx5e_reporter_rx_timeout(rq);
|
2015-05-28 13:28:48 -06:00
|
|
|
return -ETIMEDOUT;
|
|
|
|
}
|
|
|
|
|
2019-06-25 08:44:28 -06:00
|
|
|
void mlx5e_free_rx_descs(struct mlx5e_rq *rq)
|
2016-08-28 16:13:43 -06:00
|
|
|
{
|
|
|
|
__be16 wqe_ix_be;
|
|
|
|
u16 wqe_ix;
|
|
|
|
|
2018-04-02 08:23:14 -06:00
|
|
|
if (rq->wq_type == MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ) {
|
|
|
|
struct mlx5_wq_ll *wq = &rq->mpwqe.wq;
|
net/mlx5e: RX, Support multiple outstanding UMR posts
The buffers mapping of the Multi-Packet WQEs (of Striding RQ)
is done via UMR posts, one UMR WQE per an RX MPWQE.
A single MPWQE is capable of serving many incoming packets,
usually larger than the budget of a single napi cycle.
Hence, posting a single UMR WQE per napi cycle (and handling its
completion in the next cycle) works fine in many common cases,
but not always.
When an XDP program is loaded, every MPWQE is capable of serving less
packets, to satisfy the packet-per-page requirement.
Thus, for the same number of packets more MPWQEs (and UMR posts)
are needed (twice as much for the default MTU), giving less latency
room for the UMR completions.
In this patch, we add support for multiple outstanding UMR posts,
to allow faster gap closure between consuming MPWQEs and reposting
them back into the WQ.
For better SW and HW locality, we combine the UMR posts in bulks of
(at least) two.
This is expected to improve packet rate in high CPU scale.
Performance test:
As expected, huge improvement in large-scale (48 cores).
xdp_redirect_map, 64B UDP multi-stream.
Redirect from ConnectX-5 100Gbps to ConnectX-6 100Gbps.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz.
Before: Unstable, 7 to 30 Mpps
After: Stable, at 70.5 Mpps
No degradation in other tested scenarios.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-02-27 03:06:08 -07:00
|
|
|
u16 head = wq->head;
|
|
|
|
int i;
|
2018-04-02 08:23:14 -06:00
|
|
|
|
net/mlx5e: RX, Support multiple outstanding UMR posts
The buffers mapping of the Multi-Packet WQEs (of Striding RQ)
is done via UMR posts, one UMR WQE per an RX MPWQE.
A single MPWQE is capable of serving many incoming packets,
usually larger than the budget of a single napi cycle.
Hence, posting a single UMR WQE per napi cycle (and handling its
completion in the next cycle) works fine in many common cases,
but not always.
When an XDP program is loaded, every MPWQE is capable of serving less
packets, to satisfy the packet-per-page requirement.
Thus, for the same number of packets more MPWQEs (and UMR posts)
are needed (twice as much for the default MTU), giving less latency
room for the UMR completions.
In this patch, we add support for multiple outstanding UMR posts,
to allow faster gap closure between consuming MPWQEs and reposting
them back into the WQ.
For better SW and HW locality, we combine the UMR posts in bulks of
(at least) two.
This is expected to improve packet rate in high CPU scale.
Performance test:
As expected, huge improvement in large-scale (48 cores).
xdp_redirect_map, 64B UDP multi-stream.
Redirect from ConnectX-5 100Gbps to ConnectX-6 100Gbps.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz.
Before: Unstable, 7 to 30 Mpps
After: Stable, at 70.5 Mpps
No degradation in other tested scenarios.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-02-27 03:06:08 -07:00
|
|
|
/* Outstanding UMR WQEs (in progress) start at wq->head */
|
|
|
|
for (i = 0; i < rq->mpwqe.umr_in_progress; i++) {
|
|
|
|
rq->dealloc_wqe(rq, head);
|
|
|
|
head = mlx5_wq_ll_get_wqe_next_ix(wq, head);
|
|
|
|
}
|
2018-04-02 08:23:14 -06:00
|
|
|
|
|
|
|
while (!mlx5_wq_ll_is_empty(wq)) {
|
2018-04-02 08:31:31 -06:00
|
|
|
struct mlx5e_rx_wqe_ll *wqe;
|
2018-04-02 08:23:14 -06:00
|
|
|
|
|
|
|
wqe_ix_be = *wq->tail_next;
|
|
|
|
wqe_ix = be16_to_cpu(wqe_ix_be);
|
|
|
|
wqe = mlx5_wq_ll_get_wqe(wq, wqe_ix);
|
|
|
|
rq->dealloc_wqe(rq, wqe_ix);
|
|
|
|
mlx5_wq_ll_pop(wq, wqe_ix_be,
|
|
|
|
&wqe->next.next_wqe_index);
|
|
|
|
}
|
|
|
|
} else {
|
2018-04-02 08:31:31 -06:00
|
|
|
struct mlx5_wq_cyc *wq = &rq->wqe.wq;
|
2018-04-02 08:23:14 -06:00
|
|
|
|
2018-04-02 08:31:31 -06:00
|
|
|
while (!mlx5_wq_cyc_is_empty(wq)) {
|
|
|
|
wqe_ix = mlx5_wq_cyc_get_tail(wq);
|
2018-04-02 08:23:14 -06:00
|
|
|
rq->dealloc_wqe(rq, wqe_ix);
|
2018-04-02 08:31:31 -06:00
|
|
|
mlx5_wq_cyc_pop(wq);
|
2018-04-02 08:23:14 -06:00
|
|
|
}
|
net/mlx5e: Introduce RX Page-Reuse
Introduce a Page-Reuse mechanism in non-Striding RQ RX datapath.
A WQE (RX descriptor) buffer is a page, that in most cases was fully
wasted on a packet that is much smaller, requiring a new page for
the next round.
In this patch, we implement a page-reuse mechanism, that resembles a
`SW Striding RQ`.
We allow the WQE to reuse its allocated page as much as it could,
until the page is fully consumed. In each round, the WQE is capable
of receiving packet of maximal size (MTU). Yet, upon the reception of
a packet, the WQE knows the actual packet size, and consumes the exact
amount of memory needed to build a linear SKB. Then, it updates the
buffer pointer within the page accordingly, for the next round.
Feature is mutually exclusive with XDP (packet-per-page)
and LRO (session size is a power of two, needs unused page).
Performance tests:
iperf tcp tests show huge gain:
--------------------------------------------
num streams | BW before | BW after | ratio |
1 | 22.2 | 30.9 | 1.39x |
8 | 64.2 | 93.6 | 1.46x |
64 | 56.7 | 91.4 | 1.61x |
--------------------------------------------
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-01-29 08:42:26 -07:00
|
|
|
}
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
|
2016-08-28 16:13:43 -06:00
|
|
|
}
|
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
int mlx5e_open_rq(struct mlx5e_channel *c, struct mlx5e_params *params,
|
|
|
|
struct mlx5e_rq_param *param, struct mlx5e_xsk_param *xsk,
|
|
|
|
struct xdp_umem *umem, struct mlx5e_rq *rq)
|
2015-05-28 13:28:48 -06:00
|
|
|
{
|
|
|
|
int err;
|
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
err = mlx5e_alloc_rq(c, params, xsk, umem, param, rq);
|
2015-05-28 13:28:48 -06:00
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
|
net/mlx5e: Proper names for SQ/RQ/CQ functions
Rename mlx5e_{create,destroy}_{sq,rq,cq} to
mlx5e_{alloc,free}_{sq,rq,cq}.
Rename mlx5e_{enable,disable}_{sq,rq,cq} to
mlx5e_{create,destroy}_{sq,rq,cq}.
mlx5e_{enable,disable}_{sq,rq,cq} used to actually create/destroy the SQ
in FW, so we rename them to align the functions names with FW semantics.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 15:52:12 -06:00
|
|
|
err = mlx5e_create_rq(rq, param);
|
2015-05-28 13:28:48 -06:00
|
|
|
if (err)
|
net/mlx5e: Proper names for SQ/RQ/CQ functions
Rename mlx5e_{create,destroy}_{sq,rq,cq} to
mlx5e_{alloc,free}_{sq,rq,cq}.
Rename mlx5e_{enable,disable}_{sq,rq,cq} to
mlx5e_{create,destroy}_{sq,rq,cq}.
mlx5e_{enable,disable}_{sq,rq,cq} used to actually create/destroy the SQ
in FW, so we rename them to align the functions names with FW semantics.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 15:52:12 -06:00
|
|
|
goto err_free_rq;
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2016-04-24 13:51:55 -06:00
|
|
|
err = mlx5e_modify_rq_state(rq, MLX5_RQC_STATE_RST, MLX5_RQC_STATE_RDY);
|
2015-05-28 13:28:48 -06:00
|
|
|
if (err)
|
net/mlx5e: Proper names for SQ/RQ/CQ functions
Rename mlx5e_{create,destroy}_{sq,rq,cq} to
mlx5e_{alloc,free}_{sq,rq,cq}.
Rename mlx5e_{enable,disable}_{sq,rq,cq} to
mlx5e_{create,destroy}_{sq,rq,cq}.
mlx5e_{enable,disable}_{sq,rq,cq} used to actually create/destroy the SQ
in FW, so we rename them to align the functions names with FW semantics.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 15:52:12 -06:00
|
|
|
goto err_destroy_rq;
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2019-05-03 14:14:59 -06:00
|
|
|
if (MLX5_CAP_ETH(c->mdev, cqe_checksum_full))
|
|
|
|
__set_bit(MLX5E_RQ_STATE_CSUM_FULL, &c->rq.state);
|
|
|
|
|
2018-01-09 14:06:17 -07:00
|
|
|
if (params->rx_dim_enabled)
|
2018-01-23 03:27:11 -07:00
|
|
|
__set_bit(MLX5E_RQ_STATE_AM, &c->rq.state);
|
2016-06-23 08:02:41 -06:00
|
|
|
|
2019-03-21 20:07:20 -06:00
|
|
|
/* We disable csum_complete when XDP is enabled since
|
|
|
|
* XDP programs might manipulate packets which will render
|
|
|
|
* skb->checksum incorrect.
|
|
|
|
*/
|
|
|
|
if (MLX5E_GET_PFLAG(params, MLX5E_PFLAG_RX_NO_CSUM_COMPLETE) || c->xdp)
|
2018-07-01 02:58:38 -06:00
|
|
|
__set_bit(MLX5E_RQ_STATE_NO_CSUM_COMPLETE, &c->rq.state);
|
|
|
|
|
2015-05-28 13:28:48 -06:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
err_destroy_rq:
|
|
|
|
mlx5e_destroy_rq(rq);
|
net/mlx5e: Proper names for SQ/RQ/CQ functions
Rename mlx5e_{create,destroy}_{sq,rq,cq} to
mlx5e_{alloc,free}_{sq,rq,cq}.
Rename mlx5e_{enable,disable}_{sq,rq,cq} to
mlx5e_{create,destroy}_{sq,rq,cq}.
mlx5e_{enable,disable}_{sq,rq,cq} used to actually create/destroy the SQ
in FW, so we rename them to align the functions names with FW semantics.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 15:52:12 -06:00
|
|
|
err_free_rq:
|
|
|
|
mlx5e_free_rq(rq);
|
2015-05-28 13:28:48 -06:00
|
|
|
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2019-06-25 08:44:28 -06:00
|
|
|
void mlx5e_activate_rq(struct mlx5e_rq *rq)
|
2016-12-20 13:48:19 -07:00
|
|
|
{
|
|
|
|
set_bit(MLX5E_RQ_STATE_ENABLED, &rq->state);
|
2019-03-01 03:05:21 -07:00
|
|
|
mlx5e_trigger_irq(&rq->channel->icosq);
|
2016-12-20 13:48:19 -07:00
|
|
|
}
|
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
void mlx5e_deactivate_rq(struct mlx5e_rq *rq)
|
2015-05-28 13:28:48 -06:00
|
|
|
{
|
2016-12-06 08:32:48 -07:00
|
|
|
clear_bit(MLX5E_RQ_STATE_ENABLED, &rq->state);
|
2015-05-28 13:28:48 -06:00
|
|
|
napi_synchronize(&rq->channel->napi); /* prevent mlx5e_post_rx_wqes */
|
2016-12-20 13:48:19 -07:00
|
|
|
}
|
2016-06-23 08:02:41 -06:00
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
void mlx5e_close_rq(struct mlx5e_rq *rq)
|
2016-12-20 13:48:19 -07:00
|
|
|
{
|
2018-01-09 14:06:17 -07:00
|
|
|
cancel_work_sync(&rq->dim.work);
|
2019-06-25 08:44:28 -06:00
|
|
|
cancel_work_sync(&rq->channel->icosq.recover_work);
|
2019-06-26 14:21:40 -06:00
|
|
|
cancel_work_sync(&rq->recover_work);
|
2015-05-28 13:28:48 -06:00
|
|
|
mlx5e_destroy_rq(rq);
|
net/mlx5e: Proper names for SQ/RQ/CQ functions
Rename mlx5e_{create,destroy}_{sq,rq,cq} to
mlx5e_{alloc,free}_{sq,rq,cq}.
Rename mlx5e_{enable,disable}_{sq,rq,cq} to
mlx5e_{create,destroy}_{sq,rq,cq}.
mlx5e_{enable,disable}_{sq,rq,cq} used to actually create/destroy the SQ
in FW, so we rename them to align the functions names with FW semantics.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 15:52:12 -06:00
|
|
|
mlx5e_free_rx_descs(rq);
|
|
|
|
mlx5e_free_rq(rq);
|
2015-05-28 13:28:48 -06:00
|
|
|
}
|
|
|
|
|
2017-03-24 15:52:14 -06:00
|
|
|
static void mlx5e_free_xdpsq_db(struct mlx5e_xdpsq *sq)
|
2016-09-21 03:19:48 -06:00
|
|
|
{
|
2018-10-14 05:37:48 -06:00
|
|
|
kvfree(sq->db.xdpi_fifo.xi);
|
2018-10-14 05:46:57 -06:00
|
|
|
kvfree(sq->db.wqe_info);
|
2018-10-14 05:37:48 -06:00
|
|
|
}
|
|
|
|
|
|
|
|
static int mlx5e_alloc_xdpsq_fifo(struct mlx5e_xdpsq *sq, int numa)
|
|
|
|
{
|
|
|
|
struct mlx5e_xdp_info_fifo *xdpi_fifo = &sq->db.xdpi_fifo;
|
|
|
|
int wq_sz = mlx5_wq_cyc_get_size(&sq->wq);
|
|
|
|
int dsegs_per_wq = wq_sz * MLX5_SEND_WQEBB_NUM_DS;
|
|
|
|
|
|
|
|
xdpi_fifo->xi = kvzalloc_node(sizeof(*xdpi_fifo->xi) * dsegs_per_wq,
|
|
|
|
GFP_KERNEL, numa);
|
|
|
|
if (!xdpi_fifo->xi)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
xdpi_fifo->pc = &sq->xdpi_fifo_pc;
|
|
|
|
xdpi_fifo->cc = &sq->xdpi_fifo_cc;
|
|
|
|
xdpi_fifo->mask = dsegs_per_wq - 1;
|
|
|
|
|
|
|
|
return 0;
|
2016-09-21 03:19:48 -06:00
|
|
|
}
|
|
|
|
|
2017-03-24 15:52:14 -06:00
|
|
|
static int mlx5e_alloc_xdpsq_db(struct mlx5e_xdpsq *sq, int numa)
|
2016-09-21 03:19:48 -06:00
|
|
|
{
|
2018-10-14 05:46:57 -06:00
|
|
|
int wq_sz = mlx5_wq_cyc_get_size(&sq->wq);
|
2018-10-14 05:37:48 -06:00
|
|
|
int err;
|
2016-09-21 03:19:48 -06:00
|
|
|
|
2018-10-14 05:46:57 -06:00
|
|
|
sq->db.wqe_info = kvzalloc_node(sizeof(*sq->db.wqe_info) * wq_sz,
|
|
|
|
GFP_KERNEL, numa);
|
|
|
|
if (!sq->db.wqe_info)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2018-10-14 05:37:48 -06:00
|
|
|
err = mlx5e_alloc_xdpsq_fifo(sq, numa);
|
|
|
|
if (err) {
|
2017-03-24 15:52:14 -06:00
|
|
|
mlx5e_free_xdpsq_db(sq);
|
2018-10-14 05:37:48 -06:00
|
|
|
return err;
|
2016-09-21 03:19:48 -06:00
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2017-03-24 15:52:14 -06:00
|
|
|
static int mlx5e_alloc_xdpsq(struct mlx5e_channel *c,
|
2016-12-21 08:24:35 -07:00
|
|
|
struct mlx5e_params *params,
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
struct xdp_umem *umem,
|
2017-03-24 15:52:14 -06:00
|
|
|
struct mlx5e_sq_param *param,
|
2018-05-22 07:48:48 -06:00
|
|
|
struct mlx5e_xdpsq *sq,
|
|
|
|
bool is_redirect)
|
2017-03-24 15:52:14 -06:00
|
|
|
{
|
|
|
|
void *sqc_wq = MLX5_ADDR_OF(sqc, param->sqc, wq);
|
2017-03-14 11:43:52 -06:00
|
|
|
struct mlx5_core_dev *mdev = c->mdev;
|
2018-05-02 09:30:56 -06:00
|
|
|
struct mlx5_wq_cyc *wq = &sq->wq;
|
2017-03-24 15:52:14 -06:00
|
|
|
int err;
|
|
|
|
|
|
|
|
sq->pdev = c->pdev;
|
|
|
|
sq->mkey_be = c->mkey_be;
|
|
|
|
sq->channel = c;
|
|
|
|
sq->uar_map = mdev->mlx5e_res.bfreg.map;
|
2016-12-21 08:24:35 -07:00
|
|
|
sq->min_inline_mode = params->tx_min_inline_mode;
|
2018-07-15 01:34:39 -06:00
|
|
|
sq->hw_mtu = MLX5E_SW2HW_MTU(params, params->sw_mtu);
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
sq->umem = umem;
|
|
|
|
|
|
|
|
sq->stats = sq->umem ?
|
|
|
|
&c->priv->channel_stats[c->ix].xsksq :
|
|
|
|
is_redirect ?
|
|
|
|
&c->priv->channel_stats[c->ix].xdpsq :
|
|
|
|
&c->priv->channel_stats[c->ix].rq_xdpsq;
|
2017-03-24 15:52:14 -06:00
|
|
|
|
2017-11-09 23:59:52 -07:00
|
|
|
param->wq.db_numa_node = cpu_to_node(c->cpu);
|
2018-05-02 09:30:56 -06:00
|
|
|
err = mlx5_wq_cyc_create(mdev, ¶m->wq, sqc_wq, wq, &sq->wq_ctrl);
|
2017-03-24 15:52:14 -06:00
|
|
|
if (err)
|
|
|
|
return err;
|
2018-05-02 09:30:56 -06:00
|
|
|
wq->db = &wq->db[MLX5_SND_DBR];
|
2017-03-24 15:52:14 -06:00
|
|
|
|
2017-11-09 23:59:52 -07:00
|
|
|
err = mlx5e_alloc_xdpsq_db(sq, cpu_to_node(c->cpu));
|
2017-03-24 15:52:14 -06:00
|
|
|
if (err)
|
|
|
|
goto err_sq_wq_destroy;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
err_sq_wq_destroy:
|
|
|
|
mlx5_wq_destroy(&sq->wq_ctrl);
|
|
|
|
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void mlx5e_free_xdpsq(struct mlx5e_xdpsq *sq)
|
|
|
|
{
|
|
|
|
mlx5e_free_xdpsq_db(sq);
|
|
|
|
mlx5_wq_destroy(&sq->wq_ctrl);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void mlx5e_free_icosq_db(struct mlx5e_icosq *sq)
|
2015-05-28 13:28:48 -06:00
|
|
|
{
|
2018-06-05 02:47:04 -06:00
|
|
|
kvfree(sq->db.ico_wqe);
|
2015-05-28 13:28:48 -06:00
|
|
|
}
|
|
|
|
|
2017-03-24 15:52:14 -06:00
|
|
|
static int mlx5e_alloc_icosq_db(struct mlx5e_icosq *sq, int numa)
|
2016-09-21 03:19:47 -06:00
|
|
|
{
|
net/mlx5e: RX, Support multiple outstanding UMR posts
The buffers mapping of the Multi-Packet WQEs (of Striding RQ)
is done via UMR posts, one UMR WQE per an RX MPWQE.
A single MPWQE is capable of serving many incoming packets,
usually larger than the budget of a single napi cycle.
Hence, posting a single UMR WQE per napi cycle (and handling its
completion in the next cycle) works fine in many common cases,
but not always.
When an XDP program is loaded, every MPWQE is capable of serving less
packets, to satisfy the packet-per-page requirement.
Thus, for the same number of packets more MPWQEs (and UMR posts)
are needed (twice as much for the default MTU), giving less latency
room for the UMR completions.
In this patch, we add support for multiple outstanding UMR posts,
to allow faster gap closure between consuming MPWQEs and reposting
them back into the WQ.
For better SW and HW locality, we combine the UMR posts in bulks of
(at least) two.
This is expected to improve packet rate in high CPU scale.
Performance test:
As expected, huge improvement in large-scale (48 cores).
xdp_redirect_map, 64B UDP multi-stream.
Redirect from ConnectX-5 100Gbps to ConnectX-6 100Gbps.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz.
Before: Unstable, 7 to 30 Mpps
After: Stable, at 70.5 Mpps
No degradation in other tested scenarios.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-02-27 03:06:08 -07:00
|
|
|
int wq_sz = mlx5_wq_cyc_get_size(&sq->wq);
|
2016-09-21 03:19:47 -06:00
|
|
|
|
2018-07-04 11:28:47 -06:00
|
|
|
sq->db.ico_wqe = kvzalloc_node(array_size(wq_sz,
|
|
|
|
sizeof(*sq->db.ico_wqe)),
|
2018-06-05 02:47:04 -06:00
|
|
|
GFP_KERNEL, numa);
|
2016-09-21 03:19:47 -06:00
|
|
|
if (!sq->db.ico_wqe)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2019-06-25 08:44:28 -06:00
|
|
|
static void mlx5e_icosq_err_cqe_work(struct work_struct *recover_work)
|
|
|
|
{
|
|
|
|
struct mlx5e_icosq *sq = container_of(recover_work, struct mlx5e_icosq,
|
|
|
|
recover_work);
|
|
|
|
|
|
|
|
mlx5e_reporter_icosq_cqe_err(sq);
|
|
|
|
}
|
|
|
|
|
2017-03-24 15:52:14 -06:00
|
|
|
static int mlx5e_alloc_icosq(struct mlx5e_channel *c,
|
|
|
|
struct mlx5e_sq_param *param,
|
|
|
|
struct mlx5e_icosq *sq)
|
2016-09-21 03:19:47 -06:00
|
|
|
{
|
2017-03-24 15:52:14 -06:00
|
|
|
void *sqc_wq = MLX5_ADDR_OF(sqc, param->sqc, wq);
|
2017-03-14 11:43:52 -06:00
|
|
|
struct mlx5_core_dev *mdev = c->mdev;
|
2018-05-02 09:30:56 -06:00
|
|
|
struct mlx5_wq_cyc *wq = &sq->wq;
|
2017-03-24 15:52:14 -06:00
|
|
|
int err;
|
2016-09-21 03:19:47 -06:00
|
|
|
|
2017-03-24 15:52:14 -06:00
|
|
|
sq->channel = c;
|
|
|
|
sq->uar_map = mdev->mlx5e_res.bfreg.map;
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2017-11-09 23:59:52 -07:00
|
|
|
param->wq.db_numa_node = cpu_to_node(c->cpu);
|
2018-05-02 09:30:56 -06:00
|
|
|
err = mlx5_wq_cyc_create(mdev, ¶m->wq, sqc_wq, wq, &sq->wq_ctrl);
|
2017-03-24 15:52:14 -06:00
|
|
|
if (err)
|
|
|
|
return err;
|
2018-05-02 09:30:56 -06:00
|
|
|
wq->db = &wq->db[MLX5_SND_DBR];
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2017-11-09 23:59:52 -07:00
|
|
|
err = mlx5e_alloc_icosq_db(sq, cpu_to_node(c->cpu));
|
2017-03-24 15:52:14 -06:00
|
|
|
if (err)
|
|
|
|
goto err_sq_wq_destroy;
|
|
|
|
|
2019-06-25 08:44:28 -06:00
|
|
|
INIT_WORK(&sq->recover_work, mlx5e_icosq_err_cqe_work);
|
|
|
|
|
2015-05-28 13:28:48 -06:00
|
|
|
return 0;
|
2017-03-24 15:52:14 -06:00
|
|
|
|
|
|
|
err_sq_wq_destroy:
|
|
|
|
mlx5_wq_destroy(&sq->wq_ctrl);
|
|
|
|
|
|
|
|
return err;
|
2015-05-28 13:28:48 -06:00
|
|
|
}
|
|
|
|
|
2017-03-24 15:52:14 -06:00
|
|
|
static void mlx5e_free_icosq(struct mlx5e_icosq *sq)
|
2016-09-21 03:19:47 -06:00
|
|
|
{
|
2017-03-24 15:52:14 -06:00
|
|
|
mlx5e_free_icosq_db(sq);
|
|
|
|
mlx5_wq_destroy(&sq->wq_ctrl);
|
2016-09-21 03:19:47 -06:00
|
|
|
}
|
|
|
|
|
2017-03-24 15:52:14 -06:00
|
|
|
static void mlx5e_free_txqsq_db(struct mlx5e_txqsq *sq)
|
2016-09-21 03:19:47 -06:00
|
|
|
{
|
2018-06-05 02:47:04 -06:00
|
|
|
kvfree(sq->db.wqe_info);
|
|
|
|
kvfree(sq->db.dma_fifo);
|
2016-09-21 03:19:47 -06:00
|
|
|
}
|
|
|
|
|
2017-03-24 15:52:14 -06:00
|
|
|
static int mlx5e_alloc_txqsq_db(struct mlx5e_txqsq *sq, int numa)
|
2016-09-21 03:19:48 -06:00
|
|
|
{
|
2017-03-24 15:52:14 -06:00
|
|
|
int wq_sz = mlx5_wq_cyc_get_size(&sq->wq);
|
|
|
|
int df_sz = wq_sz * MLX5_SEND_WQEBB_NUM_DS;
|
|
|
|
|
2018-07-04 11:28:47 -06:00
|
|
|
sq->db.dma_fifo = kvzalloc_node(array_size(df_sz,
|
|
|
|
sizeof(*sq->db.dma_fifo)),
|
2018-06-05 02:47:04 -06:00
|
|
|
GFP_KERNEL, numa);
|
2018-07-04 11:28:47 -06:00
|
|
|
sq->db.wqe_info = kvzalloc_node(array_size(wq_sz,
|
|
|
|
sizeof(*sq->db.wqe_info)),
|
2018-06-05 02:47:04 -06:00
|
|
|
GFP_KERNEL, numa);
|
2017-04-12 21:37:01 -06:00
|
|
|
if (!sq->db.dma_fifo || !sq->db.wqe_info) {
|
2017-03-24 15:52:14 -06:00
|
|
|
mlx5e_free_txqsq_db(sq);
|
|
|
|
return -ENOMEM;
|
2016-09-21 03:19:48 -06:00
|
|
|
}
|
2017-03-24 15:52:14 -06:00
|
|
|
|
|
|
|
sq->dma_fifo_mask = df_sz - 1;
|
|
|
|
|
|
|
|
return 0;
|
2016-09-21 03:19:48 -06:00
|
|
|
}
|
|
|
|
|
net/mlx5e: Add tx reporter support
Add mlx5e tx reporter to devlink health reporters. This reporter will be
responsible for diagnosing, reporting and recovering of tx errors.
This patch declares the TX reporter operations and creates it using the
devlink health API. Currently, this reporter supports reporting and
recovering from send error CQE only. In addition, it adds diagnose
information for the open SQs.
For a local SQ recover (due to driver error report), in case of SQ recover
failure, the recover operation will be considered as a failure.
For a full tx recover, an attempt to close and open the channels will be
done. If this one passed successfully, it will be considered as a
successful recover.
The SQ recover from error CQE flow is not a new feature in the driver,
this patch re-organize the functions and adapt them for the devlink
health API. For this purpose, move code from en_main.c to a new file
named reporter_tx.c.
Diagnose output:
$devlink health diagnose pci/0000:00:09.0 reporter tx -j -p
{
"SQs": [ {
"sqn": 138,
"HW state": 1,
"stopped": false
},{
"sqn": 142,
"HW state": 1,
"stopped": false
} ]
}
$devlink health diagnose pci/0000:00:09.0 reporter tx
SQs:
sqn: 138 HW state: 1 stopped: false
sqn: 142 HW state: 1 stopped: false
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-07 02:36:40 -07:00
|
|
|
static void mlx5e_tx_err_cqe_work(struct work_struct *recover_work);
|
2017-03-24 15:52:14 -06:00
|
|
|
static int mlx5e_alloc_txqsq(struct mlx5e_channel *c,
|
2016-12-20 13:48:19 -07:00
|
|
|
int txq_ix,
|
2016-12-21 08:24:35 -07:00
|
|
|
struct mlx5e_params *params,
|
2017-03-24 15:52:14 -06:00
|
|
|
struct mlx5e_sq_param *param,
|
2018-04-12 07:03:37 -06:00
|
|
|
struct mlx5e_txqsq *sq,
|
|
|
|
int tc)
|
2015-05-28 13:28:48 -06:00
|
|
|
{
|
2017-03-24 15:52:14 -06:00
|
|
|
void *sqc_wq = MLX5_ADDR_OF(sqc, param->sqc, wq);
|
2017-03-14 11:43:52 -06:00
|
|
|
struct mlx5_core_dev *mdev = c->mdev;
|
2018-05-02 09:30:56 -06:00
|
|
|
struct mlx5_wq_cyc *wq = &sq->wq;
|
2015-05-28 13:28:48 -06:00
|
|
|
int err;
|
|
|
|
|
2016-09-21 03:19:47 -06:00
|
|
|
sq->pdev = c->pdev;
|
2017-03-14 11:43:52 -06:00
|
|
|
sq->tstamp = c->tstamp;
|
2017-08-15 04:46:04 -06:00
|
|
|
sq->clock = &mdev->clock;
|
2016-09-21 03:19:47 -06:00
|
|
|
sq->mkey_be = c->mkey_be;
|
|
|
|
sq->channel = c;
|
2019-04-28 01:14:23 -06:00
|
|
|
sq->ch_ix = c->ix;
|
2016-12-20 13:48:19 -07:00
|
|
|
sq->txq_ix = txq_ix;
|
2017-03-24 15:52:05 -06:00
|
|
|
sq->uar_map = mdev->mlx5e_res.bfreg.map;
|
2016-12-21 08:24:35 -07:00
|
|
|
sq->min_inline_mode = params->tx_min_inline_mode;
|
2019-10-07 05:01:29 -06:00
|
|
|
sq->hw_mtu = MLX5E_SW2HW_MTU(params, params->sw_mtu);
|
2018-04-12 07:03:37 -06:00
|
|
|
sq->stats = &c->priv->channel_stats[c->ix].sq[tc];
|
2019-07-05 09:30:19 -06:00
|
|
|
sq->stop_room = MLX5E_SQ_STOP_ROOM;
|
net/mlx5e: Add tx reporter support
Add mlx5e tx reporter to devlink health reporters. This reporter will be
responsible for diagnosing, reporting and recovering of tx errors.
This patch declares the TX reporter operations and creates it using the
devlink health API. Currently, this reporter supports reporting and
recovering from send error CQE only. In addition, it adds diagnose
information for the open SQs.
For a local SQ recover (due to driver error report), in case of SQ recover
failure, the recover operation will be considered as a failure.
For a full tx recover, an attempt to close and open the channels will be
done. If this one passed successfully, it will be considered as a
successful recover.
The SQ recover from error CQE flow is not a new feature in the driver,
this patch re-organize the functions and adapt them for the devlink
health API. For this purpose, move code from en_main.c to a new file
named reporter_tx.c.
Diagnose output:
$devlink health diagnose pci/0000:00:09.0 reporter tx -j -p
{
"SQs": [ {
"sqn": 138,
"HW state": 1,
"stopped": false
},{
"sqn": 142,
"HW state": 1,
"stopped": false
} ]
}
$devlink health diagnose pci/0000:00:09.0 reporter tx
SQs:
sqn: 138 HW state: 1 stopped: false
sqn: 142 HW state: 1 stopped: false
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-07 02:36:40 -07:00
|
|
|
INIT_WORK(&sq->recover_work, mlx5e_tx_err_cqe_work);
|
2019-07-01 03:08:08 -06:00
|
|
|
if (!MLX5_CAP_ETH(mdev, wqe_vlan_insert))
|
|
|
|
set_bit(MLX5E_SQ_STATE_VLAN_NEED_L2_INLINE, &sq->state);
|
2017-04-18 07:08:23 -06:00
|
|
|
if (MLX5_IPSEC_DEV(c->priv->mdev))
|
|
|
|
set_bit(MLX5E_SQ_STATE_IPSEC, &sq->state);
|
2019-10-07 05:01:29 -06:00
|
|
|
#ifdef CONFIG_MLX5_EN_TLS
|
2019-07-05 09:30:19 -06:00
|
|
|
if (mlx5_accel_is_tls_device(c->priv->mdev)) {
|
2018-04-30 01:16:20 -06:00
|
|
|
set_bit(MLX5E_SQ_STATE_TLS, &sq->state);
|
2019-10-07 05:01:29 -06:00
|
|
|
sq->stop_room += MLX5E_SQ_TLS_ROOM +
|
|
|
|
mlx5e_ktls_dumps_num_wqebbs(sq, MAX_SKB_FRAGS,
|
|
|
|
TLS_MAX_PAYLOAD_SIZE);
|
2019-07-05 09:30:19 -06:00
|
|
|
}
|
2019-10-07 05:01:29 -06:00
|
|
|
#endif
|
2016-09-21 03:19:47 -06:00
|
|
|
|
2017-11-09 23:59:52 -07:00
|
|
|
param->wq.db_numa_node = cpu_to_node(c->cpu);
|
2018-05-02 09:30:56 -06:00
|
|
|
err = mlx5_wq_cyc_create(mdev, ¶m->wq, sqc_wq, wq, &sq->wq_ctrl);
|
2015-05-28 13:28:48 -06:00
|
|
|
if (err)
|
2017-03-24 15:52:05 -06:00
|
|
|
return err;
|
2018-05-02 09:30:56 -06:00
|
|
|
wq->db = &wq->db[MLX5_SND_DBR];
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2017-11-09 23:59:52 -07:00
|
|
|
err = mlx5e_alloc_txqsq_db(sq, cpu_to_node(c->cpu));
|
2015-06-11 02:50:01 -06:00
|
|
|
if (err)
|
2015-05-28 13:28:48 -06:00
|
|
|
goto err_sq_wq_destroy;
|
|
|
|
|
2018-04-24 04:36:03 -06:00
|
|
|
INIT_WORK(&sq->dim.work, mlx5e_tx_dim_work);
|
|
|
|
sq->dim.mode = params->tx_cq_moderation.cq_period_mode;
|
|
|
|
|
2015-05-28 13:28:48 -06:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
err_sq_wq_destroy:
|
|
|
|
mlx5_wq_destroy(&sq->wq_ctrl);
|
|
|
|
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2017-03-24 15:52:14 -06:00
|
|
|
static void mlx5e_free_txqsq(struct mlx5e_txqsq *sq)
|
2015-05-28 13:28:48 -06:00
|
|
|
{
|
2017-03-24 15:52:14 -06:00
|
|
|
mlx5e_free_txqsq_db(sq);
|
2015-05-28 13:28:48 -06:00
|
|
|
mlx5_wq_destroy(&sq->wq_ctrl);
|
|
|
|
}
|
|
|
|
|
2017-03-24 15:52:13 -06:00
|
|
|
struct mlx5e_create_sq_param {
|
|
|
|
struct mlx5_wq_ctrl *wq_ctrl;
|
|
|
|
u32 cqn;
|
|
|
|
u32 tisn;
|
|
|
|
u8 tis_lst_sz;
|
|
|
|
u8 min_inline_mode;
|
|
|
|
};
|
|
|
|
|
2017-03-14 11:43:52 -06:00
|
|
|
static int mlx5e_create_sq(struct mlx5_core_dev *mdev,
|
2017-03-24 15:52:13 -06:00
|
|
|
struct mlx5e_sq_param *param,
|
|
|
|
struct mlx5e_create_sq_param *csp,
|
|
|
|
u32 *sqn)
|
2015-05-28 13:28:48 -06:00
|
|
|
{
|
|
|
|
void *in;
|
|
|
|
void *sqc;
|
|
|
|
void *wq;
|
|
|
|
int inlen;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
inlen = MLX5_ST_SZ_BYTES(create_sq_in) +
|
2017-03-24 15:52:13 -06:00
|
|
|
sizeof(u64) * csp->wq_ctrl->buf.npages;
|
2017-05-10 12:32:18 -06:00
|
|
|
in = kvzalloc(inlen, GFP_KERNEL);
|
2015-05-28 13:28:48 -06:00
|
|
|
if (!in)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
sqc = MLX5_ADDR_OF(create_sq_in, in, ctx);
|
|
|
|
wq = MLX5_ADDR_OF(sqc, sqc, wq);
|
|
|
|
|
|
|
|
memcpy(sqc, param->sqc, sizeof(param->sqc));
|
2017-03-24 15:52:13 -06:00
|
|
|
MLX5_SET(sqc, sqc, tis_lst_sz, csp->tis_lst_sz);
|
|
|
|
MLX5_SET(sqc, sqc, tis_num_0, csp->tisn);
|
|
|
|
MLX5_SET(sqc, sqc, cqn, csp->cqn);
|
2016-12-06 04:53:49 -07:00
|
|
|
|
|
|
|
if (MLX5_CAP_ETH(mdev, wqe_inline_mode) == MLX5_CAP_INLINE_MODE_VPORT_CONTEXT)
|
2017-03-24 15:52:13 -06:00
|
|
|
MLX5_SET(sqc, sqc, min_wqe_inline_mode, csp->min_inline_mode);
|
2016-12-06 04:53:49 -07:00
|
|
|
|
2017-03-24 15:52:13 -06:00
|
|
|
MLX5_SET(sqc, sqc, state, MLX5_SQC_STATE_RST);
|
2017-12-26 07:02:24 -07:00
|
|
|
MLX5_SET(sqc, sqc, flush_in_error_en, 1);
|
2015-05-28 13:28:48 -06:00
|
|
|
|
|
|
|
MLX5_SET(wq, wq, wq_type, MLX5_WQ_TYPE_CYCLIC);
|
2017-03-14 11:43:52 -06:00
|
|
|
MLX5_SET(wq, wq, uar_page, mdev->mlx5e_res.bfreg.index);
|
2017-03-24 15:52:13 -06:00
|
|
|
MLX5_SET(wq, wq, log_wq_pg_sz, csp->wq_ctrl->buf.page_shift -
|
2015-07-29 06:05:40 -06:00
|
|
|
MLX5_ADAPTER_PAGE_SHIFT);
|
2017-03-24 15:52:13 -06:00
|
|
|
MLX5_SET64(wq, wq, dbr_addr, csp->wq_ctrl->db.dma);
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2018-04-04 03:54:23 -06:00
|
|
|
mlx5_fill_page_frag_array(&csp->wq_ctrl->buf,
|
|
|
|
(__be64 *)MLX5_ADDR_OF(wq, wq, pas));
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2017-03-24 15:52:13 -06:00
|
|
|
err = mlx5_core_create_sq(mdev, in, inlen, sqn);
|
2015-05-28 13:28:48 -06:00
|
|
|
|
|
|
|
kvfree(in);
|
|
|
|
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
net/mlx5e: Add tx reporter support
Add mlx5e tx reporter to devlink health reporters. This reporter will be
responsible for diagnosing, reporting and recovering of tx errors.
This patch declares the TX reporter operations and creates it using the
devlink health API. Currently, this reporter supports reporting and
recovering from send error CQE only. In addition, it adds diagnose
information for the open SQs.
For a local SQ recover (due to driver error report), in case of SQ recover
failure, the recover operation will be considered as a failure.
For a full tx recover, an attempt to close and open the channels will be
done. If this one passed successfully, it will be considered as a
successful recover.
The SQ recover from error CQE flow is not a new feature in the driver,
this patch re-organize the functions and adapt them for the devlink
health API. For this purpose, move code from en_main.c to a new file
named reporter_tx.c.
Diagnose output:
$devlink health diagnose pci/0000:00:09.0 reporter tx -j -p
{
"SQs": [ {
"sqn": 138,
"HW state": 1,
"stopped": false
},{
"sqn": 142,
"HW state": 1,
"stopped": false
} ]
}
$devlink health diagnose pci/0000:00:09.0 reporter tx
SQs:
sqn: 138 HW state: 1 stopped: false
sqn: 142 HW state: 1 stopped: false
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-07 02:36:40 -07:00
|
|
|
int mlx5e_modify_sq(struct mlx5_core_dev *mdev, u32 sqn,
|
|
|
|
struct mlx5e_modify_sq_param *p)
|
2015-05-28 13:28:48 -06:00
|
|
|
{
|
|
|
|
void *in;
|
|
|
|
void *sqc;
|
|
|
|
int inlen;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
inlen = MLX5_ST_SZ_BYTES(modify_sq_in);
|
2017-05-10 12:32:18 -06:00
|
|
|
in = kvzalloc(inlen, GFP_KERNEL);
|
2015-05-28 13:28:48 -06:00
|
|
|
if (!in)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
sqc = MLX5_ADDR_OF(modify_sq_in, in, ctx);
|
|
|
|
|
2017-03-24 15:52:13 -06:00
|
|
|
MLX5_SET(modify_sq_in, in, sq_state, p->curr_state);
|
|
|
|
MLX5_SET(sqc, sqc, state, p->next_state);
|
|
|
|
if (p->rl_update && p->next_state == MLX5_SQC_STATE_RDY) {
|
2016-06-23 08:02:38 -06:00
|
|
|
MLX5_SET64(modify_sq_in, in, modify_bitmask, 1);
|
2017-03-24 15:52:13 -06:00
|
|
|
MLX5_SET(sqc, sqc, packet_pacing_rate_limit_index, p->rl_index);
|
2016-06-23 08:02:38 -06:00
|
|
|
}
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2017-03-24 15:52:13 -06:00
|
|
|
err = mlx5_core_modify_sq(mdev, sqn, in, inlen);
|
2015-05-28 13:28:48 -06:00
|
|
|
|
|
|
|
kvfree(in);
|
|
|
|
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2017-03-14 11:43:52 -06:00
|
|
|
static void mlx5e_destroy_sq(struct mlx5_core_dev *mdev, u32 sqn)
|
2017-03-24 15:52:13 -06:00
|
|
|
{
|
2017-03-14 11:43:52 -06:00
|
|
|
mlx5_core_destroy_sq(mdev, sqn);
|
2015-05-28 13:28:48 -06:00
|
|
|
}
|
|
|
|
|
2017-03-14 11:43:52 -06:00
|
|
|
static int mlx5e_create_sq_rdy(struct mlx5_core_dev *mdev,
|
2017-03-24 15:52:14 -06:00
|
|
|
struct mlx5e_sq_param *param,
|
|
|
|
struct mlx5e_create_sq_param *csp,
|
|
|
|
u32 *sqn)
|
2015-05-28 13:28:48 -06:00
|
|
|
{
|
2017-03-24 15:52:13 -06:00
|
|
|
struct mlx5e_modify_sq_param msp = {0};
|
2017-03-24 15:52:14 -06:00
|
|
|
int err;
|
|
|
|
|
2017-03-14 11:43:52 -06:00
|
|
|
err = mlx5e_create_sq(mdev, param, csp, sqn);
|
2017-03-24 15:52:14 -06:00
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
|
|
|
|
msp.curr_state = MLX5_SQC_STATE_RST;
|
|
|
|
msp.next_state = MLX5_SQC_STATE_RDY;
|
2017-03-14 11:43:52 -06:00
|
|
|
err = mlx5e_modify_sq(mdev, *sqn, &msp);
|
2017-03-24 15:52:14 -06:00
|
|
|
if (err)
|
2017-03-14 11:43:52 -06:00
|
|
|
mlx5e_destroy_sq(mdev, *sqn);
|
2017-03-24 15:52:14 -06:00
|
|
|
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2016-11-14 04:42:02 -07:00
|
|
|
static int mlx5e_set_sq_maxrate(struct net_device *dev,
|
|
|
|
struct mlx5e_txqsq *sq, u32 rate);
|
|
|
|
|
2017-03-24 15:52:14 -06:00
|
|
|
static int mlx5e_open_txqsq(struct mlx5e_channel *c,
|
2017-03-14 11:43:52 -06:00
|
|
|
u32 tisn,
|
2016-12-20 13:48:19 -07:00
|
|
|
int txq_ix,
|
2016-12-21 08:24:35 -07:00
|
|
|
struct mlx5e_params *params,
|
2017-03-24 15:52:14 -06:00
|
|
|
struct mlx5e_sq_param *param,
|
2018-04-12 07:03:37 -06:00
|
|
|
struct mlx5e_txqsq *sq,
|
|
|
|
int tc)
|
2017-03-24 15:52:14 -06:00
|
|
|
{
|
|
|
|
struct mlx5e_create_sq_param csp = {};
|
2016-11-14 04:42:02 -07:00
|
|
|
u32 tx_rate;
|
2015-05-28 13:28:48 -06:00
|
|
|
int err;
|
|
|
|
|
2018-04-12 07:03:37 -06:00
|
|
|
err = mlx5e_alloc_txqsq(c, txq_ix, params, param, sq, tc);
|
2015-05-28 13:28:48 -06:00
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
|
2017-03-14 11:43:52 -06:00
|
|
|
csp.tisn = tisn;
|
2017-03-24 15:52:14 -06:00
|
|
|
csp.tis_lst_sz = 1;
|
2017-03-24 15:52:13 -06:00
|
|
|
csp.cqn = sq->cq.mcq.cqn;
|
|
|
|
csp.wq_ctrl = &sq->wq_ctrl;
|
|
|
|
csp.min_inline_mode = sq->min_inline_mode;
|
2017-03-14 11:43:52 -06:00
|
|
|
err = mlx5e_create_sq_rdy(c->mdev, param, &csp, &sq->sqn);
|
2015-05-28 13:28:48 -06:00
|
|
|
if (err)
|
2017-03-24 15:52:14 -06:00
|
|
|
goto err_free_txqsq;
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2017-03-14 11:43:52 -06:00
|
|
|
tx_rate = c->priv->tx_rates[sq->txq_ix];
|
2016-11-14 04:42:02 -07:00
|
|
|
if (tx_rate)
|
2017-03-14 11:43:52 -06:00
|
|
|
mlx5e_set_sq_maxrate(c->netdev, sq, tx_rate);
|
2016-11-14 04:42:02 -07:00
|
|
|
|
2018-04-24 04:36:03 -06:00
|
|
|
if (params->tx_dim_enabled)
|
|
|
|
sq->state |= BIT(MLX5E_SQ_STATE_AM);
|
|
|
|
|
2015-05-28 13:28:48 -06:00
|
|
|
return 0;
|
|
|
|
|
2017-03-24 15:52:14 -06:00
|
|
|
err_free_txqsq:
|
|
|
|
mlx5e_free_txqsq(sq);
|
2015-05-28 13:28:48 -06:00
|
|
|
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
net/mlx5e: Add tx reporter support
Add mlx5e tx reporter to devlink health reporters. This reporter will be
responsible for diagnosing, reporting and recovering of tx errors.
This patch declares the TX reporter operations and creates it using the
devlink health API. Currently, this reporter supports reporting and
recovering from send error CQE only. In addition, it adds diagnose
information for the open SQs.
For a local SQ recover (due to driver error report), in case of SQ recover
failure, the recover operation will be considered as a failure.
For a full tx recover, an attempt to close and open the channels will be
done. If this one passed successfully, it will be considered as a
successful recover.
The SQ recover from error CQE flow is not a new feature in the driver,
this patch re-organize the functions and adapt them for the devlink
health API. For this purpose, move code from en_main.c to a new file
named reporter_tx.c.
Diagnose output:
$devlink health diagnose pci/0000:00:09.0 reporter tx -j -p
{
"SQs": [ {
"sqn": 138,
"HW state": 1,
"stopped": false
},{
"sqn": 142,
"HW state": 1,
"stopped": false
} ]
}
$devlink health diagnose pci/0000:00:09.0 reporter tx
SQs:
sqn: 138 HW state: 1 stopped: false
sqn: 142 HW state: 1 stopped: false
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-07 02:36:40 -07:00
|
|
|
void mlx5e_activate_txqsq(struct mlx5e_txqsq *sq)
|
2016-12-20 13:48:19 -07:00
|
|
|
{
|
2017-03-14 11:43:52 -06:00
|
|
|
sq->txq = netdev_get_tx_queue(sq->channel->netdev, sq->txq_ix);
|
2016-12-20 13:48:19 -07:00
|
|
|
set_bit(MLX5E_SQ_STATE_ENABLED, &sq->state);
|
|
|
|
netdev_tx_reset_queue(sq->txq);
|
|
|
|
netif_tx_start_queue(sq->txq);
|
|
|
|
}
|
|
|
|
|
net/mlx5e: Add tx reporter support
Add mlx5e tx reporter to devlink health reporters. This reporter will be
responsible for diagnosing, reporting and recovering of tx errors.
This patch declares the TX reporter operations and creates it using the
devlink health API. Currently, this reporter supports reporting and
recovering from send error CQE only. In addition, it adds diagnose
information for the open SQs.
For a local SQ recover (due to driver error report), in case of SQ recover
failure, the recover operation will be considered as a failure.
For a full tx recover, an attempt to close and open the channels will be
done. If this one passed successfully, it will be considered as a
successful recover.
The SQ recover from error CQE flow is not a new feature in the driver,
this patch re-organize the functions and adapt them for the devlink
health API. For this purpose, move code from en_main.c to a new file
named reporter_tx.c.
Diagnose output:
$devlink health diagnose pci/0000:00:09.0 reporter tx -j -p
{
"SQs": [ {
"sqn": 138,
"HW state": 1,
"stopped": false
},{
"sqn": 142,
"HW state": 1,
"stopped": false
} ]
}
$devlink health diagnose pci/0000:00:09.0 reporter tx
SQs:
sqn: 138 HW state: 1 stopped: false
sqn: 142 HW state: 1 stopped: false
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-07 02:36:40 -07:00
|
|
|
void mlx5e_tx_disable_queue(struct netdev_queue *txq)
|
2015-05-28 13:28:48 -06:00
|
|
|
{
|
|
|
|
__netif_tx_lock_bh(txq);
|
|
|
|
netif_tx_stop_queue(txq);
|
|
|
|
__netif_tx_unlock_bh(txq);
|
|
|
|
}
|
|
|
|
|
2016-12-20 13:48:19 -07:00
|
|
|
static void mlx5e_deactivate_txqsq(struct mlx5e_txqsq *sq)
|
2015-05-28 13:28:48 -06:00
|
|
|
{
|
2017-03-24 15:52:13 -06:00
|
|
|
struct mlx5e_channel *c = sq->channel;
|
2018-05-02 09:30:56 -06:00
|
|
|
struct mlx5_wq_cyc *wq = &sq->wq;
|
2017-03-24 15:52:13 -06:00
|
|
|
|
2016-12-06 08:32:48 -07:00
|
|
|
clear_bit(MLX5E_SQ_STATE_ENABLED, &sq->state);
|
2016-08-28 16:13:45 -06:00
|
|
|
/* prevent netif_tx_wake_queue */
|
2017-03-24 15:52:13 -06:00
|
|
|
napi_synchronize(&c->napi);
|
2016-06-30 08:34:44 -06:00
|
|
|
|
net/mlx5e: Add tx reporter support
Add mlx5e tx reporter to devlink health reporters. This reporter will be
responsible for diagnosing, reporting and recovering of tx errors.
This patch declares the TX reporter operations and creates it using the
devlink health API. Currently, this reporter supports reporting and
recovering from send error CQE only. In addition, it adds diagnose
information for the open SQs.
For a local SQ recover (due to driver error report), in case of SQ recover
failure, the recover operation will be considered as a failure.
For a full tx recover, an attempt to close and open the channels will be
done. If this one passed successfully, it will be considered as a
successful recover.
The SQ recover from error CQE flow is not a new feature in the driver,
this patch re-organize the functions and adapt them for the devlink
health API. For this purpose, move code from en_main.c to a new file
named reporter_tx.c.
Diagnose output:
$devlink health diagnose pci/0000:00:09.0 reporter tx -j -p
{
"SQs": [ {
"sqn": 138,
"HW state": 1,
"stopped": false
},{
"sqn": 142,
"HW state": 1,
"stopped": false
} ]
}
$devlink health diagnose pci/0000:00:09.0 reporter tx
SQs:
sqn: 138 HW state: 1 stopped: false
sqn: 142 HW state: 1 stopped: false
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-07 02:36:40 -07:00
|
|
|
mlx5e_tx_disable_queue(sq->txq);
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2017-03-24 15:52:14 -06:00
|
|
|
/* last doorbell out, godspeed .. */
|
2018-05-02 09:30:56 -06:00
|
|
|
if (mlx5e_wqc_has_room_for(wq, sq->cc, sq->pc, 1)) {
|
|
|
|
u16 pi = mlx5_wq_cyc_ctr2ix(wq, sq->pc);
|
2019-09-16 08:43:33 -06:00
|
|
|
struct mlx5e_tx_wqe_info *wi;
|
2017-03-24 15:52:14 -06:00
|
|
|
struct mlx5e_tx_wqe *nop;
|
2017-03-24 15:52:11 -06:00
|
|
|
|
2019-09-16 08:43:33 -06:00
|
|
|
wi = &sq->db.wqe_info[pi];
|
|
|
|
|
|
|
|
memset(wi, 0, sizeof(*wi));
|
|
|
|
wi->num_wqebbs = 1;
|
2018-05-02 09:30:56 -06:00
|
|
|
nop = mlx5e_post_nop(wq, sq->sqn, &sq->pc);
|
|
|
|
mlx5e_notify_hw(wq, sq->pc, sq->uar_map, &nop->ctrl);
|
2016-06-30 08:34:44 -06:00
|
|
|
}
|
2016-12-20 13:48:19 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
static void mlx5e_close_txqsq(struct mlx5e_txqsq *sq)
|
|
|
|
{
|
|
|
|
struct mlx5e_channel *c = sq->channel;
|
2017-03-14 11:43:52 -06:00
|
|
|
struct mlx5_core_dev *mdev = c->mdev;
|
2018-03-19 07:10:29 -06:00
|
|
|
struct mlx5_rate_limit rl = {0};
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2018-11-22 05:20:45 -07:00
|
|
|
cancel_work_sync(&sq->dim.work);
|
net/mlx5e: Add tx reporter support
Add mlx5e tx reporter to devlink health reporters. This reporter will be
responsible for diagnosing, reporting and recovering of tx errors.
This patch declares the TX reporter operations and creates it using the
devlink health API. Currently, this reporter supports reporting and
recovering from send error CQE only. In addition, it adds diagnose
information for the open SQs.
For a local SQ recover (due to driver error report), in case of SQ recover
failure, the recover operation will be considered as a failure.
For a full tx recover, an attempt to close and open the channels will be
done. If this one passed successfully, it will be considered as a
successful recover.
The SQ recover from error CQE flow is not a new feature in the driver,
this patch re-organize the functions and adapt them for the devlink
health API. For this purpose, move code from en_main.c to a new file
named reporter_tx.c.
Diagnose output:
$devlink health diagnose pci/0000:00:09.0 reporter tx -j -p
{
"SQs": [ {
"sqn": 138,
"HW state": 1,
"stopped": false
},{
"sqn": 142,
"HW state": 1,
"stopped": false
} ]
}
$devlink health diagnose pci/0000:00:09.0 reporter tx
SQs:
sqn: 138 HW state: 1 stopped: false
sqn: 142 HW state: 1 stopped: false
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-07 02:36:40 -07:00
|
|
|
cancel_work_sync(&sq->recover_work);
|
2017-03-14 11:43:52 -06:00
|
|
|
mlx5e_destroy_sq(mdev, sq->sqn);
|
2018-03-19 07:10:29 -06:00
|
|
|
if (sq->rate_limit) {
|
|
|
|
rl.rate = sq->rate_limit;
|
|
|
|
mlx5_rl_remove_rate(mdev, &rl);
|
|
|
|
}
|
2017-03-24 15:52:14 -06:00
|
|
|
mlx5e_free_txqsq_descs(sq);
|
|
|
|
mlx5e_free_txqsq(sq);
|
|
|
|
}
|
|
|
|
|
net/mlx5e: Add tx reporter support
Add mlx5e tx reporter to devlink health reporters. This reporter will be
responsible for diagnosing, reporting and recovering of tx errors.
This patch declares the TX reporter operations and creates it using the
devlink health API. Currently, this reporter supports reporting and
recovering from send error CQE only. In addition, it adds diagnose
information for the open SQs.
For a local SQ recover (due to driver error report), in case of SQ recover
failure, the recover operation will be considered as a failure.
For a full tx recover, an attempt to close and open the channels will be
done. If this one passed successfully, it will be considered as a
successful recover.
The SQ recover from error CQE flow is not a new feature in the driver,
this patch re-organize the functions and adapt them for the devlink
health API. For this purpose, move code from en_main.c to a new file
named reporter_tx.c.
Diagnose output:
$devlink health diagnose pci/0000:00:09.0 reporter tx -j -p
{
"SQs": [ {
"sqn": 138,
"HW state": 1,
"stopped": false
},{
"sqn": 142,
"HW state": 1,
"stopped": false
} ]
}
$devlink health diagnose pci/0000:00:09.0 reporter tx
SQs:
sqn: 138 HW state: 1 stopped: false
sqn: 142 HW state: 1 stopped: false
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-07 02:36:40 -07:00
|
|
|
static void mlx5e_tx_err_cqe_work(struct work_struct *recover_work)
|
2017-12-26 07:02:24 -07:00
|
|
|
{
|
net/mlx5e: Add tx reporter support
Add mlx5e tx reporter to devlink health reporters. This reporter will be
responsible for diagnosing, reporting and recovering of tx errors.
This patch declares the TX reporter operations and creates it using the
devlink health API. Currently, this reporter supports reporting and
recovering from send error CQE only. In addition, it adds diagnose
information for the open SQs.
For a local SQ recover (due to driver error report), in case of SQ recover
failure, the recover operation will be considered as a failure.
For a full tx recover, an attempt to close and open the channels will be
done. If this one passed successfully, it will be considered as a
successful recover.
The SQ recover from error CQE flow is not a new feature in the driver,
this patch re-organize the functions and adapt them for the devlink
health API. For this purpose, move code from en_main.c to a new file
named reporter_tx.c.
Diagnose output:
$devlink health diagnose pci/0000:00:09.0 reporter tx -j -p
{
"SQs": [ {
"sqn": 138,
"HW state": 1,
"stopped": false
},{
"sqn": 142,
"HW state": 1,
"stopped": false
} ]
}
$devlink health diagnose pci/0000:00:09.0 reporter tx
SQs:
sqn: 138 HW state: 1 stopped: false
sqn: 142 HW state: 1 stopped: false
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-07 02:36:40 -07:00
|
|
|
struct mlx5e_txqsq *sq = container_of(recover_work, struct mlx5e_txqsq,
|
|
|
|
recover_work);
|
2019-01-25 11:53:23 -07:00
|
|
|
|
2019-07-01 06:51:51 -06:00
|
|
|
mlx5e_reporter_tx_err_cqe(sq);
|
2017-12-26 07:02:24 -07:00
|
|
|
}
|
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
int mlx5e_open_icosq(struct mlx5e_channel *c, struct mlx5e_params *params,
|
|
|
|
struct mlx5e_sq_param *param, struct mlx5e_icosq *sq)
|
2017-03-24 15:52:14 -06:00
|
|
|
{
|
|
|
|
struct mlx5e_create_sq_param csp = {};
|
|
|
|
int err;
|
|
|
|
|
2016-12-21 08:24:35 -07:00
|
|
|
err = mlx5e_alloc_icosq(c, param, sq);
|
2017-03-24 15:52:14 -06:00
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
|
|
|
|
csp.cqn = sq->cq.mcq.cqn;
|
|
|
|
csp.wq_ctrl = &sq->wq_ctrl;
|
2016-12-21 08:24:35 -07:00
|
|
|
csp.min_inline_mode = params->tx_min_inline_mode;
|
2017-03-14 11:43:52 -06:00
|
|
|
err = mlx5e_create_sq_rdy(c->mdev, param, &csp, &sq->sqn);
|
2017-03-24 15:52:14 -06:00
|
|
|
if (err)
|
|
|
|
goto err_free_icosq;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
err_free_icosq:
|
|
|
|
mlx5e_free_icosq(sq);
|
|
|
|
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2019-06-25 08:44:28 -06:00
|
|
|
void mlx5e_activate_icosq(struct mlx5e_icosq *icosq)
|
2017-03-24 15:52:14 -06:00
|
|
|
{
|
2019-07-02 06:47:29 -06:00
|
|
|
set_bit(MLX5E_SQ_STATE_ENABLED, &icosq->state);
|
|
|
|
}
|
2017-03-24 15:52:14 -06:00
|
|
|
|
2019-06-25 08:44:28 -06:00
|
|
|
void mlx5e_deactivate_icosq(struct mlx5e_icosq *icosq)
|
2019-07-02 06:47:29 -06:00
|
|
|
{
|
|
|
|
struct mlx5e_channel *c = icosq->channel;
|
|
|
|
|
|
|
|
clear_bit(MLX5E_SQ_STATE_ENABLED, &icosq->state);
|
2017-03-24 15:52:14 -06:00
|
|
|
napi_synchronize(&c->napi);
|
2019-07-02 06:47:29 -06:00
|
|
|
}
|
|
|
|
|
|
|
|
void mlx5e_close_icosq(struct mlx5e_icosq *sq)
|
|
|
|
{
|
|
|
|
struct mlx5e_channel *c = sq->channel;
|
2017-03-24 15:52:14 -06:00
|
|
|
|
2017-03-14 11:43:52 -06:00
|
|
|
mlx5e_destroy_sq(c->mdev, sq->sqn);
|
2017-03-24 15:52:14 -06:00
|
|
|
mlx5e_free_icosq(sq);
|
|
|
|
}
|
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
int mlx5e_open_xdpsq(struct mlx5e_channel *c, struct mlx5e_params *params,
|
|
|
|
struct mlx5e_sq_param *param, struct xdp_umem *umem,
|
|
|
|
struct mlx5e_xdpsq *sq, bool is_redirect)
|
2017-03-24 15:52:14 -06:00
|
|
|
{
|
|
|
|
struct mlx5e_create_sq_param csp = {};
|
|
|
|
int err;
|
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
err = mlx5e_alloc_xdpsq(c, params, umem, param, sq, is_redirect);
|
2017-03-24 15:52:14 -06:00
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
|
|
|
|
csp.tis_lst_sz = 1;
|
2019-08-07 08:46:15 -06:00
|
|
|
csp.tisn = c->priv->tisn[c->lag_port][0]; /* tc = 0 */
|
2017-03-24 15:52:14 -06:00
|
|
|
csp.cqn = sq->cq.mcq.cqn;
|
|
|
|
csp.wq_ctrl = &sq->wq_ctrl;
|
|
|
|
csp.min_inline_mode = sq->min_inline_mode;
|
|
|
|
set_bit(MLX5E_SQ_STATE_ENABLED, &sq->state);
|
2017-03-14 11:43:52 -06:00
|
|
|
err = mlx5e_create_sq_rdy(c->mdev, param, &csp, &sq->sqn);
|
2017-03-24 15:52:14 -06:00
|
|
|
if (err)
|
|
|
|
goto err_free_xdpsq;
|
|
|
|
|
2018-11-21 05:08:06 -07:00
|
|
|
mlx5e_set_xmit_fp(sq, param->is_mpw);
|
|
|
|
|
|
|
|
if (!param->is_mpw) {
|
|
|
|
unsigned int ds_cnt = MLX5E_XDP_TX_DS_COUNT;
|
|
|
|
unsigned int inline_hdr_sz = 0;
|
|
|
|
int i;
|
2017-03-24 15:52:14 -06:00
|
|
|
|
2018-11-21 05:08:06 -07:00
|
|
|
if (sq->min_inline_mode != MLX5_INLINE_MODE_NONE) {
|
|
|
|
inline_hdr_sz = MLX5E_XDP_MIN_INLINE;
|
|
|
|
ds_cnt++;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Pre initialize fixed WQE fields */
|
|
|
|
for (i = 0; i < mlx5_wq_cyc_get_size(&sq->wq); i++) {
|
|
|
|
struct mlx5e_xdp_wqe_info *wi = &sq->db.wqe_info[i];
|
|
|
|
struct mlx5e_tx_wqe *wqe = mlx5_wq_cyc_get_wqe(&sq->wq, i);
|
|
|
|
struct mlx5_wqe_ctrl_seg *cseg = &wqe->ctrl;
|
|
|
|
struct mlx5_wqe_eth_seg *eseg = &wqe->eth;
|
|
|
|
struct mlx5_wqe_data_seg *dseg;
|
2017-03-24 15:52:14 -06:00
|
|
|
|
2018-11-21 05:08:06 -07:00
|
|
|
cseg->qpn_ds = cpu_to_be32((sq->sqn << 8) | ds_cnt);
|
|
|
|
eseg->inline_hdr.sz = cpu_to_be16(inline_hdr_sz);
|
2017-03-24 15:52:14 -06:00
|
|
|
|
2018-11-21 05:08:06 -07:00
|
|
|
dseg = (struct mlx5_wqe_data_seg *)cseg + (ds_cnt - 1);
|
|
|
|
dseg->lkey = sq->mkey_be;
|
2018-10-14 05:46:57 -06:00
|
|
|
|
2018-11-21 05:08:06 -07:00
|
|
|
wi->num_wqebbs = 1;
|
net/mlx5e: XDP, Inline small packets into the TX MPWQE in XDP xmit flow
Upon high packet rate with multiple CPUs TX workloads, much of the HCA's
resources are spent on prefetching TX descriptors, thus affecting
transmission rates.
This patch comes to mitigate this problem by moving some workload to the
CPU and reducing the HW data prefetch overhead for small packets (<= 256B).
When forwarding packets with XDP, a packet that is smaller
than a certain size (set to ~256 bytes) would be sent inline within
its WQE TX descrptor (mem-copied), when the hardware tx queue is congested
beyond a pre-defined water-mark.
This is added to better utilize the HW resources (which now makes
one less packet data prefetch) and allow better scalability, on the
account of CPU usage (which now 'memcpy's the packet into the WQE).
To load balance between HW and CPU and get max packet rate, we use
watermarks to detect how much the HW is congested and move the work
loads back and forth between HW and CPU.
Performance:
Tested packet rate for UDP 64Byte multi-stream
over two dual port ConnectX-5 100Gbps NICs.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
* Tested with hyper-threading disabled
XDP_TX:
| | before | after | |
| 24 rings | 51Mpps | 116Mpps | +126% |
| 1 ring | 12Mpps | 12Mpps | same |
XDP_REDIRECT:
** Below is the transmit rate, not the redirection rate
which might be larger, and is not affected by this patch.
| | before | after | |
| 32 rings | 64Mpps | 92Mpps | +43% |
| 1 ring | 6.4Mpps | 6.4Mpps | same |
As we can see, feature significantly improves scaling, without
hurting single ring performance.
Signed-off-by: Shay Agroskin <shayag@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-03-14 06:54:07 -06:00
|
|
|
wi->num_pkts = 1;
|
2018-11-21 05:08:06 -07:00
|
|
|
}
|
2017-03-24 15:52:14 -06:00
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
err_free_xdpsq:
|
|
|
|
clear_bit(MLX5E_SQ_STATE_ENABLED, &sq->state);
|
|
|
|
mlx5e_free_xdpsq(sq);
|
|
|
|
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
void mlx5e_close_xdpsq(struct mlx5e_xdpsq *sq)
|
2017-03-24 15:52:14 -06:00
|
|
|
{
|
|
|
|
struct mlx5e_channel *c = sq->channel;
|
|
|
|
|
|
|
|
clear_bit(MLX5E_SQ_STATE_ENABLED, &sq->state);
|
|
|
|
napi_synchronize(&c->napi);
|
|
|
|
|
2017-03-14 11:43:52 -06:00
|
|
|
mlx5e_destroy_sq(c->mdev, sq->sqn);
|
2019-06-26 08:35:33 -06:00
|
|
|
mlx5e_free_xdpsq_descs(sq);
|
2017-03-24 15:52:14 -06:00
|
|
|
mlx5e_free_xdpsq(sq);
|
2015-05-28 13:28:48 -06:00
|
|
|
}
|
|
|
|
|
2017-03-28 02:23:55 -06:00
|
|
|
static int mlx5e_alloc_cq_common(struct mlx5_core_dev *mdev,
|
|
|
|
struct mlx5e_cq_param *param,
|
|
|
|
struct mlx5e_cq *cq)
|
2015-05-28 13:28:48 -06:00
|
|
|
{
|
|
|
|
struct mlx5_core_cq *mcq = &cq->mcq;
|
|
|
|
int eqn_not_used;
|
2016-01-17 02:25:47 -07:00
|
|
|
unsigned int irqn;
|
2015-05-28 13:28:48 -06:00
|
|
|
int err;
|
|
|
|
u32 i;
|
|
|
|
|
2018-10-16 14:20:20 -06:00
|
|
|
err = mlx5_vector2eqn(mdev, param->eq_ix, &eqn_not_used, &irqn);
|
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
|
2015-05-28 13:28:48 -06:00
|
|
|
err = mlx5_cqwq_create(mdev, ¶m->wq, param->cqc, &cq->wq,
|
|
|
|
&cq->wq_ctrl);
|
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
|
|
|
|
mcq->cqe_sz = 64;
|
|
|
|
mcq->set_ci_db = cq->wq_ctrl.db.db;
|
|
|
|
mcq->arm_db = cq->wq_ctrl.db.db + 1;
|
|
|
|
*mcq->set_ci_db = 0;
|
|
|
|
*mcq->arm_db = 0;
|
|
|
|
mcq->vector = param->eq_ix;
|
|
|
|
mcq->comp = mlx5e_completion_event;
|
|
|
|
mcq->event = mlx5e_cq_error_event;
|
|
|
|
mcq->irqn = irqn;
|
|
|
|
|
|
|
|
for (i = 0; i < mlx5_cqwq_get_size(&cq->wq); i++) {
|
|
|
|
struct mlx5_cqe64 *cqe = mlx5_cqwq_get_wqe(&cq->wq, i);
|
|
|
|
|
|
|
|
cqe->op_own = 0xf1;
|
|
|
|
}
|
|
|
|
|
2017-03-14 11:43:52 -06:00
|
|
|
cq->mdev = mdev;
|
2015-05-28 13:28:48 -06:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2017-03-28 02:23:55 -06:00
|
|
|
static int mlx5e_alloc_cq(struct mlx5e_channel *c,
|
|
|
|
struct mlx5e_cq_param *param,
|
|
|
|
struct mlx5e_cq *cq)
|
|
|
|
{
|
|
|
|
struct mlx5_core_dev *mdev = c->priv->mdev;
|
|
|
|
int err;
|
|
|
|
|
2017-11-09 23:59:52 -07:00
|
|
|
param->wq.buf_numa_node = cpu_to_node(c->cpu);
|
|
|
|
param->wq.db_numa_node = cpu_to_node(c->cpu);
|
2017-03-28 02:23:55 -06:00
|
|
|
param->eq_ix = c->ix;
|
|
|
|
|
|
|
|
err = mlx5e_alloc_cq_common(mdev, param, cq);
|
|
|
|
|
|
|
|
cq->napi = &c->napi;
|
|
|
|
cq->channel = c;
|
|
|
|
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
net/mlx5e: Proper names for SQ/RQ/CQ functions
Rename mlx5e_{create,destroy}_{sq,rq,cq} to
mlx5e_{alloc,free}_{sq,rq,cq}.
Rename mlx5e_{enable,disable}_{sq,rq,cq} to
mlx5e_{create,destroy}_{sq,rq,cq}.
mlx5e_{enable,disable}_{sq,rq,cq} used to actually create/destroy the SQ
in FW, so we rename them to align the functions names with FW semantics.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 15:52:12 -06:00
|
|
|
static void mlx5e_free_cq(struct mlx5e_cq *cq)
|
2015-05-28 13:28:48 -06:00
|
|
|
{
|
2018-04-04 03:54:23 -06:00
|
|
|
mlx5_wq_destroy(&cq->wq_ctrl);
|
2015-05-28 13:28:48 -06:00
|
|
|
}
|
|
|
|
|
net/mlx5e: Proper names for SQ/RQ/CQ functions
Rename mlx5e_{create,destroy}_{sq,rq,cq} to
mlx5e_{alloc,free}_{sq,rq,cq}.
Rename mlx5e_{enable,disable}_{sq,rq,cq} to
mlx5e_{create,destroy}_{sq,rq,cq}.
mlx5e_{enable,disable}_{sq,rq,cq} used to actually create/destroy the SQ
in FW, so we rename them to align the functions names with FW semantics.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 15:52:12 -06:00
|
|
|
static int mlx5e_create_cq(struct mlx5e_cq *cq, struct mlx5e_cq_param *param)
|
2015-05-28 13:28:48 -06:00
|
|
|
{
|
2019-06-30 10:23:25 -06:00
|
|
|
u32 out[MLX5_ST_SZ_DW(create_cq_out)];
|
2017-03-14 11:43:52 -06:00
|
|
|
struct mlx5_core_dev *mdev = cq->mdev;
|
2015-05-28 13:28:48 -06:00
|
|
|
struct mlx5_core_cq *mcq = &cq->mcq;
|
|
|
|
|
|
|
|
void *in;
|
|
|
|
void *cqc;
|
|
|
|
int inlen;
|
2016-01-17 02:25:47 -07:00
|
|
|
unsigned int irqn_not_used;
|
2015-05-28 13:28:48 -06:00
|
|
|
int eqn;
|
|
|
|
int err;
|
|
|
|
|
2018-10-16 14:20:20 -06:00
|
|
|
err = mlx5_vector2eqn(mdev, param->eq_ix, &eqn, &irqn_not_used);
|
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
|
2015-05-28 13:28:48 -06:00
|
|
|
inlen = MLX5_ST_SZ_BYTES(create_cq_in) +
|
2018-04-04 03:54:23 -06:00
|
|
|
sizeof(u64) * cq->wq_ctrl.buf.npages;
|
2017-05-10 12:32:18 -06:00
|
|
|
in = kvzalloc(inlen, GFP_KERNEL);
|
2015-05-28 13:28:48 -06:00
|
|
|
if (!in)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
cqc = MLX5_ADDR_OF(create_cq_in, in, cq_context);
|
|
|
|
|
|
|
|
memcpy(cqc, param->cqc, sizeof(param->cqc));
|
|
|
|
|
2018-04-04 03:54:23 -06:00
|
|
|
mlx5_fill_page_frag_array(&cq->wq_ctrl.buf,
|
2016-11-30 08:59:37 -07:00
|
|
|
(__be64 *)MLX5_ADDR_OF(create_cq_in, in, pas));
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2016-06-23 08:02:40 -06:00
|
|
|
MLX5_SET(cqc, cqc, cq_period_mode, param->cq_period_mode);
|
2015-05-28 13:28:48 -06:00
|
|
|
MLX5_SET(cqc, cqc, c_eqn, eqn);
|
2017-01-03 14:55:27 -07:00
|
|
|
MLX5_SET(cqc, cqc, uar_page, mdev->priv.uar->index);
|
2018-04-04 03:54:23 -06:00
|
|
|
MLX5_SET(cqc, cqc, log_page_size, cq->wq_ctrl.buf.page_shift -
|
2015-07-29 06:05:40 -06:00
|
|
|
MLX5_ADAPTER_PAGE_SHIFT);
|
2015-05-28 13:28:48 -06:00
|
|
|
MLX5_SET64(cqc, cqc, dbr_addr, cq->wq_ctrl.db.dma);
|
|
|
|
|
2019-06-30 10:23:25 -06:00
|
|
|
err = mlx5_core_create_cq(mdev, mcq, in, inlen, out, sizeof(out));
|
2015-05-28 13:28:48 -06:00
|
|
|
|
|
|
|
kvfree(in);
|
|
|
|
|
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
|
|
|
|
mlx5e_cq_arm(cq);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
net/mlx5e: Proper names for SQ/RQ/CQ functions
Rename mlx5e_{create,destroy}_{sq,rq,cq} to
mlx5e_{alloc,free}_{sq,rq,cq}.
Rename mlx5e_{enable,disable}_{sq,rq,cq} to
mlx5e_{create,destroy}_{sq,rq,cq}.
mlx5e_{enable,disable}_{sq,rq,cq} used to actually create/destroy the SQ
in FW, so we rename them to align the functions names with FW semantics.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 15:52:12 -06:00
|
|
|
static void mlx5e_destroy_cq(struct mlx5e_cq *cq)
|
2015-05-28 13:28:48 -06:00
|
|
|
{
|
2017-03-14 11:43:52 -06:00
|
|
|
mlx5_core_destroy_cq(cq->mdev, &cq->mcq);
|
2015-05-28 13:28:48 -06:00
|
|
|
}
|
|
|
|
|
Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2019-07-03
The following pull-request contains BPF updates for your *net-next* tree.
There is a minor merge conflict in mlx5 due to 8960b38932be ("linux/dim:
Rename externally used net_dim members") which has been pulled into your
tree in the meantime, but resolution seems not that bad ... getting current
bpf-next out now before there's coming more on mlx5. ;) I'm Cc'ing Saeed
just so he's aware of the resolution below:
** First conflict in drivers/net/ethernet/mellanox/mlx5/core/en_main.c:
<<<<<<< HEAD
static int mlx5e_open_cq(struct mlx5e_channel *c,
struct dim_cq_moder moder,
struct mlx5e_cq_param *param,
struct mlx5e_cq *cq)
=======
int mlx5e_open_cq(struct mlx5e_channel *c, struct net_dim_cq_moder moder,
struct mlx5e_cq_param *param, struct mlx5e_cq *cq)
>>>>>>> e5a3e259ef239f443951d401db10db7d426c9497
Resolution is to take the second chunk and rename net_dim_cq_moder into
dim_cq_moder. Also the signature for mlx5e_open_cq() in ...
drivers/net/ethernet/mellanox/mlx5/core/en.h +977
... and in mlx5e_open_xsk() ...
drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.c +64
... needs the same rename from net_dim_cq_moder into dim_cq_moder.
** Second conflict in drivers/net/ethernet/mellanox/mlx5/core/en_main.c:
<<<<<<< HEAD
int cpu = cpumask_first(mlx5_comp_irq_get_affinity_mask(priv->mdev, ix));
struct dim_cq_moder icocq_moder = {0, 0};
struct net_device *netdev = priv->netdev;
struct mlx5e_channel *c;
unsigned int irq;
=======
struct net_dim_cq_moder icocq_moder = {0, 0};
>>>>>>> e5a3e259ef239f443951d401db10db7d426c9497
Take the second chunk and rename net_dim_cq_moder into dim_cq_moder
as well.
Let me know if you run into any issues. Anyway, the main changes are:
1) Long-awaited AF_XDP support for mlx5e driver, from Maxim.
2) Addition of two new per-cgroup BPF hooks for getsockopt and
setsockopt along with a new sockopt program type which allows more
fine-grained pass/reject settings for containers. Also add a sock_ops
callback that can be selectively enabled on a per-socket basis and is
executed for every RTT to help tracking TCP statistics, both features
from Stanislav.
3) Follow-up fix from loops in precision tracking which was not propagating
precision marks and as a result verifier assumed that some branches were
not taken and therefore wrongly removed as dead code, from Alexei.
4) Fix BPF cgroup release synchronization race which could lead to a
double-free if a leaf's cgroup_bpf object is released and a new BPF
program is attached to the one of ancestor cgroups in parallel, from Roman.
5) Support for bulking XDP_TX on veth devices which improves performance
in some cases by around 9%, from Toshiaki.
6) Allow for lookups into BPF devmap and improve feedback when calling into
bpf_redirect_map() as lookup is now performed right away in the helper
itself, from Toke.
7) Add support for fq's Earliest Departure Time to the Host Bandwidth
Manager (HBM) sample BPF program, from Lawrence.
8) Various cleanups and minor fixes all over the place from many others.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-04 13:48:21 -06:00
|
|
|
int mlx5e_open_cq(struct mlx5e_channel *c, struct dim_cq_moder moder,
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
struct mlx5e_cq_param *param, struct mlx5e_cq *cq)
|
2015-05-28 13:28:48 -06:00
|
|
|
{
|
2017-03-14 11:43:52 -06:00
|
|
|
struct mlx5_core_dev *mdev = c->mdev;
|
2015-05-28 13:28:48 -06:00
|
|
|
int err;
|
|
|
|
|
net/mlx5e: Proper names for SQ/RQ/CQ functions
Rename mlx5e_{create,destroy}_{sq,rq,cq} to
mlx5e_{alloc,free}_{sq,rq,cq}.
Rename mlx5e_{enable,disable}_{sq,rq,cq} to
mlx5e_{create,destroy}_{sq,rq,cq}.
mlx5e_{enable,disable}_{sq,rq,cq} used to actually create/destroy the SQ
in FW, so we rename them to align the functions names with FW semantics.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 15:52:12 -06:00
|
|
|
err = mlx5e_alloc_cq(c, param, cq);
|
2015-05-28 13:28:48 -06:00
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
|
net/mlx5e: Proper names for SQ/RQ/CQ functions
Rename mlx5e_{create,destroy}_{sq,rq,cq} to
mlx5e_{alloc,free}_{sq,rq,cq}.
Rename mlx5e_{enable,disable}_{sq,rq,cq} to
mlx5e_{create,destroy}_{sq,rq,cq}.
mlx5e_{enable,disable}_{sq,rq,cq} used to actually create/destroy the SQ
in FW, so we rename them to align the functions names with FW semantics.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 15:52:12 -06:00
|
|
|
err = mlx5e_create_cq(cq, param);
|
2015-05-28 13:28:48 -06:00
|
|
|
if (err)
|
net/mlx5e: Proper names for SQ/RQ/CQ functions
Rename mlx5e_{create,destroy}_{sq,rq,cq} to
mlx5e_{alloc,free}_{sq,rq,cq}.
Rename mlx5e_{enable,disable}_{sq,rq,cq} to
mlx5e_{create,destroy}_{sq,rq,cq}.
mlx5e_{enable,disable}_{sq,rq,cq} used to actually create/destroy the SQ
in FW, so we rename them to align the functions names with FW semantics.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 15:52:12 -06:00
|
|
|
goto err_free_cq;
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2016-03-01 15:13:37 -07:00
|
|
|
if (MLX5_CAP_GEN(mdev, cq_moderation))
|
2016-12-21 08:24:35 -07:00
|
|
|
mlx5_core_modify_cq_moderation(mdev, &cq->mcq, moder.usec, moder.pkts);
|
2015-05-28 13:28:48 -06:00
|
|
|
return 0;
|
|
|
|
|
net/mlx5e: Proper names for SQ/RQ/CQ functions
Rename mlx5e_{create,destroy}_{sq,rq,cq} to
mlx5e_{alloc,free}_{sq,rq,cq}.
Rename mlx5e_{enable,disable}_{sq,rq,cq} to
mlx5e_{create,destroy}_{sq,rq,cq}.
mlx5e_{enable,disable}_{sq,rq,cq} used to actually create/destroy the SQ
in FW, so we rename them to align the functions names with FW semantics.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 15:52:12 -06:00
|
|
|
err_free_cq:
|
|
|
|
mlx5e_free_cq(cq);
|
2015-05-28 13:28:48 -06:00
|
|
|
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
void mlx5e_close_cq(struct mlx5e_cq *cq)
|
2015-05-28 13:28:48 -06:00
|
|
|
{
|
|
|
|
mlx5e_destroy_cq(cq);
|
net/mlx5e: Proper names for SQ/RQ/CQ functions
Rename mlx5e_{create,destroy}_{sq,rq,cq} to
mlx5e_{alloc,free}_{sq,rq,cq}.
Rename mlx5e_{enable,disable}_{sq,rq,cq} to
mlx5e_{create,destroy}_{sq,rq,cq}.
mlx5e_{enable,disable}_{sq,rq,cq} used to actually create/destroy the SQ
in FW, so we rename them to align the functions names with FW semantics.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 15:52:12 -06:00
|
|
|
mlx5e_free_cq(cq);
|
2015-05-28 13:28:48 -06:00
|
|
|
}
|
|
|
|
|
|
|
|
static int mlx5e_open_tx_cqs(struct mlx5e_channel *c,
|
2016-12-21 08:24:35 -07:00
|
|
|
struct mlx5e_params *params,
|
2015-05-28 13:28:48 -06:00
|
|
|
struct mlx5e_channel_param *cparam)
|
|
|
|
{
|
|
|
|
int err;
|
|
|
|
int tc;
|
|
|
|
|
|
|
|
for (tc = 0; tc < c->num_tc; tc++) {
|
2016-12-21 08:24:35 -07:00
|
|
|
err = mlx5e_open_cq(c, params->tx_cq_moderation,
|
|
|
|
&cparam->tx_cq, &c->sq[tc].cq);
|
2015-05-28 13:28:48 -06:00
|
|
|
if (err)
|
|
|
|
goto err_close_tx_cqs;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
err_close_tx_cqs:
|
|
|
|
for (tc--; tc >= 0; tc--)
|
|
|
|
mlx5e_close_cq(&c->sq[tc].cq);
|
|
|
|
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void mlx5e_close_tx_cqs(struct mlx5e_channel *c)
|
|
|
|
{
|
|
|
|
int tc;
|
|
|
|
|
|
|
|
for (tc = 0; tc < c->num_tc; tc++)
|
|
|
|
mlx5e_close_cq(&c->sq[tc].cq);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int mlx5e_open_sqs(struct mlx5e_channel *c,
|
2016-12-21 08:24:35 -07:00
|
|
|
struct mlx5e_params *params,
|
2015-05-28 13:28:48 -06:00
|
|
|
struct mlx5e_channel_param *cparam)
|
|
|
|
{
|
2018-04-12 07:03:37 -06:00
|
|
|
struct mlx5e_priv *priv = c->priv;
|
2019-07-14 02:43:43 -06:00
|
|
|
int err, tc;
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2016-12-21 08:24:35 -07:00
|
|
|
for (tc = 0; tc < params->num_tc; tc++) {
|
2019-07-14 02:43:43 -06:00
|
|
|
int txq_ix = c->ix + tc * priv->max_nch;
|
2016-12-20 13:48:19 -07:00
|
|
|
|
2019-08-07 08:46:15 -06:00
|
|
|
err = mlx5e_open_txqsq(c, c->priv->tisn[c->lag_port][tc], txq_ix,
|
2018-04-12 07:03:37 -06:00
|
|
|
params, &cparam->sq, &c->sq[tc], tc);
|
2015-05-28 13:28:48 -06:00
|
|
|
if (err)
|
|
|
|
goto err_close_sqs;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
err_close_sqs:
|
|
|
|
for (tc--; tc >= 0; tc--)
|
2017-03-24 15:52:14 -06:00
|
|
|
mlx5e_close_txqsq(&c->sq[tc]);
|
2015-05-28 13:28:48 -06:00
|
|
|
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void mlx5e_close_sqs(struct mlx5e_channel *c)
|
|
|
|
{
|
|
|
|
int tc;
|
|
|
|
|
|
|
|
for (tc = 0; tc < c->num_tc; tc++)
|
2017-03-24 15:52:14 -06:00
|
|
|
mlx5e_close_txqsq(&c->sq[tc]);
|
2015-05-28 13:28:48 -06:00
|
|
|
}
|
|
|
|
|
2016-06-23 08:02:38 -06:00
|
|
|
static int mlx5e_set_sq_maxrate(struct net_device *dev,
|
2017-03-24 15:52:14 -06:00
|
|
|
struct mlx5e_txqsq *sq, u32 rate)
|
2016-06-23 08:02:38 -06:00
|
|
|
{
|
|
|
|
struct mlx5e_priv *priv = netdev_priv(dev);
|
|
|
|
struct mlx5_core_dev *mdev = priv->mdev;
|
2017-03-24 15:52:13 -06:00
|
|
|
struct mlx5e_modify_sq_param msp = {0};
|
2018-03-19 07:10:29 -06:00
|
|
|
struct mlx5_rate_limit rl = {0};
|
2016-06-23 08:02:38 -06:00
|
|
|
u16 rl_index = 0;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
if (rate == sq->rate_limit)
|
|
|
|
/* nothing to do */
|
|
|
|
return 0;
|
|
|
|
|
2018-03-19 07:10:29 -06:00
|
|
|
if (sq->rate_limit) {
|
|
|
|
rl.rate = sq->rate_limit;
|
2016-06-23 08:02:38 -06:00
|
|
|
/* remove current rl index to free space to next ones */
|
2018-03-19 07:10:29 -06:00
|
|
|
mlx5_rl_remove_rate(mdev, &rl);
|
|
|
|
}
|
2016-06-23 08:02:38 -06:00
|
|
|
|
|
|
|
sq->rate_limit = 0;
|
|
|
|
|
|
|
|
if (rate) {
|
2018-03-19 07:10:29 -06:00
|
|
|
rl.rate = rate;
|
|
|
|
err = mlx5_rl_add_rate(mdev, &rl_index, &rl);
|
2016-06-23 08:02:38 -06:00
|
|
|
if (err) {
|
|
|
|
netdev_err(dev, "Failed configuring rate %u: %d\n",
|
|
|
|
rate, err);
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2017-03-24 15:52:13 -06:00
|
|
|
msp.curr_state = MLX5_SQC_STATE_RDY;
|
|
|
|
msp.next_state = MLX5_SQC_STATE_RDY;
|
|
|
|
msp.rl_index = rl_index;
|
|
|
|
msp.rl_update = true;
|
2017-03-14 11:43:52 -06:00
|
|
|
err = mlx5e_modify_sq(mdev, sq->sqn, &msp);
|
2016-06-23 08:02:38 -06:00
|
|
|
if (err) {
|
|
|
|
netdev_err(dev, "Failed configuring rate %u: %d\n",
|
|
|
|
rate, err);
|
|
|
|
/* remove the rate from the table */
|
|
|
|
if (rate)
|
2018-03-19 07:10:29 -06:00
|
|
|
mlx5_rl_remove_rate(mdev, &rl);
|
2016-06-23 08:02:38 -06:00
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
|
|
|
sq->rate_limit = rate;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int mlx5e_set_tx_maxrate(struct net_device *dev, int index, u32 rate)
|
|
|
|
{
|
|
|
|
struct mlx5e_priv *priv = netdev_priv(dev);
|
|
|
|
struct mlx5_core_dev *mdev = priv->mdev;
|
2016-12-20 13:48:19 -07:00
|
|
|
struct mlx5e_txqsq *sq = priv->txq2sq[index];
|
2016-06-23 08:02:38 -06:00
|
|
|
int err = 0;
|
|
|
|
|
|
|
|
if (!mlx5_rl_is_supported(mdev)) {
|
|
|
|
netdev_err(dev, "Rate limiting is not supported on this device\n");
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* rate is given in Mb/sec, HW config is in Kb/sec */
|
|
|
|
rate = rate << 10;
|
|
|
|
|
|
|
|
/* Check whether rate in valid range, 0 is always valid */
|
|
|
|
if (rate && !mlx5_rl_is_in_range(mdev, rate)) {
|
|
|
|
netdev_err(dev, "TX rate %u, is not in range\n", rate);
|
|
|
|
return -ERANGE;
|
|
|
|
}
|
|
|
|
|
|
|
|
mutex_lock(&priv->state_lock);
|
|
|
|
if (test_bit(MLX5E_STATE_OPENED, &priv->state))
|
|
|
|
err = mlx5e_set_sq_maxrate(dev, sq, rate);
|
|
|
|
if (!err)
|
|
|
|
priv->tx_rates[index] = rate;
|
|
|
|
mutex_unlock(&priv->state_lock);
|
|
|
|
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2018-10-25 06:03:38 -06:00
|
|
|
static int mlx5e_alloc_xps_cpumask(struct mlx5e_channel *c,
|
|
|
|
struct mlx5e_params *params)
|
|
|
|
{
|
|
|
|
int num_comp_vectors = mlx5_comp_vectors_count(c->mdev);
|
|
|
|
int irq;
|
|
|
|
|
|
|
|
if (!zalloc_cpumask_var(&c->xps_cpumask, GFP_KERNEL))
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
for (irq = c->ix; irq < num_comp_vectors; irq += params->num_channels) {
|
|
|
|
int cpu = cpumask_first(mlx5_comp_irq_get_affinity_mask(c->mdev, irq));
|
|
|
|
|
|
|
|
cpumask_set_cpu(cpu, c->xps_cpumask);
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void mlx5e_free_xps_cpumask(struct mlx5e_channel *c)
|
|
|
|
{
|
|
|
|
free_cpumask_var(c->xps_cpumask);
|
|
|
|
}
|
|
|
|
|
2019-06-26 08:35:36 -06:00
|
|
|
static int mlx5e_open_queues(struct mlx5e_channel *c,
|
|
|
|
struct mlx5e_params *params,
|
|
|
|
struct mlx5e_channel_param *cparam)
|
2015-05-28 13:28:48 -06:00
|
|
|
{
|
2019-01-31 07:44:48 -07:00
|
|
|
struct dim_cq_moder icocq_moder = {0, 0};
|
2015-05-28 13:28:48 -06:00
|
|
|
int err;
|
|
|
|
|
2016-12-21 08:24:35 -07:00
|
|
|
err = mlx5e_open_cq(c, icocq_moder, &cparam->icosq_cq, &c->icosq.cq);
|
2015-05-28 13:28:48 -06:00
|
|
|
if (err)
|
2019-06-26 08:35:36 -06:00
|
|
|
return err;
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2016-12-21 08:24:35 -07:00
|
|
|
err = mlx5e_open_tx_cqs(c, params, cparam);
|
2016-04-20 13:02:14 -06:00
|
|
|
if (err)
|
|
|
|
goto err_close_icosq_cq;
|
|
|
|
|
2018-05-22 07:48:48 -06:00
|
|
|
err = mlx5e_open_cq(c, params->tx_cq_moderation, &cparam->tx_cq, &c->xdpsq.cq);
|
2015-05-28 13:28:48 -06:00
|
|
|
if (err)
|
|
|
|
goto err_close_tx_cqs;
|
|
|
|
|
2018-05-22 07:48:48 -06:00
|
|
|
err = mlx5e_open_cq(c, params->rx_cq_moderation, &cparam->rx_cq, &c->rq.cq);
|
|
|
|
if (err)
|
|
|
|
goto err_close_xdp_tx_cqs;
|
|
|
|
|
2016-11-03 17:48:43 -06:00
|
|
|
/* XDP SQ CQ params are same as normal TXQ sq CQ params */
|
2016-12-21 08:24:35 -07:00
|
|
|
err = c->xdp ? mlx5e_open_cq(c, params->tx_cq_moderation,
|
2019-06-26 08:35:33 -06:00
|
|
|
&cparam->tx_cq, &c->rq_xdpsq.cq) : 0;
|
2016-11-03 17:48:43 -06:00
|
|
|
if (err)
|
|
|
|
goto err_close_rx_cq;
|
|
|
|
|
2015-05-28 13:28:48 -06:00
|
|
|
napi_enable(&c->napi);
|
|
|
|
|
2016-12-21 08:24:35 -07:00
|
|
|
err = mlx5e_open_icosq(c, params, &cparam->icosq, &c->icosq);
|
2015-05-28 13:28:48 -06:00
|
|
|
if (err)
|
|
|
|
goto err_disable_napi;
|
|
|
|
|
2016-12-21 08:24:35 -07:00
|
|
|
err = mlx5e_open_sqs(c, params, cparam);
|
2016-04-20 13:02:14 -06:00
|
|
|
if (err)
|
|
|
|
goto err_close_icosq;
|
|
|
|
|
2019-06-26 08:35:33 -06:00
|
|
|
if (c->xdp) {
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
err = mlx5e_open_xdpsq(c, params, &cparam->xdp_sq, NULL,
|
2019-06-26 08:35:33 -06:00
|
|
|
&c->rq_xdpsq, false);
|
|
|
|
if (err)
|
|
|
|
goto err_close_sqs;
|
|
|
|
}
|
2016-09-21 03:19:48 -06:00
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
err = mlx5e_open_rq(c, params, &cparam->rq, NULL, NULL, &c->rq);
|
2015-05-28 13:28:48 -06:00
|
|
|
if (err)
|
2016-09-21 03:19:48 -06:00
|
|
|
goto err_close_xdp_sq;
|
2015-05-28 13:28:48 -06:00
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
err = mlx5e_open_xdpsq(c, params, &cparam->xdp_sq, NULL, &c->xdpsq, true);
|
2018-05-22 07:48:48 -06:00
|
|
|
if (err)
|
|
|
|
goto err_close_rq;
|
|
|
|
|
2015-05-28 13:28:48 -06:00
|
|
|
return 0;
|
2018-05-22 07:48:48 -06:00
|
|
|
|
|
|
|
err_close_rq:
|
|
|
|
mlx5e_close_rq(&c->rq);
|
|
|
|
|
2016-09-21 03:19:48 -06:00
|
|
|
err_close_xdp_sq:
|
2016-11-03 17:48:43 -06:00
|
|
|
if (c->xdp)
|
2019-06-26 08:35:33 -06:00
|
|
|
mlx5e_close_xdpsq(&c->rq_xdpsq);
|
2015-05-28 13:28:48 -06:00
|
|
|
|
|
|
|
err_close_sqs:
|
|
|
|
mlx5e_close_sqs(c);
|
|
|
|
|
2016-04-20 13:02:14 -06:00
|
|
|
err_close_icosq:
|
2017-03-24 15:52:14 -06:00
|
|
|
mlx5e_close_icosq(&c->icosq);
|
2016-04-20 13:02:14 -06:00
|
|
|
|
2015-05-28 13:28:48 -06:00
|
|
|
err_disable_napi:
|
|
|
|
napi_disable(&c->napi);
|
2019-06-26 08:35:36 -06:00
|
|
|
|
2016-11-03 17:48:43 -06:00
|
|
|
if (c->xdp)
|
2019-06-26 08:35:33 -06:00
|
|
|
mlx5e_close_cq(&c->rq_xdpsq.cq);
|
2016-11-03 17:48:43 -06:00
|
|
|
|
|
|
|
err_close_rx_cq:
|
2015-05-28 13:28:48 -06:00
|
|
|
mlx5e_close_cq(&c->rq.cq);
|
|
|
|
|
2018-05-22 07:48:48 -06:00
|
|
|
err_close_xdp_tx_cqs:
|
|
|
|
mlx5e_close_cq(&c->xdpsq.cq);
|
|
|
|
|
2015-05-28 13:28:48 -06:00
|
|
|
err_close_tx_cqs:
|
|
|
|
mlx5e_close_tx_cqs(c);
|
|
|
|
|
2016-04-20 13:02:14 -06:00
|
|
|
err_close_icosq_cq:
|
|
|
|
mlx5e_close_cq(&c->icosq.cq);
|
|
|
|
|
2019-06-26 08:35:36 -06:00
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void mlx5e_close_queues(struct mlx5e_channel *c)
|
|
|
|
{
|
|
|
|
mlx5e_close_xdpsq(&c->xdpsq);
|
|
|
|
mlx5e_close_rq(&c->rq);
|
|
|
|
if (c->xdp)
|
|
|
|
mlx5e_close_xdpsq(&c->rq_xdpsq);
|
|
|
|
mlx5e_close_sqs(c);
|
|
|
|
mlx5e_close_icosq(&c->icosq);
|
|
|
|
napi_disable(&c->napi);
|
|
|
|
if (c->xdp)
|
|
|
|
mlx5e_close_cq(&c->rq_xdpsq.cq);
|
|
|
|
mlx5e_close_cq(&c->rq.cq);
|
|
|
|
mlx5e_close_cq(&c->xdpsq.cq);
|
|
|
|
mlx5e_close_tx_cqs(c);
|
|
|
|
mlx5e_close_cq(&c->icosq.cq);
|
|
|
|
}
|
|
|
|
|
2019-08-07 08:46:15 -06:00
|
|
|
static u8 mlx5e_enumerate_lag_port(struct mlx5_core_dev *mdev, int ix)
|
|
|
|
{
|
|
|
|
u16 port_aff_bias = mlx5_core_is_pf(mdev) ? 0 : MLX5_CAP_GEN(mdev, vhca_id);
|
|
|
|
|
|
|
|
return (ix + port_aff_bias) % mlx5e_get_num_lag_ports(mdev);
|
|
|
|
}
|
|
|
|
|
2019-06-26 08:35:36 -06:00
|
|
|
static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix,
|
|
|
|
struct mlx5e_params *params,
|
|
|
|
struct mlx5e_channel_param *cparam,
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
struct xdp_umem *umem,
|
2019-06-26 08:35:36 -06:00
|
|
|
struct mlx5e_channel **cp)
|
|
|
|
{
|
|
|
|
int cpu = cpumask_first(mlx5_comp_irq_get_affinity_mask(priv->mdev, ix));
|
|
|
|
struct net_device *netdev = priv->netdev;
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
struct mlx5e_xsk_param xsk;
|
2019-06-26 08:35:36 -06:00
|
|
|
struct mlx5e_channel *c;
|
|
|
|
unsigned int irq;
|
|
|
|
int err;
|
|
|
|
int eqn;
|
|
|
|
|
|
|
|
err = mlx5_vector2eqn(priv->mdev, ix, &eqn, &irq);
|
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
|
|
|
|
c = kvzalloc_node(sizeof(*c), GFP_KERNEL, cpu_to_node(cpu));
|
|
|
|
if (!c)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
c->priv = priv;
|
|
|
|
c->mdev = priv->mdev;
|
|
|
|
c->tstamp = &priv->tstamp;
|
|
|
|
c->ix = ix;
|
|
|
|
c->cpu = cpu;
|
|
|
|
c->pdev = priv->mdev->device;
|
|
|
|
c->netdev = priv->netdev;
|
|
|
|
c->mkey_be = cpu_to_be32(priv->mdev->mlx5e_res.mkey.key);
|
|
|
|
c->num_tc = params->num_tc;
|
|
|
|
c->xdp = !!params->xdp_prog;
|
|
|
|
c->stats = &priv->channel_stats[ix].ch;
|
|
|
|
c->irq_desc = irq_to_desc(irq);
|
2019-08-07 08:46:15 -06:00
|
|
|
c->lag_port = mlx5e_enumerate_lag_port(priv->mdev, ix);
|
2019-06-26 08:35:36 -06:00
|
|
|
|
|
|
|
err = mlx5e_alloc_xps_cpumask(c, params);
|
|
|
|
if (err)
|
|
|
|
goto err_free_channel;
|
|
|
|
|
|
|
|
netif_napi_add(netdev, &c->napi, mlx5e_napi_poll, 64);
|
|
|
|
|
|
|
|
err = mlx5e_open_queues(c, params, cparam);
|
|
|
|
if (unlikely(err))
|
|
|
|
goto err_napi_del;
|
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
if (umem) {
|
|
|
|
mlx5e_build_xsk_param(umem, &xsk);
|
|
|
|
err = mlx5e_open_xsk(priv, params, &xsk, umem, c);
|
|
|
|
if (unlikely(err))
|
|
|
|
goto err_close_queues;
|
|
|
|
}
|
|
|
|
|
2019-06-26 08:35:36 -06:00
|
|
|
*cp = c;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
err_close_queues:
|
|
|
|
mlx5e_close_queues(c);
|
|
|
|
|
2015-05-28 13:28:48 -06:00
|
|
|
err_napi_del:
|
|
|
|
netif_napi_del(&c->napi);
|
2018-10-25 06:03:38 -06:00
|
|
|
mlx5e_free_xps_cpumask(c);
|
|
|
|
|
|
|
|
err_free_channel:
|
2018-06-05 02:47:04 -06:00
|
|
|
kvfree(c);
|
2015-05-28 13:28:48 -06:00
|
|
|
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2016-12-20 13:48:19 -07:00
|
|
|
static void mlx5e_activate_channel(struct mlx5e_channel *c)
|
|
|
|
{
|
|
|
|
int tc;
|
|
|
|
|
|
|
|
for (tc = 0; tc < c->num_tc; tc++)
|
|
|
|
mlx5e_activate_txqsq(&c->sq[tc]);
|
2019-07-02 06:47:29 -06:00
|
|
|
mlx5e_activate_icosq(&c->icosq);
|
2016-12-20 13:48:19 -07:00
|
|
|
mlx5e_activate_rq(&c->rq);
|
2018-10-25 06:03:38 -06:00
|
|
|
netif_set_xps_queue(c->netdev, c->xps_cpumask, c->ix);
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
|
|
|
|
if (test_bit(MLX5E_CHANNEL_STATE_XSK, c->state))
|
|
|
|
mlx5e_activate_xsk(c);
|
2016-12-20 13:48:19 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
static void mlx5e_deactivate_channel(struct mlx5e_channel *c)
|
|
|
|
{
|
|
|
|
int tc;
|
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
if (test_bit(MLX5E_CHANNEL_STATE_XSK, c->state))
|
|
|
|
mlx5e_deactivate_xsk(c);
|
|
|
|
|
2016-12-20 13:48:19 -07:00
|
|
|
mlx5e_deactivate_rq(&c->rq);
|
2019-07-02 06:47:29 -06:00
|
|
|
mlx5e_deactivate_icosq(&c->icosq);
|
2016-12-20 13:48:19 -07:00
|
|
|
for (tc = 0; tc < c->num_tc; tc++)
|
|
|
|
mlx5e_deactivate_txqsq(&c->sq[tc]);
|
|
|
|
}
|
|
|
|
|
2015-05-28 13:28:48 -06:00
|
|
|
static void mlx5e_close_channel(struct mlx5e_channel *c)
|
|
|
|
{
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
if (test_bit(MLX5E_CHANNEL_STATE_XSK, c->state))
|
|
|
|
mlx5e_close_xsk(c);
|
2019-06-26 08:35:36 -06:00
|
|
|
mlx5e_close_queues(c);
|
2015-05-28 13:28:48 -06:00
|
|
|
netif_napi_del(&c->napi);
|
2018-10-25 06:03:38 -06:00
|
|
|
mlx5e_free_xps_cpumask(c);
|
2015-11-18 07:30:55 -07:00
|
|
|
|
2018-06-05 02:47:04 -06:00
|
|
|
kvfree(c);
|
2015-05-28 13:28:48 -06:00
|
|
|
}
|
|
|
|
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
#define DEFAULT_FRAG_SIZE (2048)
|
|
|
|
|
|
|
|
static void mlx5e_build_rq_frags_info(struct mlx5_core_dev *mdev,
|
|
|
|
struct mlx5e_params *params,
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
struct mlx5e_xsk_param *xsk,
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
struct mlx5e_rq_frags_info *info)
|
|
|
|
{
|
|
|
|
u32 byte_count = MLX5E_SW2HW_MTU(params, params->sw_mtu);
|
|
|
|
int frag_size_max = DEFAULT_FRAG_SIZE;
|
|
|
|
u32 buf_size = 0;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
#ifdef CONFIG_MLX5_EN_IPSEC
|
|
|
|
if (MLX5_IPSEC_DEV(mdev))
|
|
|
|
byte_count += MLX5E_METADATA_ETHER_LEN;
|
|
|
|
#endif
|
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
if (mlx5e_rx_is_linear_skb(params, xsk)) {
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
int frag_stride;
|
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
frag_stride = mlx5e_rx_get_linear_frag_sz(params, xsk);
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
frag_stride = roundup_pow_of_two(frag_stride);
|
|
|
|
|
|
|
|
info->arr[0].frag_size = byte_count;
|
|
|
|
info->arr[0].frag_stride = frag_stride;
|
|
|
|
info->num_frags = 1;
|
|
|
|
info->wqe_bulk = PAGE_SIZE / frag_stride;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (byte_count > PAGE_SIZE +
|
|
|
|
(MLX5E_MAX_RX_FRAGS - 1) * frag_size_max)
|
|
|
|
frag_size_max = PAGE_SIZE;
|
|
|
|
|
|
|
|
i = 0;
|
|
|
|
while (buf_size < byte_count) {
|
|
|
|
int frag_size = byte_count - buf_size;
|
|
|
|
|
|
|
|
if (i < MLX5E_MAX_RX_FRAGS - 1)
|
|
|
|
frag_size = min(frag_size, frag_size_max);
|
|
|
|
|
|
|
|
info->arr[i].frag_size = frag_size;
|
|
|
|
info->arr[i].frag_stride = roundup_pow_of_two(frag_size);
|
|
|
|
|
|
|
|
buf_size += frag_size;
|
|
|
|
i++;
|
|
|
|
}
|
|
|
|
info->num_frags = i;
|
|
|
|
/* number of different wqes sharing a page */
|
|
|
|
info->wqe_bulk = 1 + (info->num_frags % 2);
|
|
|
|
|
|
|
|
out:
|
|
|
|
info->wqe_bulk = max_t(u8, info->wqe_bulk, 8);
|
|
|
|
info->log_num_frags = order_base_2(info->num_frags);
|
|
|
|
}
|
|
|
|
|
2018-04-02 08:31:31 -06:00
|
|
|
static inline u8 mlx5e_get_rqwq_log_stride(u8 wq_type, int ndsegs)
|
|
|
|
{
|
|
|
|
int sz = sizeof(struct mlx5_wqe_data_seg) * ndsegs;
|
|
|
|
|
|
|
|
switch (wq_type) {
|
|
|
|
case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
|
|
|
|
sz += sizeof(struct mlx5e_rx_wqe_ll);
|
|
|
|
break;
|
|
|
|
default: /* MLX5_WQ_TYPE_CYCLIC */
|
|
|
|
sz += sizeof(struct mlx5e_rx_wqe_cyc);
|
|
|
|
}
|
|
|
|
|
|
|
|
return order_base_2(sz);
|
|
|
|
}
|
|
|
|
|
net/mlx5e: RX, Support multiple outstanding UMR posts
The buffers mapping of the Multi-Packet WQEs (of Striding RQ)
is done via UMR posts, one UMR WQE per an RX MPWQE.
A single MPWQE is capable of serving many incoming packets,
usually larger than the budget of a single napi cycle.
Hence, posting a single UMR WQE per napi cycle (and handling its
completion in the next cycle) works fine in many common cases,
but not always.
When an XDP program is loaded, every MPWQE is capable of serving less
packets, to satisfy the packet-per-page requirement.
Thus, for the same number of packets more MPWQEs (and UMR posts)
are needed (twice as much for the default MTU), giving less latency
room for the UMR completions.
In this patch, we add support for multiple outstanding UMR posts,
to allow faster gap closure between consuming MPWQEs and reposting
them back into the WQ.
For better SW and HW locality, we combine the UMR posts in bulks of
(at least) two.
This is expected to improve packet rate in high CPU scale.
Performance test:
As expected, huge improvement in large-scale (48 cores).
xdp_redirect_map, 64B UDP multi-stream.
Redirect from ConnectX-5 100Gbps to ConnectX-6 100Gbps.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz.
Before: Unstable, 7 to 30 Mpps
After: Stable, at 70.5 Mpps
No degradation in other tested scenarios.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-02-27 03:06:08 -07:00
|
|
|
static u8 mlx5e_get_rq_log_wq_sz(void *rqc)
|
|
|
|
{
|
|
|
|
void *wq = MLX5_ADDR_OF(rqc, rqc, wq);
|
|
|
|
|
|
|
|
return MLX5_GET(wq, wq, log_wq_sz);
|
|
|
|
}
|
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
void mlx5e_build_rq_param(struct mlx5e_priv *priv,
|
|
|
|
struct mlx5e_params *params,
|
|
|
|
struct mlx5e_xsk_param *xsk,
|
|
|
|
struct mlx5e_rq_param *param)
|
2015-05-28 13:28:48 -06:00
|
|
|
{
|
2018-02-07 04:21:30 -07:00
|
|
|
struct mlx5_core_dev *mdev = priv->mdev;
|
2015-05-28 13:28:48 -06:00
|
|
|
void *rqc = param->rqc;
|
|
|
|
void *wq = MLX5_ADDR_OF(rqc, rqc, wq);
|
2018-04-02 08:31:31 -06:00
|
|
|
int ndsegs = 1;
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2016-12-21 08:24:35 -07:00
|
|
|
switch (params->rq_wq_type) {
|
net/mlx5e: Support RX multi-packet WQE (Striding RQ)
Introduce the feature of multi-packet WQE (RX Work Queue Element)
referred to as (MPWQE or Striding RQ), in which WQEs are larger
and serve multiple packets each.
Every WQE consists of many strides of the same size, every received
packet is aligned to a beginning of a stride and is written to
consecutive strides within a WQE.
In the regular approach, each regular WQE is big enough to be capable
of serving one received packet of any size up to MTU or 64K in case of
device LRO is enabled, making it very wasteful when dealing with
small packets or device LRO is enabled.
For its flexibility, MPWQE allows a better memory utilization
(implying improvements in CPU utilization and packet rate) as packets
consume strides according to their size, preserving the rest of
the WQE to be available for other packets.
MPWQE default configuration:
Num of WQEs = 16
Strides Per WQE = 2048
Stride Size = 64 byte
The default WQEs memory footprint went from 1024*mtu (~1.5MB) to
16 * 2048 * 64 = 2MB per ring.
However, HW LRO can now be supported at no additional cost in memory
footprint, and hence we turn it on by default and get an even better
performance.
Performance tested on ConnectX4-Lx 50G.
To isolate the feature under test, the numbers below were measured with
HW LRO turned off. We verified that the performance just improves when
LRO is turned back on.
* Netperf single TCP stream:
- BW raised by 10-15% for representative packet sizes:
default, 64B, 1024B, 1478B, 65536B.
* Netperf multi TCP stream:
- No degradation, line rate reached.
* Pktgen: packet rate raised by 2-10% for traffic of different message
sizes: 64B, 128B, 256B, 1024B, and 1500B.
* Pktgen: packet loss in bursts of small messages (64byte),
single stream:
- | num packets | packets loss before | packets loss after
| 2K | ~ 1K | 0
| 8K | ~ 6K | 0
| 16K | ~13K | 0
| 32K | ~28K | 0
| 64K | ~57K | ~24K
As expected as the driver can receive as many small packets (<=64B) as
the number of total strides in the ring (default = 2048 * 16) vs. 1024
(default ring size regardless of packets size) before this feature.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-20 13:02:13 -06:00
|
|
|
case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
|
2018-02-07 04:21:30 -07:00
|
|
|
MLX5_SET(wq, wq, log_wqe_num_of_strides,
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
mlx5e_mpwqe_get_log_num_strides(mdev, params, xsk) -
|
net/mlx5e: Use linear SKB in Striding RQ
Current Striding RQ HW feature utilizes the RX buffers so that
there is no wasted room between the strides. This maximises
the memory utilization.
This prevents the use of build_skb() (which requires headroom
and tailroom), and demands to memcpy the packets headers into
the skb linear part.
In this patch, whenever a set of conditions holds, we apply
an RQ configuration that allows combining the use of linear SKB
on top of a Striding RQ.
To use build_skb() with Striding RQ, the following must hold:
1. packet does not cross a page boundary.
2. there is enough headroom and tailroom surrounding the packet.
We can satisfy 1 and 2 by configuring:
stride size = MTU + headroom + tailoom.
This is possible only when:
a. (MTU - headroom - tailoom) does not exceed PAGE_SIZE.
b. HW LRO is turned off.
Using linear SKB has many advantages:
- Saves a memcpy of the headers.
- No page-boundary checks in datapath.
- No filler CQEs.
- Significantly smaller CQ.
- SKB data continuously resides in linear part, and not split to
small amount (linear part) and large amount (fragment).
This saves datapath cycles in driver and improves utilization
of SKB fragments in GRO.
- The fragments of a resulting GRO SKB follow the IP forwarding
assumption of equal-size fragments.
Some implementation details:
HW writes the packets to the beginning of a stride,
i.e. does not keep headroom. To overcome this we make sure we can
extend backwards and use the last bytes of stride i-1.
Extra care is needed for stride 0 as it has no preceding stride.
We make sure headroom bytes are available by shifting the buffer
pointer passed to HW by headroom bytes.
This configuration now becomes default, whenever capable.
Of course, this implies turning LRO off.
Performance testing:
ConnectX-5, single core, single RX ring, default MTU.
UDP packet rate, early drop in TC layer:
--------------------------------------------
| pkt size | before | after | ratio |
--------------------------------------------
| 1500byte | 4.65 Mpps | 5.96 Mpps | 1.28x |
| 500byte | 5.23 Mpps | 5.97 Mpps | 1.14x |
| 64byte | 5.94 Mpps | 5.96 Mpps | 1.00x |
--------------------------------------------
TCP streams: ~20% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-02-07 05:41:25 -07:00
|
|
|
MLX5_MPWQE_LOG_NUM_STRIDES_BASE);
|
2018-02-07 04:21:30 -07:00
|
|
|
MLX5_SET(wq, wq, log_wqe_stride_size,
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
mlx5e_mpwqe_get_log_stride_size(mdev, params, xsk) -
|
net/mlx5e: Use linear SKB in Striding RQ
Current Striding RQ HW feature utilizes the RX buffers so that
there is no wasted room between the strides. This maximises
the memory utilization.
This prevents the use of build_skb() (which requires headroom
and tailroom), and demands to memcpy the packets headers into
the skb linear part.
In this patch, whenever a set of conditions holds, we apply
an RQ configuration that allows combining the use of linear SKB
on top of a Striding RQ.
To use build_skb() with Striding RQ, the following must hold:
1. packet does not cross a page boundary.
2. there is enough headroom and tailroom surrounding the packet.
We can satisfy 1 and 2 by configuring:
stride size = MTU + headroom + tailoom.
This is possible only when:
a. (MTU - headroom - tailoom) does not exceed PAGE_SIZE.
b. HW LRO is turned off.
Using linear SKB has many advantages:
- Saves a memcpy of the headers.
- No page-boundary checks in datapath.
- No filler CQEs.
- Significantly smaller CQ.
- SKB data continuously resides in linear part, and not split to
small amount (linear part) and large amount (fragment).
This saves datapath cycles in driver and improves utilization
of SKB fragments in GRO.
- The fragments of a resulting GRO SKB follow the IP forwarding
assumption of equal-size fragments.
Some implementation details:
HW writes the packets to the beginning of a stride,
i.e. does not keep headroom. To overcome this we make sure we can
extend backwards and use the last bytes of stride i-1.
Extra care is needed for stride 0 as it has no preceding stride.
We make sure headroom bytes are available by shifting the buffer
pointer passed to HW by headroom bytes.
This configuration now becomes default, whenever capable.
Of course, this implies turning LRO off.
Performance testing:
ConnectX-5, single core, single RX ring, default MTU.
UDP packet rate, early drop in TC layer:
--------------------------------------------
| pkt size | before | after | ratio |
--------------------------------------------
| 1500byte | 4.65 Mpps | 5.96 Mpps | 1.28x |
| 500byte | 5.23 Mpps | 5.97 Mpps | 1.14x |
| 64byte | 5.94 Mpps | 5.96 Mpps | 1.00x |
--------------------------------------------
TCP streams: ~20% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-02-07 05:41:25 -07:00
|
|
|
MLX5_MPWQE_LOG_STRIDE_SZ_BASE);
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
MLX5_SET(wq, wq, log_wq_sz, mlx5e_mpwqe_get_log_rq_size(params, xsk));
|
net/mlx5e: Support RX multi-packet WQE (Striding RQ)
Introduce the feature of multi-packet WQE (RX Work Queue Element)
referred to as (MPWQE or Striding RQ), in which WQEs are larger
and serve multiple packets each.
Every WQE consists of many strides of the same size, every received
packet is aligned to a beginning of a stride and is written to
consecutive strides within a WQE.
In the regular approach, each regular WQE is big enough to be capable
of serving one received packet of any size up to MTU or 64K in case of
device LRO is enabled, making it very wasteful when dealing with
small packets or device LRO is enabled.
For its flexibility, MPWQE allows a better memory utilization
(implying improvements in CPU utilization and packet rate) as packets
consume strides according to their size, preserving the rest of
the WQE to be available for other packets.
MPWQE default configuration:
Num of WQEs = 16
Strides Per WQE = 2048
Stride Size = 64 byte
The default WQEs memory footprint went from 1024*mtu (~1.5MB) to
16 * 2048 * 64 = 2MB per ring.
However, HW LRO can now be supported at no additional cost in memory
footprint, and hence we turn it on by default and get an even better
performance.
Performance tested on ConnectX4-Lx 50G.
To isolate the feature under test, the numbers below were measured with
HW LRO turned off. We verified that the performance just improves when
LRO is turned back on.
* Netperf single TCP stream:
- BW raised by 10-15% for representative packet sizes:
default, 64B, 1024B, 1478B, 65536B.
* Netperf multi TCP stream:
- No degradation, line rate reached.
* Pktgen: packet rate raised by 2-10% for traffic of different message
sizes: 64B, 128B, 256B, 1024B, and 1500B.
* Pktgen: packet loss in bursts of small messages (64byte),
single stream:
- | num packets | packets loss before | packets loss after
| 2K | ~ 1K | 0
| 8K | ~ 6K | 0
| 16K | ~13K | 0
| 32K | ~28K | 0
| 64K | ~57K | ~24K
As expected as the driver can receive as many small packets (<=64B) as
the number of total strides in the ring (default = 2048 * 16) vs. 1024
(default ring size regardless of packets size) before this feature.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-20 13:02:13 -06:00
|
|
|
break;
|
2018-04-02 08:31:31 -06:00
|
|
|
default: /* MLX5_WQ_TYPE_CYCLIC */
|
2018-02-11 06:21:33 -07:00
|
|
|
MLX5_SET(wq, wq, log_wq_sz, params->log_rq_mtu_frames);
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
mlx5e_build_rq_frags_info(mdev, params, xsk, ¶m->frags_info);
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
ndsegs = param->frags_info.num_frags;
|
net/mlx5e: Support RX multi-packet WQE (Striding RQ)
Introduce the feature of multi-packet WQE (RX Work Queue Element)
referred to as (MPWQE or Striding RQ), in which WQEs are larger
and serve multiple packets each.
Every WQE consists of many strides of the same size, every received
packet is aligned to a beginning of a stride and is written to
consecutive strides within a WQE.
In the regular approach, each regular WQE is big enough to be capable
of serving one received packet of any size up to MTU or 64K in case of
device LRO is enabled, making it very wasteful when dealing with
small packets or device LRO is enabled.
For its flexibility, MPWQE allows a better memory utilization
(implying improvements in CPU utilization and packet rate) as packets
consume strides according to their size, preserving the rest of
the WQE to be available for other packets.
MPWQE default configuration:
Num of WQEs = 16
Strides Per WQE = 2048
Stride Size = 64 byte
The default WQEs memory footprint went from 1024*mtu (~1.5MB) to
16 * 2048 * 64 = 2MB per ring.
However, HW LRO can now be supported at no additional cost in memory
footprint, and hence we turn it on by default and get an even better
performance.
Performance tested on ConnectX4-Lx 50G.
To isolate the feature under test, the numbers below were measured with
HW LRO turned off. We verified that the performance just improves when
LRO is turned back on.
* Netperf single TCP stream:
- BW raised by 10-15% for representative packet sizes:
default, 64B, 1024B, 1478B, 65536B.
* Netperf multi TCP stream:
- No degradation, line rate reached.
* Pktgen: packet rate raised by 2-10% for traffic of different message
sizes: 64B, 128B, 256B, 1024B, and 1500B.
* Pktgen: packet loss in bursts of small messages (64byte),
single stream:
- | num packets | packets loss before | packets loss after
| 2K | ~ 1K | 0
| 8K | ~ 6K | 0
| 16K | ~13K | 0
| 32K | ~28K | 0
| 64K | ~57K | ~24K
As expected as the driver can receive as many small packets (<=64B) as
the number of total strides in the ring (default = 2048 * 16) vs. 1024
(default ring size regardless of packets size) before this feature.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-20 13:02:13 -06:00
|
|
|
}
|
|
|
|
|
2018-04-02 08:31:31 -06:00
|
|
|
MLX5_SET(wq, wq, wq_type, params->rq_wq_type);
|
2015-05-28 13:28:48 -06:00
|
|
|
MLX5_SET(wq, wq, end_padding_mode, MLX5_WQ_END_PAD_MODE_ALIGN);
|
2018-04-02 08:31:31 -06:00
|
|
|
MLX5_SET(wq, wq, log_wq_stride,
|
|
|
|
mlx5e_get_rqwq_log_stride(params->rq_wq_type, ndsegs));
|
2018-02-07 04:21:30 -07:00
|
|
|
MLX5_SET(wq, wq, pd, mdev->mlx5e_res.pdn);
|
2016-04-20 13:02:10 -06:00
|
|
|
MLX5_SET(rqc, rqc, counter_set_id, priv->q_counter);
|
2016-12-21 08:24:35 -07:00
|
|
|
MLX5_SET(rqc, rqc, vsd, params->vlan_strip_disable);
|
2017-02-20 07:18:17 -07:00
|
|
|
MLX5_SET(rqc, rqc, scatter_fcs, params->scatter_fcs_en);
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2019-04-29 12:14:05 -06:00
|
|
|
param->wq.buf_numa_node = dev_to_node(mdev->device);
|
2015-05-28 13:28:48 -06:00
|
|
|
}
|
|
|
|
|
2018-02-08 06:09:57 -07:00
|
|
|
static void mlx5e_build_drop_rq_param(struct mlx5e_priv *priv,
|
2018-01-25 09:00:41 -07:00
|
|
|
struct mlx5e_rq_param *param)
|
2016-03-01 15:13:36 -07:00
|
|
|
{
|
2018-02-08 06:09:57 -07:00
|
|
|
struct mlx5_core_dev *mdev = priv->mdev;
|
2016-03-01 15:13:36 -07:00
|
|
|
void *rqc = param->rqc;
|
|
|
|
void *wq = MLX5_ADDR_OF(rqc, rqc, wq);
|
|
|
|
|
2018-04-02 08:31:31 -06:00
|
|
|
MLX5_SET(wq, wq, wq_type, MLX5_WQ_TYPE_CYCLIC);
|
|
|
|
MLX5_SET(wq, wq, log_wq_stride,
|
|
|
|
mlx5e_get_rqwq_log_stride(MLX5_WQ_TYPE_CYCLIC, 1));
|
2018-02-08 06:09:57 -07:00
|
|
|
MLX5_SET(rqc, rqc, counter_set_id, priv->drop_rq_q_counter);
|
2018-01-25 09:00:41 -07:00
|
|
|
|
2019-04-29 12:14:05 -06:00
|
|
|
param->wq.buf_numa_node = dev_to_node(mdev->device);
|
2016-03-01 15:13:36 -07:00
|
|
|
}
|
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
void mlx5e_build_sq_param_common(struct mlx5e_priv *priv,
|
|
|
|
struct mlx5e_sq_param *param)
|
2015-05-28 13:28:48 -06:00
|
|
|
{
|
|
|
|
void *sqc = param->sqc;
|
|
|
|
void *wq = MLX5_ADDR_OF(sqc, sqc, wq);
|
|
|
|
|
|
|
|
MLX5_SET(wq, wq, log_wq_stride, ilog2(MLX5_SEND_WQE_BB));
|
2016-07-01 05:51:04 -06:00
|
|
|
MLX5_SET(wq, wq, pd, priv->mdev->mlx5e_res.pdn);
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2019-04-29 12:14:05 -06:00
|
|
|
param->wq.buf_numa_node = dev_to_node(priv->mdev->device);
|
2016-04-20 13:02:14 -06:00
|
|
|
}
|
|
|
|
|
|
|
|
static void mlx5e_build_sq_param(struct mlx5e_priv *priv,
|
2016-12-21 08:24:35 -07:00
|
|
|
struct mlx5e_params *params,
|
2016-04-20 13:02:14 -06:00
|
|
|
struct mlx5e_sq_param *param)
|
|
|
|
{
|
|
|
|
void *sqc = param->sqc;
|
|
|
|
void *wq = MLX5_ADDR_OF(sqc, sqc, wq);
|
2019-03-21 16:51:38 -06:00
|
|
|
bool allow_swp;
|
2016-04-20 13:02:14 -06:00
|
|
|
|
2019-03-21 16:51:38 -06:00
|
|
|
allow_swp = mlx5_geneve_tx_allowed(priv->mdev) ||
|
|
|
|
!!MLX5_IPSEC_DEV(priv->mdev);
|
2016-04-20 13:02:14 -06:00
|
|
|
mlx5e_build_sq_param_common(priv, param);
|
2016-12-21 08:24:35 -07:00
|
|
|
MLX5_SET(wq, wq, log_wq_sz, params->log_sq_size);
|
2019-03-21 16:51:38 -06:00
|
|
|
MLX5_SET(sqc, sqc, allow_swp, allow_swp);
|
2015-05-28 13:28:48 -06:00
|
|
|
}
|
|
|
|
|
|
|
|
static void mlx5e_build_common_cq_param(struct mlx5e_priv *priv,
|
|
|
|
struct mlx5e_cq_param *param)
|
|
|
|
{
|
|
|
|
void *cqc = param->cqc;
|
|
|
|
|
2017-01-03 14:55:27 -07:00
|
|
|
MLX5_SET(cqc, cqc, uar_page, priv->mdev->priv.uar->index);
|
2018-11-05 15:05:37 -07:00
|
|
|
if (MLX5_CAP_GEN(priv->mdev, cqe_128_always) && cache_line_size() >= 128)
|
|
|
|
MLX5_SET(cqc, cqc, cqe_sz, CQE_STRIDE_128_PAD);
|
2015-05-28 13:28:48 -06:00
|
|
|
}
|
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
void mlx5e_build_rx_cq_param(struct mlx5e_priv *priv,
|
|
|
|
struct mlx5e_params *params,
|
|
|
|
struct mlx5e_xsk_param *xsk,
|
|
|
|
struct mlx5e_cq_param *param)
|
2015-05-28 13:28:48 -06:00
|
|
|
{
|
2018-02-11 06:21:33 -07:00
|
|
|
struct mlx5_core_dev *mdev = priv->mdev;
|
2015-05-28 13:28:48 -06:00
|
|
|
void *cqc = param->cqc;
|
net/mlx5e: Support RX multi-packet WQE (Striding RQ)
Introduce the feature of multi-packet WQE (RX Work Queue Element)
referred to as (MPWQE or Striding RQ), in which WQEs are larger
and serve multiple packets each.
Every WQE consists of many strides of the same size, every received
packet is aligned to a beginning of a stride and is written to
consecutive strides within a WQE.
In the regular approach, each regular WQE is big enough to be capable
of serving one received packet of any size up to MTU or 64K in case of
device LRO is enabled, making it very wasteful when dealing with
small packets or device LRO is enabled.
For its flexibility, MPWQE allows a better memory utilization
(implying improvements in CPU utilization and packet rate) as packets
consume strides according to their size, preserving the rest of
the WQE to be available for other packets.
MPWQE default configuration:
Num of WQEs = 16
Strides Per WQE = 2048
Stride Size = 64 byte
The default WQEs memory footprint went from 1024*mtu (~1.5MB) to
16 * 2048 * 64 = 2MB per ring.
However, HW LRO can now be supported at no additional cost in memory
footprint, and hence we turn it on by default and get an even better
performance.
Performance tested on ConnectX4-Lx 50G.
To isolate the feature under test, the numbers below were measured with
HW LRO turned off. We verified that the performance just improves when
LRO is turned back on.
* Netperf single TCP stream:
- BW raised by 10-15% for representative packet sizes:
default, 64B, 1024B, 1478B, 65536B.
* Netperf multi TCP stream:
- No degradation, line rate reached.
* Pktgen: packet rate raised by 2-10% for traffic of different message
sizes: 64B, 128B, 256B, 1024B, and 1500B.
* Pktgen: packet loss in bursts of small messages (64byte),
single stream:
- | num packets | packets loss before | packets loss after
| 2K | ~ 1K | 0
| 8K | ~ 6K | 0
| 16K | ~13K | 0
| 32K | ~28K | 0
| 64K | ~57K | ~24K
As expected as the driver can receive as many small packets (<=64B) as
the number of total strides in the ring (default = 2048 * 16) vs. 1024
(default ring size regardless of packets size) before this feature.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-20 13:02:13 -06:00
|
|
|
u8 log_cq_size;
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2016-12-21 08:24:35 -07:00
|
|
|
switch (params->rq_wq_type) {
|
net/mlx5e: Support RX multi-packet WQE (Striding RQ)
Introduce the feature of multi-packet WQE (RX Work Queue Element)
referred to as (MPWQE or Striding RQ), in which WQEs are larger
and serve multiple packets each.
Every WQE consists of many strides of the same size, every received
packet is aligned to a beginning of a stride and is written to
consecutive strides within a WQE.
In the regular approach, each regular WQE is big enough to be capable
of serving one received packet of any size up to MTU or 64K in case of
device LRO is enabled, making it very wasteful when dealing with
small packets or device LRO is enabled.
For its flexibility, MPWQE allows a better memory utilization
(implying improvements in CPU utilization and packet rate) as packets
consume strides according to their size, preserving the rest of
the WQE to be available for other packets.
MPWQE default configuration:
Num of WQEs = 16
Strides Per WQE = 2048
Stride Size = 64 byte
The default WQEs memory footprint went from 1024*mtu (~1.5MB) to
16 * 2048 * 64 = 2MB per ring.
However, HW LRO can now be supported at no additional cost in memory
footprint, and hence we turn it on by default and get an even better
performance.
Performance tested on ConnectX4-Lx 50G.
To isolate the feature under test, the numbers below were measured with
HW LRO turned off. We verified that the performance just improves when
LRO is turned back on.
* Netperf single TCP stream:
- BW raised by 10-15% for representative packet sizes:
default, 64B, 1024B, 1478B, 65536B.
* Netperf multi TCP stream:
- No degradation, line rate reached.
* Pktgen: packet rate raised by 2-10% for traffic of different message
sizes: 64B, 128B, 256B, 1024B, and 1500B.
* Pktgen: packet loss in bursts of small messages (64byte),
single stream:
- | num packets | packets loss before | packets loss after
| 2K | ~ 1K | 0
| 8K | ~ 6K | 0
| 16K | ~13K | 0
| 32K | ~28K | 0
| 64K | ~57K | ~24K
As expected as the driver can receive as many small packets (<=64B) as
the number of total strides in the ring (default = 2048 * 16) vs. 1024
(default ring size regardless of packets size) before this feature.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-20 13:02:13 -06:00
|
|
|
case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
log_cq_size = mlx5e_mpwqe_get_log_rq_size(params, xsk) +
|
|
|
|
mlx5e_mpwqe_get_log_num_strides(mdev, params, xsk);
|
net/mlx5e: Support RX multi-packet WQE (Striding RQ)
Introduce the feature of multi-packet WQE (RX Work Queue Element)
referred to as (MPWQE or Striding RQ), in which WQEs are larger
and serve multiple packets each.
Every WQE consists of many strides of the same size, every received
packet is aligned to a beginning of a stride and is written to
consecutive strides within a WQE.
In the regular approach, each regular WQE is big enough to be capable
of serving one received packet of any size up to MTU or 64K in case of
device LRO is enabled, making it very wasteful when dealing with
small packets or device LRO is enabled.
For its flexibility, MPWQE allows a better memory utilization
(implying improvements in CPU utilization and packet rate) as packets
consume strides according to their size, preserving the rest of
the WQE to be available for other packets.
MPWQE default configuration:
Num of WQEs = 16
Strides Per WQE = 2048
Stride Size = 64 byte
The default WQEs memory footprint went from 1024*mtu (~1.5MB) to
16 * 2048 * 64 = 2MB per ring.
However, HW LRO can now be supported at no additional cost in memory
footprint, and hence we turn it on by default and get an even better
performance.
Performance tested on ConnectX4-Lx 50G.
To isolate the feature under test, the numbers below were measured with
HW LRO turned off. We verified that the performance just improves when
LRO is turned back on.
* Netperf single TCP stream:
- BW raised by 10-15% for representative packet sizes:
default, 64B, 1024B, 1478B, 65536B.
* Netperf multi TCP stream:
- No degradation, line rate reached.
* Pktgen: packet rate raised by 2-10% for traffic of different message
sizes: 64B, 128B, 256B, 1024B, and 1500B.
* Pktgen: packet loss in bursts of small messages (64byte),
single stream:
- | num packets | packets loss before | packets loss after
| 2K | ~ 1K | 0
| 8K | ~ 6K | 0
| 16K | ~13K | 0
| 32K | ~28K | 0
| 64K | ~57K | ~24K
As expected as the driver can receive as many small packets (<=64B) as
the number of total strides in the ring (default = 2048 * 16) vs. 1024
(default ring size regardless of packets size) before this feature.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-20 13:02:13 -06:00
|
|
|
break;
|
2018-04-02 08:31:31 -06:00
|
|
|
default: /* MLX5_WQ_TYPE_CYCLIC */
|
2018-02-11 06:21:33 -07:00
|
|
|
log_cq_size = params->log_rq_mtu_frames;
|
net/mlx5e: Support RX multi-packet WQE (Striding RQ)
Introduce the feature of multi-packet WQE (RX Work Queue Element)
referred to as (MPWQE or Striding RQ), in which WQEs are larger
and serve multiple packets each.
Every WQE consists of many strides of the same size, every received
packet is aligned to a beginning of a stride and is written to
consecutive strides within a WQE.
In the regular approach, each regular WQE is big enough to be capable
of serving one received packet of any size up to MTU or 64K in case of
device LRO is enabled, making it very wasteful when dealing with
small packets or device LRO is enabled.
For its flexibility, MPWQE allows a better memory utilization
(implying improvements in CPU utilization and packet rate) as packets
consume strides according to their size, preserving the rest of
the WQE to be available for other packets.
MPWQE default configuration:
Num of WQEs = 16
Strides Per WQE = 2048
Stride Size = 64 byte
The default WQEs memory footprint went from 1024*mtu (~1.5MB) to
16 * 2048 * 64 = 2MB per ring.
However, HW LRO can now be supported at no additional cost in memory
footprint, and hence we turn it on by default and get an even better
performance.
Performance tested on ConnectX4-Lx 50G.
To isolate the feature under test, the numbers below were measured with
HW LRO turned off. We verified that the performance just improves when
LRO is turned back on.
* Netperf single TCP stream:
- BW raised by 10-15% for representative packet sizes:
default, 64B, 1024B, 1478B, 65536B.
* Netperf multi TCP stream:
- No degradation, line rate reached.
* Pktgen: packet rate raised by 2-10% for traffic of different message
sizes: 64B, 128B, 256B, 1024B, and 1500B.
* Pktgen: packet loss in bursts of small messages (64byte),
single stream:
- | num packets | packets loss before | packets loss after
| 2K | ~ 1K | 0
| 8K | ~ 6K | 0
| 16K | ~13K | 0
| 32K | ~28K | 0
| 64K | ~57K | ~24K
As expected as the driver can receive as many small packets (<=64B) as
the number of total strides in the ring (default = 2048 * 16) vs. 1024
(default ring size regardless of packets size) before this feature.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-20 13:02:13 -06:00
|
|
|
}
|
|
|
|
|
|
|
|
MLX5_SET(cqc, cqc, log_cq_size, log_cq_size);
|
2016-12-21 08:24:35 -07:00
|
|
|
if (MLX5E_GET_PFLAG(params, MLX5E_PFLAG_RX_CQE_COMPRESS)) {
|
2016-05-10 15:29:14 -06:00
|
|
|
MLX5_SET(cqc, cqc, mini_cqe_res_format, MLX5_CQE_FORMAT_CSUM);
|
|
|
|
MLX5_SET(cqc, cqc, cqe_comp_en, 1);
|
|
|
|
}
|
2015-05-28 13:28:48 -06:00
|
|
|
|
|
|
|
mlx5e_build_common_cq_param(priv, param);
|
2017-09-26 07:20:43 -06:00
|
|
|
param->cq_period_mode = params->rx_cq_moderation.cq_period_mode;
|
2015-05-28 13:28:48 -06:00
|
|
|
}
|
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
void mlx5e_build_tx_cq_param(struct mlx5e_priv *priv,
|
|
|
|
struct mlx5e_params *params,
|
|
|
|
struct mlx5e_cq_param *param)
|
2015-05-28 13:28:48 -06:00
|
|
|
{
|
|
|
|
void *cqc = param->cqc;
|
|
|
|
|
2016-12-21 08:24:35 -07:00
|
|
|
MLX5_SET(cqc, cqc, log_cq_size, params->log_sq_size);
|
2015-05-28 13:28:48 -06:00
|
|
|
|
|
|
|
mlx5e_build_common_cq_param(priv, param);
|
2017-09-26 07:20:43 -06:00
|
|
|
param->cq_period_mode = params->tx_cq_moderation.cq_period_mode;
|
2015-05-28 13:28:48 -06:00
|
|
|
}
|
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
void mlx5e_build_ico_cq_param(struct mlx5e_priv *priv,
|
|
|
|
u8 log_wq_size,
|
|
|
|
struct mlx5e_cq_param *param)
|
2016-04-20 13:02:14 -06:00
|
|
|
{
|
|
|
|
void *cqc = param->cqc;
|
|
|
|
|
|
|
|
MLX5_SET(cqc, cqc, log_cq_size, log_wq_size);
|
|
|
|
|
|
|
|
mlx5e_build_common_cq_param(priv, param);
|
2016-06-23 08:02:40 -06:00
|
|
|
|
2018-11-05 03:07:52 -07:00
|
|
|
param->cq_period_mode = DIM_CQ_PERIOD_MODE_START_FROM_EQE;
|
2016-04-20 13:02:14 -06:00
|
|
|
}
|
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
void mlx5e_build_icosq_param(struct mlx5e_priv *priv,
|
|
|
|
u8 log_wq_size,
|
|
|
|
struct mlx5e_sq_param *param)
|
2016-04-20 13:02:14 -06:00
|
|
|
{
|
|
|
|
void *sqc = param->sqc;
|
|
|
|
void *wq = MLX5_ADDR_OF(sqc, sqc, wq);
|
|
|
|
|
|
|
|
mlx5e_build_sq_param_common(priv, param);
|
|
|
|
|
|
|
|
MLX5_SET(wq, wq, log_wq_sz, log_wq_size);
|
2016-04-20 13:02:15 -06:00
|
|
|
MLX5_SET(sqc, sqc, reg_umr, MLX5_CAP_ETH(priv->mdev, reg_umr_sq));
|
2016-04-20 13:02:14 -06:00
|
|
|
}
|
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
void mlx5e_build_xdpsq_param(struct mlx5e_priv *priv,
|
|
|
|
struct mlx5e_params *params,
|
|
|
|
struct mlx5e_sq_param *param)
|
2016-09-21 03:19:48 -06:00
|
|
|
{
|
|
|
|
void *sqc = param->sqc;
|
|
|
|
void *wq = MLX5_ADDR_OF(sqc, sqc, wq);
|
|
|
|
|
|
|
|
mlx5e_build_sq_param_common(priv, param);
|
2016-12-21 08:24:35 -07:00
|
|
|
MLX5_SET(wq, wq, log_wq_sz, params->log_sq_size);
|
2018-11-20 02:50:30 -07:00
|
|
|
param->is_mpw = MLX5E_GET_PFLAG(params, MLX5E_PFLAG_XDP_TX_MPWQE);
|
2016-09-21 03:19:48 -06:00
|
|
|
}
|
|
|
|
|
net/mlx5e: RX, Support multiple outstanding UMR posts
The buffers mapping of the Multi-Packet WQEs (of Striding RQ)
is done via UMR posts, one UMR WQE per an RX MPWQE.
A single MPWQE is capable of serving many incoming packets,
usually larger than the budget of a single napi cycle.
Hence, posting a single UMR WQE per napi cycle (and handling its
completion in the next cycle) works fine in many common cases,
but not always.
When an XDP program is loaded, every MPWQE is capable of serving less
packets, to satisfy the packet-per-page requirement.
Thus, for the same number of packets more MPWQEs (and UMR posts)
are needed (twice as much for the default MTU), giving less latency
room for the UMR completions.
In this patch, we add support for multiple outstanding UMR posts,
to allow faster gap closure between consuming MPWQEs and reposting
them back into the WQ.
For better SW and HW locality, we combine the UMR posts in bulks of
(at least) two.
This is expected to improve packet rate in high CPU scale.
Performance test:
As expected, huge improvement in large-scale (48 cores).
xdp_redirect_map, 64B UDP multi-stream.
Redirect from ConnectX-5 100Gbps to ConnectX-6 100Gbps.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz.
Before: Unstable, 7 to 30 Mpps
After: Stable, at 70.5 Mpps
No degradation in other tested scenarios.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-02-27 03:06:08 -07:00
|
|
|
static u8 mlx5e_build_icosq_log_wq_sz(struct mlx5e_params *params,
|
|
|
|
struct mlx5e_rq_param *rqp)
|
|
|
|
{
|
|
|
|
switch (params->rq_wq_type) {
|
|
|
|
case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
|
|
|
|
return order_base_2(MLX5E_UMR_WQEBBS) +
|
|
|
|
mlx5e_get_rq_log_wq_sz(rqp->rqc);
|
|
|
|
default: /* MLX5_WQ_TYPE_CYCLIC */
|
|
|
|
return MLX5E_PARAMS_MINIMUM_LOG_SQ_SIZE;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-12-21 08:24:35 -07:00
|
|
|
static void mlx5e_build_channel_param(struct mlx5e_priv *priv,
|
|
|
|
struct mlx5e_params *params,
|
|
|
|
struct mlx5e_channel_param *cparam)
|
2015-05-28 13:28:48 -06:00
|
|
|
{
|
net/mlx5e: RX, Support multiple outstanding UMR posts
The buffers mapping of the Multi-Packet WQEs (of Striding RQ)
is done via UMR posts, one UMR WQE per an RX MPWQE.
A single MPWQE is capable of serving many incoming packets,
usually larger than the budget of a single napi cycle.
Hence, posting a single UMR WQE per napi cycle (and handling its
completion in the next cycle) works fine in many common cases,
but not always.
When an XDP program is loaded, every MPWQE is capable of serving less
packets, to satisfy the packet-per-page requirement.
Thus, for the same number of packets more MPWQEs (and UMR posts)
are needed (twice as much for the default MTU), giving less latency
room for the UMR completions.
In this patch, we add support for multiple outstanding UMR posts,
to allow faster gap closure between consuming MPWQEs and reposting
them back into the WQ.
For better SW and HW locality, we combine the UMR posts in bulks of
(at least) two.
This is expected to improve packet rate in high CPU scale.
Performance test:
As expected, huge improvement in large-scale (48 cores).
xdp_redirect_map, 64B UDP multi-stream.
Redirect from ConnectX-5 100Gbps to ConnectX-6 100Gbps.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz.
Before: Unstable, 7 to 30 Mpps
After: Stable, at 70.5 Mpps
No degradation in other tested scenarios.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-02-27 03:06:08 -07:00
|
|
|
u8 icosq_log_wq_sz;
|
2016-04-20 13:02:14 -06:00
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
mlx5e_build_rq_param(priv, params, NULL, &cparam->rq);
|
net/mlx5e: RX, Support multiple outstanding UMR posts
The buffers mapping of the Multi-Packet WQEs (of Striding RQ)
is done via UMR posts, one UMR WQE per an RX MPWQE.
A single MPWQE is capable of serving many incoming packets,
usually larger than the budget of a single napi cycle.
Hence, posting a single UMR WQE per napi cycle (and handling its
completion in the next cycle) works fine in many common cases,
but not always.
When an XDP program is loaded, every MPWQE is capable of serving less
packets, to satisfy the packet-per-page requirement.
Thus, for the same number of packets more MPWQEs (and UMR posts)
are needed (twice as much for the default MTU), giving less latency
room for the UMR completions.
In this patch, we add support for multiple outstanding UMR posts,
to allow faster gap closure between consuming MPWQEs and reposting
them back into the WQ.
For better SW and HW locality, we combine the UMR posts in bulks of
(at least) two.
This is expected to improve packet rate in high CPU scale.
Performance test:
As expected, huge improvement in large-scale (48 cores).
xdp_redirect_map, 64B UDP multi-stream.
Redirect from ConnectX-5 100Gbps to ConnectX-6 100Gbps.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz.
Before: Unstable, 7 to 30 Mpps
After: Stable, at 70.5 Mpps
No degradation in other tested scenarios.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-02-27 03:06:08 -07:00
|
|
|
|
|
|
|
icosq_log_wq_sz = mlx5e_build_icosq_log_wq_sz(params, &cparam->rq);
|
|
|
|
|
2016-12-21 08:24:35 -07:00
|
|
|
mlx5e_build_sq_param(priv, params, &cparam->sq);
|
|
|
|
mlx5e_build_xdpsq_param(priv, params, &cparam->xdp_sq);
|
|
|
|
mlx5e_build_icosq_param(priv, icosq_log_wq_sz, &cparam->icosq);
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
mlx5e_build_rx_cq_param(priv, params, NULL, &cparam->rx_cq);
|
2016-12-21 08:24:35 -07:00
|
|
|
mlx5e_build_tx_cq_param(priv, params, &cparam->tx_cq);
|
|
|
|
mlx5e_build_ico_cq_param(priv, icosq_log_wq_sz, &cparam->icosq_cq);
|
2015-05-28 13:28:48 -06:00
|
|
|
}
|
|
|
|
|
2016-12-27 05:57:03 -07:00
|
|
|
int mlx5e_open_channels(struct mlx5e_priv *priv,
|
|
|
|
struct mlx5e_channels *chs)
|
2015-05-28 13:28:48 -06:00
|
|
|
{
|
2016-04-26 09:52:33 -06:00
|
|
|
struct mlx5e_channel_param *cparam;
|
2015-06-23 08:14:14 -06:00
|
|
|
int err = -ENOMEM;
|
2015-05-28 13:28:48 -06:00
|
|
|
int i;
|
|
|
|
|
2016-12-21 08:24:35 -07:00
|
|
|
chs->num = chs->params.num_channels;
|
2015-06-23 08:14:14 -06:00
|
|
|
|
2017-02-06 04:14:34 -07:00
|
|
|
chs->c = kcalloc(chs->num, sizeof(struct mlx5e_channel *), GFP_KERNEL);
|
2018-06-05 02:47:04 -06:00
|
|
|
cparam = kvzalloc(sizeof(struct mlx5e_channel_param), GFP_KERNEL);
|
2016-12-20 13:48:19 -07:00
|
|
|
if (!chs->c || !cparam)
|
|
|
|
goto err_free;
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2016-12-21 08:24:35 -07:00
|
|
|
mlx5e_build_channel_param(priv, &chs->params, cparam);
|
2017-02-06 04:14:34 -07:00
|
|
|
for (i = 0; i < chs->num; i++) {
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
struct xdp_umem *umem = NULL;
|
|
|
|
|
|
|
|
if (chs->params.xdp_prog)
|
|
|
|
umem = mlx5e_xsk_get_umem(&chs->params, chs->params.xsk, i);
|
|
|
|
|
|
|
|
err = mlx5e_open_channel(priv, i, &chs->params, cparam, umem, &chs->c[i]);
|
2015-05-28 13:28:48 -06:00
|
|
|
if (err)
|
|
|
|
goto err_close_channels;
|
|
|
|
}
|
|
|
|
|
2019-07-11 08:17:36 -06:00
|
|
|
mlx5e_health_channels_update(priv);
|
2018-06-05 02:47:04 -06:00
|
|
|
kvfree(cparam);
|
2015-05-28 13:28:48 -06:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
err_close_channels:
|
|
|
|
for (i--; i >= 0; i--)
|
2017-02-06 04:14:34 -07:00
|
|
|
mlx5e_close_channel(chs->c[i]);
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2016-12-20 13:48:19 -07:00
|
|
|
err_free:
|
2017-02-06 04:14:34 -07:00
|
|
|
kfree(chs->c);
|
2018-06-05 02:47:04 -06:00
|
|
|
kvfree(cparam);
|
2017-02-06 04:14:34 -07:00
|
|
|
chs->num = 0;
|
2015-05-28 13:28:48 -06:00
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2016-12-20 13:48:19 -07:00
|
|
|
static void mlx5e_activate_channels(struct mlx5e_channels *chs)
|
2015-05-28 13:28:48 -06:00
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
2016-12-20 13:48:19 -07:00
|
|
|
for (i = 0; i < chs->num; i++)
|
|
|
|
mlx5e_activate_channel(chs->c[i]);
|
|
|
|
}
|
|
|
|
|
2019-03-05 05:46:16 -07:00
|
|
|
#define MLX5E_RQ_WQES_TIMEOUT 20000 /* msecs */
|
|
|
|
|
2016-12-20 13:48:19 -07:00
|
|
|
static int mlx5e_wait_channels_min_rx_wqes(struct mlx5e_channels *chs)
|
|
|
|
{
|
|
|
|
int err = 0;
|
|
|
|
int i;
|
|
|
|
|
2019-03-05 05:46:16 -07:00
|
|
|
for (i = 0; i < chs->num; i++) {
|
|
|
|
int timeout = err ? 0 : MLX5E_RQ_WQES_TIMEOUT;
|
|
|
|
|
|
|
|
err |= mlx5e_wait_for_min_rx_wqes(&chs->c[i]->rq, timeout);
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
|
|
|
|
/* Don't wait on the XSK RQ, because the newer xdpsock sample
|
|
|
|
* doesn't provide any Fill Ring entries at the setup stage.
|
|
|
|
*/
|
2019-03-05 05:46:16 -07:00
|
|
|
}
|
2016-12-20 13:48:19 -07:00
|
|
|
|
2018-03-28 04:26:50 -06:00
|
|
|
return err ? -ETIMEDOUT : 0;
|
2016-12-20 13:48:19 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
static void mlx5e_deactivate_channels(struct mlx5e_channels *chs)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < chs->num; i++)
|
|
|
|
mlx5e_deactivate_channel(chs->c[i]);
|
|
|
|
}
|
|
|
|
|
2016-12-27 05:57:03 -07:00
|
|
|
void mlx5e_close_channels(struct mlx5e_channels *chs)
|
2016-12-20 13:48:19 -07:00
|
|
|
{
|
|
|
|
int i;
|
2016-07-12 15:07:00 -06:00
|
|
|
|
2017-02-06 04:14:34 -07:00
|
|
|
for (i = 0; i < chs->num; i++)
|
|
|
|
mlx5e_close_channel(chs->c[i]);
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2017-02-06 04:14:34 -07:00
|
|
|
kfree(chs->c);
|
|
|
|
chs->num = 0;
|
2015-05-28 13:28:48 -06:00
|
|
|
}
|
|
|
|
|
2016-12-19 14:20:17 -07:00
|
|
|
static int
|
|
|
|
mlx5e_create_rqt(struct mlx5e_priv *priv, int sz, struct mlx5e_rqt *rqt)
|
2015-05-28 13:28:48 -06:00
|
|
|
{
|
|
|
|
struct mlx5_core_dev *mdev = priv->mdev;
|
|
|
|
void *rqtc;
|
|
|
|
int inlen;
|
|
|
|
int err;
|
2016-04-28 16:36:32 -06:00
|
|
|
u32 *in;
|
2016-12-19 14:20:17 -07:00
|
|
|
int i;
|
2015-05-28 13:28:48 -06:00
|
|
|
|
|
|
|
inlen = MLX5_ST_SZ_BYTES(create_rqt_in) + sizeof(u32) * sz;
|
2017-05-10 12:32:18 -06:00
|
|
|
in = kvzalloc(inlen, GFP_KERNEL);
|
2015-05-28 13:28:48 -06:00
|
|
|
if (!in)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
rqtc = MLX5_ADDR_OF(create_rqt_in, in, rqt_context);
|
|
|
|
|
|
|
|
MLX5_SET(rqtc, rqtc, rqt_actual_size, sz);
|
|
|
|
MLX5_SET(rqtc, rqtc, rqt_max_size, sz);
|
|
|
|
|
2016-12-19 14:20:17 -07:00
|
|
|
for (i = 0; i < sz; i++)
|
|
|
|
MLX5_SET(rqtc, rqtc, rq_num[i], priv->drop_rq.rqn);
|
2015-07-23 14:35:56 -06:00
|
|
|
|
2016-07-01 05:51:06 -06:00
|
|
|
err = mlx5_core_create_rqt(mdev, in, inlen, &rqt->rqtn);
|
|
|
|
if (!err)
|
|
|
|
rqt->enabled = true;
|
2015-05-28 13:28:48 -06:00
|
|
|
|
|
|
|
kvfree(in);
|
2016-04-28 16:36:32 -06:00
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
net/mlx5e: Introduce SRIOV VF representors
Implement the relevant profile functions to create mlx5e driver instance
serving as VF representor. When SRIOV offloads mode is enabled, each VF
will have a representor netdevice instance on the host.
To do that, we also export set of shared service functions from en_main.c,
such that they can be used by both NIC and repsresentors netdevs.
The newly created representor netdevice has a basic set of net_device_ops
which are the same ndo functions as the NIC netdevice and an ndo of it's
own for phys port name.
The profiling infrastructure allow sharing code between the NIC and the
vport representor even though the representor has only a subset of the
NIC functionality.
The VF reps and the PF which is used in that mode to represent the uplink,
expose switchdev ops. Currently the only op supposed is attr get for the
port parent ID which here serves to identify net-devices belonging to the
same HW E-Switch. Other than that, no offloading is implemented and hence
switching functionality is achieved if one sets SW switching rules, e.g
using tc, bridge or ovs.
Port phys name (ndo_get_phys_port_name) is implemented to allow exporting
to user-space the VF vport number and along with the switchdev port parent
id (phys_switch_id) enable a udev base consistent naming scheme:
SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}=="<phys_switch_id>", \
ATTR{phys_port_name}!="", NAME="$PF_NIC$attr{phys_port_name}"
where phys_switch_id is exposed by the PF (and VF reps) and $PF_NIC is
the name of the PF netdevice.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01 05:51:09 -06:00
|
|
|
void mlx5e_destroy_rqt(struct mlx5e_priv *priv, struct mlx5e_rqt *rqt)
|
2016-04-28 16:36:32 -06:00
|
|
|
{
|
2016-07-01 05:51:06 -06:00
|
|
|
rqt->enabled = false;
|
|
|
|
mlx5_core_destroy_rqt(priv->mdev, rqt->rqtn);
|
2016-04-28 16:36:32 -06:00
|
|
|
}
|
|
|
|
|
2017-04-12 21:36:56 -06:00
|
|
|
int mlx5e_create_indirect_rqt(struct mlx5e_priv *priv)
|
2016-07-01 05:51:07 -06:00
|
|
|
{
|
|
|
|
struct mlx5e_rqt *rqt = &priv->indir_rqt;
|
2017-04-12 21:36:56 -06:00
|
|
|
int err;
|
2016-07-01 05:51:07 -06:00
|
|
|
|
2017-04-12 21:36:56 -06:00
|
|
|
err = mlx5e_create_rqt(priv, MLX5E_INDIR_RQT_SIZE, rqt);
|
|
|
|
if (err)
|
|
|
|
mlx5_core_warn(priv->mdev, "create indirect rqts failed, %d\n", err);
|
|
|
|
return err;
|
2016-07-01 05:51:07 -06:00
|
|
|
}
|
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
int mlx5e_create_direct_rqts(struct mlx5e_priv *priv, struct mlx5e_tir *tirs)
|
2016-04-28 16:36:32 -06:00
|
|
|
{
|
|
|
|
int err;
|
|
|
|
int ix;
|
|
|
|
|
2019-07-14 02:43:43 -06:00
|
|
|
for (ix = 0; ix < priv->max_nch; ix++) {
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
err = mlx5e_create_rqt(priv, 1 /*size */, &tirs[ix].rqt);
|
|
|
|
if (unlikely(err))
|
2016-04-28 16:36:32 -06:00
|
|
|
goto err_destroy_rqts;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
err_destroy_rqts:
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
mlx5_core_warn(priv->mdev, "create rqts failed, %d\n", err);
|
2016-04-28 16:36:32 -06:00
|
|
|
for (ix--; ix >= 0; ix--)
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
mlx5e_destroy_rqt(priv, &tirs[ix].rqt);
|
2016-04-28 16:36:32 -06:00
|
|
|
|
2015-05-28 13:28:48 -06:00
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
void mlx5e_destroy_direct_rqts(struct mlx5e_priv *priv, struct mlx5e_tir *tirs)
|
2017-04-12 21:36:56 -06:00
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
2019-07-14 02:43:43 -06:00
|
|
|
for (i = 0; i < priv->max_nch; i++)
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
mlx5e_destroy_rqt(priv, &tirs[i].rqt);
|
2017-04-12 21:36:56 -06:00
|
|
|
}
|
|
|
|
|
2016-12-19 14:20:17 -07:00
|
|
|
static int mlx5e_rx_hash_fn(int hfunc)
|
|
|
|
{
|
|
|
|
return (hfunc == ETH_RSS_HASH_TOP) ?
|
|
|
|
MLX5_RX_HASH_FN_TOEPLITZ :
|
|
|
|
MLX5_RX_HASH_FN_INVERTED_XOR8;
|
|
|
|
}
|
|
|
|
|
2017-11-26 11:39:12 -07:00
|
|
|
int mlx5e_bits_invert(unsigned long a, int size)
|
2016-12-19 14:20:17 -07:00
|
|
|
{
|
|
|
|
int inv = 0;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < size; i++)
|
|
|
|
inv |= (test_bit(size - i - 1, &a) ? 1 : 0) << i;
|
|
|
|
|
|
|
|
return inv;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void mlx5e_fill_rqt_rqns(struct mlx5e_priv *priv, int sz,
|
|
|
|
struct mlx5e_redirect_rqt_param rrp, void *rqtc)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < sz; i++) {
|
|
|
|
u32 rqn;
|
|
|
|
|
|
|
|
if (rrp.is_rss) {
|
|
|
|
int ix = i;
|
|
|
|
|
|
|
|
if (rrp.rss.hfunc == ETH_RSS_HASH_XOR)
|
|
|
|
ix = mlx5e_bits_invert(i, ilog2(sz));
|
|
|
|
|
2018-11-06 12:05:29 -07:00
|
|
|
ix = priv->rss_params.indirection_rqt[ix];
|
2016-12-19 14:20:17 -07:00
|
|
|
rqn = rrp.rss.channels->c[ix]->rq.rqn;
|
|
|
|
} else {
|
|
|
|
rqn = rrp.rqn;
|
|
|
|
}
|
|
|
|
MLX5_SET(rqtc, rqtc, rq_num[i], rqn);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
int mlx5e_redirect_rqt(struct mlx5e_priv *priv, u32 rqtn, int sz,
|
|
|
|
struct mlx5e_redirect_rqt_param rrp)
|
2015-08-04 05:05:43 -06:00
|
|
|
{
|
|
|
|
struct mlx5_core_dev *mdev = priv->mdev;
|
|
|
|
void *rqtc;
|
|
|
|
int inlen;
|
2016-04-28 16:36:32 -06:00
|
|
|
u32 *in;
|
2015-08-04 05:05:43 -06:00
|
|
|
int err;
|
|
|
|
|
|
|
|
inlen = MLX5_ST_SZ_BYTES(modify_rqt_in) + sizeof(u32) * sz;
|
2017-05-10 12:32:18 -06:00
|
|
|
in = kvzalloc(inlen, GFP_KERNEL);
|
2015-08-04 05:05:43 -06:00
|
|
|
if (!in)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
rqtc = MLX5_ADDR_OF(modify_rqt_in, in, ctx);
|
|
|
|
|
|
|
|
MLX5_SET(rqtc, rqtc, rqt_actual_size, sz);
|
|
|
|
MLX5_SET(modify_rqt_in, in, bitmask.rqn_list, 1);
|
2016-12-19 14:20:17 -07:00
|
|
|
mlx5e_fill_rqt_rqns(priv, sz, rrp, rqtc);
|
2016-04-28 16:36:32 -06:00
|
|
|
err = mlx5_core_modify_rqt(mdev, rqtn, in, inlen);
|
2015-08-04 05:05:43 -06:00
|
|
|
|
|
|
|
kvfree(in);
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2016-12-19 14:20:17 -07:00
|
|
|
static u32 mlx5e_get_direct_rqn(struct mlx5e_priv *priv, int ix,
|
|
|
|
struct mlx5e_redirect_rqt_param rrp)
|
|
|
|
{
|
|
|
|
if (!rrp.is_rss)
|
|
|
|
return rrp.rqn;
|
|
|
|
|
|
|
|
if (ix >= rrp.rss.channels->num)
|
|
|
|
return priv->drop_rq.rqn;
|
|
|
|
|
|
|
|
return rrp.rss.channels->c[ix]->rq.rqn;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void mlx5e_redirect_rqts(struct mlx5e_priv *priv,
|
|
|
|
struct mlx5e_redirect_rqt_param rrp)
|
2015-08-04 05:05:44 -06:00
|
|
|
{
|
2016-04-28 16:36:32 -06:00
|
|
|
u32 rqtn;
|
|
|
|
int ix;
|
|
|
|
|
2016-07-01 05:51:06 -06:00
|
|
|
if (priv->indir_rqt.enabled) {
|
2016-12-19 14:20:17 -07:00
|
|
|
/* RSS RQ table */
|
2016-07-01 05:51:06 -06:00
|
|
|
rqtn = priv->indir_rqt.rqtn;
|
2016-12-19 14:20:17 -07:00
|
|
|
mlx5e_redirect_rqt(priv, rqtn, MLX5E_INDIR_RQT_SIZE, rrp);
|
2016-07-01 05:51:06 -06:00
|
|
|
}
|
|
|
|
|
2019-07-14 02:43:43 -06:00
|
|
|
for (ix = 0; ix < priv->max_nch; ix++) {
|
2016-12-19 14:20:17 -07:00
|
|
|
struct mlx5e_redirect_rqt_param direct_rrp = {
|
|
|
|
.is_rss = false,
|
2017-03-31 14:09:38 -06:00
|
|
|
{
|
|
|
|
.rqn = mlx5e_get_direct_rqn(priv, ix, rrp)
|
|
|
|
},
|
2016-12-19 14:20:17 -07:00
|
|
|
};
|
|
|
|
|
|
|
|
/* Direct RQ Tables */
|
2016-07-01 05:51:06 -06:00
|
|
|
if (!priv->direct_tir[ix].rqt.enabled)
|
|
|
|
continue;
|
2016-12-19 14:20:17 -07:00
|
|
|
|
2016-07-01 05:51:06 -06:00
|
|
|
rqtn = priv->direct_tir[ix].rqt.rqtn;
|
2016-12-19 14:20:17 -07:00
|
|
|
mlx5e_redirect_rqt(priv, rqtn, 1, direct_rrp);
|
2016-04-28 16:36:32 -06:00
|
|
|
}
|
2015-08-04 05:05:44 -06:00
|
|
|
}
|
|
|
|
|
2016-12-19 14:20:17 -07:00
|
|
|
static void mlx5e_redirect_rqts_to_channels(struct mlx5e_priv *priv,
|
|
|
|
struct mlx5e_channels *chs)
|
|
|
|
{
|
|
|
|
struct mlx5e_redirect_rqt_param rrp = {
|
|
|
|
.is_rss = true,
|
2017-03-31 14:09:38 -06:00
|
|
|
{
|
|
|
|
.rss = {
|
|
|
|
.channels = chs,
|
2018-11-06 12:05:29 -07:00
|
|
|
.hfunc = priv->rss_params.hfunc,
|
2017-03-31 14:09:38 -06:00
|
|
|
}
|
|
|
|
},
|
2016-12-19 14:20:17 -07:00
|
|
|
};
|
|
|
|
|
|
|
|
mlx5e_redirect_rqts(priv, rrp);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void mlx5e_redirect_rqts_to_drop(struct mlx5e_priv *priv)
|
|
|
|
{
|
|
|
|
struct mlx5e_redirect_rqt_param drop_rrp = {
|
|
|
|
.is_rss = false,
|
2017-03-31 14:09:38 -06:00
|
|
|
{
|
|
|
|
.rqn = priv->drop_rq.rqn,
|
|
|
|
},
|
2016-12-19 14:20:17 -07:00
|
|
|
};
|
|
|
|
|
|
|
|
mlx5e_redirect_rqts(priv, drop_rrp);
|
|
|
|
}
|
|
|
|
|
2018-10-28 08:22:57 -06:00
|
|
|
static const struct mlx5e_tirc_config tirc_default_config[MLX5E_NUM_INDIR_TIRS] = {
|
|
|
|
[MLX5E_TT_IPV4_TCP] = { .l3_prot_type = MLX5_L3_PROT_TYPE_IPV4,
|
|
|
|
.l4_prot_type = MLX5_L4_PROT_TYPE_TCP,
|
|
|
|
.rx_hash_fields = MLX5_HASH_IP_L4PORTS,
|
|
|
|
},
|
|
|
|
[MLX5E_TT_IPV6_TCP] = { .l3_prot_type = MLX5_L3_PROT_TYPE_IPV6,
|
|
|
|
.l4_prot_type = MLX5_L4_PROT_TYPE_TCP,
|
|
|
|
.rx_hash_fields = MLX5_HASH_IP_L4PORTS,
|
|
|
|
},
|
|
|
|
[MLX5E_TT_IPV4_UDP] = { .l3_prot_type = MLX5_L3_PROT_TYPE_IPV4,
|
|
|
|
.l4_prot_type = MLX5_L4_PROT_TYPE_UDP,
|
|
|
|
.rx_hash_fields = MLX5_HASH_IP_L4PORTS,
|
|
|
|
},
|
|
|
|
[MLX5E_TT_IPV6_UDP] = { .l3_prot_type = MLX5_L3_PROT_TYPE_IPV6,
|
|
|
|
.l4_prot_type = MLX5_L4_PROT_TYPE_UDP,
|
|
|
|
.rx_hash_fields = MLX5_HASH_IP_L4PORTS,
|
|
|
|
},
|
|
|
|
[MLX5E_TT_IPV4_IPSEC_AH] = { .l3_prot_type = MLX5_L3_PROT_TYPE_IPV4,
|
|
|
|
.l4_prot_type = 0,
|
|
|
|
.rx_hash_fields = MLX5_HASH_IP_IPSEC_SPI,
|
|
|
|
},
|
|
|
|
[MLX5E_TT_IPV6_IPSEC_AH] = { .l3_prot_type = MLX5_L3_PROT_TYPE_IPV6,
|
|
|
|
.l4_prot_type = 0,
|
|
|
|
.rx_hash_fields = MLX5_HASH_IP_IPSEC_SPI,
|
|
|
|
},
|
|
|
|
[MLX5E_TT_IPV4_IPSEC_ESP] = { .l3_prot_type = MLX5_L3_PROT_TYPE_IPV4,
|
|
|
|
.l4_prot_type = 0,
|
|
|
|
.rx_hash_fields = MLX5_HASH_IP_IPSEC_SPI,
|
|
|
|
},
|
|
|
|
[MLX5E_TT_IPV6_IPSEC_ESP] = { .l3_prot_type = MLX5_L3_PROT_TYPE_IPV6,
|
|
|
|
.l4_prot_type = 0,
|
|
|
|
.rx_hash_fields = MLX5_HASH_IP_IPSEC_SPI,
|
|
|
|
},
|
|
|
|
[MLX5E_TT_IPV4] = { .l3_prot_type = MLX5_L3_PROT_TYPE_IPV4,
|
|
|
|
.l4_prot_type = 0,
|
|
|
|
.rx_hash_fields = MLX5_HASH_IP,
|
|
|
|
},
|
|
|
|
[MLX5E_TT_IPV6] = { .l3_prot_type = MLX5_L3_PROT_TYPE_IPV6,
|
|
|
|
.l4_prot_type = 0,
|
|
|
|
.rx_hash_fields = MLX5_HASH_IP,
|
|
|
|
},
|
|
|
|
};
|
|
|
|
|
|
|
|
struct mlx5e_tirc_config mlx5e_tirc_get_default_config(enum mlx5e_traffic_types tt)
|
|
|
|
{
|
|
|
|
return tirc_default_config[tt];
|
|
|
|
}
|
|
|
|
|
2016-12-21 08:24:35 -07:00
|
|
|
static void mlx5e_build_tir_ctx_lro(struct mlx5e_params *params, void *tirc)
|
2015-08-04 05:05:43 -06:00
|
|
|
{
|
2016-12-21 08:24:35 -07:00
|
|
|
if (!params->lro_en)
|
2015-08-04 05:05:43 -06:00
|
|
|
return;
|
|
|
|
|
|
|
|
#define ROUGH_MAX_L2_L3_HDR_SZ 256
|
|
|
|
|
|
|
|
MLX5_SET(tirc, tirc, lro_enable_mask,
|
|
|
|
MLX5_TIRC_LRO_ENABLE_MASK_IPV4_LRO |
|
|
|
|
MLX5_TIRC_LRO_ENABLE_MASK_IPV6_LRO);
|
|
|
|
MLX5_SET(tirc, tirc, lro_max_ip_payload_size,
|
2019-01-17 06:58:09 -07:00
|
|
|
(MLX5E_PARAMS_DEFAULT_LRO_WQE_SZ - ROUGH_MAX_L2_L3_HDR_SZ) >> 8);
|
2016-12-21 08:24:35 -07:00
|
|
|
MLX5_SET(tirc, tirc, lro_timeout_period_usecs, params->lro_timeout);
|
2015-08-04 05:05:43 -06:00
|
|
|
}
|
|
|
|
|
2018-11-06 12:05:29 -07:00
|
|
|
void mlx5e_build_indir_tir_ctx_hash(struct mlx5e_rss_params *rss_params,
|
2018-10-28 08:22:57 -06:00
|
|
|
const struct mlx5e_tirc_config *ttconfig,
|
2017-08-13 07:22:38 -06:00
|
|
|
void *tirc, bool inner)
|
2016-02-29 12:17:12 -07:00
|
|
|
{
|
2017-08-13 07:22:38 -06:00
|
|
|
void *hfso = inner ? MLX5_ADDR_OF(tirc, tirc, rx_hash_field_selector_inner) :
|
|
|
|
MLX5_ADDR_OF(tirc, tirc, rx_hash_field_selector_outer);
|
2017-01-12 07:25:46 -07:00
|
|
|
|
2018-11-06 12:05:29 -07:00
|
|
|
MLX5_SET(tirc, tirc, rx_hash_fn, mlx5e_rx_hash_fn(rss_params->hfunc));
|
|
|
|
if (rss_params->hfunc == ETH_RSS_HASH_TOP) {
|
2016-02-29 12:17:12 -07:00
|
|
|
void *rss_key = MLX5_ADDR_OF(tirc, tirc,
|
|
|
|
rx_hash_toeplitz_key);
|
|
|
|
size_t len = MLX5_FLD_SZ_BYTES(tirc,
|
|
|
|
rx_hash_toeplitz_key);
|
|
|
|
|
|
|
|
MLX5_SET(tirc, tirc, rx_hash_symmetric, 1);
|
2018-11-06 12:05:29 -07:00
|
|
|
memcpy(rss_key, rss_params->toeplitz_hash_key, len);
|
2016-02-29 12:17:12 -07:00
|
|
|
}
|
2018-10-28 08:22:57 -06:00
|
|
|
MLX5_SET(rx_hash_field_select, hfso, l3_prot_type,
|
|
|
|
ttconfig->l3_prot_type);
|
|
|
|
MLX5_SET(rx_hash_field_select, hfso, l4_prot_type,
|
|
|
|
ttconfig->l4_prot_type);
|
|
|
|
MLX5_SET(rx_hash_field_select, hfso, selected_fields,
|
|
|
|
ttconfig->rx_hash_fields);
|
2016-02-29 12:17:12 -07:00
|
|
|
}
|
|
|
|
|
2018-10-23 07:03:33 -06:00
|
|
|
static void mlx5e_update_rx_hash_fields(struct mlx5e_tirc_config *ttconfig,
|
|
|
|
enum mlx5e_traffic_types tt,
|
|
|
|
u32 rx_hash_fields)
|
|
|
|
{
|
|
|
|
*ttconfig = tirc_default_config[tt];
|
|
|
|
ttconfig->rx_hash_fields = rx_hash_fields;
|
|
|
|
}
|
|
|
|
|
2018-10-23 01:02:08 -06:00
|
|
|
void mlx5e_modify_tirs_hash(struct mlx5e_priv *priv, void *in, int inlen)
|
|
|
|
{
|
|
|
|
void *tirc = MLX5_ADDR_OF(modify_tir_in, in, ctx);
|
2018-10-23 07:03:33 -06:00
|
|
|
struct mlx5e_rss_params *rss = &priv->rss_params;
|
2018-10-23 01:02:08 -06:00
|
|
|
struct mlx5_core_dev *mdev = priv->mdev;
|
|
|
|
int ctxlen = MLX5_ST_SZ_BYTES(tirc);
|
2018-10-23 07:03:33 -06:00
|
|
|
struct mlx5e_tirc_config ttconfig;
|
2018-10-23 01:02:08 -06:00
|
|
|
int tt;
|
|
|
|
|
|
|
|
MLX5_SET(modify_tir_in, in, bitmask.hash, 1);
|
|
|
|
|
|
|
|
for (tt = 0; tt < MLX5E_NUM_INDIR_TIRS; tt++) {
|
|
|
|
memset(tirc, 0, ctxlen);
|
2018-10-23 07:03:33 -06:00
|
|
|
mlx5e_update_rx_hash_fields(&ttconfig, tt,
|
|
|
|
rss->rx_hash_fields[tt]);
|
|
|
|
mlx5e_build_indir_tir_ctx_hash(rss, &ttconfig, tirc, false);
|
2018-10-23 01:02:08 -06:00
|
|
|
mlx5_core_modify_tir(mdev, priv->indir_tir[tt].tirn, in, inlen);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!mlx5e_tunnel_inner_ft_supported(priv->mdev))
|
|
|
|
return;
|
|
|
|
|
|
|
|
for (tt = 0; tt < MLX5E_NUM_INDIR_TIRS; tt++) {
|
|
|
|
memset(tirc, 0, ctxlen);
|
2018-10-23 07:03:33 -06:00
|
|
|
mlx5e_update_rx_hash_fields(&ttconfig, tt,
|
|
|
|
rss->rx_hash_fields[tt]);
|
|
|
|
mlx5e_build_indir_tir_ctx_hash(rss, &ttconfig, tirc, true);
|
2018-10-23 01:02:08 -06:00
|
|
|
mlx5_core_modify_tir(mdev, priv->inner_indir_tir[tt].tirn, in,
|
|
|
|
inlen);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-02-29 12:17:10 -07:00
|
|
|
static int mlx5e_modify_tirs_lro(struct mlx5e_priv *priv)
|
2015-08-04 05:05:43 -06:00
|
|
|
{
|
|
|
|
struct mlx5_core_dev *mdev = priv->mdev;
|
|
|
|
|
|
|
|
void *in;
|
|
|
|
void *tirc;
|
|
|
|
int inlen;
|
|
|
|
int err;
|
2016-02-29 12:17:10 -07:00
|
|
|
int tt;
|
2016-04-28 16:36:32 -06:00
|
|
|
int ix;
|
2015-08-04 05:05:43 -06:00
|
|
|
|
|
|
|
inlen = MLX5_ST_SZ_BYTES(modify_tir_in);
|
2017-05-10 12:32:18 -06:00
|
|
|
in = kvzalloc(inlen, GFP_KERNEL);
|
2015-08-04 05:05:43 -06:00
|
|
|
if (!in)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
MLX5_SET(modify_tir_in, in, bitmask.lro, 1);
|
|
|
|
tirc = MLX5_ADDR_OF(modify_tir_in, in, ctx);
|
|
|
|
|
2016-12-21 08:24:35 -07:00
|
|
|
mlx5e_build_tir_ctx_lro(&priv->channels.params, tirc);
|
2015-08-04 05:05:43 -06:00
|
|
|
|
2016-04-28 16:36:32 -06:00
|
|
|
for (tt = 0; tt < MLX5E_NUM_INDIR_TIRS; tt++) {
|
2016-07-01 05:51:05 -06:00
|
|
|
err = mlx5_core_modify_tir(mdev, priv->indir_tir[tt].tirn, in,
|
2016-04-28 16:36:32 -06:00
|
|
|
inlen);
|
2016-02-29 12:17:10 -07:00
|
|
|
if (err)
|
2016-04-28 16:36:32 -06:00
|
|
|
goto free_in;
|
2016-02-29 12:17:10 -07:00
|
|
|
}
|
2015-08-04 05:05:43 -06:00
|
|
|
|
2019-07-14 02:43:43 -06:00
|
|
|
for (ix = 0; ix < priv->max_nch; ix++) {
|
2016-04-28 16:36:32 -06:00
|
|
|
err = mlx5_core_modify_tir(mdev, priv->direct_tir[ix].tirn,
|
|
|
|
in, inlen);
|
|
|
|
if (err)
|
|
|
|
goto free_in;
|
|
|
|
}
|
|
|
|
|
|
|
|
free_in:
|
2015-08-04 05:05:43 -06:00
|
|
|
kvfree(in);
|
|
|
|
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2018-03-12 06:24:41 -06:00
|
|
|
static int mlx5e_set_mtu(struct mlx5_core_dev *mdev,
|
|
|
|
struct mlx5e_params *params, u16 mtu)
|
2015-08-04 05:05:44 -06:00
|
|
|
{
|
2018-03-12 06:24:41 -06:00
|
|
|
u16 hw_mtu = MLX5E_SW2HW_MTU(params, mtu);
|
2015-08-04 05:05:44 -06:00
|
|
|
int err;
|
|
|
|
|
2016-04-21 15:33:05 -06:00
|
|
|
err = mlx5_set_port_mtu(mdev, hw_mtu, 1);
|
2015-08-04 05:05:44 -06:00
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
|
2016-04-21 15:33:05 -06:00
|
|
|
/* Update vport context MTU */
|
|
|
|
mlx5_modify_nic_vport_mtu(mdev, hw_mtu);
|
|
|
|
return 0;
|
|
|
|
}
|
2015-08-04 05:05:44 -06:00
|
|
|
|
2018-03-12 06:24:41 -06:00
|
|
|
static void mlx5e_query_mtu(struct mlx5_core_dev *mdev,
|
|
|
|
struct mlx5e_params *params, u16 *mtu)
|
2016-04-21 15:33:05 -06:00
|
|
|
{
|
|
|
|
u16 hw_mtu = 0;
|
|
|
|
int err;
|
2015-08-04 05:05:44 -06:00
|
|
|
|
2016-04-21 15:33:05 -06:00
|
|
|
err = mlx5_query_nic_vport_mtu(mdev, &hw_mtu);
|
|
|
|
if (err || !hw_mtu) /* fallback to port oper mtu */
|
|
|
|
mlx5_query_port_oper_mtu(mdev, &hw_mtu, 1);
|
|
|
|
|
2018-03-12 06:24:41 -06:00
|
|
|
*mtu = MLX5E_HW2SW_MTU(params, hw_mtu);
|
2016-04-21 15:33:05 -06:00
|
|
|
}
|
|
|
|
|
2018-02-13 06:48:30 -07:00
|
|
|
int mlx5e_set_dev_port_mtu(struct mlx5e_priv *priv)
|
2016-04-21 15:33:05 -06:00
|
|
|
{
|
2018-03-12 06:24:41 -06:00
|
|
|
struct mlx5e_params *params = &priv->channels.params;
|
2017-02-12 16:19:14 -07:00
|
|
|
struct net_device *netdev = priv->netdev;
|
2018-03-12 06:24:41 -06:00
|
|
|
struct mlx5_core_dev *mdev = priv->mdev;
|
2016-04-21 15:33:05 -06:00
|
|
|
u16 mtu;
|
|
|
|
int err;
|
|
|
|
|
2018-03-12 06:24:41 -06:00
|
|
|
err = mlx5e_set_mtu(mdev, params, params->sw_mtu);
|
2016-04-21 15:33:05 -06:00
|
|
|
if (err)
|
|
|
|
return err;
|
2015-08-04 05:05:44 -06:00
|
|
|
|
2018-03-12 06:24:41 -06:00
|
|
|
mlx5e_query_mtu(mdev, params, &mtu);
|
|
|
|
if (mtu != params->sw_mtu)
|
2016-04-21 15:33:05 -06:00
|
|
|
netdev_warn(netdev, "%s: VPort MTU %d is different than netdev mtu %d\n",
|
2018-03-12 06:24:41 -06:00
|
|
|
__func__, mtu, params->sw_mtu);
|
2015-08-04 05:05:44 -06:00
|
|
|
|
2018-03-12 06:24:41 -06:00
|
|
|
params->sw_mtu = mtu;
|
2015-08-04 05:05:44 -06:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2019-01-22 04:42:10 -07:00
|
|
|
void mlx5e_set_netdev_mtu_boundaries(struct mlx5e_priv *priv)
|
|
|
|
{
|
|
|
|
struct mlx5e_params *params = &priv->channels.params;
|
|
|
|
struct net_device *netdev = priv->netdev;
|
|
|
|
struct mlx5_core_dev *mdev = priv->mdev;
|
|
|
|
u16 max_mtu;
|
|
|
|
|
|
|
|
/* MTU range: 68 - hw-specific max */
|
|
|
|
netdev->min_mtu = ETH_MIN_MTU;
|
|
|
|
|
|
|
|
mlx5_query_port_max_mtu(mdev, &max_mtu, 1);
|
|
|
|
netdev->max_mtu = min_t(unsigned int, MLX5E_HW2SW_MTU(params, max_mtu),
|
|
|
|
ETH_MAX_MTU);
|
|
|
|
}
|
|
|
|
|
2016-02-22 09:17:26 -07:00
|
|
|
static void mlx5e_netdev_set_tcs(struct net_device *netdev)
|
|
|
|
{
|
|
|
|
struct mlx5e_priv *priv = netdev_priv(netdev);
|
2016-12-21 08:24:35 -07:00
|
|
|
int nch = priv->channels.params.num_channels;
|
|
|
|
int ntc = priv->channels.params.num_tc;
|
2016-02-22 09:17:26 -07:00
|
|
|
int tc;
|
|
|
|
|
|
|
|
netdev_reset_tc(netdev);
|
|
|
|
|
|
|
|
if (ntc == 1)
|
|
|
|
return;
|
|
|
|
|
|
|
|
netdev_set_num_tc(netdev, ntc);
|
|
|
|
|
2016-06-30 08:34:48 -06:00
|
|
|
/* Map netdev TCs to offset 0
|
|
|
|
* We have our own UP to TXQ mapping for QoS
|
|
|
|
*/
|
2016-02-22 09:17:26 -07:00
|
|
|
for (tc = 0; tc < ntc; tc++)
|
2016-06-30 08:34:48 -06:00
|
|
|
netdev_set_tc_queue(netdev, tc, nch, 0);
|
2016-02-22 09:17:26 -07:00
|
|
|
}
|
|
|
|
|
2018-05-29 01:54:47 -06:00
|
|
|
static void mlx5e_build_tc2txq_maps(struct mlx5e_priv *priv)
|
2016-12-20 13:48:19 -07:00
|
|
|
{
|
|
|
|
int i, tc;
|
|
|
|
|
2019-07-14 02:43:43 -06:00
|
|
|
for (i = 0; i < priv->max_nch; i++)
|
2016-12-20 13:48:19 -07:00
|
|
|
for (tc = 0; tc < priv->profile->max_tc; tc++)
|
2019-07-14 02:43:43 -06:00
|
|
|
priv->channel_tc2txq[i][tc] = i + tc * priv->max_nch;
|
2018-05-29 01:54:47 -06:00
|
|
|
}
|
|
|
|
|
|
|
|
static void mlx5e_build_tx2sq_maps(struct mlx5e_priv *priv)
|
|
|
|
{
|
|
|
|
struct mlx5e_channel *c;
|
|
|
|
struct mlx5e_txqsq *sq;
|
|
|
|
int i, tc;
|
2016-12-20 13:48:19 -07:00
|
|
|
|
|
|
|
for (i = 0; i < priv->channels.num; i++) {
|
|
|
|
c = priv->channels.c[i];
|
|
|
|
for (tc = 0; tc < c->num_tc; tc++) {
|
|
|
|
sq = &c->sq[tc];
|
|
|
|
priv->txq2sq[sq->txq_ix] = sq;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2017-04-12 21:36:59 -06:00
|
|
|
void mlx5e_activate_priv_channels(struct mlx5e_priv *priv)
|
2016-12-20 13:48:19 -07:00
|
|
|
{
|
2017-02-07 07:35:49 -07:00
|
|
|
int num_txqs = priv->channels.num * priv->channels.params.num_tc;
|
2019-07-14 02:43:43 -06:00
|
|
|
int num_rxqs = priv->channels.num * priv->profile->rq_groups;
|
2017-02-07 07:35:49 -07:00
|
|
|
struct net_device *netdev = priv->netdev;
|
|
|
|
|
|
|
|
mlx5e_netdev_set_tcs(netdev);
|
2017-04-05 03:11:10 -06:00
|
|
|
netif_set_real_num_tx_queues(netdev, num_txqs);
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
netif_set_real_num_rx_queues(netdev, num_rxqs);
|
2017-02-07 07:35:49 -07:00
|
|
|
|
2018-05-29 01:54:47 -06:00
|
|
|
mlx5e_build_tx2sq_maps(priv);
|
2016-12-20 13:48:19 -07:00
|
|
|
mlx5e_activate_channels(&priv->channels);
|
2019-02-11 17:27:02 -07:00
|
|
|
mlx5e_xdp_tx_enable(priv);
|
2016-12-20 13:48:19 -07:00
|
|
|
netif_tx_start_all_queues(priv->netdev);
|
2017-02-07 07:35:49 -07:00
|
|
|
|
2018-02-13 06:48:30 -07:00
|
|
|
if (mlx5e_is_vport_rep(priv))
|
2017-02-07 07:35:49 -07:00
|
|
|
mlx5e_add_sqs_fwd_rules(priv);
|
|
|
|
|
2016-12-20 13:48:19 -07:00
|
|
|
mlx5e_wait_channels_min_rx_wqes(&priv->channels);
|
2017-02-07 07:35:49 -07:00
|
|
|
mlx5e_redirect_rqts_to_channels(priv, &priv->channels);
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
|
|
|
|
mlx5e_xsk_redirect_rqts_to_channels(priv, &priv->channels);
|
2016-12-20 13:48:19 -07:00
|
|
|
}
|
|
|
|
|
2017-04-12 21:36:59 -06:00
|
|
|
void mlx5e_deactivate_priv_channels(struct mlx5e_priv *priv)
|
2016-12-20 13:48:19 -07:00
|
|
|
{
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
mlx5e_xsk_redirect_rqts_to_drop(priv, &priv->channels);
|
|
|
|
|
2017-02-07 07:35:49 -07:00
|
|
|
mlx5e_redirect_rqts_to_drop(priv);
|
|
|
|
|
2018-02-13 06:48:30 -07:00
|
|
|
if (mlx5e_is_vport_rep(priv))
|
2017-02-07 07:35:49 -07:00
|
|
|
mlx5e_remove_sqs_fwd_rules(priv);
|
|
|
|
|
2016-12-20 13:48:19 -07:00
|
|
|
/* FIXME: This is a W/A only for tx timeout watch dog false alarm when
|
|
|
|
* polling for inactive tx queues.
|
|
|
|
*/
|
|
|
|
netif_tx_stop_all_queues(priv->netdev);
|
|
|
|
netif_tx_disable(priv->netdev);
|
2019-02-11 17:27:02 -07:00
|
|
|
mlx5e_xdp_tx_disable(priv);
|
2016-12-20 13:48:19 -07:00
|
|
|
mlx5e_deactivate_channels(&priv->channels);
|
|
|
|
}
|
|
|
|
|
2018-11-26 08:22:16 -07:00
|
|
|
static void mlx5e_switch_priv_channels(struct mlx5e_priv *priv,
|
|
|
|
struct mlx5e_channels *new_chs,
|
|
|
|
mlx5e_fp_hw_modify hw_modify)
|
2016-12-27 05:57:03 -07:00
|
|
|
{
|
|
|
|
struct net_device *netdev = priv->netdev;
|
|
|
|
int new_num_txqs;
|
2017-05-18 05:32:11 -06:00
|
|
|
int carrier_ok;
|
2018-11-26 08:22:16 -07:00
|
|
|
|
2016-12-27 05:57:03 -07:00
|
|
|
new_num_txqs = new_chs->num * new_chs->params.num_tc;
|
|
|
|
|
2017-05-18 05:32:11 -06:00
|
|
|
carrier_ok = netif_carrier_ok(netdev);
|
2016-12-27 05:57:03 -07:00
|
|
|
netif_carrier_off(netdev);
|
|
|
|
|
|
|
|
if (new_num_txqs < netdev->real_num_tx_queues)
|
|
|
|
netif_set_real_num_tx_queues(netdev, new_num_txqs);
|
|
|
|
|
|
|
|
mlx5e_deactivate_priv_channels(priv);
|
|
|
|
mlx5e_close_channels(&priv->channels);
|
|
|
|
|
|
|
|
priv->channels = *new_chs;
|
|
|
|
|
2017-02-12 16:19:14 -07:00
|
|
|
/* New channels are ready to roll, modify HW settings if needed */
|
|
|
|
if (hw_modify)
|
|
|
|
hw_modify(priv);
|
|
|
|
|
net/mlx5e: Don't refresh TIRs when updating representor SQs
Refreshing TIRs is done in order to update the TIRs with the current
state of SQs in the transport domain, so that the TIRs can filter out
undesired self-loopback packets based on the source SQ of the packet.
Representor TIRs will only receive packets that originate from their
associated vport, due to dedicated steering, and therefore will never
receive self-loopback packets, whose source vport will be the vport of
the E-Switch manager, and therefore not the vport associated with the
representor. As such, it is not necessary to refresh the representors'
TIRs, since self-loopback packets can't reach them.
Since representors only exist in switchdev mode, and there is no
scenario in which a representor will exist in the transport domain
alongside a non-representor, it is not necessary to refresh the
transport domain's TIRs upon changing the state of a representor's
queues. Therefore, do not refresh TIRs upon such a change. Achieve
this by adding an update_rx callback to the mlx5e_profile, which
refreshes TIRs for non-representors and does nothing for representors,
and replace instances of mlx5e_refresh_tirs() upon changing the state
of the queues with update_rx().
Signed-off-by: Gavi Teitz <gavi@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-05-23 00:58:56 -06:00
|
|
|
priv->profile->update_rx(priv);
|
2016-12-27 05:57:03 -07:00
|
|
|
mlx5e_activate_priv_channels(priv);
|
|
|
|
|
2017-05-18 05:32:11 -06:00
|
|
|
/* return carrier back if needed */
|
|
|
|
if (carrier_ok)
|
|
|
|
netif_carrier_on(netdev);
|
2016-12-27 05:57:03 -07:00
|
|
|
}
|
|
|
|
|
2018-11-26 08:22:16 -07:00
|
|
|
int mlx5e_safe_switch_channels(struct mlx5e_priv *priv,
|
|
|
|
struct mlx5e_channels *new_chs,
|
|
|
|
mlx5e_fp_hw_modify hw_modify)
|
|
|
|
{
|
|
|
|
int err;
|
|
|
|
|
|
|
|
err = mlx5e_open_channels(priv, new_chs);
|
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
|
|
|
|
mlx5e_switch_priv_channels(priv, new_chs, hw_modify);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2019-03-28 06:26:47 -06:00
|
|
|
int mlx5e_safe_reopen_channels(struct mlx5e_priv *priv)
|
|
|
|
{
|
|
|
|
struct mlx5e_channels new_channels = {};
|
|
|
|
|
|
|
|
new_channels.params = priv->channels.params;
|
|
|
|
return mlx5e_safe_switch_channels(priv, &new_channels, NULL);
|
|
|
|
}
|
|
|
|
|
2018-01-08 01:01:04 -07:00
|
|
|
void mlx5e_timestamp_init(struct mlx5e_priv *priv)
|
2017-08-15 04:46:04 -06:00
|
|
|
{
|
|
|
|
priv->tstamp.tx_type = HWTSTAMP_TX_OFF;
|
|
|
|
priv->tstamp.rx_filter = HWTSTAMP_FILTER_NONE;
|
|
|
|
}
|
|
|
|
|
2015-08-04 05:05:44 -06:00
|
|
|
int mlx5e_open_locked(struct net_device *netdev)
|
|
|
|
{
|
|
|
|
struct mlx5e_priv *priv = netdev_priv(netdev);
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
bool is_xdp = priv->channels.params.xdp_prog;
|
2015-08-04 05:05:44 -06:00
|
|
|
int err;
|
|
|
|
|
|
|
|
set_bit(MLX5E_STATE_OPENED, &priv->state);
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
if (is_xdp)
|
|
|
|
mlx5e_xdp_set_open(priv);
|
2015-08-04 05:05:44 -06:00
|
|
|
|
2017-02-06 04:14:34 -07:00
|
|
|
err = mlx5e_open_channels(priv, &priv->channels);
|
2016-12-20 13:48:19 -07:00
|
|
|
if (err)
|
2015-09-25 01:49:09 -06:00
|
|
|
goto err_clear_state_opened_flag;
|
2015-08-04 05:05:44 -06:00
|
|
|
|
net/mlx5e: Don't refresh TIRs when updating representor SQs
Refreshing TIRs is done in order to update the TIRs with the current
state of SQs in the transport domain, so that the TIRs can filter out
undesired self-loopback packets based on the source SQ of the packet.
Representor TIRs will only receive packets that originate from their
associated vport, due to dedicated steering, and therefore will never
receive self-loopback packets, whose source vport will be the vport of
the E-Switch manager, and therefore not the vport associated with the
representor. As such, it is not necessary to refresh the representors'
TIRs, since self-loopback packets can't reach them.
Since representors only exist in switchdev mode, and there is no
scenario in which a representor will exist in the transport domain
alongside a non-representor, it is not necessary to refresh the
transport domain's TIRs upon changing the state of a representor's
queues. Therefore, do not refresh TIRs upon such a change. Achieve
this by adding an update_rx callback to the mlx5e_profile, which
refreshes TIRs for non-representors and does nothing for representors,
and replace instances of mlx5e_refresh_tirs() upon changing the state
of the queues with update_rx().
Signed-off-by: Gavi Teitz <gavi@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-05-23 00:58:56 -06:00
|
|
|
priv->profile->update_rx(priv);
|
2016-12-20 13:48:19 -07:00
|
|
|
mlx5e_activate_priv_channels(priv);
|
2017-05-18 05:32:11 -06:00
|
|
|
if (priv->profile->update_carrier)
|
|
|
|
priv->profile->update_carrier(priv);
|
2017-02-07 07:30:52 -07:00
|
|
|
|
2018-09-12 00:45:33 -06:00
|
|
|
mlx5e_queue_update_stats(priv);
|
2015-08-04 05:05:46 -06:00
|
|
|
return 0;
|
2015-09-25 01:49:09 -06:00
|
|
|
|
|
|
|
err_clear_state_opened_flag:
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
if (is_xdp)
|
|
|
|
mlx5e_xdp_set_closed(priv);
|
2015-09-25 01:49:09 -06:00
|
|
|
clear_bit(MLX5E_STATE_OPENED, &priv->state);
|
|
|
|
return err;
|
2015-08-04 05:05:44 -06:00
|
|
|
}
|
|
|
|
|
net/mlx5e: Introduce SRIOV VF representors
Implement the relevant profile functions to create mlx5e driver instance
serving as VF representor. When SRIOV offloads mode is enabled, each VF
will have a representor netdevice instance on the host.
To do that, we also export set of shared service functions from en_main.c,
such that they can be used by both NIC and repsresentors netdevs.
The newly created representor netdevice has a basic set of net_device_ops
which are the same ndo functions as the NIC netdevice and an ndo of it's
own for phys port name.
The profiling infrastructure allow sharing code between the NIC and the
vport representor even though the representor has only a subset of the
NIC functionality.
The VF reps and the PF which is used in that mode to represent the uplink,
expose switchdev ops. Currently the only op supposed is attr get for the
port parent ID which here serves to identify net-devices belonging to the
same HW E-Switch. Other than that, no offloading is implemented and hence
switching functionality is achieved if one sets SW switching rules, e.g
using tc, bridge or ovs.
Port phys name (ndo_get_phys_port_name) is implemented to allow exporting
to user-space the VF vport number and along with the switchdev port parent
id (phys_switch_id) enable a udev base consistent naming scheme:
SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}=="<phys_switch_id>", \
ATTR{phys_port_name}!="", NAME="$PF_NIC$attr{phys_port_name}"
where phys_switch_id is exposed by the PF (and VF reps) and $PF_NIC is
the name of the PF netdevice.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01 05:51:09 -06:00
|
|
|
int mlx5e_open(struct net_device *netdev)
|
2015-08-04 05:05:44 -06:00
|
|
|
{
|
|
|
|
struct mlx5e_priv *priv = netdev_priv(netdev);
|
|
|
|
int err;
|
|
|
|
|
|
|
|
mutex_lock(&priv->state_lock);
|
|
|
|
err = mlx5e_open_locked(netdev);
|
2017-02-05 08:57:40 -07:00
|
|
|
if (!err)
|
|
|
|
mlx5_set_port_admin_status(priv->mdev, MLX5_PORT_UP);
|
2015-08-04 05:05:44 -06:00
|
|
|
mutex_unlock(&priv->state_lock);
|
|
|
|
|
2018-05-09 14:28:00 -06:00
|
|
|
if (mlx5_vxlan_allowed(priv->mdev->vxlan))
|
2018-03-20 06:44:40 -06:00
|
|
|
udp_tunnel_get_rx_info(netdev);
|
|
|
|
|
2015-08-04 05:05:44 -06:00
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
|
|
|
int mlx5e_close_locked(struct net_device *netdev)
|
|
|
|
{
|
|
|
|
struct mlx5e_priv *priv = netdev_priv(netdev);
|
|
|
|
|
2015-11-02 23:07:18 -07:00
|
|
|
/* May already be CLOSED in case a previous configuration operation
|
|
|
|
* (e.g RX/TX queue size change) that involves close&open failed.
|
|
|
|
*/
|
|
|
|
if (!test_bit(MLX5E_STATE_OPENED, &priv->state))
|
|
|
|
return 0;
|
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
if (priv->channels.params.xdp_prog)
|
|
|
|
mlx5e_xdp_set_closed(priv);
|
2015-08-04 05:05:44 -06:00
|
|
|
clear_bit(MLX5E_STATE_OPENED, &priv->state);
|
|
|
|
|
|
|
|
netif_carrier_off(priv->netdev);
|
2016-12-20 13:48:19 -07:00
|
|
|
mlx5e_deactivate_priv_channels(priv);
|
|
|
|
mlx5e_close_channels(&priv->channels);
|
2015-08-04 05:05:44 -06:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
net/mlx5e: Introduce SRIOV VF representors
Implement the relevant profile functions to create mlx5e driver instance
serving as VF representor. When SRIOV offloads mode is enabled, each VF
will have a representor netdevice instance on the host.
To do that, we also export set of shared service functions from en_main.c,
such that they can be used by both NIC and repsresentors netdevs.
The newly created representor netdevice has a basic set of net_device_ops
which are the same ndo functions as the NIC netdevice and an ndo of it's
own for phys port name.
The profiling infrastructure allow sharing code between the NIC and the
vport representor even though the representor has only a subset of the
NIC functionality.
The VF reps and the PF which is used in that mode to represent the uplink,
expose switchdev ops. Currently the only op supposed is attr get for the
port parent ID which here serves to identify net-devices belonging to the
same HW E-Switch. Other than that, no offloading is implemented and hence
switching functionality is achieved if one sets SW switching rules, e.g
using tc, bridge or ovs.
Port phys name (ndo_get_phys_port_name) is implemented to allow exporting
to user-space the VF vport number and along with the switchdev port parent
id (phys_switch_id) enable a udev base consistent naming scheme:
SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}=="<phys_switch_id>", \
ATTR{phys_port_name}!="", NAME="$PF_NIC$attr{phys_port_name}"
where phys_switch_id is exposed by the PF (and VF reps) and $PF_NIC is
the name of the PF netdevice.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01 05:51:09 -06:00
|
|
|
int mlx5e_close(struct net_device *netdev)
|
2015-08-04 05:05:44 -06:00
|
|
|
{
|
|
|
|
struct mlx5e_priv *priv = netdev_priv(netdev);
|
|
|
|
int err;
|
|
|
|
|
2016-09-09 08:35:25 -06:00
|
|
|
if (!netif_device_present(netdev))
|
|
|
|
return -ENODEV;
|
|
|
|
|
2015-08-04 05:05:44 -06:00
|
|
|
mutex_lock(&priv->state_lock);
|
2017-02-05 08:57:40 -07:00
|
|
|
mlx5_set_port_admin_status(priv->mdev, MLX5_PORT_DOWN);
|
2015-08-04 05:05:44 -06:00
|
|
|
err = mlx5e_close_locked(netdev);
|
|
|
|
mutex_unlock(&priv->state_lock);
|
|
|
|
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2017-03-14 11:43:52 -06:00
|
|
|
static int mlx5e_alloc_drop_rq(struct mlx5_core_dev *mdev,
|
net/mlx5e: Proper names for SQ/RQ/CQ functions
Rename mlx5e_{create,destroy}_{sq,rq,cq} to
mlx5e_{alloc,free}_{sq,rq,cq}.
Rename mlx5e_{enable,disable}_{sq,rq,cq} to
mlx5e_{create,destroy}_{sq,rq,cq}.
mlx5e_{enable,disable}_{sq,rq,cq} used to actually create/destroy the SQ
in FW, so we rename them to align the functions names with FW semantics.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 15:52:12 -06:00
|
|
|
struct mlx5e_rq *rq,
|
|
|
|
struct mlx5e_rq_param *param)
|
2015-08-04 05:05:44 -06:00
|
|
|
{
|
|
|
|
void *rqc = param->rqc;
|
|
|
|
void *rqc_wq = MLX5_ADDR_OF(rqc, rqc, wq);
|
|
|
|
int err;
|
|
|
|
|
|
|
|
param->wq.db_numa_node = param->wq.buf_numa_node;
|
|
|
|
|
2018-04-02 08:31:31 -06:00
|
|
|
err = mlx5_wq_cyc_create(mdev, ¶m->wq, rqc_wq, &rq->wqe.wq,
|
|
|
|
&rq->wq_ctrl);
|
2015-08-04 05:05:44 -06:00
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
|
2018-01-03 03:25:18 -07:00
|
|
|
/* Mark as unused given "Drop-RQ" packets never reach XDP */
|
|
|
|
xdp_rxq_info_unused(&rq->xdp_rxq);
|
|
|
|
|
2017-03-14 11:43:52 -06:00
|
|
|
rq->mdev = mdev;
|
2015-08-04 05:05:44 -06:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2017-03-14 11:43:52 -06:00
|
|
|
static int mlx5e_alloc_drop_cq(struct mlx5_core_dev *mdev,
|
net/mlx5e: Proper names for SQ/RQ/CQ functions
Rename mlx5e_{create,destroy}_{sq,rq,cq} to
mlx5e_{alloc,free}_{sq,rq,cq}.
Rename mlx5e_{enable,disable}_{sq,rq,cq} to
mlx5e_{create,destroy}_{sq,rq,cq}.
mlx5e_{enable,disable}_{sq,rq,cq} used to actually create/destroy the SQ
in FW, so we rename them to align the functions names with FW semantics.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 15:52:12 -06:00
|
|
|
struct mlx5e_cq *cq,
|
|
|
|
struct mlx5e_cq_param *param)
|
2015-08-04 05:05:44 -06:00
|
|
|
{
|
2019-04-29 12:14:05 -06:00
|
|
|
param->wq.buf_numa_node = dev_to_node(mdev->device);
|
|
|
|
param->wq.db_numa_node = dev_to_node(mdev->device);
|
2018-01-25 09:00:41 -07:00
|
|
|
|
2017-03-28 02:23:55 -06:00
|
|
|
return mlx5e_alloc_cq_common(mdev, param, cq);
|
2015-08-04 05:05:44 -06:00
|
|
|
}
|
|
|
|
|
2018-08-04 21:58:05 -06:00
|
|
|
int mlx5e_open_drop_rq(struct mlx5e_priv *priv,
|
|
|
|
struct mlx5e_rq *drop_rq)
|
2015-08-04 05:05:44 -06:00
|
|
|
{
|
2018-02-08 06:09:57 -07:00
|
|
|
struct mlx5_core_dev *mdev = priv->mdev;
|
2017-03-14 11:43:52 -06:00
|
|
|
struct mlx5e_cq_param cq_param = {};
|
|
|
|
struct mlx5e_rq_param rq_param = {};
|
|
|
|
struct mlx5e_cq *cq = &drop_rq->cq;
|
2015-08-04 05:05:44 -06:00
|
|
|
int err;
|
|
|
|
|
2018-02-08 06:09:57 -07:00
|
|
|
mlx5e_build_drop_rq_param(priv, &rq_param);
|
2015-08-04 05:05:44 -06:00
|
|
|
|
2017-03-14 11:43:52 -06:00
|
|
|
err = mlx5e_alloc_drop_cq(mdev, cq, &cq_param);
|
2015-08-04 05:05:44 -06:00
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
|
net/mlx5e: Proper names for SQ/RQ/CQ functions
Rename mlx5e_{create,destroy}_{sq,rq,cq} to
mlx5e_{alloc,free}_{sq,rq,cq}.
Rename mlx5e_{enable,disable}_{sq,rq,cq} to
mlx5e_{create,destroy}_{sq,rq,cq}.
mlx5e_{enable,disable}_{sq,rq,cq} used to actually create/destroy the SQ
in FW, so we rename them to align the functions names with FW semantics.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 15:52:12 -06:00
|
|
|
err = mlx5e_create_cq(cq, &cq_param);
|
2015-08-04 05:05:44 -06:00
|
|
|
if (err)
|
net/mlx5e: Proper names for SQ/RQ/CQ functions
Rename mlx5e_{create,destroy}_{sq,rq,cq} to
mlx5e_{alloc,free}_{sq,rq,cq}.
Rename mlx5e_{enable,disable}_{sq,rq,cq} to
mlx5e_{create,destroy}_{sq,rq,cq}.
mlx5e_{enable,disable}_{sq,rq,cq} used to actually create/destroy the SQ
in FW, so we rename them to align the functions names with FW semantics.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 15:52:12 -06:00
|
|
|
goto err_free_cq;
|
2015-08-04 05:05:44 -06:00
|
|
|
|
2017-03-14 11:43:52 -06:00
|
|
|
err = mlx5e_alloc_drop_rq(mdev, drop_rq, &rq_param);
|
2015-08-04 05:05:44 -06:00
|
|
|
if (err)
|
net/mlx5e: Proper names for SQ/RQ/CQ functions
Rename mlx5e_{create,destroy}_{sq,rq,cq} to
mlx5e_{alloc,free}_{sq,rq,cq}.
Rename mlx5e_{enable,disable}_{sq,rq,cq} to
mlx5e_{create,destroy}_{sq,rq,cq}.
mlx5e_{enable,disable}_{sq,rq,cq} used to actually create/destroy the SQ
in FW, so we rename them to align the functions names with FW semantics.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 15:52:12 -06:00
|
|
|
goto err_destroy_cq;
|
2015-08-04 05:05:44 -06:00
|
|
|
|
2017-03-14 11:43:52 -06:00
|
|
|
err = mlx5e_create_rq(drop_rq, &rq_param);
|
2015-08-04 05:05:44 -06:00
|
|
|
if (err)
|
net/mlx5e: Proper names for SQ/RQ/CQ functions
Rename mlx5e_{create,destroy}_{sq,rq,cq} to
mlx5e_{alloc,free}_{sq,rq,cq}.
Rename mlx5e_{enable,disable}_{sq,rq,cq} to
mlx5e_{create,destroy}_{sq,rq,cq}.
mlx5e_{enable,disable}_{sq,rq,cq} used to actually create/destroy the SQ
in FW, so we rename them to align the functions names with FW semantics.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 15:52:12 -06:00
|
|
|
goto err_free_rq;
|
2015-08-04 05:05:44 -06:00
|
|
|
|
2018-02-08 06:09:57 -07:00
|
|
|
err = mlx5e_modify_rq_state(drop_rq, MLX5_RQC_STATE_RST, MLX5_RQC_STATE_RDY);
|
|
|
|
if (err)
|
|
|
|
mlx5_core_warn(priv->mdev, "modify_rq_state failed, rx_if_down_packets won't be counted %d\n", err);
|
|
|
|
|
2015-08-04 05:05:44 -06:00
|
|
|
return 0;
|
|
|
|
|
net/mlx5e: Proper names for SQ/RQ/CQ functions
Rename mlx5e_{create,destroy}_{sq,rq,cq} to
mlx5e_{alloc,free}_{sq,rq,cq}.
Rename mlx5e_{enable,disable}_{sq,rq,cq} to
mlx5e_{create,destroy}_{sq,rq,cq}.
mlx5e_{enable,disable}_{sq,rq,cq} used to actually create/destroy the SQ
in FW, so we rename them to align the functions names with FW semantics.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 15:52:12 -06:00
|
|
|
err_free_rq:
|
2017-03-14 11:43:52 -06:00
|
|
|
mlx5e_free_rq(drop_rq);
|
2015-08-04 05:05:44 -06:00
|
|
|
|
|
|
|
err_destroy_cq:
|
2017-03-14 11:43:52 -06:00
|
|
|
mlx5e_destroy_cq(cq);
|
2015-08-04 05:05:44 -06:00
|
|
|
|
net/mlx5e: Proper names for SQ/RQ/CQ functions
Rename mlx5e_{create,destroy}_{sq,rq,cq} to
mlx5e_{alloc,free}_{sq,rq,cq}.
Rename mlx5e_{enable,disable}_{sq,rq,cq} to
mlx5e_{create,destroy}_{sq,rq,cq}.
mlx5e_{enable,disable}_{sq,rq,cq} used to actually create/destroy the SQ
in FW, so we rename them to align the functions names with FW semantics.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 15:52:12 -06:00
|
|
|
err_free_cq:
|
2017-03-14 11:43:52 -06:00
|
|
|
mlx5e_free_cq(cq);
|
net/mlx5e: Proper names for SQ/RQ/CQ functions
Rename mlx5e_{create,destroy}_{sq,rq,cq} to
mlx5e_{alloc,free}_{sq,rq,cq}.
Rename mlx5e_{enable,disable}_{sq,rq,cq} to
mlx5e_{create,destroy}_{sq,rq,cq}.
mlx5e_{enable,disable}_{sq,rq,cq} used to actually create/destroy the SQ
in FW, so we rename them to align the functions names with FW semantics.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 15:52:12 -06:00
|
|
|
|
2015-08-04 05:05:44 -06:00
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2018-08-04 21:58:05 -06:00
|
|
|
void mlx5e_close_drop_rq(struct mlx5e_rq *drop_rq)
|
2015-08-04 05:05:44 -06:00
|
|
|
{
|
2017-03-14 11:43:52 -06:00
|
|
|
mlx5e_destroy_rq(drop_rq);
|
|
|
|
mlx5e_free_rq(drop_rq);
|
|
|
|
mlx5e_destroy_cq(&drop_rq->cq);
|
|
|
|
mlx5e_free_cq(&drop_rq->cq);
|
2015-08-04 05:05:44 -06:00
|
|
|
}
|
|
|
|
|
2019-07-05 09:30:20 -06:00
|
|
|
int mlx5e_create_tis(struct mlx5_core_dev *mdev, void *in, u32 *tisn)
|
2015-08-04 05:05:44 -06:00
|
|
|
{
|
|
|
|
void *tisc = MLX5_ADDR_OF(create_tis_in, in, ctx);
|
|
|
|
|
2016-07-01 05:51:04 -06:00
|
|
|
MLX5_SET(tisc, tisc, transport_domain, mdev->mlx5e_res.td.tdn);
|
2016-05-30 09:31:13 -06:00
|
|
|
|
2019-07-05 09:30:22 -06:00
|
|
|
if (MLX5_GET(tisc, tisc, tls_en))
|
|
|
|
MLX5_SET(tisc, tisc, pd, mdev->mlx5e_res.pdn);
|
|
|
|
|
2016-05-30 09:31:13 -06:00
|
|
|
if (mlx5_lag_is_lacp_owner(mdev))
|
|
|
|
MLX5_SET(tisc, tisc, strict_lag_tx_port_affinity, 1);
|
|
|
|
|
2019-07-05 09:30:20 -06:00
|
|
|
return mlx5_core_create_tis(mdev, in, MLX5_ST_SZ_BYTES(create_tis_in), tisn);
|
2015-08-04 05:05:44 -06:00
|
|
|
}
|
|
|
|
|
2017-04-12 21:36:58 -06:00
|
|
|
void mlx5e_destroy_tis(struct mlx5_core_dev *mdev, u32 tisn)
|
2015-08-04 05:05:44 -06:00
|
|
|
{
|
2017-04-12 21:36:58 -06:00
|
|
|
mlx5_core_destroy_tis(mdev, tisn);
|
2015-08-04 05:05:44 -06:00
|
|
|
}
|
|
|
|
|
2019-06-24 03:03:02 -06:00
|
|
|
void mlx5e_destroy_tises(struct mlx5e_priv *priv)
|
|
|
|
{
|
2019-08-07 08:46:15 -06:00
|
|
|
int tc, i;
|
2019-06-24 03:03:02 -06:00
|
|
|
|
2019-08-07 08:46:15 -06:00
|
|
|
for (i = 0; i < mlx5e_get_num_lag_ports(priv->mdev); i++)
|
|
|
|
for (tc = 0; tc < priv->profile->max_tc; tc++)
|
|
|
|
mlx5e_destroy_tis(priv->mdev, priv->tisn[i][tc]);
|
|
|
|
}
|
|
|
|
|
|
|
|
static bool mlx5e_lag_should_assign_affinity(struct mlx5_core_dev *mdev)
|
|
|
|
{
|
|
|
|
return MLX5_CAP_GEN(mdev, lag_tx_port_affinity) && mlx5e_get_num_lag_ports(mdev) > 1;
|
2019-06-24 03:03:02 -06:00
|
|
|
}
|
|
|
|
|
net/mlx5e: Introduce SRIOV VF representors
Implement the relevant profile functions to create mlx5e driver instance
serving as VF representor. When SRIOV offloads mode is enabled, each VF
will have a representor netdevice instance on the host.
To do that, we also export set of shared service functions from en_main.c,
such that they can be used by both NIC and repsresentors netdevs.
The newly created representor netdevice has a basic set of net_device_ops
which are the same ndo functions as the NIC netdevice and an ndo of it's
own for phys port name.
The profiling infrastructure allow sharing code between the NIC and the
vport representor even though the representor has only a subset of the
NIC functionality.
The VF reps and the PF which is used in that mode to represent the uplink,
expose switchdev ops. Currently the only op supposed is attr get for the
port parent ID which here serves to identify net-devices belonging to the
same HW E-Switch. Other than that, no offloading is implemented and hence
switching functionality is achieved if one sets SW switching rules, e.g
using tc, bridge or ovs.
Port phys name (ndo_get_phys_port_name) is implemented to allow exporting
to user-space the VF vport number and along with the switchdev port parent
id (phys_switch_id) enable a udev base consistent naming scheme:
SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}=="<phys_switch_id>", \
ATTR{phys_port_name}!="", NAME="$PF_NIC$attr{phys_port_name}"
where phys_switch_id is exposed by the PF (and VF reps) and $PF_NIC is
the name of the PF netdevice.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01 05:51:09 -06:00
|
|
|
int mlx5e_create_tises(struct mlx5e_priv *priv)
|
2015-08-04 05:05:44 -06:00
|
|
|
{
|
2019-08-07 08:46:15 -06:00
|
|
|
int tc, i;
|
2015-08-04 05:05:44 -06:00
|
|
|
int err;
|
|
|
|
|
2019-08-07 08:46:15 -06:00
|
|
|
for (i = 0; i < mlx5e_get_num_lag_ports(priv->mdev); i++) {
|
|
|
|
for (tc = 0; tc < priv->profile->max_tc; tc++) {
|
|
|
|
u32 in[MLX5_ST_SZ_DW(create_tis_in)] = {};
|
|
|
|
void *tisc;
|
2019-07-05 09:30:20 -06:00
|
|
|
|
2019-08-07 08:46:15 -06:00
|
|
|
tisc = MLX5_ADDR_OF(create_tis_in, in, ctx);
|
2019-07-05 09:30:20 -06:00
|
|
|
|
2019-08-07 08:46:15 -06:00
|
|
|
MLX5_SET(tisc, tisc, prio, tc << 1);
|
2019-07-05 09:30:20 -06:00
|
|
|
|
2019-08-07 08:46:15 -06:00
|
|
|
if (mlx5e_lag_should_assign_affinity(priv->mdev))
|
|
|
|
MLX5_SET(tisc, tisc, lag_tx_port_affinity, i + 1);
|
|
|
|
|
|
|
|
err = mlx5e_create_tis(priv->mdev, in, &priv->tisn[i][tc]);
|
|
|
|
if (err)
|
|
|
|
goto err_close_tises;
|
|
|
|
}
|
2015-08-04 05:05:44 -06:00
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
err_close_tises:
|
2019-08-07 08:46:15 -06:00
|
|
|
for (; i >= 0; i--) {
|
|
|
|
for (tc--; tc >= 0; tc--)
|
|
|
|
mlx5e_destroy_tis(priv->mdev, priv->tisn[i][tc]);
|
|
|
|
tc = priv->profile->max_tc;
|
|
|
|
}
|
2015-08-04 05:05:44 -06:00
|
|
|
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2018-02-13 06:48:30 -07:00
|
|
|
static void mlx5e_cleanup_nic_tx(struct mlx5e_priv *priv)
|
2015-08-04 05:05:44 -06:00
|
|
|
{
|
2019-06-24 03:03:02 -06:00
|
|
|
mlx5e_destroy_tises(priv);
|
2015-08-04 05:05:44 -06:00
|
|
|
}
|
|
|
|
|
2019-01-16 05:31:22 -07:00
|
|
|
static void mlx5e_build_indir_tir_ctx_common(struct mlx5e_priv *priv,
|
|
|
|
u32 rqtn, u32 *tirc)
|
2015-05-28 13:28:48 -06:00
|
|
|
{
|
2016-07-01 05:51:04 -06:00
|
|
|
MLX5_SET(tirc, tirc, transport_domain, priv->mdev->mlx5e_res.td.tdn);
|
2019-01-16 05:31:22 -07:00
|
|
|
MLX5_SET(tirc, tirc, disp_type, MLX5_TIRC_DISP_TYPE_INDIRECT);
|
|
|
|
MLX5_SET(tirc, tirc, indirect_table, rqtn);
|
2019-01-20 02:04:34 -07:00
|
|
|
MLX5_SET(tirc, tirc, tunneled_offload_en,
|
|
|
|
priv->channels.params.tunneled_offload_en);
|
2015-06-11 05:47:33 -06:00
|
|
|
|
2016-12-21 08:24:35 -07:00
|
|
|
mlx5e_build_tir_ctx_lro(&priv->channels.params, tirc);
|
2019-01-16 05:31:22 -07:00
|
|
|
}
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2019-01-16 05:31:22 -07:00
|
|
|
static void mlx5e_build_indir_tir_ctx(struct mlx5e_priv *priv,
|
|
|
|
enum mlx5e_traffic_types tt,
|
|
|
|
u32 *tirc)
|
|
|
|
{
|
|
|
|
mlx5e_build_indir_tir_ctx_common(priv, priv->indir_rqt.rqtn, tirc);
|
2018-11-06 12:05:29 -07:00
|
|
|
mlx5e_build_indir_tir_ctx_hash(&priv->rss_params,
|
2018-10-28 08:22:57 -06:00
|
|
|
&tirc_default_config[tt], tirc, false);
|
2015-05-28 13:28:48 -06:00
|
|
|
}
|
|
|
|
|
2016-12-21 08:24:35 -07:00
|
|
|
static void mlx5e_build_direct_tir_ctx(struct mlx5e_priv *priv, u32 rqtn, u32 *tirc)
|
2015-05-28 13:28:48 -06:00
|
|
|
{
|
2019-01-16 05:31:22 -07:00
|
|
|
mlx5e_build_indir_tir_ctx_common(priv, rqtn, tirc);
|
2016-04-28 16:36:32 -06:00
|
|
|
MLX5_SET(tirc, tirc, rx_hash_fn, MLX5_RX_HASH_FN_INVERTED_XOR8);
|
|
|
|
}
|
|
|
|
|
2019-01-16 05:31:22 -07:00
|
|
|
static void mlx5e_build_inner_indir_tir_ctx(struct mlx5e_priv *priv,
|
|
|
|
enum mlx5e_traffic_types tt,
|
|
|
|
u32 *tirc)
|
|
|
|
{
|
|
|
|
mlx5e_build_indir_tir_ctx_common(priv, priv->indir_rqt.rqtn, tirc);
|
|
|
|
mlx5e_build_indir_tir_ctx_hash(&priv->rss_params,
|
|
|
|
&tirc_default_config[tt], tirc, true);
|
|
|
|
}
|
|
|
|
|
2018-08-28 11:53:55 -06:00
|
|
|
int mlx5e_create_indirect_tirs(struct mlx5e_priv *priv, bool inner_ttc)
|
2016-04-28 16:36:32 -06:00
|
|
|
{
|
2016-07-01 05:51:05 -06:00
|
|
|
struct mlx5e_tir *tir;
|
2015-05-28 13:28:48 -06:00
|
|
|
void *tirc;
|
|
|
|
int inlen;
|
2017-08-13 07:22:38 -06:00
|
|
|
int i = 0;
|
2015-05-28 13:28:48 -06:00
|
|
|
int err;
|
2016-04-28 16:36:32 -06:00
|
|
|
u32 *in;
|
|
|
|
int tt;
|
2015-05-28 13:28:48 -06:00
|
|
|
|
|
|
|
inlen = MLX5_ST_SZ_BYTES(create_tir_in);
|
2017-05-10 12:32:18 -06:00
|
|
|
in = kvzalloc(inlen, GFP_KERNEL);
|
2015-05-28 13:28:48 -06:00
|
|
|
if (!in)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2016-04-28 16:36:32 -06:00
|
|
|
for (tt = 0; tt < MLX5E_NUM_INDIR_TIRS; tt++) {
|
|
|
|
memset(in, 0, inlen);
|
2016-07-01 05:51:05 -06:00
|
|
|
tir = &priv->indir_tir[tt];
|
2016-04-28 16:36:32 -06:00
|
|
|
tirc = MLX5_ADDR_OF(create_tir_in, in, ctx);
|
2016-12-21 08:24:35 -07:00
|
|
|
mlx5e_build_indir_tir_ctx(priv, tt, tirc);
|
2016-07-01 05:51:05 -06:00
|
|
|
err = mlx5e_create_tir(priv->mdev, tir, in, inlen);
|
2017-08-13 07:22:38 -06:00
|
|
|
if (err) {
|
|
|
|
mlx5_core_warn(priv->mdev, "create indirect tirs failed, %d\n", err);
|
|
|
|
goto err_destroy_inner_tirs;
|
|
|
|
}
|
2015-05-28 13:28:48 -06:00
|
|
|
}
|
|
|
|
|
2018-08-28 11:53:55 -06:00
|
|
|
if (!inner_ttc || !mlx5e_tunnel_inner_ft_supported(priv->mdev))
|
2017-08-13 07:22:38 -06:00
|
|
|
goto out;
|
|
|
|
|
|
|
|
for (i = 0; i < MLX5E_NUM_INDIR_TIRS; i++) {
|
|
|
|
memset(in, 0, inlen);
|
|
|
|
tir = &priv->inner_indir_tir[i];
|
|
|
|
tirc = MLX5_ADDR_OF(create_tir_in, in, ctx);
|
|
|
|
mlx5e_build_inner_indir_tir_ctx(priv, i, tirc);
|
|
|
|
err = mlx5e_create_tir(priv->mdev, tir, in, inlen);
|
|
|
|
if (err) {
|
|
|
|
mlx5_core_warn(priv->mdev, "create inner indirect tirs failed, %d\n", err);
|
|
|
|
goto err_destroy_inner_tirs;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
out:
|
2016-07-01 05:51:07 -06:00
|
|
|
kvfree(in);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
|
2017-08-13 07:22:38 -06:00
|
|
|
err_destroy_inner_tirs:
|
|
|
|
for (i--; i >= 0; i--)
|
|
|
|
mlx5e_destroy_tir(priv->mdev, &priv->inner_indir_tir[i]);
|
|
|
|
|
2016-07-01 05:51:07 -06:00
|
|
|
for (tt--; tt >= 0; tt--)
|
|
|
|
mlx5e_destroy_tir(priv->mdev, &priv->indir_tir[tt]);
|
|
|
|
|
|
|
|
kvfree(in);
|
|
|
|
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
int mlx5e_create_direct_tirs(struct mlx5e_priv *priv, struct mlx5e_tir *tirs)
|
2016-07-01 05:51:07 -06:00
|
|
|
{
|
|
|
|
struct mlx5e_tir *tir;
|
|
|
|
void *tirc;
|
|
|
|
int inlen;
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
int err = 0;
|
2016-07-01 05:51:07 -06:00
|
|
|
u32 *in;
|
|
|
|
int ix;
|
|
|
|
|
|
|
|
inlen = MLX5_ST_SZ_BYTES(create_tir_in);
|
2017-05-10 12:32:18 -06:00
|
|
|
in = kvzalloc(inlen, GFP_KERNEL);
|
2016-07-01 05:51:07 -06:00
|
|
|
if (!in)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2019-07-14 02:43:43 -06:00
|
|
|
for (ix = 0; ix < priv->max_nch; ix++) {
|
2016-04-28 16:36:32 -06:00
|
|
|
memset(in, 0, inlen);
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
tir = &tirs[ix];
|
2016-04-28 16:36:32 -06:00
|
|
|
tirc = MLX5_ADDR_OF(create_tir_in, in, ctx);
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
mlx5e_build_direct_tir_ctx(priv, tir->rqt.rqtn, tirc);
|
2016-07-01 05:51:05 -06:00
|
|
|
err = mlx5e_create_tir(priv->mdev, tir, in, inlen);
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
if (unlikely(err))
|
2016-04-28 16:36:32 -06:00
|
|
|
goto err_destroy_ch_tirs;
|
|
|
|
}
|
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
goto out;
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2016-04-28 16:36:32 -06:00
|
|
|
err_destroy_ch_tirs:
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
mlx5_core_warn(priv->mdev, "create tirs failed, %d\n", err);
|
2016-04-28 16:36:32 -06:00
|
|
|
for (ix--; ix >= 0; ix--)
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
mlx5e_destroy_tir(priv->mdev, &tirs[ix]);
|
2016-04-28 16:36:32 -06:00
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
out:
|
2016-04-28 16:36:32 -06:00
|
|
|
kvfree(in);
|
2015-05-28 13:28:48 -06:00
|
|
|
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2018-08-28 11:53:55 -06:00
|
|
|
void mlx5e_destroy_indirect_tirs(struct mlx5e_priv *priv, bool inner_ttc)
|
2015-05-28 13:28:48 -06:00
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
2016-04-28 16:36:32 -06:00
|
|
|
for (i = 0; i < MLX5E_NUM_INDIR_TIRS; i++)
|
2016-07-01 05:51:05 -06:00
|
|
|
mlx5e_destroy_tir(priv->mdev, &priv->indir_tir[i]);
|
2017-08-13 07:22:38 -06:00
|
|
|
|
2018-08-28 11:53:55 -06:00
|
|
|
if (!inner_ttc || !mlx5e_tunnel_inner_ft_supported(priv->mdev))
|
2017-08-13 07:22:38 -06:00
|
|
|
return;
|
|
|
|
|
|
|
|
for (i = 0; i < MLX5E_NUM_INDIR_TIRS; i++)
|
|
|
|
mlx5e_destroy_tir(priv->mdev, &priv->inner_indir_tir[i]);
|
2015-05-28 13:28:48 -06:00
|
|
|
}
|
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
void mlx5e_destroy_direct_tirs(struct mlx5e_priv *priv, struct mlx5e_tir *tirs)
|
2016-07-01 05:51:07 -06:00
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
2019-07-14 02:43:43 -06:00
|
|
|
for (i = 0; i < priv->max_nch; i++)
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
mlx5e_destroy_tir(priv->mdev, &tirs[i]);
|
2016-07-01 05:51:07 -06:00
|
|
|
}
|
|
|
|
|
2017-02-20 07:18:17 -07:00
|
|
|
static int mlx5e_modify_channels_scatter_fcs(struct mlx5e_channels *chs, bool enable)
|
|
|
|
{
|
|
|
|
int err = 0;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < chs->num; i++) {
|
|
|
|
err = mlx5e_modify_rq_scatter_fcs(&chs->c[i]->rq, enable);
|
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2017-02-20 08:31:20 -07:00
|
|
|
static int mlx5e_modify_channels_vsd(struct mlx5e_channels *chs, bool vsd)
|
2016-04-24 13:51:55 -06:00
|
|
|
{
|
|
|
|
int err = 0;
|
|
|
|
int i;
|
|
|
|
|
2017-02-06 04:14:34 -07:00
|
|
|
for (i = 0; i < chs->num; i++) {
|
|
|
|
err = mlx5e_modify_rq_vsd(&chs->c[i]->rq, vsd);
|
2016-04-24 13:51:55 -06:00
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2019-07-11 13:39:57 -06:00
|
|
|
static int mlx5e_setup_tc_mqprio(struct mlx5e_priv *priv,
|
2017-08-07 02:15:22 -06:00
|
|
|
struct tc_mqprio_qopt *mqprio)
|
2016-02-22 09:17:26 -07:00
|
|
|
{
|
2017-02-12 16:25:36 -07:00
|
|
|
struct mlx5e_channels new_channels = {};
|
2017-08-07 02:15:22 -06:00
|
|
|
u8 tc = mqprio->num_tc;
|
2016-02-22 09:17:26 -07:00
|
|
|
int err = 0;
|
|
|
|
|
2017-08-07 02:15:22 -06:00
|
|
|
mqprio->hw = TC_MQPRIO_HW_OFFLOAD_TCS;
|
|
|
|
|
2016-02-22 09:17:26 -07:00
|
|
|
if (tc && tc != MLX5E_MAX_NUM_TC)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
mutex_lock(&priv->state_lock);
|
|
|
|
|
2017-02-12 16:25:36 -07:00
|
|
|
new_channels.params = priv->channels.params;
|
|
|
|
new_channels.params.num_tc = tc ? tc : 1;
|
2016-02-22 09:17:26 -07:00
|
|
|
|
2017-05-09 07:40:46 -06:00
|
|
|
if (!test_bit(MLX5E_STATE_OPENED, &priv->state)) {
|
2017-02-12 16:25:36 -07:00
|
|
|
priv->channels.params = new_channels.params;
|
|
|
|
goto out;
|
|
|
|
}
|
2016-02-22 09:17:26 -07:00
|
|
|
|
2018-11-26 08:22:16 -07:00
|
|
|
err = mlx5e_safe_switch_channels(priv, &new_channels, NULL);
|
2017-02-12 16:25:36 -07:00
|
|
|
if (err)
|
|
|
|
goto out;
|
2016-02-22 09:17:26 -07:00
|
|
|
|
2018-04-12 07:03:37 -06:00
|
|
|
priv->max_opened_tc = max_t(u8, priv->max_opened_tc,
|
|
|
|
new_channels.params.num_tc);
|
2017-02-12 16:25:36 -07:00
|
|
|
out:
|
2016-02-22 09:17:26 -07:00
|
|
|
mutex_unlock(&priv->state_lock);
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2017-06-05 06:17:12 -06:00
|
|
|
#ifdef CONFIG_MLX5_ESWITCH
|
2017-10-19 07:50:38 -06:00
|
|
|
static int mlx5e_setup_tc_cls_flower(struct mlx5e_priv *priv,
|
2019-07-09 14:55:49 -06:00
|
|
|
struct flow_cls_offload *cls_flower,
|
2018-11-08 11:01:35 -07:00
|
|
|
unsigned long flags)
|
2016-02-22 09:17:26 -07:00
|
|
|
{
|
2017-08-07 02:15:22 -06:00
|
|
|
switch (cls_flower->command) {
|
2019-07-09 14:55:49 -06:00
|
|
|
case FLOW_CLS_REPLACE:
|
2018-10-28 01:14:50 -06:00
|
|
|
return mlx5e_configure_flower(priv->netdev, priv, cls_flower,
|
|
|
|
flags);
|
2019-07-09 14:55:49 -06:00
|
|
|
case FLOW_CLS_DESTROY:
|
2018-10-28 01:14:50 -06:00
|
|
|
return mlx5e_delete_flower(priv->netdev, priv, cls_flower,
|
|
|
|
flags);
|
2019-07-09 14:55:49 -06:00
|
|
|
case FLOW_CLS_STATS:
|
2018-10-28 01:14:50 -06:00
|
|
|
return mlx5e_stats_flower(priv->netdev, priv, cls_flower,
|
|
|
|
flags);
|
2017-08-07 02:15:22 -06:00
|
|
|
default:
|
2017-06-06 09:00:16 -06:00
|
|
|
return -EOPNOTSUPP;
|
2017-08-07 02:15:22 -06:00
|
|
|
}
|
|
|
|
}
|
2017-10-19 07:50:38 -06:00
|
|
|
|
2018-04-18 04:45:11 -06:00
|
|
|
static int mlx5e_setup_tc_block_cb(enum tc_setup_type type, void *type_data,
|
|
|
|
void *cb_priv)
|
2017-10-19 07:50:38 -06:00
|
|
|
{
|
2018-11-08 11:01:35 -07:00
|
|
|
unsigned long flags = MLX5_TC_FLAG(INGRESS) | MLX5_TC_FLAG(NIC_OFFLOAD);
|
2017-10-19 07:50:38 -06:00
|
|
|
struct mlx5e_priv *priv = cb_priv;
|
|
|
|
|
|
|
|
switch (type) {
|
|
|
|
case TC_SETUP_CLSFLOWER:
|
2018-11-08 11:01:35 -07:00
|
|
|
return mlx5e_setup_tc_cls_flower(priv, type_data, flags);
|
2017-10-19 07:50:38 -06:00
|
|
|
default:
|
|
|
|
return -EOPNOTSUPP;
|
|
|
|
}
|
|
|
|
}
|
2017-06-05 06:17:12 -06:00
|
|
|
#endif
|
2017-06-06 09:00:16 -06:00
|
|
|
|
2019-07-09 14:55:46 -06:00
|
|
|
static LIST_HEAD(mlx5e_block_cb_list);
|
|
|
|
|
2018-01-01 06:19:51 -07:00
|
|
|
static int mlx5e_setup_tc(struct net_device *dev, enum tc_setup_type type,
|
|
|
|
void *type_data)
|
2017-08-07 02:15:22 -06:00
|
|
|
{
|
2019-07-09 14:55:39 -06:00
|
|
|
struct mlx5e_priv *priv = netdev_priv(dev);
|
|
|
|
|
2017-08-07 02:15:17 -06:00
|
|
|
switch (type) {
|
mlx5-shared-2017-08-07
This series includes some mlx5 updates for both net-next and rdma trees.
From Saeed,
Core driver updates to allow selectively building the driver with
or without some large driver components, such as
- E-Switch (Ethernet SRIOV support).
- Multi-Physical Function Switch (MPFs) support.
For that we split E-Switch and MPFs functionalities into separate files.
From Erez,
Delay mlx5_core events when mlx5 interfaces, namely mlx5_ib, registration
is taking place and until it completes.
From Rabie,
Increase the maximum supported flow counters.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJZiDoAAAoJEEg/ir3gV/o+594H/RH5kRwC719s/5YQFJXvGsVC
fjtj3UUJPLrWB8XBh7a4PRcxXPIHaFKJuY3MU7KHFIeZQFklJcit3njjpxDlUINo
F5S1LHBSYBkeMD/ksWBA8OLCBprNGN6WQ2tuFfAjZlQQ44zqv8LJmegoDtW9bGRy
aGAkjUmALEblQsq81y0BQwN2/8DA8HAywrs8L2dkH1LHwijoIeYMZFOtKugv1FbB
ABSKxcU7D/NYw6rsVdZG59fHFQ+eKOspDFqBZrUzfQ+zUU2hFFo96ovfXBfIqYCV
7BtJuKXu2LeGPzFLsuw4h1131iqFT1iSMy9fEhf/4OwaL/KPP/+Umy8vP/XfM+U=
=wCpd
-----END PGP SIGNATURE-----
Merge tag 'mlx5-shared-2017-08-07' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
Saeed Mahameed says:
====================
mlx5-shared-2017-08-07
This series includes some mlx5 updates for both net-next and rdma trees.
From Saeed,
Core driver updates to allow selectively building the driver with
or without some large driver components, such as
- E-Switch (Ethernet SRIOV support).
- Multi-Physical Function Switch (MPFs) support.
For that we split E-Switch and MPFs functionalities into separate files.
From Erez,
Delay mlx5_core events when mlx5 interfaces, namely mlx5_ib, registration
is taking place and until it completes.
From Rabie,
Increase the maximum supported flow counters.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-07 11:42:09 -06:00
|
|
|
#ifdef CONFIG_MLX5_ESWITCH
|
2019-08-29 10:15:17 -06:00
|
|
|
case TC_SETUP_BLOCK: {
|
|
|
|
struct flow_block_offload *f = type_data;
|
|
|
|
|
2019-08-26 07:45:01 -06:00
|
|
|
f->unlocked_driver_cb = true;
|
2019-07-09 14:55:46 -06:00
|
|
|
return flow_block_cb_setup_simple(type_data,
|
|
|
|
&mlx5e_block_cb_list,
|
2019-07-09 14:55:39 -06:00
|
|
|
mlx5e_setup_tc_block_cb,
|
|
|
|
priv, priv, true);
|
2019-08-29 10:15:17 -06:00
|
|
|
}
|
mlx5-shared-2017-08-07
This series includes some mlx5 updates for both net-next and rdma trees.
From Saeed,
Core driver updates to allow selectively building the driver with
or without some large driver components, such as
- E-Switch (Ethernet SRIOV support).
- Multi-Physical Function Switch (MPFs) support.
For that we split E-Switch and MPFs functionalities into separate files.
From Erez,
Delay mlx5_core events when mlx5 interfaces, namely mlx5_ib, registration
is taking place and until it completes.
From Rabie,
Increase the maximum supported flow counters.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJZiDoAAAoJEEg/ir3gV/o+594H/RH5kRwC719s/5YQFJXvGsVC
fjtj3UUJPLrWB8XBh7a4PRcxXPIHaFKJuY3MU7KHFIeZQFklJcit3njjpxDlUINo
F5S1LHBSYBkeMD/ksWBA8OLCBprNGN6WQ2tuFfAjZlQQ44zqv8LJmegoDtW9bGRy
aGAkjUmALEblQsq81y0BQwN2/8DA8HAywrs8L2dkH1LHwijoIeYMZFOtKugv1FbB
ABSKxcU7D/NYw6rsVdZG59fHFQ+eKOspDFqBZrUzfQ+zUU2hFFo96ovfXBfIqYCV
7BtJuKXu2LeGPzFLsuw4h1131iqFT1iSMy9fEhf/4OwaL/KPP/+Umy8vP/XfM+U=
=wCpd
-----END PGP SIGNATURE-----
Merge tag 'mlx5-shared-2017-08-07' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
Saeed Mahameed says:
====================
mlx5-shared-2017-08-07
This series includes some mlx5 updates for both net-next and rdma trees.
From Saeed,
Core driver updates to allow selectively building the driver with
or without some large driver components, such as
- E-Switch (Ethernet SRIOV support).
- Multi-Physical Function Switch (MPFs) support.
For that we split E-Switch and MPFs functionalities into separate files.
From Erez,
Delay mlx5_core events when mlx5 interfaces, namely mlx5_ib, registration
is taking place and until it completes.
From Rabie,
Increase the maximum supported flow counters.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-07 11:42:09 -06:00
|
|
|
#endif
|
2017-11-05 23:23:42 -07:00
|
|
|
case TC_SETUP_QDISC_MQPRIO:
|
2019-07-11 13:39:57 -06:00
|
|
|
return mlx5e_setup_tc_mqprio(priv, type_data);
|
2016-03-08 03:42:36 -07:00
|
|
|
default:
|
|
|
|
return -EOPNOTSUPP;
|
|
|
|
}
|
2016-02-22 09:17:26 -07:00
|
|
|
}
|
|
|
|
|
2018-12-12 02:42:30 -07:00
|
|
|
void mlx5e_fold_sw_stats64(struct mlx5e_priv *priv, struct rtnl_link_stats64 *s)
|
2018-11-02 19:21:27 -06:00
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
2019-07-14 02:43:43 -06:00
|
|
|
for (i = 0; i < priv->max_nch; i++) {
|
2018-11-02 19:21:27 -06:00
|
|
|
struct mlx5e_channel_stats *channel_stats = &priv->channel_stats[i];
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
struct mlx5e_rq_stats *xskrq_stats = &channel_stats->xskrq;
|
2018-11-02 19:21:27 -06:00
|
|
|
struct mlx5e_rq_stats *rq_stats = &channel_stats->rq;
|
|
|
|
int j;
|
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
s->rx_packets += rq_stats->packets + xskrq_stats->packets;
|
|
|
|
s->rx_bytes += rq_stats->bytes + xskrq_stats->bytes;
|
2018-11-02 19:21:27 -06:00
|
|
|
|
|
|
|
for (j = 0; j < priv->max_opened_tc; j++) {
|
|
|
|
struct mlx5e_sq_stats *sq_stats = &channel_stats->sq[j];
|
|
|
|
|
|
|
|
s->tx_packets += sq_stats->packets;
|
|
|
|
s->tx_bytes += sq_stats->bytes;
|
|
|
|
s->tx_dropped += sq_stats->dropped;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-02-13 06:48:30 -07:00
|
|
|
void
|
2015-05-28 13:28:48 -06:00
|
|
|
mlx5e_get_stats(struct net_device *dev, struct rtnl_link_stats64 *stats)
|
|
|
|
{
|
|
|
|
struct mlx5e_priv *priv = netdev_priv(dev);
|
|
|
|
struct mlx5e_vport_stats *vstats = &priv->stats.vport;
|
2016-04-24 13:51:46 -06:00
|
|
|
struct mlx5e_pport_stats *pstats = &priv->stats.pport;
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2018-10-20 07:18:00 -06:00
|
|
|
if (!mlx5e_monitor_counter_supported(priv)) {
|
|
|
|
/* update HW stats in background for next time */
|
|
|
|
mlx5e_queue_update_stats(priv);
|
|
|
|
}
|
2018-05-23 19:26:09 -06:00
|
|
|
|
2016-11-22 14:09:55 -07:00
|
|
|
if (mlx5e_is_uplink_rep(priv)) {
|
|
|
|
stats->rx_packets = PPORT_802_3_GET(pstats, a_frames_received_ok);
|
|
|
|
stats->rx_bytes = PPORT_802_3_GET(pstats, a_octets_received_ok);
|
|
|
|
stats->tx_packets = PPORT_802_3_GET(pstats, a_frames_transmitted_ok);
|
|
|
|
stats->tx_bytes = PPORT_802_3_GET(pstats, a_octets_transmitted_ok);
|
|
|
|
} else {
|
2018-11-02 19:21:27 -06:00
|
|
|
mlx5e_fold_sw_stats64(priv, stats);
|
2016-11-22 14:09:55 -07:00
|
|
|
}
|
2016-04-24 13:51:46 -06:00
|
|
|
|
|
|
|
stats->rx_dropped = priv->stats.qcnt.rx_out_of_buffer;
|
|
|
|
|
|
|
|
stats->rx_length_errors =
|
2016-04-24 13:51:47 -06:00
|
|
|
PPORT_802_3_GET(pstats, a_in_range_length_errors) +
|
|
|
|
PPORT_802_3_GET(pstats, a_out_of_range_length_field) +
|
|
|
|
PPORT_802_3_GET(pstats, a_frame_too_long_errors);
|
2016-04-24 13:51:46 -06:00
|
|
|
stats->rx_crc_errors =
|
2016-04-24 13:51:47 -06:00
|
|
|
PPORT_802_3_GET(pstats, a_frame_check_sequence_errors);
|
|
|
|
stats->rx_frame_errors = PPORT_802_3_GET(pstats, a_alignment_errors);
|
|
|
|
stats->tx_aborted_errors = PPORT_2863_GET(pstats, if_out_discards);
|
2016-04-24 13:51:46 -06:00
|
|
|
stats->rx_errors = stats->rx_length_errors + stats->rx_crc_errors +
|
|
|
|
stats->rx_frame_errors;
|
|
|
|
stats->tx_errors = stats->tx_aborted_errors + stats->tx_carrier_errors;
|
|
|
|
|
|
|
|
/* vport multicast also counts packets that are dropped due to steering
|
|
|
|
* or rx out of buffer
|
|
|
|
*/
|
2016-04-24 13:51:47 -06:00
|
|
|
stats->multicast =
|
|
|
|
VPORT_COUNTER_GET(vstats, received_eth_multicast.packets);
|
2015-05-28 13:28:48 -06:00
|
|
|
}
|
|
|
|
|
|
|
|
static void mlx5e_set_rx_mode(struct net_device *dev)
|
|
|
|
{
|
|
|
|
struct mlx5e_priv *priv = netdev_priv(dev);
|
|
|
|
|
2016-05-01 13:59:56 -06:00
|
|
|
queue_work(priv->wq, &priv->set_rx_mode_work);
|
2015-05-28 13:28:48 -06:00
|
|
|
}
|
|
|
|
|
|
|
|
static int mlx5e_set_mac(struct net_device *netdev, void *addr)
|
|
|
|
{
|
|
|
|
struct mlx5e_priv *priv = netdev_priv(netdev);
|
|
|
|
struct sockaddr *saddr = addr;
|
|
|
|
|
|
|
|
if (!is_valid_ether_addr(saddr->sa_data))
|
|
|
|
return -EADDRNOTAVAIL;
|
|
|
|
|
|
|
|
netif_addr_lock_bh(netdev);
|
|
|
|
ether_addr_copy(netdev->dev_addr, saddr->sa_data);
|
|
|
|
netif_addr_unlock_bh(netdev);
|
|
|
|
|
2016-05-01 13:59:56 -06:00
|
|
|
queue_work(priv->wq, &priv->set_rx_mode_work);
|
2015-05-28 13:28:48 -06:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2018-01-10 08:11:11 -07:00
|
|
|
#define MLX5E_SET_FEATURE(features, feature, enable) \
|
2016-04-24 13:51:51 -06:00
|
|
|
do { \
|
|
|
|
if (enable) \
|
2018-01-10 08:11:11 -07:00
|
|
|
*features |= feature; \
|
2016-04-24 13:51:51 -06:00
|
|
|
else \
|
2018-01-10 08:11:11 -07:00
|
|
|
*features &= ~feature; \
|
2016-04-24 13:51:51 -06:00
|
|
|
} while (0)
|
|
|
|
|
|
|
|
typedef int (*mlx5e_feature_handler)(struct net_device *netdev, bool enable);
|
|
|
|
|
|
|
|
static int set_feature_lro(struct net_device *netdev, bool enable)
|
2015-05-28 13:28:48 -06:00
|
|
|
{
|
|
|
|
struct mlx5e_priv *priv = netdev_priv(netdev);
|
net/mlx5e: Use linear SKB in Striding RQ
Current Striding RQ HW feature utilizes the RX buffers so that
there is no wasted room between the strides. This maximises
the memory utilization.
This prevents the use of build_skb() (which requires headroom
and tailroom), and demands to memcpy the packets headers into
the skb linear part.
In this patch, whenever a set of conditions holds, we apply
an RQ configuration that allows combining the use of linear SKB
on top of a Striding RQ.
To use build_skb() with Striding RQ, the following must hold:
1. packet does not cross a page boundary.
2. there is enough headroom and tailroom surrounding the packet.
We can satisfy 1 and 2 by configuring:
stride size = MTU + headroom + tailoom.
This is possible only when:
a. (MTU - headroom - tailoom) does not exceed PAGE_SIZE.
b. HW LRO is turned off.
Using linear SKB has many advantages:
- Saves a memcpy of the headers.
- No page-boundary checks in datapath.
- No filler CQEs.
- Significantly smaller CQ.
- SKB data continuously resides in linear part, and not split to
small amount (linear part) and large amount (fragment).
This saves datapath cycles in driver and improves utilization
of SKB fragments in GRO.
- The fragments of a resulting GRO SKB follow the IP forwarding
assumption of equal-size fragments.
Some implementation details:
HW writes the packets to the beginning of a stride,
i.e. does not keep headroom. To overcome this we make sure we can
extend backwards and use the last bytes of stride i-1.
Extra care is needed for stride 0 as it has no preceding stride.
We make sure headroom bytes are available by shifting the buffer
pointer passed to HW by headroom bytes.
This configuration now becomes default, whenever capable.
Of course, this implies turning LRO off.
Performance testing:
ConnectX-5, single core, single RX ring, default MTU.
UDP packet rate, early drop in TC layer:
--------------------------------------------
| pkt size | before | after | ratio |
--------------------------------------------
| 1500byte | 4.65 Mpps | 5.96 Mpps | 1.28x |
| 500byte | 5.23 Mpps | 5.97 Mpps | 1.14x |
| 64byte | 5.94 Mpps | 5.96 Mpps | 1.00x |
--------------------------------------------
TCP streams: ~20% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-02-07 05:41:25 -07:00
|
|
|
struct mlx5_core_dev *mdev = priv->mdev;
|
2017-02-12 16:19:14 -07:00
|
|
|
struct mlx5e_channels new_channels = {};
|
net/mlx5e: Use linear SKB in Striding RQ
Current Striding RQ HW feature utilizes the RX buffers so that
there is no wasted room between the strides. This maximises
the memory utilization.
This prevents the use of build_skb() (which requires headroom
and tailroom), and demands to memcpy the packets headers into
the skb linear part.
In this patch, whenever a set of conditions holds, we apply
an RQ configuration that allows combining the use of linear SKB
on top of a Striding RQ.
To use build_skb() with Striding RQ, the following must hold:
1. packet does not cross a page boundary.
2. there is enough headroom and tailroom surrounding the packet.
We can satisfy 1 and 2 by configuring:
stride size = MTU + headroom + tailoom.
This is possible only when:
a. (MTU - headroom - tailoom) does not exceed PAGE_SIZE.
b. HW LRO is turned off.
Using linear SKB has many advantages:
- Saves a memcpy of the headers.
- No page-boundary checks in datapath.
- No filler CQEs.
- Significantly smaller CQ.
- SKB data continuously resides in linear part, and not split to
small amount (linear part) and large amount (fragment).
This saves datapath cycles in driver and improves utilization
of SKB fragments in GRO.
- The fragments of a resulting GRO SKB follow the IP forwarding
assumption of equal-size fragments.
Some implementation details:
HW writes the packets to the beginning of a stride,
i.e. does not keep headroom. To overcome this we make sure we can
extend backwards and use the last bytes of stride i-1.
Extra care is needed for stride 0 as it has no preceding stride.
We make sure headroom bytes are available by shifting the buffer
pointer passed to HW by headroom bytes.
This configuration now becomes default, whenever capable.
Of course, this implies turning LRO off.
Performance testing:
ConnectX-5, single core, single RX ring, default MTU.
UDP packet rate, early drop in TC layer:
--------------------------------------------
| pkt size | before | after | ratio |
--------------------------------------------
| 1500byte | 4.65 Mpps | 5.96 Mpps | 1.28x |
| 500byte | 5.23 Mpps | 5.97 Mpps | 1.14x |
| 64byte | 5.94 Mpps | 5.96 Mpps | 1.00x |
--------------------------------------------
TCP streams: ~20% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-02-07 05:41:25 -07:00
|
|
|
struct mlx5e_params *old_params;
|
2017-02-12 16:19:14 -07:00
|
|
|
int err = 0;
|
|
|
|
bool reset;
|
2015-05-28 13:28:48 -06:00
|
|
|
|
|
|
|
mutex_lock(&priv->state_lock);
|
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
if (enable && priv->xsk.refcnt) {
|
|
|
|
netdev_warn(netdev, "LRO is incompatible with AF_XDP (%hu XSKs are active)\n",
|
|
|
|
priv->xsk.refcnt);
|
|
|
|
err = -EINVAL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
net/mlx5e: Use linear SKB in Striding RQ
Current Striding RQ HW feature utilizes the RX buffers so that
there is no wasted room between the strides. This maximises
the memory utilization.
This prevents the use of build_skb() (which requires headroom
and tailroom), and demands to memcpy the packets headers into
the skb linear part.
In this patch, whenever a set of conditions holds, we apply
an RQ configuration that allows combining the use of linear SKB
on top of a Striding RQ.
To use build_skb() with Striding RQ, the following must hold:
1. packet does not cross a page boundary.
2. there is enough headroom and tailroom surrounding the packet.
We can satisfy 1 and 2 by configuring:
stride size = MTU + headroom + tailoom.
This is possible only when:
a. (MTU - headroom - tailoom) does not exceed PAGE_SIZE.
b. HW LRO is turned off.
Using linear SKB has many advantages:
- Saves a memcpy of the headers.
- No page-boundary checks in datapath.
- No filler CQEs.
- Significantly smaller CQ.
- SKB data continuously resides in linear part, and not split to
small amount (linear part) and large amount (fragment).
This saves datapath cycles in driver and improves utilization
of SKB fragments in GRO.
- The fragments of a resulting GRO SKB follow the IP forwarding
assumption of equal-size fragments.
Some implementation details:
HW writes the packets to the beginning of a stride,
i.e. does not keep headroom. To overcome this we make sure we can
extend backwards and use the last bytes of stride i-1.
Extra care is needed for stride 0 as it has no preceding stride.
We make sure headroom bytes are available by shifting the buffer
pointer passed to HW by headroom bytes.
This configuration now becomes default, whenever capable.
Of course, this implies turning LRO off.
Performance testing:
ConnectX-5, single core, single RX ring, default MTU.
UDP packet rate, early drop in TC layer:
--------------------------------------------
| pkt size | before | after | ratio |
--------------------------------------------
| 1500byte | 4.65 Mpps | 5.96 Mpps | 1.28x |
| 500byte | 5.23 Mpps | 5.97 Mpps | 1.14x |
| 64byte | 5.94 Mpps | 5.96 Mpps | 1.00x |
--------------------------------------------
TCP streams: ~20% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-02-07 05:41:25 -07:00
|
|
|
old_params = &priv->channels.params;
|
2018-04-02 07:28:10 -06:00
|
|
|
if (enable && !MLX5E_GET_PFLAG(old_params, MLX5E_PFLAG_RX_STRIDING_RQ)) {
|
|
|
|
netdev_warn(netdev, "can't set LRO with legacy RQ\n");
|
|
|
|
err = -EINVAL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
net/mlx5e: Use linear SKB in Striding RQ
Current Striding RQ HW feature utilizes the RX buffers so that
there is no wasted room between the strides. This maximises
the memory utilization.
This prevents the use of build_skb() (which requires headroom
and tailroom), and demands to memcpy the packets headers into
the skb linear part.
In this patch, whenever a set of conditions holds, we apply
an RQ configuration that allows combining the use of linear SKB
on top of a Striding RQ.
To use build_skb() with Striding RQ, the following must hold:
1. packet does not cross a page boundary.
2. there is enough headroom and tailroom surrounding the packet.
We can satisfy 1 and 2 by configuring:
stride size = MTU + headroom + tailoom.
This is possible only when:
a. (MTU - headroom - tailoom) does not exceed PAGE_SIZE.
b. HW LRO is turned off.
Using linear SKB has many advantages:
- Saves a memcpy of the headers.
- No page-boundary checks in datapath.
- No filler CQEs.
- Significantly smaller CQ.
- SKB data continuously resides in linear part, and not split to
small amount (linear part) and large amount (fragment).
This saves datapath cycles in driver and improves utilization
of SKB fragments in GRO.
- The fragments of a resulting GRO SKB follow the IP forwarding
assumption of equal-size fragments.
Some implementation details:
HW writes the packets to the beginning of a stride,
i.e. does not keep headroom. To overcome this we make sure we can
extend backwards and use the last bytes of stride i-1.
Extra care is needed for stride 0 as it has no preceding stride.
We make sure headroom bytes are available by shifting the buffer
pointer passed to HW by headroom bytes.
This configuration now becomes default, whenever capable.
Of course, this implies turning LRO off.
Performance testing:
ConnectX-5, single core, single RX ring, default MTU.
UDP packet rate, early drop in TC layer:
--------------------------------------------
| pkt size | before | after | ratio |
--------------------------------------------
| 1500byte | 4.65 Mpps | 5.96 Mpps | 1.28x |
| 500byte | 5.23 Mpps | 5.97 Mpps | 1.14x |
| 64byte | 5.94 Mpps | 5.96 Mpps | 1.00x |
--------------------------------------------
TCP streams: ~20% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-02-07 05:41:25 -07:00
|
|
|
reset = test_bit(MLX5E_STATE_OPENED, &priv->state);
|
2015-07-29 06:05:46 -06:00
|
|
|
|
net/mlx5e: Use linear SKB in Striding RQ
Current Striding RQ HW feature utilizes the RX buffers so that
there is no wasted room between the strides. This maximises
the memory utilization.
This prevents the use of build_skb() (which requires headroom
and tailroom), and demands to memcpy the packets headers into
the skb linear part.
In this patch, whenever a set of conditions holds, we apply
an RQ configuration that allows combining the use of linear SKB
on top of a Striding RQ.
To use build_skb() with Striding RQ, the following must hold:
1. packet does not cross a page boundary.
2. there is enough headroom and tailroom surrounding the packet.
We can satisfy 1 and 2 by configuring:
stride size = MTU + headroom + tailoom.
This is possible only when:
a. (MTU - headroom - tailoom) does not exceed PAGE_SIZE.
b. HW LRO is turned off.
Using linear SKB has many advantages:
- Saves a memcpy of the headers.
- No page-boundary checks in datapath.
- No filler CQEs.
- Significantly smaller CQ.
- SKB data continuously resides in linear part, and not split to
small amount (linear part) and large amount (fragment).
This saves datapath cycles in driver and improves utilization
of SKB fragments in GRO.
- The fragments of a resulting GRO SKB follow the IP forwarding
assumption of equal-size fragments.
Some implementation details:
HW writes the packets to the beginning of a stride,
i.e. does not keep headroom. To overcome this we make sure we can
extend backwards and use the last bytes of stride i-1.
Extra care is needed for stride 0 as it has no preceding stride.
We make sure headroom bytes are available by shifting the buffer
pointer passed to HW by headroom bytes.
This configuration now becomes default, whenever capable.
Of course, this implies turning LRO off.
Performance testing:
ConnectX-5, single core, single RX ring, default MTU.
UDP packet rate, early drop in TC layer:
--------------------------------------------
| pkt size | before | after | ratio |
--------------------------------------------
| 1500byte | 4.65 Mpps | 5.96 Mpps | 1.28x |
| 500byte | 5.23 Mpps | 5.97 Mpps | 1.14x |
| 64byte | 5.94 Mpps | 5.96 Mpps | 1.00x |
--------------------------------------------
TCP streams: ~20% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-02-07 05:41:25 -07:00
|
|
|
new_channels.params = *old_params;
|
2017-02-12 16:19:14 -07:00
|
|
|
new_channels.params.lro_en = enable;
|
|
|
|
|
2018-04-02 08:31:31 -06:00
|
|
|
if (old_params->rq_wq_type != MLX5_WQ_TYPE_CYCLIC) {
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
if (mlx5e_rx_mpwqe_is_linear_skb(mdev, old_params, NULL) ==
|
|
|
|
mlx5e_rx_mpwqe_is_linear_skb(mdev, &new_channels.params, NULL))
|
net/mlx5e: Use linear SKB in Striding RQ
Current Striding RQ HW feature utilizes the RX buffers so that
there is no wasted room between the strides. This maximises
the memory utilization.
This prevents the use of build_skb() (which requires headroom
and tailroom), and demands to memcpy the packets headers into
the skb linear part.
In this patch, whenever a set of conditions holds, we apply
an RQ configuration that allows combining the use of linear SKB
on top of a Striding RQ.
To use build_skb() with Striding RQ, the following must hold:
1. packet does not cross a page boundary.
2. there is enough headroom and tailroom surrounding the packet.
We can satisfy 1 and 2 by configuring:
stride size = MTU + headroom + tailoom.
This is possible only when:
a. (MTU - headroom - tailoom) does not exceed PAGE_SIZE.
b. HW LRO is turned off.
Using linear SKB has many advantages:
- Saves a memcpy of the headers.
- No page-boundary checks in datapath.
- No filler CQEs.
- Significantly smaller CQ.
- SKB data continuously resides in linear part, and not split to
small amount (linear part) and large amount (fragment).
This saves datapath cycles in driver and improves utilization
of SKB fragments in GRO.
- The fragments of a resulting GRO SKB follow the IP forwarding
assumption of equal-size fragments.
Some implementation details:
HW writes the packets to the beginning of a stride,
i.e. does not keep headroom. To overcome this we make sure we can
extend backwards and use the last bytes of stride i-1.
Extra care is needed for stride 0 as it has no preceding stride.
We make sure headroom bytes are available by shifting the buffer
pointer passed to HW by headroom bytes.
This configuration now becomes default, whenever capable.
Of course, this implies turning LRO off.
Performance testing:
ConnectX-5, single core, single RX ring, default MTU.
UDP packet rate, early drop in TC layer:
--------------------------------------------
| pkt size | before | after | ratio |
--------------------------------------------
| 1500byte | 4.65 Mpps | 5.96 Mpps | 1.28x |
| 500byte | 5.23 Mpps | 5.97 Mpps | 1.14x |
| 64byte | 5.94 Mpps | 5.96 Mpps | 1.00x |
--------------------------------------------
TCP streams: ~20% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-02-07 05:41:25 -07:00
|
|
|
reset = false;
|
|
|
|
}
|
|
|
|
|
2017-02-12 16:19:14 -07:00
|
|
|
if (!reset) {
|
net/mlx5e: Use linear SKB in Striding RQ
Current Striding RQ HW feature utilizes the RX buffers so that
there is no wasted room between the strides. This maximises
the memory utilization.
This prevents the use of build_skb() (which requires headroom
and tailroom), and demands to memcpy the packets headers into
the skb linear part.
In this patch, whenever a set of conditions holds, we apply
an RQ configuration that allows combining the use of linear SKB
on top of a Striding RQ.
To use build_skb() with Striding RQ, the following must hold:
1. packet does not cross a page boundary.
2. there is enough headroom and tailroom surrounding the packet.
We can satisfy 1 and 2 by configuring:
stride size = MTU + headroom + tailoom.
This is possible only when:
a. (MTU - headroom - tailoom) does not exceed PAGE_SIZE.
b. HW LRO is turned off.
Using linear SKB has many advantages:
- Saves a memcpy of the headers.
- No page-boundary checks in datapath.
- No filler CQEs.
- Significantly smaller CQ.
- SKB data continuously resides in linear part, and not split to
small amount (linear part) and large amount (fragment).
This saves datapath cycles in driver and improves utilization
of SKB fragments in GRO.
- The fragments of a resulting GRO SKB follow the IP forwarding
assumption of equal-size fragments.
Some implementation details:
HW writes the packets to the beginning of a stride,
i.e. does not keep headroom. To overcome this we make sure we can
extend backwards and use the last bytes of stride i-1.
Extra care is needed for stride 0 as it has no preceding stride.
We make sure headroom bytes are available by shifting the buffer
pointer passed to HW by headroom bytes.
This configuration now becomes default, whenever capable.
Of course, this implies turning LRO off.
Performance testing:
ConnectX-5, single core, single RX ring, default MTU.
UDP packet rate, early drop in TC layer:
--------------------------------------------
| pkt size | before | after | ratio |
--------------------------------------------
| 1500byte | 4.65 Mpps | 5.96 Mpps | 1.28x |
| 500byte | 5.23 Mpps | 5.97 Mpps | 1.14x |
| 64byte | 5.94 Mpps | 5.96 Mpps | 1.00x |
--------------------------------------------
TCP streams: ~20% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-02-07 05:41:25 -07:00
|
|
|
*old_params = new_channels.params;
|
2017-02-12 16:19:14 -07:00
|
|
|
err = mlx5e_modify_tirs_lro(priv);
|
|
|
|
goto out;
|
2015-07-29 06:05:46 -06:00
|
|
|
}
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2018-11-26 08:22:16 -07:00
|
|
|
err = mlx5e_safe_switch_channels(priv, &new_channels, mlx5e_modify_tirs_lro);
|
2017-02-12 16:19:14 -07:00
|
|
|
out:
|
2015-08-04 05:05:46 -06:00
|
|
|
mutex_unlock(&priv->state_lock);
|
2016-04-24 13:51:51 -06:00
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2017-09-10 08:51:10 -06:00
|
|
|
static int set_feature_cvlan_filter(struct net_device *netdev, bool enable)
|
2016-04-24 13:51:51 -06:00
|
|
|
{
|
|
|
|
struct mlx5e_priv *priv = netdev_priv(netdev);
|
|
|
|
|
|
|
|
if (enable)
|
2017-09-10 08:51:10 -06:00
|
|
|
mlx5e_enable_cvlan_filter(priv);
|
2016-04-24 13:51:51 -06:00
|
|
|
else
|
2017-09-10 08:51:10 -06:00
|
|
|
mlx5e_disable_cvlan_filter(priv);
|
2016-04-24 13:51:51 -06:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2018-10-18 04:31:27 -06:00
|
|
|
#ifdef CONFIG_MLX5_ESWITCH
|
2016-04-24 13:51:51 -06:00
|
|
|
static int set_feature_tc_num_filters(struct net_device *netdev, bool enable)
|
|
|
|
{
|
|
|
|
struct mlx5e_priv *priv = netdev_priv(netdev);
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2018-11-08 11:01:35 -07:00
|
|
|
if (!enable && mlx5e_tc_num_filters(priv, MLX5_TC_FLAG(NIC_OFFLOAD))) {
|
2016-03-08 03:42:36 -07:00
|
|
|
netdev_err(netdev,
|
|
|
|
"Active offloaded tc filters, can't turn hw_tc_offload off\n");
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
2016-04-24 13:51:51 -06:00
|
|
|
return 0;
|
|
|
|
}
|
2018-10-18 04:31:27 -06:00
|
|
|
#endif
|
2016-04-24 13:51:51 -06:00
|
|
|
|
2016-04-24 13:51:52 -06:00
|
|
|
static int set_feature_rx_all(struct net_device *netdev, bool enable)
|
|
|
|
{
|
|
|
|
struct mlx5e_priv *priv = netdev_priv(netdev);
|
|
|
|
struct mlx5_core_dev *mdev = priv->mdev;
|
|
|
|
|
|
|
|
return mlx5_set_port_fcs(mdev, !enable);
|
|
|
|
}
|
|
|
|
|
2017-02-20 07:18:17 -07:00
|
|
|
static int set_feature_rx_fcs(struct net_device *netdev, bool enable)
|
|
|
|
{
|
|
|
|
struct mlx5e_priv *priv = netdev_priv(netdev);
|
|
|
|
int err;
|
|
|
|
|
|
|
|
mutex_lock(&priv->state_lock);
|
|
|
|
|
|
|
|
priv->channels.params.scatter_fcs_en = enable;
|
|
|
|
err = mlx5e_modify_channels_scatter_fcs(&priv->channels, enable);
|
|
|
|
if (err)
|
|
|
|
priv->channels.params.scatter_fcs_en = !enable;
|
|
|
|
|
|
|
|
mutex_unlock(&priv->state_lock);
|
|
|
|
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2016-04-24 13:51:55 -06:00
|
|
|
static int set_feature_rx_vlan(struct net_device *netdev, bool enable)
|
|
|
|
{
|
|
|
|
struct mlx5e_priv *priv = netdev_priv(netdev);
|
2017-02-06 04:14:34 -07:00
|
|
|
int err = 0;
|
2016-04-24 13:51:55 -06:00
|
|
|
|
|
|
|
mutex_lock(&priv->state_lock);
|
|
|
|
|
2016-12-21 08:24:35 -07:00
|
|
|
priv->channels.params.vlan_strip_disable = !enable;
|
2017-02-06 04:14:34 -07:00
|
|
|
if (!test_bit(MLX5E_STATE_OPENED, &priv->state))
|
|
|
|
goto unlock;
|
|
|
|
|
|
|
|
err = mlx5e_modify_channels_vsd(&priv->channels, !enable);
|
2016-04-24 13:51:55 -06:00
|
|
|
if (err)
|
2016-12-21 08:24:35 -07:00
|
|
|
priv->channels.params.vlan_strip_disable = enable;
|
2016-04-24 13:51:55 -06:00
|
|
|
|
2017-02-06 04:14:34 -07:00
|
|
|
unlock:
|
2016-04-24 13:51:55 -06:00
|
|
|
mutex_unlock(&priv->state_lock);
|
|
|
|
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2018-07-12 04:01:26 -06:00
|
|
|
#ifdef CONFIG_MLX5_EN_ARFS
|
2016-04-28 16:36:42 -06:00
|
|
|
static int set_feature_arfs(struct net_device *netdev, bool enable)
|
|
|
|
{
|
|
|
|
struct mlx5e_priv *priv = netdev_priv(netdev);
|
|
|
|
int err;
|
|
|
|
|
|
|
|
if (enable)
|
|
|
|
err = mlx5e_arfs_enable(priv);
|
|
|
|
else
|
|
|
|
err = mlx5e_arfs_disable(priv);
|
|
|
|
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2016-04-24 13:51:51 -06:00
|
|
|
static int mlx5e_handle_feature(struct net_device *netdev,
|
2018-01-10 08:11:11 -07:00
|
|
|
netdev_features_t *features,
|
2016-04-24 13:51:51 -06:00
|
|
|
netdev_features_t wanted_features,
|
|
|
|
netdev_features_t feature,
|
|
|
|
mlx5e_feature_handler feature_handler)
|
|
|
|
{
|
|
|
|
netdev_features_t changes = wanted_features ^ netdev->features;
|
|
|
|
bool enable = !!(wanted_features & feature);
|
|
|
|
int err;
|
|
|
|
|
|
|
|
if (!(changes & feature))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
err = feature_handler(netdev, enable);
|
|
|
|
if (err) {
|
2017-09-12 08:51:12 -06:00
|
|
|
netdev_err(netdev, "%s feature %pNF failed, err %d\n",
|
|
|
|
enable ? "Enable" : "Disable", &feature, err);
|
2016-04-24 13:51:51 -06:00
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2018-01-10 08:11:11 -07:00
|
|
|
MLX5E_SET_FEATURE(features, feature, enable);
|
2016-04-24 13:51:51 -06:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2019-05-16 03:36:43 -06:00
|
|
|
int mlx5e_set_features(struct net_device *netdev, netdev_features_t features)
|
2016-04-24 13:51:51 -06:00
|
|
|
{
|
2018-01-10 08:11:11 -07:00
|
|
|
netdev_features_t oper_features = netdev->features;
|
2018-01-11 09:46:20 -07:00
|
|
|
int err = 0;
|
|
|
|
|
|
|
|
#define MLX5E_HANDLE_FEATURE(feature, handler) \
|
|
|
|
mlx5e_handle_feature(netdev, &oper_features, features, feature, handler)
|
2016-04-24 13:51:51 -06:00
|
|
|
|
2018-01-11 09:46:20 -07:00
|
|
|
err |= MLX5E_HANDLE_FEATURE(NETIF_F_LRO, set_feature_lro);
|
|
|
|
err |= MLX5E_HANDLE_FEATURE(NETIF_F_HW_VLAN_CTAG_FILTER,
|
2017-09-10 08:51:10 -06:00
|
|
|
set_feature_cvlan_filter);
|
2018-10-18 04:31:27 -06:00
|
|
|
#ifdef CONFIG_MLX5_ESWITCH
|
2018-01-11 09:46:20 -07:00
|
|
|
err |= MLX5E_HANDLE_FEATURE(NETIF_F_HW_TC, set_feature_tc_num_filters);
|
2018-10-18 04:31:27 -06:00
|
|
|
#endif
|
2018-01-11 09:46:20 -07:00
|
|
|
err |= MLX5E_HANDLE_FEATURE(NETIF_F_RXALL, set_feature_rx_all);
|
|
|
|
err |= MLX5E_HANDLE_FEATURE(NETIF_F_RXFCS, set_feature_rx_fcs);
|
|
|
|
err |= MLX5E_HANDLE_FEATURE(NETIF_F_HW_VLAN_CTAG_RX, set_feature_rx_vlan);
|
2018-07-12 04:01:26 -06:00
|
|
|
#ifdef CONFIG_MLX5_EN_ARFS
|
2018-01-11 09:46:20 -07:00
|
|
|
err |= MLX5E_HANDLE_FEATURE(NETIF_F_NTUPLE, set_feature_arfs);
|
2016-04-28 16:36:42 -06:00
|
|
|
#endif
|
2016-04-24 13:51:51 -06:00
|
|
|
|
2018-01-10 08:11:11 -07:00
|
|
|
if (err) {
|
|
|
|
netdev->features = oper_features;
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
2015-05-28 13:28:48 -06:00
|
|
|
}
|
|
|
|
|
2017-09-10 01:36:43 -06:00
|
|
|
static netdev_features_t mlx5e_fix_features(struct net_device *netdev,
|
|
|
|
netdev_features_t features)
|
|
|
|
{
|
|
|
|
struct mlx5e_priv *priv = netdev_priv(netdev);
|
2018-04-02 07:28:10 -06:00
|
|
|
struct mlx5e_params *params;
|
2017-09-10 01:36:43 -06:00
|
|
|
|
|
|
|
mutex_lock(&priv->state_lock);
|
2018-04-02 07:28:10 -06:00
|
|
|
params = &priv->channels.params;
|
2017-09-10 01:36:43 -06:00
|
|
|
if (!bitmap_empty(priv->fs.vlan.active_svlans, VLAN_N_VID)) {
|
|
|
|
/* HW strips the outer C-tag header, this is a problem
|
|
|
|
* for S-tag traffic.
|
|
|
|
*/
|
|
|
|
features &= ~NETIF_F_HW_VLAN_CTAG_RX;
|
2018-04-02 07:28:10 -06:00
|
|
|
if (!params->vlan_strip_disable)
|
2017-09-10 01:36:43 -06:00
|
|
|
netdev_warn(netdev, "Dropping C-tag vlan stripping offload due to S-tag vlan\n");
|
|
|
|
}
|
2018-04-02 07:28:10 -06:00
|
|
|
if (!MLX5E_GET_PFLAG(params, MLX5E_PFLAG_RX_STRIDING_RQ)) {
|
2019-06-27 08:24:57 -06:00
|
|
|
if (features & NETIF_F_LRO) {
|
2018-04-02 07:28:10 -06:00
|
|
|
netdev_warn(netdev, "Disabling LRO, not supported in legacy RQ\n");
|
2019-06-27 08:24:57 -06:00
|
|
|
features &= ~NETIF_F_LRO;
|
|
|
|
}
|
2018-04-02 07:28:10 -06:00
|
|
|
}
|
|
|
|
|
2019-05-23 13:55:10 -06:00
|
|
|
if (MLX5E_GET_PFLAG(params, MLX5E_PFLAG_RX_CQE_COMPRESS)) {
|
|
|
|
features &= ~NETIF_F_RXHASH;
|
|
|
|
if (netdev->features & NETIF_F_RXHASH)
|
|
|
|
netdev_warn(netdev, "Disabling rxhash, not supported when CQE compress is active\n");
|
|
|
|
}
|
|
|
|
|
2017-09-10 01:36:43 -06:00
|
|
|
mutex_unlock(&priv->state_lock);
|
|
|
|
|
|
|
|
return features;
|
|
|
|
}
|
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
static bool mlx5e_xsk_validate_mtu(struct net_device *netdev,
|
|
|
|
struct mlx5e_channels *chs,
|
|
|
|
struct mlx5e_params *new_params,
|
|
|
|
struct mlx5_core_dev *mdev)
|
|
|
|
{
|
|
|
|
u16 ix;
|
|
|
|
|
|
|
|
for (ix = 0; ix < chs->params.num_channels; ix++) {
|
|
|
|
struct xdp_umem *umem = mlx5e_xsk_get_umem(&chs->params, chs->params.xsk, ix);
|
|
|
|
struct mlx5e_xsk_param xsk;
|
|
|
|
|
|
|
|
if (!umem)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
mlx5e_build_xsk_param(umem, &xsk);
|
|
|
|
|
|
|
|
if (!mlx5e_validate_xsk_param(new_params, &xsk, mdev)) {
|
|
|
|
u32 hr = mlx5e_get_linear_rq_headroom(new_params, &xsk);
|
|
|
|
int max_mtu_frame, max_mtu_page, max_mtu;
|
|
|
|
|
|
|
|
/* Two criteria must be met:
|
|
|
|
* 1. HW MTU + all headrooms <= XSK frame size.
|
|
|
|
* 2. Size of SKBs allocated on XDP_PASS <= PAGE_SIZE.
|
|
|
|
*/
|
|
|
|
max_mtu_frame = MLX5E_HW2SW_MTU(new_params, xsk.chunk_size - hr);
|
|
|
|
max_mtu_page = mlx5e_xdp_max_mtu(new_params, &xsk);
|
|
|
|
max_mtu = min(max_mtu_frame, max_mtu_page);
|
|
|
|
|
|
|
|
netdev_err(netdev, "MTU %d is too big for an XSK running on channel %hu. Try MTU <= %d\n",
|
|
|
|
new_params->sw_mtu, ix, max_mtu);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2018-04-01 07:54:27 -06:00
|
|
|
int mlx5e_change_mtu(struct net_device *netdev, int new_mtu,
|
|
|
|
change_hw_mtu_cb set_mtu_cb)
|
2015-05-28 13:28:48 -06:00
|
|
|
{
|
|
|
|
struct mlx5e_priv *priv = netdev_priv(netdev);
|
2017-02-12 16:19:14 -07:00
|
|
|
struct mlx5e_channels new_channels = {};
|
2018-03-12 06:24:41 -06:00
|
|
|
struct mlx5e_params *params;
|
2015-07-29 06:05:46 -06:00
|
|
|
int err = 0;
|
2016-08-18 12:09:03 -06:00
|
|
|
bool reset;
|
2015-05-28 13:28:48 -06:00
|
|
|
|
|
|
|
mutex_lock(&priv->state_lock);
|
2015-07-29 06:05:46 -06:00
|
|
|
|
2018-03-12 06:24:41 -06:00
|
|
|
params = &priv->channels.params;
|
2016-08-18 12:09:03 -06:00
|
|
|
|
2018-02-11 06:21:33 -07:00
|
|
|
reset = !params->lro_en;
|
2017-02-12 16:19:14 -07:00
|
|
|
reset = reset && test_bit(MLX5E_STATE_OPENED, &priv->state);
|
2015-07-29 06:05:46 -06:00
|
|
|
|
2018-02-11 06:21:33 -07:00
|
|
|
new_channels.params = *params;
|
|
|
|
new_channels.params.sw_mtu = new_mtu;
|
|
|
|
|
2017-12-31 06:50:13 -07:00
|
|
|
if (params->xdp_prog &&
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
!mlx5e_rx_is_linear_skb(&new_channels.params, NULL)) {
|
2017-12-31 06:50:13 -07:00
|
|
|
netdev_err(netdev, "MTU(%d) > %d is not allowed while XDP enabled\n",
|
2019-06-26 08:35:35 -06:00
|
|
|
new_mtu, mlx5e_xdp_max_mtu(params, NULL));
|
2017-12-31 06:50:13 -07:00
|
|
|
err = -EINVAL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
if (priv->xsk.refcnt &&
|
|
|
|
!mlx5e_xsk_validate_mtu(netdev, &priv->channels,
|
|
|
|
&new_channels.params, priv->mdev)) {
|
2017-12-31 06:50:13 -07:00
|
|
|
err = -EINVAL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2018-04-02 08:31:31 -06:00
|
|
|
if (params->rq_wq_type == MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ) {
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
bool is_linear = mlx5e_rx_mpwqe_is_linear_skb(priv->mdev,
|
|
|
|
&new_channels.params,
|
|
|
|
NULL);
|
|
|
|
u8 ppw_old = mlx5e_mpwqe_log_pkts_per_wqe(params, NULL);
|
|
|
|
u8 ppw_new = mlx5e_mpwqe_log_pkts_per_wqe(&new_channels.params, NULL);
|
|
|
|
|
|
|
|
/* If XSK is active, XSK RQs are linear. */
|
|
|
|
is_linear |= priv->xsk.refcnt;
|
2018-02-11 06:21:33 -07:00
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
/* Always reset in linear mode - hw_mtu is used in data path. */
|
net/mlx5e: RX, verify received packet size in Linear Striding RQ
In case of striding RQ, we use MPWRQ (Multi Packet WQE RQ), which means
that WQE (RX descriptor) can be used for many packets and so the WQE is
much bigger than MTU. In virtualization setups where the port mtu can
be larger than the vf mtu, if received packet is bigger than MTU, it
won't be dropped by HW on too small receive WQE. If we use linear SKB in
striding RQ, since each stride has room for mtu size payload and skb
info, an oversized packet can lead to crash for crossing allocated page
boundary upon the call to build_skb. So driver needs to check packet
size and drop it.
Introduce new SW rx counter, rx_oversize_pkts_sw_drop, which counts the
number of packets dropped by the driver for being too large.
As a new field is added to the RQ struct, re-open the channels whenever
this field is being used in datapath (i.e., in the case of linear
Striding RQ).
Fixes: 619a8f2a42f1 ("net/mlx5e: Use linear SKB in Striding RQ")
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-10-10 22:31:10 -06:00
|
|
|
reset = reset && (is_linear || (ppw_old != ppw_new));
|
2018-02-11 06:21:33 -07:00
|
|
|
}
|
|
|
|
|
2017-02-12 16:19:14 -07:00
|
|
|
if (!reset) {
|
2018-03-12 06:24:41 -06:00
|
|
|
params->sw_mtu = new_mtu;
|
2018-06-05 02:32:11 -06:00
|
|
|
if (set_mtu_cb)
|
|
|
|
set_mtu_cb(priv);
|
2018-03-12 06:24:41 -06:00
|
|
|
netdev->mtu = params->sw_mtu;
|
2017-02-12 16:19:14 -07:00
|
|
|
goto out;
|
|
|
|
}
|
2015-07-29 06:05:46 -06:00
|
|
|
|
2018-11-26 08:22:16 -07:00
|
|
|
err = mlx5e_safe_switch_channels(priv, &new_channels, set_mtu_cb);
|
2018-03-12 06:24:41 -06:00
|
|
|
if (err)
|
2017-02-12 16:19:14 -07:00
|
|
|
goto out;
|
|
|
|
|
2018-03-12 06:24:41 -06:00
|
|
|
netdev->mtu = new_channels.params.sw_mtu;
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2017-02-12 16:19:14 -07:00
|
|
|
out:
|
|
|
|
mutex_unlock(&priv->state_lock);
|
2015-05-28 13:28:48 -06:00
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2018-04-01 07:54:27 -06:00
|
|
|
static int mlx5e_change_nic_mtu(struct net_device *netdev, int new_mtu)
|
|
|
|
{
|
|
|
|
return mlx5e_change_mtu(netdev, new_mtu, mlx5e_set_dev_port_mtu);
|
|
|
|
}
|
|
|
|
|
2017-08-15 04:46:04 -06:00
|
|
|
int mlx5e_hwstamp_set(struct mlx5e_priv *priv, struct ifreq *ifr)
|
|
|
|
{
|
|
|
|
struct hwtstamp_config config;
|
|
|
|
int err;
|
|
|
|
|
2018-07-29 04:29:45 -06:00
|
|
|
if (!MLX5_CAP_GEN(priv->mdev, device_frequency_khz) ||
|
|
|
|
(mlx5_clock_get_ptp_index(priv->mdev) == -1))
|
2017-08-15 04:46:04 -06:00
|
|
|
return -EOPNOTSUPP;
|
|
|
|
|
|
|
|
if (copy_from_user(&config, ifr->ifr_data, sizeof(config)))
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
/* TX HW timestamp */
|
|
|
|
switch (config.tx_type) {
|
|
|
|
case HWTSTAMP_TX_OFF:
|
|
|
|
case HWTSTAMP_TX_ON:
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
return -ERANGE;
|
|
|
|
}
|
|
|
|
|
|
|
|
mutex_lock(&priv->state_lock);
|
|
|
|
/* RX HW timestamp */
|
|
|
|
switch (config.rx_filter) {
|
|
|
|
case HWTSTAMP_FILTER_NONE:
|
|
|
|
/* Reset CQE compression to Admin default */
|
|
|
|
mlx5e_modify_rx_cqe_compression_locked(priv, priv->channels.params.rx_cqe_compress_def);
|
|
|
|
break;
|
|
|
|
case HWTSTAMP_FILTER_ALL:
|
|
|
|
case HWTSTAMP_FILTER_SOME:
|
|
|
|
case HWTSTAMP_FILTER_PTP_V1_L4_EVENT:
|
|
|
|
case HWTSTAMP_FILTER_PTP_V1_L4_SYNC:
|
|
|
|
case HWTSTAMP_FILTER_PTP_V1_L4_DELAY_REQ:
|
|
|
|
case HWTSTAMP_FILTER_PTP_V2_L4_EVENT:
|
|
|
|
case HWTSTAMP_FILTER_PTP_V2_L4_SYNC:
|
|
|
|
case HWTSTAMP_FILTER_PTP_V2_L4_DELAY_REQ:
|
|
|
|
case HWTSTAMP_FILTER_PTP_V2_L2_EVENT:
|
|
|
|
case HWTSTAMP_FILTER_PTP_V2_L2_SYNC:
|
|
|
|
case HWTSTAMP_FILTER_PTP_V2_L2_DELAY_REQ:
|
|
|
|
case HWTSTAMP_FILTER_PTP_V2_EVENT:
|
|
|
|
case HWTSTAMP_FILTER_PTP_V2_SYNC:
|
|
|
|
case HWTSTAMP_FILTER_PTP_V2_DELAY_REQ:
|
|
|
|
case HWTSTAMP_FILTER_NTP_ALL:
|
|
|
|
/* Disable CQE compression */
|
2019-05-24 13:24:43 -06:00
|
|
|
if (MLX5E_GET_PFLAG(&priv->channels.params, MLX5E_PFLAG_RX_CQE_COMPRESS))
|
|
|
|
netdev_warn(priv->netdev, "Disabling RX cqe compression\n");
|
2017-08-15 04:46:04 -06:00
|
|
|
err = mlx5e_modify_rx_cqe_compression_locked(priv, false);
|
|
|
|
if (err) {
|
|
|
|
netdev_err(priv->netdev, "Failed disabling cqe compression err=%d\n", err);
|
|
|
|
mutex_unlock(&priv->state_lock);
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
config.rx_filter = HWTSTAMP_FILTER_ALL;
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
mutex_unlock(&priv->state_lock);
|
|
|
|
return -ERANGE;
|
|
|
|
}
|
|
|
|
|
|
|
|
memcpy(&priv->tstamp, &config, sizeof(config));
|
|
|
|
mutex_unlock(&priv->state_lock);
|
|
|
|
|
2019-05-23 13:55:10 -06:00
|
|
|
/* might need to fix some features */
|
|
|
|
netdev_update_features(priv->netdev);
|
|
|
|
|
2017-08-15 04:46:04 -06:00
|
|
|
return copy_to_user(ifr->ifr_data, &config,
|
|
|
|
sizeof(config)) ? -EFAULT : 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
int mlx5e_hwstamp_get(struct mlx5e_priv *priv, struct ifreq *ifr)
|
|
|
|
{
|
|
|
|
struct hwtstamp_config *cfg = &priv->tstamp;
|
|
|
|
|
|
|
|
if (!MLX5_CAP_GEN(priv->mdev, device_frequency_khz))
|
|
|
|
return -EOPNOTSUPP;
|
|
|
|
|
|
|
|
return copy_to_user(ifr->ifr_data, cfg, sizeof(*cfg)) ? -EFAULT : 0;
|
|
|
|
}
|
|
|
|
|
2015-12-29 05:58:31 -07:00
|
|
|
static int mlx5e_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
|
|
|
|
{
|
2017-06-01 05:56:17 -06:00
|
|
|
struct mlx5e_priv *priv = netdev_priv(dev);
|
|
|
|
|
2015-12-29 05:58:31 -07:00
|
|
|
switch (cmd) {
|
|
|
|
case SIOCSHWTSTAMP:
|
2017-06-01 05:56:17 -06:00
|
|
|
return mlx5e_hwstamp_set(priv, ifr);
|
2015-12-29 05:58:31 -07:00
|
|
|
case SIOCGHWTSTAMP:
|
2017-06-01 05:56:17 -06:00
|
|
|
return mlx5e_hwstamp_get(priv, ifr);
|
2015-12-29 05:58:31 -07:00
|
|
|
default:
|
|
|
|
return -EOPNOTSUPP;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2017-06-05 06:17:12 -06:00
|
|
|
#ifdef CONFIG_MLX5_ESWITCH
|
2018-11-01 11:14:21 -06:00
|
|
|
int mlx5e_set_vf_mac(struct net_device *dev, int vf, u8 *mac)
|
2015-12-01 09:03:25 -07:00
|
|
|
{
|
|
|
|
struct mlx5e_priv *priv = netdev_priv(dev);
|
|
|
|
struct mlx5_core_dev *mdev = priv->mdev;
|
|
|
|
|
|
|
|
return mlx5_eswitch_set_vport_mac(mdev->priv.eswitch, vf + 1, mac);
|
|
|
|
}
|
|
|
|
|
2016-09-22 03:11:15 -06:00
|
|
|
static int mlx5e_set_vf_vlan(struct net_device *dev, int vf, u16 vlan, u8 qos,
|
|
|
|
__be16 vlan_proto)
|
2015-12-01 09:03:25 -07:00
|
|
|
{
|
|
|
|
struct mlx5e_priv *priv = netdev_priv(dev);
|
|
|
|
struct mlx5_core_dev *mdev = priv->mdev;
|
|
|
|
|
2016-09-22 03:11:15 -06:00
|
|
|
if (vlan_proto != htons(ETH_P_8021Q))
|
|
|
|
return -EPROTONOSUPPORT;
|
|
|
|
|
2015-12-01 09:03:25 -07:00
|
|
|
return mlx5_eswitch_set_vport_vlan(mdev->priv.eswitch, vf + 1,
|
|
|
|
vlan, qos);
|
|
|
|
}
|
|
|
|
|
2016-05-03 08:13:59 -06:00
|
|
|
static int mlx5e_set_vf_spoofchk(struct net_device *dev, int vf, bool setting)
|
|
|
|
{
|
|
|
|
struct mlx5e_priv *priv = netdev_priv(dev);
|
|
|
|
struct mlx5_core_dev *mdev = priv->mdev;
|
|
|
|
|
|
|
|
return mlx5_eswitch_set_vport_spoofchk(mdev->priv.eswitch, vf + 1, setting);
|
|
|
|
}
|
|
|
|
|
2016-05-03 08:14:04 -06:00
|
|
|
static int mlx5e_set_vf_trust(struct net_device *dev, int vf, bool setting)
|
|
|
|
{
|
|
|
|
struct mlx5e_priv *priv = netdev_priv(dev);
|
|
|
|
struct mlx5_core_dev *mdev = priv->mdev;
|
|
|
|
|
|
|
|
return mlx5_eswitch_set_vport_trust(mdev->priv.eswitch, vf + 1, setting);
|
|
|
|
}
|
2016-08-11 02:28:21 -06:00
|
|
|
|
2018-11-01 11:14:21 -06:00
|
|
|
int mlx5e_set_vf_rate(struct net_device *dev, int vf, int min_tx_rate,
|
|
|
|
int max_tx_rate)
|
2016-08-11 02:28:21 -06:00
|
|
|
{
|
|
|
|
struct mlx5e_priv *priv = netdev_priv(dev);
|
|
|
|
struct mlx5_core_dev *mdev = priv->mdev;
|
|
|
|
|
|
|
|
return mlx5_eswitch_set_vport_rate(mdev->priv.eswitch, vf + 1,
|
2016-12-15 05:02:53 -07:00
|
|
|
max_tx_rate, min_tx_rate);
|
2016-08-11 02:28:21 -06:00
|
|
|
}
|
|
|
|
|
2015-12-01 09:03:25 -07:00
|
|
|
static int mlx5_vport_link2ifla(u8 esw_link)
|
|
|
|
{
|
|
|
|
switch (esw_link) {
|
2018-08-08 17:23:49 -06:00
|
|
|
case MLX5_VPORT_ADMIN_STATE_DOWN:
|
2015-12-01 09:03:25 -07:00
|
|
|
return IFLA_VF_LINK_STATE_DISABLE;
|
2018-08-08 17:23:49 -06:00
|
|
|
case MLX5_VPORT_ADMIN_STATE_UP:
|
2015-12-01 09:03:25 -07:00
|
|
|
return IFLA_VF_LINK_STATE_ENABLE;
|
|
|
|
}
|
|
|
|
return IFLA_VF_LINK_STATE_AUTO;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int mlx5_ifla_link2vport(u8 ifla_link)
|
|
|
|
{
|
|
|
|
switch (ifla_link) {
|
|
|
|
case IFLA_VF_LINK_STATE_DISABLE:
|
2018-08-08 17:23:49 -06:00
|
|
|
return MLX5_VPORT_ADMIN_STATE_DOWN;
|
2015-12-01 09:03:25 -07:00
|
|
|
case IFLA_VF_LINK_STATE_ENABLE:
|
2018-08-08 17:23:49 -06:00
|
|
|
return MLX5_VPORT_ADMIN_STATE_UP;
|
2015-12-01 09:03:25 -07:00
|
|
|
}
|
2018-08-08 17:23:49 -06:00
|
|
|
return MLX5_VPORT_ADMIN_STATE_AUTO;
|
2015-12-01 09:03:25 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
static int mlx5e_set_vf_link_state(struct net_device *dev, int vf,
|
|
|
|
int link_state)
|
|
|
|
{
|
|
|
|
struct mlx5e_priv *priv = netdev_priv(dev);
|
|
|
|
struct mlx5_core_dev *mdev = priv->mdev;
|
|
|
|
|
|
|
|
return mlx5_eswitch_set_vport_state(mdev->priv.eswitch, vf + 1,
|
|
|
|
mlx5_ifla_link2vport(link_state));
|
|
|
|
}
|
|
|
|
|
2018-11-01 11:14:21 -06:00
|
|
|
int mlx5e_get_vf_config(struct net_device *dev,
|
|
|
|
int vf, struct ifla_vf_info *ivi)
|
2015-12-01 09:03:25 -07:00
|
|
|
{
|
|
|
|
struct mlx5e_priv *priv = netdev_priv(dev);
|
|
|
|
struct mlx5_core_dev *mdev = priv->mdev;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
err = mlx5_eswitch_get_vport_config(mdev->priv.eswitch, vf + 1, ivi);
|
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
ivi->linkstate = mlx5_vport_link2ifla(ivi->linkstate);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2018-11-01 11:14:21 -06:00
|
|
|
int mlx5e_get_vf_stats(struct net_device *dev,
|
|
|
|
int vf, struct ifla_vf_stats *vf_stats)
|
2015-12-01 09:03:25 -07:00
|
|
|
{
|
|
|
|
struct mlx5e_priv *priv = netdev_priv(dev);
|
|
|
|
struct mlx5_core_dev *mdev = priv->mdev;
|
|
|
|
|
|
|
|
return mlx5_eswitch_get_vport_stats(mdev->priv.eswitch, vf + 1,
|
|
|
|
vf_stats);
|
|
|
|
}
|
2017-06-05 06:17:12 -06:00
|
|
|
#endif
|
2015-12-01 09:03:25 -07:00
|
|
|
|
2018-05-08 02:49:43 -06:00
|
|
|
struct mlx5e_vxlan_work {
|
|
|
|
struct work_struct work;
|
|
|
|
struct mlx5e_priv *priv;
|
|
|
|
u16 port;
|
|
|
|
};
|
|
|
|
|
|
|
|
static void mlx5e_vxlan_add_work(struct work_struct *work)
|
|
|
|
{
|
|
|
|
struct mlx5e_vxlan_work *vxlan_work =
|
|
|
|
container_of(work, struct mlx5e_vxlan_work, work);
|
|
|
|
struct mlx5e_priv *priv = vxlan_work->priv;
|
|
|
|
u16 port = vxlan_work->port;
|
|
|
|
|
|
|
|
mutex_lock(&priv->state_lock);
|
2018-05-09 14:28:00 -06:00
|
|
|
mlx5_vxlan_add_port(priv->mdev->vxlan, port);
|
2018-05-08 02:49:43 -06:00
|
|
|
mutex_unlock(&priv->state_lock);
|
|
|
|
|
|
|
|
kfree(vxlan_work);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void mlx5e_vxlan_del_work(struct work_struct *work)
|
|
|
|
{
|
|
|
|
struct mlx5e_vxlan_work *vxlan_work =
|
|
|
|
container_of(work, struct mlx5e_vxlan_work, work);
|
|
|
|
struct mlx5e_priv *priv = vxlan_work->priv;
|
|
|
|
u16 port = vxlan_work->port;
|
|
|
|
|
|
|
|
mutex_lock(&priv->state_lock);
|
2018-05-09 14:28:00 -06:00
|
|
|
mlx5_vxlan_del_port(priv->mdev->vxlan, port);
|
2018-05-08 02:49:43 -06:00
|
|
|
mutex_unlock(&priv->state_lock);
|
|
|
|
kfree(vxlan_work);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void mlx5e_vxlan_queue_work(struct mlx5e_priv *priv, u16 port, int add)
|
|
|
|
{
|
|
|
|
struct mlx5e_vxlan_work *vxlan_work;
|
|
|
|
|
|
|
|
vxlan_work = kmalloc(sizeof(*vxlan_work), GFP_ATOMIC);
|
|
|
|
if (!vxlan_work)
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (add)
|
|
|
|
INIT_WORK(&vxlan_work->work, mlx5e_vxlan_add_work);
|
|
|
|
else
|
|
|
|
INIT_WORK(&vxlan_work->work, mlx5e_vxlan_del_work);
|
|
|
|
|
|
|
|
vxlan_work->priv = priv;
|
|
|
|
vxlan_work->port = port;
|
|
|
|
queue_work(priv->wq, &vxlan_work->work);
|
|
|
|
}
|
|
|
|
|
2018-11-01 11:14:21 -06:00
|
|
|
void mlx5e_add_vxlan_port(struct net_device *netdev, struct udp_tunnel_info *ti)
|
2016-02-22 09:17:32 -07:00
|
|
|
{
|
|
|
|
struct mlx5e_priv *priv = netdev_priv(netdev);
|
|
|
|
|
2016-06-16 13:22:38 -06:00
|
|
|
if (ti->type != UDP_TUNNEL_TYPE_VXLAN)
|
|
|
|
return;
|
|
|
|
|
2018-05-09 14:28:00 -06:00
|
|
|
if (!mlx5_vxlan_allowed(priv->mdev->vxlan))
|
2016-02-22 09:17:32 -07:00
|
|
|
return;
|
|
|
|
|
2018-02-13 01:31:26 -07:00
|
|
|
mlx5e_vxlan_queue_work(priv, be16_to_cpu(ti->port), 1);
|
2016-02-22 09:17:32 -07:00
|
|
|
}
|
|
|
|
|
2018-11-01 11:14:21 -06:00
|
|
|
void mlx5e_del_vxlan_port(struct net_device *netdev, struct udp_tunnel_info *ti)
|
2016-02-22 09:17:32 -07:00
|
|
|
{
|
|
|
|
struct mlx5e_priv *priv = netdev_priv(netdev);
|
|
|
|
|
2016-06-16 13:22:38 -06:00
|
|
|
if (ti->type != UDP_TUNNEL_TYPE_VXLAN)
|
|
|
|
return;
|
|
|
|
|
2018-05-09 14:28:00 -06:00
|
|
|
if (!mlx5_vxlan_allowed(priv->mdev->vxlan))
|
2016-02-22 09:17:32 -07:00
|
|
|
return;
|
|
|
|
|
2018-02-13 01:31:26 -07:00
|
|
|
mlx5e_vxlan_queue_work(priv, be16_to_cpu(ti->port), 0);
|
2016-02-22 09:17:32 -07:00
|
|
|
}
|
|
|
|
|
2017-08-13 04:34:42 -06:00
|
|
|
static netdev_features_t mlx5e_tunnel_features_check(struct mlx5e_priv *priv,
|
|
|
|
struct sk_buff *skb,
|
|
|
|
netdev_features_t features)
|
2016-02-22 09:17:32 -07:00
|
|
|
{
|
2017-11-21 08:49:36 -07:00
|
|
|
unsigned int offset = 0;
|
2016-02-22 09:17:32 -07:00
|
|
|
struct udphdr *udph;
|
2017-08-13 04:34:42 -06:00
|
|
|
u8 proto;
|
|
|
|
u16 port;
|
2016-02-22 09:17:32 -07:00
|
|
|
|
|
|
|
switch (vlan_get_protocol(skb)) {
|
|
|
|
case htons(ETH_P_IP):
|
|
|
|
proto = ip_hdr(skb)->protocol;
|
|
|
|
break;
|
|
|
|
case htons(ETH_P_IPV6):
|
2017-11-21 08:49:36 -07:00
|
|
|
proto = ipv6_find_hdr(skb, &offset, -1, NULL, NULL);
|
2016-02-22 09:17:32 -07:00
|
|
|
break;
|
|
|
|
default:
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2017-08-13 04:34:42 -06:00
|
|
|
switch (proto) {
|
|
|
|
case IPPROTO_GRE:
|
2019-08-19 19:59:11 -06:00
|
|
|
case IPPROTO_IPIP:
|
|
|
|
case IPPROTO_IPV6:
|
2017-08-13 04:34:42 -06:00
|
|
|
return features;
|
|
|
|
case IPPROTO_UDP:
|
2016-02-22 09:17:32 -07:00
|
|
|
udph = udp_hdr(skb);
|
|
|
|
port = be16_to_cpu(udph->dest);
|
|
|
|
|
2017-08-13 04:34:42 -06:00
|
|
|
/* Verify if UDP port is being offloaded by HW */
|
2018-05-09 14:28:00 -06:00
|
|
|
if (mlx5_vxlan_lookup_port(priv->mdev->vxlan, port))
|
2017-08-13 04:34:42 -06:00
|
|
|
return features;
|
2019-03-21 16:51:38 -06:00
|
|
|
|
|
|
|
#if IS_ENABLED(CONFIG_GENEVE)
|
|
|
|
/* Support Geneve offload for default UDP port */
|
|
|
|
if (port == GENEVE_UDP_PORT && mlx5_geneve_tx_allowed(priv->mdev))
|
|
|
|
return features;
|
|
|
|
#endif
|
2017-08-13 04:34:42 -06:00
|
|
|
}
|
2016-02-22 09:17:32 -07:00
|
|
|
|
|
|
|
out:
|
|
|
|
/* Disable CSUM and GSO if the udp dport is not offloaded by HW */
|
|
|
|
return features & ~(NETIF_F_CSUM_MASK | NETIF_F_GSO_MASK);
|
|
|
|
}
|
|
|
|
|
2018-11-01 11:14:21 -06:00
|
|
|
netdev_features_t mlx5e_features_check(struct sk_buff *skb,
|
|
|
|
struct net_device *netdev,
|
|
|
|
netdev_features_t features)
|
2016-02-22 09:17:32 -07:00
|
|
|
{
|
|
|
|
struct mlx5e_priv *priv = netdev_priv(netdev);
|
|
|
|
|
|
|
|
features = vlan_features_check(skb, features);
|
|
|
|
features = vxlan_features_check(skb, features);
|
|
|
|
|
2017-04-18 07:08:23 -06:00
|
|
|
#ifdef CONFIG_MLX5_EN_IPSEC
|
|
|
|
if (mlx5e_ipsec_feature_check(skb, netdev, features))
|
|
|
|
return features;
|
|
|
|
#endif
|
|
|
|
|
2016-02-22 09:17:32 -07:00
|
|
|
/* Validate if the tunneled packet is being offloaded by HW */
|
|
|
|
if (skb->encapsulation &&
|
|
|
|
(features & NETIF_F_CSUM_MASK || features & NETIF_F_GSO_MASK))
|
2017-08-13 04:34:42 -06:00
|
|
|
return mlx5e_tunnel_features_check(priv, skb, features);
|
2016-02-22 09:17:32 -07:00
|
|
|
|
|
|
|
return features;
|
|
|
|
}
|
|
|
|
|
2018-01-16 08:25:06 -07:00
|
|
|
static void mlx5e_tx_timeout_work(struct work_struct *work)
|
2016-06-30 08:34:45 -06:00
|
|
|
{
|
2018-01-16 08:25:06 -07:00
|
|
|
struct mlx5e_priv *priv = container_of(work, struct mlx5e_priv,
|
|
|
|
tx_timeout_work);
|
net/mlx5e: Add tx timeout support for mlx5e tx reporter
With this patch, ndo_tx_timeout callback will be redirected to the tx
reporter in order to detect a tx timeout error and report it to the
devlink health. (The watchdog detects tx timeouts, but the driver verify
the issue still exists before launching any recover method).
In addition, recover from tx timeout in case of lost interrupt was added
to the tx reporter recover method. The tx timeout recover from lost
interrupt is not a new feature in the driver, this patch re-organize the
functionality and move it to the tx reporter recovery flow.
tx timeout example:
(with auto_recover set to false, if set to true, the manual recover and
diagnose sections are irrelevant)
$cat /sys/kernel/debug/tracing/trace
...
devlink_health_report: bus_name=pci dev_name=0000:00:09.0
driver_name=mlx5_core reporter_name=tx: TX timeout on queue: 0, SQ: 0x8a,
CQ: 0x35, SQ Cons: 0x2 SQ Prod: 0x2, usecs since last trans: 14912000
$devlink health show
pci/0000:00:09.0:
name tx
state healthy #err 1 #recover 0 last_dump_ts N/A
parameters:
grace_period 500 auto_recover false
$devlink health diagnose pci/0000:00:09.0 reporter tx -j -p
{
"SQs": [ {
"sqn": 138,
"HW state": 1,
"stopped": true
},{
"sqn": 142,
"HW state": 1,
"stopped": false
} ]
}
$devlink health diagnose pci/0000:00:09.0 reporter tx
SQs:
sqn: 138 HW state: 1 stopped: true
sqn: 142 HW state: 1 stopped: false
$devlink health recover pci/0000:00:09 reporter tx
$devlink health show
pci/0000:00:09.0:
name tx
state healthy #err 1 #recover 1 last_dump_ts N/A
parameters:
grace_period 500 auto_recover false
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-07 02:36:41 -07:00
|
|
|
bool report_failed = false;
|
|
|
|
int err;
|
|
|
|
int i;
|
2016-06-30 08:34:45 -06:00
|
|
|
|
2018-01-16 08:25:06 -07:00
|
|
|
rtnl_lock();
|
|
|
|
mutex_lock(&priv->state_lock);
|
|
|
|
|
|
|
|
if (!test_bit(MLX5E_STATE_OPENED, &priv->state))
|
|
|
|
goto unlock;
|
2016-06-30 08:34:45 -06:00
|
|
|
|
2016-12-21 08:24:35 -07:00
|
|
|
for (i = 0; i < priv->channels.num * priv->channels.params.num_tc; i++) {
|
net/mlx5e: Add tx timeout support for mlx5e tx reporter
With this patch, ndo_tx_timeout callback will be redirected to the tx
reporter in order to detect a tx timeout error and report it to the
devlink health. (The watchdog detects tx timeouts, but the driver verify
the issue still exists before launching any recover method).
In addition, recover from tx timeout in case of lost interrupt was added
to the tx reporter recover method. The tx timeout recover from lost
interrupt is not a new feature in the driver, this patch re-organize the
functionality and move it to the tx reporter recovery flow.
tx timeout example:
(with auto_recover set to false, if set to true, the manual recover and
diagnose sections are irrelevant)
$cat /sys/kernel/debug/tracing/trace
...
devlink_health_report: bus_name=pci dev_name=0000:00:09.0
driver_name=mlx5_core reporter_name=tx: TX timeout on queue: 0, SQ: 0x8a,
CQ: 0x35, SQ Cons: 0x2 SQ Prod: 0x2, usecs since last trans: 14912000
$devlink health show
pci/0000:00:09.0:
name tx
state healthy #err 1 #recover 0 last_dump_ts N/A
parameters:
grace_period 500 auto_recover false
$devlink health diagnose pci/0000:00:09.0 reporter tx -j -p
{
"SQs": [ {
"sqn": 138,
"HW state": 1,
"stopped": true
},{
"sqn": 142,
"HW state": 1,
"stopped": false
} ]
}
$devlink health diagnose pci/0000:00:09.0 reporter tx
SQs:
sqn: 138 HW state: 1 stopped: true
sqn: 142 HW state: 1 stopped: false
$devlink health recover pci/0000:00:09 reporter tx
$devlink health show
pci/0000:00:09.0:
name tx
state healthy #err 1 #recover 1 last_dump_ts N/A
parameters:
grace_period 500 auto_recover false
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-07 02:36:41 -07:00
|
|
|
struct netdev_queue *dev_queue =
|
|
|
|
netdev_get_tx_queue(priv->netdev, i);
|
2016-12-20 13:48:19 -07:00
|
|
|
struct mlx5e_txqsq *sq = priv->txq2sq[i];
|
2016-06-30 08:34:45 -06:00
|
|
|
|
2017-12-20 02:31:28 -07:00
|
|
|
if (!netif_xmit_stopped(dev_queue))
|
2016-06-30 08:34:45 -06:00
|
|
|
continue;
|
2018-01-16 08:25:06 -07:00
|
|
|
|
2019-07-01 06:51:51 -06:00
|
|
|
if (mlx5e_reporter_tx_timeout(sq))
|
net/mlx5e: Add tx timeout support for mlx5e tx reporter
With this patch, ndo_tx_timeout callback will be redirected to the tx
reporter in order to detect a tx timeout error and report it to the
devlink health. (The watchdog detects tx timeouts, but the driver verify
the issue still exists before launching any recover method).
In addition, recover from tx timeout in case of lost interrupt was added
to the tx reporter recover method. The tx timeout recover from lost
interrupt is not a new feature in the driver, this patch re-organize the
functionality and move it to the tx reporter recovery flow.
tx timeout example:
(with auto_recover set to false, if set to true, the manual recover and
diagnose sections are irrelevant)
$cat /sys/kernel/debug/tracing/trace
...
devlink_health_report: bus_name=pci dev_name=0000:00:09.0
driver_name=mlx5_core reporter_name=tx: TX timeout on queue: 0, SQ: 0x8a,
CQ: 0x35, SQ Cons: 0x2 SQ Prod: 0x2, usecs since last trans: 14912000
$devlink health show
pci/0000:00:09.0:
name tx
state healthy #err 1 #recover 0 last_dump_ts N/A
parameters:
grace_period 500 auto_recover false
$devlink health diagnose pci/0000:00:09.0 reporter tx -j -p
{
"SQs": [ {
"sqn": 138,
"HW state": 1,
"stopped": true
},{
"sqn": 142,
"HW state": 1,
"stopped": false
} ]
}
$devlink health diagnose pci/0000:00:09.0 reporter tx
SQs:
sqn: 138 HW state: 1 stopped: true
sqn: 142 HW state: 1 stopped: false
$devlink health recover pci/0000:00:09 reporter tx
$devlink health show
pci/0000:00:09.0:
name tx
state healthy #err 1 #recover 1 last_dump_ts N/A
parameters:
grace_period 500 auto_recover false
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-07 02:36:41 -07:00
|
|
|
report_failed = true;
|
2016-06-30 08:34:45 -06:00
|
|
|
}
|
|
|
|
|
net/mlx5e: Add tx timeout support for mlx5e tx reporter
With this patch, ndo_tx_timeout callback will be redirected to the tx
reporter in order to detect a tx timeout error and report it to the
devlink health. (The watchdog detects tx timeouts, but the driver verify
the issue still exists before launching any recover method).
In addition, recover from tx timeout in case of lost interrupt was added
to the tx reporter recover method. The tx timeout recover from lost
interrupt is not a new feature in the driver, this patch re-organize the
functionality and move it to the tx reporter recovery flow.
tx timeout example:
(with auto_recover set to false, if set to true, the manual recover and
diagnose sections are irrelevant)
$cat /sys/kernel/debug/tracing/trace
...
devlink_health_report: bus_name=pci dev_name=0000:00:09.0
driver_name=mlx5_core reporter_name=tx: TX timeout on queue: 0, SQ: 0x8a,
CQ: 0x35, SQ Cons: 0x2 SQ Prod: 0x2, usecs since last trans: 14912000
$devlink health show
pci/0000:00:09.0:
name tx
state healthy #err 1 #recover 0 last_dump_ts N/A
parameters:
grace_period 500 auto_recover false
$devlink health diagnose pci/0000:00:09.0 reporter tx -j -p
{
"SQs": [ {
"sqn": 138,
"HW state": 1,
"stopped": true
},{
"sqn": 142,
"HW state": 1,
"stopped": false
} ]
}
$devlink health diagnose pci/0000:00:09.0 reporter tx
SQs:
sqn: 138 HW state: 1 stopped: true
sqn: 142 HW state: 1 stopped: false
$devlink health recover pci/0000:00:09 reporter tx
$devlink health show
pci/0000:00:09.0:
name tx
state healthy #err 1 #recover 1 last_dump_ts N/A
parameters:
grace_period 500 auto_recover false
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-07 02:36:41 -07:00
|
|
|
if (!report_failed)
|
2019-01-25 11:53:23 -07:00
|
|
|
goto unlock;
|
|
|
|
|
2019-03-28 06:26:47 -06:00
|
|
|
err = mlx5e_safe_reopen_channels(priv);
|
2019-01-25 11:53:23 -07:00
|
|
|
if (err)
|
|
|
|
netdev_err(priv->netdev,
|
2019-03-28 06:26:47 -06:00
|
|
|
"mlx5e_safe_reopen_channels failed recovering from a tx_timeout, err(%d).\n",
|
2019-01-25 11:53:23 -07:00
|
|
|
err);
|
|
|
|
|
2018-01-16 08:25:06 -07:00
|
|
|
unlock:
|
|
|
|
mutex_unlock(&priv->state_lock);
|
|
|
|
rtnl_unlock();
|
|
|
|
}
|
|
|
|
|
|
|
|
static void mlx5e_tx_timeout(struct net_device *dev)
|
|
|
|
{
|
|
|
|
struct mlx5e_priv *priv = netdev_priv(dev);
|
|
|
|
|
|
|
|
netdev_err(dev, "TX timeout detected\n");
|
|
|
|
queue_work(priv->wq, &priv->tx_timeout_work);
|
2016-06-30 08:34:45 -06:00
|
|
|
}
|
|
|
|
|
2017-12-31 06:50:13 -07:00
|
|
|
static int mlx5e_xdp_allowed(struct mlx5e_priv *priv, struct bpf_prog *prog)
|
2018-03-12 10:26:51 -06:00
|
|
|
{
|
|
|
|
struct net_device *netdev = priv->netdev;
|
2017-12-31 06:50:13 -07:00
|
|
|
struct mlx5e_channels new_channels = {};
|
2018-03-12 10:26:51 -06:00
|
|
|
|
|
|
|
if (priv->channels.params.lro_en) {
|
|
|
|
netdev_warn(netdev, "can't set XDP while LRO is on, disable LRO first\n");
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (MLX5_IPSEC_DEV(priv->mdev)) {
|
|
|
|
netdev_warn(netdev, "can't set XDP with IPSec offload\n");
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
2017-12-31 06:50:13 -07:00
|
|
|
new_channels.params = priv->channels.params;
|
|
|
|
new_channels.params.xdp_prog = prog;
|
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
/* No XSK params: AF_XDP can't be enabled yet at the point of setting
|
|
|
|
* the XDP program.
|
|
|
|
*/
|
|
|
|
if (!mlx5e_rx_is_linear_skb(&new_channels.params, NULL)) {
|
2017-12-31 06:50:13 -07:00
|
|
|
netdev_warn(netdev, "XDP is not allowed with MTU(%d) > %d\n",
|
2019-04-08 06:12:45 -06:00
|
|
|
new_channels.params.sw_mtu,
|
2019-06-26 08:35:35 -06:00
|
|
|
mlx5e_xdp_max_mtu(&new_channels.params, NULL));
|
2017-12-31 06:50:13 -07:00
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
2018-03-12 10:26:51 -06:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
static int mlx5e_xdp_update_state(struct mlx5e_priv *priv)
|
|
|
|
{
|
|
|
|
if (priv->channels.params.xdp_prog)
|
|
|
|
mlx5e_xdp_set_open(priv);
|
|
|
|
else
|
|
|
|
mlx5e_xdp_set_closed(priv);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
net/mlx5e: XDP fast RX drop bpf programs support
Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver.
When XDP is on we make sure to change channels RQs type to
MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to
ensure "page per packet".
On XDP set, we fail if HW LRO is set and request from user to turn it
off. Since on ConnectX4-LX HW LRO is always on by default, this will be
annoying, but we prefer not to enforce LRO off from XDP set function.
Full channels reset (close/open) is required only when setting XDP
on/off.
When XDP set is called just to exchange programs, we will update
each RQ xdp program on the fly and for synchronization with current
data path RX activity of that RQ, we temporally disable that RQ and
ensure RX path is not running, quickly update and re-enable that RQ,
for that we do:
- rq.state = disabled
- napi_synnchronize
- xchg(rq->xdp_prg)
- rq.state = enabled
- napi_schedule // Just in case we've missed an IRQ
Packet rate performance testing was done with pktgen 64B packets and on
TX side and, TC drop action on RX side compared to XDP fast drop.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Comparison is done between:
1. Baseline, Before this patch with TC drop action
2. This patch with TC drop action
3. This patch with XDP RX fast drop
RX Cores Baseline(TC drop) TC drop XDP fast Drop
--------------------------------------------------------------
1 5.3Mpps 5.3Mpps 16.5Mpps
2 10.2Mpps 10.2Mpps 31.3Mpps
4 20.5Mpps 19.9Mpps 36.3Mpps*
*My xmitter was limited to 36.3Mpps, so it is the bottleneck.
It seems that receive side can handle more.
Signed-off-by: Rana Shahout <ranas@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 03:19:46 -06:00
|
|
|
static int mlx5e_xdp_set(struct net_device *netdev, struct bpf_prog *prog)
|
|
|
|
{
|
|
|
|
struct mlx5e_priv *priv = netdev_priv(netdev);
|
|
|
|
struct bpf_prog *old_prog;
|
|
|
|
bool reset, was_opened;
|
2018-07-31 08:21:57 -06:00
|
|
|
int err = 0;
|
net/mlx5e: XDP fast RX drop bpf programs support
Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver.
When XDP is on we make sure to change channels RQs type to
MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to
ensure "page per packet".
On XDP set, we fail if HW LRO is set and request from user to turn it
off. Since on ConnectX4-LX HW LRO is always on by default, this will be
annoying, but we prefer not to enforce LRO off from XDP set function.
Full channels reset (close/open) is required only when setting XDP
on/off.
When XDP set is called just to exchange programs, we will update
each RQ xdp program on the fly and for synchronization with current
data path RX activity of that RQ, we temporally disable that RQ and
ensure RX path is not running, quickly update and re-enable that RQ,
for that we do:
- rq.state = disabled
- napi_synnchronize
- xchg(rq->xdp_prg)
- rq.state = enabled
- napi_schedule // Just in case we've missed an IRQ
Packet rate performance testing was done with pktgen 64B packets and on
TX side and, TC drop action on RX side compared to XDP fast drop.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Comparison is done between:
1. Baseline, Before this patch with TC drop action
2. This patch with TC drop action
3. This patch with XDP RX fast drop
RX Cores Baseline(TC drop) TC drop XDP fast Drop
--------------------------------------------------------------
1 5.3Mpps 5.3Mpps 16.5Mpps
2 10.2Mpps 10.2Mpps 31.3Mpps
4 20.5Mpps 19.9Mpps 36.3Mpps*
*My xmitter was limited to 36.3Mpps, so it is the bottleneck.
It seems that receive side can handle more.
Signed-off-by: Rana Shahout <ranas@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 03:19:46 -06:00
|
|
|
int i;
|
|
|
|
|
|
|
|
mutex_lock(&priv->state_lock);
|
|
|
|
|
2018-03-12 10:26:51 -06:00
|
|
|
if (prog) {
|
2017-12-31 06:50:13 -07:00
|
|
|
err = mlx5e_xdp_allowed(priv, prog);
|
2018-03-12 10:26:51 -06:00
|
|
|
if (err)
|
|
|
|
goto unlock;
|
2017-04-18 07:04:28 -06:00
|
|
|
}
|
|
|
|
|
net/mlx5e: XDP fast RX drop bpf programs support
Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver.
When XDP is on we make sure to change channels RQs type to
MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to
ensure "page per packet".
On XDP set, we fail if HW LRO is set and request from user to turn it
off. Since on ConnectX4-LX HW LRO is always on by default, this will be
annoying, but we prefer not to enforce LRO off from XDP set function.
Full channels reset (close/open) is required only when setting XDP
on/off.
When XDP set is called just to exchange programs, we will update
each RQ xdp program on the fly and for synchronization with current
data path RX activity of that RQ, we temporally disable that RQ and
ensure RX path is not running, quickly update and re-enable that RQ,
for that we do:
- rq.state = disabled
- napi_synnchronize
- xchg(rq->xdp_prg)
- rq.state = enabled
- napi_schedule // Just in case we've missed an IRQ
Packet rate performance testing was done with pktgen 64B packets and on
TX side and, TC drop action on RX side compared to XDP fast drop.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Comparison is done between:
1. Baseline, Before this patch with TC drop action
2. This patch with TC drop action
3. This patch with XDP RX fast drop
RX Cores Baseline(TC drop) TC drop XDP fast Drop
--------------------------------------------------------------
1 5.3Mpps 5.3Mpps 16.5Mpps
2 10.2Mpps 10.2Mpps 31.3Mpps
4 20.5Mpps 19.9Mpps 36.3Mpps*
*My xmitter was limited to 36.3Mpps, so it is the bottleneck.
It seems that receive side can handle more.
Signed-off-by: Rana Shahout <ranas@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 03:19:46 -06:00
|
|
|
was_opened = test_bit(MLX5E_STATE_OPENED, &priv->state);
|
|
|
|
/* no need for full reset when exchanging programs */
|
2016-12-21 08:24:35 -07:00
|
|
|
reset = (!priv->channels.params.xdp_prog || !prog);
|
net/mlx5e: XDP fast RX drop bpf programs support
Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver.
When XDP is on we make sure to change channels RQs type to
MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to
ensure "page per packet".
On XDP set, we fail if HW LRO is set and request from user to turn it
off. Since on ConnectX4-LX HW LRO is always on by default, this will be
annoying, but we prefer not to enforce LRO off from XDP set function.
Full channels reset (close/open) is required only when setting XDP
on/off.
When XDP set is called just to exchange programs, we will update
each RQ xdp program on the fly and for synchronization with current
data path RX activity of that RQ, we temporally disable that RQ and
ensure RX path is not running, quickly update and re-enable that RQ,
for that we do:
- rq.state = disabled
- napi_synnchronize
- xchg(rq->xdp_prg)
- rq.state = enabled
- napi_schedule // Just in case we've missed an IRQ
Packet rate performance testing was done with pktgen 64B packets and on
TX side and, TC drop action on RX side compared to XDP fast drop.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Comparison is done between:
1. Baseline, Before this patch with TC drop action
2. This patch with TC drop action
3. This patch with XDP RX fast drop
RX Cores Baseline(TC drop) TC drop XDP fast Drop
--------------------------------------------------------------
1 5.3Mpps 5.3Mpps 16.5Mpps
2 10.2Mpps 10.2Mpps 31.3Mpps
4 20.5Mpps 19.9Mpps 36.3Mpps*
*My xmitter was limited to 36.3Mpps, so it is the bottleneck.
It seems that receive side can handle more.
Signed-off-by: Rana Shahout <ranas@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 03:19:46 -06:00
|
|
|
|
2016-11-18 17:45:01 -07:00
|
|
|
if (was_opened && !reset) {
|
|
|
|
/* num_channels is invariant here, so we can take the
|
|
|
|
* batched reference right upfront.
|
|
|
|
*/
|
2016-12-21 08:24:35 -07:00
|
|
|
prog = bpf_prog_add(prog, priv->channels.num);
|
2016-11-18 17:45:01 -07:00
|
|
|
if (IS_ERR(prog)) {
|
|
|
|
err = PTR_ERR(prog);
|
|
|
|
goto unlock;
|
|
|
|
}
|
|
|
|
}
|
net/mlx5e: XDP fast RX drop bpf programs support
Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver.
When XDP is on we make sure to change channels RQs type to
MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to
ensure "page per packet".
On XDP set, we fail if HW LRO is set and request from user to turn it
off. Since on ConnectX4-LX HW LRO is always on by default, this will be
annoying, but we prefer not to enforce LRO off from XDP set function.
Full channels reset (close/open) is required only when setting XDP
on/off.
When XDP set is called just to exchange programs, we will update
each RQ xdp program on the fly and for synchronization with current
data path RX activity of that RQ, we temporally disable that RQ and
ensure RX path is not running, quickly update and re-enable that RQ,
for that we do:
- rq.state = disabled
- napi_synnchronize
- xchg(rq->xdp_prg)
- rq.state = enabled
- napi_schedule // Just in case we've missed an IRQ
Packet rate performance testing was done with pktgen 64B packets and on
TX side and, TC drop action on RX side compared to XDP fast drop.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Comparison is done between:
1. Baseline, Before this patch with TC drop action
2. This patch with TC drop action
3. This patch with XDP RX fast drop
RX Cores Baseline(TC drop) TC drop XDP fast Drop
--------------------------------------------------------------
1 5.3Mpps 5.3Mpps 16.5Mpps
2 10.2Mpps 10.2Mpps 31.3Mpps
4 20.5Mpps 19.9Mpps 36.3Mpps*
*My xmitter was limited to 36.3Mpps, so it is the bottleneck.
It seems that receive side can handle more.
Signed-off-by: Rana Shahout <ranas@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 03:19:46 -06:00
|
|
|
|
2019-06-26 08:35:23 -06:00
|
|
|
if (was_opened && reset) {
|
|
|
|
struct mlx5e_channels new_channels = {};
|
|
|
|
|
|
|
|
new_channels.params = priv->channels.params;
|
|
|
|
new_channels.params.xdp_prog = prog;
|
|
|
|
mlx5e_set_rq_type(priv->mdev, &new_channels.params);
|
|
|
|
old_prog = priv->channels.params.xdp_prog;
|
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
err = mlx5e_safe_switch_channels(priv, &new_channels, mlx5e_xdp_update_state);
|
2019-06-26 08:35:23 -06:00
|
|
|
if (err)
|
|
|
|
goto unlock;
|
|
|
|
} else {
|
|
|
|
/* exchange programs, extra prog reference we got from caller
|
|
|
|
* as long as we don't fail from this point onwards.
|
|
|
|
*/
|
|
|
|
old_prog = xchg(&priv->channels.params.xdp_prog, prog);
|
|
|
|
}
|
|
|
|
|
net/mlx5e: XDP fast RX drop bpf programs support
Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver.
When XDP is on we make sure to change channels RQs type to
MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to
ensure "page per packet".
On XDP set, we fail if HW LRO is set and request from user to turn it
off. Since on ConnectX4-LX HW LRO is always on by default, this will be
annoying, but we prefer not to enforce LRO off from XDP set function.
Full channels reset (close/open) is required only when setting XDP
on/off.
When XDP set is called just to exchange programs, we will update
each RQ xdp program on the fly and for synchronization with current
data path RX activity of that RQ, we temporally disable that RQ and
ensure RX path is not running, quickly update and re-enable that RQ,
for that we do:
- rq.state = disabled
- napi_synnchronize
- xchg(rq->xdp_prg)
- rq.state = enabled
- napi_schedule // Just in case we've missed an IRQ
Packet rate performance testing was done with pktgen 64B packets and on
TX side and, TC drop action on RX side compared to XDP fast drop.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Comparison is done between:
1. Baseline, Before this patch with TC drop action
2. This patch with TC drop action
3. This patch with XDP RX fast drop
RX Cores Baseline(TC drop) TC drop XDP fast Drop
--------------------------------------------------------------
1 5.3Mpps 5.3Mpps 16.5Mpps
2 10.2Mpps 10.2Mpps 31.3Mpps
4 20.5Mpps 19.9Mpps 36.3Mpps*
*My xmitter was limited to 36.3Mpps, so it is the bottleneck.
It seems that receive side can handle more.
Signed-off-by: Rana Shahout <ranas@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 03:19:46 -06:00
|
|
|
if (old_prog)
|
|
|
|
bpf_prog_put(old_prog);
|
|
|
|
|
2019-06-26 08:35:23 -06:00
|
|
|
if (!was_opened && reset) /* change RQ type according to priv->xdp_prog */
|
2018-02-18 02:37:06 -07:00
|
|
|
mlx5e_set_rq_type(priv->mdev, &priv->channels.params);
|
net/mlx5e: XDP fast RX drop bpf programs support
Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver.
When XDP is on we make sure to change channels RQs type to
MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to
ensure "page per packet".
On XDP set, we fail if HW LRO is set and request from user to turn it
off. Since on ConnectX4-LX HW LRO is always on by default, this will be
annoying, but we prefer not to enforce LRO off from XDP set function.
Full channels reset (close/open) is required only when setting XDP
on/off.
When XDP set is called just to exchange programs, we will update
each RQ xdp program on the fly and for synchronization with current
data path RX activity of that RQ, we temporally disable that RQ and
ensure RX path is not running, quickly update and re-enable that RQ,
for that we do:
- rq.state = disabled
- napi_synnchronize
- xchg(rq->xdp_prg)
- rq.state = enabled
- napi_schedule // Just in case we've missed an IRQ
Packet rate performance testing was done with pktgen 64B packets and on
TX side and, TC drop action on RX side compared to XDP fast drop.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Comparison is done between:
1. Baseline, Before this patch with TC drop action
2. This patch with TC drop action
3. This patch with XDP RX fast drop
RX Cores Baseline(TC drop) TC drop XDP fast Drop
--------------------------------------------------------------
1 5.3Mpps 5.3Mpps 16.5Mpps
2 10.2Mpps 10.2Mpps 31.3Mpps
4 20.5Mpps 19.9Mpps 36.3Mpps*
*My xmitter was limited to 36.3Mpps, so it is the bottleneck.
It seems that receive side can handle more.
Signed-off-by: Rana Shahout <ranas@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 03:19:46 -06:00
|
|
|
|
2019-06-26 08:35:23 -06:00
|
|
|
if (!was_opened || reset)
|
net/mlx5e: XDP fast RX drop bpf programs support
Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver.
When XDP is on we make sure to change channels RQs type to
MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to
ensure "page per packet".
On XDP set, we fail if HW LRO is set and request from user to turn it
off. Since on ConnectX4-LX HW LRO is always on by default, this will be
annoying, but we prefer not to enforce LRO off from XDP set function.
Full channels reset (close/open) is required only when setting XDP
on/off.
When XDP set is called just to exchange programs, we will update
each RQ xdp program on the fly and for synchronization with current
data path RX activity of that RQ, we temporally disable that RQ and
ensure RX path is not running, quickly update and re-enable that RQ,
for that we do:
- rq.state = disabled
- napi_synnchronize
- xchg(rq->xdp_prg)
- rq.state = enabled
- napi_schedule // Just in case we've missed an IRQ
Packet rate performance testing was done with pktgen 64B packets and on
TX side and, TC drop action on RX side compared to XDP fast drop.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Comparison is done between:
1. Baseline, Before this patch with TC drop action
2. This patch with TC drop action
3. This patch with XDP RX fast drop
RX Cores Baseline(TC drop) TC drop XDP fast Drop
--------------------------------------------------------------
1 5.3Mpps 5.3Mpps 16.5Mpps
2 10.2Mpps 10.2Mpps 31.3Mpps
4 20.5Mpps 19.9Mpps 36.3Mpps*
*My xmitter was limited to 36.3Mpps, so it is the bottleneck.
It seems that receive side can handle more.
Signed-off-by: Rana Shahout <ranas@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 03:19:46 -06:00
|
|
|
goto unlock;
|
|
|
|
|
|
|
|
/* exchanging programs w/o reset, we update ref counts on behalf
|
|
|
|
* of the channels RQs here.
|
|
|
|
*/
|
2017-02-06 04:14:34 -07:00
|
|
|
for (i = 0; i < priv->channels.num; i++) {
|
|
|
|
struct mlx5e_channel *c = priv->channels.c[i];
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
bool xsk_open = test_bit(MLX5E_CHANNEL_STATE_XSK, c->state);
|
net/mlx5e: XDP fast RX drop bpf programs support
Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver.
When XDP is on we make sure to change channels RQs type to
MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to
ensure "page per packet".
On XDP set, we fail if HW LRO is set and request from user to turn it
off. Since on ConnectX4-LX HW LRO is always on by default, this will be
annoying, but we prefer not to enforce LRO off from XDP set function.
Full channels reset (close/open) is required only when setting XDP
on/off.
When XDP set is called just to exchange programs, we will update
each RQ xdp program on the fly and for synchronization with current
data path RX activity of that RQ, we temporally disable that RQ and
ensure RX path is not running, quickly update and re-enable that RQ,
for that we do:
- rq.state = disabled
- napi_synnchronize
- xchg(rq->xdp_prg)
- rq.state = enabled
- napi_schedule // Just in case we've missed an IRQ
Packet rate performance testing was done with pktgen 64B packets and on
TX side and, TC drop action on RX side compared to XDP fast drop.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Comparison is done between:
1. Baseline, Before this patch with TC drop action
2. This patch with TC drop action
3. This patch with XDP RX fast drop
RX Cores Baseline(TC drop) TC drop XDP fast Drop
--------------------------------------------------------------
1 5.3Mpps 5.3Mpps 16.5Mpps
2 10.2Mpps 10.2Mpps 31.3Mpps
4 20.5Mpps 19.9Mpps 36.3Mpps*
*My xmitter was limited to 36.3Mpps, so it is the bottleneck.
It seems that receive side can handle more.
Signed-off-by: Rana Shahout <ranas@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 03:19:46 -06:00
|
|
|
|
2016-12-06 08:32:48 -07:00
|
|
|
clear_bit(MLX5E_RQ_STATE_ENABLED, &c->rq.state);
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
if (xsk_open)
|
|
|
|
clear_bit(MLX5E_RQ_STATE_ENABLED, &c->xskrq.state);
|
net/mlx5e: XDP fast RX drop bpf programs support
Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver.
When XDP is on we make sure to change channels RQs type to
MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to
ensure "page per packet".
On XDP set, we fail if HW LRO is set and request from user to turn it
off. Since on ConnectX4-LX HW LRO is always on by default, this will be
annoying, but we prefer not to enforce LRO off from XDP set function.
Full channels reset (close/open) is required only when setting XDP
on/off.
When XDP set is called just to exchange programs, we will update
each RQ xdp program on the fly and for synchronization with current
data path RX activity of that RQ, we temporally disable that RQ and
ensure RX path is not running, quickly update and re-enable that RQ,
for that we do:
- rq.state = disabled
- napi_synnchronize
- xchg(rq->xdp_prg)
- rq.state = enabled
- napi_schedule // Just in case we've missed an IRQ
Packet rate performance testing was done with pktgen 64B packets and on
TX side and, TC drop action on RX side compared to XDP fast drop.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Comparison is done between:
1. Baseline, Before this patch with TC drop action
2. This patch with TC drop action
3. This patch with XDP RX fast drop
RX Cores Baseline(TC drop) TC drop XDP fast Drop
--------------------------------------------------------------
1 5.3Mpps 5.3Mpps 16.5Mpps
2 10.2Mpps 10.2Mpps 31.3Mpps
4 20.5Mpps 19.9Mpps 36.3Mpps*
*My xmitter was limited to 36.3Mpps, so it is the bottleneck.
It seems that receive side can handle more.
Signed-off-by: Rana Shahout <ranas@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 03:19:46 -06:00
|
|
|
napi_synchronize(&c->napi);
|
|
|
|
/* prevent mlx5e_poll_rx_cq from accessing rq->xdp_prog */
|
|
|
|
|
|
|
|
old_prog = xchg(&c->rq.xdp_prog, prog);
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
if (old_prog)
|
|
|
|
bpf_prog_put(old_prog);
|
|
|
|
|
|
|
|
if (xsk_open) {
|
|
|
|
old_prog = xchg(&c->xskrq.xdp_prog, prog);
|
|
|
|
if (old_prog)
|
|
|
|
bpf_prog_put(old_prog);
|
|
|
|
}
|
net/mlx5e: XDP fast RX drop bpf programs support
Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver.
When XDP is on we make sure to change channels RQs type to
MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to
ensure "page per packet".
On XDP set, we fail if HW LRO is set and request from user to turn it
off. Since on ConnectX4-LX HW LRO is always on by default, this will be
annoying, but we prefer not to enforce LRO off from XDP set function.
Full channels reset (close/open) is required only when setting XDP
on/off.
When XDP set is called just to exchange programs, we will update
each RQ xdp program on the fly and for synchronization with current
data path RX activity of that RQ, we temporally disable that RQ and
ensure RX path is not running, quickly update and re-enable that RQ,
for that we do:
- rq.state = disabled
- napi_synnchronize
- xchg(rq->xdp_prg)
- rq.state = enabled
- napi_schedule // Just in case we've missed an IRQ
Packet rate performance testing was done with pktgen 64B packets and on
TX side and, TC drop action on RX side compared to XDP fast drop.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Comparison is done between:
1. Baseline, Before this patch with TC drop action
2. This patch with TC drop action
3. This patch with XDP RX fast drop
RX Cores Baseline(TC drop) TC drop XDP fast Drop
--------------------------------------------------------------
1 5.3Mpps 5.3Mpps 16.5Mpps
2 10.2Mpps 10.2Mpps 31.3Mpps
4 20.5Mpps 19.9Mpps 36.3Mpps*
*My xmitter was limited to 36.3Mpps, so it is the bottleneck.
It seems that receive side can handle more.
Signed-off-by: Rana Shahout <ranas@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 03:19:46 -06:00
|
|
|
|
2016-12-06 08:32:48 -07:00
|
|
|
set_bit(MLX5E_RQ_STATE_ENABLED, &c->rq.state);
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
if (xsk_open)
|
|
|
|
set_bit(MLX5E_RQ_STATE_ENABLED, &c->xskrq.state);
|
net/mlx5e: XDP fast RX drop bpf programs support
Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver.
When XDP is on we make sure to change channels RQs type to
MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to
ensure "page per packet".
On XDP set, we fail if HW LRO is set and request from user to turn it
off. Since on ConnectX4-LX HW LRO is always on by default, this will be
annoying, but we prefer not to enforce LRO off from XDP set function.
Full channels reset (close/open) is required only when setting XDP
on/off.
When XDP set is called just to exchange programs, we will update
each RQ xdp program on the fly and for synchronization with current
data path RX activity of that RQ, we temporally disable that RQ and
ensure RX path is not running, quickly update and re-enable that RQ,
for that we do:
- rq.state = disabled
- napi_synnchronize
- xchg(rq->xdp_prg)
- rq.state = enabled
- napi_schedule // Just in case we've missed an IRQ
Packet rate performance testing was done with pktgen 64B packets and on
TX side and, TC drop action on RX side compared to XDP fast drop.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Comparison is done between:
1. Baseline, Before this patch with TC drop action
2. This patch with TC drop action
3. This patch with XDP RX fast drop
RX Cores Baseline(TC drop) TC drop XDP fast Drop
--------------------------------------------------------------
1 5.3Mpps 5.3Mpps 16.5Mpps
2 10.2Mpps 10.2Mpps 31.3Mpps
4 20.5Mpps 19.9Mpps 36.3Mpps*
*My xmitter was limited to 36.3Mpps, so it is the bottleneck.
It seems that receive side can handle more.
Signed-off-by: Rana Shahout <ranas@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 03:19:46 -06:00
|
|
|
/* napi_schedule in case we have missed anything */
|
|
|
|
napi_schedule(&c->napi);
|
|
|
|
}
|
|
|
|
|
|
|
|
unlock:
|
|
|
|
mutex_unlock(&priv->state_lock);
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2017-06-15 18:29:11 -06:00
|
|
|
static u32 mlx5e_xdp_query(struct net_device *dev)
|
net/mlx5e: XDP fast RX drop bpf programs support
Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver.
When XDP is on we make sure to change channels RQs type to
MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to
ensure "page per packet".
On XDP set, we fail if HW LRO is set and request from user to turn it
off. Since on ConnectX4-LX HW LRO is always on by default, this will be
annoying, but we prefer not to enforce LRO off from XDP set function.
Full channels reset (close/open) is required only when setting XDP
on/off.
When XDP set is called just to exchange programs, we will update
each RQ xdp program on the fly and for synchronization with current
data path RX activity of that RQ, we temporally disable that RQ and
ensure RX path is not running, quickly update and re-enable that RQ,
for that we do:
- rq.state = disabled
- napi_synnchronize
- xchg(rq->xdp_prg)
- rq.state = enabled
- napi_schedule // Just in case we've missed an IRQ
Packet rate performance testing was done with pktgen 64B packets and on
TX side and, TC drop action on RX side compared to XDP fast drop.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Comparison is done between:
1. Baseline, Before this patch with TC drop action
2. This patch with TC drop action
3. This patch with XDP RX fast drop
RX Cores Baseline(TC drop) TC drop XDP fast Drop
--------------------------------------------------------------
1 5.3Mpps 5.3Mpps 16.5Mpps
2 10.2Mpps 10.2Mpps 31.3Mpps
4 20.5Mpps 19.9Mpps 36.3Mpps*
*My xmitter was limited to 36.3Mpps, so it is the bottleneck.
It seems that receive side can handle more.
Signed-off-by: Rana Shahout <ranas@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 03:19:46 -06:00
|
|
|
{
|
|
|
|
struct mlx5e_priv *priv = netdev_priv(dev);
|
2017-06-15 18:29:11 -06:00
|
|
|
const struct bpf_prog *xdp_prog;
|
|
|
|
u32 prog_id = 0;
|
net/mlx5e: XDP fast RX drop bpf programs support
Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver.
When XDP is on we make sure to change channels RQs type to
MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to
ensure "page per packet".
On XDP set, we fail if HW LRO is set and request from user to turn it
off. Since on ConnectX4-LX HW LRO is always on by default, this will be
annoying, but we prefer not to enforce LRO off from XDP set function.
Full channels reset (close/open) is required only when setting XDP
on/off.
When XDP set is called just to exchange programs, we will update
each RQ xdp program on the fly and for synchronization with current
data path RX activity of that RQ, we temporally disable that RQ and
ensure RX path is not running, quickly update and re-enable that RQ,
for that we do:
- rq.state = disabled
- napi_synnchronize
- xchg(rq->xdp_prg)
- rq.state = enabled
- napi_schedule // Just in case we've missed an IRQ
Packet rate performance testing was done with pktgen 64B packets and on
TX side and, TC drop action on RX side compared to XDP fast drop.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Comparison is done between:
1. Baseline, Before this patch with TC drop action
2. This patch with TC drop action
3. This patch with XDP RX fast drop
RX Cores Baseline(TC drop) TC drop XDP fast Drop
--------------------------------------------------------------
1 5.3Mpps 5.3Mpps 16.5Mpps
2 10.2Mpps 10.2Mpps 31.3Mpps
4 20.5Mpps 19.9Mpps 36.3Mpps*
*My xmitter was limited to 36.3Mpps, so it is the bottleneck.
It seems that receive side can handle more.
Signed-off-by: Rana Shahout <ranas@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 03:19:46 -06:00
|
|
|
|
2017-06-15 18:29:11 -06:00
|
|
|
mutex_lock(&priv->state_lock);
|
|
|
|
xdp_prog = priv->channels.params.xdp_prog;
|
|
|
|
if (xdp_prog)
|
|
|
|
prog_id = xdp_prog->aux->id;
|
|
|
|
mutex_unlock(&priv->state_lock);
|
|
|
|
|
|
|
|
return prog_id;
|
net/mlx5e: XDP fast RX drop bpf programs support
Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver.
When XDP is on we make sure to change channels RQs type to
MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to
ensure "page per packet".
On XDP set, we fail if HW LRO is set and request from user to turn it
off. Since on ConnectX4-LX HW LRO is always on by default, this will be
annoying, but we prefer not to enforce LRO off from XDP set function.
Full channels reset (close/open) is required only when setting XDP
on/off.
When XDP set is called just to exchange programs, we will update
each RQ xdp program on the fly and for synchronization with current
data path RX activity of that RQ, we temporally disable that RQ and
ensure RX path is not running, quickly update and re-enable that RQ,
for that we do:
- rq.state = disabled
- napi_synnchronize
- xchg(rq->xdp_prg)
- rq.state = enabled
- napi_schedule // Just in case we've missed an IRQ
Packet rate performance testing was done with pktgen 64B packets and on
TX side and, TC drop action on RX side compared to XDP fast drop.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Comparison is done between:
1. Baseline, Before this patch with TC drop action
2. This patch with TC drop action
3. This patch with XDP RX fast drop
RX Cores Baseline(TC drop) TC drop XDP fast Drop
--------------------------------------------------------------
1 5.3Mpps 5.3Mpps 16.5Mpps
2 10.2Mpps 10.2Mpps 31.3Mpps
4 20.5Mpps 19.9Mpps 36.3Mpps*
*My xmitter was limited to 36.3Mpps, so it is the bottleneck.
It seems that receive side can handle more.
Signed-off-by: Rana Shahout <ranas@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 03:19:46 -06:00
|
|
|
}
|
|
|
|
|
2017-11-03 14:56:16 -06:00
|
|
|
static int mlx5e_xdp(struct net_device *dev, struct netdev_bpf *xdp)
|
net/mlx5e: XDP fast RX drop bpf programs support
Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver.
When XDP is on we make sure to change channels RQs type to
MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to
ensure "page per packet".
On XDP set, we fail if HW LRO is set and request from user to turn it
off. Since on ConnectX4-LX HW LRO is always on by default, this will be
annoying, but we prefer not to enforce LRO off from XDP set function.
Full channels reset (close/open) is required only when setting XDP
on/off.
When XDP set is called just to exchange programs, we will update
each RQ xdp program on the fly and for synchronization with current
data path RX activity of that RQ, we temporally disable that RQ and
ensure RX path is not running, quickly update and re-enable that RQ,
for that we do:
- rq.state = disabled
- napi_synnchronize
- xchg(rq->xdp_prg)
- rq.state = enabled
- napi_schedule // Just in case we've missed an IRQ
Packet rate performance testing was done with pktgen 64B packets and on
TX side and, TC drop action on RX side compared to XDP fast drop.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Comparison is done between:
1. Baseline, Before this patch with TC drop action
2. This patch with TC drop action
3. This patch with XDP RX fast drop
RX Cores Baseline(TC drop) TC drop XDP fast Drop
--------------------------------------------------------------
1 5.3Mpps 5.3Mpps 16.5Mpps
2 10.2Mpps 10.2Mpps 31.3Mpps
4 20.5Mpps 19.9Mpps 36.3Mpps*
*My xmitter was limited to 36.3Mpps, so it is the bottleneck.
It seems that receive side can handle more.
Signed-off-by: Rana Shahout <ranas@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 03:19:46 -06:00
|
|
|
{
|
|
|
|
switch (xdp->command) {
|
|
|
|
case XDP_SETUP_PROG:
|
|
|
|
return mlx5e_xdp_set(dev, xdp->prog);
|
|
|
|
case XDP_QUERY_PROG:
|
2017-06-15 18:29:11 -06:00
|
|
|
xdp->prog_id = mlx5e_xdp_query(dev);
|
net/mlx5e: XDP fast RX drop bpf programs support
Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver.
When XDP is on we make sure to change channels RQs type to
MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to
ensure "page per packet".
On XDP set, we fail if HW LRO is set and request from user to turn it
off. Since on ConnectX4-LX HW LRO is always on by default, this will be
annoying, but we prefer not to enforce LRO off from XDP set function.
Full channels reset (close/open) is required only when setting XDP
on/off.
When XDP set is called just to exchange programs, we will update
each RQ xdp program on the fly and for synchronization with current
data path RX activity of that RQ, we temporally disable that RQ and
ensure RX path is not running, quickly update and re-enable that RQ,
for that we do:
- rq.state = disabled
- napi_synnchronize
- xchg(rq->xdp_prg)
- rq.state = enabled
- napi_schedule // Just in case we've missed an IRQ
Packet rate performance testing was done with pktgen 64B packets and on
TX side and, TC drop action on RX side compared to XDP fast drop.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Comparison is done between:
1. Baseline, Before this patch with TC drop action
2. This patch with TC drop action
3. This patch with XDP RX fast drop
RX Cores Baseline(TC drop) TC drop XDP fast Drop
--------------------------------------------------------------
1 5.3Mpps 5.3Mpps 16.5Mpps
2 10.2Mpps 10.2Mpps 31.3Mpps
4 20.5Mpps 19.9Mpps 36.3Mpps*
*My xmitter was limited to 36.3Mpps, so it is the bottleneck.
It seems that receive side can handle more.
Signed-off-by: Rana Shahout <ranas@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 03:19:46 -06:00
|
|
|
return 0;
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
case XDP_SETUP_XSK_UMEM:
|
|
|
|
return mlx5e_xsk_setup_umem(dev, xdp->xsk.umem,
|
|
|
|
xdp->xsk.queue_id);
|
net/mlx5e: XDP fast RX drop bpf programs support
Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver.
When XDP is on we make sure to change channels RQs type to
MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to
ensure "page per packet".
On XDP set, we fail if HW LRO is set and request from user to turn it
off. Since on ConnectX4-LX HW LRO is always on by default, this will be
annoying, but we prefer not to enforce LRO off from XDP set function.
Full channels reset (close/open) is required only when setting XDP
on/off.
When XDP set is called just to exchange programs, we will update
each RQ xdp program on the fly and for synchronization with current
data path RX activity of that RQ, we temporally disable that RQ and
ensure RX path is not running, quickly update and re-enable that RQ,
for that we do:
- rq.state = disabled
- napi_synnchronize
- xchg(rq->xdp_prg)
- rq.state = enabled
- napi_schedule // Just in case we've missed an IRQ
Packet rate performance testing was done with pktgen 64B packets and on
TX side and, TC drop action on RX side compared to XDP fast drop.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Comparison is done between:
1. Baseline, Before this patch with TC drop action
2. This patch with TC drop action
3. This patch with XDP RX fast drop
RX Cores Baseline(TC drop) TC drop XDP fast Drop
--------------------------------------------------------------
1 5.3Mpps 5.3Mpps 16.5Mpps
2 10.2Mpps 10.2Mpps 31.3Mpps
4 20.5Mpps 19.9Mpps 36.3Mpps*
*My xmitter was limited to 36.3Mpps, so it is the bottleneck.
It seems that receive side can handle more.
Signed-off-by: Rana Shahout <ranas@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 03:19:46 -06:00
|
|
|
default:
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2019-02-19 20:40:34 -07:00
|
|
|
#ifdef CONFIG_MLX5_ESWITCH
|
|
|
|
static int mlx5e_bridge_getlink(struct sk_buff *skb, u32 pid, u32 seq,
|
|
|
|
struct net_device *dev, u32 filter_mask,
|
|
|
|
int nlflags)
|
|
|
|
{
|
|
|
|
struct mlx5e_priv *priv = netdev_priv(dev);
|
|
|
|
struct mlx5_core_dev *mdev = priv->mdev;
|
|
|
|
u8 mode, setting;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
err = mlx5_eswitch_get_vepa(mdev->priv.eswitch, &setting);
|
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
mode = setting ? BRIDGE_MODE_VEPA : BRIDGE_MODE_VEB;
|
|
|
|
return ndo_dflt_bridge_getlink(skb, pid, seq, dev,
|
|
|
|
mode,
|
|
|
|
0, 0, nlflags, filter_mask, NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int mlx5e_bridge_setlink(struct net_device *dev, struct nlmsghdr *nlh,
|
|
|
|
u16 flags, struct netlink_ext_ack *extack)
|
|
|
|
{
|
|
|
|
struct mlx5e_priv *priv = netdev_priv(dev);
|
|
|
|
struct mlx5_core_dev *mdev = priv->mdev;
|
|
|
|
struct nlattr *attr, *br_spec;
|
|
|
|
u16 mode = BRIDGE_MODE_UNDEF;
|
|
|
|
u8 setting;
|
|
|
|
int rem;
|
|
|
|
|
|
|
|
br_spec = nlmsg_find_attr(nlh, sizeof(struct ifinfomsg), IFLA_AF_SPEC);
|
|
|
|
if (!br_spec)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
nla_for_each_nested(attr, br_spec, rem) {
|
|
|
|
if (nla_type(attr) != IFLA_BRIDGE_MODE)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (nla_len(attr) < sizeof(mode))
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
mode = nla_get_u16(attr);
|
|
|
|
if (mode > BRIDGE_MODE_VEPA)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (mode == BRIDGE_MODE_UNDEF)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
setting = (mode == BRIDGE_MODE_VEPA) ? 1 : 0;
|
|
|
|
return mlx5_eswitch_set_vepa(mdev->priv.eswitch, setting);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2018-09-05 02:43:23 -06:00
|
|
|
const struct net_device_ops mlx5e_netdev_ops = {
|
2015-05-28 13:28:48 -06:00
|
|
|
.ndo_open = mlx5e_open,
|
|
|
|
.ndo_stop = mlx5e_close,
|
|
|
|
.ndo_start_xmit = mlx5e_xmit,
|
2017-08-07 02:15:22 -06:00
|
|
|
.ndo_setup_tc = mlx5e_setup_tc,
|
2016-02-22 09:17:26 -07:00
|
|
|
.ndo_select_queue = mlx5e_select_queue,
|
2015-05-28 13:28:48 -06:00
|
|
|
.ndo_get_stats64 = mlx5e_get_stats,
|
|
|
|
.ndo_set_rx_mode = mlx5e_set_rx_mode,
|
|
|
|
.ndo_set_mac_address = mlx5e_set_mac,
|
2016-02-09 05:57:44 -07:00
|
|
|
.ndo_vlan_rx_add_vid = mlx5e_vlan_rx_add_vid,
|
|
|
|
.ndo_vlan_rx_kill_vid = mlx5e_vlan_rx_kill_vid,
|
2015-05-28 13:28:48 -06:00
|
|
|
.ndo_set_features = mlx5e_set_features,
|
2017-09-10 01:36:43 -06:00
|
|
|
.ndo_fix_features = mlx5e_fix_features,
|
2018-04-01 07:54:27 -06:00
|
|
|
.ndo_change_mtu = mlx5e_change_nic_mtu,
|
2016-02-09 05:57:44 -07:00
|
|
|
.ndo_do_ioctl = mlx5e_ioctl,
|
2016-06-23 08:02:38 -06:00
|
|
|
.ndo_set_tx_maxrate = mlx5e_set_tx_maxrate,
|
2017-06-06 08:46:49 -06:00
|
|
|
.ndo_udp_tunnel_add = mlx5e_add_vxlan_port,
|
|
|
|
.ndo_udp_tunnel_del = mlx5e_del_vxlan_port,
|
|
|
|
.ndo_features_check = mlx5e_features_check,
|
2016-06-30 08:34:45 -06:00
|
|
|
.ndo_tx_timeout = mlx5e_tx_timeout,
|
2017-11-03 14:56:16 -06:00
|
|
|
.ndo_bpf = mlx5e_xdp,
|
2018-05-22 07:48:48 -06:00
|
|
|
.ndo_xdp_xmit = mlx5e_xdp_xmit,
|
2019-08-14 01:27:16 -06:00
|
|
|
.ndo_xsk_wakeup = mlx5e_xsk_wakeup,
|
2018-07-12 04:01:26 -06:00
|
|
|
#ifdef CONFIG_MLX5_EN_ARFS
|
|
|
|
.ndo_rx_flow_steer = mlx5e_rx_flow_steer,
|
|
|
|
#endif
|
2017-06-05 06:17:12 -06:00
|
|
|
#ifdef CONFIG_MLX5_ESWITCH
|
2019-02-19 20:40:34 -07:00
|
|
|
.ndo_bridge_setlink = mlx5e_bridge_setlink,
|
|
|
|
.ndo_bridge_getlink = mlx5e_bridge_getlink,
|
|
|
|
|
2017-06-06 08:46:49 -06:00
|
|
|
/* SRIOV E-Switch NDOs */
|
2016-02-09 05:57:44 -07:00
|
|
|
.ndo_set_vf_mac = mlx5e_set_vf_mac,
|
|
|
|
.ndo_set_vf_vlan = mlx5e_set_vf_vlan,
|
2016-05-03 08:13:59 -06:00
|
|
|
.ndo_set_vf_spoofchk = mlx5e_set_vf_spoofchk,
|
2016-05-03 08:14:04 -06:00
|
|
|
.ndo_set_vf_trust = mlx5e_set_vf_trust,
|
2016-08-11 02:28:21 -06:00
|
|
|
.ndo_set_vf_rate = mlx5e_set_vf_rate,
|
2016-02-09 05:57:44 -07:00
|
|
|
.ndo_get_vf_config = mlx5e_get_vf_config,
|
|
|
|
.ndo_set_vf_link_state = mlx5e_set_vf_link_state,
|
|
|
|
.ndo_get_vf_stats = mlx5e_get_vf_stats,
|
2017-06-05 06:17:12 -06:00
|
|
|
#endif
|
2015-05-28 13:28:48 -06:00
|
|
|
};
|
|
|
|
|
|
|
|
static int mlx5e_check_required_hca_cap(struct mlx5_core_dev *mdev)
|
|
|
|
{
|
|
|
|
if (MLX5_CAP_GEN(mdev, port_type) != MLX5_CAP_PORT_TYPE_ETH)
|
2017-01-11 10:35:41 -07:00
|
|
|
return -EOPNOTSUPP;
|
2015-05-28 13:28:48 -06:00
|
|
|
if (!MLX5_CAP_GEN(mdev, eth_net_offloads) ||
|
|
|
|
!MLX5_CAP_GEN(mdev, nic_flow_table) ||
|
|
|
|
!MLX5_CAP_ETH(mdev, csum_cap) ||
|
|
|
|
!MLX5_CAP_ETH(mdev, max_lso_cap) ||
|
|
|
|
!MLX5_CAP_ETH(mdev, vlan_cap) ||
|
2015-06-11 05:47:30 -06:00
|
|
|
!MLX5_CAP_ETH(mdev, rss_ind_tbl_cap) ||
|
|
|
|
MLX5_CAP_FLOWTABLE(mdev,
|
|
|
|
flow_table_properties_nic_receive.max_ft_level)
|
|
|
|
< 3) {
|
2015-05-28 13:28:48 -06:00
|
|
|
mlx5_core_warn(mdev,
|
|
|
|
"Not creating net device, some required device capabilities are missing\n");
|
2017-01-11 10:35:41 -07:00
|
|
|
return -EOPNOTSUPP;
|
2015-05-28 13:28:48 -06:00
|
|
|
}
|
2015-11-12 10:35:26 -07:00
|
|
|
if (!MLX5_CAP_ETH(mdev, self_lb_en_modifiable))
|
|
|
|
mlx5_core_warn(mdev, "Self loop back prevention is not supported\n");
|
2016-03-01 15:13:37 -07:00
|
|
|
if (!MLX5_CAP_GEN(mdev, cq_moderation))
|
2017-06-07 08:01:51 -06:00
|
|
|
mlx5_core_warn(mdev, "CQ moderation is not supported\n");
|
2015-11-12 10:35:26 -07:00
|
|
|
|
2015-05-28 13:28:48 -06:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2017-06-07 04:55:34 -06:00
|
|
|
void mlx5e_build_default_indir_rqt(u32 *indirection_rqt, int len,
|
2016-02-29 12:17:13 -07:00
|
|
|
int num_channels)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < len; i++)
|
|
|
|
indirection_rqt[i] = i % num_channels;
|
|
|
|
}
|
|
|
|
|
2018-01-17 08:39:07 -07:00
|
|
|
static bool slow_pci_heuristic(struct mlx5_core_dev *mdev)
|
2016-05-10 15:29:16 -06:00
|
|
|
{
|
2018-01-17 08:39:07 -07:00
|
|
|
u32 link_speed = 0;
|
|
|
|
u32 pci_bw = 0;
|
2016-05-10 15:29:16 -06:00
|
|
|
|
2018-02-22 12:22:56 -07:00
|
|
|
mlx5e_port_max_linkspeed(mdev, &link_speed);
|
2018-04-06 19:31:06 -06:00
|
|
|
pci_bw = pcie_bandwidth_available(mdev->pdev, NULL, NULL, NULL);
|
2018-01-17 08:39:07 -07:00
|
|
|
mlx5_core_dbg_once(mdev, "Max link speed = %d, PCI BW = %d\n",
|
|
|
|
link_speed, pci_bw);
|
|
|
|
|
|
|
|
#define MLX5E_SLOW_PCI_RATIO (2)
|
|
|
|
|
|
|
|
return link_speed && pci_bw &&
|
|
|
|
link_speed > MLX5E_SLOW_PCI_RATIO * pci_bw;
|
2017-04-26 04:42:04 -06:00
|
|
|
}
|
|
|
|
|
2019-01-31 07:44:48 -07:00
|
|
|
static struct dim_cq_moder mlx5e_get_def_tx_moderation(u8 cq_period_mode)
|
2017-09-26 07:20:43 -06:00
|
|
|
{
|
2019-01-31 07:44:48 -07:00
|
|
|
struct dim_cq_moder moder;
|
2018-04-24 04:36:03 -06:00
|
|
|
|
|
|
|
moder.cq_period_mode = cq_period_mode;
|
|
|
|
moder.pkts = MLX5E_PARAMS_DEFAULT_TX_CQ_MODERATION_PKTS;
|
|
|
|
moder.usec = MLX5E_PARAMS_DEFAULT_TX_CQ_MODERATION_USEC;
|
|
|
|
if (cq_period_mode == MLX5_CQ_PERIOD_MODE_START_FROM_CQE)
|
|
|
|
moder.usec = MLX5E_PARAMS_DEFAULT_TX_CQ_MODERATION_USEC_FROM_CQE;
|
|
|
|
|
|
|
|
return moder;
|
|
|
|
}
|
2017-09-26 07:20:43 -06:00
|
|
|
|
2019-01-31 07:44:48 -07:00
|
|
|
static struct dim_cq_moder mlx5e_get_def_rx_moderation(u8 cq_period_mode)
|
2018-04-24 04:36:03 -06:00
|
|
|
{
|
2019-01-31 07:44:48 -07:00
|
|
|
struct dim_cq_moder moder;
|
2017-09-26 07:20:43 -06:00
|
|
|
|
2018-04-24 04:36:03 -06:00
|
|
|
moder.cq_period_mode = cq_period_mode;
|
|
|
|
moder.pkts = MLX5E_PARAMS_DEFAULT_RX_CQ_MODERATION_PKTS;
|
|
|
|
moder.usec = MLX5E_PARAMS_DEFAULT_RX_CQ_MODERATION_USEC;
|
2017-09-26 07:20:43 -06:00
|
|
|
if (cq_period_mode == MLX5_CQ_PERIOD_MODE_START_FROM_CQE)
|
2018-04-24 04:36:03 -06:00
|
|
|
moder.usec = MLX5E_PARAMS_DEFAULT_RX_CQ_MODERATION_USEC_FROM_CQE;
|
|
|
|
|
|
|
|
return moder;
|
|
|
|
}
|
|
|
|
|
|
|
|
static u8 mlx5_to_net_dim_cq_period_mode(u8 cq_period_mode)
|
|
|
|
{
|
|
|
|
return cq_period_mode == MLX5_CQ_PERIOD_MODE_START_FROM_CQE ?
|
2018-11-05 03:07:52 -07:00
|
|
|
DIM_CQ_PERIOD_MODE_START_FROM_CQE :
|
|
|
|
DIM_CQ_PERIOD_MODE_START_FROM_EQE;
|
2018-04-24 04:36:03 -06:00
|
|
|
}
|
|
|
|
|
|
|
|
void mlx5e_set_tx_cq_mode_params(struct mlx5e_params *params, u8 cq_period_mode)
|
|
|
|
{
|
|
|
|
if (params->tx_dim_enabled) {
|
|
|
|
u8 dim_period_mode = mlx5_to_net_dim_cq_period_mode(cq_period_mode);
|
|
|
|
|
|
|
|
params->tx_cq_moderation = net_dim_get_def_tx_moderation(dim_period_mode);
|
|
|
|
} else {
|
|
|
|
params->tx_cq_moderation = mlx5e_get_def_tx_moderation(cq_period_mode);
|
|
|
|
}
|
2017-09-26 07:20:43 -06:00
|
|
|
|
|
|
|
MLX5E_SET_PFLAG(params, MLX5E_PFLAG_TX_CQE_BASED_MODER,
|
|
|
|
params->tx_cq_moderation.cq_period_mode ==
|
|
|
|
MLX5_CQ_PERIOD_MODE_START_FROM_CQE);
|
|
|
|
}
|
|
|
|
|
2016-06-23 08:02:40 -06:00
|
|
|
void mlx5e_set_rx_cq_mode_params(struct mlx5e_params *params, u8 cq_period_mode)
|
|
|
|
{
|
2018-01-09 14:06:17 -07:00
|
|
|
if (params->rx_dim_enabled) {
|
2018-04-24 04:36:03 -06:00
|
|
|
u8 dim_period_mode = mlx5_to_net_dim_cq_period_mode(cq_period_mode);
|
|
|
|
|
|
|
|
params->rx_cq_moderation = net_dim_get_def_rx_moderation(dim_period_mode);
|
|
|
|
} else {
|
|
|
|
params->rx_cq_moderation = mlx5e_get_def_rx_moderation(cq_period_mode);
|
2018-01-09 14:06:17 -07:00
|
|
|
}
|
2017-03-30 10:23:41 -06:00
|
|
|
|
2016-12-21 08:24:35 -07:00
|
|
|
MLX5E_SET_PFLAG(params, MLX5E_PFLAG_RX_CQE_BASED_MODER,
|
2017-09-26 07:20:43 -06:00
|
|
|
params->rx_cq_moderation.cq_period_mode ==
|
|
|
|
MLX5_CQ_PERIOD_MODE_START_FROM_CQE);
|
2016-06-23 08:02:40 -06:00
|
|
|
}
|
|
|
|
|
2018-01-31 05:45:40 -07:00
|
|
|
static u32 mlx5e_choose_lro_timeout(struct mlx5_core_dev *mdev, u32 wanted_timeout)
|
2016-10-25 09:36:29 -06:00
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
/* The supported periods are organized in ascending order */
|
|
|
|
for (i = 0; i < MLX5E_LRO_TIMEOUT_ARR_SIZE - 1; i++)
|
|
|
|
if (MLX5_CAP_ETH(mdev, lro_timer_supported_periods[i]) >= wanted_timeout)
|
|
|
|
break;
|
|
|
|
|
|
|
|
return MLX5_CAP_ETH(mdev, lro_timer_supported_periods[i]);
|
|
|
|
}
|
|
|
|
|
2018-08-16 05:25:24 -06:00
|
|
|
void mlx5e_build_rq_params(struct mlx5_core_dev *mdev,
|
|
|
|
struct mlx5e_params *params)
|
|
|
|
{
|
|
|
|
/* Prefer Striding RQ, unless any of the following holds:
|
|
|
|
* - Striding RQ configuration is not possible/supported.
|
|
|
|
* - Slow PCI heuristic.
|
|
|
|
* - Legacy RQ would use linear SKB while Striding RQ would use non-linear.
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
*
|
|
|
|
* No XSK params: checking the availability of striding RQ in general.
|
2018-08-16 05:25:24 -06:00
|
|
|
*/
|
|
|
|
if (!slow_pci_heuristic(mdev) &&
|
|
|
|
mlx5e_striding_rq_possible(mdev, params) &&
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
(mlx5e_rx_mpwqe_is_linear_skb(mdev, params, NULL) ||
|
|
|
|
!mlx5e_rx_is_linear_skb(params, NULL)))
|
2018-08-16 05:25:24 -06:00
|
|
|
MLX5E_SET_PFLAG(params, MLX5E_PFLAG_RX_STRIDING_RQ, true);
|
|
|
|
mlx5e_set_rq_type(mdev, params);
|
|
|
|
mlx5e_init_rq_type_params(mdev, params);
|
|
|
|
}
|
|
|
|
|
2018-11-06 12:05:29 -07:00
|
|
|
void mlx5e_build_rss_params(struct mlx5e_rss_params *rss_params,
|
|
|
|
u16 num_channels)
|
2018-08-19 06:01:13 -06:00
|
|
|
{
|
2018-10-23 07:03:33 -06:00
|
|
|
enum mlx5e_traffic_types tt;
|
|
|
|
|
2018-08-31 05:29:16 -06:00
|
|
|
rss_params->hfunc = ETH_RSS_HASH_TOP;
|
2018-11-06 12:05:29 -07:00
|
|
|
netdev_rss_key_fill(rss_params->toeplitz_hash_key,
|
|
|
|
sizeof(rss_params->toeplitz_hash_key));
|
|
|
|
mlx5e_build_default_indir_rqt(rss_params->indirection_rqt,
|
|
|
|
MLX5E_INDIR_RQT_SIZE, num_channels);
|
2018-10-23 07:03:33 -06:00
|
|
|
for (tt = 0; tt < MLX5E_NUM_INDIR_TIRS; tt++)
|
|
|
|
rss_params->rx_hash_fields[tt] =
|
|
|
|
tirc_default_config[tt].rx_hash_fields;
|
2018-08-19 06:01:13 -06:00
|
|
|
}
|
|
|
|
|
2017-04-12 21:36:56 -06:00
|
|
|
void mlx5e_build_nic_params(struct mlx5_core_dev *mdev,
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
struct mlx5e_xsk *xsk,
|
2018-11-06 12:05:29 -07:00
|
|
|
struct mlx5e_rss_params *rss_params,
|
2017-04-12 21:36:56 -06:00
|
|
|
struct mlx5e_params *params,
|
2018-03-12 06:24:41 -06:00
|
|
|
u16 max_channels, u16 mtu)
|
2015-05-28 13:28:48 -06:00
|
|
|
{
|
2018-03-30 16:50:08 -06:00
|
|
|
u8 rx_cq_period_mode;
|
2016-09-21 03:19:45 -06:00
|
|
|
|
2018-03-12 06:24:41 -06:00
|
|
|
params->sw_mtu = mtu;
|
|
|
|
params->hard_mtu = MLX5E_ETH_HARD_MTU;
|
2016-12-21 08:24:35 -07:00
|
|
|
params->num_channels = max_channels;
|
|
|
|
params->num_tc = 1;
|
2016-10-25 09:36:29 -06:00
|
|
|
|
2016-12-21 08:24:35 -07:00
|
|
|
/* SQ */
|
|
|
|
params->log_sq_size = is_kdump_kernel() ?
|
2016-11-22 02:03:32 -07:00
|
|
|
MLX5E_PARAMS_MINIMUM_LOG_SQ_SIZE :
|
|
|
|
MLX5E_PARAMS_DEFAULT_LOG_SQ_SIZE;
|
net/mlx5e: Support RX multi-packet WQE (Striding RQ)
Introduce the feature of multi-packet WQE (RX Work Queue Element)
referred to as (MPWQE or Striding RQ), in which WQEs are larger
and serve multiple packets each.
Every WQE consists of many strides of the same size, every received
packet is aligned to a beginning of a stride and is written to
consecutive strides within a WQE.
In the regular approach, each regular WQE is big enough to be capable
of serving one received packet of any size up to MTU or 64K in case of
device LRO is enabled, making it very wasteful when dealing with
small packets or device LRO is enabled.
For its flexibility, MPWQE allows a better memory utilization
(implying improvements in CPU utilization and packet rate) as packets
consume strides according to their size, preserving the rest of
the WQE to be available for other packets.
MPWQE default configuration:
Num of WQEs = 16
Strides Per WQE = 2048
Stride Size = 64 byte
The default WQEs memory footprint went from 1024*mtu (~1.5MB) to
16 * 2048 * 64 = 2MB per ring.
However, HW LRO can now be supported at no additional cost in memory
footprint, and hence we turn it on by default and get an even better
performance.
Performance tested on ConnectX4-Lx 50G.
To isolate the feature under test, the numbers below were measured with
HW LRO turned off. We verified that the performance just improves when
LRO is turned back on.
* Netperf single TCP stream:
- BW raised by 10-15% for representative packet sizes:
default, 64B, 1024B, 1478B, 65536B.
* Netperf multi TCP stream:
- No degradation, line rate reached.
* Pktgen: packet rate raised by 2-10% for traffic of different message
sizes: 64B, 128B, 256B, 1024B, and 1500B.
* Pktgen: packet loss in bursts of small messages (64byte),
single stream:
- | num packets | packets loss before | packets loss after
| 2K | ~ 1K | 0
| 8K | ~ 6K | 0
| 16K | ~13K | 0
| 32K | ~28K | 0
| 64K | ~57K | ~24K
As expected as the driver can receive as many small packets (<=64B) as
the number of total strides in the ring (default = 2048 * 16) vs. 1024
(default ring size regardless of packets size) before this feature.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-20 13:02:13 -06:00
|
|
|
|
2018-11-20 02:50:30 -07:00
|
|
|
/* XDP SQ */
|
|
|
|
MLX5E_SET_PFLAG(params, MLX5E_PFLAG_XDP_TX_MPWQE,
|
|
|
|
MLX5_CAP_ETH(mdev, enhanced_multi_pkt_send_wqe));
|
|
|
|
|
2016-05-10 15:29:16 -06:00
|
|
|
/* set CQE compression */
|
2016-12-21 08:24:35 -07:00
|
|
|
params->rx_cqe_compress_def = false;
|
2016-05-10 15:29:16 -06:00
|
|
|
if (MLX5_CAP_GEN(mdev, cqe_compression) &&
|
2017-05-28 06:40:43 -06:00
|
|
|
MLX5_CAP_GEN(mdev, vport_group_manager))
|
2018-01-17 08:39:07 -07:00
|
|
|
params->rx_cqe_compress_def = slow_pci_heuristic(mdev);
|
2017-04-26 04:42:04 -06:00
|
|
|
|
2016-12-21 08:24:35 -07:00
|
|
|
MLX5E_SET_PFLAG(params, MLX5E_PFLAG_RX_CQE_COMPRESS, params->rx_cqe_compress_def);
|
2018-07-01 02:58:38 -06:00
|
|
|
MLX5E_SET_PFLAG(params, MLX5E_PFLAG_RX_NO_CSUM_COMPLETE, false);
|
2016-12-21 08:24:35 -07:00
|
|
|
|
|
|
|
/* RQ */
|
2018-08-16 05:25:24 -06:00
|
|
|
mlx5e_build_rq_params(mdev, params);
|
2016-05-10 15:29:16 -06:00
|
|
|
|
2016-12-21 08:24:35 -07:00
|
|
|
/* HW LRO */
|
2017-05-18 08:03:21 -06:00
|
|
|
|
2017-04-12 21:36:58 -06:00
|
|
|
/* TODO: && MLX5_CAP_ETH(mdev, lro_cap) */
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
if (params->rq_wq_type == MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ) {
|
|
|
|
/* No XSK params: checking the availability of striding RQ in general. */
|
|
|
|
if (!mlx5e_rx_mpwqe_is_linear_skb(mdev, params, NULL))
|
net/mlx5e: Use linear SKB in Striding RQ
Current Striding RQ HW feature utilizes the RX buffers so that
there is no wasted room between the strides. This maximises
the memory utilization.
This prevents the use of build_skb() (which requires headroom
and tailroom), and demands to memcpy the packets headers into
the skb linear part.
In this patch, whenever a set of conditions holds, we apply
an RQ configuration that allows combining the use of linear SKB
on top of a Striding RQ.
To use build_skb() with Striding RQ, the following must hold:
1. packet does not cross a page boundary.
2. there is enough headroom and tailroom surrounding the packet.
We can satisfy 1 and 2 by configuring:
stride size = MTU + headroom + tailoom.
This is possible only when:
a. (MTU - headroom - tailoom) does not exceed PAGE_SIZE.
b. HW LRO is turned off.
Using linear SKB has many advantages:
- Saves a memcpy of the headers.
- No page-boundary checks in datapath.
- No filler CQEs.
- Significantly smaller CQ.
- SKB data continuously resides in linear part, and not split to
small amount (linear part) and large amount (fragment).
This saves datapath cycles in driver and improves utilization
of SKB fragments in GRO.
- The fragments of a resulting GRO SKB follow the IP forwarding
assumption of equal-size fragments.
Some implementation details:
HW writes the packets to the beginning of a stride,
i.e. does not keep headroom. To overcome this we make sure we can
extend backwards and use the last bytes of stride i-1.
Extra care is needed for stride 0 as it has no preceding stride.
We make sure headroom bytes are available by shifting the buffer
pointer passed to HW by headroom bytes.
This configuration now becomes default, whenever capable.
Of course, this implies turning LRO off.
Performance testing:
ConnectX-5, single core, single RX ring, default MTU.
UDP packet rate, early drop in TC layer:
--------------------------------------------
| pkt size | before | after | ratio |
--------------------------------------------
| 1500byte | 4.65 Mpps | 5.96 Mpps | 1.28x |
| 500byte | 5.23 Mpps | 5.97 Mpps | 1.14x |
| 64byte | 5.94 Mpps | 5.96 Mpps | 1.00x |
--------------------------------------------
TCP streams: ~20% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-02-07 05:41:25 -07:00
|
|
|
params->lro_en = !slow_pci_heuristic(mdev);
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
}
|
2016-12-21 08:24:35 -07:00
|
|
|
params->lro_timeout = mlx5e_choose_lro_timeout(mdev, MLX5E_DEFAULT_LRO_TIMEOUT);
|
2017-02-22 08:20:14 -07:00
|
|
|
|
2016-12-21 08:24:35 -07:00
|
|
|
/* CQ moderation params */
|
2018-03-30 16:50:08 -06:00
|
|
|
rx_cq_period_mode = MLX5_CAP_GEN(mdev, cq_period_start_from_cqe) ?
|
2016-12-21 08:24:35 -07:00
|
|
|
MLX5_CQ_PERIOD_MODE_START_FROM_CQE :
|
|
|
|
MLX5_CQ_PERIOD_MODE_START_FROM_EQE;
|
2018-01-09 14:06:17 -07:00
|
|
|
params->rx_dim_enabled = MLX5_CAP_GEN(mdev, cq_moderation);
|
2018-04-24 04:36:03 -06:00
|
|
|
params->tx_dim_enabled = MLX5_CAP_GEN(mdev, cq_moderation);
|
2018-03-30 16:50:08 -06:00
|
|
|
mlx5e_set_rx_cq_mode_params(params, rx_cq_period_mode);
|
|
|
|
mlx5e_set_tx_cq_mode_params(params, MLX5_CQ_PERIOD_MODE_START_FROM_EQE);
|
2016-06-23 08:02:40 -06:00
|
|
|
|
2016-12-21 08:24:35 -07:00
|
|
|
/* TX inline */
|
2019-07-01 03:08:08 -06:00
|
|
|
mlx5_query_min_inline(mdev, ¶ms->tx_min_inline_mode);
|
2016-12-06 04:53:49 -07:00
|
|
|
|
2016-12-21 08:24:35 -07:00
|
|
|
/* RSS */
|
2018-11-06 12:05:29 -07:00
|
|
|
mlx5e_build_rss_params(rss_params, params->num_channels);
|
2019-01-20 02:04:34 -07:00
|
|
|
params->tunneled_offload_en =
|
|
|
|
mlx5e_tunnel_inner_ft_supported(mdev);
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
|
|
|
|
/* AF_XDP */
|
|
|
|
params->xsk = xsk;
|
2016-12-21 08:24:35 -07:00
|
|
|
}
|
2015-05-28 13:28:48 -06:00
|
|
|
|
|
|
|
static void mlx5e_set_netdev_dev_addr(struct net_device *netdev)
|
|
|
|
{
|
|
|
|
struct mlx5e_priv *priv = netdev_priv(netdev);
|
|
|
|
|
2019-06-28 16:36:13 -06:00
|
|
|
mlx5_query_mac_address(priv->mdev, netdev->dev_addr);
|
2015-12-10 08:12:38 -07:00
|
|
|
if (is_zero_ether_addr(netdev->dev_addr) &&
|
|
|
|
!MLX5_CAP_GEN(priv->mdev, vport_group_manager)) {
|
|
|
|
eth_hw_addr_random(netdev);
|
|
|
|
mlx5_core_info(priv->mdev, "Assigned random MAC address %pM\n", netdev->dev_addr);
|
|
|
|
}
|
2015-05-28 13:28:48 -06:00
|
|
|
}
|
|
|
|
|
2016-07-01 05:51:07 -06:00
|
|
|
static void mlx5e_build_nic_netdev(struct net_device *netdev)
|
2015-05-28 13:28:48 -06:00
|
|
|
{
|
|
|
|
struct mlx5e_priv *priv = netdev_priv(netdev);
|
|
|
|
struct mlx5_core_dev *mdev = priv->mdev;
|
2016-04-24 13:51:52 -06:00
|
|
|
bool fcs_supported;
|
|
|
|
bool fcs_enabled;
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2019-04-29 12:14:05 -06:00
|
|
|
SET_NETDEV_DEV(netdev, mdev->device);
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2017-06-05 06:17:12 -06:00
|
|
|
netdev->netdev_ops = &mlx5e_netdev_ops;
|
|
|
|
|
2016-02-22 09:17:26 -07:00
|
|
|
#ifdef CONFIG_MLX5_CORE_EN_DCB
|
2017-06-05 06:17:12 -06:00
|
|
|
if (MLX5_CAP_GEN(mdev, vport_group_manager) && MLX5_CAP_GEN(mdev, qos))
|
|
|
|
netdev->dcbnl_ops = &mlx5e_dcbnl_ops;
|
2016-02-22 09:17:26 -07:00
|
|
|
#endif
|
2015-12-01 09:03:25 -07:00
|
|
|
|
2015-05-28 13:28:48 -06:00
|
|
|
netdev->watchdog_timeo = 15 * HZ;
|
|
|
|
|
|
|
|
netdev->ethtool_ops = &mlx5e_ethtool_ops;
|
|
|
|
|
2015-06-11 05:47:31 -06:00
|
|
|
netdev->vlan_features |= NETIF_F_SG;
|
2019-06-05 10:40:09 -06:00
|
|
|
netdev->vlan_features |= NETIF_F_HW_CSUM;
|
2015-05-28 13:28:48 -06:00
|
|
|
netdev->vlan_features |= NETIF_F_GRO;
|
|
|
|
netdev->vlan_features |= NETIF_F_TSO;
|
|
|
|
netdev->vlan_features |= NETIF_F_TSO6;
|
|
|
|
netdev->vlan_features |= NETIF_F_RXCSUM;
|
|
|
|
netdev->vlan_features |= NETIF_F_RXHASH;
|
|
|
|
|
2019-06-05 11:01:08 -06:00
|
|
|
netdev->mpls_features |= NETIF_F_SG;
|
|
|
|
netdev->mpls_features |= NETIF_F_HW_CSUM;
|
|
|
|
netdev->mpls_features |= NETIF_F_TSO;
|
|
|
|
netdev->mpls_features |= NETIF_F_TSO6;
|
|
|
|
|
2017-08-17 07:44:16 -06:00
|
|
|
netdev->hw_enc_features |= NETIF_F_HW_VLAN_CTAG_TX;
|
|
|
|
netdev->hw_enc_features |= NETIF_F_HW_VLAN_CTAG_RX;
|
|
|
|
|
2018-04-02 07:28:10 -06:00
|
|
|
if (!!MLX5_CAP_ETH(mdev, lro_cap) &&
|
|
|
|
mlx5e_check_fragmented_striding_rq_cap(mdev))
|
2015-05-28 13:28:48 -06:00
|
|
|
netdev->vlan_features |= NETIF_F_LRO;
|
|
|
|
|
|
|
|
netdev->hw_features = netdev->vlan_features;
|
2015-11-02 23:07:23 -07:00
|
|
|
netdev->hw_features |= NETIF_F_HW_VLAN_CTAG_TX;
|
2015-05-28 13:28:48 -06:00
|
|
|
netdev->hw_features |= NETIF_F_HW_VLAN_CTAG_RX;
|
|
|
|
netdev->hw_features |= NETIF_F_HW_VLAN_CTAG_FILTER;
|
2017-09-10 04:22:51 -06:00
|
|
|
netdev->hw_features |= NETIF_F_HW_VLAN_STAG_TX;
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2019-03-21 16:51:38 -06:00
|
|
|
if (mlx5_vxlan_allowed(mdev->vxlan) || mlx5_geneve_tx_allowed(mdev) ||
|
2019-08-19 18:36:29 -06:00
|
|
|
mlx5e_any_tunnel_proto_supported(mdev)) {
|
2019-06-05 10:40:09 -06:00
|
|
|
netdev->hw_enc_features |= NETIF_F_HW_CSUM;
|
2016-02-22 09:17:32 -07:00
|
|
|
netdev->hw_enc_features |= NETIF_F_TSO;
|
|
|
|
netdev->hw_enc_features |= NETIF_F_TSO6;
|
2017-08-13 04:34:42 -06:00
|
|
|
netdev->hw_enc_features |= NETIF_F_GSO_PARTIAL;
|
|
|
|
}
|
|
|
|
|
2019-03-21 16:51:38 -06:00
|
|
|
if (mlx5_vxlan_allowed(mdev->vxlan) || mlx5_geneve_tx_allowed(mdev)) {
|
2017-08-13 04:34:42 -06:00
|
|
|
netdev->hw_features |= NETIF_F_GSO_UDP_TUNNEL |
|
|
|
|
NETIF_F_GSO_UDP_TUNNEL_CSUM;
|
|
|
|
netdev->hw_enc_features |= NETIF_F_GSO_UDP_TUNNEL |
|
|
|
|
NETIF_F_GSO_UDP_TUNNEL_CSUM;
|
2016-05-02 10:38:43 -06:00
|
|
|
netdev->gso_partial_features = NETIF_F_GSO_UDP_TUNNEL_CSUM;
|
2016-02-22 09:17:32 -07:00
|
|
|
}
|
|
|
|
|
2019-08-19 18:36:29 -06:00
|
|
|
if (mlx5e_tunnel_proto_supported(mdev, IPPROTO_GRE)) {
|
2017-08-13 04:34:42 -06:00
|
|
|
netdev->hw_features |= NETIF_F_GSO_GRE |
|
|
|
|
NETIF_F_GSO_GRE_CSUM;
|
|
|
|
netdev->hw_enc_features |= NETIF_F_GSO_GRE |
|
|
|
|
NETIF_F_GSO_GRE_CSUM;
|
|
|
|
netdev->gso_partial_features |= NETIF_F_GSO_GRE |
|
|
|
|
NETIF_F_GSO_GRE_CSUM;
|
|
|
|
}
|
|
|
|
|
2019-08-19 19:59:11 -06:00
|
|
|
if (mlx5e_tunnel_proto_supported(mdev, IPPROTO_IPIP)) {
|
|
|
|
netdev->hw_features |= NETIF_F_GSO_IPXIP4 |
|
|
|
|
NETIF_F_GSO_IPXIP6;
|
|
|
|
netdev->hw_enc_features |= NETIF_F_GSO_IPXIP4 |
|
|
|
|
NETIF_F_GSO_IPXIP6;
|
|
|
|
netdev->gso_partial_features |= NETIF_F_GSO_IPXIP4 |
|
|
|
|
NETIF_F_GSO_IPXIP6;
|
|
|
|
}
|
|
|
|
|
2018-06-30 13:14:27 -06:00
|
|
|
netdev->hw_features |= NETIF_F_GSO_PARTIAL;
|
|
|
|
netdev->gso_partial_features |= NETIF_F_GSO_UDP_L4;
|
|
|
|
netdev->hw_features |= NETIF_F_GSO_UDP_L4;
|
|
|
|
netdev->features |= NETIF_F_GSO_UDP_L4;
|
|
|
|
|
2016-04-24 13:51:52 -06:00
|
|
|
mlx5_query_port_fcs(mdev, &fcs_supported, &fcs_enabled);
|
|
|
|
|
|
|
|
if (fcs_supported)
|
|
|
|
netdev->hw_features |= NETIF_F_RXALL;
|
|
|
|
|
2017-02-20 07:18:17 -07:00
|
|
|
if (MLX5_CAP_ETH(mdev, scatter_fcs))
|
|
|
|
netdev->hw_features |= NETIF_F_RXFCS;
|
|
|
|
|
2015-05-28 13:28:48 -06:00
|
|
|
netdev->features = netdev->hw_features;
|
2016-12-21 08:24:35 -07:00
|
|
|
if (!priv->channels.params.lro_en)
|
2015-05-28 13:28:48 -06:00
|
|
|
netdev->features &= ~NETIF_F_LRO;
|
|
|
|
|
2016-04-24 13:51:52 -06:00
|
|
|
if (fcs_enabled)
|
|
|
|
netdev->features &= ~NETIF_F_RXALL;
|
|
|
|
|
2017-02-20 07:18:17 -07:00
|
|
|
if (!priv->channels.params.scatter_fcs_en)
|
|
|
|
netdev->features &= ~NETIF_F_RXFCS;
|
|
|
|
|
2019-05-23 13:55:10 -06:00
|
|
|
/* prefere CQE compression over rxhash */
|
|
|
|
if (MLX5E_GET_PFLAG(&priv->channels.params, MLX5E_PFLAG_RX_CQE_COMPRESS))
|
|
|
|
netdev->features &= ~NETIF_F_RXHASH;
|
|
|
|
|
2016-03-08 03:42:36 -07:00
|
|
|
#define FT_CAP(f) MLX5_CAP_FLOWTABLE(mdev, flow_table_properties_nic_receive.f)
|
|
|
|
if (FT_CAP(flow_modify_en) &&
|
|
|
|
FT_CAP(modify_root) &&
|
|
|
|
FT_CAP(identified_miss_table_mode) &&
|
2016-04-28 16:36:40 -06:00
|
|
|
FT_CAP(flow_table_modify)) {
|
2018-10-18 04:31:27 -06:00
|
|
|
#ifdef CONFIG_MLX5_ESWITCH
|
2016-04-28 16:36:40 -06:00
|
|
|
netdev->hw_features |= NETIF_F_HW_TC;
|
2018-10-18 04:31:27 -06:00
|
|
|
#endif
|
2018-07-12 04:01:26 -06:00
|
|
|
#ifdef CONFIG_MLX5_EN_ARFS
|
2016-04-28 16:36:40 -06:00
|
|
|
netdev->hw_features |= NETIF_F_NTUPLE;
|
|
|
|
#endif
|
|
|
|
}
|
2016-03-08 03:42:36 -07:00
|
|
|
|
2015-05-28 13:28:48 -06:00
|
|
|
netdev->features |= NETIF_F_HIGHDMA;
|
2017-09-10 01:36:43 -06:00
|
|
|
netdev->features |= NETIF_F_HW_VLAN_STAG_FILTER;
|
2015-05-28 13:28:48 -06:00
|
|
|
|
|
|
|
netdev->priv_flags |= IFF_UNICAST_FLT;
|
|
|
|
|
|
|
|
mlx5e_set_netdev_dev_addr(netdev);
|
2017-04-18 07:04:28 -06:00
|
|
|
mlx5e_ipsec_build_netdev(priv);
|
2018-04-30 01:16:19 -06:00
|
|
|
mlx5e_tls_build_netdev(priv);
|
2015-05-28 13:28:48 -06:00
|
|
|
}
|
|
|
|
|
2018-08-04 21:58:05 -06:00
|
|
|
void mlx5e_create_q_counters(struct mlx5e_priv *priv)
|
2016-04-20 13:02:10 -06:00
|
|
|
{
|
|
|
|
struct mlx5_core_dev *mdev = priv->mdev;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
err = mlx5_core_alloc_q_counter(mdev, &priv->q_counter);
|
|
|
|
if (err) {
|
|
|
|
mlx5_core_warn(mdev, "alloc queue counter failed, %d\n", err);
|
|
|
|
priv->q_counter = 0;
|
|
|
|
}
|
2018-02-08 06:09:57 -07:00
|
|
|
|
|
|
|
err = mlx5_core_alloc_q_counter(mdev, &priv->drop_rq_q_counter);
|
|
|
|
if (err) {
|
|
|
|
mlx5_core_warn(mdev, "alloc drop RQ counter failed, %d\n", err);
|
|
|
|
priv->drop_rq_q_counter = 0;
|
|
|
|
}
|
2016-04-20 13:02:10 -06:00
|
|
|
}
|
|
|
|
|
2018-08-04 21:58:05 -06:00
|
|
|
void mlx5e_destroy_q_counters(struct mlx5e_priv *priv)
|
2016-04-20 13:02:10 -06:00
|
|
|
{
|
2018-02-08 06:09:57 -07:00
|
|
|
if (priv->q_counter)
|
|
|
|
mlx5_core_dealloc_q_counter(priv->mdev, priv->q_counter);
|
2016-04-20 13:02:10 -06:00
|
|
|
|
2018-02-08 06:09:57 -07:00
|
|
|
if (priv->drop_rq_q_counter)
|
|
|
|
mlx5_core_dealloc_q_counter(priv->mdev, priv->drop_rq_q_counter);
|
2016-04-20 13:02:10 -06:00
|
|
|
}
|
|
|
|
|
2018-10-02 00:54:59 -06:00
|
|
|
static int mlx5e_nic_init(struct mlx5_core_dev *mdev,
|
|
|
|
struct net_device *netdev,
|
|
|
|
const struct mlx5e_profile *profile,
|
|
|
|
void *ppriv)
|
2016-07-01 05:51:07 -06:00
|
|
|
{
|
|
|
|
struct mlx5e_priv *priv = netdev_priv(netdev);
|
2018-11-06 12:05:29 -07:00
|
|
|
struct mlx5e_rss_params *rss = &priv->rss_params;
|
2017-04-18 07:04:28 -06:00
|
|
|
int err;
|
2016-07-01 05:51:07 -06:00
|
|
|
|
2018-09-12 16:02:05 -06:00
|
|
|
err = mlx5e_netdev_init(netdev, priv, mdev, profile, ppriv);
|
2018-10-02 00:54:59 -06:00
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
mlx5e_build_nic_params(mdev, &priv->xsk, rss, &priv->channels.params,
|
2019-07-14 02:43:43 -06:00
|
|
|
priv->max_nch, netdev->mtu);
|
2018-09-12 16:02:05 -06:00
|
|
|
|
|
|
|
mlx5e_timestamp_init(priv);
|
|
|
|
|
2017-04-18 07:04:28 -06:00
|
|
|
err = mlx5e_ipsec_init(priv);
|
|
|
|
if (err)
|
|
|
|
mlx5_core_err(mdev, "IPSec initialization failed, %d\n", err);
|
2018-04-30 01:16:21 -06:00
|
|
|
err = mlx5e_tls_init(priv);
|
|
|
|
if (err)
|
|
|
|
mlx5_core_err(mdev, "TLS initialization failed, %d\n", err);
|
2016-07-01 05:51:07 -06:00
|
|
|
mlx5e_build_nic_netdev(netdev);
|
2018-05-29 01:54:47 -06:00
|
|
|
mlx5e_build_tc2txq_maps(priv);
|
2019-07-11 08:17:36 -06:00
|
|
|
mlx5e_health_create_reporters(priv);
|
2018-10-02 00:54:59 -06:00
|
|
|
|
|
|
|
return 0;
|
2016-07-01 05:51:07 -06:00
|
|
|
}
|
|
|
|
|
|
|
|
static void mlx5e_nic_cleanup(struct mlx5e_priv *priv)
|
|
|
|
{
|
2019-07-11 08:17:36 -06:00
|
|
|
mlx5e_health_destroy_reporters(priv);
|
2018-04-30 01:16:21 -06:00
|
|
|
mlx5e_tls_cleanup(priv);
|
2017-04-18 07:04:28 -06:00
|
|
|
mlx5e_ipsec_cleanup(priv);
|
2018-10-02 00:54:59 -06:00
|
|
|
mlx5e_netdev_cleanup(priv->netdev, priv);
|
2016-07-01 05:51:07 -06:00
|
|
|
}
|
|
|
|
|
|
|
|
static int mlx5e_init_nic_rx(struct mlx5e_priv *priv)
|
|
|
|
{
|
|
|
|
struct mlx5_core_dev *mdev = priv->mdev;
|
|
|
|
int err;
|
|
|
|
|
2018-08-04 21:58:05 -06:00
|
|
|
mlx5e_create_q_counters(priv);
|
|
|
|
|
|
|
|
err = mlx5e_open_drop_rq(priv, &priv->drop_rq);
|
|
|
|
if (err) {
|
|
|
|
mlx5_core_err(mdev, "open drop rq failed, %d\n", err);
|
|
|
|
goto err_destroy_q_counters;
|
|
|
|
}
|
|
|
|
|
2017-04-12 21:36:56 -06:00
|
|
|
err = mlx5e_create_indirect_rqt(priv);
|
|
|
|
if (err)
|
2018-08-04 21:58:05 -06:00
|
|
|
goto err_close_drop_rq;
|
2016-07-01 05:51:07 -06:00
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
err = mlx5e_create_direct_rqts(priv, priv->direct_tir);
|
2017-04-12 21:36:56 -06:00
|
|
|
if (err)
|
2016-07-01 05:51:07 -06:00
|
|
|
goto err_destroy_indirect_rqts;
|
|
|
|
|
2018-08-28 11:53:55 -06:00
|
|
|
err = mlx5e_create_indirect_tirs(priv, true);
|
2017-04-12 21:36:56 -06:00
|
|
|
if (err)
|
2016-07-01 05:51:07 -06:00
|
|
|
goto err_destroy_direct_rqts;
|
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
err = mlx5e_create_direct_tirs(priv, priv->direct_tir);
|
2017-04-12 21:36:56 -06:00
|
|
|
if (err)
|
2016-07-01 05:51:07 -06:00
|
|
|
goto err_destroy_indirect_tirs;
|
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
err = mlx5e_create_direct_rqts(priv, priv->xsk_tir);
|
|
|
|
if (unlikely(err))
|
|
|
|
goto err_destroy_direct_tirs;
|
|
|
|
|
|
|
|
err = mlx5e_create_direct_tirs(priv, priv->xsk_tir);
|
|
|
|
if (unlikely(err))
|
|
|
|
goto err_destroy_xsk_rqts;
|
|
|
|
|
2016-07-01 05:51:07 -06:00
|
|
|
err = mlx5e_create_flow_steering(priv);
|
|
|
|
if (err) {
|
|
|
|
mlx5_core_warn(mdev, "create flow steering failed, %d\n", err);
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
goto err_destroy_xsk_tirs;
|
2016-07-01 05:51:07 -06:00
|
|
|
}
|
|
|
|
|
2018-04-10 09:34:36 -06:00
|
|
|
err = mlx5e_tc_nic_init(priv);
|
2016-07-01 05:51:07 -06:00
|
|
|
if (err)
|
|
|
|
goto err_destroy_flow_steering;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
err_destroy_flow_steering:
|
|
|
|
mlx5e_destroy_flow_steering(priv);
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
err_destroy_xsk_tirs:
|
|
|
|
mlx5e_destroy_direct_tirs(priv, priv->xsk_tir);
|
|
|
|
err_destroy_xsk_rqts:
|
|
|
|
mlx5e_destroy_direct_rqts(priv, priv->xsk_tir);
|
2016-07-01 05:51:07 -06:00
|
|
|
err_destroy_direct_tirs:
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
mlx5e_destroy_direct_tirs(priv, priv->direct_tir);
|
2016-07-01 05:51:07 -06:00
|
|
|
err_destroy_indirect_tirs:
|
2018-08-28 11:53:55 -06:00
|
|
|
mlx5e_destroy_indirect_tirs(priv, true);
|
2016-07-01 05:51:07 -06:00
|
|
|
err_destroy_direct_rqts:
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
mlx5e_destroy_direct_rqts(priv, priv->direct_tir);
|
2016-07-01 05:51:07 -06:00
|
|
|
err_destroy_indirect_rqts:
|
|
|
|
mlx5e_destroy_rqt(priv, &priv->indir_rqt);
|
2018-08-04 21:58:05 -06:00
|
|
|
err_close_drop_rq:
|
|
|
|
mlx5e_close_drop_rq(&priv->drop_rq);
|
|
|
|
err_destroy_q_counters:
|
|
|
|
mlx5e_destroy_q_counters(priv);
|
2016-07-01 05:51:07 -06:00
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void mlx5e_cleanup_nic_rx(struct mlx5e_priv *priv)
|
|
|
|
{
|
2018-04-10 09:34:36 -06:00
|
|
|
mlx5e_tc_nic_cleanup(priv);
|
2016-07-01 05:51:07 -06:00
|
|
|
mlx5e_destroy_flow_steering(priv);
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
mlx5e_destroy_direct_tirs(priv, priv->xsk_tir);
|
|
|
|
mlx5e_destroy_direct_rqts(priv, priv->xsk_tir);
|
|
|
|
mlx5e_destroy_direct_tirs(priv, priv->direct_tir);
|
2018-08-28 11:53:55 -06:00
|
|
|
mlx5e_destroy_indirect_tirs(priv, true);
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
mlx5e_destroy_direct_rqts(priv, priv->direct_tir);
|
2016-07-01 05:51:07 -06:00
|
|
|
mlx5e_destroy_rqt(priv, &priv->indir_rqt);
|
2018-08-04 21:58:05 -06:00
|
|
|
mlx5e_close_drop_rq(&priv->drop_rq);
|
|
|
|
mlx5e_destroy_q_counters(priv);
|
2016-07-01 05:51:07 -06:00
|
|
|
}
|
|
|
|
|
|
|
|
static int mlx5e_init_nic_tx(struct mlx5e_priv *priv)
|
|
|
|
{
|
|
|
|
int err;
|
|
|
|
|
|
|
|
err = mlx5e_create_tises(priv);
|
|
|
|
if (err) {
|
|
|
|
mlx5_core_warn(priv->mdev, "create tises failed, %d\n", err);
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
|
|
|
#ifdef CONFIG_MLX5_CORE_EN_DCB
|
2016-11-27 08:02:07 -07:00
|
|
|
mlx5e_dcbnl_initialize(priv);
|
2016-07-01 05:51:07 -06:00
|
|
|
#endif
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void mlx5e_nic_enable(struct mlx5e_priv *priv)
|
|
|
|
{
|
|
|
|
struct net_device *netdev = priv->netdev;
|
|
|
|
struct mlx5_core_dev *mdev = priv->mdev;
|
2017-04-12 21:36:54 -06:00
|
|
|
|
|
|
|
mlx5e_init_l2_addr(priv);
|
|
|
|
|
2017-02-05 08:57:40 -07:00
|
|
|
/* Marking the link as currently not needed by the Driver */
|
|
|
|
if (!netif_running(netdev))
|
|
|
|
mlx5_set_port_admin_status(mdev, MLX5_PORT_DOWN);
|
|
|
|
|
2019-01-22 04:42:10 -07:00
|
|
|
mlx5e_set_netdev_mtu_boundaries(priv);
|
2017-04-12 21:36:54 -06:00
|
|
|
mlx5e_set_dev_port_mtu(priv);
|
2016-07-01 05:51:07 -06:00
|
|
|
|
net/mlx5: Implement RoCE LAG feature
Available on dual port cards only, this feature keeps
track, using netdev LAG events, of the bonding
and link status of each port's PF netdev.
When both of the card's PF netdevs are enslaved to the
same bond/team master, and only them, LAG state
is active.
During LAG, only one IB device is present for both ports.
In addition to the above, this commit includes FW commands
used for managing the LAG, new facilities for adding and removing
a single device by interface, and port remap functionality according to
bond events.
Please note that this feature is currently used only for mimicking
Ethernet bonding for RoCE - netdevs functionality is not altered,
and their bonding continues to be managed solely by bond/team driver.
Signed-off-by: Aviv Heller <avivh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2016-04-17 07:57:32 -06:00
|
|
|
mlx5_lag_add(mdev, netdev);
|
|
|
|
|
2016-07-01 05:51:07 -06:00
|
|
|
mlx5e_enable_async_events(priv);
|
2018-10-20 07:18:00 -06:00
|
|
|
if (mlx5e_monitor_counter_supported(priv))
|
|
|
|
mlx5e_monitor_counter_init(priv);
|
2016-07-01 05:51:08 -06:00
|
|
|
|
2019-08-21 23:06:00 -06:00
|
|
|
mlx5e_hv_vhca_stats_create(priv);
|
2016-12-28 05:58:41 -07:00
|
|
|
if (netdev->reg_state != NETREG_REGISTERED)
|
|
|
|
return;
|
2017-07-18 15:23:36 -06:00
|
|
|
#ifdef CONFIG_MLX5_CORE_EN_DCB
|
|
|
|
mlx5e_dcbnl_init_app(priv);
|
|
|
|
#endif
|
2016-12-28 05:58:41 -07:00
|
|
|
|
|
|
|
queue_work(priv->wq, &priv->set_rx_mode_work);
|
2017-04-12 21:36:54 -06:00
|
|
|
|
|
|
|
rtnl_lock();
|
|
|
|
if (netif_running(netdev))
|
|
|
|
mlx5e_open(netdev);
|
|
|
|
netif_device_attach(netdev);
|
|
|
|
rtnl_unlock();
|
2016-07-01 05:51:07 -06:00
|
|
|
}
|
|
|
|
|
|
|
|
static void mlx5e_nic_disable(struct mlx5e_priv *priv)
|
|
|
|
{
|
2017-01-10 13:33:37 -07:00
|
|
|
struct mlx5_core_dev *mdev = priv->mdev;
|
|
|
|
|
2017-07-18 15:23:36 -06:00
|
|
|
#ifdef CONFIG_MLX5_CORE_EN_DCB
|
|
|
|
if (priv->netdev->reg_state == NETREG_REGISTERED)
|
|
|
|
mlx5e_dcbnl_delete_app(priv);
|
|
|
|
#endif
|
|
|
|
|
2017-04-12 21:36:54 -06:00
|
|
|
rtnl_lock();
|
|
|
|
if (netif_running(priv->netdev))
|
|
|
|
mlx5e_close(priv->netdev);
|
|
|
|
netif_device_detach(priv->netdev);
|
|
|
|
rtnl_unlock();
|
|
|
|
|
2016-07-01 05:51:07 -06:00
|
|
|
queue_work(priv->wq, &priv->set_rx_mode_work);
|
2017-04-24 03:36:42 -06:00
|
|
|
|
2019-08-21 23:06:00 -06:00
|
|
|
mlx5e_hv_vhca_stats_destroy(priv);
|
2018-10-20 07:18:00 -06:00
|
|
|
if (mlx5e_monitor_counter_supported(priv))
|
|
|
|
mlx5e_monitor_counter_cleanup(priv);
|
|
|
|
|
2016-07-01 05:51:07 -06:00
|
|
|
mlx5e_disable_async_events(priv);
|
2017-01-10 13:33:37 -07:00
|
|
|
mlx5_lag_remove(mdev);
|
2016-07-01 05:51:07 -06:00
|
|
|
}
|
|
|
|
|
net/mlx5e: Don't refresh TIRs when updating representor SQs
Refreshing TIRs is done in order to update the TIRs with the current
state of SQs in the transport domain, so that the TIRs can filter out
undesired self-loopback packets based on the source SQ of the packet.
Representor TIRs will only receive packets that originate from their
associated vport, due to dedicated steering, and therefore will never
receive self-loopback packets, whose source vport will be the vport of
the E-Switch manager, and therefore not the vport associated with the
representor. As such, it is not necessary to refresh the representors'
TIRs, since self-loopback packets can't reach them.
Since representors only exist in switchdev mode, and there is no
scenario in which a representor will exist in the transport domain
alongside a non-representor, it is not necessary to refresh the
transport domain's TIRs upon changing the state of a representor's
queues. Therefore, do not refresh TIRs upon such a change. Achieve
this by adding an update_rx callback to the mlx5e_profile, which
refreshes TIRs for non-representors and does nothing for representors,
and replace instances of mlx5e_refresh_tirs() upon changing the state
of the queues with update_rx().
Signed-off-by: Gavi Teitz <gavi@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-05-23 00:58:56 -06:00
|
|
|
int mlx5e_update_nic_rx(struct mlx5e_priv *priv)
|
|
|
|
{
|
|
|
|
return mlx5e_refresh_tirs(priv, false);
|
|
|
|
}
|
|
|
|
|
2016-07-01 05:51:07 -06:00
|
|
|
static const struct mlx5e_profile mlx5e_nic_profile = {
|
|
|
|
.init = mlx5e_nic_init,
|
|
|
|
.cleanup = mlx5e_nic_cleanup,
|
|
|
|
.init_rx = mlx5e_init_nic_rx,
|
|
|
|
.cleanup_rx = mlx5e_cleanup_nic_rx,
|
|
|
|
.init_tx = mlx5e_init_nic_tx,
|
|
|
|
.cleanup_tx = mlx5e_cleanup_nic_tx,
|
|
|
|
.enable = mlx5e_nic_enable,
|
|
|
|
.disable = mlx5e_nic_disable,
|
net/mlx5e: Don't refresh TIRs when updating representor SQs
Refreshing TIRs is done in order to update the TIRs with the current
state of SQs in the transport domain, so that the TIRs can filter out
undesired self-loopback packets based on the source SQ of the packet.
Representor TIRs will only receive packets that originate from their
associated vport, due to dedicated steering, and therefore will never
receive self-loopback packets, whose source vport will be the vport of
the E-Switch manager, and therefore not the vport associated with the
representor. As such, it is not necessary to refresh the representors'
TIRs, since self-loopback packets can't reach them.
Since representors only exist in switchdev mode, and there is no
scenario in which a representor will exist in the transport domain
alongside a non-representor, it is not necessary to refresh the
transport domain's TIRs upon changing the state of a representor's
queues. Therefore, do not refresh TIRs upon such a change. Achieve
this by adding an update_rx callback to the mlx5e_profile, which
refreshes TIRs for non-representors and does nothing for representors,
and replace instances of mlx5e_refresh_tirs() upon changing the state
of the queues with update_rx().
Signed-off-by: Gavi Teitz <gavi@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-05-23 00:58:56 -06:00
|
|
|
.update_rx = mlx5e_update_nic_rx,
|
2017-05-10 06:10:33 -06:00
|
|
|
.update_stats = mlx5e_update_ndo_stats,
|
2017-05-18 05:32:11 -06:00
|
|
|
.update_carrier = mlx5e_update_carrier,
|
2017-04-12 21:37:03 -06:00
|
|
|
.rx_handlers.handle_rx_cqe = mlx5e_handle_rx_cqe,
|
|
|
|
.rx_handlers.handle_rx_cqe_mpwqe = mlx5e_handle_rx_cqe_mpwrq,
|
2016-07-01 05:51:07 -06:00
|
|
|
.max_tc = MLX5E_MAX_NUM_TC,
|
2019-07-14 02:43:43 -06:00
|
|
|
.rq_groups = MLX5E_NUM_RQ_GROUPS(XSK),
|
2016-07-01 05:51:07 -06:00
|
|
|
};
|
|
|
|
|
2017-04-12 21:36:54 -06:00
|
|
|
/* mlx5e generic netdev management API (move to en_common.c) */
|
|
|
|
|
2018-10-02 00:54:59 -06:00
|
|
|
/* mlx5e_netdev_init/cleanup must be called from profile->init/cleanup callbacks */
|
2018-09-12 16:02:05 -06:00
|
|
|
int mlx5e_netdev_init(struct net_device *netdev,
|
|
|
|
struct mlx5e_priv *priv,
|
|
|
|
struct mlx5_core_dev *mdev,
|
|
|
|
const struct mlx5e_profile *profile,
|
|
|
|
void *ppriv)
|
2018-10-02 00:54:59 -06:00
|
|
|
{
|
2018-09-12 16:02:05 -06:00
|
|
|
/* priv init */
|
|
|
|
priv->mdev = mdev;
|
|
|
|
priv->netdev = netdev;
|
|
|
|
priv->profile = profile;
|
|
|
|
priv->ppriv = ppriv;
|
|
|
|
priv->msglevel = MLX5E_MSG_LEVEL;
|
2019-07-14 02:43:43 -06:00
|
|
|
priv->max_nch = netdev->num_rx_queues / max_t(u8, profile->rq_groups, 1);
|
2018-09-12 16:02:05 -06:00
|
|
|
priv->max_opened_tc = 1;
|
2018-10-02 00:54:59 -06:00
|
|
|
|
2018-09-12 16:02:05 -06:00
|
|
|
mutex_init(&priv->state_lock);
|
|
|
|
INIT_WORK(&priv->update_carrier_work, mlx5e_update_carrier_work);
|
|
|
|
INIT_WORK(&priv->set_rx_mode_work, mlx5e_set_rx_mode_work);
|
|
|
|
INIT_WORK(&priv->tx_timeout_work, mlx5e_tx_timeout_work);
|
2018-09-12 00:45:33 -06:00
|
|
|
INIT_WORK(&priv->update_stats_work, mlx5e_update_stats_work);
|
2018-10-09 04:06:02 -06:00
|
|
|
|
2018-10-02 00:54:59 -06:00
|
|
|
priv->wq = create_singlethread_workqueue("mlx5e");
|
|
|
|
if (!priv->wq)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2018-09-12 16:02:05 -06:00
|
|
|
/* netdev init */
|
|
|
|
netif_carrier_off(netdev);
|
|
|
|
|
|
|
|
#ifdef CONFIG_MLX5_EN_ARFS
|
2018-11-19 11:52:38 -07:00
|
|
|
netdev->rx_cpu_rmap = mlx5_eq_table_get_rmap(mdev);
|
2018-09-12 16:02:05 -06:00
|
|
|
#endif
|
|
|
|
|
2018-10-02 00:54:59 -06:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
void mlx5e_netdev_cleanup(struct net_device *netdev, struct mlx5e_priv *priv)
|
|
|
|
{
|
|
|
|
destroy_workqueue(priv->wq);
|
|
|
|
}
|
|
|
|
|
2016-09-09 08:35:25 -06:00
|
|
|
struct net_device *mlx5e_create_netdev(struct mlx5_core_dev *mdev,
|
|
|
|
const struct mlx5e_profile *profile,
|
2018-09-06 05:56:56 -06:00
|
|
|
int nch,
|
2016-09-09 08:35:25 -06:00
|
|
|
void *ppriv)
|
2015-05-28 13:28:48 -06:00
|
|
|
{
|
|
|
|
struct net_device *netdev;
|
2018-10-02 00:54:59 -06:00
|
|
|
int err;
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2016-02-22 09:17:26 -07:00
|
|
|
netdev = alloc_etherdev_mqs(sizeof(struct mlx5e_priv),
|
2016-07-01 05:51:07 -06:00
|
|
|
nch * profile->max_tc,
|
2019-07-14 02:43:43 -06:00
|
|
|
nch * profile->rq_groups);
|
2015-05-28 13:28:48 -06:00
|
|
|
if (!netdev) {
|
|
|
|
mlx5_core_err(mdev, "alloc_etherdev_mqs() failed\n");
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2018-10-02 00:54:59 -06:00
|
|
|
err = profile->init(mdev, netdev, profile, ppriv);
|
|
|
|
if (err) {
|
|
|
|
mlx5_core_err(mdev, "failed to init mlx5e profile %d\n", err);
|
|
|
|
goto err_free_netdev;
|
|
|
|
}
|
2016-09-09 08:35:25 -06:00
|
|
|
|
|
|
|
return netdev;
|
|
|
|
|
2018-10-02 00:54:59 -06:00
|
|
|
err_free_netdev:
|
2016-09-09 08:35:25 -06:00
|
|
|
free_netdev(netdev);
|
|
|
|
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2017-04-12 21:36:54 -06:00
|
|
|
int mlx5e_attach_netdev(struct mlx5e_priv *priv)
|
2016-09-09 08:35:25 -06:00
|
|
|
{
|
|
|
|
const struct mlx5e_profile *profile;
|
2018-10-16 14:20:20 -06:00
|
|
|
int max_nch;
|
2016-09-09 08:35:25 -06:00
|
|
|
int err;
|
|
|
|
|
|
|
|
profile = priv->profile;
|
|
|
|
clear_bit(MLX5E_STATE_DESTROYING, &priv->state);
|
2016-05-01 13:59:56 -06:00
|
|
|
|
2018-10-16 14:20:20 -06:00
|
|
|
/* max number of channels may have changed */
|
|
|
|
max_nch = mlx5e_get_max_num_channels(priv->mdev);
|
|
|
|
if (priv->channels.params.num_channels > max_nch) {
|
|
|
|
mlx5_core_warn(priv->mdev, "MLX5E: Reducing number of channels to %d\n", max_nch);
|
|
|
|
priv->channels.params.num_channels = max_nch;
|
2018-11-06 12:05:29 -07:00
|
|
|
mlx5e_build_default_indir_rqt(priv->rss_params.indirection_rqt,
|
2018-10-16 14:20:20 -06:00
|
|
|
MLX5E_INDIR_RQT_SIZE, max_nch);
|
|
|
|
}
|
|
|
|
|
2016-07-01 05:51:07 -06:00
|
|
|
err = profile->init_tx(priv);
|
|
|
|
if (err)
|
2016-11-30 08:59:39 -07:00
|
|
|
goto out;
|
2015-08-04 05:05:43 -06:00
|
|
|
|
2016-07-01 05:51:07 -06:00
|
|
|
err = profile->init_rx(priv);
|
|
|
|
if (err)
|
2018-08-04 21:58:05 -06:00
|
|
|
goto err_cleanup_tx;
|
2015-08-04 05:05:43 -06:00
|
|
|
|
2016-07-01 05:51:07 -06:00
|
|
|
if (profile->enable)
|
|
|
|
profile->enable(priv);
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2016-09-09 08:35:25 -06:00
|
|
|
return 0;
|
2015-08-04 05:05:43 -06:00
|
|
|
|
2018-08-04 21:58:05 -06:00
|
|
|
err_cleanup_tx:
|
2016-07-01 05:51:07 -06:00
|
|
|
profile->cleanup_tx(priv);
|
2015-08-04 05:05:43 -06:00
|
|
|
|
2016-09-09 08:35:25 -06:00
|
|
|
out:
|
|
|
|
return err;
|
2015-05-28 13:28:48 -06:00
|
|
|
}
|
|
|
|
|
2017-04-12 21:36:54 -06:00
|
|
|
void mlx5e_detach_netdev(struct mlx5e_priv *priv)
|
2016-09-09 08:35:25 -06:00
|
|
|
{
|
|
|
|
const struct mlx5e_profile *profile = priv->profile;
|
|
|
|
|
|
|
|
set_bit(MLX5E_STATE_DESTROYING, &priv->state);
|
|
|
|
|
2016-12-28 05:58:42 -07:00
|
|
|
if (profile->disable)
|
|
|
|
profile->disable(priv);
|
|
|
|
flush_workqueue(priv->wq);
|
|
|
|
|
2016-09-09 08:35:25 -06:00
|
|
|
profile->cleanup_rx(priv);
|
|
|
|
profile->cleanup_tx(priv);
|
2018-09-12 00:45:33 -06:00
|
|
|
cancel_work_sync(&priv->update_stats_work);
|
2016-09-09 08:35:25 -06:00
|
|
|
}
|
|
|
|
|
2017-04-12 21:36:54 -06:00
|
|
|
void mlx5e_destroy_netdev(struct mlx5e_priv *priv)
|
|
|
|
{
|
|
|
|
const struct mlx5e_profile *profile = priv->profile;
|
|
|
|
struct net_device *netdev = priv->netdev;
|
|
|
|
|
|
|
|
if (profile->cleanup)
|
|
|
|
profile->cleanup(priv);
|
|
|
|
free_netdev(netdev);
|
|
|
|
}
|
|
|
|
|
2016-09-09 08:35:25 -06:00
|
|
|
/* mlx5e_attach and mlx5e_detach scope should be only creating/destroying
|
|
|
|
* hardware contexts and to connect it to the current netdev.
|
|
|
|
*/
|
|
|
|
static int mlx5e_attach(struct mlx5_core_dev *mdev, void *vpriv)
|
|
|
|
{
|
|
|
|
struct mlx5e_priv *priv = vpriv;
|
|
|
|
struct net_device *netdev = priv->netdev;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
if (netif_device_present(netdev))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
err = mlx5e_create_mdev_resources(mdev);
|
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
|
2017-04-12 21:36:54 -06:00
|
|
|
err = mlx5e_attach_netdev(priv);
|
2016-09-09 08:35:25 -06:00
|
|
|
if (err) {
|
|
|
|
mlx5e_destroy_mdev_resources(mdev);
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void mlx5e_detach(struct mlx5_core_dev *mdev, void *vpriv)
|
|
|
|
{
|
|
|
|
struct mlx5e_priv *priv = vpriv;
|
|
|
|
struct net_device *netdev = priv->netdev;
|
|
|
|
|
2019-05-26 02:56:27 -06:00
|
|
|
#ifdef CONFIG_MLX5_ESWITCH
|
|
|
|
if (MLX5_ESWITCH_MANAGER(mdev) && vpriv == mdev)
|
|
|
|
return;
|
|
|
|
#endif
|
|
|
|
|
2016-09-09 08:35:25 -06:00
|
|
|
if (!netif_device_present(netdev))
|
|
|
|
return;
|
|
|
|
|
2017-04-12 21:36:54 -06:00
|
|
|
mlx5e_detach_netdev(priv);
|
2016-09-09 08:35:25 -06:00
|
|
|
mlx5e_destroy_mdev_resources(mdev);
|
|
|
|
}
|
|
|
|
|
2016-07-01 05:51:04 -06:00
|
|
|
static void *mlx5e_add(struct mlx5_core_dev *mdev)
|
|
|
|
{
|
2017-06-06 00:12:04 -06:00
|
|
|
struct net_device *netdev;
|
2016-09-09 08:35:25 -06:00
|
|
|
void *priv;
|
|
|
|
int err;
|
2018-09-06 05:56:56 -06:00
|
|
|
int nch;
|
2016-07-01 05:51:04 -06:00
|
|
|
|
2016-09-09 08:35:25 -06:00
|
|
|
err = mlx5e_check_required_hca_cap(mdev);
|
|
|
|
if (err)
|
2016-07-01 05:51:04 -06:00
|
|
|
return NULL;
|
|
|
|
|
2018-11-07 07:34:52 -07:00
|
|
|
#ifdef CONFIG_MLX5_ESWITCH
|
|
|
|
if (MLX5_ESWITCH_MANAGER(mdev) &&
|
2019-06-28 16:36:15 -06:00
|
|
|
mlx5_eswitch_mode(mdev->priv.eswitch) == MLX5_ESWITCH_OFFLOADS) {
|
2018-11-07 07:34:52 -07:00
|
|
|
mlx5e_rep_register_vport_reps(mdev);
|
|
|
|
return mdev;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2018-09-06 05:56:56 -06:00
|
|
|
nch = mlx5e_get_max_num_channels(mdev);
|
2018-02-13 09:14:42 -07:00
|
|
|
netdev = mlx5e_create_netdev(mdev, &mlx5e_nic_profile, nch, NULL);
|
2016-09-09 08:35:25 -06:00
|
|
|
if (!netdev) {
|
|
|
|
mlx5_core_err(mdev, "mlx5e_create_netdev failed\n");
|
2018-02-13 09:14:42 -07:00
|
|
|
return NULL;
|
2016-09-09 08:35:25 -06:00
|
|
|
}
|
|
|
|
|
|
|
|
priv = netdev_priv(netdev);
|
|
|
|
|
|
|
|
err = mlx5e_attach(mdev, priv);
|
|
|
|
if (err) {
|
|
|
|
mlx5_core_err(mdev, "mlx5e_attach failed, %d\n", err);
|
|
|
|
goto err_destroy_netdev;
|
|
|
|
}
|
|
|
|
|
|
|
|
err = register_netdev(netdev);
|
|
|
|
if (err) {
|
|
|
|
mlx5_core_err(mdev, "register_netdev failed, %d\n", err);
|
|
|
|
goto err_detach;
|
2016-07-01 05:51:04 -06:00
|
|
|
}
|
2016-09-09 08:35:25 -06:00
|
|
|
|
2017-07-18 15:23:36 -06:00
|
|
|
#ifdef CONFIG_MLX5_CORE_EN_DCB
|
|
|
|
mlx5e_dcbnl_init_app(priv);
|
|
|
|
#endif
|
2016-09-09 08:35:25 -06:00
|
|
|
return priv;
|
|
|
|
|
|
|
|
err_detach:
|
|
|
|
mlx5e_detach(mdev, priv);
|
|
|
|
err_destroy_netdev:
|
2017-04-12 21:36:54 -06:00
|
|
|
mlx5e_destroy_netdev(priv);
|
2016-09-09 08:35:25 -06:00
|
|
|
return NULL;
|
2016-07-01 05:51:04 -06:00
|
|
|
}
|
|
|
|
|
|
|
|
static void mlx5e_remove(struct mlx5_core_dev *mdev, void *vpriv)
|
|
|
|
{
|
2018-11-07 07:34:52 -07:00
|
|
|
struct mlx5e_priv *priv;
|
2016-07-01 05:51:08 -06:00
|
|
|
|
2018-11-07 07:34:52 -07:00
|
|
|
#ifdef CONFIG_MLX5_ESWITCH
|
|
|
|
if (MLX5_ESWITCH_MANAGER(mdev) && vpriv == mdev) {
|
|
|
|
mlx5e_rep_unregister_vport_reps(mdev);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
priv = vpriv;
|
2017-07-18 15:23:36 -06:00
|
|
|
#ifdef CONFIG_MLX5_CORE_EN_DCB
|
|
|
|
mlx5e_dcbnl_delete_app(priv);
|
|
|
|
#endif
|
2016-10-25 09:36:30 -06:00
|
|
|
unregister_netdev(priv->netdev);
|
2016-09-09 08:35:25 -06:00
|
|
|
mlx5e_detach(mdev, vpriv);
|
2017-04-12 21:36:54 -06:00
|
|
|
mlx5e_destroy_netdev(priv);
|
2016-07-01 05:51:04 -06:00
|
|
|
}
|
|
|
|
|
2015-05-28 13:28:48 -06:00
|
|
|
static struct mlx5_interface mlx5e_interface = {
|
2016-07-01 05:51:04 -06:00
|
|
|
.add = mlx5e_add,
|
|
|
|
.remove = mlx5e_remove,
|
2016-09-09 08:35:25 -06:00
|
|
|
.attach = mlx5e_attach,
|
|
|
|
.detach = mlx5e_detach,
|
2015-05-28 13:28:48 -06:00
|
|
|
.protocol = MLX5_INTERFACE_PROTOCOL_ETH,
|
|
|
|
};
|
|
|
|
|
|
|
|
void mlx5e_init(void)
|
|
|
|
{
|
2017-04-18 07:08:23 -06:00
|
|
|
mlx5e_ipsec_build_inverse_table();
|
2016-06-23 08:02:45 -06:00
|
|
|
mlx5e_build_ptys2ethtool_map();
|
2015-05-28 13:28:48 -06:00
|
|
|
mlx5_register_interface(&mlx5e_interface);
|
|
|
|
}
|
|
|
|
|
|
|
|
void mlx5e_cleanup(void)
|
|
|
|
{
|
|
|
|
mlx5_unregister_interface(&mlx5e_interface);
|
|
|
|
}
|