2015-05-28 13:28:48 -06:00
|
|
|
/*
|
2016-02-22 09:17:31 -07:00
|
|
|
* Copyright (c) 2015-2016, Mellanox Technologies. All rights reserved.
|
2015-05-28 13:28:48 -06:00
|
|
|
*
|
|
|
|
* This software is available to you under a choice of one of two
|
|
|
|
* licenses. You may choose to be licensed under the terms of the GNU
|
|
|
|
* General Public License (GPL) Version 2, available from the file
|
|
|
|
* COPYING in the main directory of this source tree, or the
|
|
|
|
* OpenIB.org BSD license below:
|
|
|
|
*
|
|
|
|
* Redistribution and use in source and binary forms, with or
|
|
|
|
* without modification, are permitted provided that the following
|
|
|
|
* conditions are met:
|
|
|
|
*
|
|
|
|
* - Redistributions of source code must retain the above
|
|
|
|
* copyright notice, this list of conditions and the following
|
|
|
|
* disclaimer.
|
|
|
|
*
|
|
|
|
* - Redistributions in binary form must reproduce the above
|
|
|
|
* copyright notice, this list of conditions and the following
|
|
|
|
* disclaimer in the documentation and/or other materials
|
|
|
|
* provided with the distribution.
|
|
|
|
*
|
|
|
|
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
|
|
|
* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
|
|
|
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
|
|
|
* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
|
|
|
|
* BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
|
|
|
|
* ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
|
|
|
|
* CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
|
|
* SOFTWARE.
|
|
|
|
*/
|
2016-02-22 09:17:31 -07:00
|
|
|
#ifndef __MLX5_EN_H__
|
|
|
|
#define __MLX5_EN_H__
|
2015-05-28 13:28:48 -06:00
|
|
|
|
|
|
|
#include <linux/if_vlan.h>
|
|
|
|
#include <linux/etherdevice.h>
|
2015-12-29 05:58:31 -07:00
|
|
|
#include <linux/timecounter.h>
|
|
|
|
#include <linux/net_tstamp.h>
|
2015-12-29 05:58:32 -07:00
|
|
|
#include <linux/ptp_clock_kernel.h>
|
2017-04-12 21:36:55 -06:00
|
|
|
#include <linux/crash_dump.h>
|
2015-05-28 13:28:48 -06:00
|
|
|
#include <linux/mlx5/driver.h>
|
|
|
|
#include <linux/mlx5/qp.h>
|
|
|
|
#include <linux/mlx5/cq.h>
|
2016-02-22 09:17:23 -07:00
|
|
|
#include <linux/mlx5/port.h>
|
2015-06-04 10:30:40 -06:00
|
|
|
#include <linux/mlx5/vport.h>
|
2016-01-14 10:12:59 -07:00
|
|
|
#include <linux/mlx5/transobj.h>
|
2017-12-06 12:05:01 -07:00
|
|
|
#include <linux/mlx5/fs.h>
|
2016-03-08 03:42:36 -07:00
|
|
|
#include <linux/rhashtable.h>
|
net/mlx5e: Introduce SRIOV VF representors
Implement the relevant profile functions to create mlx5e driver instance
serving as VF representor. When SRIOV offloads mode is enabled, each VF
will have a representor netdevice instance on the host.
To do that, we also export set of shared service functions from en_main.c,
such that they can be used by both NIC and repsresentors netdevs.
The newly created representor netdevice has a basic set of net_device_ops
which are the same ndo functions as the NIC netdevice and an ndo of it's
own for phys port name.
The profiling infrastructure allow sharing code between the NIC and the
vport representor even though the representor has only a subset of the
NIC functionality.
The VF reps and the PF which is used in that mode to represent the uplink,
expose switchdev ops. Currently the only op supposed is attr get for the
port parent ID which here serves to identify net-devices belonging to the
same HW E-Switch. Other than that, no offloading is implemented and hence
switching functionality is achieved if one sets SW switching rules, e.g
using tc, bridge or ovs.
Port phys name (ndo_get_phys_port_name) is implemented to allow exporting
to user-space the VF vport number and along with the switchdev port parent
id (phys_switch_id) enable a udev base consistent naming scheme:
SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}=="<phys_switch_id>", \
ATTR{phys_port_name}!="", NAME="$PF_NIC$attr{phys_port_name}"
where phys_switch_id is exposed by the PF (and VF reps) and $PF_NIC is
the name of the PF netdevice.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01 05:51:09 -06:00
|
|
|
#include <net/switchdev.h>
|
2018-01-03 03:25:18 -07:00
|
|
|
#include <net/xdp.h>
|
2019-01-10 08:33:17 -07:00
|
|
|
#include <linux/dim.h>
|
2018-11-19 06:07:07 -07:00
|
|
|
#include <linux/bits.h>
|
2015-05-28 13:28:48 -06:00
|
|
|
#include "wq.h"
|
|
|
|
#include "mlx5_core.h"
|
2016-04-24 13:51:47 -06:00
|
|
|
#include "en_stats.h"
|
2018-07-11 13:02:42 -06:00
|
|
|
#include "en/fs.h"
|
2019-08-21 23:06:00 -06:00
|
|
|
#include "lib/hv_vhca.h"
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2018-09-05 02:43:23 -06:00
|
|
|
extern const struct net_device_ops mlx5e_netdev_ops;
|
mlx5: use page_pool for xdp_return_frame call
This patch shows how it is possible to have both the driver local page
cache, which uses elevated refcnt for "catching"/avoiding SKB
put_page returns the page through the page allocator. And at the
same time, have pages getting returned to the page_pool from
ndp_xdp_xmit DMA completion.
The performance improvement for XDP_REDIRECT in this patch is really
good. Especially considering that (currently) the xdp_return_frame
API and page_pool_put_page() does per frame operations of both
rhashtable ID-lookup and locked return into (page_pool) ptr_ring.
(It is the plan to remove these per frame operation in a followup
patchset).
The benchmark performed was RX on mlx5 and XDP_REDIRECT out ixgbe,
with xdp_redirect_map (using devmap) . And the target/maximum
capability of ixgbe is 13Mpps (on this HW setup).
Before this patch for mlx5, XDP redirected frames were returned via
the page allocator. The single flow performance was 6Mpps, and if I
started two flows the collective performance drop to 4Mpps, because we
hit the page allocator lock (further negative scaling occurs).
Two test scenarios need to be covered, for xdp_return_frame API, which
is DMA-TX completion running on same-CPU or cross-CPU free/return.
Results were same-CPU=10Mpps, and cross-CPU=12Mpps. This is very
close to our 13Mpps max target.
The reason max target isn't reached in cross-CPU test, is likely due
to RX-ring DMA unmap/map overhead (which doesn't occur in ixgbe to
ixgbe testing). It is also planned to remove this unnecessary DMA
unmap in a later patchset
V2: Adjustments requested by Tariq
- Changed page_pool_create return codes not return NULL, only
ERR_PTR, as this simplifies err handling in drivers.
- Save a branch in mlx5e_page_release
- Correct page_pool size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ
V5: Updated patch desc
V8: Adjust for b0cedc844c00 ("net/mlx5e: Remove rq_headroom field from params")
V9:
- Adjust for 121e89275471 ("net/mlx5e: Refactor RQ XDP_TX indication")
- Adjust for 73281b78a37a ("net/mlx5e: Derive Striding RQ size from MTU")
- Correct handling if page_pool_create fail for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ
V10: Req from Tariq
- Change pool_size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 08:46:27 -06:00
|
|
|
struct page_pool;
|
|
|
|
|
2018-04-30 01:16:17 -06:00
|
|
|
#define MLX5E_METADATA_ETHER_TYPE (0x8CE4)
|
|
|
|
#define MLX5E_METADATA_ETHER_LEN 8
|
|
|
|
|
2016-04-28 16:36:40 -06:00
|
|
|
#define MLX5_SET_CFG(p, f, v) MLX5_SET(create_flow_group_in, p, f, v)
|
|
|
|
|
2017-05-18 08:03:21 -06:00
|
|
|
#define MLX5E_ETH_HARD_MTU (ETH_HLEN + VLAN_HLEN + ETH_FCS_LEN)
|
|
|
|
|
2018-03-12 06:24:41 -06:00
|
|
|
#define MLX5E_HW2SW_MTU(params, hwmtu) ((hwmtu) - ((params)->hard_mtu))
|
|
|
|
#define MLX5E_SW2HW_MTU(params, swmtu) ((swmtu) + ((params)->hard_mtu))
|
2017-01-17 23:06:07 -07:00
|
|
|
|
2018-03-21 20:10:22 -06:00
|
|
|
#define MLX5E_MAX_PRIORITY 8
|
2017-07-18 15:23:36 -06:00
|
|
|
#define MLX5E_MAX_DSCP 64
|
2015-05-28 13:28:48 -06:00
|
|
|
#define MLX5E_MAX_NUM_TC 8
|
|
|
|
|
2016-09-21 03:19:42 -06:00
|
|
|
#define MLX5_RX_HEADROOM NET_SKB_PAD
|
2017-01-18 05:28:53 -07:00
|
|
|
#define MLX5_SKB_FRAG_SZ(len) (SKB_DATA_ALIGN(len) + \
|
|
|
|
SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
|
2016-09-21 03:19:42 -06:00
|
|
|
|
net/mlx5e: RX, Make sure packet header does not cross page boundary
In the non-linear SKB memory scheme of Striding RQ, a packet header
could cross page boundary. This requires special care in fast path
that costs LoC, additional runtime instructions and branches.
It could happen when the header (up to 256B) does not fit in
a single stride. Avoid this by working with a stride size that fits
the maximum possible header. Stride is increased form 64B to 256B.
Performance:
Tested packet rate for UDP streams, single ring, on ConnectX-5.
Configuration:
Set Striding RQ and LRO ON (to enabled the non-linear SKB scheme).
GRO OFF, early drop by TC rule.
64B: 4x worse memory utilization, no page-crossers headers
- No degradation (5,887,305 pps).
- The reduction in memory utilization is compensated by the saving of
branches tests.
192B: 1.33x worse memory utilization, avoid page-crossers headers
- Before: 5,727,252. After: 5,777,037. ~1% gain.
256B: Same memory util, no page-crossers
- Before: 5,691,885. After: 5,748,007. ~1% gain.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-09-25 03:05:22 -06:00
|
|
|
#define MLX5E_RX_MAX_HEAD (256)
|
|
|
|
|
2015-11-19 08:12:26 -07:00
|
|
|
#define MLX5_MPWRQ_MIN_LOG_STRIDE_SZ(mdev) \
|
|
|
|
(6 + MLX5_CAP_GEN(mdev, cache_line_128byte)) /* HW restriction */
|
|
|
|
#define MLX5_MPWRQ_LOG_STRIDE_SZ(mdev, req) \
|
|
|
|
max_t(u32, MLX5_MPWRQ_MIN_LOG_STRIDE_SZ(mdev), req)
|
net/mlx5e: RX, Make sure packet header does not cross page boundary
In the non-linear SKB memory scheme of Striding RQ, a packet header
could cross page boundary. This requires special care in fast path
that costs LoC, additional runtime instructions and branches.
It could happen when the header (up to 256B) does not fit in
a single stride. Avoid this by working with a stride size that fits
the maximum possible header. Stride is increased form 64B to 256B.
Performance:
Tested packet rate for UDP streams, single ring, on ConnectX-5.
Configuration:
Set Striding RQ and LRO ON (to enabled the non-linear SKB scheme).
GRO OFF, early drop by TC rule.
64B: 4x worse memory utilization, no page-crossers headers
- No degradation (5,887,305 pps).
- The reduction in memory utilization is compensated by the saving of
branches tests.
192B: 1.33x worse memory utilization, avoid page-crossers headers
- Before: 5,727,252. After: 5,777,037. ~1% gain.
256B: Same memory util, no page-crossers
- Before: 5,691,885. After: 5,748,007. ~1% gain.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-09-25 03:05:22 -06:00
|
|
|
#define MLX5_MPWRQ_DEF_LOG_STRIDE_SZ(mdev) \
|
|
|
|
MLX5_MPWRQ_LOG_STRIDE_SZ(mdev, order_base_2(MLX5E_RX_MAX_HEAD))
|
2015-11-19 08:12:26 -07:00
|
|
|
|
net/mlx5e: Single flow order-0 pages for Striding RQ
To improve the memory consumption scheme, we omit the flow that
demands and splits high-order pages in Striding RQ, and stay
with a single Striding RQ flow that uses order-0 pages.
Moving to fragmented memory allows the use of larger MPWQEs,
which reduces the number of UMR posts and filler CQEs.
Moving to a single flow allows several optimizations that improve
performance, especially in production servers where we would
anyway fallback to order-0 allocations:
- inline functions that were called via function pointers.
- improve the UMR post process.
This patch alone is expected to give a slight performance reduction.
However, the new memory scheme gives the possibility to use a page-cache
of a fair size, that doesn't inflate the memory footprint, which will
dramatically fix the reduction and even give a performance gain.
Performance tests:
The following results were measured on a freshly booted system,
giving optimal baseline performance, as high-order pages are yet to
be fragmented and depleted.
We ran pktgen single-stream benchmarks, with iptables-raw-drop:
Single stride, 64 bytes:
* 4,739,057 - baseline
* 4,749,550 - this patch
no reduction
Larger packets, no page cross, 1024 bytes:
* 3,982,361 - baseline
* 3,845,682 - this patch
3.5% reduction
Larger packets, every 3rd packet crosses a page, 1500 bytes:
* 3,731,189 - baseline
* 3,579,414 - this patch
4% reduction
Fixes: 461017cb006a ("net/mlx5e: Support RX multi-packet WQE (Striding RQ)")
Fixes: bc77b240b3c5 ("net/mlx5e: Add fragmented memory support for RX multi packet WQE")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-15 07:08:36 -06:00
|
|
|
#define MLX5_MPWRQ_LOG_WQE_SZ 18
|
net/mlx5e: Support RX multi-packet WQE (Striding RQ)
Introduce the feature of multi-packet WQE (RX Work Queue Element)
referred to as (MPWQE or Striding RQ), in which WQEs are larger
and serve multiple packets each.
Every WQE consists of many strides of the same size, every received
packet is aligned to a beginning of a stride and is written to
consecutive strides within a WQE.
In the regular approach, each regular WQE is big enough to be capable
of serving one received packet of any size up to MTU or 64K in case of
device LRO is enabled, making it very wasteful when dealing with
small packets or device LRO is enabled.
For its flexibility, MPWQE allows a better memory utilization
(implying improvements in CPU utilization and packet rate) as packets
consume strides according to their size, preserving the rest of
the WQE to be available for other packets.
MPWQE default configuration:
Num of WQEs = 16
Strides Per WQE = 2048
Stride Size = 64 byte
The default WQEs memory footprint went from 1024*mtu (~1.5MB) to
16 * 2048 * 64 = 2MB per ring.
However, HW LRO can now be supported at no additional cost in memory
footprint, and hence we turn it on by default and get an even better
performance.
Performance tested on ConnectX4-Lx 50G.
To isolate the feature under test, the numbers below were measured with
HW LRO turned off. We verified that the performance just improves when
LRO is turned back on.
* Netperf single TCP stream:
- BW raised by 10-15% for representative packet sizes:
default, 64B, 1024B, 1478B, 65536B.
* Netperf multi TCP stream:
- No degradation, line rate reached.
* Pktgen: packet rate raised by 2-10% for traffic of different message
sizes: 64B, 128B, 256B, 1024B, and 1500B.
* Pktgen: packet loss in bursts of small messages (64byte),
single stream:
- | num packets | packets loss before | packets loss after
| 2K | ~ 1K | 0
| 8K | ~ 6K | 0
| 16K | ~13K | 0
| 32K | ~28K | 0
| 64K | ~57K | ~24K
As expected as the driver can receive as many small packets (<=64B) as
the number of total strides in the ring (default = 2048 * 16) vs. 1024
(default ring size regardless of packets size) before this feature.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-20 13:02:13 -06:00
|
|
|
#define MLX5_MPWRQ_WQE_PAGE_ORDER (MLX5_MPWRQ_LOG_WQE_SZ - PAGE_SHIFT > 0 ? \
|
|
|
|
MLX5_MPWRQ_LOG_WQE_SZ - PAGE_SHIFT : 0)
|
|
|
|
#define MLX5_MPWRQ_PAGES_PER_WQE BIT(MLX5_MPWRQ_WQE_PAGE_ORDER)
|
2016-08-28 16:13:42 -06:00
|
|
|
|
|
|
|
#define MLX5_MTT_OCTW(npages) (ALIGN(npages, 8) / 2)
|
2018-02-11 06:21:33 -07:00
|
|
|
#define MLX5E_REQUIRED_WQE_MTTS (ALIGN(MLX5_MPWRQ_PAGES_PER_WQE, 8))
|
2017-12-20 02:56:35 -07:00
|
|
|
#define MLX5E_LOG_ALIGNED_MPWQE_PPW (ilog2(MLX5E_REQUIRED_WQE_MTTS))
|
2018-02-11 06:21:33 -07:00
|
|
|
#define MLX5E_REQUIRED_MTTS(wqes) (wqes * MLX5E_REQUIRED_WQE_MTTS)
|
|
|
|
#define MLX5E_MAX_RQ_NUM_MTTS \
|
|
|
|
((1 << 16) * 2) /* So that MLX5_MTT_OCTW(num_mtts) fits into u16 */
|
|
|
|
#define MLX5E_ORDER2_MAX_PACKET_MTU (order_base_2(10 * 1024))
|
|
|
|
#define MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE_MPW \
|
|
|
|
(ilog2(MLX5E_MAX_RQ_NUM_MTTS / MLX5E_REQUIRED_WQE_MTTS))
|
|
|
|
#define MLX5E_LOG_MAX_RQ_NUM_PACKETS_MPW \
|
|
|
|
(MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE_MPW + \
|
|
|
|
(MLX5_MPWRQ_LOG_WQE_SZ - MLX5E_ORDER2_MAX_PACKET_MTU))
|
|
|
|
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
#define MLX5E_MIN_SKB_FRAG_SZ (MLX5_SKB_FRAG_SZ(MLX5_RX_HEADROOM))
|
|
|
|
#define MLX5E_LOG_MAX_RX_WQE_BULK \
|
|
|
|
(ilog2(PAGE_SIZE / roundup_pow_of_two(MLX5E_MIN_SKB_FRAG_SZ)))
|
|
|
|
|
2018-02-11 06:21:33 -07:00
|
|
|
#define MLX5E_PARAMS_MINIMUM_LOG_SQ_SIZE 0x6
|
|
|
|
#define MLX5E_PARAMS_DEFAULT_LOG_SQ_SIZE 0xa
|
|
|
|
#define MLX5E_PARAMS_MAXIMUM_LOG_SQ_SIZE 0xd
|
|
|
|
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
#define MLX5E_PARAMS_MINIMUM_LOG_RQ_SIZE (1 + MLX5E_LOG_MAX_RX_WQE_BULK)
|
2018-02-11 06:21:33 -07:00
|
|
|
#define MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE 0xa
|
|
|
|
#define MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE min_t(u8, 0xd, \
|
|
|
|
MLX5E_LOG_MAX_RQ_NUM_PACKETS_MPW)
|
|
|
|
|
|
|
|
#define MLX5E_PARAMS_MINIMUM_LOG_RQ_SIZE_MPW 0x2
|
2016-08-28 16:13:42 -06:00
|
|
|
|
2015-08-16 07:04:49 -06:00
|
|
|
#define MLX5E_PARAMS_DEFAULT_LRO_WQE_SZ (64 * 1024)
|
2016-10-25 09:36:29 -06:00
|
|
|
#define MLX5E_DEFAULT_LRO_TIMEOUT 32
|
|
|
|
#define MLX5E_LRO_TIMEOUT_ARR_SIZE 4
|
|
|
|
|
2015-05-28 13:28:48 -06:00
|
|
|
#define MLX5E_PARAMS_DEFAULT_RX_CQ_MODERATION_USEC 0x10
|
2016-06-23 08:02:40 -06:00
|
|
|
#define MLX5E_PARAMS_DEFAULT_RX_CQ_MODERATION_USEC_FROM_CQE 0x3
|
2015-05-28 13:28:48 -06:00
|
|
|
#define MLX5E_PARAMS_DEFAULT_RX_CQ_MODERATION_PKTS 0x20
|
|
|
|
#define MLX5E_PARAMS_DEFAULT_TX_CQ_MODERATION_USEC 0x10
|
2017-09-26 07:20:43 -06:00
|
|
|
#define MLX5E_PARAMS_DEFAULT_TX_CQ_MODERATION_USEC_FROM_CQE 0x10
|
2015-05-28 13:28:48 -06:00
|
|
|
#define MLX5E_PARAMS_DEFAULT_TX_CQ_MODERATION_PKTS 0x20
|
|
|
|
#define MLX5E_PARAMS_DEFAULT_MIN_RX_WQES 0x80
|
net/mlx5e: Support RX multi-packet WQE (Striding RQ)
Introduce the feature of multi-packet WQE (RX Work Queue Element)
referred to as (MPWQE or Striding RQ), in which WQEs are larger
and serve multiple packets each.
Every WQE consists of many strides of the same size, every received
packet is aligned to a beginning of a stride and is written to
consecutive strides within a WQE.
In the regular approach, each regular WQE is big enough to be capable
of serving one received packet of any size up to MTU or 64K in case of
device LRO is enabled, making it very wasteful when dealing with
small packets or device LRO is enabled.
For its flexibility, MPWQE allows a better memory utilization
(implying improvements in CPU utilization and packet rate) as packets
consume strides according to their size, preserving the rest of
the WQE to be available for other packets.
MPWQE default configuration:
Num of WQEs = 16
Strides Per WQE = 2048
Stride Size = 64 byte
The default WQEs memory footprint went from 1024*mtu (~1.5MB) to
16 * 2048 * 64 = 2MB per ring.
However, HW LRO can now be supported at no additional cost in memory
footprint, and hence we turn it on by default and get an even better
performance.
Performance tested on ConnectX4-Lx 50G.
To isolate the feature under test, the numbers below were measured with
HW LRO turned off. We verified that the performance just improves when
LRO is turned back on.
* Netperf single TCP stream:
- BW raised by 10-15% for representative packet sizes:
default, 64B, 1024B, 1478B, 65536B.
* Netperf multi TCP stream:
- No degradation, line rate reached.
* Pktgen: packet rate raised by 2-10% for traffic of different message
sizes: 64B, 128B, 256B, 1024B, and 1500B.
* Pktgen: packet loss in bursts of small messages (64byte),
single stream:
- | num packets | packets loss before | packets loss after
| 2K | ~ 1K | 0
| 8K | ~ 6K | 0
| 16K | ~13K | 0
| 32K | ~28K | 0
| 64K | ~57K | ~24K
As expected as the driver can receive as many small packets (<=64B) as
the number of total strides in the ring (default = 2048 * 16) vs. 1024
(default ring size regardless of packets size) before this feature.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-20 13:02:13 -06:00
|
|
|
#define MLX5E_PARAMS_DEFAULT_MIN_RX_WQES_MPW 0x2
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2015-08-16 07:04:46 -06:00
|
|
|
#define MLX5E_LOG_INDIR_RQT_SIZE 0x7
|
|
|
|
#define MLX5E_INDIR_RQT_SIZE BIT(MLX5E_LOG_INDIR_RQT_SIZE)
|
2016-11-22 02:03:32 -07:00
|
|
|
#define MLX5E_MIN_NUM_CHANNELS 0x1
|
2015-08-16 07:04:46 -06:00
|
|
|
#define MLX5E_MAX_NUM_CHANNELS (MLX5E_INDIR_RQT_SIZE >> 1)
|
2016-06-23 08:02:38 -06:00
|
|
|
#define MLX5E_MAX_NUM_SQS (MLX5E_MAX_NUM_CHANNELS * MLX5E_MAX_NUM_TC)
|
2015-05-28 13:28:48 -06:00
|
|
|
#define MLX5E_TX_CQ_POLL_BUDGET 128
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
#define MLX5E_TX_XSK_POLL_BUDGET 64
|
2017-12-26 07:02:24 -07:00
|
|
|
#define MLX5E_SQ_RECOVER_MIN_INTERVAL 500 /* msecs */
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2017-07-10 03:52:36 -06:00
|
|
|
#define MLX5E_UMR_WQE_INLINE_SZ \
|
|
|
|
(sizeof(struct mlx5e_umr_wqe) + \
|
|
|
|
ALIGN(MLX5_MPWRQ_PAGES_PER_WQE * sizeof(struct mlx5_mtt), \
|
|
|
|
MLX5_UMR_MTT_ALIGNMENT))
|
|
|
|
#define MLX5E_UMR_WQEBBS \
|
|
|
|
(DIV_ROUND_UP(MLX5E_UMR_WQE_INLINE_SZ, MLX5_SEND_WQE_BB))
|
2016-04-20 13:02:12 -06:00
|
|
|
|
2015-07-28 00:35:31 -06:00
|
|
|
#define MLX5E_MSG_LEVEL NETIF_MSG_LINK
|
|
|
|
|
|
|
|
#define mlx5e_dbg(mlevel, priv, format, ...) \
|
|
|
|
do { \
|
|
|
|
if (NETIF_MSG_##mlevel & (priv)->msglevel) \
|
|
|
|
netdev_warn(priv->netdev, format, \
|
|
|
|
##__VA_ARGS__); \
|
|
|
|
} while (0)
|
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
enum mlx5e_rq_group {
|
|
|
|
MLX5E_RQ_GROUP_REGULAR,
|
|
|
|
MLX5E_RQ_GROUP_XSK,
|
2019-07-14 02:43:43 -06:00
|
|
|
#define MLX5E_NUM_RQ_GROUPS(g) (1 + MLX5E_RQ_GROUP_##g)
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
};
|
2015-07-28 00:35:31 -06:00
|
|
|
|
2019-08-07 08:46:15 -06:00
|
|
|
static inline u8 mlx5e_get_num_lag_ports(struct mlx5_core_dev *mdev)
|
|
|
|
{
|
|
|
|
if (mlx5_lag_is_lacp_owner(mdev))
|
|
|
|
return 1;
|
|
|
|
|
|
|
|
return clamp_t(u8, MLX5_CAP_GEN(mdev, num_lag_ports), 1, MLX5_MAX_PORTS);
|
|
|
|
}
|
|
|
|
|
net/mlx5e: Support RX multi-packet WQE (Striding RQ)
Introduce the feature of multi-packet WQE (RX Work Queue Element)
referred to as (MPWQE or Striding RQ), in which WQEs are larger
and serve multiple packets each.
Every WQE consists of many strides of the same size, every received
packet is aligned to a beginning of a stride and is written to
consecutive strides within a WQE.
In the regular approach, each regular WQE is big enough to be capable
of serving one received packet of any size up to MTU or 64K in case of
device LRO is enabled, making it very wasteful when dealing with
small packets or device LRO is enabled.
For its flexibility, MPWQE allows a better memory utilization
(implying improvements in CPU utilization and packet rate) as packets
consume strides according to their size, preserving the rest of
the WQE to be available for other packets.
MPWQE default configuration:
Num of WQEs = 16
Strides Per WQE = 2048
Stride Size = 64 byte
The default WQEs memory footprint went from 1024*mtu (~1.5MB) to
16 * 2048 * 64 = 2MB per ring.
However, HW LRO can now be supported at no additional cost in memory
footprint, and hence we turn it on by default and get an even better
performance.
Performance tested on ConnectX4-Lx 50G.
To isolate the feature under test, the numbers below were measured with
HW LRO turned off. We verified that the performance just improves when
LRO is turned back on.
* Netperf single TCP stream:
- BW raised by 10-15% for representative packet sizes:
default, 64B, 1024B, 1478B, 65536B.
* Netperf multi TCP stream:
- No degradation, line rate reached.
* Pktgen: packet rate raised by 2-10% for traffic of different message
sizes: 64B, 128B, 256B, 1024B, and 1500B.
* Pktgen: packet loss in bursts of small messages (64byte),
single stream:
- | num packets | packets loss before | packets loss after
| 2K | ~ 1K | 0
| 8K | ~ 6K | 0
| 16K | ~13K | 0
| 32K | ~28K | 0
| 64K | ~57K | ~24K
As expected as the driver can receive as many small packets (<=64B) as
the number of total strides in the ring (default = 2048 * 16) vs. 1024
(default ring size regardless of packets size) before this feature.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-20 13:02:13 -06:00
|
|
|
static inline u16 mlx5_min_rx_wqes(int wq_type, u32 wq_size)
|
|
|
|
{
|
|
|
|
switch (wq_type) {
|
|
|
|
case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
|
|
|
|
return min_t(u16, MLX5E_PARAMS_DEFAULT_MIN_RX_WQES_MPW,
|
|
|
|
wq_size / 2);
|
|
|
|
default:
|
|
|
|
return min_t(u16, MLX5E_PARAMS_DEFAULT_MIN_RX_WQES,
|
|
|
|
wq_size / 2);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-09-06 05:56:56 -06:00
|
|
|
/* Use this function to get max num channels (rxqs/txqs) only to create netdev */
|
2017-04-12 21:36:55 -06:00
|
|
|
static inline int mlx5e_get_max_num_channels(struct mlx5_core_dev *mdev)
|
|
|
|
{
|
|
|
|
return is_kdump_kernel() ?
|
|
|
|
MLX5E_MIN_NUM_CHANNELS :
|
2018-11-19 11:52:38 -07:00
|
|
|
min_t(int, mlx5_comp_vectors_count(mdev), MLX5E_MAX_NUM_CHANNELS);
|
2017-04-12 21:36:55 -06:00
|
|
|
}
|
|
|
|
|
2016-04-20 13:02:12 -06:00
|
|
|
struct mlx5e_tx_wqe {
|
|
|
|
struct mlx5_wqe_ctrl_seg ctrl;
|
2019-07-30 02:55:25 -06:00
|
|
|
union {
|
|
|
|
struct {
|
|
|
|
struct mlx5_wqe_eth_seg eth;
|
|
|
|
struct mlx5_wqe_data_seg data[0];
|
|
|
|
};
|
|
|
|
u8 tls_progress_params_ctx[0];
|
|
|
|
};
|
2016-04-20 13:02:12 -06:00
|
|
|
};
|
|
|
|
|
2018-04-02 08:31:31 -06:00
|
|
|
struct mlx5e_rx_wqe_ll {
|
2016-04-20 13:02:12 -06:00
|
|
|
struct mlx5_wqe_srq_next_seg next;
|
2018-04-02 08:31:31 -06:00
|
|
|
struct mlx5_wqe_data_seg data[0];
|
|
|
|
};
|
|
|
|
|
|
|
|
struct mlx5e_rx_wqe_cyc {
|
|
|
|
struct mlx5_wqe_data_seg data[0];
|
2016-04-20 13:02:12 -06:00
|
|
|
};
|
2015-12-10 08:12:44 -07:00
|
|
|
|
2016-04-20 13:02:15 -06:00
|
|
|
struct mlx5e_umr_wqe {
|
|
|
|
struct mlx5_wqe_ctrl_seg ctrl;
|
|
|
|
struct mlx5_wqe_umr_ctrl_seg uctrl;
|
|
|
|
struct mlx5_mkey_seg mkc;
|
2019-07-05 09:30:22 -06:00
|
|
|
union {
|
|
|
|
struct mlx5_mtt inline_mtts[0];
|
|
|
|
u8 tls_static_params_ctx[0];
|
|
|
|
};
|
2016-04-20 13:02:15 -06:00
|
|
|
};
|
|
|
|
|
2016-11-27 08:02:09 -07:00
|
|
|
extern const char mlx5e_self_tests[][ETH_GSTRING_LEN];
|
|
|
|
|
2016-06-23 08:02:39 -06:00
|
|
|
enum mlx5e_priv_flag {
|
2018-11-19 06:07:07 -07:00
|
|
|
MLX5E_PFLAG_RX_CQE_BASED_MODER,
|
|
|
|
MLX5E_PFLAG_TX_CQE_BASED_MODER,
|
|
|
|
MLX5E_PFLAG_RX_CQE_COMPRESS,
|
|
|
|
MLX5E_PFLAG_RX_STRIDING_RQ,
|
|
|
|
MLX5E_PFLAG_RX_NO_CSUM_COMPLETE,
|
2018-11-20 02:50:30 -07:00
|
|
|
MLX5E_PFLAG_XDP_TX_MPWQE,
|
2018-11-19 06:07:07 -07:00
|
|
|
MLX5E_NUM_PFLAGS, /* Keep last */
|
2016-06-23 08:02:39 -06:00
|
|
|
};
|
|
|
|
|
2016-12-21 08:24:35 -07:00
|
|
|
#define MLX5E_SET_PFLAG(params, pflag, enable) \
|
2016-11-27 08:02:11 -07:00
|
|
|
do { \
|
|
|
|
if (enable) \
|
2018-11-19 06:07:07 -07:00
|
|
|
(params)->pflags |= BIT(pflag); \
|
2016-11-27 08:02:11 -07:00
|
|
|
else \
|
2018-11-19 06:07:07 -07:00
|
|
|
(params)->pflags &= ~(BIT(pflag)); \
|
2016-06-23 08:02:39 -06:00
|
|
|
} while (0)
|
|
|
|
|
2018-11-19 06:07:07 -07:00
|
|
|
#define MLX5E_GET_PFLAG(params, pflag) (!!((params)->pflags & (BIT(pflag))))
|
2016-11-27 08:02:11 -07:00
|
|
|
|
2016-02-22 09:17:26 -07:00
|
|
|
#ifdef CONFIG_MLX5_CORE_EN_DCB
|
|
|
|
#define MLX5E_MAX_BW_ALLOC 100 /* Max percentage of BW allocation */
|
|
|
|
#endif
|
|
|
|
|
2015-05-28 13:28:48 -06:00
|
|
|
struct mlx5e_params {
|
|
|
|
u8 log_sq_size;
|
net/mlx5e: Support RX multi-packet WQE (Striding RQ)
Introduce the feature of multi-packet WQE (RX Work Queue Element)
referred to as (MPWQE or Striding RQ), in which WQEs are larger
and serve multiple packets each.
Every WQE consists of many strides of the same size, every received
packet is aligned to a beginning of a stride and is written to
consecutive strides within a WQE.
In the regular approach, each regular WQE is big enough to be capable
of serving one received packet of any size up to MTU or 64K in case of
device LRO is enabled, making it very wasteful when dealing with
small packets or device LRO is enabled.
For its flexibility, MPWQE allows a better memory utilization
(implying improvements in CPU utilization and packet rate) as packets
consume strides according to their size, preserving the rest of
the WQE to be available for other packets.
MPWQE default configuration:
Num of WQEs = 16
Strides Per WQE = 2048
Stride Size = 64 byte
The default WQEs memory footprint went from 1024*mtu (~1.5MB) to
16 * 2048 * 64 = 2MB per ring.
However, HW LRO can now be supported at no additional cost in memory
footprint, and hence we turn it on by default and get an even better
performance.
Performance tested on ConnectX4-Lx 50G.
To isolate the feature under test, the numbers below were measured with
HW LRO turned off. We verified that the performance just improves when
LRO is turned back on.
* Netperf single TCP stream:
- BW raised by 10-15% for representative packet sizes:
default, 64B, 1024B, 1478B, 65536B.
* Netperf multi TCP stream:
- No degradation, line rate reached.
* Pktgen: packet rate raised by 2-10% for traffic of different message
sizes: 64B, 128B, 256B, 1024B, and 1500B.
* Pktgen: packet loss in bursts of small messages (64byte),
single stream:
- | num packets | packets loss before | packets loss after
| 2K | ~ 1K | 0
| 8K | ~ 6K | 0
| 16K | ~13K | 0
| 32K | ~28K | 0
| 64K | ~57K | ~24K
As expected as the driver can receive as many small packets (<=64B) as
the number of total strides in the ring (default = 2048 * 16) vs. 1024
(default ring size regardless of packets size) before this feature.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-20 13:02:13 -06:00
|
|
|
u8 rq_wq_type;
|
2018-02-11 06:21:33 -07:00
|
|
|
u8 log_rq_mtu_frames;
|
2015-05-28 13:28:48 -06:00
|
|
|
u16 num_channels;
|
|
|
|
u8 num_tc;
|
2016-11-27 08:02:12 -07:00
|
|
|
bool rx_cqe_compress_def;
|
2019-01-20 02:04:34 -07:00
|
|
|
bool tunneled_offload_en;
|
2019-01-31 07:44:48 -07:00
|
|
|
struct dim_cq_moder rx_cq_moderation;
|
|
|
|
struct dim_cq_moder tx_cq_moderation;
|
2015-05-28 13:28:48 -06:00
|
|
|
bool lro_en;
|
2016-07-24 07:12:40 -06:00
|
|
|
u8 tx_min_inline_mode;
|
2016-04-24 13:51:55 -06:00
|
|
|
bool vlan_strip_disable;
|
2017-02-20 07:18:17 -07:00
|
|
|
bool scatter_fcs_en;
|
2018-01-09 14:06:17 -07:00
|
|
|
bool rx_dim_enabled;
|
2018-04-24 04:36:03 -06:00
|
|
|
bool tx_dim_enabled;
|
2016-10-25 09:36:29 -06:00
|
|
|
u32 lro_timeout;
|
2016-11-27 08:02:11 -07:00
|
|
|
u32 pflags;
|
2016-12-21 08:24:35 -07:00
|
|
|
struct bpf_prog *xdp_prog;
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
struct mlx5e_xsk *xsk;
|
2018-03-12 06:24:41 -06:00
|
|
|
unsigned int sw_mtu;
|
|
|
|
int hard_mtu;
|
2015-05-28 13:28:48 -06:00
|
|
|
};
|
|
|
|
|
2016-11-27 08:02:04 -07:00
|
|
|
#ifdef CONFIG_MLX5_CORE_EN_DCB
|
|
|
|
struct mlx5e_cee_config {
|
|
|
|
/* bw pct for priority group */
|
|
|
|
u8 pg_bw_pct[CEE_DCBX_MAX_PGS];
|
|
|
|
u8 prio_to_pg_map[CEE_DCBX_MAX_PRIO];
|
|
|
|
bool pfc_setting[CEE_DCBX_MAX_PRIO];
|
|
|
|
bool pfc_enable;
|
|
|
|
};
|
|
|
|
|
|
|
|
enum {
|
|
|
|
MLX5_DCB_CHG_RESET,
|
|
|
|
MLX5_DCB_NO_CHG,
|
|
|
|
MLX5_DCB_CHG_NO_RESET,
|
|
|
|
};
|
|
|
|
|
|
|
|
struct mlx5e_dcbx {
|
2016-11-27 08:02:07 -07:00
|
|
|
enum mlx5_dcbx_oper_mode mode;
|
2016-11-27 08:02:04 -07:00
|
|
|
struct mlx5e_cee_config cee_cfg; /* pending configuration */
|
2017-07-18 15:23:36 -06:00
|
|
|
u8 dscp_app_cnt;
|
2016-11-27 08:02:05 -07:00
|
|
|
|
|
|
|
/* The only setting that cannot be read from FW */
|
|
|
|
u8 tc_tsa[IEEE_8021QAZ_MAX_TCS];
|
2017-07-10 13:00:23 -06:00
|
|
|
u8 cap;
|
2018-03-21 20:10:22 -06:00
|
|
|
|
|
|
|
/* Buffer configuration */
|
2018-03-21 20:39:17 -06:00
|
|
|
bool manual_buffer;
|
2018-03-21 20:10:22 -06:00
|
|
|
u32 cable_len;
|
|
|
|
u32 xoff;
|
2016-11-27 08:02:04 -07:00
|
|
|
};
|
2017-07-18 15:23:36 -06:00
|
|
|
|
|
|
|
struct mlx5e_dcbx_dp {
|
|
|
|
u8 dscp2prio[MLX5E_MAX_DSCP];
|
|
|
|
u8 trust_state;
|
|
|
|
};
|
2016-11-27 08:02:04 -07:00
|
|
|
#endif
|
|
|
|
|
2015-05-28 13:28:48 -06:00
|
|
|
enum {
|
2016-12-06 08:32:48 -07:00
|
|
|
MLX5E_RQ_STATE_ENABLED,
|
2019-06-26 14:21:40 -06:00
|
|
|
MLX5E_RQ_STATE_RECOVERING,
|
2016-06-23 08:02:41 -06:00
|
|
|
MLX5E_RQ_STATE_AM,
|
2018-07-01 02:58:38 -06:00
|
|
|
MLX5E_RQ_STATE_NO_CSUM_COMPLETE,
|
2019-05-03 14:14:59 -06:00
|
|
|
MLX5E_RQ_STATE_CSUM_FULL, /* cqe_csum_full hw bit is set */
|
2015-05-28 13:28:48 -06:00
|
|
|
};
|
|
|
|
|
|
|
|
struct mlx5e_cq {
|
|
|
|
/* data path - accessed per cqe */
|
|
|
|
struct mlx5_cqwq wq;
|
|
|
|
|
|
|
|
/* data path - accessed per napi poll */
|
2016-06-23 08:02:41 -06:00
|
|
|
u16 event_ctr;
|
2015-05-28 13:28:48 -06:00
|
|
|
struct napi_struct *napi;
|
|
|
|
struct mlx5_core_cq mcq;
|
|
|
|
struct mlx5e_channel *channel;
|
|
|
|
|
2018-09-12 09:32:49 -06:00
|
|
|
/* control */
|
|
|
|
struct mlx5_core_dev *mdev;
|
|
|
|
struct mlx5_wq_ctrl wq_ctrl;
|
|
|
|
} ____cacheline_aligned_in_smp;
|
|
|
|
|
|
|
|
struct mlx5e_cq_decomp {
|
2016-05-10 15:29:14 -06:00
|
|
|
/* cqe decompression */
|
|
|
|
struct mlx5_cqe64 title;
|
|
|
|
struct mlx5_mini_cqe8 mini_arr[MLX5_MINI_CQE_ARRAY_SIZE];
|
|
|
|
u8 mini_arr_idx;
|
2018-09-12 09:32:49 -06:00
|
|
|
u16 left;
|
|
|
|
u16 wqe_counter;
|
2015-05-28 13:28:48 -06:00
|
|
|
} ____cacheline_aligned_in_smp;
|
|
|
|
|
2017-03-24 15:52:07 -06:00
|
|
|
struct mlx5e_tx_wqe_info {
|
2017-04-12 21:37:01 -06:00
|
|
|
struct sk_buff *skb;
|
2017-03-24 15:52:07 -06:00
|
|
|
u32 num_bytes;
|
|
|
|
u8 num_wqebbs;
|
|
|
|
u8 num_dma;
|
2019-07-05 09:30:22 -06:00
|
|
|
#ifdef CONFIG_MLX5_EN_TLS
|
2019-09-18 04:50:32 -06:00
|
|
|
struct page *resync_dump_frag_page;
|
2019-07-05 09:30:22 -06:00
|
|
|
#endif
|
2017-03-24 15:52:07 -06:00
|
|
|
};
|
|
|
|
|
|
|
|
enum mlx5e_dma_map_type {
|
|
|
|
MLX5E_DMA_MAP_SINGLE,
|
|
|
|
MLX5E_DMA_MAP_PAGE
|
|
|
|
};
|
|
|
|
|
|
|
|
struct mlx5e_sq_dma {
|
|
|
|
dma_addr_t addr;
|
|
|
|
u32 size;
|
|
|
|
enum mlx5e_dma_map_type type;
|
|
|
|
};
|
|
|
|
|
|
|
|
enum {
|
|
|
|
MLX5E_SQ_STATE_ENABLED,
|
2017-12-26 07:02:24 -07:00
|
|
|
MLX5E_SQ_STATE_RECOVERING,
|
2017-04-18 07:08:23 -06:00
|
|
|
MLX5E_SQ_STATE_IPSEC,
|
2018-04-24 04:36:03 -06:00
|
|
|
MLX5E_SQ_STATE_AM,
|
2018-04-30 01:16:20 -06:00
|
|
|
MLX5E_SQ_STATE_TLS,
|
2019-07-01 03:08:08 -06:00
|
|
|
MLX5E_SQ_STATE_VLAN_NEED_L2_INLINE,
|
2017-03-24 15:52:07 -06:00
|
|
|
};
|
|
|
|
|
|
|
|
struct mlx5e_sq_wqe_info {
|
|
|
|
u8 opcode;
|
2019-06-26 08:35:31 -06:00
|
|
|
|
|
|
|
/* Auxiliary data for different opcodes. */
|
|
|
|
union {
|
|
|
|
struct {
|
|
|
|
struct mlx5e_rq *rq;
|
|
|
|
} umr;
|
|
|
|
};
|
2017-03-24 15:52:07 -06:00
|
|
|
};
|
2016-04-20 13:02:12 -06:00
|
|
|
|
2017-03-24 15:52:14 -06:00
|
|
|
struct mlx5e_txqsq {
|
2017-03-24 15:52:07 -06:00
|
|
|
/* data path */
|
|
|
|
|
|
|
|
/* dirtied @completion */
|
|
|
|
u16 cc;
|
|
|
|
u32 dma_fifo_cc;
|
2019-01-31 07:44:48 -07:00
|
|
|
struct dim dim; /* Adaptive Moderation */
|
2017-03-24 15:52:07 -06:00
|
|
|
|
|
|
|
/* dirtied @xmit */
|
|
|
|
u16 pc ____cacheline_aligned_in_smp;
|
|
|
|
u32 dma_fifo_pc;
|
|
|
|
|
|
|
|
struct mlx5e_cq cq;
|
|
|
|
|
|
|
|
/* read only */
|
|
|
|
struct mlx5_wq_cyc wq;
|
|
|
|
u32 dma_fifo_mask;
|
2018-04-12 07:03:37 -06:00
|
|
|
struct mlx5e_sq_stats *stats;
|
2018-05-22 08:06:38 -06:00
|
|
|
struct {
|
|
|
|
struct mlx5e_sq_dma *dma_fifo;
|
|
|
|
struct mlx5e_tx_wqe_info *wqe_info;
|
|
|
|
} db;
|
2017-03-24 15:52:07 -06:00
|
|
|
void __iomem *uar_map;
|
|
|
|
struct netdev_queue *txq;
|
|
|
|
u32 sqn;
|
2019-07-05 09:30:19 -06:00
|
|
|
u16 stop_room;
|
2017-03-24 15:52:07 -06:00
|
|
|
u8 min_inline_mode;
|
|
|
|
struct device *pdev;
|
|
|
|
__be32 mkey_be;
|
|
|
|
unsigned long state;
|
2019-10-07 05:01:29 -06:00
|
|
|
unsigned int hw_mtu;
|
2017-08-15 04:46:04 -06:00
|
|
|
struct hwtstamp_config *tstamp;
|
|
|
|
struct mlx5_clock *clock;
|
2017-03-24 15:52:07 -06:00
|
|
|
|
|
|
|
/* control path */
|
|
|
|
struct mlx5_wq_ctrl wq_ctrl;
|
|
|
|
struct mlx5e_channel *channel;
|
2019-04-28 01:14:23 -06:00
|
|
|
int ch_ix;
|
2016-12-20 13:48:19 -07:00
|
|
|
int txq_ix;
|
2017-03-24 15:52:07 -06:00
|
|
|
u32 rate_limit;
|
net/mlx5e: Add tx reporter support
Add mlx5e tx reporter to devlink health reporters. This reporter will be
responsible for diagnosing, reporting and recovering of tx errors.
This patch declares the TX reporter operations and creates it using the
devlink health API. Currently, this reporter supports reporting and
recovering from send error CQE only. In addition, it adds diagnose
information for the open SQs.
For a local SQ recover (due to driver error report), in case of SQ recover
failure, the recover operation will be considered as a failure.
For a full tx recover, an attempt to close and open the channels will be
done. If this one passed successfully, it will be considered as a
successful recover.
The SQ recover from error CQE flow is not a new feature in the driver,
this patch re-organize the functions and adapt them for the devlink
health API. For this purpose, move code from en_main.c to a new file
named reporter_tx.c.
Diagnose output:
$devlink health diagnose pci/0000:00:09.0 reporter tx -j -p
{
"SQs": [ {
"sqn": 138,
"HW state": 1,
"stopped": false
},{
"sqn": 142,
"HW state": 1,
"stopped": false
} ]
}
$devlink health diagnose pci/0000:00:09.0 reporter tx
SQs:
sqn: 138 HW state: 1 stopped: false
sqn: 142 HW state: 1 stopped: false
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-07 02:36:40 -07:00
|
|
|
struct work_struct recover_work;
|
2017-03-24 15:52:14 -06:00
|
|
|
} ____cacheline_aligned_in_smp;
|
|
|
|
|
2018-07-15 01:34:39 -06:00
|
|
|
struct mlx5e_dma_info {
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
dma_addr_t addr;
|
|
|
|
union {
|
|
|
|
struct page *page;
|
|
|
|
struct {
|
|
|
|
u64 handle;
|
|
|
|
void *data;
|
|
|
|
} xsk;
|
|
|
|
};
|
2018-07-15 01:34:39 -06:00
|
|
|
};
|
|
|
|
|
net/mlx5e: Refactor struct mlx5e_xdp_info
Currently, struct mlx5e_xdp_info has some issues that have to be cleaned
up before the upcoming AF_XDP support makes things too complicated and
messy. This structure is used both when sending the packet and on
completion. Moreover, the cleanup procedure on completion depends on the
origin of the packet (XDP_REDIRECT, XDP_TX). Adding AF_XDP support will
add new flows that use this structure even differently. To avoid
overcomplicating the code, this commit refactors the usage of this
structure in the following ways:
1. struct mlx5e_xdp_info is split into two different structures. One is
struct mlx5e_xdp_xmit_data, a transient structure that doesn't need to
be stored and is only used while sending the packet. The other is still
struct mlx5e_xdp_info that is stored in a FIFO and contains the fields
needed on completion.
2. The fields of struct mlx5e_xdp_info that are used in different flows
are put into a union. A special enum indicates the cleanup mode and
helps choose the right union member. This approach is clear and
explicit. Although it could be possible to "guess" the mode by looking
at the values of the fields and at the XDP SQ type, it wouldn't be that
clear and extendable and would require looking through the whole chain
to understand what's going on.
For the reference, there are the fields of struct mlx5e_xdp_info that
are used in different flows (including AF_XDP ones):
Packet origin | Fields used on completion | Cleanup steps
-----------------------+---------------------------+------------------
XDP_REDIRECT, | xdpf, dma_addr | DMA unmap and
XDP_TX from XSK RQ | | xdp_return_frame.
-----------------------+---------------------------+------------------
XDP_TX from regular RQ | di | Recycle page.
-----------------------+---------------------------+------------------
AF_XDP TX | (none) | Increment the
| | producer index in
| | Completion Ring.
On send, the same set of mlx5e_xdp_xmit_data fields is used in all
flows: DMA and virtual addresses and length.
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:32 -06:00
|
|
|
/* XDP packets can be transmitted in different ways. On completion, we need to
|
|
|
|
* distinguish between them to clean up things in a proper way.
|
|
|
|
*/
|
|
|
|
enum mlx5e_xdp_xmit_mode {
|
|
|
|
/* An xdp_frame was transmitted due to either XDP_REDIRECT from another
|
|
|
|
* device or XDP_TX from an XSK RQ. The frame has to be unmapped and
|
|
|
|
* returned.
|
|
|
|
*/
|
|
|
|
MLX5E_XDP_XMIT_MODE_FRAME,
|
|
|
|
|
|
|
|
/* The xdp_frame was created in place as a result of XDP_TX from a
|
|
|
|
* regular RQ. No DMA remapping happened, and the page belongs to us.
|
|
|
|
*/
|
|
|
|
MLX5E_XDP_XMIT_MODE_PAGE,
|
|
|
|
|
|
|
|
/* No xdp_frame was created at all, the transmit happened from a UMEM
|
|
|
|
* page. The UMEM Completion Ring producer pointer has to be increased.
|
|
|
|
*/
|
|
|
|
MLX5E_XDP_XMIT_MODE_XSK,
|
2018-07-15 01:34:39 -06:00
|
|
|
};
|
|
|
|
|
|
|
|
struct mlx5e_xdp_info {
|
net/mlx5e: Refactor struct mlx5e_xdp_info
Currently, struct mlx5e_xdp_info has some issues that have to be cleaned
up before the upcoming AF_XDP support makes things too complicated and
messy. This structure is used both when sending the packet and on
completion. Moreover, the cleanup procedure on completion depends on the
origin of the packet (XDP_REDIRECT, XDP_TX). Adding AF_XDP support will
add new flows that use this structure even differently. To avoid
overcomplicating the code, this commit refactors the usage of this
structure in the following ways:
1. struct mlx5e_xdp_info is split into two different structures. One is
struct mlx5e_xdp_xmit_data, a transient structure that doesn't need to
be stored and is only used while sending the packet. The other is still
struct mlx5e_xdp_info that is stored in a FIFO and contains the fields
needed on completion.
2. The fields of struct mlx5e_xdp_info that are used in different flows
are put into a union. A special enum indicates the cleanup mode and
helps choose the right union member. This approach is clear and
explicit. Although it could be possible to "guess" the mode by looking
at the values of the fields and at the XDP SQ type, it wouldn't be that
clear and extendable and would require looking through the whole chain
to understand what's going on.
For the reference, there are the fields of struct mlx5e_xdp_info that
are used in different flows (including AF_XDP ones):
Packet origin | Fields used on completion | Cleanup steps
-----------------------+---------------------------+------------------
XDP_REDIRECT, | xdpf, dma_addr | DMA unmap and
XDP_TX from XSK RQ | | xdp_return_frame.
-----------------------+---------------------------+------------------
XDP_TX from regular RQ | di | Recycle page.
-----------------------+---------------------------+------------------
AF_XDP TX | (none) | Increment the
| | producer index in
| | Completion Ring.
On send, the same set of mlx5e_xdp_xmit_data fields is used in all
flows: DMA and virtual addresses and length.
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:32 -06:00
|
|
|
enum mlx5e_xdp_xmit_mode mode;
|
|
|
|
union {
|
|
|
|
struct {
|
|
|
|
struct xdp_frame *xdpf;
|
|
|
|
dma_addr_t dma_addr;
|
|
|
|
} frame;
|
|
|
|
struct {
|
2019-06-26 08:35:33 -06:00
|
|
|
struct mlx5e_rq *rq;
|
net/mlx5e: Refactor struct mlx5e_xdp_info
Currently, struct mlx5e_xdp_info has some issues that have to be cleaned
up before the upcoming AF_XDP support makes things too complicated and
messy. This structure is used both when sending the packet and on
completion. Moreover, the cleanup procedure on completion depends on the
origin of the packet (XDP_REDIRECT, XDP_TX). Adding AF_XDP support will
add new flows that use this structure even differently. To avoid
overcomplicating the code, this commit refactors the usage of this
structure in the following ways:
1. struct mlx5e_xdp_info is split into two different structures. One is
struct mlx5e_xdp_xmit_data, a transient structure that doesn't need to
be stored and is only used while sending the packet. The other is still
struct mlx5e_xdp_info that is stored in a FIFO and contains the fields
needed on completion.
2. The fields of struct mlx5e_xdp_info that are used in different flows
are put into a union. A special enum indicates the cleanup mode and
helps choose the right union member. This approach is clear and
explicit. Although it could be possible to "guess" the mode by looking
at the values of the fields and at the XDP SQ type, it wouldn't be that
clear and extendable and would require looking through the whole chain
to understand what's going on.
For the reference, there are the fields of struct mlx5e_xdp_info that
are used in different flows (including AF_XDP ones):
Packet origin | Fields used on completion | Cleanup steps
-----------------------+---------------------------+------------------
XDP_REDIRECT, | xdpf, dma_addr | DMA unmap and
XDP_TX from XSK RQ | | xdp_return_frame.
-----------------------+---------------------------+------------------
XDP_TX from regular RQ | di | Recycle page.
-----------------------+---------------------------+------------------
AF_XDP TX | (none) | Increment the
| | producer index in
| | Completion Ring.
On send, the same set of mlx5e_xdp_xmit_data fields is used in all
flows: DMA and virtual addresses and length.
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:32 -06:00
|
|
|
struct mlx5e_dma_info di;
|
|
|
|
} page;
|
|
|
|
};
|
|
|
|
};
|
|
|
|
|
|
|
|
struct mlx5e_xdp_xmit_data {
|
|
|
|
dma_addr_t dma_addr;
|
|
|
|
void *data;
|
|
|
|
u32 len;
|
2018-07-15 01:34:39 -06:00
|
|
|
};
|
|
|
|
|
2018-10-14 05:37:48 -06:00
|
|
|
struct mlx5e_xdp_info_fifo {
|
|
|
|
struct mlx5e_xdp_info *xi;
|
|
|
|
u32 *cc;
|
|
|
|
u32 *pc;
|
|
|
|
u32 mask;
|
|
|
|
};
|
|
|
|
|
2018-10-14 05:46:57 -06:00
|
|
|
struct mlx5e_xdp_wqe_info {
|
|
|
|
u8 num_wqebbs;
|
net/mlx5e: XDP, Inline small packets into the TX MPWQE in XDP xmit flow
Upon high packet rate with multiple CPUs TX workloads, much of the HCA's
resources are spent on prefetching TX descriptors, thus affecting
transmission rates.
This patch comes to mitigate this problem by moving some workload to the
CPU and reducing the HW data prefetch overhead for small packets (<= 256B).
When forwarding packets with XDP, a packet that is smaller
than a certain size (set to ~256 bytes) would be sent inline within
its WQE TX descrptor (mem-copied), when the hardware tx queue is congested
beyond a pre-defined water-mark.
This is added to better utilize the HW resources (which now makes
one less packet data prefetch) and allow better scalability, on the
account of CPU usage (which now 'memcpy's the packet into the WQE).
To load balance between HW and CPU and get max packet rate, we use
watermarks to detect how much the HW is congested and move the work
loads back and forth between HW and CPU.
Performance:
Tested packet rate for UDP 64Byte multi-stream
over two dual port ConnectX-5 100Gbps NICs.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
* Tested with hyper-threading disabled
XDP_TX:
| | before | after | |
| 24 rings | 51Mpps | 116Mpps | +126% |
| 1 ring | 12Mpps | 12Mpps | same |
XDP_REDIRECT:
** Below is the transmit rate, not the redirection rate
which might be larger, and is not affected by this patch.
| | before | after | |
| 32 rings | 64Mpps | 92Mpps | +43% |
| 1 ring | 6.4Mpps | 6.4Mpps | same |
As we can see, feature significantly improves scaling, without
hurting single ring performance.
Signed-off-by: Shay Agroskin <shayag@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-03-14 06:54:07 -06:00
|
|
|
u8 num_pkts;
|
2018-10-14 05:46:57 -06:00
|
|
|
};
|
|
|
|
|
2018-11-21 05:08:06 -07:00
|
|
|
struct mlx5e_xdp_mpwqe {
|
|
|
|
/* Current MPWQE session */
|
|
|
|
struct mlx5e_tx_wqe *wqe;
|
|
|
|
u8 ds_count;
|
net/mlx5e: XDP, Inline small packets into the TX MPWQE in XDP xmit flow
Upon high packet rate with multiple CPUs TX workloads, much of the HCA's
resources are spent on prefetching TX descriptors, thus affecting
transmission rates.
This patch comes to mitigate this problem by moving some workload to the
CPU and reducing the HW data prefetch overhead for small packets (<= 256B).
When forwarding packets with XDP, a packet that is smaller
than a certain size (set to ~256 bytes) would be sent inline within
its WQE TX descrptor (mem-copied), when the hardware tx queue is congested
beyond a pre-defined water-mark.
This is added to better utilize the HW resources (which now makes
one less packet data prefetch) and allow better scalability, on the
account of CPU usage (which now 'memcpy's the packet into the WQE).
To load balance between HW and CPU and get max packet rate, we use
watermarks to detect how much the HW is congested and move the work
loads back and forth between HW and CPU.
Performance:
Tested packet rate for UDP 64Byte multi-stream
over two dual port ConnectX-5 100Gbps NICs.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
* Tested with hyper-threading disabled
XDP_TX:
| | before | after | |
| 24 rings | 51Mpps | 116Mpps | +126% |
| 1 ring | 12Mpps | 12Mpps | same |
XDP_REDIRECT:
** Below is the transmit rate, not the redirection rate
which might be larger, and is not affected by this patch.
| | before | after | |
| 32 rings | 64Mpps | 92Mpps | +43% |
| 1 ring | 6.4Mpps | 6.4Mpps | same |
As we can see, feature significantly improves scaling, without
hurting single ring performance.
Signed-off-by: Shay Agroskin <shayag@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-03-14 06:54:07 -06:00
|
|
|
u8 pkt_count;
|
|
|
|
u8 inline_on;
|
2018-11-21 05:08:06 -07:00
|
|
|
};
|
|
|
|
|
|
|
|
struct mlx5e_xdpsq;
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
typedef int (*mlx5e_fp_xmit_xdp_frame_check)(struct mlx5e_xdpsq *);
|
net/mlx5e: Refactor struct mlx5e_xdp_info
Currently, struct mlx5e_xdp_info has some issues that have to be cleaned
up before the upcoming AF_XDP support makes things too complicated and
messy. This structure is used both when sending the packet and on
completion. Moreover, the cleanup procedure on completion depends on the
origin of the packet (XDP_REDIRECT, XDP_TX). Adding AF_XDP support will
add new flows that use this structure even differently. To avoid
overcomplicating the code, this commit refactors the usage of this
structure in the following ways:
1. struct mlx5e_xdp_info is split into two different structures. One is
struct mlx5e_xdp_xmit_data, a transient structure that doesn't need to
be stored and is only used while sending the packet. The other is still
struct mlx5e_xdp_info that is stored in a FIFO and contains the fields
needed on completion.
2. The fields of struct mlx5e_xdp_info that are used in different flows
are put into a union. A special enum indicates the cleanup mode and
helps choose the right union member. This approach is clear and
explicit. Although it could be possible to "guess" the mode by looking
at the values of the fields and at the XDP SQ type, it wouldn't be that
clear and extendable and would require looking through the whole chain
to understand what's going on.
For the reference, there are the fields of struct mlx5e_xdp_info that
are used in different flows (including AF_XDP ones):
Packet origin | Fields used on completion | Cleanup steps
-----------------------+---------------------------+------------------
XDP_REDIRECT, | xdpf, dma_addr | DMA unmap and
XDP_TX from XSK RQ | | xdp_return_frame.
-----------------------+---------------------------+------------------
XDP_TX from regular RQ | di | Recycle page.
-----------------------+---------------------------+------------------
AF_XDP TX | (none) | Increment the
| | producer index in
| | Completion Ring.
On send, the same set of mlx5e_xdp_xmit_data fields is used in all
flows: DMA and virtual addresses and length.
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:32 -06:00
|
|
|
typedef bool (*mlx5e_fp_xmit_xdp_frame)(struct mlx5e_xdpsq *,
|
|
|
|
struct mlx5e_xdp_xmit_data *,
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
struct mlx5e_xdp_info *,
|
|
|
|
int);
|
net/mlx5e: Refactor struct mlx5e_xdp_info
Currently, struct mlx5e_xdp_info has some issues that have to be cleaned
up before the upcoming AF_XDP support makes things too complicated and
messy. This structure is used both when sending the packet and on
completion. Moreover, the cleanup procedure on completion depends on the
origin of the packet (XDP_REDIRECT, XDP_TX). Adding AF_XDP support will
add new flows that use this structure even differently. To avoid
overcomplicating the code, this commit refactors the usage of this
structure in the following ways:
1. struct mlx5e_xdp_info is split into two different structures. One is
struct mlx5e_xdp_xmit_data, a transient structure that doesn't need to
be stored and is only used while sending the packet. The other is still
struct mlx5e_xdp_info that is stored in a FIFO and contains the fields
needed on completion.
2. The fields of struct mlx5e_xdp_info that are used in different flows
are put into a union. A special enum indicates the cleanup mode and
helps choose the right union member. This approach is clear and
explicit. Although it could be possible to "guess" the mode by looking
at the values of the fields and at the XDP SQ type, it wouldn't be that
clear and extendable and would require looking through the whole chain
to understand what's going on.
For the reference, there are the fields of struct mlx5e_xdp_info that
are used in different flows (including AF_XDP ones):
Packet origin | Fields used on completion | Cleanup steps
-----------------------+---------------------------+------------------
XDP_REDIRECT, | xdpf, dma_addr | DMA unmap and
XDP_TX from XSK RQ | | xdp_return_frame.
-----------------------+---------------------------+------------------
XDP_TX from regular RQ | di | Recycle page.
-----------------------+---------------------------+------------------
AF_XDP TX | (none) | Increment the
| | producer index in
| | Completion Ring.
On send, the same set of mlx5e_xdp_xmit_data fields is used in all
flows: DMA and virtual addresses and length.
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:32 -06:00
|
|
|
|
2017-03-24 15:52:14 -06:00
|
|
|
struct mlx5e_xdpsq {
|
|
|
|
/* data path */
|
|
|
|
|
2018-05-22 07:43:54 -06:00
|
|
|
/* dirtied @completion */
|
2018-10-14 05:37:48 -06:00
|
|
|
u32 xdpi_fifo_cc;
|
2017-03-24 15:52:14 -06:00
|
|
|
u16 cc;
|
|
|
|
|
2018-05-22 07:43:54 -06:00
|
|
|
/* dirtied @xmit */
|
2018-10-14 05:37:48 -06:00
|
|
|
u32 xdpi_fifo_pc ____cacheline_aligned_in_smp;
|
|
|
|
u16 pc;
|
2018-11-21 05:06:02 -07:00
|
|
|
struct mlx5_wqe_ctrl_seg *doorbell_cseg;
|
2018-11-21 05:08:06 -07:00
|
|
|
struct mlx5e_xdp_mpwqe mpwqe;
|
2017-03-24 15:52:14 -06:00
|
|
|
|
2018-05-22 07:43:54 -06:00
|
|
|
struct mlx5e_cq cq;
|
2017-03-24 15:52:14 -06:00
|
|
|
|
|
|
|
/* read only */
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
struct xdp_umem *umem;
|
2017-03-24 15:52:14 -06:00
|
|
|
struct mlx5_wq_cyc wq;
|
2018-05-22 07:29:31 -06:00
|
|
|
struct mlx5e_xdpsq_stats *stats;
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
mlx5e_fp_xmit_xdp_frame_check xmit_xdp_frame_check;
|
2018-11-21 05:08:06 -07:00
|
|
|
mlx5e_fp_xmit_xdp_frame xmit_xdp_frame;
|
2018-05-22 07:43:54 -06:00
|
|
|
struct {
|
2018-10-14 05:46:57 -06:00
|
|
|
struct mlx5e_xdp_wqe_info *wqe_info;
|
2018-10-14 05:37:48 -06:00
|
|
|
struct mlx5e_xdp_info_fifo xdpi_fifo;
|
2018-05-22 07:43:54 -06:00
|
|
|
} db;
|
2017-03-24 15:52:14 -06:00
|
|
|
void __iomem *uar_map;
|
|
|
|
u32 sqn;
|
|
|
|
struct device *pdev;
|
|
|
|
__be32 mkey_be;
|
|
|
|
u8 min_inline_mode;
|
|
|
|
unsigned long state;
|
2018-07-15 01:34:39 -06:00
|
|
|
unsigned int hw_mtu;
|
2017-03-24 15:52:14 -06:00
|
|
|
|
|
|
|
/* control path */
|
|
|
|
struct mlx5_wq_ctrl wq_ctrl;
|
|
|
|
struct mlx5e_channel *channel;
|
|
|
|
} ____cacheline_aligned_in_smp;
|
|
|
|
|
|
|
|
struct mlx5e_icosq {
|
|
|
|
/* data path */
|
net/mlx5e: RX, Support multiple outstanding UMR posts
The buffers mapping of the Multi-Packet WQEs (of Striding RQ)
is done via UMR posts, one UMR WQE per an RX MPWQE.
A single MPWQE is capable of serving many incoming packets,
usually larger than the budget of a single napi cycle.
Hence, posting a single UMR WQE per napi cycle (and handling its
completion in the next cycle) works fine in many common cases,
but not always.
When an XDP program is loaded, every MPWQE is capable of serving less
packets, to satisfy the packet-per-page requirement.
Thus, for the same number of packets more MPWQEs (and UMR posts)
are needed (twice as much for the default MTU), giving less latency
room for the UMR completions.
In this patch, we add support for multiple outstanding UMR posts,
to allow faster gap closure between consuming MPWQEs and reposting
them back into the WQ.
For better SW and HW locality, we combine the UMR posts in bulks of
(at least) two.
This is expected to improve packet rate in high CPU scale.
Performance test:
As expected, huge improvement in large-scale (48 cores).
xdp_redirect_map, 64B UDP multi-stream.
Redirect from ConnectX-5 100Gbps to ConnectX-6 100Gbps.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz.
Before: Unstable, 7 to 30 Mpps
After: Stable, at 70.5 Mpps
No degradation in other tested scenarios.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-02-27 03:06:08 -07:00
|
|
|
u16 cc;
|
|
|
|
u16 pc;
|
2017-03-24 15:52:14 -06:00
|
|
|
|
net/mlx5e: RX, Support multiple outstanding UMR posts
The buffers mapping of the Multi-Packet WQEs (of Striding RQ)
is done via UMR posts, one UMR WQE per an RX MPWQE.
A single MPWQE is capable of serving many incoming packets,
usually larger than the budget of a single napi cycle.
Hence, posting a single UMR WQE per napi cycle (and handling its
completion in the next cycle) works fine in many common cases,
but not always.
When an XDP program is loaded, every MPWQE is capable of serving less
packets, to satisfy the packet-per-page requirement.
Thus, for the same number of packets more MPWQEs (and UMR posts)
are needed (twice as much for the default MTU), giving less latency
room for the UMR completions.
In this patch, we add support for multiple outstanding UMR posts,
to allow faster gap closure between consuming MPWQEs and reposting
them back into the WQ.
For better SW and HW locality, we combine the UMR posts in bulks of
(at least) two.
This is expected to improve packet rate in high CPU scale.
Performance test:
As expected, huge improvement in large-scale (48 cores).
xdp_redirect_map, 64B UDP multi-stream.
Redirect from ConnectX-5 100Gbps to ConnectX-6 100Gbps.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz.
Before: Unstable, 7 to 30 Mpps
After: Stable, at 70.5 Mpps
No degradation in other tested scenarios.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-02-27 03:06:08 -07:00
|
|
|
struct mlx5_wqe_ctrl_seg *doorbell_cseg;
|
2017-03-24 15:52:14 -06:00
|
|
|
struct mlx5e_cq cq;
|
|
|
|
|
|
|
|
/* write@xmit, read@completion */
|
|
|
|
struct {
|
|
|
|
struct mlx5e_sq_wqe_info *ico_wqe;
|
|
|
|
} db;
|
|
|
|
|
|
|
|
/* read only */
|
|
|
|
struct mlx5_wq_cyc wq;
|
|
|
|
void __iomem *uar_map;
|
|
|
|
u32 sqn;
|
|
|
|
unsigned long state;
|
|
|
|
|
|
|
|
/* control path */
|
|
|
|
struct mlx5_wq_ctrl wq_ctrl;
|
|
|
|
struct mlx5e_channel *channel;
|
2019-06-25 08:44:28 -06:00
|
|
|
|
|
|
|
struct work_struct recover_work;
|
2017-03-24 15:52:07 -06:00
|
|
|
} ____cacheline_aligned_in_smp;
|
|
|
|
|
net/mlx5e: Introduce RX Page-Reuse
Introduce a Page-Reuse mechanism in non-Striding RQ RX datapath.
A WQE (RX descriptor) buffer is a page, that in most cases was fully
wasted on a packet that is much smaller, requiring a new page for
the next round.
In this patch, we implement a page-reuse mechanism, that resembles a
`SW Striding RQ`.
We allow the WQE to reuse its allocated page as much as it could,
until the page is fully consumed. In each round, the WQE is capable
of receiving packet of maximal size (MTU). Yet, upon the reception of
a packet, the WQE knows the actual packet size, and consumes the exact
amount of memory needed to build a linear SKB. Then, it updates the
buffer pointer within the page accordingly, for the next round.
Feature is mutually exclusive with XDP (packet-per-page)
and LRO (session size is a power of two, needs unused page).
Performance tests:
iperf tcp tests show huge gain:
--------------------------------------------
num streams | BW before | BW after | ratio |
1 | 22.2 | 30.9 | 1.39x |
8 | 64.2 | 93.6 | 1.46x |
64 | 56.7 | 91.4 | 1.61x |
--------------------------------------------
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-01-29 08:42:26 -07:00
|
|
|
struct mlx5e_wqe_frag_info {
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
struct mlx5e_dma_info *di;
|
net/mlx5e: Introduce RX Page-Reuse
Introduce a Page-Reuse mechanism in non-Striding RQ RX datapath.
A WQE (RX descriptor) buffer is a page, that in most cases was fully
wasted on a packet that is much smaller, requiring a new page for
the next round.
In this patch, we implement a page-reuse mechanism, that resembles a
`SW Striding RQ`.
We allow the WQE to reuse its allocated page as much as it could,
until the page is fully consumed. In each round, the WQE is capable
of receiving packet of maximal size (MTU). Yet, upon the reception of
a packet, the WQE knows the actual packet size, and consumes the exact
amount of memory needed to build a linear SKB. Then, it updates the
buffer pointer within the page accordingly, for the next round.
Feature is mutually exclusive with XDP (packet-per-page)
and LRO (session size is a power of two, needs unused page).
Performance tests:
iperf tcp tests show huge gain:
--------------------------------------------
num streams | BW before | BW after | ratio |
1 | 22.2 | 30.9 | 1.39x |
8 | 64.2 | 93.6 | 1.46x |
64 | 56.7 | 91.4 | 1.61x |
--------------------------------------------
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-01-29 08:42:26 -07:00
|
|
|
u32 offset;
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
bool last_in_page;
|
net/mlx5e: Introduce RX Page-Reuse
Introduce a Page-Reuse mechanism in non-Striding RQ RX datapath.
A WQE (RX descriptor) buffer is a page, that in most cases was fully
wasted on a packet that is much smaller, requiring a new page for
the next round.
In this patch, we implement a page-reuse mechanism, that resembles a
`SW Striding RQ`.
We allow the WQE to reuse its allocated page as much as it could,
until the page is fully consumed. In each round, the WQE is capable
of receiving packet of maximal size (MTU). Yet, upon the reception of
a packet, the WQE knows the actual packet size, and consumes the exact
amount of memory needed to build a linear SKB. Then, it updates the
buffer pointer within the page accordingly, for the next round.
Feature is mutually exclusive with XDP (packet-per-page)
and LRO (session size is a power of two, needs unused page).
Performance tests:
iperf tcp tests show huge gain:
--------------------------------------------
num streams | BW before | BW after | ratio |
1 | 22.2 | 30.9 | 1.39x |
8 | 64.2 | 93.6 | 1.46x |
64 | 56.7 | 91.4 | 1.61x |
--------------------------------------------
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-01-29 08:42:26 -07:00
|
|
|
};
|
|
|
|
|
2017-03-24 15:52:07 -06:00
|
|
|
struct mlx5e_umr_dma_info {
|
|
|
|
struct mlx5e_dma_info dma_info[MLX5_MPWRQ_PAGES_PER_WQE];
|
|
|
|
};
|
|
|
|
|
|
|
|
struct mlx5e_mpw_info {
|
|
|
|
struct mlx5e_umr_dma_info umr;
|
|
|
|
u16 consumed_strides;
|
2018-02-07 05:46:36 -07:00
|
|
|
DECLARE_BITMAP(xdp_xmit_bitmap, MLX5_MPWRQ_PAGES_PER_WQE);
|
2017-03-24 15:52:07 -06:00
|
|
|
};
|
|
|
|
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
#define MLX5E_MAX_RX_FRAGS 4
|
|
|
|
|
net/mlx5e: Implement RX mapped page cache for page recycle
Instead of reallocating and mapping pages for RX data-path,
recycle already used pages in a per ring cache.
Performance tests:
The following results were measured on a freshly booted system,
giving optimal baseline performance, as high-order pages are yet to
be fragmented and depleted.
We ran pktgen single-stream benchmarks, with iptables-raw-drop:
Single stride, 64 bytes:
* 4,739,057 - baseline
* 4,749,550 - order0 no cache
* 4,786,899 - order0 with cache
1% gain
Larger packets, no page cross, 1024 bytes:
* 3,982,361 - baseline
* 3,845,682 - order0 no cache
* 4,127,852 - order0 with cache
3.7% gain
Larger packets, every 3rd packet crosses a page, 1500 bytes:
* 3,731,189 - baseline
* 3,579,414 - order0 no cache
* 3,931,708 - order0 with cache
5.4% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-15 07:08:38 -06:00
|
|
|
/* a single cache unit is capable to serve one napi call (for non-striding rq)
|
|
|
|
* or a MPWQE (for striding rq).
|
|
|
|
*/
|
|
|
|
#define MLX5E_CACHE_UNIT (MLX5_MPWRQ_PAGES_PER_WQE > NAPI_POLL_WEIGHT ? \
|
|
|
|
MLX5_MPWRQ_PAGES_PER_WQE : NAPI_POLL_WEIGHT)
|
2017-07-02 08:33:59 -06:00
|
|
|
#define MLX5E_CACHE_SIZE (4 * roundup_pow_of_two(MLX5E_CACHE_UNIT))
|
net/mlx5e: Implement RX mapped page cache for page recycle
Instead of reallocating and mapping pages for RX data-path,
recycle already used pages in a per ring cache.
Performance tests:
The following results were measured on a freshly booted system,
giving optimal baseline performance, as high-order pages are yet to
be fragmented and depleted.
We ran pktgen single-stream benchmarks, with iptables-raw-drop:
Single stride, 64 bytes:
* 4,739,057 - baseline
* 4,749,550 - order0 no cache
* 4,786,899 - order0 with cache
1% gain
Larger packets, no page cross, 1024 bytes:
* 3,982,361 - baseline
* 3,845,682 - order0 no cache
* 4,127,852 - order0 with cache
3.7% gain
Larger packets, every 3rd packet crosses a page, 1500 bytes:
* 3,731,189 - baseline
* 3,579,414 - order0 no cache
* 3,931,708 - order0 with cache
5.4% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-15 07:08:38 -06:00
|
|
|
struct mlx5e_page_cache {
|
|
|
|
u32 head;
|
|
|
|
u32 tail;
|
|
|
|
struct mlx5e_dma_info page_cache[MLX5E_CACHE_SIZE];
|
|
|
|
};
|
|
|
|
|
2017-03-24 15:52:07 -06:00
|
|
|
struct mlx5e_rq;
|
|
|
|
typedef void (*mlx5e_fp_handle_rx_cqe)(struct mlx5e_rq*, struct mlx5_cqe64*);
|
net/mlx5e: Use linear SKB in Striding RQ
Current Striding RQ HW feature utilizes the RX buffers so that
there is no wasted room between the strides. This maximises
the memory utilization.
This prevents the use of build_skb() (which requires headroom
and tailroom), and demands to memcpy the packets headers into
the skb linear part.
In this patch, whenever a set of conditions holds, we apply
an RQ configuration that allows combining the use of linear SKB
on top of a Striding RQ.
To use build_skb() with Striding RQ, the following must hold:
1. packet does not cross a page boundary.
2. there is enough headroom and tailroom surrounding the packet.
We can satisfy 1 and 2 by configuring:
stride size = MTU + headroom + tailoom.
This is possible only when:
a. (MTU - headroom - tailoom) does not exceed PAGE_SIZE.
b. HW LRO is turned off.
Using linear SKB has many advantages:
- Saves a memcpy of the headers.
- No page-boundary checks in datapath.
- No filler CQEs.
- Significantly smaller CQ.
- SKB data continuously resides in linear part, and not split to
small amount (linear part) and large amount (fragment).
This saves datapath cycles in driver and improves utilization
of SKB fragments in GRO.
- The fragments of a resulting GRO SKB follow the IP forwarding
assumption of equal-size fragments.
Some implementation details:
HW writes the packets to the beginning of a stride,
i.e. does not keep headroom. To overcome this we make sure we can
extend backwards and use the last bytes of stride i-1.
Extra care is needed for stride 0 as it has no preceding stride.
We make sure headroom bytes are available by shifting the buffer
pointer passed to HW by headroom bytes.
This configuration now becomes default, whenever capable.
Of course, this implies turning LRO off.
Performance testing:
ConnectX-5, single core, single RX ring, default MTU.
UDP packet rate, early drop in TC layer:
--------------------------------------------
| pkt size | before | after | ratio |
--------------------------------------------
| 1500byte | 4.65 Mpps | 5.96 Mpps | 1.28x |
| 500byte | 5.23 Mpps | 5.97 Mpps | 1.14x |
| 64byte | 5.94 Mpps | 5.96 Mpps | 1.00x |
--------------------------------------------
TCP streams: ~20% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-02-07 05:41:25 -07:00
|
|
|
typedef struct sk_buff *
|
|
|
|
(*mlx5e_fp_skb_from_cqe_mpwrq)(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
|
|
|
|
u16 cqe_bcnt, u32 head_offset, u32 page_idx);
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
typedef struct sk_buff *
|
|
|
|
(*mlx5e_fp_skb_from_cqe)(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
|
|
|
|
struct mlx5e_wqe_frag_info *wi, u32 cqe_bcnt);
|
2017-07-17 03:27:26 -06:00
|
|
|
typedef bool (*mlx5e_fp_post_rx_wqes)(struct mlx5e_rq *rq);
|
2017-03-24 15:52:07 -06:00
|
|
|
typedef void (*mlx5e_fp_dealloc_wqe)(struct mlx5e_rq*, u16);
|
|
|
|
|
2017-12-12 06:46:49 -07:00
|
|
|
enum mlx5e_rq_flag {
|
2019-03-10 07:29:13 -06:00
|
|
|
MLX5E_RQ_FLAG_XDP_XMIT,
|
2019-03-10 07:35:58 -06:00
|
|
|
MLX5E_RQ_FLAG_XDP_REDIRECT,
|
2017-12-12 06:46:49 -07:00
|
|
|
};
|
|
|
|
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
struct mlx5e_rq_frag_info {
|
|
|
|
int frag_size;
|
|
|
|
int frag_stride;
|
|
|
|
};
|
|
|
|
|
|
|
|
struct mlx5e_rq_frags_info {
|
|
|
|
struct mlx5e_rq_frag_info arr[MLX5E_MAX_RX_FRAGS];
|
|
|
|
u8 num_frags;
|
|
|
|
u8 log_num_frags;
|
|
|
|
u8 wqe_bulk;
|
|
|
|
};
|
|
|
|
|
2015-05-28 13:28:48 -06:00
|
|
|
struct mlx5e_rq {
|
|
|
|
/* data path */
|
2016-09-21 03:19:43 -06:00
|
|
|
union {
|
net/mlx5e: Introduce RX Page-Reuse
Introduce a Page-Reuse mechanism in non-Striding RQ RX datapath.
A WQE (RX descriptor) buffer is a page, that in most cases was fully
wasted on a packet that is much smaller, requiring a new page for
the next round.
In this patch, we implement a page-reuse mechanism, that resembles a
`SW Striding RQ`.
We allow the WQE to reuse its allocated page as much as it could,
until the page is fully consumed. In each round, the WQE is capable
of receiving packet of maximal size (MTU). Yet, upon the reception of
a packet, the WQE knows the actual packet size, and consumes the exact
amount of memory needed to build a linear SKB. Then, it updates the
buffer pointer within the page accordingly, for the next round.
Feature is mutually exclusive with XDP (packet-per-page)
and LRO (session size is a power of two, needs unused page).
Performance tests:
iperf tcp tests show huge gain:
--------------------------------------------
num streams | BW before | BW after | ratio |
1 | 22.2 | 30.9 | 1.39x |
8 | 64.2 | 93.6 | 1.46x |
64 | 56.7 | 91.4 | 1.61x |
--------------------------------------------
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-01-29 08:42:26 -07:00
|
|
|
struct {
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
struct mlx5_wq_cyc wq;
|
|
|
|
struct mlx5e_wqe_frag_info *frags;
|
|
|
|
struct mlx5e_dma_info *di;
|
|
|
|
struct mlx5e_rq_frags_info info;
|
|
|
|
mlx5e_fp_skb_from_cqe skb_from_cqe;
|
net/mlx5e: Introduce RX Page-Reuse
Introduce a Page-Reuse mechanism in non-Striding RQ RX datapath.
A WQE (RX descriptor) buffer is a page, that in most cases was fully
wasted on a packet that is much smaller, requiring a new page for
the next round.
In this patch, we implement a page-reuse mechanism, that resembles a
`SW Striding RQ`.
We allow the WQE to reuse its allocated page as much as it could,
until the page is fully consumed. In each round, the WQE is capable
of receiving packet of maximal size (MTU). Yet, upon the reception of
a packet, the WQE knows the actual packet size, and consumes the exact
amount of memory needed to build a linear SKB. Then, it updates the
buffer pointer within the page accordingly, for the next round.
Feature is mutually exclusive with XDP (packet-per-page)
and LRO (session size is a power of two, needs unused page).
Performance tests:
iperf tcp tests show huge gain:
--------------------------------------------
num streams | BW before | BW after | ratio |
1 | 22.2 | 30.9 | 1.39x |
8 | 64.2 | 93.6 | 1.46x |
64 | 56.7 | 91.4 | 1.61x |
--------------------------------------------
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-01-29 08:42:26 -07:00
|
|
|
} wqe;
|
2016-09-21 03:19:43 -06:00
|
|
|
struct {
|
2018-04-02 08:23:14 -06:00
|
|
|
struct mlx5_wq_ll wq;
|
2017-12-20 02:56:35 -07:00
|
|
|
struct mlx5e_umr_wqe umr_wqe;
|
2016-09-21 03:19:43 -06:00
|
|
|
struct mlx5e_mpw_info *info;
|
net/mlx5e: Use linear SKB in Striding RQ
Current Striding RQ HW feature utilizes the RX buffers so that
there is no wasted room between the strides. This maximises
the memory utilization.
This prevents the use of build_skb() (which requires headroom
and tailroom), and demands to memcpy the packets headers into
the skb linear part.
In this patch, whenever a set of conditions holds, we apply
an RQ configuration that allows combining the use of linear SKB
on top of a Striding RQ.
To use build_skb() with Striding RQ, the following must hold:
1. packet does not cross a page boundary.
2. there is enough headroom and tailroom surrounding the packet.
We can satisfy 1 and 2 by configuring:
stride size = MTU + headroom + tailoom.
This is possible only when:
a. (MTU - headroom - tailoom) does not exceed PAGE_SIZE.
b. HW LRO is turned off.
Using linear SKB has many advantages:
- Saves a memcpy of the headers.
- No page-boundary checks in datapath.
- No filler CQEs.
- Significantly smaller CQ.
- SKB data continuously resides in linear part, and not split to
small amount (linear part) and large amount (fragment).
This saves datapath cycles in driver and improves utilization
of SKB fragments in GRO.
- The fragments of a resulting GRO SKB follow the IP forwarding
assumption of equal-size fragments.
Some implementation details:
HW writes the packets to the beginning of a stride,
i.e. does not keep headroom. To overcome this we make sure we can
extend backwards and use the last bytes of stride i-1.
Extra care is needed for stride 0 as it has no preceding stride.
We make sure headroom bytes are available by shifting the buffer
pointer passed to HW by headroom bytes.
This configuration now becomes default, whenever capable.
Of course, this implies turning LRO off.
Performance testing:
ConnectX-5, single core, single RX ring, default MTU.
UDP packet rate, early drop in TC layer:
--------------------------------------------
| pkt size | before | after | ratio |
--------------------------------------------
| 1500byte | 4.65 Mpps | 5.96 Mpps | 1.28x |
| 500byte | 5.23 Mpps | 5.97 Mpps | 1.14x |
| 64byte | 5.94 Mpps | 5.96 Mpps | 1.00x |
--------------------------------------------
TCP streams: ~20% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-02-07 05:41:25 -07:00
|
|
|
mlx5e_fp_skb_from_cqe_mpwrq skb_from_cqe_mpwrq;
|
2017-02-13 09:41:30 -07:00
|
|
|
u16 num_strides;
|
net/mlx5e: RX, Support multiple outstanding UMR posts
The buffers mapping of the Multi-Packet WQEs (of Striding RQ)
is done via UMR posts, one UMR WQE per an RX MPWQE.
A single MPWQE is capable of serving many incoming packets,
usually larger than the budget of a single napi cycle.
Hence, posting a single UMR WQE per napi cycle (and handling its
completion in the next cycle) works fine in many common cases,
but not always.
When an XDP program is loaded, every MPWQE is capable of serving less
packets, to satisfy the packet-per-page requirement.
Thus, for the same number of packets more MPWQEs (and UMR posts)
are needed (twice as much for the default MTU), giving less latency
room for the UMR completions.
In this patch, we add support for multiple outstanding UMR posts,
to allow faster gap closure between consuming MPWQEs and reposting
them back into the WQ.
For better SW and HW locality, we combine the UMR posts in bulks of
(at least) two.
This is expected to improve packet rate in high CPU scale.
Performance test:
As expected, huge improvement in large-scale (48 cores).
xdp_redirect_map, 64B UDP multi-stream.
Redirect from ConnectX-5 100Gbps to ConnectX-6 100Gbps.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz.
Before: Unstable, 7 to 30 Mpps
After: Stable, at 70.5 Mpps
No degradation in other tested scenarios.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-02-27 03:06:08 -07:00
|
|
|
u16 actual_wq_head;
|
2017-07-02 10:02:05 -06:00
|
|
|
u8 log_stride_sz;
|
net/mlx5e: RX, Support multiple outstanding UMR posts
The buffers mapping of the Multi-Packet WQEs (of Striding RQ)
is done via UMR posts, one UMR WQE per an RX MPWQE.
A single MPWQE is capable of serving many incoming packets,
usually larger than the budget of a single napi cycle.
Hence, posting a single UMR WQE per napi cycle (and handling its
completion in the next cycle) works fine in many common cases,
but not always.
When an XDP program is loaded, every MPWQE is capable of serving less
packets, to satisfy the packet-per-page requirement.
Thus, for the same number of packets more MPWQEs (and UMR posts)
are needed (twice as much for the default MTU), giving less latency
room for the UMR completions.
In this patch, we add support for multiple outstanding UMR posts,
to allow faster gap closure between consuming MPWQEs and reposting
them back into the WQ.
For better SW and HW locality, we combine the UMR posts in bulks of
(at least) two.
This is expected to improve packet rate in high CPU scale.
Performance test:
As expected, huge improvement in large-scale (48 cores).
xdp_redirect_map, 64B UDP multi-stream.
Redirect from ConnectX-5 100Gbps to ConnectX-6 100Gbps.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz.
Before: Unstable, 7 to 30 Mpps
After: Stable, at 70.5 Mpps
No degradation in other tested scenarios.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-02-27 03:06:08 -07:00
|
|
|
u8 umr_in_progress;
|
|
|
|
u8 umr_last_bulk;
|
2019-06-26 08:35:31 -06:00
|
|
|
u8 umr_completed;
|
2016-09-21 03:19:43 -06:00
|
|
|
} mpwqe;
|
|
|
|
};
|
2016-09-21 03:19:42 -06:00
|
|
|
struct {
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
u16 umem_headroom;
|
2017-02-13 09:41:30 -07:00
|
|
|
u16 headroom;
|
2016-09-21 03:19:48 -06:00
|
|
|
u8 map_dir; /* dma map direction */
|
2016-09-21 03:19:42 -06:00
|
|
|
} buff;
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2017-07-17 03:27:26 -06:00
|
|
|
struct mlx5e_channel *channel;
|
2015-05-28 13:28:48 -06:00
|
|
|
struct device *pdev;
|
|
|
|
struct net_device *netdev;
|
2018-04-12 07:03:37 -06:00
|
|
|
struct mlx5e_rq_stats *stats;
|
2015-05-28 13:28:48 -06:00
|
|
|
struct mlx5e_cq cq;
|
2018-09-12 09:32:49 -06:00
|
|
|
struct mlx5e_cq_decomp cqd;
|
net/mlx5e: Implement RX mapped page cache for page recycle
Instead of reallocating and mapping pages for RX data-path,
recycle already used pages in a per ring cache.
Performance tests:
The following results were measured on a freshly booted system,
giving optimal baseline performance, as high-order pages are yet to
be fragmented and depleted.
We ran pktgen single-stream benchmarks, with iptables-raw-drop:
Single stride, 64 bytes:
* 4,739,057 - baseline
* 4,749,550 - order0 no cache
* 4,786,899 - order0 with cache
1% gain
Larger packets, no page cross, 1024 bytes:
* 3,982,361 - baseline
* 3,845,682 - order0 no cache
* 4,127,852 - order0 with cache
3.7% gain
Larger packets, every 3rd packet crosses a page, 1500 bytes:
* 3,731,189 - baseline
* 3,579,414 - order0 no cache
* 3,931,708 - order0 with cache
5.4% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-15 07:08:38 -06:00
|
|
|
struct mlx5e_page_cache page_cache;
|
2017-08-15 04:46:04 -06:00
|
|
|
struct hwtstamp_config *tstamp;
|
|
|
|
struct mlx5_clock *clock;
|
net/mlx5e: Implement RX mapped page cache for page recycle
Instead of reallocating and mapping pages for RX data-path,
recycle already used pages in a per ring cache.
Performance tests:
The following results were measured on a freshly booted system,
giving optimal baseline performance, as high-order pages are yet to
be fragmented and depleted.
We ran pktgen single-stream benchmarks, with iptables-raw-drop:
Single stride, 64 bytes:
* 4,739,057 - baseline
* 4,749,550 - order0 no cache
* 4,786,899 - order0 with cache
1% gain
Larger packets, no page cross, 1024 bytes:
* 3,982,361 - baseline
* 3,845,682 - order0 no cache
* 4,127,852 - order0 with cache
3.7% gain
Larger packets, every 3rd packet crosses a page, 1500 bytes:
* 3,731,189 - baseline
* 3,579,414 - order0 no cache
* 3,931,708 - order0 with cache
5.4% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-15 07:08:38 -06:00
|
|
|
|
2016-04-20 13:02:12 -06:00
|
|
|
mlx5e_fp_handle_rx_cqe handle_rx_cqe;
|
2017-07-17 03:27:26 -06:00
|
|
|
mlx5e_fp_post_rx_wqes post_wqes;
|
2016-06-30 08:34:46 -06:00
|
|
|
mlx5e_fp_dealloc_wqe dealloc_wqe;
|
2015-05-28 13:28:48 -06:00
|
|
|
|
|
|
|
unsigned long state;
|
|
|
|
int ix;
|
net/mlx5e: RX, verify received packet size in Linear Striding RQ
In case of striding RQ, we use MPWRQ (Multi Packet WQE RQ), which means
that WQE (RX descriptor) can be used for many packets and so the WQE is
much bigger than MTU. In virtualization setups where the port mtu can
be larger than the vf mtu, if received packet is bigger than MTU, it
won't be dropped by HW on too small receive WQE. If we use linear SKB in
striding RQ, since each stride has room for mtu size payload and skb
info, an oversized packet can lead to crash for crossing allocated page
boundary upon the call to build_skb. So driver needs to check packet
size and drop it.
Introduce new SW rx counter, rx_oversize_pkts_sw_drop, which counts the
number of packets dropped by the driver for being too large.
As a new field is added to the RQ struct, re-open the channels whenever
this field is being used in datapath (i.e., in the case of linear
Striding RQ).
Fixes: 619a8f2a42f1 ("net/mlx5e: Use linear SKB in Striding RQ")
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-10-10 22:31:10 -06:00
|
|
|
unsigned int hw_mtu;
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2019-01-31 07:44:48 -07:00
|
|
|
struct dim dim; /* Dynamic Interrupt Moderation */
|
2017-03-24 15:52:08 -06:00
|
|
|
|
|
|
|
/* XDP */
|
net/mlx5e: XDP fast RX drop bpf programs support
Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver.
When XDP is on we make sure to change channels RQs type to
MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to
ensure "page per packet".
On XDP set, we fail if HW LRO is set and request from user to turn it
off. Since on ConnectX4-LX HW LRO is always on by default, this will be
annoying, but we prefer not to enforce LRO off from XDP set function.
Full channels reset (close/open) is required only when setting XDP
on/off.
When XDP set is called just to exchange programs, we will update
each RQ xdp program on the fly and for synchronization with current
data path RX activity of that RQ, we temporally disable that RQ and
ensure RX path is not running, quickly update and re-enable that RQ,
for that we do:
- rq.state = disabled
- napi_synnchronize
- xchg(rq->xdp_prg)
- rq.state = enabled
- napi_schedule // Just in case we've missed an IRQ
Packet rate performance testing was done with pktgen 64B packets and on
TX side and, TC drop action on RX side compared to XDP fast drop.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Comparison is done between:
1. Baseline, Before this patch with TC drop action
2. This patch with TC drop action
3. This patch with XDP RX fast drop
RX Cores Baseline(TC drop) TC drop XDP fast Drop
--------------------------------------------------------------
1 5.3Mpps 5.3Mpps 16.5Mpps
2 10.2Mpps 10.2Mpps 31.3Mpps
4 20.5Mpps 19.9Mpps 36.3Mpps*
*My xmitter was limited to 36.3Mpps, so it is the bottleneck.
It seems that receive side can handle more.
Signed-off-by: Rana Shahout <ranas@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 03:19:46 -06:00
|
|
|
struct bpf_prog *xdp_prog;
|
2019-06-26 08:35:33 -06:00
|
|
|
struct mlx5e_xdpsq *xdpsq;
|
2017-12-12 06:46:49 -07:00
|
|
|
DECLARE_BITMAP(flags, 8);
|
mlx5: use page_pool for xdp_return_frame call
This patch shows how it is possible to have both the driver local page
cache, which uses elevated refcnt for "catching"/avoiding SKB
put_page returns the page through the page allocator. And at the
same time, have pages getting returned to the page_pool from
ndp_xdp_xmit DMA completion.
The performance improvement for XDP_REDIRECT in this patch is really
good. Especially considering that (currently) the xdp_return_frame
API and page_pool_put_page() does per frame operations of both
rhashtable ID-lookup and locked return into (page_pool) ptr_ring.
(It is the plan to remove these per frame operation in a followup
patchset).
The benchmark performed was RX on mlx5 and XDP_REDIRECT out ixgbe,
with xdp_redirect_map (using devmap) . And the target/maximum
capability of ixgbe is 13Mpps (on this HW setup).
Before this patch for mlx5, XDP redirected frames were returned via
the page allocator. The single flow performance was 6Mpps, and if I
started two flows the collective performance drop to 4Mpps, because we
hit the page allocator lock (further negative scaling occurs).
Two test scenarios need to be covered, for xdp_return_frame API, which
is DMA-TX completion running on same-CPU or cross-CPU free/return.
Results were same-CPU=10Mpps, and cross-CPU=12Mpps. This is very
close to our 13Mpps max target.
The reason max target isn't reached in cross-CPU test, is likely due
to RX-ring DMA unmap/map overhead (which doesn't occur in ixgbe to
ixgbe testing). It is also planned to remove this unnecessary DMA
unmap in a later patchset
V2: Adjustments requested by Tariq
- Changed page_pool_create return codes not return NULL, only
ERR_PTR, as this simplifies err handling in drivers.
- Save a branch in mlx5e_page_release
- Correct page_pool size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ
V5: Updated patch desc
V8: Adjust for b0cedc844c00 ("net/mlx5e: Remove rq_headroom field from params")
V9:
- Adjust for 121e89275471 ("net/mlx5e: Refactor RQ XDP_TX indication")
- Adjust for 73281b78a37a ("net/mlx5e: Derive Striding RQ size from MTU")
- Correct handling if page_pool_create fail for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ
V10: Req from Tariq
- Change pool_size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 08:46:27 -06:00
|
|
|
struct page_pool *page_pool;
|
2016-06-23 08:02:41 -06:00
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
/* AF_XDP zero-copy */
|
|
|
|
struct zero_copy_allocator zca;
|
|
|
|
struct xdp_umem *umem;
|
|
|
|
|
2019-06-26 14:21:40 -06:00
|
|
|
struct work_struct recover_work;
|
|
|
|
|
2015-05-28 13:28:48 -06:00
|
|
|
/* control */
|
|
|
|
struct mlx5_wq_ctrl wq_ctrl;
|
2017-02-13 09:41:30 -07:00
|
|
|
__be32 mkey_be;
|
net/mlx5e: Support RX multi-packet WQE (Striding RQ)
Introduce the feature of multi-packet WQE (RX Work Queue Element)
referred to as (MPWQE or Striding RQ), in which WQEs are larger
and serve multiple packets each.
Every WQE consists of many strides of the same size, every received
packet is aligned to a beginning of a stride and is written to
consecutive strides within a WQE.
In the regular approach, each regular WQE is big enough to be capable
of serving one received packet of any size up to MTU or 64K in case of
device LRO is enabled, making it very wasteful when dealing with
small packets or device LRO is enabled.
For its flexibility, MPWQE allows a better memory utilization
(implying improvements in CPU utilization and packet rate) as packets
consume strides according to their size, preserving the rest of
the WQE to be available for other packets.
MPWQE default configuration:
Num of WQEs = 16
Strides Per WQE = 2048
Stride Size = 64 byte
The default WQEs memory footprint went from 1024*mtu (~1.5MB) to
16 * 2048 * 64 = 2MB per ring.
However, HW LRO can now be supported at no additional cost in memory
footprint, and hence we turn it on by default and get an even better
performance.
Performance tested on ConnectX4-Lx 50G.
To isolate the feature under test, the numbers below were measured with
HW LRO turned off. We verified that the performance just improves when
LRO is turned back on.
* Netperf single TCP stream:
- BW raised by 10-15% for representative packet sizes:
default, 64B, 1024B, 1478B, 65536B.
* Netperf multi TCP stream:
- No degradation, line rate reached.
* Pktgen: packet rate raised by 2-10% for traffic of different message
sizes: 64B, 128B, 256B, 1024B, and 1500B.
* Pktgen: packet loss in bursts of small messages (64byte),
single stream:
- | num packets | packets loss before | packets loss after
| 2K | ~ 1K | 0
| 8K | ~ 6K | 0
| 16K | ~13K | 0
| 32K | ~28K | 0
| 64K | ~57K | ~24K
As expected as the driver can receive as many small packets (<=64B) as
the number of total strides in the ring (default = 2048 * 16) vs. 1024
(default ring size regardless of packets size) before this feature.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-20 13:02:13 -06:00
|
|
|
u8 wq_type;
|
2015-05-28 13:28:48 -06:00
|
|
|
u32 rqn;
|
2017-03-14 11:43:52 -06:00
|
|
|
struct mlx5_core_dev *mdev;
|
2016-11-30 08:59:39 -07:00
|
|
|
struct mlx5_core_mkey umr_mkey;
|
2018-01-03 03:25:18 -07:00
|
|
|
|
|
|
|
/* XDP read-mostly */
|
|
|
|
struct xdp_rxq_info xdp_rxq;
|
2015-05-28 13:28:48 -06:00
|
|
|
} ____cacheline_aligned_in_smp;
|
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
enum mlx5e_channel_state {
|
|
|
|
MLX5E_CHANNEL_STATE_XSK,
|
|
|
|
MLX5E_CHANNEL_NUM_STATES
|
|
|
|
};
|
|
|
|
|
2015-05-28 13:28:48 -06:00
|
|
|
struct mlx5e_channel {
|
|
|
|
/* data path */
|
|
|
|
struct mlx5e_rq rq;
|
2019-06-26 08:35:33 -06:00
|
|
|
struct mlx5e_xdpsq rq_xdpsq;
|
2017-03-24 15:52:14 -06:00
|
|
|
struct mlx5e_txqsq sq[MLX5E_MAX_NUM_TC];
|
|
|
|
struct mlx5e_icosq icosq; /* internal control operations */
|
2016-09-21 03:19:48 -06:00
|
|
|
bool xdp;
|
2015-05-28 13:28:48 -06:00
|
|
|
struct napi_struct napi;
|
|
|
|
struct device *pdev;
|
|
|
|
struct net_device *netdev;
|
|
|
|
__be32 mkey_be;
|
|
|
|
u8 num_tc;
|
2019-08-07 08:46:15 -06:00
|
|
|
u8 lag_port;
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2018-05-22 07:48:48 -06:00
|
|
|
/* XDP_REDIRECT */
|
|
|
|
struct mlx5e_xdpsq xdpsq;
|
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
/* AF_XDP zero-copy */
|
|
|
|
struct mlx5e_rq xskrq;
|
|
|
|
struct mlx5e_xdpsq xsksq;
|
|
|
|
struct mlx5e_icosq xskicosq;
|
|
|
|
/* xskicosq can be accessed from any CPU - the spinlock protects it. */
|
|
|
|
spinlock_t xskicosq_lock;
|
|
|
|
|
2017-07-02 04:17:42 -06:00
|
|
|
/* data path - accessed per napi poll */
|
|
|
|
struct irq_desc *irq_desc;
|
2018-04-12 07:03:37 -06:00
|
|
|
struct mlx5e_ch_stats *stats;
|
2015-05-28 13:28:48 -06:00
|
|
|
|
|
|
|
/* control */
|
|
|
|
struct mlx5e_priv *priv;
|
2017-03-14 11:43:52 -06:00
|
|
|
struct mlx5_core_dev *mdev;
|
2017-08-15 04:46:04 -06:00
|
|
|
struct hwtstamp_config *tstamp;
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
DECLARE_BITMAP(state, MLX5E_CHANNEL_NUM_STATES);
|
2015-05-28 13:28:48 -06:00
|
|
|
int ix;
|
2017-11-09 23:59:52 -07:00
|
|
|
int cpu;
|
2018-10-25 06:03:38 -06:00
|
|
|
cpumask_var_t xps_cpumask;
|
2015-05-28 13:28:48 -06:00
|
|
|
};
|
|
|
|
|
2017-02-06 04:14:34 -07:00
|
|
|
struct mlx5e_channels {
|
|
|
|
struct mlx5e_channel **c;
|
|
|
|
unsigned int num;
|
2016-12-21 08:24:35 -07:00
|
|
|
struct mlx5e_params params;
|
2017-02-06 04:14:34 -07:00
|
|
|
};
|
|
|
|
|
2018-04-12 07:03:37 -06:00
|
|
|
struct mlx5e_channel_stats {
|
|
|
|
struct mlx5e_ch_stats ch;
|
|
|
|
struct mlx5e_sq_stats sq[MLX5E_MAX_NUM_TC];
|
|
|
|
struct mlx5e_rq_stats rq;
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
struct mlx5e_rq_stats xskrq;
|
2018-05-22 07:29:31 -06:00
|
|
|
struct mlx5e_xdpsq_stats rq_xdpsq;
|
2018-05-22 07:48:48 -06:00
|
|
|
struct mlx5e_xdpsq_stats xdpsq;
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
struct mlx5e_xdpsq_stats xsksq;
|
2018-04-12 07:03:37 -06:00
|
|
|
} ____cacheline_aligned_in_smp;
|
|
|
|
|
2016-04-28 16:36:37 -06:00
|
|
|
enum {
|
|
|
|
MLX5E_STATE_OPENED,
|
|
|
|
MLX5E_STATE_DESTROYING,
|
2019-02-11 17:27:02 -07:00
|
|
|
MLX5E_STATE_XDP_TX_ENABLED,
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
MLX5E_STATE_XDP_OPEN,
|
2016-04-28 16:36:37 -06:00
|
|
|
};
|
|
|
|
|
2016-07-01 05:51:06 -06:00
|
|
|
struct mlx5e_rqt {
|
2016-04-28 16:36:32 -06:00
|
|
|
u32 rqtn;
|
2016-07-01 05:51:06 -06:00
|
|
|
bool enabled;
|
|
|
|
};
|
|
|
|
|
|
|
|
struct mlx5e_tir {
|
|
|
|
u32 tirn;
|
|
|
|
struct mlx5e_rqt rqt;
|
|
|
|
struct list_head list;
|
2016-04-28 16:36:32 -06:00
|
|
|
};
|
|
|
|
|
2016-04-28 16:36:37 -06:00
|
|
|
enum {
|
|
|
|
MLX5E_TC_PRIO = 0,
|
|
|
|
MLX5E_NIC_PRIO
|
|
|
|
};
|
|
|
|
|
2018-11-06 12:05:29 -07:00
|
|
|
struct mlx5e_rss_params {
|
|
|
|
u32 indirection_rqt[MLX5E_INDIR_RQT_SIZE];
|
2018-10-23 07:03:33 -06:00
|
|
|
u32 rx_hash_fields[MLX5E_NUM_INDIR_TIRS];
|
2018-11-06 12:05:29 -07:00
|
|
|
u8 toeplitz_hash_key[40];
|
|
|
|
u8 hfunc;
|
|
|
|
};
|
|
|
|
|
net/mlx5e: Add tx reporter support
Add mlx5e tx reporter to devlink health reporters. This reporter will be
responsible for diagnosing, reporting and recovering of tx errors.
This patch declares the TX reporter operations and creates it using the
devlink health API. Currently, this reporter supports reporting and
recovering from send error CQE only. In addition, it adds diagnose
information for the open SQs.
For a local SQ recover (due to driver error report), in case of SQ recover
failure, the recover operation will be considered as a failure.
For a full tx recover, an attempt to close and open the channels will be
done. If this one passed successfully, it will be considered as a
successful recover.
The SQ recover from error CQE flow is not a new feature in the driver,
this patch re-organize the functions and adapt them for the devlink
health API. For this purpose, move code from en_main.c to a new file
named reporter_tx.c.
Diagnose output:
$devlink health diagnose pci/0000:00:09.0 reporter tx -j -p
{
"SQs": [ {
"sqn": 138,
"HW state": 1,
"stopped": false
},{
"sqn": 142,
"HW state": 1,
"stopped": false
} ]
}
$devlink health diagnose pci/0000:00:09.0 reporter tx
SQs:
sqn: 138 HW state: 1 stopped: false
sqn: 142 HW state: 1 stopped: false
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-07 02:36:40 -07:00
|
|
|
struct mlx5e_modify_sq_param {
|
|
|
|
int curr_state;
|
|
|
|
int next_state;
|
|
|
|
int rl_update;
|
|
|
|
int rl_index;
|
|
|
|
};
|
|
|
|
|
2019-08-21 23:06:00 -06:00
|
|
|
#if IS_ENABLED(CONFIG_PCI_HYPERV_INTERFACE)
|
|
|
|
struct mlx5e_hv_vhca_stats_agent {
|
|
|
|
struct mlx5_hv_vhca_agent *agent;
|
|
|
|
struct delayed_work work;
|
|
|
|
u16 delay;
|
|
|
|
void *buf;
|
|
|
|
};
|
|
|
|
#endif
|
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
struct mlx5e_xsk {
|
|
|
|
/* UMEMs are stored separately from channels, because we don't want to
|
|
|
|
* lose them when channels are recreated. The kernel also stores UMEMs,
|
|
|
|
* but it doesn't distinguish between zero-copy and non-zero-copy UMEMs,
|
|
|
|
* so rely on our mechanism.
|
|
|
|
*/
|
|
|
|
struct xdp_umem **umems;
|
|
|
|
u16 refcnt;
|
|
|
|
bool ever_used;
|
|
|
|
};
|
|
|
|
|
2015-05-28 13:28:48 -06:00
|
|
|
struct mlx5e_priv {
|
|
|
|
/* priv data path fields - start */
|
2016-12-20 13:48:19 -07:00
|
|
|
struct mlx5e_txqsq *txq2sq[MLX5E_MAX_NUM_CHANNELS * MLX5E_MAX_NUM_TC];
|
|
|
|
int channel_tc2txq[MLX5E_MAX_NUM_CHANNELS][MLX5E_MAX_NUM_TC];
|
2017-07-18 15:23:36 -06:00
|
|
|
#ifdef CONFIG_MLX5_CORE_EN_DCB
|
|
|
|
struct mlx5e_dcbx_dp dcbx_dp;
|
|
|
|
#endif
|
2015-05-28 13:28:48 -06:00
|
|
|
/* priv data path fields - end */
|
|
|
|
|
2015-07-28 00:35:31 -06:00
|
|
|
u32 msglevel;
|
2015-05-28 13:28:48 -06:00
|
|
|
unsigned long state;
|
|
|
|
struct mutex state_lock; /* Protects Interface state */
|
2015-08-04 05:05:41 -06:00
|
|
|
struct mlx5e_rq drop_rq;
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2017-02-06 04:14:34 -07:00
|
|
|
struct mlx5e_channels channels;
|
2019-08-07 08:46:15 -06:00
|
|
|
u32 tisn[MLX5_MAX_PORTS][MLX5E_MAX_NUM_TC];
|
2016-07-01 05:51:06 -06:00
|
|
|
struct mlx5e_rqt indir_rqt;
|
2016-07-01 05:51:05 -06:00
|
|
|
struct mlx5e_tir indir_tir[MLX5E_NUM_INDIR_TIRS];
|
2017-08-13 07:22:38 -06:00
|
|
|
struct mlx5e_tir inner_indir_tir[MLX5E_NUM_INDIR_TIRS];
|
2016-07-01 05:51:05 -06:00
|
|
|
struct mlx5e_tir direct_tir[MLX5E_MAX_NUM_CHANNELS];
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
struct mlx5e_tir xsk_tir[MLX5E_MAX_NUM_CHANNELS];
|
2018-11-06 12:05:29 -07:00
|
|
|
struct mlx5e_rss_params rss_params;
|
2016-06-23 08:02:38 -06:00
|
|
|
u32 tx_rates[MLX5E_MAX_NUM_SQS];
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2016-04-28 16:36:37 -06:00
|
|
|
struct mlx5e_flow_steering fs;
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2016-05-01 13:59:56 -06:00
|
|
|
struct workqueue_struct *wq;
|
2015-05-28 13:28:48 -06:00
|
|
|
struct work_struct update_carrier_work;
|
|
|
|
struct work_struct set_rx_mode_work;
|
2016-06-30 08:34:45 -06:00
|
|
|
struct work_struct tx_timeout_work;
|
2018-09-12 00:45:33 -06:00
|
|
|
struct work_struct update_stats_work;
|
2018-10-20 07:18:00 -06:00
|
|
|
struct work_struct monitor_counters_work;
|
|
|
|
struct mlx5_nb monitor_counters_nb;
|
2015-05-28 13:28:48 -06:00
|
|
|
|
|
|
|
struct mlx5_core_dev *mdev;
|
|
|
|
struct net_device *netdev;
|
|
|
|
struct mlx5e_stats stats;
|
2018-04-12 07:03:37 -06:00
|
|
|
struct mlx5e_channel_stats channel_stats[MLX5E_MAX_NUM_CHANNELS];
|
2019-07-14 02:43:43 -06:00
|
|
|
u16 max_nch;
|
2018-04-12 07:03:37 -06:00
|
|
|
u8 max_opened_tc;
|
2017-08-15 04:46:04 -06:00
|
|
|
struct hwtstamp_config tstamp;
|
2018-02-08 06:09:57 -07:00
|
|
|
u16 q_counter;
|
|
|
|
u16 drop_rq_q_counter;
|
2018-11-26 15:38:58 -07:00
|
|
|
struct notifier_block events_nb;
|
|
|
|
|
2016-11-27 08:02:04 -07:00
|
|
|
#ifdef CONFIG_MLX5_CORE_EN_DCB
|
|
|
|
struct mlx5e_dcbx dcbx;
|
|
|
|
#endif
|
|
|
|
|
2016-07-01 05:51:07 -06:00
|
|
|
const struct mlx5e_profile *profile;
|
2016-07-01 05:51:08 -06:00
|
|
|
void *ppriv;
|
2017-04-18 07:04:28 -06:00
|
|
|
#ifdef CONFIG_MLX5_EN_IPSEC
|
|
|
|
struct mlx5e_ipsec *ipsec;
|
|
|
|
#endif
|
2018-04-30 01:16:21 -06:00
|
|
|
#ifdef CONFIG_MLX5_EN_TLS
|
|
|
|
struct mlx5e_tls *tls;
|
|
|
|
#endif
|
net/mlx5e: Add tx reporter support
Add mlx5e tx reporter to devlink health reporters. This reporter will be
responsible for diagnosing, reporting and recovering of tx errors.
This patch declares the TX reporter operations and creates it using the
devlink health API. Currently, this reporter supports reporting and
recovering from send error CQE only. In addition, it adds diagnose
information for the open SQs.
For a local SQ recover (due to driver error report), in case of SQ recover
failure, the recover operation will be considered as a failure.
For a full tx recover, an attempt to close and open the channels will be
done. If this one passed successfully, it will be considered as a
successful recover.
The SQ recover from error CQE flow is not a new feature in the driver,
this patch re-organize the functions and adapt them for the devlink
health API. For this purpose, move code from en_main.c to a new file
named reporter_tx.c.
Diagnose output:
$devlink health diagnose pci/0000:00:09.0 reporter tx -j -p
{
"SQs": [ {
"sqn": 138,
"HW state": 1,
"stopped": false
},{
"sqn": 142,
"HW state": 1,
"stopped": false
} ]
}
$devlink health diagnose pci/0000:00:09.0 reporter tx
SQs:
sqn: 138 HW state: 1 stopped: false
sqn: 142 HW state: 1 stopped: false
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-07 02:36:40 -07:00
|
|
|
struct devlink_health_reporter *tx_reporter;
|
net/mlx5e: Add support to rx reporter diagnose
Add rx reporter, which supports diagnose call-back. Diagnostics output
include: information common to all RQs: RQ type, RQ size, RQ stride
size, CQ size and CQ stride size. In addition advertise information per
RQ and its related icosq and attached CQ.
$ devlink health diagnose pci/0000:00:0b.0 reporter rx
Common config:
RQ:
type: 2 stride size: 2048 size: 8
CQ:
stride size: 64 size: 1024
RQs:
channel ix: 0 rqn: 4308 HW state: 1 SW state: 3 posted WQEs: 7 cc: 7 ICOSQ HW state: 1
CQ:
cqn: 1032 HW status: 0
channel ix: 1 rqn: 4313 HW state: 1 SW state: 3 posted WQEs: 7 cc: 7 ICOSQ HW state: 1
CQ:
cqn: 1036 HW status: 0
channel ix: 2 rqn: 4318 HW state: 1 SW state: 3 posted WQEs: 7 cc: 7 ICOSQ HW state: 1
CQ:
cqn: 1040 HW status: 0
channel ix: 3 rqn: 4323 HW state: 1 SW state: 3 posted WQEs: 7 cc: 7 ICOSQ HW state: 1
CQ:
cqn: 1044 HW status: 0
$ devlink health diagnose pci/0000:00:0b.0 reporter rx -jp
{
"Common config": {
"RQ": {
"type": 2,
"stride size": 2048,
"size": 8
},
"CQ": {
"stride size": 64,
"size": 1024
}
},
"RQs": [ {
"channel ix": 0,
"rqn": 4308,
"HW state": 1,
"SW state": 3,
"posted WQEs": 7,
"cc": 7,
"ICOSQ HW state": 1,
"CQ": {
"cqn": 1032,
"HW status": 0
}
},{
"channel ix": 1,
"rqn": 4313,
"HW state": 1,
"SW state": 3,
"posted WQEs": 7,
"cc": 7,
"ICOSQ HW state": 1,
"CQ": {
"cqn": 1036,
"HW status": 0
}
},{
"channel ix": 2,
"rqn": 4318,
"HW state": 1,
"SW state": 3,
"posted WQEs": 7,
"cc": 7,
"ICOSQ HW state": 1,
"CQ": {
"cqn": 1040,
"HW status": 0
}
},{
"channel ix": 3,
"rqn": 4323,
"HW state": 1,
"SW state": 3,
"posted WQEs": 7,
"cc": 7,
"ICOSQ HW state": 1,
"CQ": {
"cqn": 1044,
"HW status": 0
}
} ]
}
Signed-off-by: Aya Levin <ayal@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-06-25 07:26:46 -06:00
|
|
|
struct devlink_health_reporter *rx_reporter;
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
struct mlx5e_xsk xsk;
|
2019-08-21 23:06:00 -06:00
|
|
|
#if IS_ENABLED(CONFIG_PCI_HYPERV_INTERFACE)
|
|
|
|
struct mlx5e_hv_vhca_stats_agent stats_agent;
|
|
|
|
#endif
|
2015-05-28 13:28:48 -06:00
|
|
|
};
|
|
|
|
|
2017-03-14 11:43:52 -06:00
|
|
|
struct mlx5e_profile {
|
2018-10-02 00:54:59 -06:00
|
|
|
int (*init)(struct mlx5_core_dev *mdev,
|
2017-03-14 11:43:52 -06:00
|
|
|
struct net_device *netdev,
|
|
|
|
const struct mlx5e_profile *profile, void *ppriv);
|
|
|
|
void (*cleanup)(struct mlx5e_priv *priv);
|
|
|
|
int (*init_rx)(struct mlx5e_priv *priv);
|
|
|
|
void (*cleanup_rx)(struct mlx5e_priv *priv);
|
|
|
|
int (*init_tx)(struct mlx5e_priv *priv);
|
|
|
|
void (*cleanup_tx)(struct mlx5e_priv *priv);
|
|
|
|
void (*enable)(struct mlx5e_priv *priv);
|
|
|
|
void (*disable)(struct mlx5e_priv *priv);
|
net/mlx5e: Don't refresh TIRs when updating representor SQs
Refreshing TIRs is done in order to update the TIRs with the current
state of SQs in the transport domain, so that the TIRs can filter out
undesired self-loopback packets based on the source SQ of the packet.
Representor TIRs will only receive packets that originate from their
associated vport, due to dedicated steering, and therefore will never
receive self-loopback packets, whose source vport will be the vport of
the E-Switch manager, and therefore not the vport associated with the
representor. As such, it is not necessary to refresh the representors'
TIRs, since self-loopback packets can't reach them.
Since representors only exist in switchdev mode, and there is no
scenario in which a representor will exist in the transport domain
alongside a non-representor, it is not necessary to refresh the
transport domain's TIRs upon changing the state of a representor's
queues. Therefore, do not refresh TIRs upon such a change. Achieve
this by adding an update_rx callback to the mlx5e_profile, which
refreshes TIRs for non-representors and does nothing for representors,
and replace instances of mlx5e_refresh_tirs() upon changing the state
of the queues with update_rx().
Signed-off-by: Gavi Teitz <gavi@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-05-23 00:58:56 -06:00
|
|
|
int (*update_rx)(struct mlx5e_priv *priv);
|
2017-03-14 11:43:52 -06:00
|
|
|
void (*update_stats)(struct mlx5e_priv *priv);
|
2017-05-18 05:32:11 -06:00
|
|
|
void (*update_carrier)(struct mlx5e_priv *priv);
|
2017-04-12 21:37:03 -06:00
|
|
|
struct {
|
|
|
|
mlx5e_fp_handle_rx_cqe handle_rx_cqe;
|
|
|
|
mlx5e_fp_handle_rx_cqe handle_rx_cqe_mpwqe;
|
|
|
|
} rx_handlers;
|
2017-03-14 11:43:52 -06:00
|
|
|
int max_tc;
|
2019-07-14 02:43:43 -06:00
|
|
|
u8 rq_groups;
|
2017-03-14 11:43:52 -06:00
|
|
|
};
|
|
|
|
|
2016-06-23 08:02:45 -06:00
|
|
|
void mlx5e_build_ptys2ethtool_map(void);
|
|
|
|
|
2015-05-28 13:28:48 -06:00
|
|
|
u16 mlx5e_select_queue(struct net_device *dev, struct sk_buff *skb,
|
2019-03-20 04:02:06 -06:00
|
|
|
struct net_device *sb_dev);
|
2015-05-28 13:28:48 -06:00
|
|
|
netdev_tx_t mlx5e_xmit(struct sk_buff *skb, struct net_device *dev);
|
2018-04-30 01:16:20 -06:00
|
|
|
netdev_tx_t mlx5e_sq_xmit(struct mlx5e_txqsq *sq, struct sk_buff *skb,
|
2019-04-01 08:42:15 -06:00
|
|
|
struct mlx5e_tx_wqe *wqe, u16 pi, bool xmit_more);
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2019-03-01 03:05:21 -07:00
|
|
|
void mlx5e_trigger_irq(struct mlx5e_icosq *sq);
|
2019-06-30 10:23:27 -06:00
|
|
|
void mlx5e_completion_event(struct mlx5_core_cq *mcq, struct mlx5_eqe *eqe);
|
2015-05-28 13:28:48 -06:00
|
|
|
void mlx5e_cq_error_event(struct mlx5_core_cq *mcq, enum mlx5_event event);
|
|
|
|
int mlx5e_napi_poll(struct napi_struct *napi, int budget);
|
2016-03-11 01:44:17 -07:00
|
|
|
bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget);
|
2015-11-18 07:30:56 -07:00
|
|
|
int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget);
|
2017-03-24 15:52:14 -06:00
|
|
|
void mlx5e_free_txqsq_descs(struct mlx5e_txqsq *sq);
|
net/mlx5e: Support RX multi-packet WQE (Striding RQ)
Introduce the feature of multi-packet WQE (RX Work Queue Element)
referred to as (MPWQE or Striding RQ), in which WQEs are larger
and serve multiple packets each.
Every WQE consists of many strides of the same size, every received
packet is aligned to a beginning of a stride and is written to
consecutive strides within a WQE.
In the regular approach, each regular WQE is big enough to be capable
of serving one received packet of any size up to MTU or 64K in case of
device LRO is enabled, making it very wasteful when dealing with
small packets or device LRO is enabled.
For its flexibility, MPWQE allows a better memory utilization
(implying improvements in CPU utilization and packet rate) as packets
consume strides according to their size, preserving the rest of
the WQE to be available for other packets.
MPWQE default configuration:
Num of WQEs = 16
Strides Per WQE = 2048
Stride Size = 64 byte
The default WQEs memory footprint went from 1024*mtu (~1.5MB) to
16 * 2048 * 64 = 2MB per ring.
However, HW LRO can now be supported at no additional cost in memory
footprint, and hence we turn it on by default and get an even better
performance.
Performance tested on ConnectX4-Lx 50G.
To isolate the feature under test, the numbers below were measured with
HW LRO turned off. We verified that the performance just improves when
LRO is turned back on.
* Netperf single TCP stream:
- BW raised by 10-15% for representative packet sizes:
default, 64B, 1024B, 1478B, 65536B.
* Netperf multi TCP stream:
- No degradation, line rate reached.
* Pktgen: packet rate raised by 2-10% for traffic of different message
sizes: 64B, 128B, 256B, 1024B, and 1500B.
* Pktgen: packet loss in bursts of small messages (64byte),
single stream:
- | num packets | packets loss before | packets loss after
| 2K | ~ 1K | 0
| 8K | ~ 6K | 0
| 16K | ~13K | 0
| 32K | ~28K | 0
| 64K | ~57K | ~24K
As expected as the driver can receive as many small packets (<=64B) as
the number of total strides in the ring (default = 2048 * 16) vs. 1024
(default ring size regardless of packets size) before this feature.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-20 13:02:13 -06:00
|
|
|
|
net/mlx5e: Add support to rx reporter diagnose
Add rx reporter, which supports diagnose call-back. Diagnostics output
include: information common to all RQs: RQ type, RQ size, RQ stride
size, CQ size and CQ stride size. In addition advertise information per
RQ and its related icosq and attached CQ.
$ devlink health diagnose pci/0000:00:0b.0 reporter rx
Common config:
RQ:
type: 2 stride size: 2048 size: 8
CQ:
stride size: 64 size: 1024
RQs:
channel ix: 0 rqn: 4308 HW state: 1 SW state: 3 posted WQEs: 7 cc: 7 ICOSQ HW state: 1
CQ:
cqn: 1032 HW status: 0
channel ix: 1 rqn: 4313 HW state: 1 SW state: 3 posted WQEs: 7 cc: 7 ICOSQ HW state: 1
CQ:
cqn: 1036 HW status: 0
channel ix: 2 rqn: 4318 HW state: 1 SW state: 3 posted WQEs: 7 cc: 7 ICOSQ HW state: 1
CQ:
cqn: 1040 HW status: 0
channel ix: 3 rqn: 4323 HW state: 1 SW state: 3 posted WQEs: 7 cc: 7 ICOSQ HW state: 1
CQ:
cqn: 1044 HW status: 0
$ devlink health diagnose pci/0000:00:0b.0 reporter rx -jp
{
"Common config": {
"RQ": {
"type": 2,
"stride size": 2048,
"size": 8
},
"CQ": {
"stride size": 64,
"size": 1024
}
},
"RQs": [ {
"channel ix": 0,
"rqn": 4308,
"HW state": 1,
"SW state": 3,
"posted WQEs": 7,
"cc": 7,
"ICOSQ HW state": 1,
"CQ": {
"cqn": 1032,
"HW status": 0
}
},{
"channel ix": 1,
"rqn": 4313,
"HW state": 1,
"SW state": 3,
"posted WQEs": 7,
"cc": 7,
"ICOSQ HW state": 1,
"CQ": {
"cqn": 1036,
"HW status": 0
}
},{
"channel ix": 2,
"rqn": 4318,
"HW state": 1,
"SW state": 3,
"posted WQEs": 7,
"cc": 7,
"ICOSQ HW state": 1,
"CQ": {
"cqn": 1040,
"HW status": 0
}
},{
"channel ix": 3,
"rqn": 4323,
"HW state": 1,
"SW state": 3,
"posted WQEs": 7,
"cc": 7,
"ICOSQ HW state": 1,
"CQ": {
"cqn": 1044,
"HW status": 0
}
} ]
}
Signed-off-by: Aya Levin <ayal@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-06-25 07:26:46 -06:00
|
|
|
static inline u32 mlx5e_rqwq_get_size(struct mlx5e_rq *rq)
|
|
|
|
{
|
|
|
|
switch (rq->wq_type) {
|
|
|
|
case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
|
|
|
|
return mlx5_wq_ll_get_size(&rq->mpwqe.wq);
|
|
|
|
default:
|
|
|
|
return mlx5_wq_cyc_get_size(&rq->wqe.wq);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline u32 mlx5e_rqwq_get_cur_sz(struct mlx5e_rq *rq)
|
|
|
|
{
|
|
|
|
switch (rq->wq_type) {
|
|
|
|
case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
|
|
|
|
return rq->mpwqe.wq.cur_sz;
|
|
|
|
default:
|
|
|
|
return rq->wqe.wq.cur_sz;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-02-07 05:51:45 -07:00
|
|
|
bool mlx5e_check_fragmented_striding_rq_cap(struct mlx5_core_dev *mdev);
|
|
|
|
bool mlx5e_striding_rq_possible(struct mlx5_core_dev *mdev,
|
|
|
|
struct mlx5e_params *params);
|
|
|
|
|
2018-07-15 01:28:44 -06:00
|
|
|
void mlx5e_page_dma_unmap(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info);
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
void mlx5e_page_release_dynamic(struct mlx5e_rq *rq,
|
|
|
|
struct mlx5e_dma_info *dma_info,
|
|
|
|
bool recycle);
|
2016-04-20 13:02:12 -06:00
|
|
|
void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
|
net/mlx5e: Support RX multi-packet WQE (Striding RQ)
Introduce the feature of multi-packet WQE (RX Work Queue Element)
referred to as (MPWQE or Striding RQ), in which WQEs are larger
and serve multiple packets each.
Every WQE consists of many strides of the same size, every received
packet is aligned to a beginning of a stride and is written to
consecutive strides within a WQE.
In the regular approach, each regular WQE is big enough to be capable
of serving one received packet of any size up to MTU or 64K in case of
device LRO is enabled, making it very wasteful when dealing with
small packets or device LRO is enabled.
For its flexibility, MPWQE allows a better memory utilization
(implying improvements in CPU utilization and packet rate) as packets
consume strides according to their size, preserving the rest of
the WQE to be available for other packets.
MPWQE default configuration:
Num of WQEs = 16
Strides Per WQE = 2048
Stride Size = 64 byte
The default WQEs memory footprint went from 1024*mtu (~1.5MB) to
16 * 2048 * 64 = 2MB per ring.
However, HW LRO can now be supported at no additional cost in memory
footprint, and hence we turn it on by default and get an even better
performance.
Performance tested on ConnectX4-Lx 50G.
To isolate the feature under test, the numbers below were measured with
HW LRO turned off. We verified that the performance just improves when
LRO is turned back on.
* Netperf single TCP stream:
- BW raised by 10-15% for representative packet sizes:
default, 64B, 1024B, 1478B, 65536B.
* Netperf multi TCP stream:
- No degradation, line rate reached.
* Pktgen: packet rate raised by 2-10% for traffic of different message
sizes: 64B, 128B, 256B, 1024B, and 1500B.
* Pktgen: packet loss in bursts of small messages (64byte),
single stream:
- | num packets | packets loss before | packets loss after
| 2K | ~ 1K | 0
| 8K | ~ 6K | 0
| 16K | ~13K | 0
| 32K | ~28K | 0
| 64K | ~57K | ~24K
As expected as the driver can receive as many small packets (<=64B) as
the number of total strides in the ring (default = 2048 * 16) vs. 1024
(default ring size regardless of packets size) before this feature.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-20 13:02:13 -06:00
|
|
|
void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
|
2015-05-28 13:28:48 -06:00
|
|
|
bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq);
|
2019-06-26 08:35:31 -06:00
|
|
|
void mlx5e_poll_ico_cq(struct mlx5e_cq *cq);
|
2017-07-17 03:27:26 -06:00
|
|
|
bool mlx5e_post_rx_mpwqes(struct mlx5e_rq *rq);
|
2016-06-30 08:34:46 -06:00
|
|
|
void mlx5e_dealloc_rx_wqe(struct mlx5e_rq *rq, u16 ix);
|
|
|
|
void mlx5e_dealloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix);
|
net/mlx5e: Use linear SKB in Striding RQ
Current Striding RQ HW feature utilizes the RX buffers so that
there is no wasted room between the strides. This maximises
the memory utilization.
This prevents the use of build_skb() (which requires headroom
and tailroom), and demands to memcpy the packets headers into
the skb linear part.
In this patch, whenever a set of conditions holds, we apply
an RQ configuration that allows combining the use of linear SKB
on top of a Striding RQ.
To use build_skb() with Striding RQ, the following must hold:
1. packet does not cross a page boundary.
2. there is enough headroom and tailroom surrounding the packet.
We can satisfy 1 and 2 by configuring:
stride size = MTU + headroom + tailoom.
This is possible only when:
a. (MTU - headroom - tailoom) does not exceed PAGE_SIZE.
b. HW LRO is turned off.
Using linear SKB has many advantages:
- Saves a memcpy of the headers.
- No page-boundary checks in datapath.
- No filler CQEs.
- Significantly smaller CQ.
- SKB data continuously resides in linear part, and not split to
small amount (linear part) and large amount (fragment).
This saves datapath cycles in driver and improves utilization
of SKB fragments in GRO.
- The fragments of a resulting GRO SKB follow the IP forwarding
assumption of equal-size fragments.
Some implementation details:
HW writes the packets to the beginning of a stride,
i.e. does not keep headroom. To overcome this we make sure we can
extend backwards and use the last bytes of stride i-1.
Extra care is needed for stride 0 as it has no preceding stride.
We make sure headroom bytes are available by shifting the buffer
pointer passed to HW by headroom bytes.
This configuration now becomes default, whenever capable.
Of course, this implies turning LRO off.
Performance testing:
ConnectX-5, single core, single RX ring, default MTU.
UDP packet rate, early drop in TC layer:
--------------------------------------------
| pkt size | before | after | ratio |
--------------------------------------------
| 1500byte | 4.65 Mpps | 5.96 Mpps | 1.28x |
| 500byte | 5.23 Mpps | 5.97 Mpps | 1.14x |
| 64byte | 5.94 Mpps | 5.96 Mpps | 1.00x |
--------------------------------------------
TCP streams: ~20% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-02-07 05:41:25 -07:00
|
|
|
struct sk_buff *
|
|
|
|
mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
|
|
|
|
u16 cqe_bcnt, u32 head_offset, u32 page_idx);
|
|
|
|
struct sk_buff *
|
|
|
|
mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
|
|
|
|
u16 cqe_bcnt, u32 head_offset, u32 page_idx);
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
struct sk_buff *
|
|
|
|
mlx5e_skb_from_cqe_linear(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
|
|
|
|
struct mlx5e_wqe_frag_info *wi, u32 cqe_bcnt);
|
|
|
|
struct sk_buff *
|
|
|
|
mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
|
|
|
|
struct mlx5e_wqe_frag_info *wi, u32 cqe_bcnt);
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2017-11-28 14:52:13 -07:00
|
|
|
void mlx5e_update_stats(struct mlx5e_priv *priv);
|
2018-02-13 06:48:30 -07:00
|
|
|
void mlx5e_get_stats(struct net_device *dev, struct rtnl_link_stats64 *stats);
|
2018-12-12 02:42:30 -07:00
|
|
|
void mlx5e_fold_sw_stats64(struct mlx5e_priv *priv, struct rtnl_link_stats64 *s);
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2016-04-28 16:36:38 -06:00
|
|
|
void mlx5e_init_l2_addr(struct mlx5e_priv *priv);
|
2016-11-27 08:02:09 -07:00
|
|
|
int mlx5e_self_test_num(struct mlx5e_priv *priv);
|
|
|
|
void mlx5e_self_test(struct net_device *ndev, struct ethtool_test *etest,
|
|
|
|
u64 *buf);
|
2015-05-28 13:28:48 -06:00
|
|
|
void mlx5e_set_rx_mode_work(struct work_struct *work);
|
|
|
|
|
2017-06-01 05:56:17 -06:00
|
|
|
int mlx5e_hwstamp_set(struct mlx5e_priv *priv, struct ifreq *ifr);
|
|
|
|
int mlx5e_hwstamp_get(struct mlx5e_priv *priv, struct ifreq *ifr);
|
2017-02-12 15:42:54 -07:00
|
|
|
int mlx5e_modify_rx_cqe_compression_locked(struct mlx5e_priv *priv, bool val);
|
2015-12-29 05:58:31 -07:00
|
|
|
|
2015-05-28 13:28:48 -06:00
|
|
|
int mlx5e_vlan_rx_add_vid(struct net_device *dev, __always_unused __be16 proto,
|
|
|
|
u16 vid);
|
|
|
|
int mlx5e_vlan_rx_kill_vid(struct net_device *dev, __always_unused __be16 proto,
|
|
|
|
u16 vid);
|
2018-01-08 01:01:04 -07:00
|
|
|
void mlx5e_timestamp_init(struct mlx5e_priv *priv);
|
2015-05-28 13:28:48 -06:00
|
|
|
|
2016-12-19 14:20:17 -07:00
|
|
|
struct mlx5e_redirect_rqt_param {
|
|
|
|
bool is_rss;
|
|
|
|
union {
|
|
|
|
u32 rqn; /* Direct RQN (Non-RSS) */
|
|
|
|
struct {
|
|
|
|
u8 hfunc;
|
|
|
|
struct mlx5e_channels *channels;
|
|
|
|
} rss; /* RSS data */
|
|
|
|
};
|
|
|
|
};
|
|
|
|
|
|
|
|
int mlx5e_redirect_rqt(struct mlx5e_priv *priv, u32 rqtn, int sz,
|
|
|
|
struct mlx5e_redirect_rqt_param rrp);
|
2018-11-06 12:05:29 -07:00
|
|
|
void mlx5e_build_indir_tir_ctx_hash(struct mlx5e_rss_params *rss_params,
|
2018-10-28 08:22:57 -06:00
|
|
|
const struct mlx5e_tirc_config *ttconfig,
|
2017-08-13 07:22:38 -06:00
|
|
|
void *tirc, bool inner);
|
2018-10-23 01:02:08 -06:00
|
|
|
void mlx5e_modify_tirs_hash(struct mlx5e_priv *priv, void *in, int inlen);
|
2018-10-28 08:22:57 -06:00
|
|
|
struct mlx5e_tirc_config mlx5e_tirc_get_default_config(enum mlx5e_traffic_types tt);
|
2015-08-16 07:04:47 -06:00
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
struct mlx5e_xsk_param;
|
|
|
|
|
|
|
|
struct mlx5e_rq_param;
|
|
|
|
int mlx5e_open_rq(struct mlx5e_channel *c, struct mlx5e_params *params,
|
|
|
|
struct mlx5e_rq_param *param, struct mlx5e_xsk_param *xsk,
|
|
|
|
struct xdp_umem *umem, struct mlx5e_rq *rq);
|
|
|
|
int mlx5e_wait_for_min_rx_wqes(struct mlx5e_rq *rq, int wait_time);
|
|
|
|
void mlx5e_deactivate_rq(struct mlx5e_rq *rq);
|
|
|
|
void mlx5e_close_rq(struct mlx5e_rq *rq);
|
|
|
|
|
|
|
|
struct mlx5e_sq_param;
|
|
|
|
int mlx5e_open_icosq(struct mlx5e_channel *c, struct mlx5e_params *params,
|
|
|
|
struct mlx5e_sq_param *param, struct mlx5e_icosq *sq);
|
|
|
|
void mlx5e_close_icosq(struct mlx5e_icosq *sq);
|
|
|
|
int mlx5e_open_xdpsq(struct mlx5e_channel *c, struct mlx5e_params *params,
|
|
|
|
struct mlx5e_sq_param *param, struct xdp_umem *umem,
|
|
|
|
struct mlx5e_xdpsq *sq, bool is_redirect);
|
|
|
|
void mlx5e_close_xdpsq(struct mlx5e_xdpsq *sq);
|
|
|
|
|
|
|
|
struct mlx5e_cq_param;
|
Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2019-07-03
The following pull-request contains BPF updates for your *net-next* tree.
There is a minor merge conflict in mlx5 due to 8960b38932be ("linux/dim:
Rename externally used net_dim members") which has been pulled into your
tree in the meantime, but resolution seems not that bad ... getting current
bpf-next out now before there's coming more on mlx5. ;) I'm Cc'ing Saeed
just so he's aware of the resolution below:
** First conflict in drivers/net/ethernet/mellanox/mlx5/core/en_main.c:
<<<<<<< HEAD
static int mlx5e_open_cq(struct mlx5e_channel *c,
struct dim_cq_moder moder,
struct mlx5e_cq_param *param,
struct mlx5e_cq *cq)
=======
int mlx5e_open_cq(struct mlx5e_channel *c, struct net_dim_cq_moder moder,
struct mlx5e_cq_param *param, struct mlx5e_cq *cq)
>>>>>>> e5a3e259ef239f443951d401db10db7d426c9497
Resolution is to take the second chunk and rename net_dim_cq_moder into
dim_cq_moder. Also the signature for mlx5e_open_cq() in ...
drivers/net/ethernet/mellanox/mlx5/core/en.h +977
... and in mlx5e_open_xsk() ...
drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.c +64
... needs the same rename from net_dim_cq_moder into dim_cq_moder.
** Second conflict in drivers/net/ethernet/mellanox/mlx5/core/en_main.c:
<<<<<<< HEAD
int cpu = cpumask_first(mlx5_comp_irq_get_affinity_mask(priv->mdev, ix));
struct dim_cq_moder icocq_moder = {0, 0};
struct net_device *netdev = priv->netdev;
struct mlx5e_channel *c;
unsigned int irq;
=======
struct net_dim_cq_moder icocq_moder = {0, 0};
>>>>>>> e5a3e259ef239f443951d401db10db7d426c9497
Take the second chunk and rename net_dim_cq_moder into dim_cq_moder
as well.
Let me know if you run into any issues. Anyway, the main changes are:
1) Long-awaited AF_XDP support for mlx5e driver, from Maxim.
2) Addition of two new per-cgroup BPF hooks for getsockopt and
setsockopt along with a new sockopt program type which allows more
fine-grained pass/reject settings for containers. Also add a sock_ops
callback that can be selectively enabled on a per-socket basis and is
executed for every RTT to help tracking TCP statistics, both features
from Stanislav.
3) Follow-up fix from loops in precision tracking which was not propagating
precision marks and as a result verifier assumed that some branches were
not taken and therefore wrongly removed as dead code, from Alexei.
4) Fix BPF cgroup release synchronization race which could lead to a
double-free if a leaf's cgroup_bpf object is released and a new BPF
program is attached to the one of ancestor cgroups in parallel, from Roman.
5) Support for bulking XDP_TX on veth devices which improves performance
in some cases by around 9%, from Toshiaki.
6) Allow for lookups into BPF devmap and improve feedback when calling into
bpf_redirect_map() as lookup is now performed right away in the helper
itself, from Toke.
7) Add support for fq's Earliest Departure Time to the Host Bandwidth
Manager (HBM) sample BPF program, from Lawrence.
8) Various cleanups and minor fixes all over the place from many others.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-04 13:48:21 -06:00
|
|
|
int mlx5e_open_cq(struct mlx5e_channel *c, struct dim_cq_moder moder,
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
struct mlx5e_cq_param *param, struct mlx5e_cq *cq);
|
|
|
|
void mlx5e_close_cq(struct mlx5e_cq *cq);
|
|
|
|
|
2015-05-28 13:28:48 -06:00
|
|
|
int mlx5e_open_locked(struct net_device *netdev);
|
|
|
|
int mlx5e_close_locked(struct net_device *netdev);
|
2016-12-27 05:57:03 -07:00
|
|
|
|
|
|
|
int mlx5e_open_channels(struct mlx5e_priv *priv,
|
|
|
|
struct mlx5e_channels *chs);
|
|
|
|
void mlx5e_close_channels(struct mlx5e_channels *chs);
|
2017-02-12 16:19:14 -07:00
|
|
|
|
|
|
|
/* Function pointer to be used to modify WH settings while
|
|
|
|
* switching channels
|
|
|
|
*/
|
|
|
|
typedef int (*mlx5e_fp_hw_modify)(struct mlx5e_priv *priv);
|
2019-03-28 06:26:47 -06:00
|
|
|
int mlx5e_safe_reopen_channels(struct mlx5e_priv *priv);
|
2018-11-26 08:22:16 -07:00
|
|
|
int mlx5e_safe_switch_channels(struct mlx5e_priv *priv,
|
|
|
|
struct mlx5e_channels *new_chs,
|
|
|
|
mlx5e_fp_hw_modify hw_modify);
|
2017-04-12 21:36:59 -06:00
|
|
|
void mlx5e_activate_priv_channels(struct mlx5e_priv *priv);
|
|
|
|
void mlx5e_deactivate_priv_channels(struct mlx5e_priv *priv);
|
2016-12-27 05:57:03 -07:00
|
|
|
|
2017-06-07 04:55:34 -06:00
|
|
|
void mlx5e_build_default_indir_rqt(u32 *indirection_rqt, int len,
|
2016-02-29 12:17:13 -07:00
|
|
|
int num_channels);
|
2017-09-26 07:20:43 -06:00
|
|
|
void mlx5e_set_tx_cq_mode_params(struct mlx5e_params *params,
|
|
|
|
u8 cq_period_mode);
|
2016-06-23 08:02:40 -06:00
|
|
|
void mlx5e_set_rx_cq_mode_params(struct mlx5e_params *params,
|
|
|
|
u8 cq_period_mode);
|
2018-02-07 05:51:45 -07:00
|
|
|
void mlx5e_set_rq_type(struct mlx5_core_dev *mdev, struct mlx5e_params *params);
|
2017-11-14 00:44:55 -07:00
|
|
|
void mlx5e_init_rq_type_params(struct mlx5_core_dev *mdev,
|
2018-02-18 02:37:06 -07:00
|
|
|
struct mlx5e_params *params);
|
2019-06-25 08:44:28 -06:00
|
|
|
int mlx5e_modify_rq_state(struct mlx5e_rq *rq, int curr_state, int next_state);
|
|
|
|
void mlx5e_activate_rq(struct mlx5e_rq *rq);
|
|
|
|
void mlx5e_deactivate_rq(struct mlx5e_rq *rq);
|
|
|
|
void mlx5e_free_rx_descs(struct mlx5e_rq *rq);
|
|
|
|
void mlx5e_activate_icosq(struct mlx5e_icosq *icosq);
|
|
|
|
void mlx5e_deactivate_icosq(struct mlx5e_icosq *icosq);
|
2016-06-23 08:02:40 -06:00
|
|
|
|
net/mlx5e: Add tx reporter support
Add mlx5e tx reporter to devlink health reporters. This reporter will be
responsible for diagnosing, reporting and recovering of tx errors.
This patch declares the TX reporter operations and creates it using the
devlink health API. Currently, this reporter supports reporting and
recovering from send error CQE only. In addition, it adds diagnose
information for the open SQs.
For a local SQ recover (due to driver error report), in case of SQ recover
failure, the recover operation will be considered as a failure.
For a full tx recover, an attempt to close and open the channels will be
done. If this one passed successfully, it will be considered as a
successful recover.
The SQ recover from error CQE flow is not a new feature in the driver,
this patch re-organize the functions and adapt them for the devlink
health API. For this purpose, move code from en_main.c to a new file
named reporter_tx.c.
Diagnose output:
$devlink health diagnose pci/0000:00:09.0 reporter tx -j -p
{
"SQs": [ {
"sqn": 138,
"HW state": 1,
"stopped": false
},{
"sqn": 142,
"HW state": 1,
"stopped": false
} ]
}
$devlink health diagnose pci/0000:00:09.0 reporter tx
SQs:
sqn: 138 HW state: 1 stopped: false
sqn: 142 HW state: 1 stopped: false
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-07 02:36:40 -07:00
|
|
|
int mlx5e_modify_sq(struct mlx5_core_dev *mdev, u32 sqn,
|
|
|
|
struct mlx5e_modify_sq_param *p);
|
|
|
|
void mlx5e_activate_txqsq(struct mlx5e_txqsq *sq);
|
|
|
|
void mlx5e_tx_disable_queue(struct netdev_queue *txq);
|
|
|
|
|
2019-03-21 16:51:38 -06:00
|
|
|
static inline bool mlx5_tx_swp_supported(struct mlx5_core_dev *mdev)
|
|
|
|
{
|
|
|
|
return MLX5_CAP_ETH(mdev, swp) &&
|
|
|
|
MLX5_CAP_ETH(mdev, swp_csum) && MLX5_CAP_ETH(mdev, swp_lso);
|
|
|
|
}
|
|
|
|
|
2015-05-28 13:28:48 -06:00
|
|
|
extern const struct ethtool_ops mlx5e_ethtool_ops;
|
2016-02-22 09:17:26 -07:00
|
|
|
#ifdef CONFIG_MLX5_CORE_EN_DCB
|
|
|
|
extern const struct dcbnl_rtnl_ops mlx5e_dcbnl_ops;
|
|
|
|
int mlx5e_dcbnl_ieee_setets_core(struct mlx5e_priv *priv, struct ieee_ets *ets);
|
2016-11-27 08:02:07 -07:00
|
|
|
void mlx5e_dcbnl_initialize(struct mlx5e_priv *priv);
|
2017-07-18 15:23:36 -06:00
|
|
|
void mlx5e_dcbnl_init_app(struct mlx5e_priv *priv);
|
|
|
|
void mlx5e_dcbnl_delete_app(struct mlx5e_priv *priv);
|
2016-02-22 09:17:26 -07:00
|
|
|
#endif
|
|
|
|
|
2016-07-01 05:51:05 -06:00
|
|
|
int mlx5e_create_tir(struct mlx5_core_dev *mdev,
|
|
|
|
struct mlx5e_tir *tir, u32 *in, int inlen);
|
|
|
|
void mlx5e_destroy_tir(struct mlx5_core_dev *mdev,
|
|
|
|
struct mlx5e_tir *tir);
|
2016-07-01 05:51:04 -06:00
|
|
|
int mlx5e_create_mdev_resources(struct mlx5_core_dev *mdev);
|
|
|
|
void mlx5e_destroy_mdev_resources(struct mlx5_core_dev *mdev);
|
2016-12-20 08:30:20 -07:00
|
|
|
int mlx5e_refresh_tirs(struct mlx5e_priv *priv, bool enable_uc_lb);
|
2016-02-22 09:17:31 -07:00
|
|
|
|
2017-04-12 21:36:57 -06:00
|
|
|
/* common netdev helpers */
|
2018-08-04 21:58:05 -06:00
|
|
|
void mlx5e_create_q_counters(struct mlx5e_priv *priv);
|
|
|
|
void mlx5e_destroy_q_counters(struct mlx5e_priv *priv);
|
|
|
|
int mlx5e_open_drop_rq(struct mlx5e_priv *priv,
|
|
|
|
struct mlx5e_rq *drop_rq);
|
|
|
|
void mlx5e_close_drop_rq(struct mlx5e_rq *drop_rq);
|
|
|
|
|
2017-04-12 21:36:56 -06:00
|
|
|
int mlx5e_create_indirect_rqt(struct mlx5e_priv *priv);
|
|
|
|
|
2018-08-28 11:53:55 -06:00
|
|
|
int mlx5e_create_indirect_tirs(struct mlx5e_priv *priv, bool inner_ttc);
|
|
|
|
void mlx5e_destroy_indirect_tirs(struct mlx5e_priv *priv, bool inner_ttc);
|
2017-04-12 21:36:56 -06:00
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
int mlx5e_create_direct_rqts(struct mlx5e_priv *priv, struct mlx5e_tir *tirs);
|
|
|
|
void mlx5e_destroy_direct_rqts(struct mlx5e_priv *priv, struct mlx5e_tir *tirs);
|
|
|
|
int mlx5e_create_direct_tirs(struct mlx5e_priv *priv, struct mlx5e_tir *tirs);
|
|
|
|
void mlx5e_destroy_direct_tirs(struct mlx5e_priv *priv, struct mlx5e_tir *tirs);
|
2017-04-12 21:36:56 -06:00
|
|
|
void mlx5e_destroy_rqt(struct mlx5e_priv *priv, struct mlx5e_rqt *rqt);
|
|
|
|
|
2019-07-05 09:30:20 -06:00
|
|
|
int mlx5e_create_tis(struct mlx5_core_dev *mdev, void *in, u32 *tisn);
|
2017-04-12 21:36:58 -06:00
|
|
|
void mlx5e_destroy_tis(struct mlx5_core_dev *mdev, u32 tisn);
|
|
|
|
|
net/mlx5e: Introduce SRIOV VF representors
Implement the relevant profile functions to create mlx5e driver instance
serving as VF representor. When SRIOV offloads mode is enabled, each VF
will have a representor netdevice instance on the host.
To do that, we also export set of shared service functions from en_main.c,
such that they can be used by both NIC and repsresentors netdevs.
The newly created representor netdevice has a basic set of net_device_ops
which are the same ndo functions as the NIC netdevice and an ndo of it's
own for phys port name.
The profiling infrastructure allow sharing code between the NIC and the
vport representor even though the representor has only a subset of the
NIC functionality.
The VF reps and the PF which is used in that mode to represent the uplink,
expose switchdev ops. Currently the only op supposed is attr get for the
port parent ID which here serves to identify net-devices belonging to the
same HW E-Switch. Other than that, no offloading is implemented and hence
switching functionality is achieved if one sets SW switching rules, e.g
using tc, bridge or ovs.
Port phys name (ndo_get_phys_port_name) is implemented to allow exporting
to user-space the VF vport number and along with the switchdev port parent
id (phys_switch_id) enable a udev base consistent naming scheme:
SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}=="<phys_switch_id>", \
ATTR{phys_port_name}!="", NAME="$PF_NIC$attr{phys_port_name}"
where phys_switch_id is exposed by the PF (and VF reps) and $PF_NIC is
the name of the PF netdevice.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01 05:51:09 -06:00
|
|
|
int mlx5e_create_tises(struct mlx5e_priv *priv);
|
2019-06-24 03:03:02 -06:00
|
|
|
void mlx5e_destroy_tises(struct mlx5e_priv *priv);
|
net/mlx5e: Don't refresh TIRs when updating representor SQs
Refreshing TIRs is done in order to update the TIRs with the current
state of SQs in the transport domain, so that the TIRs can filter out
undesired self-loopback packets based on the source SQ of the packet.
Representor TIRs will only receive packets that originate from their
associated vport, due to dedicated steering, and therefore will never
receive self-loopback packets, whose source vport will be the vport of
the E-Switch manager, and therefore not the vport associated with the
representor. As such, it is not necessary to refresh the representors'
TIRs, since self-loopback packets can't reach them.
Since representors only exist in switchdev mode, and there is no
scenario in which a representor will exist in the transport domain
alongside a non-representor, it is not necessary to refresh the
transport domain's TIRs upon changing the state of a representor's
queues. Therefore, do not refresh TIRs upon such a change. Achieve
this by adding an update_rx callback to the mlx5e_profile, which
refreshes TIRs for non-representors and does nothing for representors,
and replace instances of mlx5e_refresh_tirs() upon changing the state
of the queues with update_rx().
Signed-off-by: Gavi Teitz <gavi@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-05-23 00:58:56 -06:00
|
|
|
int mlx5e_update_nic_rx(struct mlx5e_priv *priv);
|
2018-11-08 11:42:55 -07:00
|
|
|
void mlx5e_update_carrier(struct mlx5e_priv *priv);
|
net/mlx5e: Introduce SRIOV VF representors
Implement the relevant profile functions to create mlx5e driver instance
serving as VF representor. When SRIOV offloads mode is enabled, each VF
will have a representor netdevice instance on the host.
To do that, we also export set of shared service functions from en_main.c,
such that they can be used by both NIC and repsresentors netdevs.
The newly created representor netdevice has a basic set of net_device_ops
which are the same ndo functions as the NIC netdevice and an ndo of it's
own for phys port name.
The profiling infrastructure allow sharing code between the NIC and the
vport representor even though the representor has only a subset of the
NIC functionality.
The VF reps and the PF which is used in that mode to represent the uplink,
expose switchdev ops. Currently the only op supposed is attr get for the
port parent ID which here serves to identify net-devices belonging to the
same HW E-Switch. Other than that, no offloading is implemented and hence
switching functionality is achieved if one sets SW switching rules, e.g
using tc, bridge or ovs.
Port phys name (ndo_get_phys_port_name) is implemented to allow exporting
to user-space the VF vport number and along with the switchdev port parent
id (phys_switch_id) enable a udev base consistent naming scheme:
SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}=="<phys_switch_id>", \
ATTR{phys_port_name}!="", NAME="$PF_NIC$attr{phys_port_name}"
where phys_switch_id is exposed by the PF (and VF reps) and $PF_NIC is
the name of the PF netdevice.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01 05:51:09 -06:00
|
|
|
int mlx5e_close(struct net_device *netdev);
|
|
|
|
int mlx5e_open(struct net_device *netdev);
|
2018-10-20 07:18:00 -06:00
|
|
|
void mlx5e_update_ndo_stats(struct mlx5e_priv *priv);
|
net/mlx5e: Introduce SRIOV VF representors
Implement the relevant profile functions to create mlx5e driver instance
serving as VF representor. When SRIOV offloads mode is enabled, each VF
will have a representor netdevice instance on the host.
To do that, we also export set of shared service functions from en_main.c,
such that they can be used by both NIC and repsresentors netdevs.
The newly created representor netdevice has a basic set of net_device_ops
which are the same ndo functions as the NIC netdevice and an ndo of it's
own for phys port name.
The profiling infrastructure allow sharing code between the NIC and the
vport representor even though the representor has only a subset of the
NIC functionality.
The VF reps and the PF which is used in that mode to represent the uplink,
expose switchdev ops. Currently the only op supposed is attr get for the
port parent ID which here serves to identify net-devices belonging to the
same HW E-Switch. Other than that, no offloading is implemented and hence
switching functionality is achieved if one sets SW switching rules, e.g
using tc, bridge or ovs.
Port phys name (ndo_get_phys_port_name) is implemented to allow exporting
to user-space the VF vport number and along with the switchdev port parent
id (phys_switch_id) enable a udev base consistent naming scheme:
SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}=="<phys_switch_id>", \
ATTR{phys_port_name}!="", NAME="$PF_NIC$attr{phys_port_name}"
where phys_switch_id is exposed by the PF (and VF reps) and $PF_NIC is
the name of the PF netdevice.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01 05:51:09 -06:00
|
|
|
|
2018-09-12 00:45:33 -06:00
|
|
|
void mlx5e_queue_update_stats(struct mlx5e_priv *priv);
|
2017-11-26 11:39:12 -07:00
|
|
|
int mlx5e_bits_invert(unsigned long a, int size);
|
|
|
|
|
2018-04-01 07:54:27 -06:00
|
|
|
typedef int (*change_hw_mtu_cb)(struct mlx5e_priv *priv);
|
2018-02-13 06:48:30 -07:00
|
|
|
int mlx5e_set_dev_port_mtu(struct mlx5e_priv *priv);
|
2018-04-01 07:54:27 -06:00
|
|
|
int mlx5e_change_mtu(struct net_device *netdev, int new_mtu,
|
|
|
|
change_hw_mtu_cb set_mtu_cb);
|
|
|
|
|
2017-05-15 04:32:28 -06:00
|
|
|
/* ethtool helpers */
|
|
|
|
void mlx5e_ethtool_get_drvinfo(struct mlx5e_priv *priv,
|
|
|
|
struct ethtool_drvinfo *drvinfo);
|
|
|
|
void mlx5e_ethtool_get_strings(struct mlx5e_priv *priv,
|
|
|
|
uint32_t stringset, uint8_t *data);
|
|
|
|
int mlx5e_ethtool_get_sset_count(struct mlx5e_priv *priv, int sset);
|
|
|
|
void mlx5e_ethtool_get_ethtool_stats(struct mlx5e_priv *priv,
|
|
|
|
struct ethtool_stats *stats, u64 *data);
|
|
|
|
void mlx5e_ethtool_get_ringparam(struct mlx5e_priv *priv,
|
|
|
|
struct ethtool_ringparam *param);
|
|
|
|
int mlx5e_ethtool_set_ringparam(struct mlx5e_priv *priv,
|
|
|
|
struct ethtool_ringparam *param);
|
|
|
|
void mlx5e_ethtool_get_channels(struct mlx5e_priv *priv,
|
|
|
|
struct ethtool_channels *ch);
|
|
|
|
int mlx5e_ethtool_set_channels(struct mlx5e_priv *priv,
|
|
|
|
struct ethtool_channels *ch);
|
|
|
|
int mlx5e_ethtool_get_coalesce(struct mlx5e_priv *priv,
|
|
|
|
struct ethtool_coalesce *coal);
|
|
|
|
int mlx5e_ethtool_set_coalesce(struct mlx5e_priv *priv,
|
|
|
|
struct ethtool_coalesce *coal);
|
2018-11-06 10:31:10 -07:00
|
|
|
int mlx5e_ethtool_get_link_ksettings(struct mlx5e_priv *priv,
|
|
|
|
struct ethtool_link_ksettings *link_ksettings);
|
|
|
|
int mlx5e_ethtool_set_link_ksettings(struct mlx5e_priv *priv,
|
|
|
|
const struct ethtool_link_ksettings *link_ksettings);
|
2018-08-26 03:53:51 -06:00
|
|
|
u32 mlx5e_ethtool_get_rxfh_key_size(struct mlx5e_priv *priv);
|
|
|
|
u32 mlx5e_ethtool_get_rxfh_indir_size(struct mlx5e_priv *priv);
|
2017-06-01 05:43:43 -06:00
|
|
|
int mlx5e_ethtool_get_ts_info(struct mlx5e_priv *priv,
|
|
|
|
struct ethtool_ts_info *info);
|
2019-08-01 05:27:30 -06:00
|
|
|
int mlx5e_ethtool_flash_device(struct mlx5e_priv *priv,
|
|
|
|
struct ethtool_flash *flash);
|
2018-11-06 10:31:10 -07:00
|
|
|
void mlx5e_ethtool_get_pauseparam(struct mlx5e_priv *priv,
|
|
|
|
struct ethtool_pauseparam *pauseparam);
|
|
|
|
int mlx5e_ethtool_set_pauseparam(struct mlx5e_priv *priv,
|
|
|
|
struct ethtool_pauseparam *pauseparam);
|
2017-05-15 04:32:28 -06:00
|
|
|
|
2017-04-12 21:36:54 -06:00
|
|
|
/* mlx5e generic netdev management API */
|
2018-09-12 16:02:05 -06:00
|
|
|
int mlx5e_netdev_init(struct net_device *netdev,
|
|
|
|
struct mlx5e_priv *priv,
|
|
|
|
struct mlx5_core_dev *mdev,
|
|
|
|
const struct mlx5e_profile *profile,
|
|
|
|
void *ppriv);
|
2018-10-02 00:54:59 -06:00
|
|
|
void mlx5e_netdev_cleanup(struct net_device *netdev, struct mlx5e_priv *priv);
|
2017-04-12 21:36:54 -06:00
|
|
|
struct net_device*
|
|
|
|
mlx5e_create_netdev(struct mlx5_core_dev *mdev, const struct mlx5e_profile *profile,
|
2018-09-06 05:56:56 -06:00
|
|
|
int nch, void *ppriv);
|
2017-04-12 21:36:54 -06:00
|
|
|
int mlx5e_attach_netdev(struct mlx5e_priv *priv);
|
|
|
|
void mlx5e_detach_netdev(struct mlx5e_priv *priv);
|
|
|
|
void mlx5e_destroy_netdev(struct mlx5e_priv *priv);
|
2019-01-22 04:42:10 -07:00
|
|
|
void mlx5e_set_netdev_mtu_boundaries(struct mlx5e_priv *priv);
|
2017-04-12 21:36:56 -06:00
|
|
|
void mlx5e_build_nic_params(struct mlx5_core_dev *mdev,
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
struct mlx5e_xsk *xsk,
|
2018-11-06 12:05:29 -07:00
|
|
|
struct mlx5e_rss_params *rss_params,
|
2017-04-12 21:36:56 -06:00
|
|
|
struct mlx5e_params *params,
|
2018-03-12 06:24:41 -06:00
|
|
|
u16 max_channels, u16 mtu);
|
2018-08-16 05:25:24 -06:00
|
|
|
void mlx5e_build_rq_params(struct mlx5_core_dev *mdev,
|
|
|
|
struct mlx5e_params *params);
|
2018-11-06 12:05:29 -07:00
|
|
|
void mlx5e_build_rss_params(struct mlx5e_rss_params *rss_params,
|
|
|
|
u16 num_channels);
|
2018-01-09 14:06:17 -07:00
|
|
|
void mlx5e_rx_dim_work(struct work_struct *work);
|
2018-04-24 04:36:03 -06:00
|
|
|
void mlx5e_tx_dim_work(struct work_struct *work);
|
2018-11-01 11:14:21 -06:00
|
|
|
|
|
|
|
void mlx5e_add_vxlan_port(struct net_device *netdev, struct udp_tunnel_info *ti);
|
|
|
|
void mlx5e_del_vxlan_port(struct net_device *netdev, struct udp_tunnel_info *ti);
|
|
|
|
netdev_features_t mlx5e_features_check(struct sk_buff *skb,
|
|
|
|
struct net_device *netdev,
|
|
|
|
netdev_features_t features);
|
2019-05-16 03:36:43 -06:00
|
|
|
int mlx5e_set_features(struct net_device *netdev, netdev_features_t features);
|
2018-11-01 11:14:21 -06:00
|
|
|
#ifdef CONFIG_MLX5_ESWITCH
|
|
|
|
int mlx5e_set_vf_mac(struct net_device *dev, int vf, u8 *mac);
|
|
|
|
int mlx5e_set_vf_rate(struct net_device *dev, int vf, int min_tx_rate, int max_tx_rate);
|
|
|
|
int mlx5e_get_vf_config(struct net_device *dev, int vf, struct ifla_vf_info *ivi);
|
|
|
|
int mlx5e_get_vf_stats(struct net_device *dev, int vf, struct ifla_vf_stats *vf_stats);
|
|
|
|
#endif
|
2016-02-22 09:17:31 -07:00
|
|
|
#endif /* __MLX5_EN_H__ */
|