2015-05-28 13:28:46 -06:00
|
|
|
/*
|
|
|
|
* Copyright (c) 2015, Mellanox Technologies. All rights reserved.
|
|
|
|
*
|
|
|
|
* This software is available to you under a choice of one of two
|
|
|
|
* licenses. You may choose to be licensed under the terms of the GNU
|
|
|
|
* General Public License (GPL) Version 2, available from the file
|
|
|
|
* COPYING in the main directory of this source tree, or the
|
|
|
|
* OpenIB.org BSD license below:
|
|
|
|
*
|
|
|
|
* Redistribution and use in source and binary forms, with or
|
|
|
|
* without modification, are permitted provided that the following
|
|
|
|
* conditions are met:
|
|
|
|
*
|
|
|
|
* - Redistributions of source code must retain the above
|
|
|
|
* copyright notice, this list of conditions and the following
|
|
|
|
* disclaimer.
|
|
|
|
*
|
|
|
|
* - Redistributions in binary form must reproduce the above
|
|
|
|
* copyright notice, this list of conditions and the following
|
|
|
|
* disclaimer in the documentation and/or other materials
|
|
|
|
* provided with the distribution.
|
|
|
|
*
|
|
|
|
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
|
|
|
* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
|
|
|
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
|
|
|
* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
|
|
|
|
* BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
|
|
|
|
* ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
|
|
|
|
* CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
|
|
* SOFTWARE.
|
|
|
|
*/
|
|
|
|
|
2017-02-22 08:20:11 -07:00
|
|
|
#include <linux/prefetch.h>
|
2015-05-28 13:28:46 -06:00
|
|
|
#include <linux/ip.h>
|
|
|
|
#include <linux/ipv6.h>
|
|
|
|
#include <linux/tcp.h>
|
2019-06-12 04:18:35 -06:00
|
|
|
#include <linux/indirect_call_wrapper.h>
|
2017-12-19 23:48:24 -07:00
|
|
|
#include <net/ip6_checksum.h>
|
mlx5: use page_pool for xdp_return_frame call
This patch shows how it is possible to have both the driver local page
cache, which uses elevated refcnt for "catching"/avoiding SKB
put_page returns the page through the page allocator. And at the
same time, have pages getting returned to the page_pool from
ndp_xdp_xmit DMA completion.
The performance improvement for XDP_REDIRECT in this patch is really
good. Especially considering that (currently) the xdp_return_frame
API and page_pool_put_page() does per frame operations of both
rhashtable ID-lookup and locked return into (page_pool) ptr_ring.
(It is the plan to remove these per frame operation in a followup
patchset).
The benchmark performed was RX on mlx5 and XDP_REDIRECT out ixgbe,
with xdp_redirect_map (using devmap) . And the target/maximum
capability of ixgbe is 13Mpps (on this HW setup).
Before this patch for mlx5, XDP redirected frames were returned via
the page allocator. The single flow performance was 6Mpps, and if I
started two flows the collective performance drop to 4Mpps, because we
hit the page allocator lock (further negative scaling occurs).
Two test scenarios need to be covered, for xdp_return_frame API, which
is DMA-TX completion running on same-CPU or cross-CPU free/return.
Results were same-CPU=10Mpps, and cross-CPU=12Mpps. This is very
close to our 13Mpps max target.
The reason max target isn't reached in cross-CPU test, is likely due
to RX-ring DMA unmap/map overhead (which doesn't occur in ixgbe to
ixgbe testing). It is also planned to remove this unnecessary DMA
unmap in a later patchset
V2: Adjustments requested by Tariq
- Changed page_pool_create return codes not return NULL, only
ERR_PTR, as this simplifies err handling in drivers.
- Save a branch in mlx5e_page_release
- Correct page_pool size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ
V5: Updated patch desc
V8: Adjust for b0cedc844c00 ("net/mlx5e: Remove rq_headroom field from params")
V9:
- Adjust for 121e89275471 ("net/mlx5e: Refactor RQ XDP_TX indication")
- Adjust for 73281b78a37a ("net/mlx5e: Derive Striding RQ size from MTU")
- Correct handling if page_pool_create fail for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ
V10: Req from Tariq
- Change pool_size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 08:46:27 -06:00
|
|
|
#include <net/page_pool.h>
|
2018-03-20 09:15:31 -06:00
|
|
|
#include <net/inet_ecn.h>
|
2015-05-28 13:28:46 -06:00
|
|
|
#include "en.h"
|
2016-03-08 03:42:38 -07:00
|
|
|
#include "en_tc.h"
|
net/mlx5: E-Switch, Support VLAN actions in the offloads mode
Many virtualization systems use a policy under which a vlan tag is
pushed to packets sent by guests, and popped before the packet is
forwarded to the VM.
The current generation of the mlx5 HW doesn't fully support that on
a per flow level. As such, we are addressing the above common use
case with the SRIOV e-Switch abilities to push vlan into packets
sent by VFs and pop vlan from packets forwarded to VFs.
The HW can match on the correct vlan being present in packets
forwarded to VFs (eSwitch steering is done before stripping
the tag), so this part is offloaded as is.
A common practice for vlans is to avoid both push vlan and pop vlan
for inter-host VM/VM (east-west) communication because in this case,
push on egress cancels out with pop on ingress.
For supporting that, we use a global eswitch vlan pop policy, hence
allowing guest A to communicate with both remote VM B and local VM C.
This works since the HW pops the vlan only if it exists (e.g for
C --> A packets but not for B --> A packets).
On the slow path, when a VF vport has an offloaded flow which involves
pushing vlans, wheres another flow is not currently offloaded, the
packets from the 2nd flow seen by the VF representor on the host have
vlan. The VF rep driver removes such vlan before calling into the host
networking stack.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-22 11:01:47 -06:00
|
|
|
#include "eswitch.h"
|
2017-04-24 03:36:42 -06:00
|
|
|
#include "en_rep.h"
|
2017-06-18 08:13:44 -06:00
|
|
|
#include "ipoib/ipoib.h"
|
2017-06-19 05:04:36 -06:00
|
|
|
#include "en_accel/ipsec_rxtx.h"
|
2018-07-13 05:33:48 -06:00
|
|
|
#include "en_accel/tls_rxtx.h"
|
2017-08-15 04:46:04 -06:00
|
|
|
#include "lib/clock.h"
|
2018-07-15 01:28:44 -06:00
|
|
|
#include "en/xdp.h"
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
#include "en/xsk/rx.h"
|
2019-06-14 16:21:15 -06:00
|
|
|
#include "en/health.h"
|
2015-05-28 13:28:46 -06:00
|
|
|
|
2017-08-15 04:46:04 -06:00
|
|
|
static inline bool mlx5e_rx_hw_stamp(struct hwtstamp_config *config)
|
2015-12-29 05:58:31 -07:00
|
|
|
{
|
2017-08-15 04:46:04 -06:00
|
|
|
return config->rx_filter == HWTSTAMP_FILTER_ALL;
|
2015-12-29 05:58:31 -07:00
|
|
|
}
|
|
|
|
|
2018-09-12 09:32:49 -06:00
|
|
|
static inline void mlx5e_read_cqe_slot(struct mlx5_cqwq *wq,
|
|
|
|
u32 cqcc, void *data)
|
2016-05-10 15:29:14 -06:00
|
|
|
{
|
2018-09-12 09:32:49 -06:00
|
|
|
u32 ci = mlx5_cqwq_ctr2ix(wq, cqcc);
|
2016-05-10 15:29:14 -06:00
|
|
|
|
2018-09-12 09:32:49 -06:00
|
|
|
memcpy(data, mlx5_cqwq_get_wqe(wq, ci), sizeof(struct mlx5_cqe64));
|
2016-05-10 15:29:14 -06:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline void mlx5e_read_title_slot(struct mlx5e_rq *rq,
|
2018-09-12 09:32:49 -06:00
|
|
|
struct mlx5_cqwq *wq,
|
|
|
|
u32 cqcc)
|
2016-05-10 15:29:14 -06:00
|
|
|
{
|
2018-09-12 09:32:49 -06:00
|
|
|
struct mlx5e_cq_decomp *cqd = &rq->cqd;
|
|
|
|
struct mlx5_cqe64 *title = &cqd->title;
|
|
|
|
|
|
|
|
mlx5e_read_cqe_slot(wq, cqcc, title);
|
|
|
|
cqd->left = be32_to_cpu(title->byte_cnt);
|
|
|
|
cqd->wqe_counter = be16_to_cpu(title->wqe_counter);
|
2018-04-12 07:03:37 -06:00
|
|
|
rq->stats->cqe_compress_blks++;
|
2016-05-10 15:29:14 -06:00
|
|
|
}
|
|
|
|
|
2018-09-12 09:32:49 -06:00
|
|
|
static inline void mlx5e_read_mini_arr_slot(struct mlx5_cqwq *wq,
|
|
|
|
struct mlx5e_cq_decomp *cqd,
|
|
|
|
u32 cqcc)
|
2016-05-10 15:29:14 -06:00
|
|
|
{
|
2018-09-12 09:32:49 -06:00
|
|
|
mlx5e_read_cqe_slot(wq, cqcc, cqd->mini_arr);
|
|
|
|
cqd->mini_arr_idx = 0;
|
2016-05-10 15:29:14 -06:00
|
|
|
}
|
|
|
|
|
2018-09-12 09:32:49 -06:00
|
|
|
static inline void mlx5e_cqes_update_owner(struct mlx5_cqwq *wq, int n)
|
2016-05-10 15:29:14 -06:00
|
|
|
{
|
2018-09-12 09:32:49 -06:00
|
|
|
u32 cqcc = wq->cc;
|
2018-05-02 09:30:56 -06:00
|
|
|
u8 op_own = mlx5_cqwq_get_ctr_wrap_cnt(wq, cqcc) & 1;
|
|
|
|
u32 ci = mlx5_cqwq_ctr2ix(wq, cqcc);
|
|
|
|
u32 wq_sz = mlx5_cqwq_get_size(wq);
|
2016-05-10 15:29:14 -06:00
|
|
|
u32 ci_top = min_t(u32, wq_sz, ci + n);
|
|
|
|
|
|
|
|
for (; ci < ci_top; ci++, n--) {
|
2018-09-12 09:32:49 -06:00
|
|
|
struct mlx5_cqe64 *cqe = mlx5_cqwq_get_wqe(wq, ci);
|
2016-05-10 15:29:14 -06:00
|
|
|
|
|
|
|
cqe->op_own = op_own;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (unlikely(ci == wq_sz)) {
|
|
|
|
op_own = !op_own;
|
|
|
|
for (ci = 0; ci < n; ci++) {
|
2018-09-12 09:32:49 -06:00
|
|
|
struct mlx5_cqe64 *cqe = mlx5_cqwq_get_wqe(wq, ci);
|
2016-05-10 15:29:14 -06:00
|
|
|
|
|
|
|
cqe->op_own = op_own;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void mlx5e_decompress_cqe(struct mlx5e_rq *rq,
|
2018-09-12 09:32:49 -06:00
|
|
|
struct mlx5_cqwq *wq,
|
|
|
|
u32 cqcc)
|
2016-05-10 15:29:14 -06:00
|
|
|
{
|
2018-09-12 09:32:49 -06:00
|
|
|
struct mlx5e_cq_decomp *cqd = &rq->cqd;
|
|
|
|
struct mlx5_mini_cqe8 *mini_cqe = &cqd->mini_arr[cqd->mini_arr_idx];
|
|
|
|
struct mlx5_cqe64 *title = &cqd->title;
|
|
|
|
|
|
|
|
title->byte_cnt = mini_cqe->byte_cnt;
|
|
|
|
title->check_sum = mini_cqe->checksum;
|
|
|
|
title->op_own &= 0xf0;
|
|
|
|
title->op_own |= 0x01 & (cqcc >> wq->fbc.log_sz);
|
|
|
|
title->wqe_counter = cpu_to_be16(cqd->wqe_counter);
|
2016-05-10 15:29:14 -06:00
|
|
|
|
2017-02-22 08:20:16 -07:00
|
|
|
if (rq->wq_type == MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ)
|
2018-09-12 09:32:49 -06:00
|
|
|
cqd->wqe_counter += mpwrq_get_cqe_consumed_strides(title);
|
2017-02-22 08:20:16 -07:00
|
|
|
else
|
2018-09-12 09:32:49 -06:00
|
|
|
cqd->wqe_counter =
|
|
|
|
mlx5_wq_cyc_ctr2ix(&rq->wqe.wq, cqd->wqe_counter + 1);
|
2016-05-10 15:29:14 -06:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline void mlx5e_decompress_cqe_no_hash(struct mlx5e_rq *rq,
|
2018-09-12 09:32:49 -06:00
|
|
|
struct mlx5_cqwq *wq,
|
|
|
|
u32 cqcc)
|
2016-05-10 15:29:14 -06:00
|
|
|
{
|
2018-09-12 09:32:49 -06:00
|
|
|
struct mlx5e_cq_decomp *cqd = &rq->cqd;
|
|
|
|
|
|
|
|
mlx5e_decompress_cqe(rq, wq, cqcc);
|
|
|
|
cqd->title.rss_hash_type = 0;
|
|
|
|
cqd->title.rss_hash_result = 0;
|
2016-05-10 15:29:14 -06:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline u32 mlx5e_decompress_cqes_cont(struct mlx5e_rq *rq,
|
2018-09-12 09:32:49 -06:00
|
|
|
struct mlx5_cqwq *wq,
|
2016-05-10 15:29:14 -06:00
|
|
|
int update_owner_only,
|
|
|
|
int budget_rem)
|
|
|
|
{
|
2018-09-12 09:32:49 -06:00
|
|
|
struct mlx5e_cq_decomp *cqd = &rq->cqd;
|
|
|
|
u32 cqcc = wq->cc + update_owner_only;
|
2016-05-10 15:29:14 -06:00
|
|
|
u32 cqe_count;
|
|
|
|
u32 i;
|
|
|
|
|
2018-09-12 09:32:49 -06:00
|
|
|
cqe_count = min_t(u32, cqd->left, budget_rem);
|
2016-05-10 15:29:14 -06:00
|
|
|
|
|
|
|
for (i = update_owner_only; i < cqe_count;
|
2018-09-12 09:32:49 -06:00
|
|
|
i++, cqd->mini_arr_idx++, cqcc++) {
|
|
|
|
if (cqd->mini_arr_idx == MLX5_MINI_CQE_ARRAY_SIZE)
|
|
|
|
mlx5e_read_mini_arr_slot(wq, cqd, cqcc);
|
2016-05-10 15:29:14 -06:00
|
|
|
|
2018-09-12 09:32:49 -06:00
|
|
|
mlx5e_decompress_cqe_no_hash(rq, wq, cqcc);
|
|
|
|
rq->handle_rx_cqe(rq, &cqd->title);
|
2016-05-10 15:29:14 -06:00
|
|
|
}
|
2018-09-12 09:32:49 -06:00
|
|
|
mlx5e_cqes_update_owner(wq, cqcc - wq->cc);
|
|
|
|
wq->cc = cqcc;
|
|
|
|
cqd->left -= cqe_count;
|
2018-04-12 07:03:37 -06:00
|
|
|
rq->stats->cqe_compress_pkts += cqe_count;
|
2016-05-10 15:29:14 -06:00
|
|
|
|
|
|
|
return cqe_count;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline u32 mlx5e_decompress_cqes_start(struct mlx5e_rq *rq,
|
2018-09-12 09:32:49 -06:00
|
|
|
struct mlx5_cqwq *wq,
|
2016-05-10 15:29:14 -06:00
|
|
|
int budget_rem)
|
|
|
|
{
|
2018-09-12 09:32:49 -06:00
|
|
|
struct mlx5e_cq_decomp *cqd = &rq->cqd;
|
|
|
|
u32 cc = wq->cc;
|
2016-05-10 15:29:14 -06:00
|
|
|
|
2018-09-12 09:32:49 -06:00
|
|
|
mlx5e_read_title_slot(rq, wq, cc);
|
|
|
|
mlx5e_read_mini_arr_slot(wq, cqd, cc + 1);
|
|
|
|
mlx5e_decompress_cqe(rq, wq, cc);
|
|
|
|
rq->handle_rx_cqe(rq, &cqd->title);
|
|
|
|
cqd->mini_arr_idx++;
|
|
|
|
|
|
|
|
return mlx5e_decompress_cqes_cont(rq, wq, 1, budget_rem) - 1;
|
2016-05-10 15:29:14 -06:00
|
|
|
}
|
|
|
|
|
net/mlx5e: Introduce RX Page-Reuse
Introduce a Page-Reuse mechanism in non-Striding RQ RX datapath.
A WQE (RX descriptor) buffer is a page, that in most cases was fully
wasted on a packet that is much smaller, requiring a new page for
the next round.
In this patch, we implement a page-reuse mechanism, that resembles a
`SW Striding RQ`.
We allow the WQE to reuse its allocated page as much as it could,
until the page is fully consumed. In each round, the WQE is capable
of receiving packet of maximal size (MTU). Yet, upon the reception of
a packet, the WQE knows the actual packet size, and consumes the exact
amount of memory needed to build a linear SKB. Then, it updates the
buffer pointer within the page accordingly, for the next round.
Feature is mutually exclusive with XDP (packet-per-page)
and LRO (session size is a power of two, needs unused page).
Performance tests:
iperf tcp tests show huge gain:
--------------------------------------------
num streams | BW before | BW after | ratio |
1 | 22.2 | 30.9 | 1.39x |
8 | 64.2 | 93.6 | 1.46x |
64 | 56.7 | 91.4 | 1.61x |
--------------------------------------------
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-01-29 08:42:26 -07:00
|
|
|
static inline bool mlx5e_page_is_reserved(struct page *page)
|
|
|
|
{
|
2017-07-13 09:26:40 -06:00
|
|
|
return page_is_pfmemalloc(page) || page_to_nid(page) != numa_mem_id();
|
net/mlx5e: Introduce RX Page-Reuse
Introduce a Page-Reuse mechanism in non-Striding RQ RX datapath.
A WQE (RX descriptor) buffer is a page, that in most cases was fully
wasted on a packet that is much smaller, requiring a new page for
the next round.
In this patch, we implement a page-reuse mechanism, that resembles a
`SW Striding RQ`.
We allow the WQE to reuse its allocated page as much as it could,
until the page is fully consumed. In each round, the WQE is capable
of receiving packet of maximal size (MTU). Yet, upon the reception of
a packet, the WQE knows the actual packet size, and consumes the exact
amount of memory needed to build a linear SKB. Then, it updates the
buffer pointer within the page accordingly, for the next round.
Feature is mutually exclusive with XDP (packet-per-page)
and LRO (session size is a power of two, needs unused page).
Performance tests:
iperf tcp tests show huge gain:
--------------------------------------------
num streams | BW before | BW after | ratio |
1 | 22.2 | 30.9 | 1.39x |
8 | 64.2 | 93.6 | 1.46x |
64 | 56.7 | 91.4 | 1.61x |
--------------------------------------------
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-01-29 08:42:26 -07:00
|
|
|
}
|
|
|
|
|
2016-09-21 03:19:42 -06:00
|
|
|
static inline bool mlx5e_rx_cache_put(struct mlx5e_rq *rq,
|
|
|
|
struct mlx5e_dma_info *dma_info)
|
2015-05-28 13:28:46 -06:00
|
|
|
{
|
2016-09-21 03:19:42 -06:00
|
|
|
struct mlx5e_page_cache *cache = &rq->page_cache;
|
|
|
|
u32 tail_next = (cache->tail + 1) & (MLX5E_CACHE_SIZE - 1);
|
2018-04-12 07:03:37 -06:00
|
|
|
struct mlx5e_rq_stats *stats = rq->stats;
|
2015-05-28 13:28:46 -06:00
|
|
|
|
2016-09-21 03:19:42 -06:00
|
|
|
if (tail_next == cache->head) {
|
2018-04-12 07:03:37 -06:00
|
|
|
stats->cache_full++;
|
2016-09-21 03:19:42 -06:00
|
|
|
return false;
|
|
|
|
}
|
2015-05-28 13:28:46 -06:00
|
|
|
|
2017-07-13 09:26:40 -06:00
|
|
|
if (unlikely(mlx5e_page_is_reserved(dma_info->page))) {
|
2018-04-12 07:03:37 -06:00
|
|
|
stats->cache_waive++;
|
2017-01-19 00:03:08 -07:00
|
|
|
return false;
|
2017-07-13 09:26:40 -06:00
|
|
|
}
|
2017-01-19 00:03:08 -07:00
|
|
|
|
2016-09-21 03:19:42 -06:00
|
|
|
cache->page_cache[cache->tail] = *dma_info;
|
|
|
|
cache->tail = tail_next;
|
|
|
|
return true;
|
|
|
|
}
|
2015-05-28 13:28:46 -06:00
|
|
|
|
2016-09-21 03:19:42 -06:00
|
|
|
static inline bool mlx5e_rx_cache_get(struct mlx5e_rq *rq,
|
|
|
|
struct mlx5e_dma_info *dma_info)
|
|
|
|
{
|
|
|
|
struct mlx5e_page_cache *cache = &rq->page_cache;
|
2018-04-12 07:03:37 -06:00
|
|
|
struct mlx5e_rq_stats *stats = rq->stats;
|
2016-09-21 03:19:42 -06:00
|
|
|
|
|
|
|
if (unlikely(cache->head == cache->tail)) {
|
2018-04-12 07:03:37 -06:00
|
|
|
stats->cache_empty++;
|
2016-09-21 03:19:42 -06:00
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (page_ref_count(cache->page_cache[cache->head].page) != 1) {
|
2018-04-12 07:03:37 -06:00
|
|
|
stats->cache_busy++;
|
2016-09-21 03:19:42 -06:00
|
|
|
return false;
|
|
|
|
}
|
2015-05-28 13:28:46 -06:00
|
|
|
|
2016-09-21 03:19:42 -06:00
|
|
|
*dma_info = cache->page_cache[cache->head];
|
|
|
|
cache->head = (cache->head + 1) & (MLX5E_CACHE_SIZE - 1);
|
2018-04-12 07:03:37 -06:00
|
|
|
stats->cache_reuse++;
|
2015-05-28 13:28:46 -06:00
|
|
|
|
2016-09-21 03:19:42 -06:00
|
|
|
dma_sync_single_for_device(rq->pdev, dma_info->addr,
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
PAGE_SIZE,
|
2016-09-21 03:19:42 -06:00
|
|
|
DMA_FROM_DEVICE);
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
static inline int mlx5e_page_alloc_pool(struct mlx5e_rq *rq,
|
|
|
|
struct mlx5e_dma_info *dma_info)
|
2016-09-21 03:19:42 -06:00
|
|
|
{
|
|
|
|
if (mlx5e_rx_cache_get(rq, dma_info))
|
|
|
|
return 0;
|
|
|
|
|
mlx5: use page_pool for xdp_return_frame call
This patch shows how it is possible to have both the driver local page
cache, which uses elevated refcnt for "catching"/avoiding SKB
put_page returns the page through the page allocator. And at the
same time, have pages getting returned to the page_pool from
ndp_xdp_xmit DMA completion.
The performance improvement for XDP_REDIRECT in this patch is really
good. Especially considering that (currently) the xdp_return_frame
API and page_pool_put_page() does per frame operations of both
rhashtable ID-lookup and locked return into (page_pool) ptr_ring.
(It is the plan to remove these per frame operation in a followup
patchset).
The benchmark performed was RX on mlx5 and XDP_REDIRECT out ixgbe,
with xdp_redirect_map (using devmap) . And the target/maximum
capability of ixgbe is 13Mpps (on this HW setup).
Before this patch for mlx5, XDP redirected frames were returned via
the page allocator. The single flow performance was 6Mpps, and if I
started two flows the collective performance drop to 4Mpps, because we
hit the page allocator lock (further negative scaling occurs).
Two test scenarios need to be covered, for xdp_return_frame API, which
is DMA-TX completion running on same-CPU or cross-CPU free/return.
Results were same-CPU=10Mpps, and cross-CPU=12Mpps. This is very
close to our 13Mpps max target.
The reason max target isn't reached in cross-CPU test, is likely due
to RX-ring DMA unmap/map overhead (which doesn't occur in ixgbe to
ixgbe testing). It is also planned to remove this unnecessary DMA
unmap in a later patchset
V2: Adjustments requested by Tariq
- Changed page_pool_create return codes not return NULL, only
ERR_PTR, as this simplifies err handling in drivers.
- Save a branch in mlx5e_page_release
- Correct page_pool size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ
V5: Updated patch desc
V8: Adjust for b0cedc844c00 ("net/mlx5e: Remove rq_headroom field from params")
V9:
- Adjust for 121e89275471 ("net/mlx5e: Refactor RQ XDP_TX indication")
- Adjust for 73281b78a37a ("net/mlx5e: Derive Striding RQ size from MTU")
- Correct handling if page_pool_create fail for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ
V10: Req from Tariq
- Change pool_size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 08:46:27 -06:00
|
|
|
dma_info->page = page_pool_dev_alloc_pages(rq->page_pool);
|
2017-10-15 08:30:59 -06:00
|
|
|
if (unlikely(!dma_info->page))
|
2016-09-21 03:19:42 -06:00
|
|
|
return -ENOMEM;
|
|
|
|
|
2017-10-15 08:30:59 -06:00
|
|
|
dma_info->addr = dma_map_page(rq->pdev, dma_info->page, 0,
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
PAGE_SIZE, rq->buff.map_dir);
|
2016-09-21 03:19:42 -06:00
|
|
|
if (unlikely(dma_mapping_error(rq->pdev, dma_info->addr))) {
|
2019-06-18 07:05:42 -06:00
|
|
|
page_pool_recycle_direct(rq->page_pool, dma_info->page);
|
2017-10-15 08:30:59 -06:00
|
|
|
dma_info->page = NULL;
|
2016-09-21 03:19:42 -06:00
|
|
|
return -ENOMEM;
|
|
|
|
}
|
2015-05-28 13:28:46 -06:00
|
|
|
|
|
|
|
return 0;
|
2016-09-21 03:19:42 -06:00
|
|
|
}
|
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
static inline int mlx5e_page_alloc(struct mlx5e_rq *rq,
|
|
|
|
struct mlx5e_dma_info *dma_info)
|
|
|
|
{
|
|
|
|
if (rq->umem)
|
|
|
|
return mlx5e_xsk_page_alloc_umem(rq, dma_info);
|
|
|
|
else
|
|
|
|
return mlx5e_page_alloc_pool(rq, dma_info);
|
|
|
|
}
|
|
|
|
|
2018-07-15 01:28:44 -06:00
|
|
|
void mlx5e_page_dma_unmap(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info)
|
2018-04-17 08:45:21 -06:00
|
|
|
{
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
dma_unmap_page(rq->pdev, dma_info->addr, PAGE_SIZE, rq->buff.map_dir);
|
2018-04-17 08:45:21 -06:00
|
|
|
}
|
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
void mlx5e_page_release_dynamic(struct mlx5e_rq *rq,
|
|
|
|
struct mlx5e_dma_info *dma_info,
|
|
|
|
bool recycle)
|
2016-09-21 03:19:42 -06:00
|
|
|
{
|
mlx5: use page_pool for xdp_return_frame call
This patch shows how it is possible to have both the driver local page
cache, which uses elevated refcnt for "catching"/avoiding SKB
put_page returns the page through the page allocator. And at the
same time, have pages getting returned to the page_pool from
ndp_xdp_xmit DMA completion.
The performance improvement for XDP_REDIRECT in this patch is really
good. Especially considering that (currently) the xdp_return_frame
API and page_pool_put_page() does per frame operations of both
rhashtable ID-lookup and locked return into (page_pool) ptr_ring.
(It is the plan to remove these per frame operation in a followup
patchset).
The benchmark performed was RX on mlx5 and XDP_REDIRECT out ixgbe,
with xdp_redirect_map (using devmap) . And the target/maximum
capability of ixgbe is 13Mpps (on this HW setup).
Before this patch for mlx5, XDP redirected frames were returned via
the page allocator. The single flow performance was 6Mpps, and if I
started two flows the collective performance drop to 4Mpps, because we
hit the page allocator lock (further negative scaling occurs).
Two test scenarios need to be covered, for xdp_return_frame API, which
is DMA-TX completion running on same-CPU or cross-CPU free/return.
Results were same-CPU=10Mpps, and cross-CPU=12Mpps. This is very
close to our 13Mpps max target.
The reason max target isn't reached in cross-CPU test, is likely due
to RX-ring DMA unmap/map overhead (which doesn't occur in ixgbe to
ixgbe testing). It is also planned to remove this unnecessary DMA
unmap in a later patchset
V2: Adjustments requested by Tariq
- Changed page_pool_create return codes not return NULL, only
ERR_PTR, as this simplifies err handling in drivers.
- Save a branch in mlx5e_page_release
- Correct page_pool size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ
V5: Updated patch desc
V8: Adjust for b0cedc844c00 ("net/mlx5e: Remove rq_headroom field from params")
V9:
- Adjust for 121e89275471 ("net/mlx5e: Refactor RQ XDP_TX indication")
- Adjust for 73281b78a37a ("net/mlx5e: Derive Striding RQ size from MTU")
- Correct handling if page_pool_create fail for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ
V10: Req from Tariq
- Change pool_size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 08:46:27 -06:00
|
|
|
if (likely(recycle)) {
|
|
|
|
if (mlx5e_rx_cache_put(rq, dma_info))
|
|
|
|
return;
|
|
|
|
|
|
|
|
mlx5e_page_dma_unmap(rq, dma_info);
|
|
|
|
page_pool_recycle_direct(rq->page_pool, dma_info->page);
|
|
|
|
} else {
|
|
|
|
mlx5e_page_dma_unmap(rq, dma_info);
|
2019-06-18 07:05:42 -06:00
|
|
|
page_pool_release_page(rq->page_pool, dma_info->page);
|
mlx5: use page_pool for xdp_return_frame call
This patch shows how it is possible to have both the driver local page
cache, which uses elevated refcnt for "catching"/avoiding SKB
put_page returns the page through the page allocator. And at the
same time, have pages getting returned to the page_pool from
ndp_xdp_xmit DMA completion.
The performance improvement for XDP_REDIRECT in this patch is really
good. Especially considering that (currently) the xdp_return_frame
API and page_pool_put_page() does per frame operations of both
rhashtable ID-lookup and locked return into (page_pool) ptr_ring.
(It is the plan to remove these per frame operation in a followup
patchset).
The benchmark performed was RX on mlx5 and XDP_REDIRECT out ixgbe,
with xdp_redirect_map (using devmap) . And the target/maximum
capability of ixgbe is 13Mpps (on this HW setup).
Before this patch for mlx5, XDP redirected frames were returned via
the page allocator. The single flow performance was 6Mpps, and if I
started two flows the collective performance drop to 4Mpps, because we
hit the page allocator lock (further negative scaling occurs).
Two test scenarios need to be covered, for xdp_return_frame API, which
is DMA-TX completion running on same-CPU or cross-CPU free/return.
Results were same-CPU=10Mpps, and cross-CPU=12Mpps. This is very
close to our 13Mpps max target.
The reason max target isn't reached in cross-CPU test, is likely due
to RX-ring DMA unmap/map overhead (which doesn't occur in ixgbe to
ixgbe testing). It is also planned to remove this unnecessary DMA
unmap in a later patchset
V2: Adjustments requested by Tariq
- Changed page_pool_create return codes not return NULL, only
ERR_PTR, as this simplifies err handling in drivers.
- Save a branch in mlx5e_page_release
- Correct page_pool size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ
V5: Updated patch desc
V8: Adjust for b0cedc844c00 ("net/mlx5e: Remove rq_headroom field from params")
V9:
- Adjust for 121e89275471 ("net/mlx5e: Refactor RQ XDP_TX indication")
- Adjust for 73281b78a37a ("net/mlx5e: Derive Striding RQ size from MTU")
- Correct handling if page_pool_create fail for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ
V10: Req from Tariq
- Change pool_size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 08:46:27 -06:00
|
|
|
put_page(dma_info->page);
|
|
|
|
}
|
2016-09-21 03:19:42 -06:00
|
|
|
}
|
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
static inline void mlx5e_page_release(struct mlx5e_rq *rq,
|
|
|
|
struct mlx5e_dma_info *dma_info,
|
|
|
|
bool recycle)
|
|
|
|
{
|
|
|
|
if (rq->umem)
|
|
|
|
/* The `recycle` parameter is ignored, and the page is always
|
|
|
|
* put into the Reuse Ring, because there is no way to return
|
|
|
|
* the page to the userspace when the interface goes down.
|
|
|
|
*/
|
|
|
|
mlx5e_xsk_page_release(rq, dma_info);
|
|
|
|
else
|
|
|
|
mlx5e_page_release_dynamic(rq, dma_info, recycle);
|
|
|
|
}
|
|
|
|
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
static inline int mlx5e_get_rx_frag(struct mlx5e_rq *rq,
|
|
|
|
struct mlx5e_wqe_frag_info *frag)
|
|
|
|
{
|
|
|
|
int err = 0;
|
|
|
|
|
|
|
|
if (!frag->offset)
|
|
|
|
/* On first frag (offset == 0), replenish page (dma_info actually).
|
|
|
|
* Other frags that point to the same dma_info (with a different
|
|
|
|
* offset) should just use the new one without replenishing again
|
|
|
|
* by themselves.
|
|
|
|
*/
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
err = mlx5e_page_alloc(rq, frag->di);
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void mlx5e_put_rx_frag(struct mlx5e_rq *rq,
|
2018-06-12 01:09:24 -06:00
|
|
|
struct mlx5e_wqe_frag_info *frag,
|
|
|
|
bool recycle)
|
net/mlx5e: Introduce RX Page-Reuse
Introduce a Page-Reuse mechanism in non-Striding RQ RX datapath.
A WQE (RX descriptor) buffer is a page, that in most cases was fully
wasted on a packet that is much smaller, requiring a new page for
the next round.
In this patch, we implement a page-reuse mechanism, that resembles a
`SW Striding RQ`.
We allow the WQE to reuse its allocated page as much as it could,
until the page is fully consumed. In each round, the WQE is capable
of receiving packet of maximal size (MTU). Yet, upon the reception of
a packet, the WQE knows the actual packet size, and consumes the exact
amount of memory needed to build a linear SKB. Then, it updates the
buffer pointer within the page accordingly, for the next round.
Feature is mutually exclusive with XDP (packet-per-page)
and LRO (session size is a power of two, needs unused page).
Performance tests:
iperf tcp tests show huge gain:
--------------------------------------------
num streams | BW before | BW after | ratio |
1 | 22.2 | 30.9 | 1.39x |
8 | 64.2 | 93.6 | 1.46x |
64 | 56.7 | 91.4 | 1.61x |
--------------------------------------------
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-01-29 08:42:26 -07:00
|
|
|
{
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
if (frag->last_in_page)
|
2018-06-12 01:09:24 -06:00
|
|
|
mlx5e_page_release(rq, frag->di, recycle);
|
net/mlx5e: Introduce RX Page-Reuse
Introduce a Page-Reuse mechanism in non-Striding RQ RX datapath.
A WQE (RX descriptor) buffer is a page, that in most cases was fully
wasted on a packet that is much smaller, requiring a new page for
the next round.
In this patch, we implement a page-reuse mechanism, that resembles a
`SW Striding RQ`.
We allow the WQE to reuse its allocated page as much as it could,
until the page is fully consumed. In each round, the WQE is capable
of receiving packet of maximal size (MTU). Yet, upon the reception of
a packet, the WQE knows the actual packet size, and consumes the exact
amount of memory needed to build a linear SKB. Then, it updates the
buffer pointer within the page accordingly, for the next round.
Feature is mutually exclusive with XDP (packet-per-page)
and LRO (session size is a power of two, needs unused page).
Performance tests:
iperf tcp tests show huge gain:
--------------------------------------------
num streams | BW before | BW after | ratio |
1 | 22.2 | 30.9 | 1.39x |
8 | 64.2 | 93.6 | 1.46x |
64 | 56.7 | 91.4 | 1.61x |
--------------------------------------------
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-01-29 08:42:26 -07:00
|
|
|
}
|
|
|
|
|
2018-04-02 08:31:31 -06:00
|
|
|
static inline struct mlx5e_wqe_frag_info *get_frag(struct mlx5e_rq *rq, u16 ix)
|
|
|
|
{
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
return &rq->wqe.frags[ix << rq->wqe.info.log_num_frags];
|
2018-04-02 08:31:31 -06:00
|
|
|
}
|
|
|
|
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
static int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe_cyc *wqe,
|
|
|
|
u16 ix)
|
2016-09-21 03:19:42 -06:00
|
|
|
{
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
struct mlx5e_wqe_frag_info *frag = get_frag(rq, ix);
|
|
|
|
int err;
|
|
|
|
int i;
|
2015-05-28 13:28:46 -06:00
|
|
|
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
for (i = 0; i < rq->wqe.info.num_frags; i++, frag++) {
|
|
|
|
err = mlx5e_get_rx_frag(rq, frag);
|
|
|
|
if (unlikely(err))
|
|
|
|
goto free_frags;
|
|
|
|
|
|
|
|
wqe->data[i].addr = cpu_to_be64(frag->di->addr +
|
|
|
|
frag->offset + rq->buff.headroom);
|
net/mlx5e: Introduce RX Page-Reuse
Introduce a Page-Reuse mechanism in non-Striding RQ RX datapath.
A WQE (RX descriptor) buffer is a page, that in most cases was fully
wasted on a packet that is much smaller, requiring a new page for
the next round.
In this patch, we implement a page-reuse mechanism, that resembles a
`SW Striding RQ`.
We allow the WQE to reuse its allocated page as much as it could,
until the page is fully consumed. In each round, the WQE is capable
of receiving packet of maximal size (MTU). Yet, upon the reception of
a packet, the WQE knows the actual packet size, and consumes the exact
amount of memory needed to build a linear SKB. Then, it updates the
buffer pointer within the page accordingly, for the next round.
Feature is mutually exclusive with XDP (packet-per-page)
and LRO (session size is a power of two, needs unused page).
Performance tests:
iperf tcp tests show huge gain:
--------------------------------------------
num streams | BW before | BW after | ratio |
1 | 22.2 | 30.9 | 1.39x |
8 | 64.2 | 93.6 | 1.46x |
64 | 56.7 | 91.4 | 1.61x |
--------------------------------------------
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-01-29 08:42:26 -07:00
|
|
|
}
|
2015-05-28 13:28:46 -06:00
|
|
|
|
2016-09-21 03:19:42 -06:00
|
|
|
return 0;
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
|
|
|
|
free_frags:
|
|
|
|
while (--i >= 0)
|
2018-06-12 01:09:24 -06:00
|
|
|
mlx5e_put_rx_frag(rq, --frag, true);
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
|
|
|
|
return err;
|
2015-05-28 13:28:46 -06:00
|
|
|
}
|
|
|
|
|
net/mlx5e: Introduce RX Page-Reuse
Introduce a Page-Reuse mechanism in non-Striding RQ RX datapath.
A WQE (RX descriptor) buffer is a page, that in most cases was fully
wasted on a packet that is much smaller, requiring a new page for
the next round.
In this patch, we implement a page-reuse mechanism, that resembles a
`SW Striding RQ`.
We allow the WQE to reuse its allocated page as much as it could,
until the page is fully consumed. In each round, the WQE is capable
of receiving packet of maximal size (MTU). Yet, upon the reception of
a packet, the WQE knows the actual packet size, and consumes the exact
amount of memory needed to build a linear SKB. Then, it updates the
buffer pointer within the page accordingly, for the next round.
Feature is mutually exclusive with XDP (packet-per-page)
and LRO (session size is a power of two, needs unused page).
Performance tests:
iperf tcp tests show huge gain:
--------------------------------------------
num streams | BW before | BW after | ratio |
1 | 22.2 | 30.9 | 1.39x |
8 | 64.2 | 93.6 | 1.46x |
64 | 56.7 | 91.4 | 1.61x |
--------------------------------------------
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-01-29 08:42:26 -07:00
|
|
|
static inline void mlx5e_free_rx_wqe(struct mlx5e_rq *rq,
|
2018-06-12 01:09:24 -06:00
|
|
|
struct mlx5e_wqe_frag_info *wi,
|
|
|
|
bool recycle)
|
net/mlx5e: Introduce RX Page-Reuse
Introduce a Page-Reuse mechanism in non-Striding RQ RX datapath.
A WQE (RX descriptor) buffer is a page, that in most cases was fully
wasted on a packet that is much smaller, requiring a new page for
the next round.
In this patch, we implement a page-reuse mechanism, that resembles a
`SW Striding RQ`.
We allow the WQE to reuse its allocated page as much as it could,
until the page is fully consumed. In each round, the WQE is capable
of receiving packet of maximal size (MTU). Yet, upon the reception of
a packet, the WQE knows the actual packet size, and consumes the exact
amount of memory needed to build a linear SKB. Then, it updates the
buffer pointer within the page accordingly, for the next round.
Feature is mutually exclusive with XDP (packet-per-page)
and LRO (session size is a power of two, needs unused page).
Performance tests:
iperf tcp tests show huge gain:
--------------------------------------------
num streams | BW before | BW after | ratio |
1 | 22.2 | 30.9 | 1.39x |
8 | 64.2 | 93.6 | 1.46x |
64 | 56.7 | 91.4 | 1.61x |
--------------------------------------------
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-01-29 08:42:26 -07:00
|
|
|
{
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < rq->wqe.info.num_frags; i++, wi++)
|
2018-06-12 01:09:24 -06:00
|
|
|
mlx5e_put_rx_frag(rq, wi, recycle);
|
net/mlx5e: Introduce RX Page-Reuse
Introduce a Page-Reuse mechanism in non-Striding RQ RX datapath.
A WQE (RX descriptor) buffer is a page, that in most cases was fully
wasted on a packet that is much smaller, requiring a new page for
the next round.
In this patch, we implement a page-reuse mechanism, that resembles a
`SW Striding RQ`.
We allow the WQE to reuse its allocated page as much as it could,
until the page is fully consumed. In each round, the WQE is capable
of receiving packet of maximal size (MTU). Yet, upon the reception of
a packet, the WQE knows the actual packet size, and consumes the exact
amount of memory needed to build a linear SKB. Then, it updates the
buffer pointer within the page accordingly, for the next round.
Feature is mutually exclusive with XDP (packet-per-page)
and LRO (session size is a power of two, needs unused page).
Performance tests:
iperf tcp tests show huge gain:
--------------------------------------------
num streams | BW before | BW after | ratio |
1 | 22.2 | 30.9 | 1.39x |
8 | 64.2 | 93.6 | 1.46x |
64 | 56.7 | 91.4 | 1.61x |
--------------------------------------------
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-01-29 08:42:26 -07:00
|
|
|
}
|
|
|
|
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
void mlx5e_dealloc_rx_wqe(struct mlx5e_rq *rq, u16 ix)
|
net/mlx5e: Introduce RX Page-Reuse
Introduce a Page-Reuse mechanism in non-Striding RQ RX datapath.
A WQE (RX descriptor) buffer is a page, that in most cases was fully
wasted on a packet that is much smaller, requiring a new page for
the next round.
In this patch, we implement a page-reuse mechanism, that resembles a
`SW Striding RQ`.
We allow the WQE to reuse its allocated page as much as it could,
until the page is fully consumed. In each round, the WQE is capable
of receiving packet of maximal size (MTU). Yet, upon the reception of
a packet, the WQE knows the actual packet size, and consumes the exact
amount of memory needed to build a linear SKB. Then, it updates the
buffer pointer within the page accordingly, for the next round.
Feature is mutually exclusive with XDP (packet-per-page)
and LRO (session size is a power of two, needs unused page).
Performance tests:
iperf tcp tests show huge gain:
--------------------------------------------
num streams | BW before | BW after | ratio |
1 | 22.2 | 30.9 | 1.39x |
8 | 64.2 | 93.6 | 1.46x |
64 | 56.7 | 91.4 | 1.61x |
--------------------------------------------
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-01-29 08:42:26 -07:00
|
|
|
{
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
struct mlx5e_wqe_frag_info *wi = get_frag(rq, ix);
|
net/mlx5e: Introduce RX Page-Reuse
Introduce a Page-Reuse mechanism in non-Striding RQ RX datapath.
A WQE (RX descriptor) buffer is a page, that in most cases was fully
wasted on a packet that is much smaller, requiring a new page for
the next round.
In this patch, we implement a page-reuse mechanism, that resembles a
`SW Striding RQ`.
We allow the WQE to reuse its allocated page as much as it could,
until the page is fully consumed. In each round, the WQE is capable
of receiving packet of maximal size (MTU). Yet, upon the reception of
a packet, the WQE knows the actual packet size, and consumes the exact
amount of memory needed to build a linear SKB. Then, it updates the
buffer pointer within the page accordingly, for the next round.
Feature is mutually exclusive with XDP (packet-per-page)
and LRO (session size is a power of two, needs unused page).
Performance tests:
iperf tcp tests show huge gain:
--------------------------------------------
num streams | BW before | BW after | ratio |
1 | 22.2 | 30.9 | 1.39x |
8 | 64.2 | 93.6 | 1.46x |
64 | 56.7 | 91.4 | 1.61x |
--------------------------------------------
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-01-29 08:42:26 -07:00
|
|
|
|
2018-06-12 01:09:24 -06:00
|
|
|
mlx5e_free_rx_wqe(rq, wi, false);
|
net/mlx5e: Introduce RX Page-Reuse
Introduce a Page-Reuse mechanism in non-Striding RQ RX datapath.
A WQE (RX descriptor) buffer is a page, that in most cases was fully
wasted on a packet that is much smaller, requiring a new page for
the next round.
In this patch, we implement a page-reuse mechanism, that resembles a
`SW Striding RQ`.
We allow the WQE to reuse its allocated page as much as it could,
until the page is fully consumed. In each round, the WQE is capable
of receiving packet of maximal size (MTU). Yet, upon the reception of
a packet, the WQE knows the actual packet size, and consumes the exact
amount of memory needed to build a linear SKB. Then, it updates the
buffer pointer within the page accordingly, for the next round.
Feature is mutually exclusive with XDP (packet-per-page)
and LRO (session size is a power of two, needs unused page).
Performance tests:
iperf tcp tests show huge gain:
--------------------------------------------
num streams | BW before | BW after | ratio |
1 | 22.2 | 30.9 | 1.39x |
8 | 64.2 | 93.6 | 1.46x |
64 | 56.7 | 91.4 | 1.61x |
--------------------------------------------
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-01-29 08:42:26 -07:00
|
|
|
}
|
|
|
|
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
static int mlx5e_alloc_rx_wqes(struct mlx5e_rq *rq, u16 ix, u8 wqe_bulk)
|
2016-06-30 08:34:46 -06:00
|
|
|
{
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
struct mlx5_wq_cyc *wq = &rq->wqe.wq;
|
|
|
|
int err;
|
|
|
|
int i;
|
2016-06-30 08:34:46 -06:00
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
if (rq->umem) {
|
|
|
|
int pages_desired = wqe_bulk << rq->wqe.info.log_num_frags;
|
|
|
|
|
|
|
|
if (unlikely(!mlx5e_xsk_pages_enough_umem(rq, pages_desired)))
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
for (i = 0; i < wqe_bulk; i++) {
|
|
|
|
struct mlx5e_rx_wqe_cyc *wqe = mlx5_wq_cyc_get_wqe(wq, ix + i);
|
|
|
|
|
|
|
|
err = mlx5e_alloc_rx_wqe(rq, wqe, ix + i);
|
|
|
|
if (unlikely(err))
|
|
|
|
goto free_wqes;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
free_wqes:
|
|
|
|
while (--i >= 0)
|
|
|
|
mlx5e_dealloc_rx_wqe(rq, ix + i);
|
|
|
|
|
|
|
|
return err;
|
2016-06-30 08:34:46 -06:00
|
|
|
}
|
|
|
|
|
2018-04-02 05:30:34 -06:00
|
|
|
static inline void
|
|
|
|
mlx5e_add_skb_frag(struct mlx5e_rq *rq, struct sk_buff *skb,
|
|
|
|
struct mlx5e_dma_info *di, u32 frag_offset, u32 len,
|
|
|
|
unsigned int truesize)
|
2016-04-20 13:02:15 -06:00
|
|
|
{
|
2016-05-10 15:29:15 -06:00
|
|
|
dma_sync_single_for_cpu(rq->pdev,
|
net/mlx5e: Remove page_ref bulking in Striding RQ
When many packets reside on the same page, the bulking of
page_ref modifications reduces the total number of atomic
operations executed.
Besides the necessary 2 operations on page alloc/free, we
have the following extra ops per page:
- one on WQE allocation (bump refcnt to maximum possible),
- zero ops for SKBs,
- one on WQE free,
a constant of two operations in total, no matter how many
packets/SKBs actually populate the page.
Without this bulking, we have:
- no ops on WQE allocation or free,
- one op per SKB,
Comparing the two methods when PAGE_SIZE is 4K:
- As mentioned above, bulking method always executes 2 operations,
not more, but not less.
- In the default MTU configuration (1500, stride size is 2K),
the non-bulking method execute 2 ops as well.
- For larger MTUs with stride size of 4K, non-bulking method
executes only a single op.
- For XDP (stride size of 4K, no SKBs), non-bulking method
executes no ops at all!
Hence, to optimize the flows with linear SKB and XDP over Striding RQ,
we here remove the page_ref bulking method.
Performance testing:
ConnectX-5, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz.
Single core packet rate (64 bytes).
Early drop in TC: no degradation.
XDP_DROP:
before: 14,270,188 pps
after: 20,503,603 pps, 43% improvement.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-11-28 02:03:37 -07:00
|
|
|
di->addr + frag_offset,
|
2016-04-20 13:02:15 -06:00
|
|
|
len, DMA_FROM_DEVICE);
|
net/mlx5e: Remove page_ref bulking in Striding RQ
When many packets reside on the same page, the bulking of
page_ref modifications reduces the total number of atomic
operations executed.
Besides the necessary 2 operations on page alloc/free, we
have the following extra ops per page:
- one on WQE allocation (bump refcnt to maximum possible),
- zero ops for SKBs,
- one on WQE free,
a constant of two operations in total, no matter how many
packets/SKBs actually populate the page.
Without this bulking, we have:
- no ops on WQE allocation or free,
- one op per SKB,
Comparing the two methods when PAGE_SIZE is 4K:
- As mentioned above, bulking method always executes 2 operations,
not more, but not less.
- In the default MTU configuration (1500, stride size is 2K),
the non-bulking method execute 2 ops as well.
- For larger MTUs with stride size of 4K, non-bulking method
executes only a single op.
- For XDP (stride size of 4K, no SKBs), non-bulking method
executes no ops at all!
Hence, to optimize the flows with linear SKB and XDP over Striding RQ,
we here remove the page_ref bulking method.
Performance testing:
ConnectX-5, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz.
Single core packet rate (64 bytes).
Early drop in TC: no degradation.
XDP_DROP:
before: 14,270,188 pps
after: 20,503,603 pps, 43% improvement.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-11-28 02:03:37 -07:00
|
|
|
page_ref_inc(di->page);
|
2016-04-20 13:02:15 -06:00
|
|
|
skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
|
net/mlx5e: Remove page_ref bulking in Striding RQ
When many packets reside on the same page, the bulking of
page_ref modifications reduces the total number of atomic
operations executed.
Besides the necessary 2 operations on page alloc/free, we
have the following extra ops per page:
- one on WQE allocation (bump refcnt to maximum possible),
- zero ops for SKBs,
- one on WQE free,
a constant of two operations in total, no matter how many
packets/SKBs actually populate the page.
Without this bulking, we have:
- no ops on WQE allocation or free,
- one op per SKB,
Comparing the two methods when PAGE_SIZE is 4K:
- As mentioned above, bulking method always executes 2 operations,
not more, but not less.
- In the default MTU configuration (1500, stride size is 2K),
the non-bulking method execute 2 ops as well.
- For larger MTUs with stride size of 4K, non-bulking method
executes only a single op.
- For XDP (stride size of 4K, no SKBs), non-bulking method
executes no ops at all!
Hence, to optimize the flows with linear SKB and XDP over Striding RQ,
we here remove the page_ref bulking method.
Performance testing:
ConnectX-5, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz.
Single core packet rate (64 bytes).
Early drop in TC: no degradation.
XDP_DROP:
before: 14,270,188 pps
after: 20,503,603 pps, 43% improvement.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-11-28 02:03:37 -07:00
|
|
|
di->page, frag_offset, len, truesize);
|
2016-04-20 13:02:15 -06:00
|
|
|
}
|
|
|
|
|
2018-04-16 08:11:07 -06:00
|
|
|
static inline void
|
|
|
|
mlx5e_copy_skb_header(struct device *pdev, struct sk_buff *skb,
|
|
|
|
struct mlx5e_dma_info *dma_info,
|
net/mlx5e: RX, Make sure packet header does not cross page boundary
In the non-linear SKB memory scheme of Striding RQ, a packet header
could cross page boundary. This requires special care in fast path
that costs LoC, additional runtime instructions and branches.
It could happen when the header (up to 256B) does not fit in
a single stride. Avoid this by working with a stride size that fits
the maximum possible header. Stride is increased form 64B to 256B.
Performance:
Tested packet rate for UDP streams, single ring, on ConnectX-5.
Configuration:
Set Striding RQ and LRO ON (to enabled the non-linear SKB scheme).
GRO OFF, early drop by TC rule.
64B: 4x worse memory utilization, no page-crossers headers
- No degradation (5,887,305 pps).
- The reduction in memory utilization is compensated by the saving of
branches tests.
192B: 1.33x worse memory utilization, avoid page-crossers headers
- Before: 5,727,252. After: 5,777,037. ~1% gain.
256B: Same memory util, no page-crossers
- Before: 5,691,885. After: 5,748,007. ~1% gain.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-09-25 03:05:22 -06:00
|
|
|
int offset_from, u32 headlen)
|
2018-04-16 08:11:07 -06:00
|
|
|
{
|
|
|
|
const void *from = page_address(dma_info->page) + offset_from;
|
|
|
|
/* Aligning len to sizeof(long) optimizes memcpy performance */
|
|
|
|
unsigned int len = ALIGN(headlen, sizeof(long));
|
|
|
|
|
|
|
|
dma_sync_single_for_cpu(pdev, dma_info->addr + offset_from, len,
|
|
|
|
DMA_FROM_DEVICE);
|
net/mlx5e: RX, Make sure packet header does not cross page boundary
In the non-linear SKB memory scheme of Striding RQ, a packet header
could cross page boundary. This requires special care in fast path
that costs LoC, additional runtime instructions and branches.
It could happen when the header (up to 256B) does not fit in
a single stride. Avoid this by working with a stride size that fits
the maximum possible header. Stride is increased form 64B to 256B.
Performance:
Tested packet rate for UDP streams, single ring, on ConnectX-5.
Configuration:
Set Striding RQ and LRO ON (to enabled the non-linear SKB scheme).
GRO OFF, early drop by TC rule.
64B: 4x worse memory utilization, no page-crossers headers
- No degradation (5,887,305 pps).
- The reduction in memory utilization is compensated by the saving of
branches tests.
192B: 1.33x worse memory utilization, avoid page-crossers headers
- Before: 5,727,252. After: 5,777,037. ~1% gain.
256B: Same memory util, no page-crossers
- Before: 5,691,885. After: 5,748,007. ~1% gain.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-09-25 03:05:22 -06:00
|
|
|
skb_copy_to_linear_data(skb, from, len);
|
2016-04-20 13:02:15 -06:00
|
|
|
}
|
|
|
|
|
2018-06-12 01:09:24 -06:00
|
|
|
static void
|
|
|
|
mlx5e_free_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi, bool recycle)
|
2016-04-20 13:02:15 -06:00
|
|
|
{
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
bool no_xdp_xmit;
|
2018-02-07 05:46:36 -07:00
|
|
|
struct mlx5e_dma_info *dma_info = wi->umr.dma_info;
|
2017-07-18 03:07:06 -06:00
|
|
|
int i;
|
2016-04-20 13:02:15 -06:00
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
/* A common case for AF_XDP. */
|
|
|
|
if (bitmap_full(wi->xdp_xmit_bitmap, MLX5_MPWRQ_PAGES_PER_WQE))
|
|
|
|
return;
|
|
|
|
|
|
|
|
no_xdp_xmit = bitmap_empty(wi->xdp_xmit_bitmap,
|
|
|
|
MLX5_MPWRQ_PAGES_PER_WQE);
|
|
|
|
|
net/mlx5e: Remove page_ref bulking in Striding RQ
When many packets reside on the same page, the bulking of
page_ref modifications reduces the total number of atomic
operations executed.
Besides the necessary 2 operations on page alloc/free, we
have the following extra ops per page:
- one on WQE allocation (bump refcnt to maximum possible),
- zero ops for SKBs,
- one on WQE free,
a constant of two operations in total, no matter how many
packets/SKBs actually populate the page.
Without this bulking, we have:
- no ops on WQE allocation or free,
- one op per SKB,
Comparing the two methods when PAGE_SIZE is 4K:
- As mentioned above, bulking method always executes 2 operations,
not more, but not less.
- In the default MTU configuration (1500, stride size is 2K),
the non-bulking method execute 2 ops as well.
- For larger MTUs with stride size of 4K, non-bulking method
executes only a single op.
- For XDP (stride size of 4K, no SKBs), non-bulking method
executes no ops at all!
Hence, to optimize the flows with linear SKB and XDP over Striding RQ,
we here remove the page_ref bulking method.
Performance testing:
ConnectX-5, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz.
Single core packet rate (64 bytes).
Early drop in TC: no degradation.
XDP_DROP:
before: 14,270,188 pps
after: 20,503,603 pps, 43% improvement.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-11-28 02:03:37 -07:00
|
|
|
for (i = 0; i < MLX5_MPWRQ_PAGES_PER_WQE; i++)
|
2018-02-07 05:46:36 -07:00
|
|
|
if (no_xdp_xmit || !test_bit(i, wi->xdp_xmit_bitmap))
|
2018-06-12 01:09:24 -06:00
|
|
|
mlx5e_page_release(rq, &dma_info[i], recycle);
|
2017-07-18 03:07:06 -06:00
|
|
|
}
|
2016-04-20 13:02:15 -06:00
|
|
|
|
net/mlx5e: RX, Support multiple outstanding UMR posts
The buffers mapping of the Multi-Packet WQEs (of Striding RQ)
is done via UMR posts, one UMR WQE per an RX MPWQE.
A single MPWQE is capable of serving many incoming packets,
usually larger than the budget of a single napi cycle.
Hence, posting a single UMR WQE per napi cycle (and handling its
completion in the next cycle) works fine in many common cases,
but not always.
When an XDP program is loaded, every MPWQE is capable of serving less
packets, to satisfy the packet-per-page requirement.
Thus, for the same number of packets more MPWQEs (and UMR posts)
are needed (twice as much for the default MTU), giving less latency
room for the UMR completions.
In this patch, we add support for multiple outstanding UMR posts,
to allow faster gap closure between consuming MPWQEs and reposting
them back into the WQ.
For better SW and HW locality, we combine the UMR posts in bulks of
(at least) two.
This is expected to improve packet rate in high CPU scale.
Performance test:
As expected, huge improvement in large-scale (48 cores).
xdp_redirect_map, 64B UDP multi-stream.
Redirect from ConnectX-5 100Gbps to ConnectX-6 100Gbps.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz.
Before: Unstable, 7 to 30 Mpps
After: Stable, at 70.5 Mpps
No degradation in other tested scenarios.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-02-27 03:06:08 -07:00
|
|
|
static void mlx5e_post_rx_mpwqe(struct mlx5e_rq *rq, u8 n)
|
2017-07-18 03:07:06 -06:00
|
|
|
{
|
2018-04-02 08:23:14 -06:00
|
|
|
struct mlx5_wq_ll *wq = &rq->mpwqe.wq;
|
net/mlx5e: Single flow order-0 pages for Striding RQ
To improve the memory consumption scheme, we omit the flow that
demands and splits high-order pages in Striding RQ, and stay
with a single Striding RQ flow that uses order-0 pages.
Moving to fragmented memory allows the use of larger MPWQEs,
which reduces the number of UMR posts and filler CQEs.
Moving to a single flow allows several optimizations that improve
performance, especially in production servers where we would
anyway fallback to order-0 allocations:
- inline functions that were called via function pointers.
- improve the UMR post process.
This patch alone is expected to give a slight performance reduction.
However, the new memory scheme gives the possibility to use a page-cache
of a fair size, that doesn't inflate the memory footprint, which will
dramatically fix the reduction and even give a performance gain.
Performance tests:
The following results were measured on a freshly booted system,
giving optimal baseline performance, as high-order pages are yet to
be fragmented and depleted.
We ran pktgen single-stream benchmarks, with iptables-raw-drop:
Single stride, 64 bytes:
* 4,739,057 - baseline
* 4,749,550 - this patch
no reduction
Larger packets, no page cross, 1024 bytes:
* 3,982,361 - baseline
* 3,845,682 - this patch
3.5% reduction
Larger packets, every 3rd packet crosses a page, 1500 bytes:
* 3,731,189 - baseline
* 3,579,414 - this patch
4% reduction
Fixes: 461017cb006a ("net/mlx5e: Support RX multi-packet WQE (Striding RQ)")
Fixes: bc77b240b3c5 ("net/mlx5e: Add fragmented memory support for RX multi packet WQE")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-15 07:08:36 -06:00
|
|
|
|
net/mlx5e: RX, Support multiple outstanding UMR posts
The buffers mapping of the Multi-Packet WQEs (of Striding RQ)
is done via UMR posts, one UMR WQE per an RX MPWQE.
A single MPWQE is capable of serving many incoming packets,
usually larger than the budget of a single napi cycle.
Hence, posting a single UMR WQE per napi cycle (and handling its
completion in the next cycle) works fine in many common cases,
but not always.
When an XDP program is loaded, every MPWQE is capable of serving less
packets, to satisfy the packet-per-page requirement.
Thus, for the same number of packets more MPWQEs (and UMR posts)
are needed (twice as much for the default MTU), giving less latency
room for the UMR completions.
In this patch, we add support for multiple outstanding UMR posts,
to allow faster gap closure between consuming MPWQEs and reposting
them back into the WQ.
For better SW and HW locality, we combine the UMR posts in bulks of
(at least) two.
This is expected to improve packet rate in high CPU scale.
Performance test:
As expected, huge improvement in large-scale (48 cores).
xdp_redirect_map, 64B UDP multi-stream.
Redirect from ConnectX-5 100Gbps to ConnectX-6 100Gbps.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz.
Before: Unstable, 7 to 30 Mpps
After: Stable, at 70.5 Mpps
No degradation in other tested scenarios.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-02-27 03:06:08 -07:00
|
|
|
do {
|
|
|
|
u16 next_wqe_index = mlx5_wq_ll_get_wqe_next_ix(wq, wq->head);
|
2017-07-18 03:07:06 -06:00
|
|
|
|
net/mlx5e: RX, Support multiple outstanding UMR posts
The buffers mapping of the Multi-Packet WQEs (of Striding RQ)
is done via UMR posts, one UMR WQE per an RX MPWQE.
A single MPWQE is capable of serving many incoming packets,
usually larger than the budget of a single napi cycle.
Hence, posting a single UMR WQE per napi cycle (and handling its
completion in the next cycle) works fine in many common cases,
but not always.
When an XDP program is loaded, every MPWQE is capable of serving less
packets, to satisfy the packet-per-page requirement.
Thus, for the same number of packets more MPWQEs (and UMR posts)
are needed (twice as much for the default MTU), giving less latency
room for the UMR completions.
In this patch, we add support for multiple outstanding UMR posts,
to allow faster gap closure between consuming MPWQEs and reposting
them back into the WQ.
For better SW and HW locality, we combine the UMR posts in bulks of
(at least) two.
This is expected to improve packet rate in high CPU scale.
Performance test:
As expected, huge improvement in large-scale (48 cores).
xdp_redirect_map, 64B UDP multi-stream.
Redirect from ConnectX-5 100Gbps to ConnectX-6 100Gbps.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz.
Before: Unstable, 7 to 30 Mpps
After: Stable, at 70.5 Mpps
No degradation in other tested scenarios.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-02-27 03:06:08 -07:00
|
|
|
mlx5_wq_ll_push(wq, next_wqe_index);
|
|
|
|
} while (--n);
|
2017-07-18 03:07:06 -06:00
|
|
|
|
|
|
|
/* ensure wqes are visible to device before updating doorbell record */
|
|
|
|
dma_wmb();
|
|
|
|
|
|
|
|
mlx5_wq_ll_update_db_record(wq);
|
2016-04-20 13:02:15 -06:00
|
|
|
}
|
|
|
|
|
2018-04-04 03:54:23 -06:00
|
|
|
static inline void mlx5e_fill_icosq_frag_edge(struct mlx5e_icosq *sq,
|
|
|
|
struct mlx5_wq_cyc *wq,
|
net/mlx5: WQ, fixes for fragmented WQ buffers API
mlx5e netdevice used to calculate fragment edges by a call to
mlx5_wq_cyc_get_frag_size(). This calculation did not give the correct
indication for queues smaller than a PAGE_SIZE, (broken by default on
PowerPC, where PAGE_SIZE == 64KB). Here it is replaced by the correct new
calls/API.
Since (TX/RX) Work Queues buffers are fragmented, here we introduce
changes to the API in core driver, so that it gets a stride index and
returns the index of last stride on same fragment, and an additional
wrapping function that returns the number of physically contiguous
strides that can be written contiguously to the work queue.
This obsoletes the following API functions, and their buggy
usage in EN driver:
* mlx5_wq_cyc_get_frag_size()
* mlx5_wq_cyc_ctr2fragix()
The new API improves modularity and hides the details of such
calculation for mlx5e netdevice and mlx5_ib rdma drivers.
New calculation is also more efficient, and improves performance
as follows:
Packet rate test: pktgen, UDP / IPv4, 64byte, single ring, 8K ring size.
Before: 16,477,619 pps
After: 17,085,793 pps
3.7% improvement
Fixes: 3a2f70331226 ("net/mlx5: Use order-0 allocations for all WQ types")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-08-21 05:41:41 -06:00
|
|
|
u16 pi, u16 nnops)
|
2018-03-21 08:31:08 -06:00
|
|
|
{
|
|
|
|
struct mlx5e_sq_wqe_info *edge_wi, *wi = &sq->db.ico_wqe[pi];
|
|
|
|
|
|
|
|
edge_wi = wi + nnops;
|
|
|
|
|
2018-04-04 03:54:23 -06:00
|
|
|
/* fill sq frag edge with nops to avoid wqe wrapping two pages */
|
2018-03-21 08:31:08 -06:00
|
|
|
for (; wi < edge_wi; wi++) {
|
|
|
|
wi->opcode = MLX5_OPCODE_NOP;
|
|
|
|
mlx5e_post_nop(wq, sq->sqn, &sq->pc);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2017-07-18 03:07:06 -06:00
|
|
|
static int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
|
2016-04-20 13:02:15 -06:00
|
|
|
{
|
2016-09-21 03:19:43 -06:00
|
|
|
struct mlx5e_mpw_info *wi = &rq->mpwqe.info[ix];
|
2017-06-25 07:28:46 -06:00
|
|
|
struct mlx5e_dma_info *dma_info = &wi->umr.dma_info[0];
|
2017-07-18 03:07:06 -06:00
|
|
|
struct mlx5e_icosq *sq = &rq->channel->icosq;
|
|
|
|
struct mlx5_wq_cyc *wq = &sq->wq;
|
2017-07-10 03:52:36 -06:00
|
|
|
struct mlx5e_umr_wqe *umr_wqe;
|
2017-12-20 02:56:35 -07:00
|
|
|
u16 xlt_offset = ix << (MLX5E_LOG_ALIGNED_MPWQE_PPW - 1);
|
net/mlx5: WQ, fixes for fragmented WQ buffers API
mlx5e netdevice used to calculate fragment edges by a call to
mlx5_wq_cyc_get_frag_size(). This calculation did not give the correct
indication for queues smaller than a PAGE_SIZE, (broken by default on
PowerPC, where PAGE_SIZE == 64KB). Here it is replaced by the correct new
calls/API.
Since (TX/RX) Work Queues buffers are fragmented, here we introduce
changes to the API in core driver, so that it gets a stride index and
returns the index of last stride on same fragment, and an additional
wrapping function that returns the number of physically contiguous
strides that can be written contiguously to the work queue.
This obsoletes the following API functions, and their buggy
usage in EN driver:
* mlx5_wq_cyc_get_frag_size()
* mlx5_wq_cyc_ctr2fragix()
The new API improves modularity and hides the details of such
calculation for mlx5e netdevice and mlx5_ib rdma drivers.
New calculation is also more efficient, and improves performance
as follows:
Packet rate test: pktgen, UDP / IPv4, 64byte, single ring, 8K ring size.
Before: 16,477,619 pps
After: 17,085,793 pps
3.7% improvement
Fixes: 3a2f70331226 ("net/mlx5: Use order-0 allocations for all WQ types")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-08-21 05:41:41 -06:00
|
|
|
u16 pi, contig_wqebbs_room;
|
net/mlx5e: Single flow order-0 pages for Striding RQ
To improve the memory consumption scheme, we omit the flow that
demands and splits high-order pages in Striding RQ, and stay
with a single Striding RQ flow that uses order-0 pages.
Moving to fragmented memory allows the use of larger MPWQEs,
which reduces the number of UMR posts and filler CQEs.
Moving to a single flow allows several optimizations that improve
performance, especially in production servers where we would
anyway fallback to order-0 allocations:
- inline functions that were called via function pointers.
- improve the UMR post process.
This patch alone is expected to give a slight performance reduction.
However, the new memory scheme gives the possibility to use a page-cache
of a fair size, that doesn't inflate the memory footprint, which will
dramatically fix the reduction and even give a performance gain.
Performance tests:
The following results were measured on a freshly booted system,
giving optimal baseline performance, as high-order pages are yet to
be fragmented and depleted.
We ran pktgen single-stream benchmarks, with iptables-raw-drop:
Single stride, 64 bytes:
* 4,739,057 - baseline
* 4,749,550 - this patch
no reduction
Larger packets, no page cross, 1024 bytes:
* 3,982,361 - baseline
* 3,845,682 - this patch
3.5% reduction
Larger packets, every 3rd packet crosses a page, 1500 bytes:
* 3,731,189 - baseline
* 3,579,414 - this patch
4% reduction
Fixes: 461017cb006a ("net/mlx5e: Support RX multi-packet WQE (Striding RQ)")
Fixes: bc77b240b3c5 ("net/mlx5e: Add fragmented memory support for RX multi packet WQE")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-15 07:08:36 -06:00
|
|
|
int err;
|
2016-04-20 13:02:15 -06:00
|
|
|
int i;
|
|
|
|
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
if (rq->umem &&
|
|
|
|
unlikely(!mlx5e_xsk_pages_enough_umem(rq, MLX5_MPWRQ_PAGES_PER_WQE))) {
|
|
|
|
err = -ENOMEM;
|
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
|
2018-03-21 08:31:08 -06:00
|
|
|
pi = mlx5_wq_cyc_ctr2ix(wq, sq->pc);
|
net/mlx5: WQ, fixes for fragmented WQ buffers API
mlx5e netdevice used to calculate fragment edges by a call to
mlx5_wq_cyc_get_frag_size(). This calculation did not give the correct
indication for queues smaller than a PAGE_SIZE, (broken by default on
PowerPC, where PAGE_SIZE == 64KB). Here it is replaced by the correct new
calls/API.
Since (TX/RX) Work Queues buffers are fragmented, here we introduce
changes to the API in core driver, so that it gets a stride index and
returns the index of last stride on same fragment, and an additional
wrapping function that returns the number of physically contiguous
strides that can be written contiguously to the work queue.
This obsoletes the following API functions, and their buggy
usage in EN driver:
* mlx5_wq_cyc_get_frag_size()
* mlx5_wq_cyc_ctr2fragix()
The new API improves modularity and hides the details of such
calculation for mlx5e netdevice and mlx5_ib rdma drivers.
New calculation is also more efficient, and improves performance
as follows:
Packet rate test: pktgen, UDP / IPv4, 64byte, single ring, 8K ring size.
Before: 16,477,619 pps
After: 17,085,793 pps
3.7% improvement
Fixes: 3a2f70331226 ("net/mlx5: Use order-0 allocations for all WQ types")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-08-21 05:41:41 -06:00
|
|
|
contig_wqebbs_room = mlx5_wq_cyc_get_contig_wqebbs(wq, pi);
|
|
|
|
if (unlikely(contig_wqebbs_room < MLX5E_UMR_WQEBBS)) {
|
|
|
|
mlx5e_fill_icosq_frag_edge(sq, wq, pi, contig_wqebbs_room);
|
2018-03-21 08:31:08 -06:00
|
|
|
pi = mlx5_wq_cyc_ctr2ix(wq, sq->pc);
|
2017-07-10 03:52:36 -06:00
|
|
|
}
|
|
|
|
|
|
|
|
umr_wqe = mlx5_wq_cyc_get_wqe(wq, pi);
|
2019-06-26 08:35:31 -06:00
|
|
|
memcpy(umr_wqe, &rq->mpwqe.umr_wqe, offsetof(struct mlx5e_umr_wqe, inline_mtts));
|
2018-01-04 04:09:15 -07:00
|
|
|
|
2017-06-25 07:28:46 -06:00
|
|
|
for (i = 0; i < MLX5_MPWRQ_PAGES_PER_WQE; i++, dma_info++) {
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
err = mlx5e_page_alloc(rq, dma_info);
|
net/mlx5e: Single flow order-0 pages for Striding RQ
To improve the memory consumption scheme, we omit the flow that
demands and splits high-order pages in Striding RQ, and stay
with a single Striding RQ flow that uses order-0 pages.
Moving to fragmented memory allows the use of larger MPWQEs,
which reduces the number of UMR posts and filler CQEs.
Moving to a single flow allows several optimizations that improve
performance, especially in production servers where we would
anyway fallback to order-0 allocations:
- inline functions that were called via function pointers.
- improve the UMR post process.
This patch alone is expected to give a slight performance reduction.
However, the new memory scheme gives the possibility to use a page-cache
of a fair size, that doesn't inflate the memory footprint, which will
dramatically fix the reduction and even give a performance gain.
Performance tests:
The following results were measured on a freshly booted system,
giving optimal baseline performance, as high-order pages are yet to
be fragmented and depleted.
We ran pktgen single-stream benchmarks, with iptables-raw-drop:
Single stride, 64 bytes:
* 4,739,057 - baseline
* 4,749,550 - this patch
no reduction
Larger packets, no page cross, 1024 bytes:
* 3,982,361 - baseline
* 3,845,682 - this patch
3.5% reduction
Larger packets, every 3rd packet crosses a page, 1500 bytes:
* 3,731,189 - baseline
* 3,579,414 - this patch
4% reduction
Fixes: 461017cb006a ("net/mlx5e: Support RX multi-packet WQE (Striding RQ)")
Fixes: bc77b240b3c5 ("net/mlx5e: Add fragmented memory support for RX multi packet WQE")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-15 07:08:36 -06:00
|
|
|
if (unlikely(err))
|
2016-04-20 13:02:15 -06:00
|
|
|
goto err_unmap;
|
2017-07-10 03:52:36 -06:00
|
|
|
umr_wqe->inline_mtts[i].ptag = cpu_to_be64(dma_info->addr | MLX5_EN_WR);
|
2016-04-20 13:02:15 -06:00
|
|
|
}
|
|
|
|
|
2018-02-07 05:46:36 -07:00
|
|
|
bitmap_zero(wi->xdp_xmit_bitmap, MLX5_MPWRQ_PAGES_PER_WQE);
|
2016-04-20 13:02:15 -06:00
|
|
|
wi->consumed_strides = 0;
|
|
|
|
|
2017-07-10 03:52:36 -06:00
|
|
|
umr_wqe->ctrl.opmod_idx_opcode =
|
2017-07-18 03:07:06 -06:00
|
|
|
cpu_to_be32((sq->pc << MLX5_WQE_CTRL_WQE_INDEX_SHIFT) |
|
|
|
|
MLX5_OPCODE_UMR);
|
2017-12-20 02:56:35 -07:00
|
|
|
umr_wqe->uctrl.xlt_offset = cpu_to_be16(xlt_offset);
|
2017-07-18 03:07:06 -06:00
|
|
|
|
|
|
|
sq->db.ico_wqe[pi].opcode = MLX5_OPCODE_UMR;
|
2019-06-26 08:35:31 -06:00
|
|
|
sq->db.ico_wqe[pi].umr.rq = rq;
|
2017-07-10 03:52:36 -06:00
|
|
|
sq->pc += MLX5E_UMR_WQEBBS;
|
net/mlx5e: RX, Support multiple outstanding UMR posts
The buffers mapping of the Multi-Packet WQEs (of Striding RQ)
is done via UMR posts, one UMR WQE per an RX MPWQE.
A single MPWQE is capable of serving many incoming packets,
usually larger than the budget of a single napi cycle.
Hence, posting a single UMR WQE per napi cycle (and handling its
completion in the next cycle) works fine in many common cases,
but not always.
When an XDP program is loaded, every MPWQE is capable of serving less
packets, to satisfy the packet-per-page requirement.
Thus, for the same number of packets more MPWQEs (and UMR posts)
are needed (twice as much for the default MTU), giving less latency
room for the UMR completions.
In this patch, we add support for multiple outstanding UMR posts,
to allow faster gap closure between consuming MPWQEs and reposting
them back into the WQ.
For better SW and HW locality, we combine the UMR posts in bulks of
(at least) two.
This is expected to improve packet rate in high CPU scale.
Performance test:
As expected, huge improvement in large-scale (48 cores).
xdp_redirect_map, 64B UDP multi-stream.
Redirect from ConnectX-5 100Gbps to ConnectX-6 100Gbps.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz.
Before: Unstable, 7 to 30 Mpps
After: Stable, at 70.5 Mpps
No degradation in other tested scenarios.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-02-27 03:06:08 -07:00
|
|
|
|
|
|
|
sq->doorbell_cseg = &umr_wqe->ctrl;
|
2017-07-18 03:07:06 -06:00
|
|
|
|
2016-04-20 13:02:15 -06:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
err_unmap:
|
|
|
|
while (--i >= 0) {
|
2017-06-25 07:28:46 -06:00
|
|
|
dma_info--;
|
net/mlx5e: Implement RX mapped page cache for page recycle
Instead of reallocating and mapping pages for RX data-path,
recycle already used pages in a per ring cache.
Performance tests:
The following results were measured on a freshly booted system,
giving optimal baseline performance, as high-order pages are yet to
be fragmented and depleted.
We ran pktgen single-stream benchmarks, with iptables-raw-drop:
Single stride, 64 bytes:
* 4,739,057 - baseline
* 4,749,550 - order0 no cache
* 4,786,899 - order0 with cache
1% gain
Larger packets, no page cross, 1024 bytes:
* 3,982,361 - baseline
* 3,845,682 - order0 no cache
* 4,127,852 - order0 with cache
3.7% gain
Larger packets, every 3rd packet crosses a page, 1500 bytes:
* 3,731,189 - baseline
* 3,579,414 - order0 no cache
* 3,931,708 - order0 with cache
5.4% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-15 07:08:38 -06:00
|
|
|
mlx5e_page_release(rq, dma_info, true);
|
2016-04-20 13:02:15 -06:00
|
|
|
}
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
|
|
|
|
err:
|
2018-04-12 07:03:37 -06:00
|
|
|
rq->stats->buff_alloc_err++;
|
2016-04-20 13:02:15 -06:00
|
|
|
|
net/mlx5e: Single flow order-0 pages for Striding RQ
To improve the memory consumption scheme, we omit the flow that
demands and splits high-order pages in Striding RQ, and stay
with a single Striding RQ flow that uses order-0 pages.
Moving to fragmented memory allows the use of larger MPWQEs,
which reduces the number of UMR posts and filler CQEs.
Moving to a single flow allows several optimizations that improve
performance, especially in production servers where we would
anyway fallback to order-0 allocations:
- inline functions that were called via function pointers.
- improve the UMR post process.
This patch alone is expected to give a slight performance reduction.
However, the new memory scheme gives the possibility to use a page-cache
of a fair size, that doesn't inflate the memory footprint, which will
dramatically fix the reduction and even give a performance gain.
Performance tests:
The following results were measured on a freshly booted system,
giving optimal baseline performance, as high-order pages are yet to
be fragmented and depleted.
We ran pktgen single-stream benchmarks, with iptables-raw-drop:
Single stride, 64 bytes:
* 4,739,057 - baseline
* 4,749,550 - this patch
no reduction
Larger packets, no page cross, 1024 bytes:
* 3,982,361 - baseline
* 3,845,682 - this patch
3.5% reduction
Larger packets, every 3rd packet crosses a page, 1500 bytes:
* 3,731,189 - baseline
* 3,579,414 - this patch
4% reduction
Fixes: 461017cb006a ("net/mlx5e: Support RX multi-packet WQE (Striding RQ)")
Fixes: bc77b240b3c5 ("net/mlx5e: Add fragmented memory support for RX multi packet WQE")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-15 07:08:36 -06:00
|
|
|
return err;
|
2016-04-20 13:02:15 -06:00
|
|
|
}
|
|
|
|
|
2016-06-30 08:34:46 -06:00
|
|
|
void mlx5e_dealloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
|
|
|
|
{
|
2016-09-21 03:19:43 -06:00
|
|
|
struct mlx5e_mpw_info *wi = &rq->mpwqe.info[ix];
|
2018-06-12 01:09:24 -06:00
|
|
|
/* Don't recycle, this function is called on rq/netdev close */
|
|
|
|
mlx5e_free_rx_mpwqe(rq, wi, false);
|
2016-06-30 08:34:46 -06:00
|
|
|
}
|
|
|
|
|
2015-05-28 13:28:46 -06:00
|
|
|
bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq)
|
|
|
|
{
|
2018-04-02 08:31:31 -06:00
|
|
|
struct mlx5_wq_cyc *wq = &rq->wqe.wq;
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
u8 wqe_bulk;
|
2017-06-19 09:11:30 -06:00
|
|
|
int err;
|
2015-05-28 13:28:46 -06:00
|
|
|
|
2018-01-23 03:23:26 -07:00
|
|
|
if (unlikely(!test_bit(MLX5E_RQ_STATE_ENABLED, &rq->state)))
|
2015-05-28 13:28:46 -06:00
|
|
|
return false;
|
|
|
|
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
wqe_bulk = rq->wqe.info.wqe_bulk;
|
|
|
|
|
|
|
|
if (mlx5_wq_cyc_missing(wq) < wqe_bulk)
|
2017-06-19 09:11:30 -06:00
|
|
|
return false;
|
|
|
|
|
|
|
|
do {
|
2018-04-02 08:31:31 -06:00
|
|
|
u16 head = mlx5_wq_cyc_get_head(wq);
|
2015-05-28 13:28:46 -06:00
|
|
|
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
err = mlx5e_alloc_rx_wqes(rq, head, wqe_bulk);
|
2016-04-20 13:02:19 -06:00
|
|
|
if (unlikely(err)) {
|
2018-04-12 07:03:37 -06:00
|
|
|
rq->stats->buff_alloc_err++;
|
2015-05-28 13:28:46 -06:00
|
|
|
break;
|
2016-04-20 13:02:19 -06:00
|
|
|
}
|
2015-05-28 13:28:46 -06:00
|
|
|
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
mlx5_wq_cyc_push_n(wq, wqe_bulk);
|
|
|
|
} while (mlx5_wq_cyc_missing(wq) >= wqe_bulk);
|
2015-05-28 13:28:46 -06:00
|
|
|
|
|
|
|
/* ensure wqes are visible to device before updating doorbell record */
|
|
|
|
dma_wmb();
|
|
|
|
|
2018-04-02 08:31:31 -06:00
|
|
|
mlx5_wq_cyc_update_db_record(wq);
|
2015-05-28 13:28:46 -06:00
|
|
|
|
2017-06-19 09:11:30 -06:00
|
|
|
return !!err;
|
2015-05-28 13:28:46 -06:00
|
|
|
}
|
|
|
|
|
2019-06-26 08:35:31 -06:00
|
|
|
void mlx5e_poll_ico_cq(struct mlx5e_cq *cq)
|
2017-07-17 03:27:26 -06:00
|
|
|
{
|
|
|
|
struct mlx5e_icosq *sq = container_of(cq, struct mlx5e_icosq, cq);
|
|
|
|
struct mlx5_cqe64 *cqe;
|
net/mlx5e: RX, Support multiple outstanding UMR posts
The buffers mapping of the Multi-Packet WQEs (of Striding RQ)
is done via UMR posts, one UMR WQE per an RX MPWQE.
A single MPWQE is capable of serving many incoming packets,
usually larger than the budget of a single napi cycle.
Hence, posting a single UMR WQE per napi cycle (and handling its
completion in the next cycle) works fine in many common cases,
but not always.
When an XDP program is loaded, every MPWQE is capable of serving less
packets, to satisfy the packet-per-page requirement.
Thus, for the same number of packets more MPWQEs (and UMR posts)
are needed (twice as much for the default MTU), giving less latency
room for the UMR completions.
In this patch, we add support for multiple outstanding UMR posts,
to allow faster gap closure between consuming MPWQEs and reposting
them back into the WQ.
For better SW and HW locality, we combine the UMR posts in bulks of
(at least) two.
This is expected to improve packet rate in high CPU scale.
Performance test:
As expected, huge improvement in large-scale (48 cores).
xdp_redirect_map, 64B UDP multi-stream.
Redirect from ConnectX-5 100Gbps to ConnectX-6 100Gbps.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz.
Before: Unstable, 7 to 30 Mpps
After: Stable, at 70.5 Mpps
No degradation in other tested scenarios.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-02-27 03:06:08 -07:00
|
|
|
u16 sqcc;
|
|
|
|
int i;
|
2017-07-17 03:27:26 -06:00
|
|
|
|
2018-01-23 03:23:26 -07:00
|
|
|
if (unlikely(!test_bit(MLX5E_SQ_STATE_ENABLED, &sq->state)))
|
2017-07-17 03:27:26 -06:00
|
|
|
return;
|
|
|
|
|
|
|
|
cqe = mlx5_cqwq_get_cqe(&cq->wq);
|
|
|
|
if (likely(!cqe))
|
|
|
|
return;
|
|
|
|
|
net/mlx5e: RX, Support multiple outstanding UMR posts
The buffers mapping of the Multi-Packet WQEs (of Striding RQ)
is done via UMR posts, one UMR WQE per an RX MPWQE.
A single MPWQE is capable of serving many incoming packets,
usually larger than the budget of a single napi cycle.
Hence, posting a single UMR WQE per napi cycle (and handling its
completion in the next cycle) works fine in many common cases,
but not always.
When an XDP program is loaded, every MPWQE is capable of serving less
packets, to satisfy the packet-per-page requirement.
Thus, for the same number of packets more MPWQEs (and UMR posts)
are needed (twice as much for the default MTU), giving less latency
room for the UMR completions.
In this patch, we add support for multiple outstanding UMR posts,
to allow faster gap closure between consuming MPWQEs and reposting
them back into the WQ.
For better SW and HW locality, we combine the UMR posts in bulks of
(at least) two.
This is expected to improve packet rate in high CPU scale.
Performance test:
As expected, huge improvement in large-scale (48 cores).
xdp_redirect_map, 64B UDP multi-stream.
Redirect from ConnectX-5 100Gbps to ConnectX-6 100Gbps.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz.
Before: Unstable, 7 to 30 Mpps
After: Stable, at 70.5 Mpps
No degradation in other tested scenarios.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-02-27 03:06:08 -07:00
|
|
|
/* sq->cc must be updated only after mlx5_cqwq_update_db_record(),
|
|
|
|
* otherwise a cq overrun may occur
|
|
|
|
*/
|
|
|
|
sqcc = sq->cc;
|
|
|
|
|
|
|
|
i = 0;
|
|
|
|
do {
|
|
|
|
u16 wqe_counter;
|
|
|
|
bool last_wqe;
|
|
|
|
|
|
|
|
mlx5_cqwq_pop(&cq->wq);
|
|
|
|
|
|
|
|
wqe_counter = be16_to_cpu(cqe->wqe_counter);
|
|
|
|
|
|
|
|
if (unlikely(get_cqe_opcode(cqe) != MLX5_CQE_REQ)) {
|
|
|
|
netdev_WARN_ONCE(cq->channel->netdev,
|
|
|
|
"Bad OP in ICOSQ CQE: 0x%x\n", get_cqe_opcode(cqe));
|
2019-06-25 08:44:28 -06:00
|
|
|
if (!test_and_set_bit(MLX5E_SQ_STATE_RECOVERING, &sq->state))
|
|
|
|
queue_work(cq->channel->priv->wq, &sq->recover_work);
|
net/mlx5e: RX, Support multiple outstanding UMR posts
The buffers mapping of the Multi-Packet WQEs (of Striding RQ)
is done via UMR posts, one UMR WQE per an RX MPWQE.
A single MPWQE is capable of serving many incoming packets,
usually larger than the budget of a single napi cycle.
Hence, posting a single UMR WQE per napi cycle (and handling its
completion in the next cycle) works fine in many common cases,
but not always.
When an XDP program is loaded, every MPWQE is capable of serving less
packets, to satisfy the packet-per-page requirement.
Thus, for the same number of packets more MPWQEs (and UMR posts)
are needed (twice as much for the default MTU), giving less latency
room for the UMR completions.
In this patch, we add support for multiple outstanding UMR posts,
to allow faster gap closure between consuming MPWQEs and reposting
them back into the WQ.
For better SW and HW locality, we combine the UMR posts in bulks of
(at least) two.
This is expected to improve packet rate in high CPU scale.
Performance test:
As expected, huge improvement in large-scale (48 cores).
xdp_redirect_map, 64B UDP multi-stream.
Redirect from ConnectX-5 100Gbps to ConnectX-6 100Gbps.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz.
Before: Unstable, 7 to 30 Mpps
After: Stable, at 70.5 Mpps
No degradation in other tested scenarios.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-02-27 03:06:08 -07:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
do {
|
|
|
|
struct mlx5e_sq_wqe_info *wi;
|
|
|
|
u16 ci;
|
|
|
|
|
|
|
|
last_wqe = (sqcc == wqe_counter);
|
|
|
|
|
|
|
|
ci = mlx5_wq_cyc_ctr2ix(&sq->wq, sqcc);
|
|
|
|
wi = &sq->db.ico_wqe[ci];
|
|
|
|
|
|
|
|
if (likely(wi->opcode == MLX5_OPCODE_UMR)) {
|
|
|
|
sqcc += MLX5E_UMR_WQEBBS;
|
2019-06-26 08:35:31 -06:00
|
|
|
wi->umr.rq->mpwqe.umr_completed++;
|
net/mlx5e: RX, Support multiple outstanding UMR posts
The buffers mapping of the Multi-Packet WQEs (of Striding RQ)
is done via UMR posts, one UMR WQE per an RX MPWQE.
A single MPWQE is capable of serving many incoming packets,
usually larger than the budget of a single napi cycle.
Hence, posting a single UMR WQE per napi cycle (and handling its
completion in the next cycle) works fine in many common cases,
but not always.
When an XDP program is loaded, every MPWQE is capable of serving less
packets, to satisfy the packet-per-page requirement.
Thus, for the same number of packets more MPWQEs (and UMR posts)
are needed (twice as much for the default MTU), giving less latency
room for the UMR completions.
In this patch, we add support for multiple outstanding UMR posts,
to allow faster gap closure between consuming MPWQEs and reposting
them back into the WQ.
For better SW and HW locality, we combine the UMR posts in bulks of
(at least) two.
This is expected to improve packet rate in high CPU scale.
Performance test:
As expected, huge improvement in large-scale (48 cores).
xdp_redirect_map, 64B UDP multi-stream.
Redirect from ConnectX-5 100Gbps to ConnectX-6 100Gbps.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz.
Before: Unstable, 7 to 30 Mpps
After: Stable, at 70.5 Mpps
No degradation in other tested scenarios.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-02-27 03:06:08 -07:00
|
|
|
} else if (likely(wi->opcode == MLX5_OPCODE_NOP)) {
|
|
|
|
sqcc++;
|
|
|
|
} else {
|
|
|
|
netdev_WARN_ONCE(cq->channel->netdev,
|
|
|
|
"Bad OPCODE in ICOSQ WQE info: 0x%x\n",
|
|
|
|
wi->opcode);
|
|
|
|
}
|
|
|
|
|
|
|
|
} while (!last_wqe);
|
|
|
|
|
|
|
|
} while ((++i < MLX5E_TX_CQ_POLL_BUDGET) && (cqe = mlx5_cqwq_get_cqe(&cq->wq)));
|
|
|
|
|
|
|
|
sq->cc = sqcc;
|
2017-07-17 03:27:26 -06:00
|
|
|
|
|
|
|
mlx5_cqwq_update_db_record(&cq->wq);
|
|
|
|
}
|
|
|
|
|
|
|
|
bool mlx5e_post_rx_mpwqes(struct mlx5e_rq *rq)
|
|
|
|
{
|
net/mlx5e: RX, Support multiple outstanding UMR posts
The buffers mapping of the Multi-Packet WQEs (of Striding RQ)
is done via UMR posts, one UMR WQE per an RX MPWQE.
A single MPWQE is capable of serving many incoming packets,
usually larger than the budget of a single napi cycle.
Hence, posting a single UMR WQE per napi cycle (and handling its
completion in the next cycle) works fine in many common cases,
but not always.
When an XDP program is loaded, every MPWQE is capable of serving less
packets, to satisfy the packet-per-page requirement.
Thus, for the same number of packets more MPWQEs (and UMR posts)
are needed (twice as much for the default MTU), giving less latency
room for the UMR completions.
In this patch, we add support for multiple outstanding UMR posts,
to allow faster gap closure between consuming MPWQEs and reposting
them back into the WQ.
For better SW and HW locality, we combine the UMR posts in bulks of
(at least) two.
This is expected to improve packet rate in high CPU scale.
Performance test:
As expected, huge improvement in large-scale (48 cores).
xdp_redirect_map, 64B UDP multi-stream.
Redirect from ConnectX-5 100Gbps to ConnectX-6 100Gbps.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz.
Before: Unstable, 7 to 30 Mpps
After: Stable, at 70.5 Mpps
No degradation in other tested scenarios.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-02-27 03:06:08 -07:00
|
|
|
struct mlx5e_icosq *sq = &rq->channel->icosq;
|
2018-04-02 08:23:14 -06:00
|
|
|
struct mlx5_wq_ll *wq = &rq->mpwqe.wq;
|
2019-06-26 08:35:31 -06:00
|
|
|
u8 umr_completed = rq->mpwqe.umr_completed;
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
int alloc_err = 0;
|
net/mlx5e: RX, Support multiple outstanding UMR posts
The buffers mapping of the Multi-Packet WQEs (of Striding RQ)
is done via UMR posts, one UMR WQE per an RX MPWQE.
A single MPWQE is capable of serving many incoming packets,
usually larger than the budget of a single napi cycle.
Hence, posting a single UMR WQE per napi cycle (and handling its
completion in the next cycle) works fine in many common cases,
but not always.
When an XDP program is loaded, every MPWQE is capable of serving less
packets, to satisfy the packet-per-page requirement.
Thus, for the same number of packets more MPWQEs (and UMR posts)
are needed (twice as much for the default MTU), giving less latency
room for the UMR completions.
In this patch, we add support for multiple outstanding UMR posts,
to allow faster gap closure between consuming MPWQEs and reposting
them back into the WQ.
For better SW and HW locality, we combine the UMR posts in bulks of
(at least) two.
This is expected to improve packet rate in high CPU scale.
Performance test:
As expected, huge improvement in large-scale (48 cores).
xdp_redirect_map, 64B UDP multi-stream.
Redirect from ConnectX-5 100Gbps to ConnectX-6 100Gbps.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz.
Before: Unstable, 7 to 30 Mpps
After: Stable, at 70.5 Mpps
No degradation in other tested scenarios.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-02-27 03:06:08 -07:00
|
|
|
u8 missing, i;
|
|
|
|
u16 head;
|
2017-07-17 03:27:26 -06:00
|
|
|
|
2018-01-23 03:23:26 -07:00
|
|
|
if (unlikely(!test_bit(MLX5E_RQ_STATE_ENABLED, &rq->state)))
|
2017-07-17 03:27:26 -06:00
|
|
|
return false;
|
|
|
|
|
2019-06-26 08:35:31 -06:00
|
|
|
if (umr_completed) {
|
|
|
|
mlx5e_post_rx_mpwqe(rq, umr_completed);
|
|
|
|
rq->mpwqe.umr_in_progress -= umr_completed;
|
|
|
|
rq->mpwqe.umr_completed = 0;
|
|
|
|
}
|
net/mlx5e: RX, Support multiple outstanding UMR posts
The buffers mapping of the Multi-Packet WQEs (of Striding RQ)
is done via UMR posts, one UMR WQE per an RX MPWQE.
A single MPWQE is capable of serving many incoming packets,
usually larger than the budget of a single napi cycle.
Hence, posting a single UMR WQE per napi cycle (and handling its
completion in the next cycle) works fine in many common cases,
but not always.
When an XDP program is loaded, every MPWQE is capable of serving less
packets, to satisfy the packet-per-page requirement.
Thus, for the same number of packets more MPWQEs (and UMR posts)
are needed (twice as much for the default MTU), giving less latency
room for the UMR completions.
In this patch, we add support for multiple outstanding UMR posts,
to allow faster gap closure between consuming MPWQEs and reposting
them back into the WQ.
For better SW and HW locality, we combine the UMR posts in bulks of
(at least) two.
This is expected to improve packet rate in high CPU scale.
Performance test:
As expected, huge improvement in large-scale (48 cores).
xdp_redirect_map, 64B UDP multi-stream.
Redirect from ConnectX-5 100Gbps to ConnectX-6 100Gbps.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz.
Before: Unstable, 7 to 30 Mpps
After: Stable, at 70.5 Mpps
No degradation in other tested scenarios.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-02-27 03:06:08 -07:00
|
|
|
|
|
|
|
missing = mlx5_wq_ll_missing(wq) - rq->mpwqe.umr_in_progress;
|
2017-07-17 03:27:26 -06:00
|
|
|
|
net/mlx5e: RX, Support multiple outstanding UMR posts
The buffers mapping of the Multi-Packet WQEs (of Striding RQ)
is done via UMR posts, one UMR WQE per an RX MPWQE.
A single MPWQE is capable of serving many incoming packets,
usually larger than the budget of a single napi cycle.
Hence, posting a single UMR WQE per napi cycle (and handling its
completion in the next cycle) works fine in many common cases,
but not always.
When an XDP program is loaded, every MPWQE is capable of serving less
packets, to satisfy the packet-per-page requirement.
Thus, for the same number of packets more MPWQEs (and UMR posts)
are needed (twice as much for the default MTU), giving less latency
room for the UMR completions.
In this patch, we add support for multiple outstanding UMR posts,
to allow faster gap closure between consuming MPWQEs and reposting
them back into the WQ.
For better SW and HW locality, we combine the UMR posts in bulks of
(at least) two.
This is expected to improve packet rate in high CPU scale.
Performance test:
As expected, huge improvement in large-scale (48 cores).
xdp_redirect_map, 64B UDP multi-stream.
Redirect from ConnectX-5 100Gbps to ConnectX-6 100Gbps.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz.
Before: Unstable, 7 to 30 Mpps
After: Stable, at 70.5 Mpps
No degradation in other tested scenarios.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-02-27 03:06:08 -07:00
|
|
|
if (unlikely(rq->mpwqe.umr_in_progress > rq->mpwqe.umr_last_bulk))
|
|
|
|
rq->stats->congst_umr++;
|
|
|
|
|
|
|
|
#define UMR_WQE_BULK (2)
|
|
|
|
if (likely(missing < UMR_WQE_BULK))
|
2017-07-17 03:27:26 -06:00
|
|
|
return false;
|
|
|
|
|
net/mlx5e: RX, Support multiple outstanding UMR posts
The buffers mapping of the Multi-Packet WQEs (of Striding RQ)
is done via UMR posts, one UMR WQE per an RX MPWQE.
A single MPWQE is capable of serving many incoming packets,
usually larger than the budget of a single napi cycle.
Hence, posting a single UMR WQE per napi cycle (and handling its
completion in the next cycle) works fine in many common cases,
but not always.
When an XDP program is loaded, every MPWQE is capable of serving less
packets, to satisfy the packet-per-page requirement.
Thus, for the same number of packets more MPWQEs (and UMR posts)
are needed (twice as much for the default MTU), giving less latency
room for the UMR completions.
In this patch, we add support for multiple outstanding UMR posts,
to allow faster gap closure between consuming MPWQEs and reposting
them back into the WQ.
For better SW and HW locality, we combine the UMR posts in bulks of
(at least) two.
This is expected to improve packet rate in high CPU scale.
Performance test:
As expected, huge improvement in large-scale (48 cores).
xdp_redirect_map, 64B UDP multi-stream.
Redirect from ConnectX-5 100Gbps to ConnectX-6 100Gbps.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz.
Before: Unstable, 7 to 30 Mpps
After: Stable, at 70.5 Mpps
No degradation in other tested scenarios.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-02-27 03:06:08 -07:00
|
|
|
head = rq->mpwqe.actual_wq_head;
|
|
|
|
i = missing;
|
|
|
|
do {
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
alloc_err = mlx5e_alloc_rx_mpwqe(rq, head);
|
|
|
|
|
|
|
|
if (unlikely(alloc_err))
|
net/mlx5e: RX, Support multiple outstanding UMR posts
The buffers mapping of the Multi-Packet WQEs (of Striding RQ)
is done via UMR posts, one UMR WQE per an RX MPWQE.
A single MPWQE is capable of serving many incoming packets,
usually larger than the budget of a single napi cycle.
Hence, posting a single UMR WQE per napi cycle (and handling its
completion in the next cycle) works fine in many common cases,
but not always.
When an XDP program is loaded, every MPWQE is capable of serving less
packets, to satisfy the packet-per-page requirement.
Thus, for the same number of packets more MPWQEs (and UMR posts)
are needed (twice as much for the default MTU), giving less latency
room for the UMR completions.
In this patch, we add support for multiple outstanding UMR posts,
to allow faster gap closure between consuming MPWQEs and reposting
them back into the WQ.
For better SW and HW locality, we combine the UMR posts in bulks of
(at least) two.
This is expected to improve packet rate in high CPU scale.
Performance test:
As expected, huge improvement in large-scale (48 cores).
xdp_redirect_map, 64B UDP multi-stream.
Redirect from ConnectX-5 100Gbps to ConnectX-6 100Gbps.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz.
Before: Unstable, 7 to 30 Mpps
After: Stable, at 70.5 Mpps
No degradation in other tested scenarios.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-02-27 03:06:08 -07:00
|
|
|
break;
|
|
|
|
head = mlx5_wq_ll_get_wqe_next_ix(wq, head);
|
|
|
|
} while (--i);
|
|
|
|
|
|
|
|
rq->mpwqe.umr_last_bulk = missing - i;
|
|
|
|
if (sq->doorbell_cseg) {
|
|
|
|
mlx5e_notify_hw(&sq->wq, sq->pc, sq->uar_map, sq->doorbell_cseg);
|
|
|
|
sq->doorbell_cseg = NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
rq->mpwqe.umr_in_progress += rq->mpwqe.umr_last_bulk;
|
|
|
|
rq->mpwqe.actual_wq_head = head;
|
2017-07-17 03:27:26 -06:00
|
|
|
|
2019-08-14 01:27:23 -06:00
|
|
|
/* If XSK Fill Ring doesn't have enough frames, report the error, so
|
|
|
|
* that one of the actions can be performed:
|
|
|
|
* 1. If need_wakeup is used, signal that the application has to kick
|
|
|
|
* the driver when it refills the Fill Ring.
|
|
|
|
* 2. Otherwise, busy poll by rescheduling the NAPI poll.
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
*/
|
|
|
|
if (unlikely(alloc_err == -ENOMEM && rq->umem))
|
|
|
|
return true;
|
|
|
|
|
2018-01-07 05:15:02 -07:00
|
|
|
return false;
|
2017-07-17 03:27:26 -06:00
|
|
|
}
|
|
|
|
|
2017-12-19 23:48:24 -07:00
|
|
|
static void mlx5e_lro_update_tcp_hdr(struct mlx5_cqe64 *cqe, struct tcphdr *tcp)
|
|
|
|
{
|
|
|
|
u8 l4_hdr_type = get_cqe_l4_hdr_type(cqe);
|
|
|
|
u8 tcp_ack = (l4_hdr_type == CQE_L4_HDR_TYPE_TCP_ACK_NO_DATA) ||
|
|
|
|
(l4_hdr_type == CQE_L4_HDR_TYPE_TCP_ACK_AND_DATA);
|
|
|
|
|
|
|
|
tcp->check = 0;
|
|
|
|
tcp->psh = get_cqe_lro_tcppsh(cqe);
|
|
|
|
|
|
|
|
if (tcp_ack) {
|
|
|
|
tcp->ack = 1;
|
|
|
|
tcp->ack_seq = cqe->lro_ack_seq_num;
|
|
|
|
tcp->window = cqe->lro_tcp_win;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
net/mlx5e: Support RX multi-packet WQE (Striding RQ)
Introduce the feature of multi-packet WQE (RX Work Queue Element)
referred to as (MPWQE or Striding RQ), in which WQEs are larger
and serve multiple packets each.
Every WQE consists of many strides of the same size, every received
packet is aligned to a beginning of a stride and is written to
consecutive strides within a WQE.
In the regular approach, each regular WQE is big enough to be capable
of serving one received packet of any size up to MTU or 64K in case of
device LRO is enabled, making it very wasteful when dealing with
small packets or device LRO is enabled.
For its flexibility, MPWQE allows a better memory utilization
(implying improvements in CPU utilization and packet rate) as packets
consume strides according to their size, preserving the rest of
the WQE to be available for other packets.
MPWQE default configuration:
Num of WQEs = 16
Strides Per WQE = 2048
Stride Size = 64 byte
The default WQEs memory footprint went from 1024*mtu (~1.5MB) to
16 * 2048 * 64 = 2MB per ring.
However, HW LRO can now be supported at no additional cost in memory
footprint, and hence we turn it on by default and get an even better
performance.
Performance tested on ConnectX4-Lx 50G.
To isolate the feature under test, the numbers below were measured with
HW LRO turned off. We verified that the performance just improves when
LRO is turned back on.
* Netperf single TCP stream:
- BW raised by 10-15% for representative packet sizes:
default, 64B, 1024B, 1478B, 65536B.
* Netperf multi TCP stream:
- No degradation, line rate reached.
* Pktgen: packet rate raised by 2-10% for traffic of different message
sizes: 64B, 128B, 256B, 1024B, and 1500B.
* Pktgen: packet loss in bursts of small messages (64byte),
single stream:
- | num packets | packets loss before | packets loss after
| 2K | ~ 1K | 0
| 8K | ~ 6K | 0
| 16K | ~13K | 0
| 32K | ~28K | 0
| 64K | ~57K | ~24K
As expected as the driver can receive as many small packets (<=64B) as
the number of total strides in the ring (default = 2048 * 16) vs. 1024
(default ring size regardless of packets size) before this feature.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-20 13:02:13 -06:00
|
|
|
static void mlx5e_lro_update_hdr(struct sk_buff *skb, struct mlx5_cqe64 *cqe,
|
|
|
|
u32 cqe_bcnt)
|
2015-05-28 13:28:46 -06:00
|
|
|
{
|
2016-09-07 10:08:01 -06:00
|
|
|
struct ethhdr *eth = (struct ethhdr *)(skb->data);
|
2015-05-28 13:28:46 -06:00
|
|
|
struct tcphdr *tcp;
|
2016-09-07 10:08:01 -06:00
|
|
|
int network_depth = 0;
|
2017-12-19 23:48:24 -07:00
|
|
|
__wsum check;
|
2016-09-07 10:08:01 -06:00
|
|
|
__be16 proto;
|
|
|
|
u16 tot_len;
|
2017-06-05 02:17:20 -06:00
|
|
|
void *ip_p;
|
2015-05-28 13:28:46 -06:00
|
|
|
|
2016-09-07 10:08:01 -06:00
|
|
|
proto = __vlan_get_protocol(skb, eth->h_proto, &network_depth);
|
2015-05-28 13:28:46 -06:00
|
|
|
|
2016-09-07 10:08:01 -06:00
|
|
|
tot_len = cqe_bcnt - network_depth;
|
2017-06-05 02:17:20 -06:00
|
|
|
ip_p = skb->data + network_depth;
|
2016-09-07 10:08:01 -06:00
|
|
|
|
|
|
|
if (proto == htons(ETH_P_IP)) {
|
2017-06-05 02:17:20 -06:00
|
|
|
struct iphdr *ipv4 = ip_p;
|
2015-05-28 13:28:46 -06:00
|
|
|
|
2017-06-05 02:17:20 -06:00
|
|
|
tcp = ip_p + sizeof(struct iphdr);
|
|
|
|
skb_shinfo(skb)->gso_type = SKB_GSO_TCPV4;
|
2015-05-28 13:28:46 -06:00
|
|
|
|
|
|
|
ipv4->ttl = cqe->lro_min_ttl;
|
|
|
|
ipv4->tot_len = cpu_to_be16(tot_len);
|
|
|
|
ipv4->check = 0;
|
|
|
|
ipv4->check = ip_fast_csum((unsigned char *)ipv4,
|
|
|
|
ipv4->ihl);
|
2017-12-19 23:48:24 -07:00
|
|
|
|
|
|
|
mlx5e_lro_update_tcp_hdr(cqe, tcp);
|
|
|
|
check = csum_partial(tcp, tcp->doff * 4,
|
|
|
|
csum_unfold((__force __sum16)cqe->check_sum));
|
|
|
|
/* Almost done, don't forget the pseudo header */
|
|
|
|
tcp->check = csum_tcpudp_magic(ipv4->saddr, ipv4->daddr,
|
|
|
|
tot_len - sizeof(struct iphdr),
|
|
|
|
IPPROTO_TCP, check);
|
2015-05-28 13:28:46 -06:00
|
|
|
} else {
|
2017-12-19 23:48:24 -07:00
|
|
|
u16 payload_len = tot_len - sizeof(struct ipv6hdr);
|
2017-06-05 02:17:20 -06:00
|
|
|
struct ipv6hdr *ipv6 = ip_p;
|
|
|
|
|
|
|
|
tcp = ip_p + sizeof(struct ipv6hdr);
|
|
|
|
skb_shinfo(skb)->gso_type = SKB_GSO_TCPV6;
|
|
|
|
|
2015-05-28 13:28:46 -06:00
|
|
|
ipv6->hop_limit = cqe->lro_min_ttl;
|
2017-12-19 23:48:24 -07:00
|
|
|
ipv6->payload_len = cpu_to_be16(payload_len);
|
|
|
|
|
|
|
|
mlx5e_lro_update_tcp_hdr(cqe, tcp);
|
|
|
|
check = csum_partial(tcp, tcp->doff * 4,
|
|
|
|
csum_unfold((__force __sum16)cqe->check_sum));
|
|
|
|
/* Almost done, don't forget the pseudo header */
|
|
|
|
tcp->check = csum_ipv6_magic(&ipv6->saddr, &ipv6->daddr, payload_len,
|
|
|
|
IPPROTO_TCP, check);
|
2017-06-05 02:17:20 -06:00
|
|
|
}
|
2015-05-28 13:28:46 -06:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline void mlx5e_skb_set_hash(struct mlx5_cqe64 *cqe,
|
|
|
|
struct sk_buff *skb)
|
|
|
|
{
|
|
|
|
u8 cht = cqe->rss_hash_type;
|
|
|
|
int ht = (cht & CQE_RSS_HTYPE_L4) ? PKT_HASH_TYPE_L4 :
|
|
|
|
(cht & CQE_RSS_HTYPE_IP) ? PKT_HASH_TYPE_L3 :
|
|
|
|
PKT_HASH_TYPE_NONE;
|
|
|
|
skb_set_hash(skb, be32_to_cpu(cqe->rss_hash_result), ht);
|
|
|
|
}
|
|
|
|
|
2018-03-20 09:15:31 -06:00
|
|
|
static inline bool is_last_ethertype_ip(struct sk_buff *skb, int *network_depth,
|
|
|
|
__be16 *proto)
|
2015-08-16 07:04:52 -06:00
|
|
|
{
|
2018-03-20 09:15:31 -06:00
|
|
|
*proto = ((struct ethhdr *)skb->data)->h_proto;
|
|
|
|
*proto = __vlan_get_protocol(skb, *proto, network_depth);
|
2019-03-25 23:10:59 -06:00
|
|
|
|
|
|
|
if (*proto == htons(ETH_P_IP))
|
|
|
|
return pskb_may_pull(skb, *network_depth + sizeof(struct iphdr));
|
|
|
|
|
|
|
|
if (*proto == htons(ETH_P_IPV6))
|
|
|
|
return pskb_may_pull(skb, *network_depth + sizeof(struct ipv6hdr));
|
|
|
|
|
|
|
|
return false;
|
2018-03-20 09:15:31 -06:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline void mlx5e_enable_ecn(struct mlx5e_rq *rq, struct sk_buff *skb)
|
|
|
|
{
|
|
|
|
int network_depth = 0;
|
|
|
|
__be16 proto;
|
|
|
|
void *ip;
|
|
|
|
int rc;
|
2015-08-16 07:04:52 -06:00
|
|
|
|
2018-03-20 09:15:31 -06:00
|
|
|
if (unlikely(!is_last_ethertype_ip(skb, &network_depth, &proto)))
|
|
|
|
return;
|
|
|
|
|
|
|
|
ip = skb->data + network_depth;
|
|
|
|
rc = ((proto == htons(ETH_P_IP)) ? IP_ECN_set_ce((struct iphdr *)ip) :
|
|
|
|
IP6_ECN_set_ce(skb, (struct ipv6hdr *)ip));
|
|
|
|
|
|
|
|
rq->stats->ecn_mark += !!rc;
|
2015-08-16 07:04:52 -06:00
|
|
|
}
|
|
|
|
|
2018-11-28 16:04:05 -07:00
|
|
|
static u8 get_ip_proto(struct sk_buff *skb, int network_depth, __be16 proto)
|
2018-07-10 04:34:06 -06:00
|
|
|
{
|
2018-11-28 16:04:05 -07:00
|
|
|
void *ip_p = skb->data + network_depth;
|
2018-07-10 04:34:06 -06:00
|
|
|
|
|
|
|
return (proto == htons(ETH_P_IP)) ? ((struct iphdr *)ip_p)->protocol :
|
|
|
|
((struct ipv6hdr *)ip_p)->nexthdr;
|
|
|
|
}
|
|
|
|
|
2018-12-03 23:14:04 -07:00
|
|
|
#define short_frame(size) ((size) <= ETH_ZLEN + ETH_FCS_LEN)
|
|
|
|
|
2019-03-12 01:24:52 -06:00
|
|
|
#define MAX_PADDING 8
|
|
|
|
|
|
|
|
static void
|
|
|
|
tail_padding_csum_slow(struct sk_buff *skb, int offset, int len,
|
|
|
|
struct mlx5e_rq_stats *stats)
|
|
|
|
{
|
|
|
|
stats->csum_complete_tail_slow++;
|
|
|
|
skb->csum = csum_block_add(skb->csum,
|
|
|
|
skb_checksum(skb, offset, len, 0),
|
|
|
|
offset);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
tail_padding_csum(struct sk_buff *skb, int offset,
|
|
|
|
struct mlx5e_rq_stats *stats)
|
|
|
|
{
|
|
|
|
u8 tail_padding[MAX_PADDING];
|
|
|
|
int len = skb->len - offset;
|
|
|
|
void *tail;
|
|
|
|
|
|
|
|
if (unlikely(len > MAX_PADDING)) {
|
|
|
|
tail_padding_csum_slow(skb, offset, len, stats);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
tail = skb_header_pointer(skb, offset, len, tail_padding);
|
|
|
|
if (unlikely(!tail)) {
|
|
|
|
tail_padding_csum_slow(skb, offset, len, stats);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
stats->csum_complete_tail++;
|
|
|
|
skb->csum = csum_block_add(skb->csum, csum_partial(tail, len, 0), offset);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
2019-05-03 16:12:46 -06:00
|
|
|
mlx5e_skb_csum_fixup(struct sk_buff *skb, int network_depth, __be16 proto,
|
|
|
|
struct mlx5e_rq_stats *stats)
|
2019-03-12 01:24:52 -06:00
|
|
|
{
|
|
|
|
struct ipv6hdr *ip6;
|
|
|
|
struct iphdr *ip4;
|
|
|
|
int pkt_len;
|
|
|
|
|
2019-05-03 16:12:46 -06:00
|
|
|
/* Fixup vlan headers, if any */
|
|
|
|
if (network_depth > ETH_HLEN)
|
|
|
|
/* CQE csum is calculated from the IP header and does
|
|
|
|
* not cover VLAN headers (if present). This will add
|
|
|
|
* the checksum manually.
|
|
|
|
*/
|
|
|
|
skb->csum = csum_partial(skb->data + ETH_HLEN,
|
|
|
|
network_depth - ETH_HLEN,
|
|
|
|
skb->csum);
|
|
|
|
|
|
|
|
/* Fixup tail padding, if any */
|
2019-03-12 01:24:52 -06:00
|
|
|
switch (proto) {
|
|
|
|
case htons(ETH_P_IP):
|
|
|
|
ip4 = (struct iphdr *)(skb->data + network_depth);
|
|
|
|
pkt_len = network_depth + ntohs(ip4->tot_len);
|
|
|
|
break;
|
|
|
|
case htons(ETH_P_IPV6):
|
|
|
|
ip6 = (struct ipv6hdr *)(skb->data + network_depth);
|
|
|
|
pkt_len = network_depth + sizeof(*ip6) + ntohs(ip6->payload_len);
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (likely(pkt_len >= skb->len))
|
|
|
|
return;
|
|
|
|
|
|
|
|
tail_padding_csum(skb, pkt_len, stats);
|
|
|
|
}
|
|
|
|
|
2015-08-16 07:04:52 -06:00
|
|
|
static inline void mlx5e_handle_csum(struct net_device *netdev,
|
|
|
|
struct mlx5_cqe64 *cqe,
|
|
|
|
struct mlx5e_rq *rq,
|
2016-02-22 09:17:30 -07:00
|
|
|
struct sk_buff *skb,
|
|
|
|
bool lro)
|
2015-08-16 07:04:52 -06:00
|
|
|
{
|
2018-04-12 07:03:37 -06:00
|
|
|
struct mlx5e_rq_stats *stats = rq->stats;
|
net/mlx5e: CHECKSUM_COMPLETE offload for VLAN/QinQ packets
When the VLAN tag is present in the packet buffer (i.e VLAN stripping disabled, QinQ)
the driver will currently report CHECKSUM_UNNECESSARY.
Instead of using CHECKSUM_COMPLETE offload for packets with first
ethertype of IPv4/6, use it for packets with last ethertype of IPv4/6 to
cover the former cases as well.
The checksum field present in the CQE is calculated from the IP header
until the end of the packet. When the first ethertype is different than
IPv4/6 (for ex. 802.1Q VLAN) a checksum of the VLAN header/s should be
added. The small header/s checksum calculation will allow us to use
CHECKSUM_COMPLETE instead of CHECKSUM_UNNECESSARY.
Testing bandwidth of one and 8 TCP streams to a single RQ,
LRO and VLAN stripping offloads disabled:
CPU: Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
NIC: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
Before:
+--------------+--------------------+---------------------+----------------------+
| Traffic type | 1 Stream BW [Mbps] | 8 Streams BW [Mbps] | Checksum offload |
+--------------+--------------------+---------------------+----------------------+
| Untagged | 28,247.35 | 24,716.88 | CHECKSUM_COMPLETE |
| VLAN | 27,516.69 | 23,752.26 | CHECKSUM_UNNECESSARY |
| QinQ | 6,961.30 | 20,667.04 | CHECKSUM_UNNECESSARY |
+--------------+--------------------+---------------------+----------------------+
Now:
+--------------+--------------------+---------------------+-------------------+
| Traffic type | 1 Stream BW [Mbps] | 8 Streams BW [Mbps] | Checksum offload |
+--------------+--------------------+---------------------+-------------------+
| Untagged | 28,521.28 | 24,926.32 | CHECKSUM_COMPLETE |
| VLAN | 27,389.37 | 23,715.34 | CHECKSUM_COMPLETE |
| QinQ | 6,901.77 | 20,845.73 | CHECKSUM_COMPLETE |
+--------------+--------------------+---------------------+-------------------+
No performance degradation observed.
Signed-off-by: Gal Pressman <galp@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-08-30 06:12:45 -06:00
|
|
|
int network_depth = 0;
|
2018-03-20 09:15:31 -06:00
|
|
|
__be16 proto;
|
net/mlx5e: CHECKSUM_COMPLETE offload for VLAN/QinQ packets
When the VLAN tag is present in the packet buffer (i.e VLAN stripping disabled, QinQ)
the driver will currently report CHECKSUM_UNNECESSARY.
Instead of using CHECKSUM_COMPLETE offload for packets with first
ethertype of IPv4/6, use it for packets with last ethertype of IPv4/6 to
cover the former cases as well.
The checksum field present in the CQE is calculated from the IP header
until the end of the packet. When the first ethertype is different than
IPv4/6 (for ex. 802.1Q VLAN) a checksum of the VLAN header/s should be
added. The small header/s checksum calculation will allow us to use
CHECKSUM_COMPLETE instead of CHECKSUM_UNNECESSARY.
Testing bandwidth of one and 8 TCP streams to a single RQ,
LRO and VLAN stripping offloads disabled:
CPU: Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
NIC: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
Before:
+--------------+--------------------+---------------------+----------------------+
| Traffic type | 1 Stream BW [Mbps] | 8 Streams BW [Mbps] | Checksum offload |
+--------------+--------------------+---------------------+----------------------+
| Untagged | 28,247.35 | 24,716.88 | CHECKSUM_COMPLETE |
| VLAN | 27,516.69 | 23,752.26 | CHECKSUM_UNNECESSARY |
| QinQ | 6,961.30 | 20,667.04 | CHECKSUM_UNNECESSARY |
+--------------+--------------------+---------------------+----------------------+
Now:
+--------------+--------------------+---------------------+-------------------+
| Traffic type | 1 Stream BW [Mbps] | 8 Streams BW [Mbps] | Checksum offload |
+--------------+--------------------+---------------------+-------------------+
| Untagged | 28,521.28 | 24,926.32 | CHECKSUM_COMPLETE |
| VLAN | 27,389.37 | 23,715.34 | CHECKSUM_COMPLETE |
| QinQ | 6,901.77 | 20,845.73 | CHECKSUM_COMPLETE |
+--------------+--------------------+---------------------+-------------------+
No performance degradation observed.
Signed-off-by: Gal Pressman <galp@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-08-30 06:12:45 -06:00
|
|
|
|
2015-08-16 07:04:52 -06:00
|
|
|
if (unlikely(!(netdev->features & NETIF_F_RXCSUM)))
|
|
|
|
goto csum_none;
|
|
|
|
|
2016-02-22 09:17:30 -07:00
|
|
|
if (lro) {
|
2015-08-16 07:04:52 -06:00
|
|
|
skb->ip_summed = CHECKSUM_UNNECESSARY;
|
2018-04-12 07:03:37 -06:00
|
|
|
stats->csum_unnecessary++;
|
2016-04-24 13:51:56 -06:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2019-03-21 20:07:20 -06:00
|
|
|
/* True when explicitly set via priv flag, or XDP prog is loaded */
|
|
|
|
if (test_bit(MLX5E_RQ_STATE_NO_CSUM_COMPLETE, &rq->state))
|
2018-07-01 02:58:38 -06:00
|
|
|
goto csum_unnecessary;
|
|
|
|
|
2018-12-03 23:14:04 -07:00
|
|
|
/* CQE csum doesn't cover padding octets in short ethernet
|
|
|
|
* frames. And the pad field is appended prior to calculating
|
|
|
|
* and appending the FCS field.
|
|
|
|
*
|
|
|
|
* Detecting these padded frames requires to verify and parse
|
|
|
|
* IP headers, so we simply force all those small frames to be
|
|
|
|
* CHECKSUM_UNNECESSARY even if they are not padded.
|
|
|
|
*/
|
|
|
|
if (short_frame(skb->len))
|
|
|
|
goto csum_unnecessary;
|
|
|
|
|
2018-03-20 09:15:31 -06:00
|
|
|
if (likely(is_last_ethertype_ip(skb, &network_depth, &proto))) {
|
2018-11-28 16:04:05 -07:00
|
|
|
if (unlikely(get_ip_proto(skb, network_depth, proto) == IPPROTO_SCTP))
|
2018-07-10 04:34:06 -06:00
|
|
|
goto csum_unnecessary;
|
|
|
|
|
2019-05-03 14:14:59 -06:00
|
|
|
stats->csum_complete++;
|
2015-08-16 07:04:52 -06:00
|
|
|
skb->ip_summed = CHECKSUM_COMPLETE;
|
2015-08-17 22:22:26 -06:00
|
|
|
skb->csum = csum_unfold((__force __sum16)cqe->check_sum);
|
2019-05-03 14:14:59 -06:00
|
|
|
|
|
|
|
if (test_bit(MLX5E_RQ_STATE_CSUM_FULL, &rq->state))
|
|
|
|
return; /* CQE csum covers all received bytes */
|
|
|
|
|
|
|
|
/* csum might need some fixups ...*/
|
2019-05-03 16:12:46 -06:00
|
|
|
mlx5e_skb_csum_fixup(skb, network_depth, proto, stats);
|
2016-04-24 13:51:56 -06:00
|
|
|
return;
|
2015-08-16 07:04:52 -06:00
|
|
|
}
|
|
|
|
|
2018-07-10 04:34:06 -06:00
|
|
|
csum_unnecessary:
|
2016-04-24 13:51:56 -06:00
|
|
|
if (likely((cqe->hds_ip_ext & CQE_L3_OK) &&
|
2019-03-31 06:53:03 -06:00
|
|
|
(cqe->hds_ip_ext & CQE_L4_OK))) {
|
2016-04-24 13:51:56 -06:00
|
|
|
skb->ip_summed = CHECKSUM_UNNECESSARY;
|
|
|
|
if (cqe_is_tunneled(cqe)) {
|
|
|
|
skb->csum_level = 1;
|
|
|
|
skb->encapsulation = 1;
|
2018-04-12 07:03:37 -06:00
|
|
|
stats->csum_unnecessary_inner++;
|
2017-09-13 06:37:50 -06:00
|
|
|
return;
|
2016-04-24 13:51:56 -06:00
|
|
|
}
|
2018-04-12 07:03:37 -06:00
|
|
|
stats->csum_unnecessary++;
|
2016-04-24 13:51:56 -06:00
|
|
|
return;
|
|
|
|
}
|
2015-08-16 07:04:52 -06:00
|
|
|
csum_none:
|
|
|
|
skb->ip_summed = CHECKSUM_NONE;
|
2018-04-12 07:03:37 -06:00
|
|
|
stats->csum_none++;
|
2015-08-16 07:04:52 -06:00
|
|
|
}
|
|
|
|
|
2018-03-20 09:15:31 -06:00
|
|
|
#define MLX5E_CE_BIT_MASK 0x80
|
|
|
|
|
2015-05-28 13:28:46 -06:00
|
|
|
static inline void mlx5e_build_rx_skb(struct mlx5_cqe64 *cqe,
|
net/mlx5e: Support RX multi-packet WQE (Striding RQ)
Introduce the feature of multi-packet WQE (RX Work Queue Element)
referred to as (MPWQE or Striding RQ), in which WQEs are larger
and serve multiple packets each.
Every WQE consists of many strides of the same size, every received
packet is aligned to a beginning of a stride and is written to
consecutive strides within a WQE.
In the regular approach, each regular WQE is big enough to be capable
of serving one received packet of any size up to MTU or 64K in case of
device LRO is enabled, making it very wasteful when dealing with
small packets or device LRO is enabled.
For its flexibility, MPWQE allows a better memory utilization
(implying improvements in CPU utilization and packet rate) as packets
consume strides according to their size, preserving the rest of
the WQE to be available for other packets.
MPWQE default configuration:
Num of WQEs = 16
Strides Per WQE = 2048
Stride Size = 64 byte
The default WQEs memory footprint went from 1024*mtu (~1.5MB) to
16 * 2048 * 64 = 2MB per ring.
However, HW LRO can now be supported at no additional cost in memory
footprint, and hence we turn it on by default and get an even better
performance.
Performance tested on ConnectX4-Lx 50G.
To isolate the feature under test, the numbers below were measured with
HW LRO turned off. We verified that the performance just improves when
LRO is turned back on.
* Netperf single TCP stream:
- BW raised by 10-15% for representative packet sizes:
default, 64B, 1024B, 1478B, 65536B.
* Netperf multi TCP stream:
- No degradation, line rate reached.
* Pktgen: packet rate raised by 2-10% for traffic of different message
sizes: 64B, 128B, 256B, 1024B, and 1500B.
* Pktgen: packet loss in bursts of small messages (64byte),
single stream:
- | num packets | packets loss before | packets loss after
| 2K | ~ 1K | 0
| 8K | ~ 6K | 0
| 16K | ~13K | 0
| 32K | ~28K | 0
| 64K | ~57K | ~24K
As expected as the driver can receive as many small packets (<=64B) as
the number of total strides in the ring (default = 2048 * 16) vs. 1024
(default ring size regardless of packets size) before this feature.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-20 13:02:13 -06:00
|
|
|
u32 cqe_bcnt,
|
2015-05-28 13:28:46 -06:00
|
|
|
struct mlx5e_rq *rq,
|
|
|
|
struct sk_buff *skb)
|
|
|
|
{
|
2017-11-23 05:29:25 -07:00
|
|
|
u8 lro_num_seg = be32_to_cpu(cqe->srqn) >> 24;
|
2018-04-12 07:03:37 -06:00
|
|
|
struct mlx5e_rq_stats *stats = rq->stats;
|
2015-05-28 13:28:46 -06:00
|
|
|
struct net_device *netdev = rq->netdev;
|
|
|
|
|
net/mlx5e: CHECKSUM_COMPLETE offload for VLAN/QinQ packets
When the VLAN tag is present in the packet buffer (i.e VLAN stripping disabled, QinQ)
the driver will currently report CHECKSUM_UNNECESSARY.
Instead of using CHECKSUM_COMPLETE offload for packets with first
ethertype of IPv4/6, use it for packets with last ethertype of IPv4/6 to
cover the former cases as well.
The checksum field present in the CQE is calculated from the IP header
until the end of the packet. When the first ethertype is different than
IPv4/6 (for ex. 802.1Q VLAN) a checksum of the VLAN header/s should be
added. The small header/s checksum calculation will allow us to use
CHECKSUM_COMPLETE instead of CHECKSUM_UNNECESSARY.
Testing bandwidth of one and 8 TCP streams to a single RQ,
LRO and VLAN stripping offloads disabled:
CPU: Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
NIC: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
Before:
+--------------+--------------------+---------------------+----------------------+
| Traffic type | 1 Stream BW [Mbps] | 8 Streams BW [Mbps] | Checksum offload |
+--------------+--------------------+---------------------+----------------------+
| Untagged | 28,247.35 | 24,716.88 | CHECKSUM_COMPLETE |
| VLAN | 27,516.69 | 23,752.26 | CHECKSUM_UNNECESSARY |
| QinQ | 6,961.30 | 20,667.04 | CHECKSUM_UNNECESSARY |
+--------------+--------------------+---------------------+----------------------+
Now:
+--------------+--------------------+---------------------+-------------------+
| Traffic type | 1 Stream BW [Mbps] | 8 Streams BW [Mbps] | Checksum offload |
+--------------+--------------------+---------------------+-------------------+
| Untagged | 28,521.28 | 24,926.32 | CHECKSUM_COMPLETE |
| VLAN | 27,389.37 | 23,715.34 | CHECKSUM_COMPLETE |
| QinQ | 6,901.77 | 20,845.73 | CHECKSUM_COMPLETE |
+--------------+--------------------+---------------------+-------------------+
No performance degradation observed.
Signed-off-by: Gal Pressman <galp@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-08-30 06:12:45 -06:00
|
|
|
skb->mac_len = ETH_HLEN;
|
2018-07-13 05:33:48 -06:00
|
|
|
|
|
|
|
#ifdef CONFIG_MLX5_EN_TLS
|
|
|
|
mlx5e_tls_handle_rx_skb(netdev, skb, &cqe_bcnt);
|
|
|
|
#endif
|
|
|
|
|
2015-05-28 13:28:46 -06:00
|
|
|
if (lro_num_seg > 1) {
|
net/mlx5e: Support RX multi-packet WQE (Striding RQ)
Introduce the feature of multi-packet WQE (RX Work Queue Element)
referred to as (MPWQE or Striding RQ), in which WQEs are larger
and serve multiple packets each.
Every WQE consists of many strides of the same size, every received
packet is aligned to a beginning of a stride and is written to
consecutive strides within a WQE.
In the regular approach, each regular WQE is big enough to be capable
of serving one received packet of any size up to MTU or 64K in case of
device LRO is enabled, making it very wasteful when dealing with
small packets or device LRO is enabled.
For its flexibility, MPWQE allows a better memory utilization
(implying improvements in CPU utilization and packet rate) as packets
consume strides according to their size, preserving the rest of
the WQE to be available for other packets.
MPWQE default configuration:
Num of WQEs = 16
Strides Per WQE = 2048
Stride Size = 64 byte
The default WQEs memory footprint went from 1024*mtu (~1.5MB) to
16 * 2048 * 64 = 2MB per ring.
However, HW LRO can now be supported at no additional cost in memory
footprint, and hence we turn it on by default and get an even better
performance.
Performance tested on ConnectX4-Lx 50G.
To isolate the feature under test, the numbers below were measured with
HW LRO turned off. We verified that the performance just improves when
LRO is turned back on.
* Netperf single TCP stream:
- BW raised by 10-15% for representative packet sizes:
default, 64B, 1024B, 1478B, 65536B.
* Netperf multi TCP stream:
- No degradation, line rate reached.
* Pktgen: packet rate raised by 2-10% for traffic of different message
sizes: 64B, 128B, 256B, 1024B, and 1500B.
* Pktgen: packet loss in bursts of small messages (64byte),
single stream:
- | num packets | packets loss before | packets loss after
| 2K | ~ 1K | 0
| 8K | ~ 6K | 0
| 16K | ~13K | 0
| 32K | ~28K | 0
| 64K | ~57K | ~24K
As expected as the driver can receive as many small packets (<=64B) as
the number of total strides in the ring (default = 2048 * 16) vs. 1024
(default ring size regardless of packets size) before this feature.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-20 13:02:13 -06:00
|
|
|
mlx5e_lro_update_hdr(skb, cqe, cqe_bcnt);
|
2015-08-16 07:04:49 -06:00
|
|
|
skb_shinfo(skb)->gso_size = DIV_ROUND_UP(cqe_bcnt, lro_num_seg);
|
2017-03-21 07:59:19 -06:00
|
|
|
/* Subtract one since we already counted this as one
|
|
|
|
* "regular" packet in mlx5e_complete_rx_cqe()
|
|
|
|
*/
|
2018-04-12 07:03:37 -06:00
|
|
|
stats->packets += lro_num_seg - 1;
|
|
|
|
stats->lro_packets++;
|
|
|
|
stats->lro_bytes += cqe_bcnt;
|
2015-05-28 13:28:46 -06:00
|
|
|
}
|
|
|
|
|
2017-08-15 04:46:04 -06:00
|
|
|
if (unlikely(mlx5e_rx_hw_stamp(rq->tstamp)))
|
|
|
|
skb_hwtstamps(skb)->hwtstamp =
|
|
|
|
mlx5_timecounter_cyc2time(rq->clock, get_cqe_ts(cqe));
|
2015-12-29 05:58:31 -07:00
|
|
|
|
2015-05-28 13:28:46 -06:00
|
|
|
skb_record_rx_queue(skb, rq->ix);
|
|
|
|
|
|
|
|
if (likely(netdev->features & NETIF_F_RXHASH))
|
|
|
|
mlx5e_skb_set_hash(cqe, skb);
|
|
|
|
|
2017-09-10 04:49:59 -06:00
|
|
|
if (cqe_has_vlan(cqe)) {
|
2015-05-28 13:28:46 -06:00
|
|
|
__vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q),
|
|
|
|
be16_to_cpu(cqe->vlan_info));
|
2018-04-12 07:03:37 -06:00
|
|
|
stats->removed_vlan_packets++;
|
2017-09-10 04:49:59 -06:00
|
|
|
}
|
2016-03-08 03:42:38 -07:00
|
|
|
|
|
|
|
skb->mark = be32_to_cpu(cqe->sop_drop_qpn) & MLX5E_TC_FLOW_ID_MASK;
|
2016-04-20 13:02:18 -06:00
|
|
|
|
|
|
|
mlx5e_handle_csum(netdev, cqe, rq, skb, !!lro_num_seg);
|
2018-03-20 09:15:31 -06:00
|
|
|
/* checking CE bit in cqe - MSB in ml_path field */
|
|
|
|
if (unlikely(cqe->ml_path & MLX5E_CE_BIT_MASK))
|
|
|
|
mlx5e_enable_ecn(rq, skb);
|
|
|
|
|
2016-04-20 13:02:18 -06:00
|
|
|
skb->protocol = eth_type_trans(skb, netdev);
|
2015-05-28 13:28:46 -06:00
|
|
|
}
|
|
|
|
|
net/mlx5e: Support RX multi-packet WQE (Striding RQ)
Introduce the feature of multi-packet WQE (RX Work Queue Element)
referred to as (MPWQE or Striding RQ), in which WQEs are larger
and serve multiple packets each.
Every WQE consists of many strides of the same size, every received
packet is aligned to a beginning of a stride and is written to
consecutive strides within a WQE.
In the regular approach, each regular WQE is big enough to be capable
of serving one received packet of any size up to MTU or 64K in case of
device LRO is enabled, making it very wasteful when dealing with
small packets or device LRO is enabled.
For its flexibility, MPWQE allows a better memory utilization
(implying improvements in CPU utilization and packet rate) as packets
consume strides according to their size, preserving the rest of
the WQE to be available for other packets.
MPWQE default configuration:
Num of WQEs = 16
Strides Per WQE = 2048
Stride Size = 64 byte
The default WQEs memory footprint went from 1024*mtu (~1.5MB) to
16 * 2048 * 64 = 2MB per ring.
However, HW LRO can now be supported at no additional cost in memory
footprint, and hence we turn it on by default and get an even better
performance.
Performance tested on ConnectX4-Lx 50G.
To isolate the feature under test, the numbers below were measured with
HW LRO turned off. We verified that the performance just improves when
LRO is turned back on.
* Netperf single TCP stream:
- BW raised by 10-15% for representative packet sizes:
default, 64B, 1024B, 1478B, 65536B.
* Netperf multi TCP stream:
- No degradation, line rate reached.
* Pktgen: packet rate raised by 2-10% for traffic of different message
sizes: 64B, 128B, 256B, 1024B, and 1500B.
* Pktgen: packet loss in bursts of small messages (64byte),
single stream:
- | num packets | packets loss before | packets loss after
| 2K | ~ 1K | 0
| 8K | ~ 6K | 0
| 16K | ~13K | 0
| 32K | ~28K | 0
| 64K | ~57K | ~24K
As expected as the driver can receive as many small packets (<=64B) as
the number of total strides in the ring (default = 2048 * 16) vs. 1024
(default ring size regardless of packets size) before this feature.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-20 13:02:13 -06:00
|
|
|
static inline void mlx5e_complete_rx_cqe(struct mlx5e_rq *rq,
|
|
|
|
struct mlx5_cqe64 *cqe,
|
|
|
|
u32 cqe_bcnt,
|
|
|
|
struct sk_buff *skb)
|
|
|
|
{
|
2018-04-12 07:03:37 -06:00
|
|
|
struct mlx5e_rq_stats *stats = rq->stats;
|
|
|
|
|
|
|
|
stats->packets++;
|
|
|
|
stats->bytes += cqe_bcnt;
|
net/mlx5e: Support RX multi-packet WQE (Striding RQ)
Introduce the feature of multi-packet WQE (RX Work Queue Element)
referred to as (MPWQE or Striding RQ), in which WQEs are larger
and serve multiple packets each.
Every WQE consists of many strides of the same size, every received
packet is aligned to a beginning of a stride and is written to
consecutive strides within a WQE.
In the regular approach, each regular WQE is big enough to be capable
of serving one received packet of any size up to MTU or 64K in case of
device LRO is enabled, making it very wasteful when dealing with
small packets or device LRO is enabled.
For its flexibility, MPWQE allows a better memory utilization
(implying improvements in CPU utilization and packet rate) as packets
consume strides according to their size, preserving the rest of
the WQE to be available for other packets.
MPWQE default configuration:
Num of WQEs = 16
Strides Per WQE = 2048
Stride Size = 64 byte
The default WQEs memory footprint went from 1024*mtu (~1.5MB) to
16 * 2048 * 64 = 2MB per ring.
However, HW LRO can now be supported at no additional cost in memory
footprint, and hence we turn it on by default and get an even better
performance.
Performance tested on ConnectX4-Lx 50G.
To isolate the feature under test, the numbers below were measured with
HW LRO turned off. We verified that the performance just improves when
LRO is turned back on.
* Netperf single TCP stream:
- BW raised by 10-15% for representative packet sizes:
default, 64B, 1024B, 1478B, 65536B.
* Netperf multi TCP stream:
- No degradation, line rate reached.
* Pktgen: packet rate raised by 2-10% for traffic of different message
sizes: 64B, 128B, 256B, 1024B, and 1500B.
* Pktgen: packet loss in bursts of small messages (64byte),
single stream:
- | num packets | packets loss before | packets loss after
| 2K | ~ 1K | 0
| 8K | ~ 6K | 0
| 16K | ~13K | 0
| 32K | ~28K | 0
| 64K | ~57K | ~24K
As expected as the driver can receive as many small packets (<=64B) as
the number of total strides in the ring (default = 2048 * 16) vs. 1024
(default ring size regardless of packets size) before this feature.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-20 13:02:13 -06:00
|
|
|
mlx5e_build_rx_skb(cqe, cqe_bcnt, rq, skb);
|
|
|
|
}
|
|
|
|
|
net/mlx5e: Use linear SKB in Striding RQ
Current Striding RQ HW feature utilizes the RX buffers so that
there is no wasted room between the strides. This maximises
the memory utilization.
This prevents the use of build_skb() (which requires headroom
and tailroom), and demands to memcpy the packets headers into
the skb linear part.
In this patch, whenever a set of conditions holds, we apply
an RQ configuration that allows combining the use of linear SKB
on top of a Striding RQ.
To use build_skb() with Striding RQ, the following must hold:
1. packet does not cross a page boundary.
2. there is enough headroom and tailroom surrounding the packet.
We can satisfy 1 and 2 by configuring:
stride size = MTU + headroom + tailoom.
This is possible only when:
a. (MTU - headroom - tailoom) does not exceed PAGE_SIZE.
b. HW LRO is turned off.
Using linear SKB has many advantages:
- Saves a memcpy of the headers.
- No page-boundary checks in datapath.
- No filler CQEs.
- Significantly smaller CQ.
- SKB data continuously resides in linear part, and not split to
small amount (linear part) and large amount (fragment).
This saves datapath cycles in driver and improves utilization
of SKB fragments in GRO.
- The fragments of a resulting GRO SKB follow the IP forwarding
assumption of equal-size fragments.
Some implementation details:
HW writes the packets to the beginning of a stride,
i.e. does not keep headroom. To overcome this we make sure we can
extend backwards and use the last bytes of stride i-1.
Extra care is needed for stride 0 as it has no preceding stride.
We make sure headroom bytes are available by shifting the buffer
pointer passed to HW by headroom bytes.
This configuration now becomes default, whenever capable.
Of course, this implies turning LRO off.
Performance testing:
ConnectX-5, single core, single RX ring, default MTU.
UDP packet rate, early drop in TC layer:
--------------------------------------------
| pkt size | before | after | ratio |
--------------------------------------------
| 1500byte | 4.65 Mpps | 5.96 Mpps | 1.28x |
| 500byte | 5.23 Mpps | 5.97 Mpps | 1.14x |
| 64byte | 5.94 Mpps | 5.96 Mpps | 1.00x |
--------------------------------------------
TCP streams: ~20% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-02-07 05:41:25 -07:00
|
|
|
static inline
|
|
|
|
struct sk_buff *mlx5e_build_linear_skb(struct mlx5e_rq *rq, void *va,
|
|
|
|
u32 frag_size, u16 headroom,
|
|
|
|
u32 cqe_bcnt)
|
|
|
|
{
|
|
|
|
struct sk_buff *skb = build_skb(va, frag_size);
|
|
|
|
|
|
|
|
if (unlikely(!skb)) {
|
2018-04-12 07:03:37 -06:00
|
|
|
rq->stats->buff_alloc_err++;
|
net/mlx5e: Use linear SKB in Striding RQ
Current Striding RQ HW feature utilizes the RX buffers so that
there is no wasted room between the strides. This maximises
the memory utilization.
This prevents the use of build_skb() (which requires headroom
and tailroom), and demands to memcpy the packets headers into
the skb linear part.
In this patch, whenever a set of conditions holds, we apply
an RQ configuration that allows combining the use of linear SKB
on top of a Striding RQ.
To use build_skb() with Striding RQ, the following must hold:
1. packet does not cross a page boundary.
2. there is enough headroom and tailroom surrounding the packet.
We can satisfy 1 and 2 by configuring:
stride size = MTU + headroom + tailoom.
This is possible only when:
a. (MTU - headroom - tailoom) does not exceed PAGE_SIZE.
b. HW LRO is turned off.
Using linear SKB has many advantages:
- Saves a memcpy of the headers.
- No page-boundary checks in datapath.
- No filler CQEs.
- Significantly smaller CQ.
- SKB data continuously resides in linear part, and not split to
small amount (linear part) and large amount (fragment).
This saves datapath cycles in driver and improves utilization
of SKB fragments in GRO.
- The fragments of a resulting GRO SKB follow the IP forwarding
assumption of equal-size fragments.
Some implementation details:
HW writes the packets to the beginning of a stride,
i.e. does not keep headroom. To overcome this we make sure we can
extend backwards and use the last bytes of stride i-1.
Extra care is needed for stride 0 as it has no preceding stride.
We make sure headroom bytes are available by shifting the buffer
pointer passed to HW by headroom bytes.
This configuration now becomes default, whenever capable.
Of course, this implies turning LRO off.
Performance testing:
ConnectX-5, single core, single RX ring, default MTU.
UDP packet rate, early drop in TC layer:
--------------------------------------------
| pkt size | before | after | ratio |
--------------------------------------------
| 1500byte | 4.65 Mpps | 5.96 Mpps | 1.28x |
| 500byte | 5.23 Mpps | 5.97 Mpps | 1.14x |
| 64byte | 5.94 Mpps | 5.96 Mpps | 1.00x |
--------------------------------------------
TCP streams: ~20% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-02-07 05:41:25 -07:00
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
skb_reserve(skb, headroom);
|
|
|
|
skb_put(skb, cqe_bcnt);
|
|
|
|
|
|
|
|
return skb;
|
|
|
|
}
|
|
|
|
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
struct sk_buff *
|
|
|
|
mlx5e_skb_from_cqe_linear(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
|
|
|
|
struct mlx5e_wqe_frag_info *wi, u32 cqe_bcnt)
|
2016-04-20 13:02:12 -06:00
|
|
|
{
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
struct mlx5e_dma_info *di = wi->di;
|
2017-02-13 09:41:30 -07:00
|
|
|
u16 rx_headroom = rq->buff.headroom;
|
2016-09-21 03:19:42 -06:00
|
|
|
struct sk_buff *skb;
|
2016-09-21 03:19:48 -06:00
|
|
|
void *va, *data;
|
bpf, xdp: drop rcu_read_lock from bpf_prog_run_xdp and move to caller
After 326fe02d1ed6 ("net/mlx4_en: protect ring->xdp_prog with rcu_read_lock"),
the rcu_read_lock() in bpf_prog_run_xdp() is superfluous, since callers
need to hold rcu_read_lock() already to make sure BPF program doesn't
get released in the background.
Thus, drop it from bpf_prog_run_xdp(), as it can otherwise be misleading.
Still keeping the bpf_prog_run_xdp() is useful as it allows for grepping
in XDP supported drivers and to keep the typecheck on the context intact.
For mlx4, this means we don't have a double rcu_read_lock() anymore. nfp can
just make use of bpf_prog_run_xdp(), too. For qede, just move rcu_read_lock()
out of the helper. When the driver gets atomic replace support, this will
move to call-sites eventually.
mlx5 needs actual fixing as it has the same issue as described already in
326fe02d1ed6 ("net/mlx4_en: protect ring->xdp_prog with rcu_read_lock"),
that is, we're under RCU bh at this time, BPF programs are released via
call_rcu(), and call_rcu() != call_rcu_bh(), so we need to properly mark
read side as programs can get xchg()'ed in mlx5e_xdp_set() without queue
reset.
Fixes: 86994156c736 ("net/mlx5e: XDP fast RX drop bpf programs support")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-30 14:16:06 -07:00
|
|
|
bool consumed;
|
2017-01-18 05:28:53 -07:00
|
|
|
u32 frag_size;
|
2016-04-20 13:02:12 -06:00
|
|
|
|
net/mlx5e: Introduce RX Page-Reuse
Introduce a Page-Reuse mechanism in non-Striding RQ RX datapath.
A WQE (RX descriptor) buffer is a page, that in most cases was fully
wasted on a packet that is much smaller, requiring a new page for
the next round.
In this patch, we implement a page-reuse mechanism, that resembles a
`SW Striding RQ`.
We allow the WQE to reuse its allocated page as much as it could,
until the page is fully consumed. In each round, the WQE is capable
of receiving packet of maximal size (MTU). Yet, upon the reception of
a packet, the WQE knows the actual packet size, and consumes the exact
amount of memory needed to build a linear SKB. Then, it updates the
buffer pointer within the page accordingly, for the next round.
Feature is mutually exclusive with XDP (packet-per-page)
and LRO (session size is a power of two, needs unused page).
Performance tests:
iperf tcp tests show huge gain:
--------------------------------------------
num streams | BW before | BW after | ratio |
1 | 22.2 | 30.9 | 1.39x |
8 | 64.2 | 93.6 | 1.46x |
64 | 56.7 | 91.4 | 1.61x |
--------------------------------------------
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-01-29 08:42:26 -07:00
|
|
|
va = page_address(di->page) + wi->offset;
|
2017-01-17 23:06:07 -07:00
|
|
|
data = va + rx_headroom;
|
net/mlx5e: Introduce RX Page-Reuse
Introduce a Page-Reuse mechanism in non-Striding RQ RX datapath.
A WQE (RX descriptor) buffer is a page, that in most cases was fully
wasted on a packet that is much smaller, requiring a new page for
the next round.
In this patch, we implement a page-reuse mechanism, that resembles a
`SW Striding RQ`.
We allow the WQE to reuse its allocated page as much as it could,
until the page is fully consumed. In each round, the WQE is capable
of receiving packet of maximal size (MTU). Yet, upon the reception of
a packet, the WQE knows the actual packet size, and consumes the exact
amount of memory needed to build a linear SKB. Then, it updates the
buffer pointer within the page accordingly, for the next round.
Feature is mutually exclusive with XDP (packet-per-page)
and LRO (session size is a power of two, needs unused page).
Performance tests:
iperf tcp tests show huge gain:
--------------------------------------------
num streams | BW before | BW after | ratio |
1 | 22.2 | 30.9 | 1.39x |
8 | 64.2 | 93.6 | 1.46x |
64 | 56.7 | 91.4 | 1.61x |
--------------------------------------------
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-01-29 08:42:26 -07:00
|
|
|
frag_size = MLX5_SKB_FRAG_SZ(rx_headroom + cqe_bcnt);
|
2016-04-20 13:02:12 -06:00
|
|
|
|
2017-11-27 01:35:12 -07:00
|
|
|
dma_sync_single_range_for_cpu(rq->pdev, di->addr, wi->offset,
|
|
|
|
frag_size, DMA_FROM_DEVICE);
|
xdp: transition into using xdp_frame for return API
Changing API xdp_return_frame() to take struct xdp_frame as argument,
seems like a natural choice. But there are some subtle performance
details here that needs extra care, which is a deliberate choice.
When de-referencing xdp_frame on a remote CPU during DMA-TX
completion, result in the cache-line is change to "Shared"
state. Later when the page is reused for RX, then this xdp_frame
cache-line is written, which change the state to "Modified".
This situation already happens (naturally) for, virtio_net, tun and
cpumap as the xdp_frame pointer is the queued object. In tun and
cpumap, the ptr_ring is used for efficiently transferring cache-lines
(with pointers) between CPUs. Thus, the only option is to
de-referencing xdp_frame.
It is only the ixgbe driver that had an optimization, in which it can
avoid doing the de-reference of xdp_frame. The driver already have
TX-ring queue, which (in case of remote DMA-TX completion) have to be
transferred between CPUs anyhow. In this data area, we stored a
struct xdp_mem_info and a data pointer, which allowed us to avoid
de-referencing xdp_frame.
To compensate for this, a prefetchw is used for telling the cache
coherency protocol about our access pattern. My benchmarks show that
this prefetchw is enough to compensate the ixgbe driver.
V7: Adjust for commit d9314c474d4f ("i40e: add support for XDP_REDIRECT")
V8: Adjust for commit bd658dda4237 ("net/mlx5e: Separate dma base address
and offset in dma_sync call")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 08:46:32 -06:00
|
|
|
prefetchw(va); /* xdp_frame data area */
|
2016-09-21 03:19:48 -06:00
|
|
|
prefetch(data);
|
2016-04-20 13:02:12 -06:00
|
|
|
|
bpf, xdp: drop rcu_read_lock from bpf_prog_run_xdp and move to caller
After 326fe02d1ed6 ("net/mlx4_en: protect ring->xdp_prog with rcu_read_lock"),
the rcu_read_lock() in bpf_prog_run_xdp() is superfluous, since callers
need to hold rcu_read_lock() already to make sure BPF program doesn't
get released in the background.
Thus, drop it from bpf_prog_run_xdp(), as it can otherwise be misleading.
Still keeping the bpf_prog_run_xdp() is useful as it allows for grepping
in XDP supported drivers and to keep the typecheck on the context intact.
For mlx4, this means we don't have a double rcu_read_lock() anymore. nfp can
just make use of bpf_prog_run_xdp(), too. For qede, just move rcu_read_lock()
out of the helper. When the driver gets atomic replace support, this will
move to call-sites eventually.
mlx5 needs actual fixing as it has the same issue as described already in
326fe02d1ed6 ("net/mlx4_en: protect ring->xdp_prog with rcu_read_lock"),
that is, we're under RCU bh at this time, BPF programs are released via
call_rcu(), and call_rcu() != call_rcu_bh(), so we need to properly mark
read side as programs can get xchg()'ed in mlx5e_xdp_set() without queue
reset.
Fixes: 86994156c736 ("net/mlx5e: XDP fast RX drop bpf programs support")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-30 14:16:06 -07:00
|
|
|
rcu_read_lock();
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
consumed = mlx5e_xdp_handle(rq, di, va, &rx_headroom, &cqe_bcnt, false);
|
bpf, xdp: drop rcu_read_lock from bpf_prog_run_xdp and move to caller
After 326fe02d1ed6 ("net/mlx4_en: protect ring->xdp_prog with rcu_read_lock"),
the rcu_read_lock() in bpf_prog_run_xdp() is superfluous, since callers
need to hold rcu_read_lock() already to make sure BPF program doesn't
get released in the background.
Thus, drop it from bpf_prog_run_xdp(), as it can otherwise be misleading.
Still keeping the bpf_prog_run_xdp() is useful as it allows for grepping
in XDP supported drivers and to keep the typecheck on the context intact.
For mlx4, this means we don't have a double rcu_read_lock() anymore. nfp can
just make use of bpf_prog_run_xdp(), too. For qede, just move rcu_read_lock()
out of the helper. When the driver gets atomic replace support, this will
move to call-sites eventually.
mlx5 needs actual fixing as it has the same issue as described already in
326fe02d1ed6 ("net/mlx4_en: protect ring->xdp_prog with rcu_read_lock"),
that is, we're under RCU bh at this time, BPF programs are released via
call_rcu(), and call_rcu() != call_rcu_bh(), so we need to properly mark
read side as programs can get xchg()'ed in mlx5e_xdp_set() without queue
reset.
Fixes: 86994156c736 ("net/mlx5e: XDP fast RX drop bpf programs support")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-30 14:16:06 -07:00
|
|
|
rcu_read_unlock();
|
|
|
|
if (consumed)
|
2016-09-22 11:01:46 -06:00
|
|
|
return NULL; /* page/packet was consumed by XDP */
|
net/mlx5e: XDP fast RX drop bpf programs support
Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver.
When XDP is on we make sure to change channels RQs type to
MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to
ensure "page per packet".
On XDP set, we fail if HW LRO is set and request from user to turn it
off. Since on ConnectX4-LX HW LRO is always on by default, this will be
annoying, but we prefer not to enforce LRO off from XDP set function.
Full channels reset (close/open) is required only when setting XDP
on/off.
When XDP set is called just to exchange programs, we will update
each RQ xdp program on the fly and for synchronization with current
data path RX activity of that RQ, we temporally disable that RQ and
ensure RX path is not running, quickly update and re-enable that RQ,
for that we do:
- rq.state = disabled
- napi_synnchronize
- xchg(rq->xdp_prg)
- rq.state = enabled
- napi_schedule // Just in case we've missed an IRQ
Packet rate performance testing was done with pktgen 64B packets and on
TX side and, TC drop action on RX side compared to XDP fast drop.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Comparison is done between:
1. Baseline, Before this patch with TC drop action
2. This patch with TC drop action
3. This patch with XDP RX fast drop
RX Cores Baseline(TC drop) TC drop XDP fast Drop
--------------------------------------------------------------
1 5.3Mpps 5.3Mpps 16.5Mpps
2 10.2Mpps 10.2Mpps 31.3Mpps
4 20.5Mpps 19.9Mpps 36.3Mpps*
*My xmitter was limited to 36.3Mpps, so it is the bottleneck.
It seems that receive side can handle more.
Signed-off-by: Rana Shahout <ranas@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 03:19:46 -06:00
|
|
|
|
net/mlx5e: Use linear SKB in Striding RQ
Current Striding RQ HW feature utilizes the RX buffers so that
there is no wasted room between the strides. This maximises
the memory utilization.
This prevents the use of build_skb() (which requires headroom
and tailroom), and demands to memcpy the packets headers into
the skb linear part.
In this patch, whenever a set of conditions holds, we apply
an RQ configuration that allows combining the use of linear SKB
on top of a Striding RQ.
To use build_skb() with Striding RQ, the following must hold:
1. packet does not cross a page boundary.
2. there is enough headroom and tailroom surrounding the packet.
We can satisfy 1 and 2 by configuring:
stride size = MTU + headroom + tailoom.
This is possible only when:
a. (MTU - headroom - tailoom) does not exceed PAGE_SIZE.
b. HW LRO is turned off.
Using linear SKB has many advantages:
- Saves a memcpy of the headers.
- No page-boundary checks in datapath.
- No filler CQEs.
- Significantly smaller CQ.
- SKB data continuously resides in linear part, and not split to
small amount (linear part) and large amount (fragment).
This saves datapath cycles in driver and improves utilization
of SKB fragments in GRO.
- The fragments of a resulting GRO SKB follow the IP forwarding
assumption of equal-size fragments.
Some implementation details:
HW writes the packets to the beginning of a stride,
i.e. does not keep headroom. To overcome this we make sure we can
extend backwards and use the last bytes of stride i-1.
Extra care is needed for stride 0 as it has no preceding stride.
We make sure headroom bytes are available by shifting the buffer
pointer passed to HW by headroom bytes.
This configuration now becomes default, whenever capable.
Of course, this implies turning LRO off.
Performance testing:
ConnectX-5, single core, single RX ring, default MTU.
UDP packet rate, early drop in TC layer:
--------------------------------------------
| pkt size | before | after | ratio |
--------------------------------------------
| 1500byte | 4.65 Mpps | 5.96 Mpps | 1.28x |
| 500byte | 5.23 Mpps | 5.97 Mpps | 1.14x |
| 64byte | 5.94 Mpps | 5.96 Mpps | 1.00x |
--------------------------------------------
TCP streams: ~20% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-02-07 05:41:25 -07:00
|
|
|
skb = mlx5e_build_linear_skb(rq, va, frag_size, rx_headroom, cqe_bcnt);
|
|
|
|
if (unlikely(!skb))
|
2016-09-22 11:01:46 -06:00
|
|
|
return NULL;
|
2016-09-21 03:19:42 -06:00
|
|
|
|
net/mlx5e: Introduce RX Page-Reuse
Introduce a Page-Reuse mechanism in non-Striding RQ RX datapath.
A WQE (RX descriptor) buffer is a page, that in most cases was fully
wasted on a packet that is much smaller, requiring a new page for
the next round.
In this patch, we implement a page-reuse mechanism, that resembles a
`SW Striding RQ`.
We allow the WQE to reuse its allocated page as much as it could,
until the page is fully consumed. In each round, the WQE is capable
of receiving packet of maximal size (MTU). Yet, upon the reception of
a packet, the WQE knows the actual packet size, and consumes the exact
amount of memory needed to build a linear SKB. Then, it updates the
buffer pointer within the page accordingly, for the next round.
Feature is mutually exclusive with XDP (packet-per-page)
and LRO (session size is a power of two, needs unused page).
Performance tests:
iperf tcp tests show huge gain:
--------------------------------------------
num streams | BW before | BW after | ratio |
1 | 22.2 | 30.9 | 1.39x |
8 | 64.2 | 93.6 | 1.46x |
64 | 56.7 | 91.4 | 1.61x |
--------------------------------------------
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-01-29 08:42:26 -07:00
|
|
|
/* queue up for recycling/reuse */
|
2016-09-21 03:19:42 -06:00
|
|
|
page_ref_inc(di->page);
|
|
|
|
|
2016-09-22 11:01:46 -06:00
|
|
|
return skb;
|
|
|
|
}
|
|
|
|
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
struct sk_buff *
|
|
|
|
mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
|
|
|
|
struct mlx5e_wqe_frag_info *wi, u32 cqe_bcnt)
|
|
|
|
{
|
|
|
|
struct mlx5e_rq_frag_info *frag_info = &rq->wqe.info.arr[0];
|
|
|
|
struct mlx5e_wqe_frag_info *head_wi = wi;
|
|
|
|
u16 headlen = min_t(u32, MLX5E_RX_MAX_HEAD, cqe_bcnt);
|
|
|
|
u16 frag_headlen = headlen;
|
|
|
|
u16 byte_cnt = cqe_bcnt - headlen;
|
|
|
|
struct sk_buff *skb;
|
|
|
|
|
|
|
|
/* XDP is not supported in this configuration, as incoming packets
|
|
|
|
* might spread among multiple pages.
|
|
|
|
*/
|
|
|
|
skb = napi_alloc_skb(rq->cq.napi,
|
|
|
|
ALIGN(MLX5E_RX_MAX_HEAD, sizeof(long)));
|
|
|
|
if (unlikely(!skb)) {
|
|
|
|
rq->stats->buff_alloc_err++;
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
prefetchw(skb->data);
|
|
|
|
|
|
|
|
while (byte_cnt) {
|
|
|
|
u16 frag_consumed_bytes =
|
|
|
|
min_t(u16, frag_info->frag_size - frag_headlen, byte_cnt);
|
|
|
|
|
|
|
|
mlx5e_add_skb_frag(rq, skb, wi->di, wi->offset + frag_headlen,
|
|
|
|
frag_consumed_bytes, frag_info->frag_stride);
|
|
|
|
byte_cnt -= frag_consumed_bytes;
|
|
|
|
frag_headlen = 0;
|
|
|
|
frag_info++;
|
|
|
|
wi++;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* copy header */
|
net/mlx5e: RX, Make sure packet header does not cross page boundary
In the non-linear SKB memory scheme of Striding RQ, a packet header
could cross page boundary. This requires special care in fast path
that costs LoC, additional runtime instructions and branches.
It could happen when the header (up to 256B) does not fit in
a single stride. Avoid this by working with a stride size that fits
the maximum possible header. Stride is increased form 64B to 256B.
Performance:
Tested packet rate for UDP streams, single ring, on ConnectX-5.
Configuration:
Set Striding RQ and LRO ON (to enabled the non-linear SKB scheme).
GRO OFF, early drop by TC rule.
64B: 4x worse memory utilization, no page-crossers headers
- No degradation (5,887,305 pps).
- The reduction in memory utilization is compensated by the saving of
branches tests.
192B: 1.33x worse memory utilization, avoid page-crossers headers
- Before: 5,727,252. After: 5,777,037. ~1% gain.
256B: Same memory util, no page-crossers
- Before: 5,691,885. After: 5,748,007. ~1% gain.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-09-25 03:05:22 -06:00
|
|
|
mlx5e_copy_skb_header(rq->pdev, skb, head_wi->di, head_wi->offset, headlen);
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
/* skb linear part was allocated with headlen and aligned to long */
|
|
|
|
skb->tail += headlen;
|
|
|
|
skb->len += headlen;
|
|
|
|
|
|
|
|
return skb;
|
|
|
|
}
|
|
|
|
|
2019-06-26 14:21:40 -06:00
|
|
|
static void trigger_report(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
|
|
|
|
{
|
|
|
|
struct mlx5_err_cqe *err_cqe = (struct mlx5_err_cqe *)cqe;
|
|
|
|
|
|
|
|
if (cqe_syndrome_needs_recover(err_cqe->syndrome) &&
|
|
|
|
!test_and_set_bit(MLX5E_RQ_STATE_RECOVERING, &rq->state))
|
|
|
|
queue_work(rq->channel->priv->wq, &rq->recover_work);
|
|
|
|
}
|
|
|
|
|
2016-09-22 11:01:46 -06:00
|
|
|
void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
|
|
|
|
{
|
2018-04-02 08:31:31 -06:00
|
|
|
struct mlx5_wq_cyc *wq = &rq->wqe.wq;
|
net/mlx5e: Introduce RX Page-Reuse
Introduce a Page-Reuse mechanism in non-Striding RQ RX datapath.
A WQE (RX descriptor) buffer is a page, that in most cases was fully
wasted on a packet that is much smaller, requiring a new page for
the next round.
In this patch, we implement a page-reuse mechanism, that resembles a
`SW Striding RQ`.
We allow the WQE to reuse its allocated page as much as it could,
until the page is fully consumed. In each round, the WQE is capable
of receiving packet of maximal size (MTU). Yet, upon the reception of
a packet, the WQE knows the actual packet size, and consumes the exact
amount of memory needed to build a linear SKB. Then, it updates the
buffer pointer within the page accordingly, for the next round.
Feature is mutually exclusive with XDP (packet-per-page)
and LRO (session size is a power of two, needs unused page).
Performance tests:
iperf tcp tests show huge gain:
--------------------------------------------
num streams | BW before | BW after | ratio |
1 | 22.2 | 30.9 | 1.39x |
8 | 64.2 | 93.6 | 1.46x |
64 | 56.7 | 91.4 | 1.61x |
--------------------------------------------
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-01-29 08:42:26 -07:00
|
|
|
struct mlx5e_wqe_frag_info *wi;
|
2016-09-22 11:01:46 -06:00
|
|
|
struct sk_buff *skb;
|
|
|
|
u32 cqe_bcnt;
|
2018-04-02 08:31:31 -06:00
|
|
|
u16 ci;
|
2016-09-22 11:01:46 -06:00
|
|
|
|
2018-04-02 08:31:31 -06:00
|
|
|
ci = mlx5_wq_cyc_ctr2ix(wq, be16_to_cpu(cqe->wqe_counter));
|
|
|
|
wi = get_frag(rq, ci);
|
|
|
|
cqe_bcnt = be32_to_cpu(cqe->byte_cnt);
|
2016-09-22 11:01:46 -06:00
|
|
|
|
2019-06-14 16:21:15 -06:00
|
|
|
if (unlikely(MLX5E_RX_ERR_CQE(cqe))) {
|
2019-06-26 14:21:40 -06:00
|
|
|
trigger_report(rq, cqe);
|
2019-06-14 16:21:15 -06:00
|
|
|
rq->stats->wqe_err++;
|
|
|
|
goto free_wqe;
|
|
|
|
}
|
|
|
|
|
2019-06-12 04:18:35 -06:00
|
|
|
skb = INDIRECT_CALL_2(rq->wqe.skb_from_cqe,
|
|
|
|
mlx5e_skb_from_cqe_linear,
|
|
|
|
mlx5e_skb_from_cqe_nonlinear,
|
|
|
|
rq, cqe, wi, cqe_bcnt);
|
net/mlx5e: Introduce RX Page-Reuse
Introduce a Page-Reuse mechanism in non-Striding RQ RX datapath.
A WQE (RX descriptor) buffer is a page, that in most cases was fully
wasted on a packet that is much smaller, requiring a new page for
the next round.
In this patch, we implement a page-reuse mechanism, that resembles a
`SW Striding RQ`.
We allow the WQE to reuse its allocated page as much as it could,
until the page is fully consumed. In each round, the WQE is capable
of receiving packet of maximal size (MTU). Yet, upon the reception of
a packet, the WQE knows the actual packet size, and consumes the exact
amount of memory needed to build a linear SKB. Then, it updates the
buffer pointer within the page accordingly, for the next round.
Feature is mutually exclusive with XDP (packet-per-page)
and LRO (session size is a power of two, needs unused page).
Performance tests:
iperf tcp tests show huge gain:
--------------------------------------------
num streams | BW before | BW after | ratio |
1 | 22.2 | 30.9 | 1.39x |
8 | 64.2 | 93.6 | 1.46x |
64 | 56.7 | 91.4 | 1.61x |
--------------------------------------------
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-01-29 08:42:26 -07:00
|
|
|
if (!skb) {
|
|
|
|
/* probably for XDP */
|
2017-12-12 06:46:49 -07:00
|
|
|
if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
/* do not return page to cache,
|
|
|
|
* it will be returned on XDP_TX completion.
|
|
|
|
*/
|
2018-04-02 08:31:31 -06:00
|
|
|
goto wq_cyc_pop;
|
net/mlx5e: Introduce RX Page-Reuse
Introduce a Page-Reuse mechanism in non-Striding RQ RX datapath.
A WQE (RX descriptor) buffer is a page, that in most cases was fully
wasted on a packet that is much smaller, requiring a new page for
the next round.
In this patch, we implement a page-reuse mechanism, that resembles a
`SW Striding RQ`.
We allow the WQE to reuse its allocated page as much as it could,
until the page is fully consumed. In each round, the WQE is capable
of receiving packet of maximal size (MTU). Yet, upon the reception of
a packet, the WQE knows the actual packet size, and consumes the exact
amount of memory needed to build a linear SKB. Then, it updates the
buffer pointer within the page accordingly, for the next round.
Feature is mutually exclusive with XDP (packet-per-page)
and LRO (session size is a power of two, needs unused page).
Performance tests:
iperf tcp tests show huge gain:
--------------------------------------------
num streams | BW before | BW after | ratio |
1 | 22.2 | 30.9 | 1.39x |
8 | 64.2 | 93.6 | 1.46x |
64 | 56.7 | 91.4 | 1.61x |
--------------------------------------------
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-01-29 08:42:26 -07:00
|
|
|
}
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
goto free_wqe;
|
net/mlx5e: Introduce RX Page-Reuse
Introduce a Page-Reuse mechanism in non-Striding RQ RX datapath.
A WQE (RX descriptor) buffer is a page, that in most cases was fully
wasted on a packet that is much smaller, requiring a new page for
the next round.
In this patch, we implement a page-reuse mechanism, that resembles a
`SW Striding RQ`.
We allow the WQE to reuse its allocated page as much as it could,
until the page is fully consumed. In each round, the WQE is capable
of receiving packet of maximal size (MTU). Yet, upon the reception of
a packet, the WQE knows the actual packet size, and consumes the exact
amount of memory needed to build a linear SKB. Then, it updates the
buffer pointer within the page accordingly, for the next round.
Feature is mutually exclusive with XDP (packet-per-page)
and LRO (session size is a power of two, needs unused page).
Performance tests:
iperf tcp tests show huge gain:
--------------------------------------------
num streams | BW before | BW after | ratio |
1 | 22.2 | 30.9 | 1.39x |
8 | 64.2 | 93.6 | 1.46x |
64 | 56.7 | 91.4 | 1.61x |
--------------------------------------------
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-01-29 08:42:26 -07:00
|
|
|
}
|
2016-09-22 11:01:46 -06:00
|
|
|
|
net/mlx5e: Support RX multi-packet WQE (Striding RQ)
Introduce the feature of multi-packet WQE (RX Work Queue Element)
referred to as (MPWQE or Striding RQ), in which WQEs are larger
and serve multiple packets each.
Every WQE consists of many strides of the same size, every received
packet is aligned to a beginning of a stride and is written to
consecutive strides within a WQE.
In the regular approach, each regular WQE is big enough to be capable
of serving one received packet of any size up to MTU or 64K in case of
device LRO is enabled, making it very wasteful when dealing with
small packets or device LRO is enabled.
For its flexibility, MPWQE allows a better memory utilization
(implying improvements in CPU utilization and packet rate) as packets
consume strides according to their size, preserving the rest of
the WQE to be available for other packets.
MPWQE default configuration:
Num of WQEs = 16
Strides Per WQE = 2048
Stride Size = 64 byte
The default WQEs memory footprint went from 1024*mtu (~1.5MB) to
16 * 2048 * 64 = 2MB per ring.
However, HW LRO can now be supported at no additional cost in memory
footprint, and hence we turn it on by default and get an even better
performance.
Performance tested on ConnectX4-Lx 50G.
To isolate the feature under test, the numbers below were measured with
HW LRO turned off. We verified that the performance just improves when
LRO is turned back on.
* Netperf single TCP stream:
- BW raised by 10-15% for representative packet sizes:
default, 64B, 1024B, 1478B, 65536B.
* Netperf multi TCP stream:
- No degradation, line rate reached.
* Pktgen: packet rate raised by 2-10% for traffic of different message
sizes: 64B, 128B, 256B, 1024B, and 1500B.
* Pktgen: packet loss in bursts of small messages (64byte),
single stream:
- | num packets | packets loss before | packets loss after
| 2K | ~ 1K | 0
| 8K | ~ 6K | 0
| 16K | ~13K | 0
| 32K | ~28K | 0
| 64K | ~57K | ~24K
As expected as the driver can receive as many small packets (<=64B) as
the number of total strides in the ring (default = 2048 * 16) vs. 1024
(default ring size regardless of packets size) before this feature.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-20 13:02:13 -06:00
|
|
|
mlx5e_complete_rx_cqe(rq, cqe, cqe_bcnt, skb);
|
2016-09-22 11:01:46 -06:00
|
|
|
napi_gro_receive(rq->cq.napi, skb);
|
2016-04-20 13:02:12 -06:00
|
|
|
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
free_wqe:
|
2018-06-12 01:09:24 -06:00
|
|
|
mlx5e_free_rx_wqe(rq, wi, true);
|
2018-04-02 08:31:31 -06:00
|
|
|
wq_cyc_pop:
|
|
|
|
mlx5_wq_cyc_pop(wq);
|
2016-04-20 13:02:12 -06:00
|
|
|
}
|
|
|
|
|
2017-06-05 06:17:12 -06:00
|
|
|
#ifdef CONFIG_MLX5_ESWITCH
|
net/mlx5: E-Switch, Support VLAN actions in the offloads mode
Many virtualization systems use a policy under which a vlan tag is
pushed to packets sent by guests, and popped before the packet is
forwarded to the VM.
The current generation of the mlx5 HW doesn't fully support that on
a per flow level. As such, we are addressing the above common use
case with the SRIOV e-Switch abilities to push vlan into packets
sent by VFs and pop vlan from packets forwarded to VFs.
The HW can match on the correct vlan being present in packets
forwarded to VFs (eSwitch steering is done before stripping
the tag), so this part is offloaded as is.
A common practice for vlans is to avoid both push vlan and pop vlan
for inter-host VM/VM (east-west) communication because in this case,
push on egress cancels out with pop on ingress.
For supporting that, we use a global eswitch vlan pop policy, hence
allowing guest A to communicate with both remote VM B and local VM C.
This works since the HW pops the vlan only if it exists (e.g for
C --> A packets but not for B --> A packets).
On the slow path, when a VF vport has an offloaded flow which involves
pushing vlans, wheres another flow is not currently offloaded, the
packets from the 2nd flow seen by the VF representor on the host have
vlan. The VF rep driver removes such vlan before calling into the host
networking stack.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-22 11:01:47 -06:00
|
|
|
void mlx5e_handle_rx_cqe_rep(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
|
|
|
|
{
|
|
|
|
struct net_device *netdev = rq->netdev;
|
|
|
|
struct mlx5e_priv *priv = netdev_priv(netdev);
|
2017-04-24 03:36:42 -06:00
|
|
|
struct mlx5e_rep_priv *rpriv = priv->ppriv;
|
|
|
|
struct mlx5_eswitch_rep *rep = rpriv->rep;
|
2018-04-02 08:31:31 -06:00
|
|
|
struct mlx5_wq_cyc *wq = &rq->wqe.wq;
|
net/mlx5e: Introduce RX Page-Reuse
Introduce a Page-Reuse mechanism in non-Striding RQ RX datapath.
A WQE (RX descriptor) buffer is a page, that in most cases was fully
wasted on a packet that is much smaller, requiring a new page for
the next round.
In this patch, we implement a page-reuse mechanism, that resembles a
`SW Striding RQ`.
We allow the WQE to reuse its allocated page as much as it could,
until the page is fully consumed. In each round, the WQE is capable
of receiving packet of maximal size (MTU). Yet, upon the reception of
a packet, the WQE knows the actual packet size, and consumes the exact
amount of memory needed to build a linear SKB. Then, it updates the
buffer pointer within the page accordingly, for the next round.
Feature is mutually exclusive with XDP (packet-per-page)
and LRO (session size is a power of two, needs unused page).
Performance tests:
iperf tcp tests show huge gain:
--------------------------------------------
num streams | BW before | BW after | ratio |
1 | 22.2 | 30.9 | 1.39x |
8 | 64.2 | 93.6 | 1.46x |
64 | 56.7 | 91.4 | 1.61x |
--------------------------------------------
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-01-29 08:42:26 -07:00
|
|
|
struct mlx5e_wqe_frag_info *wi;
|
net/mlx5: E-Switch, Support VLAN actions in the offloads mode
Many virtualization systems use a policy under which a vlan tag is
pushed to packets sent by guests, and popped before the packet is
forwarded to the VM.
The current generation of the mlx5 HW doesn't fully support that on
a per flow level. As such, we are addressing the above common use
case with the SRIOV e-Switch abilities to push vlan into packets
sent by VFs and pop vlan from packets forwarded to VFs.
The HW can match on the correct vlan being present in packets
forwarded to VFs (eSwitch steering is done before stripping
the tag), so this part is offloaded as is.
A common practice for vlans is to avoid both push vlan and pop vlan
for inter-host VM/VM (east-west) communication because in this case,
push on egress cancels out with pop on ingress.
For supporting that, we use a global eswitch vlan pop policy, hence
allowing guest A to communicate with both remote VM B and local VM C.
This works since the HW pops the vlan only if it exists (e.g for
C --> A packets but not for B --> A packets).
On the slow path, when a VF vport has an offloaded flow which involves
pushing vlans, wheres another flow is not currently offloaded, the
packets from the 2nd flow seen by the VF representor on the host have
vlan. The VF rep driver removes such vlan before calling into the host
networking stack.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-22 11:01:47 -06:00
|
|
|
struct sk_buff *skb;
|
|
|
|
u32 cqe_bcnt;
|
2018-04-02 08:31:31 -06:00
|
|
|
u16 ci;
|
net/mlx5: E-Switch, Support VLAN actions in the offloads mode
Many virtualization systems use a policy under which a vlan tag is
pushed to packets sent by guests, and popped before the packet is
forwarded to the VM.
The current generation of the mlx5 HW doesn't fully support that on
a per flow level. As such, we are addressing the above common use
case with the SRIOV e-Switch abilities to push vlan into packets
sent by VFs and pop vlan from packets forwarded to VFs.
The HW can match on the correct vlan being present in packets
forwarded to VFs (eSwitch steering is done before stripping
the tag), so this part is offloaded as is.
A common practice for vlans is to avoid both push vlan and pop vlan
for inter-host VM/VM (east-west) communication because in this case,
push on egress cancels out with pop on ingress.
For supporting that, we use a global eswitch vlan pop policy, hence
allowing guest A to communicate with both remote VM B and local VM C.
This works since the HW pops the vlan only if it exists (e.g for
C --> A packets but not for B --> A packets).
On the slow path, when a VF vport has an offloaded flow which involves
pushing vlans, wheres another flow is not currently offloaded, the
packets from the 2nd flow seen by the VF representor on the host have
vlan. The VF rep driver removes such vlan before calling into the host
networking stack.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-22 11:01:47 -06:00
|
|
|
|
2018-04-02 08:31:31 -06:00
|
|
|
ci = mlx5_wq_cyc_ctr2ix(wq, be16_to_cpu(cqe->wqe_counter));
|
|
|
|
wi = get_frag(rq, ci);
|
|
|
|
cqe_bcnt = be32_to_cpu(cqe->byte_cnt);
|
net/mlx5: E-Switch, Support VLAN actions in the offloads mode
Many virtualization systems use a policy under which a vlan tag is
pushed to packets sent by guests, and popped before the packet is
forwarded to the VM.
The current generation of the mlx5 HW doesn't fully support that on
a per flow level. As such, we are addressing the above common use
case with the SRIOV e-Switch abilities to push vlan into packets
sent by VFs and pop vlan from packets forwarded to VFs.
The HW can match on the correct vlan being present in packets
forwarded to VFs (eSwitch steering is done before stripping
the tag), so this part is offloaded as is.
A common practice for vlans is to avoid both push vlan and pop vlan
for inter-host VM/VM (east-west) communication because in this case,
push on egress cancels out with pop on ingress.
For supporting that, we use a global eswitch vlan pop policy, hence
allowing guest A to communicate with both remote VM B and local VM C.
This works since the HW pops the vlan only if it exists (e.g for
C --> A packets but not for B --> A packets).
On the slow path, when a VF vport has an offloaded flow which involves
pushing vlans, wheres another flow is not currently offloaded, the
packets from the 2nd flow seen by the VF representor on the host have
vlan. The VF rep driver removes such vlan before calling into the host
networking stack.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-22 11:01:47 -06:00
|
|
|
|
2019-06-14 16:21:15 -06:00
|
|
|
if (unlikely(MLX5E_RX_ERR_CQE(cqe))) {
|
|
|
|
rq->stats->wqe_err++;
|
|
|
|
goto free_wqe;
|
|
|
|
}
|
|
|
|
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
skb = rq->wqe.skb_from_cqe(rq, cqe, wi, cqe_bcnt);
|
net/mlx5e: Introduce RX Page-Reuse
Introduce a Page-Reuse mechanism in non-Striding RQ RX datapath.
A WQE (RX descriptor) buffer is a page, that in most cases was fully
wasted on a packet that is much smaller, requiring a new page for
the next round.
In this patch, we implement a page-reuse mechanism, that resembles a
`SW Striding RQ`.
We allow the WQE to reuse its allocated page as much as it could,
until the page is fully consumed. In each round, the WQE is capable
of receiving packet of maximal size (MTU). Yet, upon the reception of
a packet, the WQE knows the actual packet size, and consumes the exact
amount of memory needed to build a linear SKB. Then, it updates the
buffer pointer within the page accordingly, for the next round.
Feature is mutually exclusive with XDP (packet-per-page)
and LRO (session size is a power of two, needs unused page).
Performance tests:
iperf tcp tests show huge gain:
--------------------------------------------
num streams | BW before | BW after | ratio |
1 | 22.2 | 30.9 | 1.39x |
8 | 64.2 | 93.6 | 1.46x |
64 | 56.7 | 91.4 | 1.61x |
--------------------------------------------
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-01-29 08:42:26 -07:00
|
|
|
if (!skb) {
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
/* probably for XDP */
|
2017-12-12 06:46:49 -07:00
|
|
|
if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
/* do not return page to cache,
|
|
|
|
* it will be returned on XDP_TX completion.
|
|
|
|
*/
|
2018-04-02 08:31:31 -06:00
|
|
|
goto wq_cyc_pop;
|
net/mlx5e: Introduce RX Page-Reuse
Introduce a Page-Reuse mechanism in non-Striding RQ RX datapath.
A WQE (RX descriptor) buffer is a page, that in most cases was fully
wasted on a packet that is much smaller, requiring a new page for
the next round.
In this patch, we implement a page-reuse mechanism, that resembles a
`SW Striding RQ`.
We allow the WQE to reuse its allocated page as much as it could,
until the page is fully consumed. In each round, the WQE is capable
of receiving packet of maximal size (MTU). Yet, upon the reception of
a packet, the WQE knows the actual packet size, and consumes the exact
amount of memory needed to build a linear SKB. Then, it updates the
buffer pointer within the page accordingly, for the next round.
Feature is mutually exclusive with XDP (packet-per-page)
and LRO (session size is a power of two, needs unused page).
Performance tests:
iperf tcp tests show huge gain:
--------------------------------------------
num streams | BW before | BW after | ratio |
1 | 22.2 | 30.9 | 1.39x |
8 | 64.2 | 93.6 | 1.46x |
64 | 56.7 | 91.4 | 1.61x |
--------------------------------------------
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-01-29 08:42:26 -07:00
|
|
|
}
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
goto free_wqe;
|
net/mlx5e: Introduce RX Page-Reuse
Introduce a Page-Reuse mechanism in non-Striding RQ RX datapath.
A WQE (RX descriptor) buffer is a page, that in most cases was fully
wasted on a packet that is much smaller, requiring a new page for
the next round.
In this patch, we implement a page-reuse mechanism, that resembles a
`SW Striding RQ`.
We allow the WQE to reuse its allocated page as much as it could,
until the page is fully consumed. In each round, the WQE is capable
of receiving packet of maximal size (MTU). Yet, upon the reception of
a packet, the WQE knows the actual packet size, and consumes the exact
amount of memory needed to build a linear SKB. Then, it updates the
buffer pointer within the page accordingly, for the next round.
Feature is mutually exclusive with XDP (packet-per-page)
and LRO (session size is a power of two, needs unused page).
Performance tests:
iperf tcp tests show huge gain:
--------------------------------------------
num streams | BW before | BW after | ratio |
1 | 22.2 | 30.9 | 1.39x |
8 | 64.2 | 93.6 | 1.46x |
64 | 56.7 | 91.4 | 1.61x |
--------------------------------------------
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-01-29 08:42:26 -07:00
|
|
|
}
|
net/mlx5: E-Switch, Support VLAN actions in the offloads mode
Many virtualization systems use a policy under which a vlan tag is
pushed to packets sent by guests, and popped before the packet is
forwarded to the VM.
The current generation of the mlx5 HW doesn't fully support that on
a per flow level. As such, we are addressing the above common use
case with the SRIOV e-Switch abilities to push vlan into packets
sent by VFs and pop vlan from packets forwarded to VFs.
The HW can match on the correct vlan being present in packets
forwarded to VFs (eSwitch steering is done before stripping
the tag), so this part is offloaded as is.
A common practice for vlans is to avoid both push vlan and pop vlan
for inter-host VM/VM (east-west) communication because in this case,
push on egress cancels out with pop on ingress.
For supporting that, we use a global eswitch vlan pop policy, hence
allowing guest A to communicate with both remote VM B and local VM C.
This works since the HW pops the vlan only if it exists (e.g for
C --> A packets but not for B --> A packets).
On the slow path, when a VF vport has an offloaded flow which involves
pushing vlans, wheres another flow is not currently offloaded, the
packets from the 2nd flow seen by the VF representor on the host have
vlan. The VF rep driver removes such vlan before calling into the host
networking stack.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-22 11:01:47 -06:00
|
|
|
|
|
|
|
mlx5e_complete_rx_cqe(rq, cqe, cqe_bcnt, skb);
|
|
|
|
|
|
|
|
if (rep->vlan && skb_vlan_tag_present(skb))
|
|
|
|
skb_vlan_pop(skb);
|
|
|
|
|
|
|
|
napi_gro_receive(rq->cq.napi, skb);
|
|
|
|
|
net/mlx5e: RX, Enhance legacy Receive Queue memory scheme
Enhance the memory scheme of the legacy RQ, such that
only order-0 pages are used.
Whenever possible, prefer using a linear SKB, and build it
wrapping the WQE buffer.
Otherwise (for example, jumbo frames on x86), use non-linear SKB,
with as many frags as needed. In this case, multiple WQE
scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
This implied to remove support of HW LRO in legacy RQ, as it would
require large number of page allocations and scatter entries per WQE
on archs with PAGE_SIZE = 4KB, yielding bad performance.
In earlier patches, we guaranteed that all completions are in-order,
and that we use a cyclic WQ.
This creates an oppurtunity for a performance optimization:
The mapping between a "struct mlx5e_dma_info", and the
WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
across different cycles of a WQ. This allows initializing
the mapping in the time of RQ creation, and not handle it
in datapath.
A struct mlx5e_dma_info that is shared between different WQEs
is allocated by the first WQE, and freed by the last one.
This implies an important requirement: WQEs that share the same
struct mlx5e_dma_info must be posted within the same NAPI.
Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
point to the new struct mlx5e_dma_info, not the one that was posted
(and the HW wrote to).
This bulking requirement is actually good also for performance reasons,
hence we extend the bulk beyong the minimal requirement above.
With this memory scheme, the RQs memory footprint is reduce by a
factor of 2 on x86, and by a factor of 32 on PowerPC.
Same factors apply for the number of pages in a GRO session.
Performance tests:
ConnectX-4, single core, single RX ring, default MTU.
x86:
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Packet rate (early drop in TC): no degradation
TCP streams: ~5% improvement
PowerPC:
CPU: POWER8 (raw), altivec supported
Packet rate (early drop in TC): 20% gain
TCP streams: 25% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-02 09:23:58 -06:00
|
|
|
free_wqe:
|
2018-06-12 01:09:24 -06:00
|
|
|
mlx5e_free_rx_wqe(rq, wi, true);
|
2018-04-02 08:31:31 -06:00
|
|
|
wq_cyc_pop:
|
|
|
|
mlx5_wq_cyc_pop(wq);
|
net/mlx5: E-Switch, Support VLAN actions in the offloads mode
Many virtualization systems use a policy under which a vlan tag is
pushed to packets sent by guests, and popped before the packet is
forwarded to the VM.
The current generation of the mlx5 HW doesn't fully support that on
a per flow level. As such, we are addressing the above common use
case with the SRIOV e-Switch abilities to push vlan into packets
sent by VFs and pop vlan from packets forwarded to VFs.
The HW can match on the correct vlan being present in packets
forwarded to VFs (eSwitch steering is done before stripping
the tag), so this part is offloaded as is.
A common practice for vlans is to avoid both push vlan and pop vlan
for inter-host VM/VM (east-west) communication because in this case,
push on egress cancels out with pop on ingress.
For supporting that, we use a global eswitch vlan pop policy, hence
allowing guest A to communicate with both remote VM B and local VM C.
This works since the HW pops the vlan only if it exists (e.g for
C --> A packets but not for B --> A packets).
On the slow path, when a VF vport has an offloaded flow which involves
pushing vlans, wheres another flow is not currently offloaded, the
packets from the 2nd flow seen by the VF representor on the host have
vlan. The VF rep driver removes such vlan before calling into the host
networking stack.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-22 11:01:47 -06:00
|
|
|
}
|
2017-06-05 06:17:12 -06:00
|
|
|
#endif
|
net/mlx5: E-Switch, Support VLAN actions in the offloads mode
Many virtualization systems use a policy under which a vlan tag is
pushed to packets sent by guests, and popped before the packet is
forwarded to the VM.
The current generation of the mlx5 HW doesn't fully support that on
a per flow level. As such, we are addressing the above common use
case with the SRIOV e-Switch abilities to push vlan into packets
sent by VFs and pop vlan from packets forwarded to VFs.
The HW can match on the correct vlan being present in packets
forwarded to VFs (eSwitch steering is done before stripping
the tag), so this part is offloaded as is.
A common practice for vlans is to avoid both push vlan and pop vlan
for inter-host VM/VM (east-west) communication because in this case,
push on egress cancels out with pop on ingress.
For supporting that, we use a global eswitch vlan pop policy, hence
allowing guest A to communicate with both remote VM B and local VM C.
This works since the HW pops the vlan only if it exists (e.g for
C --> A packets but not for B --> A packets).
On the slow path, when a VF vport has an offloaded flow which involves
pushing vlans, wheres another flow is not currently offloaded, the
packets from the 2nd flow seen by the VF representor on the host have
vlan. The VF rep driver removes such vlan before calling into the host
networking stack.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-22 11:01:47 -06:00
|
|
|
|
net/mlx5e: Use linear SKB in Striding RQ
Current Striding RQ HW feature utilizes the RX buffers so that
there is no wasted room between the strides. This maximises
the memory utilization.
This prevents the use of build_skb() (which requires headroom
and tailroom), and demands to memcpy the packets headers into
the skb linear part.
In this patch, whenever a set of conditions holds, we apply
an RQ configuration that allows combining the use of linear SKB
on top of a Striding RQ.
To use build_skb() with Striding RQ, the following must hold:
1. packet does not cross a page boundary.
2. there is enough headroom and tailroom surrounding the packet.
We can satisfy 1 and 2 by configuring:
stride size = MTU + headroom + tailoom.
This is possible only when:
a. (MTU - headroom - tailoom) does not exceed PAGE_SIZE.
b. HW LRO is turned off.
Using linear SKB has many advantages:
- Saves a memcpy of the headers.
- No page-boundary checks in datapath.
- No filler CQEs.
- Significantly smaller CQ.
- SKB data continuously resides in linear part, and not split to
small amount (linear part) and large amount (fragment).
This saves datapath cycles in driver and improves utilization
of SKB fragments in GRO.
- The fragments of a resulting GRO SKB follow the IP forwarding
assumption of equal-size fragments.
Some implementation details:
HW writes the packets to the beginning of a stride,
i.e. does not keep headroom. To overcome this we make sure we can
extend backwards and use the last bytes of stride i-1.
Extra care is needed for stride 0 as it has no preceding stride.
We make sure headroom bytes are available by shifting the buffer
pointer passed to HW by headroom bytes.
This configuration now becomes default, whenever capable.
Of course, this implies turning LRO off.
Performance testing:
ConnectX-5, single core, single RX ring, default MTU.
UDP packet rate, early drop in TC layer:
--------------------------------------------
| pkt size | before | after | ratio |
--------------------------------------------
| 1500byte | 4.65 Mpps | 5.96 Mpps | 1.28x |
| 500byte | 5.23 Mpps | 5.97 Mpps | 1.14x |
| 64byte | 5.94 Mpps | 5.96 Mpps | 1.00x |
--------------------------------------------
TCP streams: ~20% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-02-07 05:41:25 -07:00
|
|
|
struct sk_buff *
|
|
|
|
mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
|
|
|
|
u16 cqe_bcnt, u32 head_offset, u32 page_idx)
|
2016-04-20 13:02:15 -06:00
|
|
|
{
|
2018-04-02 05:30:34 -06:00
|
|
|
u16 headlen = min_t(u16, MLX5E_RX_MAX_HEAD, cqe_bcnt);
|
net/mlx5e: Remove page_ref bulking in Striding RQ
When many packets reside on the same page, the bulking of
page_ref modifications reduces the total number of atomic
operations executed.
Besides the necessary 2 operations on page alloc/free, we
have the following extra ops per page:
- one on WQE allocation (bump refcnt to maximum possible),
- zero ops for SKBs,
- one on WQE free,
a constant of two operations in total, no matter how many
packets/SKBs actually populate the page.
Without this bulking, we have:
- no ops on WQE allocation or free,
- one op per SKB,
Comparing the two methods when PAGE_SIZE is 4K:
- As mentioned above, bulking method always executes 2 operations,
not more, but not less.
- In the default MTU configuration (1500, stride size is 2K),
the non-bulking method execute 2 ops as well.
- For larger MTUs with stride size of 4K, non-bulking method
executes only a single op.
- For XDP (stride size of 4K, no SKBs), non-bulking method
executes no ops at all!
Hence, to optimize the flows with linear SKB and XDP over Striding RQ,
we here remove the page_ref bulking method.
Performance testing:
ConnectX-5, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz.
Single core packet rate (64 bytes).
Early drop in TC: no degradation.
XDP_DROP:
before: 14,270,188 pps
after: 20,503,603 pps, 43% improvement.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-11-28 02:03:37 -07:00
|
|
|
struct mlx5e_dma_info *di = &wi->umr.dma_info[page_idx];
|
2016-04-20 13:02:15 -06:00
|
|
|
u32 frag_offset = head_offset + headlen;
|
net/mlx5e: Remove page_ref bulking in Striding RQ
When many packets reside on the same page, the bulking of
page_ref modifications reduces the total number of atomic
operations executed.
Besides the necessary 2 operations on page alloc/free, we
have the following extra ops per page:
- one on WQE allocation (bump refcnt to maximum possible),
- zero ops for SKBs,
- one on WQE free,
a constant of two operations in total, no matter how many
packets/SKBs actually populate the page.
Without this bulking, we have:
- no ops on WQE allocation or free,
- one op per SKB,
Comparing the two methods when PAGE_SIZE is 4K:
- As mentioned above, bulking method always executes 2 operations,
not more, but not less.
- In the default MTU configuration (1500, stride size is 2K),
the non-bulking method execute 2 ops as well.
- For larger MTUs with stride size of 4K, non-bulking method
executes only a single op.
- For XDP (stride size of 4K, no SKBs), non-bulking method
executes no ops at all!
Hence, to optimize the flows with linear SKB and XDP over Striding RQ,
we here remove the page_ref bulking method.
Performance testing:
ConnectX-5, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz.
Single core packet rate (64 bytes).
Early drop in TC: no degradation.
XDP_DROP:
before: 14,270,188 pps
after: 20,503,603 pps, 43% improvement.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-11-28 02:03:37 -07:00
|
|
|
u32 byte_cnt = cqe_bcnt - headlen;
|
|
|
|
struct mlx5e_dma_info *head_di = di;
|
net/mlx5e: Use linear SKB in Striding RQ
Current Striding RQ HW feature utilizes the RX buffers so that
there is no wasted room between the strides. This maximises
the memory utilization.
This prevents the use of build_skb() (which requires headroom
and tailroom), and demands to memcpy the packets headers into
the skb linear part.
In this patch, whenever a set of conditions holds, we apply
an RQ configuration that allows combining the use of linear SKB
on top of a Striding RQ.
To use build_skb() with Striding RQ, the following must hold:
1. packet does not cross a page boundary.
2. there is enough headroom and tailroom surrounding the packet.
We can satisfy 1 and 2 by configuring:
stride size = MTU + headroom + tailoom.
This is possible only when:
a. (MTU - headroom - tailoom) does not exceed PAGE_SIZE.
b. HW LRO is turned off.
Using linear SKB has many advantages:
- Saves a memcpy of the headers.
- No page-boundary checks in datapath.
- No filler CQEs.
- Significantly smaller CQ.
- SKB data continuously resides in linear part, and not split to
small amount (linear part) and large amount (fragment).
This saves datapath cycles in driver and improves utilization
of SKB fragments in GRO.
- The fragments of a resulting GRO SKB follow the IP forwarding
assumption of equal-size fragments.
Some implementation details:
HW writes the packets to the beginning of a stride,
i.e. does not keep headroom. To overcome this we make sure we can
extend backwards and use the last bytes of stride i-1.
Extra care is needed for stride 0 as it has no preceding stride.
We make sure headroom bytes are available by shifting the buffer
pointer passed to HW by headroom bytes.
This configuration now becomes default, whenever capable.
Of course, this implies turning LRO off.
Performance testing:
ConnectX-5, single core, single RX ring, default MTU.
UDP packet rate, early drop in TC layer:
--------------------------------------------
| pkt size | before | after | ratio |
--------------------------------------------
| 1500byte | 4.65 Mpps | 5.96 Mpps | 1.28x |
| 500byte | 5.23 Mpps | 5.97 Mpps | 1.14x |
| 64byte | 5.94 Mpps | 5.96 Mpps | 1.00x |
--------------------------------------------
TCP streams: ~20% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-02-07 05:41:25 -07:00
|
|
|
struct sk_buff *skb;
|
|
|
|
|
|
|
|
skb = napi_alloc_skb(rq->cq.napi,
|
2018-04-02 05:30:34 -06:00
|
|
|
ALIGN(MLX5E_RX_MAX_HEAD, sizeof(long)));
|
net/mlx5e: Use linear SKB in Striding RQ
Current Striding RQ HW feature utilizes the RX buffers so that
there is no wasted room between the strides. This maximises
the memory utilization.
This prevents the use of build_skb() (which requires headroom
and tailroom), and demands to memcpy the packets headers into
the skb linear part.
In this patch, whenever a set of conditions holds, we apply
an RQ configuration that allows combining the use of linear SKB
on top of a Striding RQ.
To use build_skb() with Striding RQ, the following must hold:
1. packet does not cross a page boundary.
2. there is enough headroom and tailroom surrounding the packet.
We can satisfy 1 and 2 by configuring:
stride size = MTU + headroom + tailoom.
This is possible only when:
a. (MTU - headroom - tailoom) does not exceed PAGE_SIZE.
b. HW LRO is turned off.
Using linear SKB has many advantages:
- Saves a memcpy of the headers.
- No page-boundary checks in datapath.
- No filler CQEs.
- Significantly smaller CQ.
- SKB data continuously resides in linear part, and not split to
small amount (linear part) and large amount (fragment).
This saves datapath cycles in driver and improves utilization
of SKB fragments in GRO.
- The fragments of a resulting GRO SKB follow the IP forwarding
assumption of equal-size fragments.
Some implementation details:
HW writes the packets to the beginning of a stride,
i.e. does not keep headroom. To overcome this we make sure we can
extend backwards and use the last bytes of stride i-1.
Extra care is needed for stride 0 as it has no preceding stride.
We make sure headroom bytes are available by shifting the buffer
pointer passed to HW by headroom bytes.
This configuration now becomes default, whenever capable.
Of course, this implies turning LRO off.
Performance testing:
ConnectX-5, single core, single RX ring, default MTU.
UDP packet rate, early drop in TC layer:
--------------------------------------------
| pkt size | before | after | ratio |
--------------------------------------------
| 1500byte | 4.65 Mpps | 5.96 Mpps | 1.28x |
| 500byte | 5.23 Mpps | 5.97 Mpps | 1.14x |
| 64byte | 5.94 Mpps | 5.96 Mpps | 1.00x |
--------------------------------------------
TCP streams: ~20% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-02-07 05:41:25 -07:00
|
|
|
if (unlikely(!skb)) {
|
2018-04-12 07:03:37 -06:00
|
|
|
rq->stats->buff_alloc_err++;
|
net/mlx5e: Use linear SKB in Striding RQ
Current Striding RQ HW feature utilizes the RX buffers so that
there is no wasted room between the strides. This maximises
the memory utilization.
This prevents the use of build_skb() (which requires headroom
and tailroom), and demands to memcpy the packets headers into
the skb linear part.
In this patch, whenever a set of conditions holds, we apply
an RQ configuration that allows combining the use of linear SKB
on top of a Striding RQ.
To use build_skb() with Striding RQ, the following must hold:
1. packet does not cross a page boundary.
2. there is enough headroom and tailroom surrounding the packet.
We can satisfy 1 and 2 by configuring:
stride size = MTU + headroom + tailoom.
This is possible only when:
a. (MTU - headroom - tailoom) does not exceed PAGE_SIZE.
b. HW LRO is turned off.
Using linear SKB has many advantages:
- Saves a memcpy of the headers.
- No page-boundary checks in datapath.
- No filler CQEs.
- Significantly smaller CQ.
- SKB data continuously resides in linear part, and not split to
small amount (linear part) and large amount (fragment).
This saves datapath cycles in driver and improves utilization
of SKB fragments in GRO.
- The fragments of a resulting GRO SKB follow the IP forwarding
assumption of equal-size fragments.
Some implementation details:
HW writes the packets to the beginning of a stride,
i.e. does not keep headroom. To overcome this we make sure we can
extend backwards and use the last bytes of stride i-1.
Extra care is needed for stride 0 as it has no preceding stride.
We make sure headroom bytes are available by shifting the buffer
pointer passed to HW by headroom bytes.
This configuration now becomes default, whenever capable.
Of course, this implies turning LRO off.
Performance testing:
ConnectX-5, single core, single RX ring, default MTU.
UDP packet rate, early drop in TC layer:
--------------------------------------------
| pkt size | before | after | ratio |
--------------------------------------------
| 1500byte | 4.65 Mpps | 5.96 Mpps | 1.28x |
| 500byte | 5.23 Mpps | 5.97 Mpps | 1.14x |
| 64byte | 5.94 Mpps | 5.96 Mpps | 1.00x |
--------------------------------------------
TCP streams: ~20% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-02-07 05:41:25 -07:00
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
prefetchw(skb->data);
|
2016-04-20 13:02:15 -06:00
|
|
|
|
|
|
|
if (unlikely(frag_offset >= PAGE_SIZE)) {
|
net/mlx5e: Remove page_ref bulking in Striding RQ
When many packets reside on the same page, the bulking of
page_ref modifications reduces the total number of atomic
operations executed.
Besides the necessary 2 operations on page alloc/free, we
have the following extra ops per page:
- one on WQE allocation (bump refcnt to maximum possible),
- zero ops for SKBs,
- one on WQE free,
a constant of two operations in total, no matter how many
packets/SKBs actually populate the page.
Without this bulking, we have:
- no ops on WQE allocation or free,
- one op per SKB,
Comparing the two methods when PAGE_SIZE is 4K:
- As mentioned above, bulking method always executes 2 operations,
not more, but not less.
- In the default MTU configuration (1500, stride size is 2K),
the non-bulking method execute 2 ops as well.
- For larger MTUs with stride size of 4K, non-bulking method
executes only a single op.
- For XDP (stride size of 4K, no SKBs), non-bulking method
executes no ops at all!
Hence, to optimize the flows with linear SKB and XDP over Striding RQ,
we here remove the page_ref bulking method.
Performance testing:
ConnectX-5, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz.
Single core packet rate (64 bytes).
Early drop in TC: no degradation.
XDP_DROP:
before: 14,270,188 pps
after: 20,503,603 pps, 43% improvement.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-11-28 02:03:37 -07:00
|
|
|
di++;
|
2016-04-20 13:02:15 -06:00
|
|
|
frag_offset -= PAGE_SIZE;
|
|
|
|
}
|
|
|
|
|
|
|
|
while (byte_cnt) {
|
|
|
|
u32 pg_consumed_bytes =
|
|
|
|
min_t(u32, PAGE_SIZE - frag_offset, byte_cnt);
|
2018-04-02 05:30:34 -06:00
|
|
|
unsigned int truesize =
|
|
|
|
ALIGN(pg_consumed_bytes, BIT(rq->mpwqe.log_stride_sz));
|
2016-04-20 13:02:15 -06:00
|
|
|
|
2018-04-02 05:30:34 -06:00
|
|
|
mlx5e_add_skb_frag(rq, skb, di, frag_offset,
|
|
|
|
pg_consumed_bytes, truesize);
|
2016-04-20 13:02:15 -06:00
|
|
|
byte_cnt -= pg_consumed_bytes;
|
|
|
|
frag_offset = 0;
|
net/mlx5e: Remove page_ref bulking in Striding RQ
When many packets reside on the same page, the bulking of
page_ref modifications reduces the total number of atomic
operations executed.
Besides the necessary 2 operations on page alloc/free, we
have the following extra ops per page:
- one on WQE allocation (bump refcnt to maximum possible),
- zero ops for SKBs,
- one on WQE free,
a constant of two operations in total, no matter how many
packets/SKBs actually populate the page.
Without this bulking, we have:
- no ops on WQE allocation or free,
- one op per SKB,
Comparing the two methods when PAGE_SIZE is 4K:
- As mentioned above, bulking method always executes 2 operations,
not more, but not less.
- In the default MTU configuration (1500, stride size is 2K),
the non-bulking method execute 2 ops as well.
- For larger MTUs with stride size of 4K, non-bulking method
executes only a single op.
- For XDP (stride size of 4K, no SKBs), non-bulking method
executes no ops at all!
Hence, to optimize the flows with linear SKB and XDP over Striding RQ,
we here remove the page_ref bulking method.
Performance testing:
ConnectX-5, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz.
Single core packet rate (64 bytes).
Early drop in TC: no degradation.
XDP_DROP:
before: 14,270,188 pps
after: 20,503,603 pps, 43% improvement.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-11-28 02:03:37 -07:00
|
|
|
di++;
|
2016-04-20 13:02:15 -06:00
|
|
|
}
|
|
|
|
/* copy header */
|
net/mlx5e: RX, Make sure packet header does not cross page boundary
In the non-linear SKB memory scheme of Striding RQ, a packet header
could cross page boundary. This requires special care in fast path
that costs LoC, additional runtime instructions and branches.
It could happen when the header (up to 256B) does not fit in
a single stride. Avoid this by working with a stride size that fits
the maximum possible header. Stride is increased form 64B to 256B.
Performance:
Tested packet rate for UDP streams, single ring, on ConnectX-5.
Configuration:
Set Striding RQ and LRO ON (to enabled the non-linear SKB scheme).
GRO OFF, early drop by TC rule.
64B: 4x worse memory utilization, no page-crossers headers
- No degradation (5,887,305 pps).
- The reduction in memory utilization is compensated by the saving of
branches tests.
192B: 1.33x worse memory utilization, avoid page-crossers headers
- Before: 5,727,252. After: 5,777,037. ~1% gain.
256B: Same memory util, no page-crossers
- Before: 5,691,885. After: 5,748,007. ~1% gain.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-09-25 03:05:22 -06:00
|
|
|
mlx5e_copy_skb_header(rq->pdev, skb, head_di, head_offset, headlen);
|
2016-04-20 13:02:15 -06:00
|
|
|
/* skb linear part was allocated with headlen and aligned to long */
|
|
|
|
skb->tail += headlen;
|
|
|
|
skb->len += headlen;
|
net/mlx5e: Use linear SKB in Striding RQ
Current Striding RQ HW feature utilizes the RX buffers so that
there is no wasted room between the strides. This maximises
the memory utilization.
This prevents the use of build_skb() (which requires headroom
and tailroom), and demands to memcpy the packets headers into
the skb linear part.
In this patch, whenever a set of conditions holds, we apply
an RQ configuration that allows combining the use of linear SKB
on top of a Striding RQ.
To use build_skb() with Striding RQ, the following must hold:
1. packet does not cross a page boundary.
2. there is enough headroom and tailroom surrounding the packet.
We can satisfy 1 and 2 by configuring:
stride size = MTU + headroom + tailoom.
This is possible only when:
a. (MTU - headroom - tailoom) does not exceed PAGE_SIZE.
b. HW LRO is turned off.
Using linear SKB has many advantages:
- Saves a memcpy of the headers.
- No page-boundary checks in datapath.
- No filler CQEs.
- Significantly smaller CQ.
- SKB data continuously resides in linear part, and not split to
small amount (linear part) and large amount (fragment).
This saves datapath cycles in driver and improves utilization
of SKB fragments in GRO.
- The fragments of a resulting GRO SKB follow the IP forwarding
assumption of equal-size fragments.
Some implementation details:
HW writes the packets to the beginning of a stride,
i.e. does not keep headroom. To overcome this we make sure we can
extend backwards and use the last bytes of stride i-1.
Extra care is needed for stride 0 as it has no preceding stride.
We make sure headroom bytes are available by shifting the buffer
pointer passed to HW by headroom bytes.
This configuration now becomes default, whenever capable.
Of course, this implies turning LRO off.
Performance testing:
ConnectX-5, single core, single RX ring, default MTU.
UDP packet rate, early drop in TC layer:
--------------------------------------------
| pkt size | before | after | ratio |
--------------------------------------------
| 1500byte | 4.65 Mpps | 5.96 Mpps | 1.28x |
| 500byte | 5.23 Mpps | 5.97 Mpps | 1.14x |
| 64byte | 5.94 Mpps | 5.96 Mpps | 1.00x |
--------------------------------------------
TCP streams: ~20% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-02-07 05:41:25 -07:00
|
|
|
|
|
|
|
return skb;
|
|
|
|
}
|
|
|
|
|
|
|
|
struct sk_buff *
|
|
|
|
mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
|
|
|
|
u16 cqe_bcnt, u32 head_offset, u32 page_idx)
|
|
|
|
{
|
|
|
|
struct mlx5e_dma_info *di = &wi->umr.dma_info[page_idx];
|
|
|
|
u16 rx_headroom = rq->buff.headroom;
|
2018-02-07 05:46:36 -07:00
|
|
|
u32 cqe_bcnt32 = cqe_bcnt;
|
net/mlx5e: Use linear SKB in Striding RQ
Current Striding RQ HW feature utilizes the RX buffers so that
there is no wasted room between the strides. This maximises
the memory utilization.
This prevents the use of build_skb() (which requires headroom
and tailroom), and demands to memcpy the packets headers into
the skb linear part.
In this patch, whenever a set of conditions holds, we apply
an RQ configuration that allows combining the use of linear SKB
on top of a Striding RQ.
To use build_skb() with Striding RQ, the following must hold:
1. packet does not cross a page boundary.
2. there is enough headroom and tailroom surrounding the packet.
We can satisfy 1 and 2 by configuring:
stride size = MTU + headroom + tailoom.
This is possible only when:
a. (MTU - headroom - tailoom) does not exceed PAGE_SIZE.
b. HW LRO is turned off.
Using linear SKB has many advantages:
- Saves a memcpy of the headers.
- No page-boundary checks in datapath.
- No filler CQEs.
- Significantly smaller CQ.
- SKB data continuously resides in linear part, and not split to
small amount (linear part) and large amount (fragment).
This saves datapath cycles in driver and improves utilization
of SKB fragments in GRO.
- The fragments of a resulting GRO SKB follow the IP forwarding
assumption of equal-size fragments.
Some implementation details:
HW writes the packets to the beginning of a stride,
i.e. does not keep headroom. To overcome this we make sure we can
extend backwards and use the last bytes of stride i-1.
Extra care is needed for stride 0 as it has no preceding stride.
We make sure headroom bytes are available by shifting the buffer
pointer passed to HW by headroom bytes.
This configuration now becomes default, whenever capable.
Of course, this implies turning LRO off.
Performance testing:
ConnectX-5, single core, single RX ring, default MTU.
UDP packet rate, early drop in TC layer:
--------------------------------------------
| pkt size | before | after | ratio |
--------------------------------------------
| 1500byte | 4.65 Mpps | 5.96 Mpps | 1.28x |
| 500byte | 5.23 Mpps | 5.97 Mpps | 1.14x |
| 64byte | 5.94 Mpps | 5.96 Mpps | 1.00x |
--------------------------------------------
TCP streams: ~20% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-02-07 05:41:25 -07:00
|
|
|
struct sk_buff *skb;
|
|
|
|
void *va, *data;
|
|
|
|
u32 frag_size;
|
2018-02-07 05:46:36 -07:00
|
|
|
bool consumed;
|
net/mlx5e: Use linear SKB in Striding RQ
Current Striding RQ HW feature utilizes the RX buffers so that
there is no wasted room between the strides. This maximises
the memory utilization.
This prevents the use of build_skb() (which requires headroom
and tailroom), and demands to memcpy the packets headers into
the skb linear part.
In this patch, whenever a set of conditions holds, we apply
an RQ configuration that allows combining the use of linear SKB
on top of a Striding RQ.
To use build_skb() with Striding RQ, the following must hold:
1. packet does not cross a page boundary.
2. there is enough headroom and tailroom surrounding the packet.
We can satisfy 1 and 2 by configuring:
stride size = MTU + headroom + tailoom.
This is possible only when:
a. (MTU - headroom - tailoom) does not exceed PAGE_SIZE.
b. HW LRO is turned off.
Using linear SKB has many advantages:
- Saves a memcpy of the headers.
- No page-boundary checks in datapath.
- No filler CQEs.
- Significantly smaller CQ.
- SKB data continuously resides in linear part, and not split to
small amount (linear part) and large amount (fragment).
This saves datapath cycles in driver and improves utilization
of SKB fragments in GRO.
- The fragments of a resulting GRO SKB follow the IP forwarding
assumption of equal-size fragments.
Some implementation details:
HW writes the packets to the beginning of a stride,
i.e. does not keep headroom. To overcome this we make sure we can
extend backwards and use the last bytes of stride i-1.
Extra care is needed for stride 0 as it has no preceding stride.
We make sure headroom bytes are available by shifting the buffer
pointer passed to HW by headroom bytes.
This configuration now becomes default, whenever capable.
Of course, this implies turning LRO off.
Performance testing:
ConnectX-5, single core, single RX ring, default MTU.
UDP packet rate, early drop in TC layer:
--------------------------------------------
| pkt size | before | after | ratio |
--------------------------------------------
| 1500byte | 4.65 Mpps | 5.96 Mpps | 1.28x |
| 500byte | 5.23 Mpps | 5.97 Mpps | 1.14x |
| 64byte | 5.94 Mpps | 5.96 Mpps | 1.00x |
--------------------------------------------
TCP streams: ~20% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-02-07 05:41:25 -07:00
|
|
|
|
net/mlx5e: RX, verify received packet size in Linear Striding RQ
In case of striding RQ, we use MPWRQ (Multi Packet WQE RQ), which means
that WQE (RX descriptor) can be used for many packets and so the WQE is
much bigger than MTU. In virtualization setups where the port mtu can
be larger than the vf mtu, if received packet is bigger than MTU, it
won't be dropped by HW on too small receive WQE. If we use linear SKB in
striding RQ, since each stride has room for mtu size payload and skb
info, an oversized packet can lead to crash for crossing allocated page
boundary upon the call to build_skb. So driver needs to check packet
size and drop it.
Introduce new SW rx counter, rx_oversize_pkts_sw_drop, which counts the
number of packets dropped by the driver for being too large.
As a new field is added to the RQ struct, re-open the channels whenever
this field is being used in datapath (i.e., in the case of linear
Striding RQ).
Fixes: 619a8f2a42f1 ("net/mlx5e: Use linear SKB in Striding RQ")
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-10-10 22:31:10 -06:00
|
|
|
/* Check packet size. Note LRO doesn't use linear SKB */
|
|
|
|
if (unlikely(cqe_bcnt > rq->hw_mtu)) {
|
|
|
|
rq->stats->oversize_pkts_sw_drop++;
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
net/mlx5e: Use linear SKB in Striding RQ
Current Striding RQ HW feature utilizes the RX buffers so that
there is no wasted room between the strides. This maximises
the memory utilization.
This prevents the use of build_skb() (which requires headroom
and tailroom), and demands to memcpy the packets headers into
the skb linear part.
In this patch, whenever a set of conditions holds, we apply
an RQ configuration that allows combining the use of linear SKB
on top of a Striding RQ.
To use build_skb() with Striding RQ, the following must hold:
1. packet does not cross a page boundary.
2. there is enough headroom and tailroom surrounding the packet.
We can satisfy 1 and 2 by configuring:
stride size = MTU + headroom + tailoom.
This is possible only when:
a. (MTU - headroom - tailoom) does not exceed PAGE_SIZE.
b. HW LRO is turned off.
Using linear SKB has many advantages:
- Saves a memcpy of the headers.
- No page-boundary checks in datapath.
- No filler CQEs.
- Significantly smaller CQ.
- SKB data continuously resides in linear part, and not split to
small amount (linear part) and large amount (fragment).
This saves datapath cycles in driver and improves utilization
of SKB fragments in GRO.
- The fragments of a resulting GRO SKB follow the IP forwarding
assumption of equal-size fragments.
Some implementation details:
HW writes the packets to the beginning of a stride,
i.e. does not keep headroom. To overcome this we make sure we can
extend backwards and use the last bytes of stride i-1.
Extra care is needed for stride 0 as it has no preceding stride.
We make sure headroom bytes are available by shifting the buffer
pointer passed to HW by headroom bytes.
This configuration now becomes default, whenever capable.
Of course, this implies turning LRO off.
Performance testing:
ConnectX-5, single core, single RX ring, default MTU.
UDP packet rate, early drop in TC layer:
--------------------------------------------
| pkt size | before | after | ratio |
--------------------------------------------
| 1500byte | 4.65 Mpps | 5.96 Mpps | 1.28x |
| 500byte | 5.23 Mpps | 5.97 Mpps | 1.14x |
| 64byte | 5.94 Mpps | 5.96 Mpps | 1.00x |
--------------------------------------------
TCP streams: ~20% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-02-07 05:41:25 -07:00
|
|
|
va = page_address(di->page) + head_offset;
|
|
|
|
data = va + rx_headroom;
|
2018-02-07 05:46:36 -07:00
|
|
|
frag_size = MLX5_SKB_FRAG_SZ(rx_headroom + cqe_bcnt32);
|
net/mlx5e: Use linear SKB in Striding RQ
Current Striding RQ HW feature utilizes the RX buffers so that
there is no wasted room between the strides. This maximises
the memory utilization.
This prevents the use of build_skb() (which requires headroom
and tailroom), and demands to memcpy the packets headers into
the skb linear part.
In this patch, whenever a set of conditions holds, we apply
an RQ configuration that allows combining the use of linear SKB
on top of a Striding RQ.
To use build_skb() with Striding RQ, the following must hold:
1. packet does not cross a page boundary.
2. there is enough headroom and tailroom surrounding the packet.
We can satisfy 1 and 2 by configuring:
stride size = MTU + headroom + tailoom.
This is possible only when:
a. (MTU - headroom - tailoom) does not exceed PAGE_SIZE.
b. HW LRO is turned off.
Using linear SKB has many advantages:
- Saves a memcpy of the headers.
- No page-boundary checks in datapath.
- No filler CQEs.
- Significantly smaller CQ.
- SKB data continuously resides in linear part, and not split to
small amount (linear part) and large amount (fragment).
This saves datapath cycles in driver and improves utilization
of SKB fragments in GRO.
- The fragments of a resulting GRO SKB follow the IP forwarding
assumption of equal-size fragments.
Some implementation details:
HW writes the packets to the beginning of a stride,
i.e. does not keep headroom. To overcome this we make sure we can
extend backwards and use the last bytes of stride i-1.
Extra care is needed for stride 0 as it has no preceding stride.
We make sure headroom bytes are available by shifting the buffer
pointer passed to HW by headroom bytes.
This configuration now becomes default, whenever capable.
Of course, this implies turning LRO off.
Performance testing:
ConnectX-5, single core, single RX ring, default MTU.
UDP packet rate, early drop in TC layer:
--------------------------------------------
| pkt size | before | after | ratio |
--------------------------------------------
| 1500byte | 4.65 Mpps | 5.96 Mpps | 1.28x |
| 500byte | 5.23 Mpps | 5.97 Mpps | 1.14x |
| 64byte | 5.94 Mpps | 5.96 Mpps | 1.00x |
--------------------------------------------
TCP streams: ~20% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-02-07 05:41:25 -07:00
|
|
|
|
|
|
|
dma_sync_single_range_for_cpu(rq->pdev, di->addr, head_offset,
|
|
|
|
frag_size, DMA_FROM_DEVICE);
|
2018-05-16 01:46:57 -06:00
|
|
|
prefetchw(va); /* xdp_frame data area */
|
net/mlx5e: Use linear SKB in Striding RQ
Current Striding RQ HW feature utilizes the RX buffers so that
there is no wasted room between the strides. This maximises
the memory utilization.
This prevents the use of build_skb() (which requires headroom
and tailroom), and demands to memcpy the packets headers into
the skb linear part.
In this patch, whenever a set of conditions holds, we apply
an RQ configuration that allows combining the use of linear SKB
on top of a Striding RQ.
To use build_skb() with Striding RQ, the following must hold:
1. packet does not cross a page boundary.
2. there is enough headroom and tailroom surrounding the packet.
We can satisfy 1 and 2 by configuring:
stride size = MTU + headroom + tailoom.
This is possible only when:
a. (MTU - headroom - tailoom) does not exceed PAGE_SIZE.
b. HW LRO is turned off.
Using linear SKB has many advantages:
- Saves a memcpy of the headers.
- No page-boundary checks in datapath.
- No filler CQEs.
- Significantly smaller CQ.
- SKB data continuously resides in linear part, and not split to
small amount (linear part) and large amount (fragment).
This saves datapath cycles in driver and improves utilization
of SKB fragments in GRO.
- The fragments of a resulting GRO SKB follow the IP forwarding
assumption of equal-size fragments.
Some implementation details:
HW writes the packets to the beginning of a stride,
i.e. does not keep headroom. To overcome this we make sure we can
extend backwards and use the last bytes of stride i-1.
Extra care is needed for stride 0 as it has no preceding stride.
We make sure headroom bytes are available by shifting the buffer
pointer passed to HW by headroom bytes.
This configuration now becomes default, whenever capable.
Of course, this implies turning LRO off.
Performance testing:
ConnectX-5, single core, single RX ring, default MTU.
UDP packet rate, early drop in TC layer:
--------------------------------------------
| pkt size | before | after | ratio |
--------------------------------------------
| 1500byte | 4.65 Mpps | 5.96 Mpps | 1.28x |
| 500byte | 5.23 Mpps | 5.97 Mpps | 1.14x |
| 64byte | 5.94 Mpps | 5.96 Mpps | 1.00x |
--------------------------------------------
TCP streams: ~20% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-02-07 05:41:25 -07:00
|
|
|
prefetch(data);
|
2018-02-07 05:46:36 -07:00
|
|
|
|
|
|
|
rcu_read_lock();
|
net/mlx5e: Add XSK zero-copy support
This commit adds support for AF_XDP zero-copy RX and TX.
We create a dedicated XSK RQ inside the channel, it means that two
RQs are running simultaneously: one for non-XSK traffic and the other
for XSK traffic. The regular and XSK RQs use a single ID namespace split
into two halves: the lower half is regular RQs, and the upper half is
XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
of channels is not allowed, because it would break to mapping between
XSK RQ IDs and channels.
XSK requires different page allocation and release routines. Such
functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
generic enough to be used for both regular and XSK RQs, and they use the
mlx5e_page_{alloc,release} wrappers around the real allocation
functions. Function pointers are not used to avoid losing the
performance with retpolines. Wherever it's certain that the regular
(non-XSK) page release function should be used, it's called directly.
Only the stats that could be meaningful for XSK are exposed to the
userspace. Those that don't take part in the XSK flow are not
considered.
Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
because the newer xdpsock sample doesn't provide any Fill Ring entries
at the setup stage.
We create a dedicated XSK SQ in the channel. This separation has its
advantages:
1. When the UMEM is closed, the XSK SQ can also be closed and stop
receiving completions. If an existing SQ was used for XSK, it would
continue receiving completions for the packets of the closed socket. If
a new UMEM was opened at that point, it would start getting completions
that don't belong to it.
2. Calculating statistics separately.
When the userspace kicks the TX, the driver triggers a hardware
interrupt by posting a NOP to a dedicated XSK ICO (internal control
operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
ICO SQ is protected by a spinlock, as the userspace application may kick
the TX from any core.
Store the pointers to the UMEMs in the net device private context,
independently from the kernel. This way the driver can distinguish
between the zero-copy and non-zero-copy UMEMs. The kernel function
xdp_get_umem_from_qid does not care about this difference, but the
driver is only interested in zero-copy UMEMs, particularly, on the
cleanup it determines whether to close the XSK RQ and SQ or not by
looking at the presence of the UMEM. Use state_lock to protect the
access to this area of UMEM pointers.
LRO isn't compatible with XDP, but there may be active UMEMs while
XDP is off. If this is the case, don't allow LRO to ensure XDP can
be reenabled at any time.
The validation of XSK parameters typically happens when XSK queues
open. However, when the interface is down or the XDP program isn't
set, it's still possible to have active AF_XDP sockets and even to
open new, but the XSK queues will be closed. To cover these cases,
perform the validation also in these flows:
1. A new UMEM is registered, but the XSK queues aren't going to be
created due to missing XDP program or interface being down.
2. MTU changes while there are UMEMs registered.
Having this early check prevents mlx5e_open_channels from failing
at a later stage, where recovery is impossible and the application
has no chance to handle the error, because it got the successful
return value for an MTU change or XSK open operation.
The performance testing was performed on a machine with the following
configuration:
- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link
The results with retpoline disabled, single stream:
txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps
The results with retpoline enabled, single stream:
txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 08:35:38 -06:00
|
|
|
consumed = mlx5e_xdp_handle(rq, di, va, &rx_headroom, &cqe_bcnt32, false);
|
2018-02-07 05:46:36 -07:00
|
|
|
rcu_read_unlock();
|
|
|
|
if (consumed) {
|
|
|
|
if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags))
|
|
|
|
__set_bit(page_idx, wi->xdp_xmit_bitmap); /* non-atomic */
|
|
|
|
return NULL; /* page/packet was consumed by XDP */
|
|
|
|
}
|
|
|
|
|
|
|
|
skb = mlx5e_build_linear_skb(rq, va, frag_size, rx_headroom, cqe_bcnt32);
|
net/mlx5e: Use linear SKB in Striding RQ
Current Striding RQ HW feature utilizes the RX buffers so that
there is no wasted room between the strides. This maximises
the memory utilization.
This prevents the use of build_skb() (which requires headroom
and tailroom), and demands to memcpy the packets headers into
the skb linear part.
In this patch, whenever a set of conditions holds, we apply
an RQ configuration that allows combining the use of linear SKB
on top of a Striding RQ.
To use build_skb() with Striding RQ, the following must hold:
1. packet does not cross a page boundary.
2. there is enough headroom and tailroom surrounding the packet.
We can satisfy 1 and 2 by configuring:
stride size = MTU + headroom + tailoom.
This is possible only when:
a. (MTU - headroom - tailoom) does not exceed PAGE_SIZE.
b. HW LRO is turned off.
Using linear SKB has many advantages:
- Saves a memcpy of the headers.
- No page-boundary checks in datapath.
- No filler CQEs.
- Significantly smaller CQ.
- SKB data continuously resides in linear part, and not split to
small amount (linear part) and large amount (fragment).
This saves datapath cycles in driver and improves utilization
of SKB fragments in GRO.
- The fragments of a resulting GRO SKB follow the IP forwarding
assumption of equal-size fragments.
Some implementation details:
HW writes the packets to the beginning of a stride,
i.e. does not keep headroom. To overcome this we make sure we can
extend backwards and use the last bytes of stride i-1.
Extra care is needed for stride 0 as it has no preceding stride.
We make sure headroom bytes are available by shifting the buffer
pointer passed to HW by headroom bytes.
This configuration now becomes default, whenever capable.
Of course, this implies turning LRO off.
Performance testing:
ConnectX-5, single core, single RX ring, default MTU.
UDP packet rate, early drop in TC layer:
--------------------------------------------
| pkt size | before | after | ratio |
--------------------------------------------
| 1500byte | 4.65 Mpps | 5.96 Mpps | 1.28x |
| 500byte | 5.23 Mpps | 5.97 Mpps | 1.14x |
| 64byte | 5.94 Mpps | 5.96 Mpps | 1.00x |
--------------------------------------------
TCP streams: ~20% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-02-07 05:41:25 -07:00
|
|
|
if (unlikely(!skb))
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
/* queue up for recycling/reuse */
|
net/mlx5e: Remove page_ref bulking in Striding RQ
When many packets reside on the same page, the bulking of
page_ref modifications reduces the total number of atomic
operations executed.
Besides the necessary 2 operations on page alloc/free, we
have the following extra ops per page:
- one on WQE allocation (bump refcnt to maximum possible),
- zero ops for SKBs,
- one on WQE free,
a constant of two operations in total, no matter how many
packets/SKBs actually populate the page.
Without this bulking, we have:
- no ops on WQE allocation or free,
- one op per SKB,
Comparing the two methods when PAGE_SIZE is 4K:
- As mentioned above, bulking method always executes 2 operations,
not more, but not less.
- In the default MTU configuration (1500, stride size is 2K),
the non-bulking method execute 2 ops as well.
- For larger MTUs with stride size of 4K, non-bulking method
executes only a single op.
- For XDP (stride size of 4K, no SKBs), non-bulking method
executes no ops at all!
Hence, to optimize the flows with linear SKB and XDP over Striding RQ,
we here remove the page_ref bulking method.
Performance testing:
ConnectX-5, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz.
Single core packet rate (64 bytes).
Early drop in TC: no degradation.
XDP_DROP:
before: 14,270,188 pps
after: 20,503,603 pps, 43% improvement.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-11-28 02:03:37 -07:00
|
|
|
page_ref_inc(di->page);
|
net/mlx5e: Use linear SKB in Striding RQ
Current Striding RQ HW feature utilizes the RX buffers so that
there is no wasted room between the strides. This maximises
the memory utilization.
This prevents the use of build_skb() (which requires headroom
and tailroom), and demands to memcpy the packets headers into
the skb linear part.
In this patch, whenever a set of conditions holds, we apply
an RQ configuration that allows combining the use of linear SKB
on top of a Striding RQ.
To use build_skb() with Striding RQ, the following must hold:
1. packet does not cross a page boundary.
2. there is enough headroom and tailroom surrounding the packet.
We can satisfy 1 and 2 by configuring:
stride size = MTU + headroom + tailoom.
This is possible only when:
a. (MTU - headroom - tailoom) does not exceed PAGE_SIZE.
b. HW LRO is turned off.
Using linear SKB has many advantages:
- Saves a memcpy of the headers.
- No page-boundary checks in datapath.
- No filler CQEs.
- Significantly smaller CQ.
- SKB data continuously resides in linear part, and not split to
small amount (linear part) and large amount (fragment).
This saves datapath cycles in driver and improves utilization
of SKB fragments in GRO.
- The fragments of a resulting GRO SKB follow the IP forwarding
assumption of equal-size fragments.
Some implementation details:
HW writes the packets to the beginning of a stride,
i.e. does not keep headroom. To overcome this we make sure we can
extend backwards and use the last bytes of stride i-1.
Extra care is needed for stride 0 as it has no preceding stride.
We make sure headroom bytes are available by shifting the buffer
pointer passed to HW by headroom bytes.
This configuration now becomes default, whenever capable.
Of course, this implies turning LRO off.
Performance testing:
ConnectX-5, single core, single RX ring, default MTU.
UDP packet rate, early drop in TC layer:
--------------------------------------------
| pkt size | before | after | ratio |
--------------------------------------------
| 1500byte | 4.65 Mpps | 5.96 Mpps | 1.28x |
| 500byte | 5.23 Mpps | 5.97 Mpps | 1.14x |
| 64byte | 5.94 Mpps | 5.96 Mpps | 1.00x |
--------------------------------------------
TCP streams: ~20% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-02-07 05:41:25 -07:00
|
|
|
|
|
|
|
return skb;
|
2016-04-20 13:02:15 -06:00
|
|
|
}
|
|
|
|
|
net/mlx5e: Support RX multi-packet WQE (Striding RQ)
Introduce the feature of multi-packet WQE (RX Work Queue Element)
referred to as (MPWQE or Striding RQ), in which WQEs are larger
and serve multiple packets each.
Every WQE consists of many strides of the same size, every received
packet is aligned to a beginning of a stride and is written to
consecutive strides within a WQE.
In the regular approach, each regular WQE is big enough to be capable
of serving one received packet of any size up to MTU or 64K in case of
device LRO is enabled, making it very wasteful when dealing with
small packets or device LRO is enabled.
For its flexibility, MPWQE allows a better memory utilization
(implying improvements in CPU utilization and packet rate) as packets
consume strides according to their size, preserving the rest of
the WQE to be available for other packets.
MPWQE default configuration:
Num of WQEs = 16
Strides Per WQE = 2048
Stride Size = 64 byte
The default WQEs memory footprint went from 1024*mtu (~1.5MB) to
16 * 2048 * 64 = 2MB per ring.
However, HW LRO can now be supported at no additional cost in memory
footprint, and hence we turn it on by default and get an even better
performance.
Performance tested on ConnectX4-Lx 50G.
To isolate the feature under test, the numbers below were measured with
HW LRO turned off. We verified that the performance just improves when
LRO is turned back on.
* Netperf single TCP stream:
- BW raised by 10-15% for representative packet sizes:
default, 64B, 1024B, 1478B, 65536B.
* Netperf multi TCP stream:
- No degradation, line rate reached.
* Pktgen: packet rate raised by 2-10% for traffic of different message
sizes: 64B, 128B, 256B, 1024B, and 1500B.
* Pktgen: packet loss in bursts of small messages (64byte),
single stream:
- | num packets | packets loss before | packets loss after
| 2K | ~ 1K | 0
| 8K | ~ 6K | 0
| 16K | ~13K | 0
| 32K | ~28K | 0
| 64K | ~57K | ~24K
As expected as the driver can receive as many small packets (<=64B) as
the number of total strides in the ring (default = 2048 * 16) vs. 1024
(default ring size regardless of packets size) before this feature.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-20 13:02:13 -06:00
|
|
|
void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
|
|
|
|
{
|
|
|
|
u16 cstrides = mpwrq_get_cqe_consumed_strides(cqe);
|
|
|
|
u16 wqe_id = be16_to_cpu(cqe->wqe_id);
|
2016-09-21 03:19:43 -06:00
|
|
|
struct mlx5e_mpw_info *wi = &rq->mpwqe.info[wqe_id];
|
net/mlx5e: Use linear SKB in Striding RQ
Current Striding RQ HW feature utilizes the RX buffers so that
there is no wasted room between the strides. This maximises
the memory utilization.
This prevents the use of build_skb() (which requires headroom
and tailroom), and demands to memcpy the packets headers into
the skb linear part.
In this patch, whenever a set of conditions holds, we apply
an RQ configuration that allows combining the use of linear SKB
on top of a Striding RQ.
To use build_skb() with Striding RQ, the following must hold:
1. packet does not cross a page boundary.
2. there is enough headroom and tailroom surrounding the packet.
We can satisfy 1 and 2 by configuring:
stride size = MTU + headroom + tailoom.
This is possible only when:
a. (MTU - headroom - tailoom) does not exceed PAGE_SIZE.
b. HW LRO is turned off.
Using linear SKB has many advantages:
- Saves a memcpy of the headers.
- No page-boundary checks in datapath.
- No filler CQEs.
- Significantly smaller CQ.
- SKB data continuously resides in linear part, and not split to
small amount (linear part) and large amount (fragment).
This saves datapath cycles in driver and improves utilization
of SKB fragments in GRO.
- The fragments of a resulting GRO SKB follow the IP forwarding
assumption of equal-size fragments.
Some implementation details:
HW writes the packets to the beginning of a stride,
i.e. does not keep headroom. To overcome this we make sure we can
extend backwards and use the last bytes of stride i-1.
Extra care is needed for stride 0 as it has no preceding stride.
We make sure headroom bytes are available by shifting the buffer
pointer passed to HW by headroom bytes.
This configuration now becomes default, whenever capable.
Of course, this implies turning LRO off.
Performance testing:
ConnectX-5, single core, single RX ring, default MTU.
UDP packet rate, early drop in TC layer:
--------------------------------------------
| pkt size | before | after | ratio |
--------------------------------------------
| 1500byte | 4.65 Mpps | 5.96 Mpps | 1.28x |
| 500byte | 5.23 Mpps | 5.97 Mpps | 1.14x |
| 64byte | 5.94 Mpps | 5.96 Mpps | 1.00x |
--------------------------------------------
TCP streams: ~20% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-02-07 05:41:25 -07:00
|
|
|
u16 stride_ix = mpwrq_get_cqe_stride_index(cqe);
|
|
|
|
u32 wqe_offset = stride_ix << rq->mpwqe.log_stride_sz;
|
|
|
|
u32 head_offset = wqe_offset & (PAGE_SIZE - 1);
|
|
|
|
u32 page_idx = wqe_offset >> PAGE_SHIFT;
|
2018-04-02 08:31:31 -06:00
|
|
|
struct mlx5e_rx_wqe_ll *wqe;
|
2018-04-02 08:23:14 -06:00
|
|
|
struct mlx5_wq_ll *wq;
|
net/mlx5e: Support RX multi-packet WQE (Striding RQ)
Introduce the feature of multi-packet WQE (RX Work Queue Element)
referred to as (MPWQE or Striding RQ), in which WQEs are larger
and serve multiple packets each.
Every WQE consists of many strides of the same size, every received
packet is aligned to a beginning of a stride and is written to
consecutive strides within a WQE.
In the regular approach, each regular WQE is big enough to be capable
of serving one received packet of any size up to MTU or 64K in case of
device LRO is enabled, making it very wasteful when dealing with
small packets or device LRO is enabled.
For its flexibility, MPWQE allows a better memory utilization
(implying improvements in CPU utilization and packet rate) as packets
consume strides according to their size, preserving the rest of
the WQE to be available for other packets.
MPWQE default configuration:
Num of WQEs = 16
Strides Per WQE = 2048
Stride Size = 64 byte
The default WQEs memory footprint went from 1024*mtu (~1.5MB) to
16 * 2048 * 64 = 2MB per ring.
However, HW LRO can now be supported at no additional cost in memory
footprint, and hence we turn it on by default and get an even better
performance.
Performance tested on ConnectX4-Lx 50G.
To isolate the feature under test, the numbers below were measured with
HW LRO turned off. We verified that the performance just improves when
LRO is turned back on.
* Netperf single TCP stream:
- BW raised by 10-15% for representative packet sizes:
default, 64B, 1024B, 1478B, 65536B.
* Netperf multi TCP stream:
- No degradation, line rate reached.
* Pktgen: packet rate raised by 2-10% for traffic of different message
sizes: 64B, 128B, 256B, 1024B, and 1500B.
* Pktgen: packet loss in bursts of small messages (64byte),
single stream:
- | num packets | packets loss before | packets loss after
| 2K | ~ 1K | 0
| 8K | ~ 6K | 0
| 16K | ~13K | 0
| 32K | ~28K | 0
| 64K | ~57K | ~24K
As expected as the driver can receive as many small packets (<=64B) as
the number of total strides in the ring (default = 2048 * 16) vs. 1024
(default ring size regardless of packets size) before this feature.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-20 13:02:13 -06:00
|
|
|
struct sk_buff *skb;
|
|
|
|
u16 cqe_bcnt;
|
|
|
|
|
|
|
|
wi->consumed_strides += cstrides;
|
|
|
|
|
2019-06-14 16:21:15 -06:00
|
|
|
if (unlikely(MLX5E_RX_ERR_CQE(cqe))) {
|
2019-06-26 14:21:40 -06:00
|
|
|
trigger_report(rq, cqe);
|
2018-04-12 07:03:37 -06:00
|
|
|
rq->stats->wqe_err++;
|
net/mlx5e: Support RX multi-packet WQE (Striding RQ)
Introduce the feature of multi-packet WQE (RX Work Queue Element)
referred to as (MPWQE or Striding RQ), in which WQEs are larger
and serve multiple packets each.
Every WQE consists of many strides of the same size, every received
packet is aligned to a beginning of a stride and is written to
consecutive strides within a WQE.
In the regular approach, each regular WQE is big enough to be capable
of serving one received packet of any size up to MTU or 64K in case of
device LRO is enabled, making it very wasteful when dealing with
small packets or device LRO is enabled.
For its flexibility, MPWQE allows a better memory utilization
(implying improvements in CPU utilization and packet rate) as packets
consume strides according to their size, preserving the rest of
the WQE to be available for other packets.
MPWQE default configuration:
Num of WQEs = 16
Strides Per WQE = 2048
Stride Size = 64 byte
The default WQEs memory footprint went from 1024*mtu (~1.5MB) to
16 * 2048 * 64 = 2MB per ring.
However, HW LRO can now be supported at no additional cost in memory
footprint, and hence we turn it on by default and get an even better
performance.
Performance tested on ConnectX4-Lx 50G.
To isolate the feature under test, the numbers below were measured with
HW LRO turned off. We verified that the performance just improves when
LRO is turned back on.
* Netperf single TCP stream:
- BW raised by 10-15% for representative packet sizes:
default, 64B, 1024B, 1478B, 65536B.
* Netperf multi TCP stream:
- No degradation, line rate reached.
* Pktgen: packet rate raised by 2-10% for traffic of different message
sizes: 64B, 128B, 256B, 1024B, and 1500B.
* Pktgen: packet loss in bursts of small messages (64byte),
single stream:
- | num packets | packets loss before | packets loss after
| 2K | ~ 1K | 0
| 8K | ~ 6K | 0
| 16K | ~13K | 0
| 32K | ~28K | 0
| 64K | ~57K | ~24K
As expected as the driver can receive as many small packets (<=64B) as
the number of total strides in the ring (default = 2048 * 16) vs. 1024
(default ring size regardless of packets size) before this feature.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-20 13:02:13 -06:00
|
|
|
goto mpwrq_cqe_out;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (unlikely(mpwrq_is_filler_cqe(cqe))) {
|
2017-06-28 10:27:18 -06:00
|
|
|
struct mlx5e_rq_stats *stats = rq->stats;
|
|
|
|
|
|
|
|
stats->mpwqe_filler_cqes++;
|
|
|
|
stats->mpwqe_filler_strides += cstrides;
|
net/mlx5e: Support RX multi-packet WQE (Striding RQ)
Introduce the feature of multi-packet WQE (RX Work Queue Element)
referred to as (MPWQE or Striding RQ), in which WQEs are larger
and serve multiple packets each.
Every WQE consists of many strides of the same size, every received
packet is aligned to a beginning of a stride and is written to
consecutive strides within a WQE.
In the regular approach, each regular WQE is big enough to be capable
of serving one received packet of any size up to MTU or 64K in case of
device LRO is enabled, making it very wasteful when dealing with
small packets or device LRO is enabled.
For its flexibility, MPWQE allows a better memory utilization
(implying improvements in CPU utilization and packet rate) as packets
consume strides according to their size, preserving the rest of
the WQE to be available for other packets.
MPWQE default configuration:
Num of WQEs = 16
Strides Per WQE = 2048
Stride Size = 64 byte
The default WQEs memory footprint went from 1024*mtu (~1.5MB) to
16 * 2048 * 64 = 2MB per ring.
However, HW LRO can now be supported at no additional cost in memory
footprint, and hence we turn it on by default and get an even better
performance.
Performance tested on ConnectX4-Lx 50G.
To isolate the feature under test, the numbers below were measured with
HW LRO turned off. We verified that the performance just improves when
LRO is turned back on.
* Netperf single TCP stream:
- BW raised by 10-15% for representative packet sizes:
default, 64B, 1024B, 1478B, 65536B.
* Netperf multi TCP stream:
- No degradation, line rate reached.
* Pktgen: packet rate raised by 2-10% for traffic of different message
sizes: 64B, 128B, 256B, 1024B, and 1500B.
* Pktgen: packet loss in bursts of small messages (64byte),
single stream:
- | num packets | packets loss before | packets loss after
| 2K | ~ 1K | 0
| 8K | ~ 6K | 0
| 16K | ~13K | 0
| 32K | ~28K | 0
| 64K | ~57K | ~24K
As expected as the driver can receive as many small packets (<=64B) as
the number of total strides in the ring (default = 2048 * 16) vs. 1024
(default ring size regardless of packets size) before this feature.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-20 13:02:13 -06:00
|
|
|
goto mpwrq_cqe_out;
|
|
|
|
}
|
|
|
|
|
|
|
|
cqe_bcnt = mpwrq_get_cqe_byte_cnt(cqe);
|
|
|
|
|
2019-06-12 04:18:35 -06:00
|
|
|
skb = INDIRECT_CALL_2(rq->mpwqe.skb_from_cqe_mpwrq,
|
|
|
|
mlx5e_skb_from_cqe_mpwrq_linear,
|
|
|
|
mlx5e_skb_from_cqe_mpwrq_nonlinear,
|
|
|
|
rq, wi, cqe_bcnt, head_offset, page_idx);
|
2018-02-07 05:46:36 -07:00
|
|
|
if (!skb)
|
net/mlx5e: Use linear SKB in Striding RQ
Current Striding RQ HW feature utilizes the RX buffers so that
there is no wasted room between the strides. This maximises
the memory utilization.
This prevents the use of build_skb() (which requires headroom
and tailroom), and demands to memcpy the packets headers into
the skb linear part.
In this patch, whenever a set of conditions holds, we apply
an RQ configuration that allows combining the use of linear SKB
on top of a Striding RQ.
To use build_skb() with Striding RQ, the following must hold:
1. packet does not cross a page boundary.
2. there is enough headroom and tailroom surrounding the packet.
We can satisfy 1 and 2 by configuring:
stride size = MTU + headroom + tailoom.
This is possible only when:
a. (MTU - headroom - tailoom) does not exceed PAGE_SIZE.
b. HW LRO is turned off.
Using linear SKB has many advantages:
- Saves a memcpy of the headers.
- No page-boundary checks in datapath.
- No filler CQEs.
- Significantly smaller CQ.
- SKB data continuously resides in linear part, and not split to
small amount (linear part) and large amount (fragment).
This saves datapath cycles in driver and improves utilization
of SKB fragments in GRO.
- The fragments of a resulting GRO SKB follow the IP forwarding
assumption of equal-size fragments.
Some implementation details:
HW writes the packets to the beginning of a stride,
i.e. does not keep headroom. To overcome this we make sure we can
extend backwards and use the last bytes of stride i-1.
Extra care is needed for stride 0 as it has no preceding stride.
We make sure headroom bytes are available by shifting the buffer
pointer passed to HW by headroom bytes.
This configuration now becomes default, whenever capable.
Of course, this implies turning LRO off.
Performance testing:
ConnectX-5, single core, single RX ring, default MTU.
UDP packet rate, early drop in TC layer:
--------------------------------------------
| pkt size | before | after | ratio |
--------------------------------------------
| 1500byte | 4.65 Mpps | 5.96 Mpps | 1.28x |
| 500byte | 5.23 Mpps | 5.97 Mpps | 1.14x |
| 64byte | 5.94 Mpps | 5.96 Mpps | 1.00x |
--------------------------------------------
TCP streams: ~20% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-02-07 05:41:25 -07:00
|
|
|
goto mpwrq_cqe_out;
|
|
|
|
|
net/mlx5e: Support RX multi-packet WQE (Striding RQ)
Introduce the feature of multi-packet WQE (RX Work Queue Element)
referred to as (MPWQE or Striding RQ), in which WQEs are larger
and serve multiple packets each.
Every WQE consists of many strides of the same size, every received
packet is aligned to a beginning of a stride and is written to
consecutive strides within a WQE.
In the regular approach, each regular WQE is big enough to be capable
of serving one received packet of any size up to MTU or 64K in case of
device LRO is enabled, making it very wasteful when dealing with
small packets or device LRO is enabled.
For its flexibility, MPWQE allows a better memory utilization
(implying improvements in CPU utilization and packet rate) as packets
consume strides according to their size, preserving the rest of
the WQE to be available for other packets.
MPWQE default configuration:
Num of WQEs = 16
Strides Per WQE = 2048
Stride Size = 64 byte
The default WQEs memory footprint went from 1024*mtu (~1.5MB) to
16 * 2048 * 64 = 2MB per ring.
However, HW LRO can now be supported at no additional cost in memory
footprint, and hence we turn it on by default and get an even better
performance.
Performance tested on ConnectX4-Lx 50G.
To isolate the feature under test, the numbers below were measured with
HW LRO turned off. We verified that the performance just improves when
LRO is turned back on.
* Netperf single TCP stream:
- BW raised by 10-15% for representative packet sizes:
default, 64B, 1024B, 1478B, 65536B.
* Netperf multi TCP stream:
- No degradation, line rate reached.
* Pktgen: packet rate raised by 2-10% for traffic of different message
sizes: 64B, 128B, 256B, 1024B, and 1500B.
* Pktgen: packet loss in bursts of small messages (64byte),
single stream:
- | num packets | packets loss before | packets loss after
| 2K | ~ 1K | 0
| 8K | ~ 6K | 0
| 16K | ~13K | 0
| 32K | ~28K | 0
| 64K | ~57K | ~24K
As expected as the driver can receive as many small packets (<=64B) as
the number of total strides in the ring (default = 2048 * 16) vs. 1024
(default ring size regardless of packets size) before this feature.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-20 13:02:13 -06:00
|
|
|
mlx5e_complete_rx_cqe(rq, cqe, cqe_bcnt, skb);
|
2016-09-22 11:01:46 -06:00
|
|
|
napi_gro_receive(rq->cq.napi, skb);
|
net/mlx5e: Support RX multi-packet WQE (Striding RQ)
Introduce the feature of multi-packet WQE (RX Work Queue Element)
referred to as (MPWQE or Striding RQ), in which WQEs are larger
and serve multiple packets each.
Every WQE consists of many strides of the same size, every received
packet is aligned to a beginning of a stride and is written to
consecutive strides within a WQE.
In the regular approach, each regular WQE is big enough to be capable
of serving one received packet of any size up to MTU or 64K in case of
device LRO is enabled, making it very wasteful when dealing with
small packets or device LRO is enabled.
For its flexibility, MPWQE allows a better memory utilization
(implying improvements in CPU utilization and packet rate) as packets
consume strides according to their size, preserving the rest of
the WQE to be available for other packets.
MPWQE default configuration:
Num of WQEs = 16
Strides Per WQE = 2048
Stride Size = 64 byte
The default WQEs memory footprint went from 1024*mtu (~1.5MB) to
16 * 2048 * 64 = 2MB per ring.
However, HW LRO can now be supported at no additional cost in memory
footprint, and hence we turn it on by default and get an even better
performance.
Performance tested on ConnectX4-Lx 50G.
To isolate the feature under test, the numbers below were measured with
HW LRO turned off. We verified that the performance just improves when
LRO is turned back on.
* Netperf single TCP stream:
- BW raised by 10-15% for representative packet sizes:
default, 64B, 1024B, 1478B, 65536B.
* Netperf multi TCP stream:
- No degradation, line rate reached.
* Pktgen: packet rate raised by 2-10% for traffic of different message
sizes: 64B, 128B, 256B, 1024B, and 1500B.
* Pktgen: packet loss in bursts of small messages (64byte),
single stream:
- | num packets | packets loss before | packets loss after
| 2K | ~ 1K | 0
| 8K | ~ 6K | 0
| 16K | ~13K | 0
| 32K | ~28K | 0
| 64K | ~57K | ~24K
As expected as the driver can receive as many small packets (<=64B) as
the number of total strides in the ring (default = 2048 * 16) vs. 1024
(default ring size regardless of packets size) before this feature.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-20 13:02:13 -06:00
|
|
|
|
|
|
|
mpwrq_cqe_out:
|
2017-02-13 09:41:30 -07:00
|
|
|
if (likely(wi->consumed_strides < rq->mpwqe.num_strides))
|
net/mlx5e: Support RX multi-packet WQE (Striding RQ)
Introduce the feature of multi-packet WQE (RX Work Queue Element)
referred to as (MPWQE or Striding RQ), in which WQEs are larger
and serve multiple packets each.
Every WQE consists of many strides of the same size, every received
packet is aligned to a beginning of a stride and is written to
consecutive strides within a WQE.
In the regular approach, each regular WQE is big enough to be capable
of serving one received packet of any size up to MTU or 64K in case of
device LRO is enabled, making it very wasteful when dealing with
small packets or device LRO is enabled.
For its flexibility, MPWQE allows a better memory utilization
(implying improvements in CPU utilization and packet rate) as packets
consume strides according to their size, preserving the rest of
the WQE to be available for other packets.
MPWQE default configuration:
Num of WQEs = 16
Strides Per WQE = 2048
Stride Size = 64 byte
The default WQEs memory footprint went from 1024*mtu (~1.5MB) to
16 * 2048 * 64 = 2MB per ring.
However, HW LRO can now be supported at no additional cost in memory
footprint, and hence we turn it on by default and get an even better
performance.
Performance tested on ConnectX4-Lx 50G.
To isolate the feature under test, the numbers below were measured with
HW LRO turned off. We verified that the performance just improves when
LRO is turned back on.
* Netperf single TCP stream:
- BW raised by 10-15% for representative packet sizes:
default, 64B, 1024B, 1478B, 65536B.
* Netperf multi TCP stream:
- No degradation, line rate reached.
* Pktgen: packet rate raised by 2-10% for traffic of different message
sizes: 64B, 128B, 256B, 1024B, and 1500B.
* Pktgen: packet loss in bursts of small messages (64byte),
single stream:
- | num packets | packets loss before | packets loss after
| 2K | ~ 1K | 0
| 8K | ~ 6K | 0
| 16K | ~13K | 0
| 32K | ~28K | 0
| 64K | ~57K | ~24K
As expected as the driver can receive as many small packets (<=64B) as
the number of total strides in the ring (default = 2048 * 16) vs. 1024
(default ring size regardless of packets size) before this feature.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-20 13:02:13 -06:00
|
|
|
return;
|
|
|
|
|
2018-04-02 08:23:14 -06:00
|
|
|
wq = &rq->mpwqe.wq;
|
|
|
|
wqe = mlx5_wq_ll_get_wqe(wq, wqe_id);
|
2018-06-12 01:09:24 -06:00
|
|
|
mlx5e_free_rx_mpwqe(rq, wi, true);
|
2018-04-02 08:23:14 -06:00
|
|
|
mlx5_wq_ll_pop(wq, cqe->wqe_id, &wqe->next.next_wqe_index);
|
net/mlx5e: Support RX multi-packet WQE (Striding RQ)
Introduce the feature of multi-packet WQE (RX Work Queue Element)
referred to as (MPWQE or Striding RQ), in which WQEs are larger
and serve multiple packets each.
Every WQE consists of many strides of the same size, every received
packet is aligned to a beginning of a stride and is written to
consecutive strides within a WQE.
In the regular approach, each regular WQE is big enough to be capable
of serving one received packet of any size up to MTU or 64K in case of
device LRO is enabled, making it very wasteful when dealing with
small packets or device LRO is enabled.
For its flexibility, MPWQE allows a better memory utilization
(implying improvements in CPU utilization and packet rate) as packets
consume strides according to their size, preserving the rest of
the WQE to be available for other packets.
MPWQE default configuration:
Num of WQEs = 16
Strides Per WQE = 2048
Stride Size = 64 byte
The default WQEs memory footprint went from 1024*mtu (~1.5MB) to
16 * 2048 * 64 = 2MB per ring.
However, HW LRO can now be supported at no additional cost in memory
footprint, and hence we turn it on by default and get an even better
performance.
Performance tested on ConnectX4-Lx 50G.
To isolate the feature under test, the numbers below were measured with
HW LRO turned off. We verified that the performance just improves when
LRO is turned back on.
* Netperf single TCP stream:
- BW raised by 10-15% for representative packet sizes:
default, 64B, 1024B, 1478B, 65536B.
* Netperf multi TCP stream:
- No degradation, line rate reached.
* Pktgen: packet rate raised by 2-10% for traffic of different message
sizes: 64B, 128B, 256B, 1024B, and 1500B.
* Pktgen: packet loss in bursts of small messages (64byte),
single stream:
- | num packets | packets loss before | packets loss after
| 2K | ~ 1K | 0
| 8K | ~ 6K | 0
| 16K | ~13K | 0
| 32K | ~28K | 0
| 64K | ~57K | ~24K
As expected as the driver can receive as many small packets (<=64B) as
the number of total strides in the ring (default = 2048 * 16) vs. 1024
(default ring size regardless of packets size) before this feature.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-20 13:02:13 -06:00
|
|
|
}
|
|
|
|
|
2015-11-18 07:30:56 -07:00
|
|
|
int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget)
|
2015-05-28 13:28:46 -06:00
|
|
|
{
|
2015-06-23 08:14:20 -06:00
|
|
|
struct mlx5e_rq *rq = container_of(cq, struct mlx5e_rq, cq);
|
2018-09-12 09:32:49 -06:00
|
|
|
struct mlx5_cqwq *cqwq = &cq->wq;
|
2017-06-19 09:11:30 -06:00
|
|
|
struct mlx5_cqe64 *cqe;
|
2016-05-10 15:29:14 -06:00
|
|
|
int work_done = 0;
|
2015-05-28 13:28:46 -06:00
|
|
|
|
2018-01-23 03:23:26 -07:00
|
|
|
if (unlikely(!test_bit(MLX5E_RQ_STATE_ENABLED, &rq->state)))
|
2016-06-30 08:34:46 -06:00
|
|
|
return 0;
|
|
|
|
|
net/mlx5e: Fix handling of compressed CQEs in case of low NAPI budget
When CQE compression is enabled, compressed CQEs use the following
structure: a title is followed by one or many blocks, each containing 8
mini CQEs (except the last, which may contain fewer mini CQEs).
Due to NAPI budget restriction, a complete structure is not always
parsed in one NAPI run, and some blocks with mini CQEs may be deferred
to the next NAPI poll call - we have the mlx5e_decompress_cqes_cont call
in the beginning of mlx5e_poll_rx_cq. However, if the budget is
extremely low, some blocks may be left even after that, but the code
that follows the mlx5e_decompress_cqes_cont call doesn't check it and
assumes that a new CQE begins, which may not be the case. In such cases,
random memory corruptions occur.
An extremely low NAPI budget of 8 is used when busy_poll or busy_read is
active.
This commit adds a check to make sure that the previous compressed CQE
has been completely parsed after mlx5e_decompress_cqes_cont, otherwise
it prevents a new CQE from being fetched in the middle of a compressed
CQE.
This commit fixes random crashes in __build_skb, __page_pool_put_page
and other not-related-directly places, that used to happen when both CQE
compression and busy_poll/busy_read were enabled.
Fixes: 7219ab34f184 ("net/mlx5e: CQE compression")
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-09-16 05:54:20 -06:00
|
|
|
if (rq->cqd.left) {
|
2018-09-12 09:32:49 -06:00
|
|
|
work_done += mlx5e_decompress_cqes_cont(rq, cqwq, 0, budget);
|
net/mlx5e: Fix handling of compressed CQEs in case of low NAPI budget
When CQE compression is enabled, compressed CQEs use the following
structure: a title is followed by one or many blocks, each containing 8
mini CQEs (except the last, which may contain fewer mini CQEs).
Due to NAPI budget restriction, a complete structure is not always
parsed in one NAPI run, and some blocks with mini CQEs may be deferred
to the next NAPI poll call - we have the mlx5e_decompress_cqes_cont call
in the beginning of mlx5e_poll_rx_cq. However, if the budget is
extremely low, some blocks may be left even after that, but the code
that follows the mlx5e_decompress_cqes_cont call doesn't check it and
assumes that a new CQE begins, which may not be the case. In such cases,
random memory corruptions occur.
An extremely low NAPI budget of 8 is used when busy_poll or busy_read is
active.
This commit adds a check to make sure that the previous compressed CQE
has been completely parsed after mlx5e_decompress_cqes_cont, otherwise
it prevents a new CQE from being fetched in the middle of a compressed
CQE.
This commit fixes random crashes in __build_skb, __page_pool_put_page
and other not-related-directly places, that used to happen when both CQE
compression and busy_poll/busy_read were enabled.
Fixes: 7219ab34f184 ("net/mlx5e: CQE compression")
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-09-16 05:54:20 -06:00
|
|
|
if (rq->cqd.left || work_done >= budget)
|
|
|
|
goto out;
|
|
|
|
}
|
2016-05-10 15:29:14 -06:00
|
|
|
|
2018-09-12 09:32:49 -06:00
|
|
|
cqe = mlx5_cqwq_get_cqe(cqwq);
|
2018-12-02 06:45:53 -07:00
|
|
|
if (!cqe) {
|
|
|
|
if (unlikely(work_done))
|
|
|
|
goto out;
|
2017-06-19 09:11:30 -06:00
|
|
|
return 0;
|
2018-12-02 06:45:53 -07:00
|
|
|
}
|
2015-05-28 13:28:46 -06:00
|
|
|
|
2017-06-19 09:11:30 -06:00
|
|
|
do {
|
2016-05-10 15:29:14 -06:00
|
|
|
if (mlx5_get_cqe_format(cqe) == MLX5_COMPRESSED) {
|
|
|
|
work_done +=
|
2018-09-12 09:32:49 -06:00
|
|
|
mlx5e_decompress_cqes_start(rq, cqwq,
|
2016-05-10 15:29:14 -06:00
|
|
|
budget - work_done);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2018-09-12 09:32:49 -06:00
|
|
|
mlx5_cqwq_pop(cqwq);
|
2015-06-23 08:14:21 -06:00
|
|
|
|
2019-06-12 04:18:36 -06:00
|
|
|
INDIRECT_CALL_2(rq->handle_rx_cqe, mlx5e_handle_rx_cqe_mpwrq,
|
|
|
|
mlx5e_handle_rx_cqe, rq, cqe);
|
2018-09-12 09:32:49 -06:00
|
|
|
} while ((++work_done < budget) && (cqe = mlx5_cqwq_get_cqe(cqwq)));
|
2015-05-28 13:28:46 -06:00
|
|
|
|
2018-12-02 06:45:53 -07:00
|
|
|
out:
|
2018-11-21 04:56:17 -07:00
|
|
|
if (rq->xdp_prog)
|
|
|
|
mlx5e_xdp_rx_poll_complete(rq);
|
2018-04-17 08:45:21 -06:00
|
|
|
|
2018-09-12 09:32:49 -06:00
|
|
|
mlx5_cqwq_update_db_record(cqwq);
|
2015-05-28 13:28:46 -06:00
|
|
|
|
|
|
|
/* ensure cq space is freed before enabling more cqes */
|
|
|
|
wmb();
|
|
|
|
|
2015-11-18 07:30:56 -07:00
|
|
|
return work_done;
|
2015-05-28 13:28:46 -06:00
|
|
|
}
|
2017-03-24 15:52:06 -06:00
|
|
|
|
2017-04-12 21:37:04 -06:00
|
|
|
#ifdef CONFIG_MLX5_CORE_IPOIB
|
|
|
|
|
|
|
|
#define MLX5_IB_GRH_DGID_OFFSET 24
|
|
|
|
#define MLX5_GID_SIZE 16
|
|
|
|
|
|
|
|
static inline void mlx5i_complete_rx_cqe(struct mlx5e_rq *rq,
|
|
|
|
struct mlx5_cqe64 *cqe,
|
|
|
|
u32 cqe_bcnt,
|
|
|
|
struct sk_buff *skb)
|
|
|
|
{
|
2017-10-31 06:57:27 -06:00
|
|
|
struct hwtstamp_config *tstamp;
|
2018-09-02 13:12:10 -06:00
|
|
|
struct mlx5e_rq_stats *stats;
|
2017-09-14 01:27:25 -06:00
|
|
|
struct net_device *netdev;
|
2017-10-31 06:57:27 -06:00
|
|
|
struct mlx5e_priv *priv;
|
2017-04-27 08:59:00 -06:00
|
|
|
char *pseudo_header;
|
2017-09-14 01:27:25 -06:00
|
|
|
u32 qpn;
|
2017-04-12 21:37:04 -06:00
|
|
|
u8 *dgid;
|
|
|
|
u8 g;
|
|
|
|
|
2017-09-14 01:27:25 -06:00
|
|
|
qpn = be32_to_cpu(cqe->sop_drop_qpn) & 0xffffff;
|
|
|
|
netdev = mlx5i_pkey_get_netdev(rq->netdev, qpn);
|
|
|
|
|
|
|
|
/* No mapping present, cannot process SKB. This might happen if a child
|
|
|
|
* interface is going down while having unprocessed CQEs on parent RQ
|
|
|
|
*/
|
|
|
|
if (unlikely(!netdev)) {
|
|
|
|
/* TODO: add drop counters support */
|
|
|
|
skb->dev = NULL;
|
|
|
|
pr_warn_once("Unable to map QPN %u to dev - dropping skb\n", qpn);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2017-10-31 06:57:27 -06:00
|
|
|
priv = mlx5i_epriv(netdev);
|
|
|
|
tstamp = &priv->tstamp;
|
2018-09-02 13:12:10 -06:00
|
|
|
stats = &priv->channel_stats[rq->ix].rq;
|
2017-10-31 06:57:27 -06:00
|
|
|
|
2017-04-12 21:37:04 -06:00
|
|
|
g = (be32_to_cpu(cqe->flags_rqpn) >> 28) & 3;
|
|
|
|
dgid = skb->data + MLX5_IB_GRH_DGID_OFFSET;
|
|
|
|
if ((!g) || dgid[0] != 0xff)
|
|
|
|
skb->pkt_type = PACKET_HOST;
|
|
|
|
else if (memcmp(dgid, netdev->broadcast + 4, MLX5_GID_SIZE) == 0)
|
|
|
|
skb->pkt_type = PACKET_BROADCAST;
|
|
|
|
else
|
|
|
|
skb->pkt_type = PACKET_MULTICAST;
|
|
|
|
|
|
|
|
/* TODO: IB/ipoib: Allow mcast packets from other VFs
|
|
|
|
* 68996a6e760e5c74654723eeb57bf65628ae87f4
|
|
|
|
*/
|
|
|
|
|
|
|
|
skb_pull(skb, MLX5_IB_GRH_BYTES);
|
|
|
|
|
|
|
|
skb->protocol = *((__be16 *)(skb->data));
|
|
|
|
|
2019-01-14 08:34:28 -07:00
|
|
|
if (netdev->features & NETIF_F_RXCSUM) {
|
|
|
|
skb->ip_summed = CHECKSUM_COMPLETE;
|
|
|
|
skb->csum = csum_unfold((__force __sum16)cqe->check_sum);
|
|
|
|
stats->csum_complete++;
|
|
|
|
} else {
|
|
|
|
skb->ip_summed = CHECKSUM_NONE;
|
|
|
|
stats->csum_none++;
|
|
|
|
}
|
2017-04-12 21:37:04 -06:00
|
|
|
|
2017-10-31 06:57:27 -06:00
|
|
|
if (unlikely(mlx5e_rx_hw_stamp(tstamp)))
|
2017-08-15 04:46:04 -06:00
|
|
|
skb_hwtstamps(skb)->hwtstamp =
|
|
|
|
mlx5_timecounter_cyc2time(rq->clock, get_cqe_ts(cqe));
|
2017-06-01 05:43:43 -06:00
|
|
|
|
2017-04-12 21:37:04 -06:00
|
|
|
skb_record_rx_queue(skb, rq->ix);
|
|
|
|
|
|
|
|
if (likely(netdev->features & NETIF_F_RXHASH))
|
|
|
|
mlx5e_skb_set_hash(cqe, skb);
|
|
|
|
|
2017-04-27 08:59:00 -06:00
|
|
|
/* 20 bytes of ipoib header and 4 for encap existing */
|
|
|
|
pseudo_header = skb_push(skb, MLX5_IPOIB_PSEUDO_LEN);
|
|
|
|
memset(pseudo_header, 0, MLX5_IPOIB_PSEUDO_LEN);
|
2017-04-12 21:37:04 -06:00
|
|
|
skb_reset_mac_header(skb);
|
2017-04-27 08:59:00 -06:00
|
|
|
skb_pull(skb, MLX5_IPOIB_HARD_LEN);
|
2017-04-12 21:37:04 -06:00
|
|
|
|
|
|
|
skb->dev = netdev;
|
|
|
|
|
2018-04-12 07:03:37 -06:00
|
|
|
stats->packets++;
|
|
|
|
stats->bytes += cqe_bcnt;
|
2017-04-12 21:37:04 -06:00
|
|
|
}
|
|
|
|
|
|
|
|
void mlx5i_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
|
|
|
|
{
|
2018-04-02 08:31:31 -06:00
|
|
|
struct mlx5_wq_cyc *wq = &rq->wqe.wq;
|
net/mlx5e: Introduce RX Page-Reuse
Introduce a Page-Reuse mechanism in non-Striding RQ RX datapath.
A WQE (RX descriptor) buffer is a page, that in most cases was fully
wasted on a packet that is much smaller, requiring a new page for
the next round.
In this patch, we implement a page-reuse mechanism, that resembles a
`SW Striding RQ`.
We allow the WQE to reuse its allocated page as much as it could,
until the page is fully consumed. In each round, the WQE is capable
of receiving packet of maximal size (MTU). Yet, upon the reception of
a packet, the WQE knows the actual packet size, and consumes the exact
amount of memory needed to build a linear SKB. Then, it updates the
buffer pointer within the page accordingly, for the next round.
Feature is mutually exclusive with XDP (packet-per-page)
and LRO (session size is a power of two, needs unused page).
Performance tests:
iperf tcp tests show huge gain:
--------------------------------------------
num streams | BW before | BW after | ratio |
1 | 22.2 | 30.9 | 1.39x |
8 | 64.2 | 93.6 | 1.46x |
64 | 56.7 | 91.4 | 1.61x |
--------------------------------------------
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-01-29 08:42:26 -07:00
|
|
|
struct mlx5e_wqe_frag_info *wi;
|
2017-04-12 21:37:04 -06:00
|
|
|
struct sk_buff *skb;
|
|
|
|
u32 cqe_bcnt;
|
2018-04-02 08:31:31 -06:00
|
|
|
u16 ci;
|
2017-04-12 21:37:04 -06:00
|
|
|
|
2018-04-02 08:31:31 -06:00
|
|
|
ci = mlx5_wq_cyc_ctr2ix(wq, be16_to_cpu(cqe->wqe_counter));
|
|
|
|
wi = get_frag(rq, ci);
|
|
|
|
cqe_bcnt = be32_to_cpu(cqe->byte_cnt);
|
2017-04-12 21:37:04 -06:00
|
|
|
|
2019-06-14 16:21:15 -06:00
|
|
|
if (unlikely(MLX5E_RX_ERR_CQE(cqe))) {
|
|
|
|
rq->stats->wqe_err++;
|
|
|
|
goto wq_free_wqe;
|
|
|
|
}
|
|
|
|
|
2019-06-12 04:18:35 -06:00
|
|
|
skb = INDIRECT_CALL_2(rq->wqe.skb_from_cqe,
|
|
|
|
mlx5e_skb_from_cqe_linear,
|
|
|
|
mlx5e_skb_from_cqe_nonlinear,
|
|
|
|
rq, cqe, wi, cqe_bcnt);
|
2017-04-12 21:37:04 -06:00
|
|
|
if (!skb)
|
net/mlx5e: Introduce RX Page-Reuse
Introduce a Page-Reuse mechanism in non-Striding RQ RX datapath.
A WQE (RX descriptor) buffer is a page, that in most cases was fully
wasted on a packet that is much smaller, requiring a new page for
the next round.
In this patch, we implement a page-reuse mechanism, that resembles a
`SW Striding RQ`.
We allow the WQE to reuse its allocated page as much as it could,
until the page is fully consumed. In each round, the WQE is capable
of receiving packet of maximal size (MTU). Yet, upon the reception of
a packet, the WQE knows the actual packet size, and consumes the exact
amount of memory needed to build a linear SKB. Then, it updates the
buffer pointer within the page accordingly, for the next round.
Feature is mutually exclusive with XDP (packet-per-page)
and LRO (session size is a power of two, needs unused page).
Performance tests:
iperf tcp tests show huge gain:
--------------------------------------------
num streams | BW before | BW after | ratio |
1 | 22.2 | 30.9 | 1.39x |
8 | 64.2 | 93.6 | 1.46x |
64 | 56.7 | 91.4 | 1.61x |
--------------------------------------------
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-01-29 08:42:26 -07:00
|
|
|
goto wq_free_wqe;
|
2017-04-12 21:37:04 -06:00
|
|
|
|
|
|
|
mlx5i_complete_rx_cqe(rq, cqe, cqe_bcnt, skb);
|
2017-09-14 01:27:25 -06:00
|
|
|
if (unlikely(!skb->dev)) {
|
|
|
|
dev_kfree_skb_any(skb);
|
|
|
|
goto wq_free_wqe;
|
|
|
|
}
|
2017-04-12 21:37:04 -06:00
|
|
|
napi_gro_receive(rq->cq.napi, skb);
|
|
|
|
|
net/mlx5e: Introduce RX Page-Reuse
Introduce a Page-Reuse mechanism in non-Striding RQ RX datapath.
A WQE (RX descriptor) buffer is a page, that in most cases was fully
wasted on a packet that is much smaller, requiring a new page for
the next round.
In this patch, we implement a page-reuse mechanism, that resembles a
`SW Striding RQ`.
We allow the WQE to reuse its allocated page as much as it could,
until the page is fully consumed. In each round, the WQE is capable
of receiving packet of maximal size (MTU). Yet, upon the reception of
a packet, the WQE knows the actual packet size, and consumes the exact
amount of memory needed to build a linear SKB. Then, it updates the
buffer pointer within the page accordingly, for the next round.
Feature is mutually exclusive with XDP (packet-per-page)
and LRO (session size is a power of two, needs unused page).
Performance tests:
iperf tcp tests show huge gain:
--------------------------------------------
num streams | BW before | BW after | ratio |
1 | 22.2 | 30.9 | 1.39x |
8 | 64.2 | 93.6 | 1.46x |
64 | 56.7 | 91.4 | 1.61x |
--------------------------------------------
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-01-29 08:42:26 -07:00
|
|
|
wq_free_wqe:
|
2018-06-12 01:09:24 -06:00
|
|
|
mlx5e_free_rx_wqe(rq, wi, true);
|
2018-04-02 08:31:31 -06:00
|
|
|
mlx5_wq_cyc_pop(wq);
|
2017-04-12 21:37:04 -06:00
|
|
|
}
|
|
|
|
|
|
|
|
#endif /* CONFIG_MLX5_CORE_IPOIB */
|
2017-06-19 05:04:36 -06:00
|
|
|
|
|
|
|
#ifdef CONFIG_MLX5_EN_IPSEC
|
|
|
|
|
|
|
|
void mlx5e_ipsec_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
|
|
|
|
{
|
2018-04-02 08:31:31 -06:00
|
|
|
struct mlx5_wq_cyc *wq = &rq->wqe.wq;
|
2017-06-19 05:04:36 -06:00
|
|
|
struct mlx5e_wqe_frag_info *wi;
|
|
|
|
struct sk_buff *skb;
|
|
|
|
u32 cqe_bcnt;
|
2018-04-02 08:31:31 -06:00
|
|
|
u16 ci;
|
2017-06-19 05:04:36 -06:00
|
|
|
|
2018-04-02 08:31:31 -06:00
|
|
|
ci = mlx5_wq_cyc_ctr2ix(wq, be16_to_cpu(cqe->wqe_counter));
|
|
|
|
wi = get_frag(rq, ci);
|
|
|
|
cqe_bcnt = be32_to_cpu(cqe->byte_cnt);
|
2017-06-19 05:04:36 -06:00
|
|
|
|
2019-06-14 16:21:15 -06:00
|
|
|
if (unlikely(MLX5E_RX_ERR_CQE(cqe))) {
|
|
|
|
rq->stats->wqe_err++;
|
|
|
|
goto wq_free_wqe;
|
|
|
|
}
|
|
|
|
|
2019-06-12 04:18:35 -06:00
|
|
|
skb = INDIRECT_CALL_2(rq->wqe.skb_from_cqe,
|
|
|
|
mlx5e_skb_from_cqe_linear,
|
|
|
|
mlx5e_skb_from_cqe_nonlinear,
|
|
|
|
rq, cqe, wi, cqe_bcnt);
|
2019-06-14 16:21:15 -06:00
|
|
|
if (unlikely(!skb)) /* a DROP, save the page-reuse checks */
|
|
|
|
goto wq_free_wqe;
|
|
|
|
|
2018-07-13 05:33:52 -06:00
|
|
|
skb = mlx5e_ipsec_handle_rx_skb(rq->netdev, skb, &cqe_bcnt);
|
2019-06-14 16:21:15 -06:00
|
|
|
if (unlikely(!skb))
|
|
|
|
goto wq_free_wqe;
|
2017-06-19 05:04:36 -06:00
|
|
|
|
|
|
|
mlx5e_complete_rx_cqe(rq, cqe, cqe_bcnt, skb);
|
|
|
|
napi_gro_receive(rq->cq.napi, skb);
|
|
|
|
|
2019-06-14 16:21:15 -06:00
|
|
|
wq_free_wqe:
|
2018-06-12 01:09:24 -06:00
|
|
|
mlx5e_free_rx_wqe(rq, wi, true);
|
2018-04-02 08:31:31 -06:00
|
|
|
mlx5_wq_cyc_pop(wq);
|
2017-06-19 05:04:36 -06:00
|
|
|
}
|
|
|
|
|
|
|
|
#endif /* CONFIG_MLX5_EN_IPSEC */
|