alistair23-linux

redonkable

Author	SHA1	Message	Date
David S. Miller	2e2d6f0342	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net net/sched/cls_api.c has overlapping changes to a call to nlmsg_parse(), one (from 'net') added rtm_tca_policy instead of NULL to the 5th argument, and another (from 'net-next') added cb->extack instead of NULL to the 6th argument. net/ipv4/ipmr_base.c is a case of a bug fix in 'net' being done to code which moved (to mr_table_dump)) in 'net-next'. Thanks to David Ahern for the heads up. Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-19 11:03:06 -07:00
Tariq Toukan	37fdffb217	net/mlx5: WQ, fixes for fragmented WQ buffers API mlx5e netdevice used to calculate fragment edges by a call to mlx5_wq_cyc_get_frag_size(). This calculation did not give the correct indication for queues smaller than a PAGE_SIZE, (broken by default on PowerPC, where PAGE_SIZE == 64KB). Here it is replaced by the correct new calls/API. Since (TX/RX) Work Queues buffers are fragmented, here we introduce changes to the API in core driver, so that it gets a stride index and returns the index of last stride on same fragment, and an additional wrapping function that returns the number of physically contiguous strides that can be written contiguously to the work queue. This obsoletes the following API functions, and their buggy usage in EN driver: * mlx5_wq_cyc_get_frag_size() * mlx5_wq_cyc_ctr2fragix() The new API improves modularity and hides the details of such calculation for mlx5e netdevice and mlx5_ib rdma drivers. New calculation is also more efficient, and improves performance as follows: Packet rate test: pktgen, UDP / IPv4, 64byte, single ring, 8K ring size. Before: 16,477,619 pps After: 17,085,793 pps 3.7% improvement Fixes: `3a2f703312` ("net/mlx5: Use order-0 allocations for all WQ types") Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2018-10-10 18:26:16 -07:00
Or Gerlitz	b856df28f9	net/mlx5e: Allow reporting of checksum unnecessary Currently we practically never report checksum unnecessary, because for all IP packets we take the checksum complete path. Enable non-default runs with reprorting checksum unnecessary, using an ethtool private flag. This can be useful for performance evals and other explorations. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2018-10-01 11:32:47 -07:00
Or Gerlitz	b820e6fb09	net/mlx5e: Enable reporting checksum unnecessary also for L3 packets We can report checksum unnecessary also when the L3 checksum flag on the cqe is set and there's no L4 header. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2018-10-01 11:32:46 -07:00
Alaa Hleihel	fe1dc06999	net/mlx5e: don't set CHECKSUM_COMPLETE on SCTP packets CHECKSUM_COMPLETE is not applicable to SCTP protocol. Setting it for SCTP packets leads to CRC32c validation failure. Fixes: `bbceefce9a` ("net/mlx5e: Support RX CHECKSUM_COMPLETE") Signed-off-by: Alaa Hleihel <alaa@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2018-09-05 21:14:57 -07:00
Natali Shechtman	f007c13d4a	net/mlx5e: Set ECN for received packets using CQE indication In multi-host (MH) NIC scheme, a single HW port serves multiple hosts or sockets on the same host. The HW uses a mechanism in the PCIe buffer which monitors the amount of consumed PCIe buffers per host. On a certain configuration, under congestion, the HW emulates a switch doing ECN marking on packets using ECN indication on the completion descriptor (CQE). The driver needs to set the ECN bits on the packet SKB, such that the network stack can react on that, this commit does that. Signed-off-by: Natali Shechtman <natali@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2018-09-05 21:14:57 -07:00
Feras Daoud	19052a3b77	net/mlx5e: IPoIB, Use priv stats in completion rx flow Since the RQs are shared between all pkey interfaces, the stats should be taken from where the per-ring stats are stored instead of the parent RQ. Fixes: `4c6c615e3f` ("net/mlx5e: IPoIB, Add PKEY child interface nic profile") Signed-off-by: Feras Daoud <ferasda@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-09-02 16:22:42 -07:00
Tariq Toukan	d3398a4f18	net/mlx5e: RX, Prefetch the xdp_frame data area A loaded XDP program might write to the xdp_frame data area, prefetchw() it to avoid a potential cache miss. Performance tests: ConnectX-5, XDP_TX packet rate, single ring. CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz Before: 13,172,976 pps After: 13,456,248 pps 2% gain. Fixes: `22f4539881` ("net/mlx5e: Support XDP over Striding RQ") Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2018-07-26 15:23:57 -07:00
Tariq Toukan	dac0d15fff	net/mlx5e: Re-order fields of struct mlx5e_xdpsq In the downstream patch that adds support to XDP_REDIRECT-out, the XDP xmit frame function doesn't share the same run context as the NAPI that polls the XDP-SQ completion queue. Hence, need to re-order the XDP-SQ fields to avoid cacheline false-sharing. Take redirect_flush and doorbell out of DB, into separated cachelines. Add a cacheline breaker within the stats struct. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2018-07-26 15:23:56 -07:00
Tariq Toukan	159d213134	net/mlx5e: Move XDP related code into new XDP files Take XDP code out of the general EN header and RX file into new XDP files. Currently, XDP-SQ resides only within an RQ and used from a single flow (XDP_TX) triggered upon RX completions. In a downstream patch, additional type of XDP-SQ instances will be presented and used for the XDP_REDIRECT flow, totally unrelated to the RX context. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2018-07-26 15:23:53 -07:00
Tariq Toukan	cb5189d173	net/mlx5e: Do not recycle RX pages in interface down flow Keep all page-pool recycle calls within NAPI context. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2018-07-26 15:23:51 -07:00
Tariq Toukan	afab995e06	net/mlx5e: Replace call to MPWQE free with dealloc in interface down flow No need to expose the MPWQE free function to control path. The dealloc function already exposed, use it. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2018-07-26 15:23:51 -07:00
Boris Pismenny	b3ccf97813	net/mlx5e: IPsec, fix byte count in CQE This patch fixes the byte count indication in CQE for processed IPsec packets that contain a metadata header. Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-16 00:13:28 -07:00
Boris Pismenny	00aebab27c	net/mlx5e: TLS, add Innova TLS rx data path Implement the TLS rx offload data path according to the requirements of the TLS generic NIC offload infrastructure. Special metadata ethertype is used to pass information to the hardware. When hardware loses synchronization a special resync request metadata message is used to request resync. Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-16 00:13:11 -07:00
Tariq Toukan	b71ba6b46f	net/mlx5e: Add counter for MPWQE filler strides Add ethtool counter to indicate the number of strides consumed by filler CQEs. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2018-06-28 14:44:19 -07:00
Tariq Toukan	dc983f0e2b	net/mlx5e: Add a counter for congested UMRs Add per-ring and global ethtool counters for congested UMR requests. These events indicate congestion in UMR handlers in HW. Such event is concluded when there's an outstanding UMR post, yet the SW consumed at least two additional MPWQEs in the meanwhile. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2018-06-28 14:44:18 -07:00
Tariq Toukan	cbe73aaeec	net/mlx5e: Add XDP_TX completions statistics Add per-ring and global ethtool counters for XDP_TX completions. This helps us monitor and analyze XDP_TX flow performance. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2018-06-28 14:44:17 -07:00
Tariq Toukan	c400028371	net/mlx5e: RX, Use existing WQ local variable Local variable 'wq' already points to &sq->wq, use it. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2018-06-28 14:44:16 -07:00
Tariq Toukan	069d11465a	net/mlx5e: RX, Enhance legacy Receive Queue memory scheme Enhance the memory scheme of the legacy RQ, such that only order-0 pages are used. Whenever possible, prefer using a linear SKB, and build it wrapping the WQE buffer. Otherwise (for example, jumbo frames on x86), use non-linear SKB, with as many frags as needed. In this case, multiple WQE scatter entries are used, up to a maximum of 4 frags and 10KB of MTU. This implied to remove support of HW LRO in legacy RQ, as it would require large number of page allocations and scatter entries per WQE on archs with PAGE_SIZE = 4KB, yielding bad performance. In earlier patches, we guaranteed that all completions are in-order, and that we use a cyclic WQ. This creates an oppurtunity for a performance optimization: The mapping between a "struct mlx5e_dma_info", and the WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant across different cycles of a WQ. This allows initializing the mapping in the time of RQ creation, and not handle it in datapath. A struct mlx5e_dma_info that is shared between different WQEs is allocated by the first WQE, and freed by the last one. This implies an important requirement: WQEs that share the same struct mlx5e_dma_info must be posted within the same NAPI. Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly point to the new struct mlx5e_dma_info, not the one that was posted (and the HW wrote to). This bulking requirement is actually good also for performance reasons, hence we extend the bulk beyong the minimal requirement above. With this memory scheme, the RQs memory footprint is reduce by a factor of 2 on x86, and by a factor of 32 on PowerPC. Same factors apply for the number of pages in a GRO session. Performance tests: ConnectX-4, single core, single RX ring, default MTU. x86: CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz Packet rate (early drop in TC): no degradation TCP streams: ~5% improvement PowerPC: CPU: POWER8 (raw), altivec supported Packet rate (early drop in TC): 20% gain TCP streams: 25% gain Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2018-06-01 16:48:15 -07:00
Tariq Toukan	99cbfa93a6	net/mlx5e: RX, Use cyclic WQ in legacy RQ Now that LRO is not supported for Legacy RQ, there is no source of out-of-order completions in the WQ, and we can use a cyclic one. This has multiple advantages: - reduces the WQE size (smaller PCI transactions). - lower overhead in datapath (no handling of 'next' pointers). - no reserved WQE for the WQ head (was need in linked-list). - allows using a constant map between frag and dma_info struct, in downstream patch. Performance tests: ConnectX-4, single core, single RX ring. Major gain in packet rate of single ring XDP drop. Bottleneck is shifted form HW (at 16Mpps) to SW (at 20Mpps). Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2018-06-01 16:48:15 -07:00
Tariq Toukan	422d4c401e	net/mlx5e: RX, Split WQ objects for different RQ types Replace the common RQ WQ object with two separate ones for the different RQ types. This is in preparation for switching to using a cyclic WQ type in Legacy RQ. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2018-06-01 16:48:15 -07:00
Tariq Toukan	386471f16b	net/mlx5e: RX, Dedicate a function for copying SKB header Get the logic of copying the packet header into the SKB linear part into a generic function. Function does copy length alignment and dma buffer sync. It is currently called only within the MPWQE flow. In a downstream patch, it will be called within the legacy RQ flow as well. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2018-06-01 16:48:14 -07:00
Tariq Toukan	fa698366b7	net/mlx5e: RX, Generalise function of SKB frag addition Rename it and pass truesize as an extra argument, as it will be used also in Legacy RQ in a downstream patch. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2018-06-01 16:48:14 -07:00
Tariq Toukan	75aa889fb9	net/mlx5e: RX, Generalise name of non-linear SKB head size Make name more generic by dropping MPWRQ from it, as it will be used also in Legacy RQ in a downstream patch. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2018-06-01 16:48:14 -07:00
David S. Miller	874fcf1de6	mlx5e-updates-2018-05-25 This series includes updates for mlx5e netdev driver. 1) Allowr flow based VF vport mirroring under sriov switchdev scheme, added support for offloading the TC mirred mirror sub-action, from Chris Mi. ================= From: Or Gerlitz <ogerlitz@mellanox.com> The user will typically set the actions order such that the mirror port (mirror VF) sees packets as the original port (VF under mirroring) sent them or as it will receive them. In the general case, it means that packets are potentially sent to the mirror port before or after some actions were applied on them. To properly do that, we follow on the exact action order as set for the flow and make sure this will also be the case when we program the HW offload. If all the actions should apply before forwarding to the mirror and dest port, mirroring is just multicasting to the two vports. Otherwise, we split the TC flow to two HW rules, where the 1st applies only the actions needed up to the mirror (if there are such) and the 2nd the rest of the actions plus the forwarding to the dest vport. ================= 2) Move to order-0 only allocations (using fragmented work queues) for all work queues used by the driver, RX and TX descriptor rings (RQs, SQs and Completion Queues (CQs)), from Tariq Toukan. 3) Avoid resetting netdevice statistics on netdevice state changes, from Eran Ben Elisha. -----BEGIN PGP SIGNATURE----- iQEcBAABAgAGBQJbCKCcAAoJEEg/ir3gV/o++7sH/1FPmwvpf4qNEsusr714lNnl PxzVjnwxKZNYyovIjr6QGSxMM1qDiPejyYnZIAzH00B+XWCq3zn8H3sfJLFbmxN3 clayd6dGV27HzLZwV2aD9vXfVb7snNhQtTp5zoajsnZY4xO335n3FA3kF5TxLqKO bxsnY2xSaSpbrBH2z2UvHc+ib9KnvY1Q+gr2WqdRnN5Tm51Zq+bgaUw0nROefGJK XR/aVBub3PsjA8W/0/b3DGfiP1bgPeU6QmLhdjn3IEufFtbEJH+K/4u53l3A4RLR fXmJrWKlZn8j7LUBFOD0/G43RU/YzqRgwF/iUwCxbUU2wK/J0cPv23drLDVYzaY= =n7Zz -----END PGP SIGNATURE----- Merge tag 'mlx5e-updates-2018-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5e-updates-2018-05-25 This series includes updates for mlx5e netdev driver. 1) Allowr flow based VF vport mirroring under sriov switchdev scheme, added support for offloading the TC mirred mirror sub-action, from Chris Mi. ================= From: Or Gerlitz <ogerlitz@mellanox.com> The user will typically set the actions order such that the mirror port (mirror VF) sees packets as the original port (VF under mirroring) sent them or as it will receive them. In the general case, it means that packets are potentially sent to the mirror port before or after some actions were applied on them. To properly do that, we follow on the exact action order as set for the flow and make sure this will also be the case when we program the HW offload. If all the actions should apply before forwarding to the mirror and dest port, mirroring is just multicasting to the two vports. Otherwise, we split the TC flow to two HW rules, where the 1st applies only the actions needed up to the mirror (if there are such) and the 2nd the rest of the actions plus the forwarding to the dest vport. ================= 2) Move to order-0 only allocations (using fragmented work queues) for all work queues used by the driver, RX and TX descriptor rings (RQs, SQs and Completion Queues (CQs)), from Tariq Toukan. 3) Avoid resetting netdevice statistics on netdevice state changes, from Eran Ben Elisha. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-05-29 09:45:13 -04:00
David S. Miller	5b79c2af66	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Lots of easy overlapping changes in the confict resolutions here. Signed-off-by: David S. Miller <davem@davemloft.net>	2018-05-26 19:46:15 -04:00
Eran Ben Elisha	05909babce	net/mlx5e: Avoid reset netdev stats on configuration changes Move all RQ, SQ and channel counters from the channel objects into the priv structure. With this change, counters will not be reset upon channel configuration changes. Channel's statistics for SQs which are associated with TCs higher than zero will be presented in ethtool -S, only for SQs which were opened at least once since the module was loaded (regardless of their open/close current status). This is done in order to decrease the total amount of statistics presented and calculated for the common out of box use (no QoS). mlx5e_channel_stats is a compound of CH,RQ,SQs stats in order to create locality for the NAPI when handling TX and RX of the same channel. Align the new statistics struct per ring to avoid several channels update to the same cache line at the same time. Packet rate was tested, no degradation sensed. Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> CC: Qing Huang <qing.huang@oracle.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2018-05-25 16:14:28 -07:00
Tariq Toukan	3a2f703312	net/mlx5: Use order-0 allocations for all WQ types Complete the transition of all WQ types to use fragmented order-0 coherent memory instead of high-order allocations. CQ-WQ already uses order-0. Here we do the same for cyclic and linked-list WQs. This allows the driver to load cleanly on systems with a highly fragmented coherent memory. Performance tests: ConnectX-5 100Gbps, CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz Packet rate of 64B packets, single transmit ring, size 8K. No degradation is sensed. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2018-05-25 14:11:00 -07:00
Tariq Toukan	043dc78ecf	net/mlx5e: TX, Use actual WQE size for SQ edge fill We fill SQ edge with NOPs to avoid WQEs wrap. Here, instead of doing that in advance for the maximum possible WQE size, we do it on-demand using the actual WQE size. We re-order some parts in mlx5e_sq_xmit to finish the calculation of WQE size (ds_cnt) before doing any writes to the WQE buffer. When SQ work queue is fragmented (introduced in an downstream patch), dealing with WQE wraps becomes more frequent. This change would drastically reduce the overhead in this case. Performance tests: ConnectX-5 100Gbps, CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz Packet rate of 64B packets, single transmit ring, size 8K. Before: 14.9 Mpps After: 15.8 Mpps Improvement of 6%. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2018-05-25 14:11:00 -07:00
Tariq Toukan	ddf385e31f	net/mlx5e: Use WQ API functions instead of direct fields access Use the WQ API to get the WQ size, and to map a counter into a WQ entry index. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2018-05-25 14:11:00 -07:00
Eran Ben Elisha	902a545904	net/mlx5e: When RXFCS is set, add FCS data into checksum calculation When RXFCS feature is enabled, the HW do not strip the FCS data, however it is not present in the checksum calculated by the HW. Fix that by manually calculating the FCS checksum and adding it to the SKB checksum field. Add helper function to find the FCS data for all SKB forms (linear, one fragment or more). Fixes: `102722fc68` ("net/mlx5e: Add support for RXFCS feature flag") Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2018-05-24 14:40:39 -07:00
Gal Pressman	0e5c04f6b5	net/mlx5e: Remove MLX5E_TEST_BIT macro MLX5E_TEST_BIT macro is the same as the already existent test_bit, remove it and replace all usages. Signed-off-by: Gal Pressman <galp@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2018-05-14 15:10:21 -07:00
Tariq Toukan	efb6d7a20c	net/mlx5e: Use bool as return type for mlx5e_xdp_handle Function returns boolean values, use bool instead of int. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2018-05-14 15:10:21 -07:00
Tariq Toukan	bd206fd52e	net/mlx5e: Use u8 instead of int for LRO number of segments Range of LRO number of segments fits in u8. Also, bring initialization and declaration together to save code. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2018-05-14 15:10:21 -07:00
Jesper Dangaard Brouer	039930945a	xdp: transition into using xdp_frame for return API Changing API xdp_return_frame() to take struct xdp_frame as argument, seems like a natural choice. But there are some subtle performance details here that needs extra care, which is a deliberate choice. When de-referencing xdp_frame on a remote CPU during DMA-TX completion, result in the cache-line is change to "Shared" state. Later when the page is reused for RX, then this xdp_frame cache-line is written, which change the state to "Modified". This situation already happens (naturally) for, virtio_net, tun and cpumap as the xdp_frame pointer is the queued object. In tun and cpumap, the ptr_ring is used for efficiently transferring cache-lines (with pointers) between CPUs. Thus, the only option is to de-referencing xdp_frame. It is only the ixgbe driver that had an optimization, in which it can avoid doing the de-reference of xdp_frame. The driver already have TX-ring queue, which (in case of remote DMA-TX completion) have to be transferred between CPUs anyhow. In this data area, we stored a struct xdp_mem_info and a data pointer, which allowed us to avoid de-referencing xdp_frame. To compensate for this, a prefetchw is used for telling the cache coherency protocol about our access pattern. My benchmarks show that this prefetchw is enough to compensate the ixgbe driver. V7: Adjust for commit `d9314c474d` ("i40e: add support for XDP_REDIRECT") V8: Adjust for commit `bd658dda42` ("net/mlx5e: Separate dma base address and offset in dma_sync call") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-04-17 10:50:29 -04:00
Jesper Dangaard Brouer	60bbf7eeef	mlx5: use page_pool for xdp_return_frame call This patch shows how it is possible to have both the driver local page cache, which uses elevated refcnt for "catching"/avoiding SKB put_page returns the page through the page allocator. And at the same time, have pages getting returned to the page_pool from ndp_xdp_xmit DMA completion. The performance improvement for XDP_REDIRECT in this patch is really good. Especially considering that (currently) the xdp_return_frame API and page_pool_put_page() does per frame operations of both rhashtable ID-lookup and locked return into (page_pool) ptr_ring. (It is the plan to remove these per frame operation in a followup patchset). The benchmark performed was RX on mlx5 and XDP_REDIRECT out ixgbe, with xdp_redirect_map (using devmap) . And the target/maximum capability of ixgbe is 13Mpps (on this HW setup). Before this patch for mlx5, XDP redirected frames were returned via the page allocator. The single flow performance was 6Mpps, and if I started two flows the collective performance drop to 4Mpps, because we hit the page allocator lock (further negative scaling occurs). Two test scenarios need to be covered, for xdp_return_frame API, which is DMA-TX completion running on same-CPU or cross-CPU free/return. Results were same-CPU=10Mpps, and cross-CPU=12Mpps. This is very close to our 13Mpps max target. The reason max target isn't reached in cross-CPU test, is likely due to RX-ring DMA unmap/map overhead (which doesn't occur in ixgbe to ixgbe testing). It is also planned to remove this unnecessary DMA unmap in a later patchset V2: Adjustments requested by Tariq - Changed page_pool_create return codes not return NULL, only ERR_PTR, as this simplifies err handling in drivers. - Save a branch in mlx5e_page_release - Correct page_pool size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ V5: Updated patch desc V8: Adjust for `b0cedc844c` ("net/mlx5e: Remove rq_headroom field from params") V9: - Adjust for `121e892754` ("net/mlx5e: Refactor RQ XDP_TX indication") - Adjust for `73281b78a3` ("net/mlx5e: Derive Striding RQ size from MTU") - Correct handling if page_pool_create fail for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ V10: Req from Tariq - Change pool_size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-04-17 10:50:29 -04:00
Jesper Dangaard Brouer	5168d73201	mlx5: basic XDP_REDIRECT forward support This implements basic XDP redirect support in mlx5 driver. Notice that the ndo_xdp_xmit() is NOT implemented, because that API need some changes that this patchset is working towards. The main purpose of this patch is have different drivers doing XDP_REDIRECT to show how different memory models behave in a cross driver world. Update(pre-RFCv2 Tariq): Need to DMA unmap page before xdp_do_redirect, as the return API does not exist yet to to keep this mapped. Update(pre-RFCv3 Saeed): Don't mix XDP_TX and XDP_REDIRECT flushing, introduce xdpsq.db.redirect_flush boolian. V9: Adjust for commit `121e892754` ("net/mlx5e: Refactor RQ XDP_TX indication") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-04-17 10:50:27 -04:00
Tariq Toukan	ab966d7e4f	net/mlx5e: RX, Recycle buffer of UMR WQEs Upon a new UMR post, check if the WQE buffer contains a previous UMR WQE. If so, modify the dynamic fields instead of a whole WQE overwrite. This saves a memcpy. In current setting, after 2 WQ cycles (12 UMR posts), this will always be the case. No degradation sensed. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2018-03-30 16:55:07 -07:00
Tariq Toukan	b8a98a4cf3	net/mlx5e: Keep single pre-initialized UMR WQE per RQ All UMR WQEs of an RQ share many common fields. We use pre-initialized structures to save calculations in datapath. One field (xlt_offset) was the only reason we saved a pre-initialized copy per WQE index. Here we remove its initialization (move its calculation to datapath), and reduce the number of copies to one-per-RQ. A very small datapath calculation is added, it occurs once per a MPWQE (i.e. once every 256KB), but reduces memory consumption and gives better cache utilization. Performance testing: Tested packet rate, no degradation sensed. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2018-03-30 16:55:07 -07:00
Tariq Toukan	9f9e9cd50e	net/mlx5e: Remove page_ref bulking in Striding RQ When many packets reside on the same page, the bulking of page_ref modifications reduces the total number of atomic operations executed. Besides the necessary 2 operations on page alloc/free, we have the following extra ops per page: - one on WQE allocation (bump refcnt to maximum possible), - zero ops for SKBs, - one on WQE free, a constant of two operations in total, no matter how many packets/SKBs actually populate the page. Without this bulking, we have: - no ops on WQE allocation or free, - one op per SKB, Comparing the two methods when PAGE_SIZE is 4K: - As mentioned above, bulking method always executes 2 operations, not more, but not less. - In the default MTU configuration (1500, stride size is 2K), the non-bulking method execute 2 ops as well. - For larger MTUs with stride size of 4K, non-bulking method executes only a single op. - For XDP (stride size of 4K, no SKBs), non-bulking method executes no ops at all! Hence, to optimize the flows with linear SKB and XDP over Striding RQ, we here remove the page_ref bulking method. Performance testing: ConnectX-5, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz. Single core packet rate (64 bytes). Early drop in TC: no degradation. XDP_DROP: before: 14,270,188 pps after: 20,503,603 pps, 43% improvement. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2018-03-30 16:55:06 -07:00
Tariq Toukan	22f4539881	net/mlx5e: Support XDP over Striding RQ Add XDP support over Striding RQ. Now that linear SKB is supported over Striding RQ, we can support XDP by setting stride size to PAGE_SIZE and headroom to XDP_PACKET_HEADROOM. Upon a MPWQE free, do not release pages that are being XDP xmit, they will be released upon completions. Striding RQ is capable of a higher packet-rate than conventional RQ. A performance gain is expected for all cases that had a HW packet-rate bottleneck. This is the case whenever using many flows that distribute to many cores. Performance testing: ConnectX-5, 24 rings, default MTU. CQE compression ON (to reduce completions BW in PCI). XDP_DROP packet rate: -------------------------------------------------- \| pkt size \| XDP rate \| 100GbE linerate \| pct% \| -------------------------------------------------- \| 64byte \| 126.2 Mpps \| 148.0 Mpps \| 85% \| \| 128byte \| 80.0 Mpps \| 84.8 Mpps \| 94% \| \| 256byte \| 42.7 Mpps \| 42.7 Mpps \| 100% \| \| 512byte \| 23.4 Mpps \| 23.4 Mpps \| 100% \| -------------------------------------------------- Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2018-03-30 16:55:06 -07:00
Tariq Toukan	121e892754	net/mlx5e: Refactor RQ XDP_TX indication Make the xdp_xmit indication available for Striding RQ by taking it out of the type-specific union. This refactor is a preparation for a downstream patch that adds XDP support over Striding RQ. In addition, use a bitmap instead of a boolean for possible future flags. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2018-03-30 16:55:05 -07:00
Tariq Toukan	619a8f2a42	net/mlx5e: Use linear SKB in Striding RQ Current Striding RQ HW feature utilizes the RX buffers so that there is no wasted room between the strides. This maximises the memory utilization. This prevents the use of build_skb() (which requires headroom and tailroom), and demands to memcpy the packets headers into the skb linear part. In this patch, whenever a set of conditions holds, we apply an RQ configuration that allows combining the use of linear SKB on top of a Striding RQ. To use build_skb() with Striding RQ, the following must hold: 1. packet does not cross a page boundary. 2. there is enough headroom and tailroom surrounding the packet. We can satisfy 1 and 2 by configuring: stride size = MTU + headroom + tailoom. This is possible only when: a. (MTU - headroom - tailoom) does not exceed PAGE_SIZE. b. HW LRO is turned off. Using linear SKB has many advantages: - Saves a memcpy of the headers. - No page-boundary checks in datapath. - No filler CQEs. - Significantly smaller CQ. - SKB data continuously resides in linear part, and not split to small amount (linear part) and large amount (fragment). This saves datapath cycles in driver and improves utilization of SKB fragments in GRO. - The fragments of a resulting GRO SKB follow the IP forwarding assumption of equal-size fragments. Some implementation details: HW writes the packets to the beginning of a stride, i.e. does not keep headroom. To overcome this we make sure we can extend backwards and use the last bytes of stride i-1. Extra care is needed for stride 0 as it has no preceding stride. We make sure headroom bytes are available by shifting the buffer pointer passed to HW by headroom bytes. This configuration now becomes default, whenever capable. Of course, this implies turning LRO off. Performance testing: ConnectX-5, single core, single RX ring, default MTU. UDP packet rate, early drop in TC layer: -------------------------------------------- \| pkt size \| before \| after \| ratio \| -------------------------------------------- \| 1500byte \| 4.65 Mpps \| 5.96 Mpps \| 1.28x \| \| 500byte \| 5.23 Mpps \| 5.97 Mpps \| 1.14x \| \| 64byte \| 5.94 Mpps \| 5.96 Mpps \| 1.00x \| -------------------------------------------- TCP streams: ~20% gain Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2018-03-30 16:54:49 -07:00
Tariq Toukan	ea3886cab7	net/mlx5e: Use inline MTTs in UMR WQEs When modifying the page mapping of a HW memory region (via a UMR post), post the new values inlined in WQE, instead of using a data pointer. This is a micro-optimization, inline UMR WQEs of different rings scale better in HW. In addition, this obsoletes a few control flows and helps delete ~50 LOC. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2018-03-30 16:16:17 -07:00
Tariq Toukan	e4d86a4a58	net/mlx5e: Do not busy-wait for UMR completion in Striding RQ Do not busy-wait a pending UMR completion. Under high HW load, busy-waiting a delayed completion would fully utilize the CPU core and mistakenly indicate a SW bottleneck. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2018-03-30 16:16:17 -07:00
Tariq Toukan	18187fb2c3	net/mlx5e: Code movements in RX UMR WQE post Gets the process of a UMR WQE post in one function, in preparation for a downstream patch that inlines the WQE data. No functional change here. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2018-03-30 16:16:17 -07:00
Tariq Toukan	472a1e44b3	net/mlx5e: Save MTU in channels params Knowing the MTU is required for RQ creation flow. By our design, channels creation flow is totally isolated from priv/netdev, and can be completed with access to channels params and mdev. Adding the MTU to the channels params helps preserving that. In addition, we save it in RQ to make its access faster in datapath checks. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2018-03-30 16:16:17 -07:00
Tariq Toukan	24fd07abfa	net/mlx5e: Use no-offset function in skb header copy In copying skb header to skb->data, replace the call to skb_copy_to_linear_data_offset() with a zero offset with the call to the no-offset function skb_copy_to_linear_data(). Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2018-03-27 17:17:27 -07:00
Tariq Toukan	bd658dda42	net/mlx5e: Separate dma base address and offset in dma_sync call Pass the base dma address and offset to dma_sync_single_range_for_cpu(), instead of doing the pre-calculation. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>	2018-03-27 17:17:27 -07:00
David S. Miller	f74290fdb3	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2018-02-24 00:04:20 -05:00

1 2 3 4

162 Commits (2e2d6f0342be7f73a34526077fa96f42f0e8c661)