redonkable/alistair23-linux

Author	SHA1	Message	Date
Jakub Kicinski	2295cddf99	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Minor conflicts in net/mptcp/protocol.h and tools/testing/selftests/net/Makefile. In both cases code was added on both sides in the same place so just keep both. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-15 12:43:21 -07:00
Jonathan Lemon	b2b8a92733	mlx4: handle non-napi callers to napi_poll netcons calls napi_poll with a budget of 0 to transmit packets. Handle this by: - skipping RX processing - do not try to recycle TX packets to the RX cache Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-10-12 14:02:32 -07:00
Tariq Toukan	aed4d4c663	net/mlx4_en: RX, Add a prefetch command for small L1_CACHE_BYTES A single cacheline might not contain the packet header for small L1_CACHE_BYTES values. Use net_prefetch() as it issues an additional prefetch in this case. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-08-26 15:55:53 -07:00
Gustavo A. R. Silva	5e619d73e6	net/mlx4: Use fallthrough pseudo-keyword Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-07-27 13:14:10 -07:00
Jesper Dangaard Brouer	d201ea9ebc	mlx4: Add XDP frame size and adjust max XDP MTU The mlx4 drivers size of memory backing the RX packet is stored in frag_stride. For XDP mode this will be PAGE_SIZE (normally 4096). For normal mode frag_stride is 2048. Also adjust MLX4_EN_MAX_XDP_MTU to take tailroom into account. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Cc: Saeed Mahameed <saeedm@mellanox.com> Link: https://lore.kernel.org/bpf/158945341893.97035.2688142527052329942.stgit@firesoul	2020-05-14 21:21:55 -07:00
Eric Dumazet	cf4058dbaa	net/mlx4_en: use napi_complete_done() in TX completion In order to benefit from the new napi_defer_hard_irqs feature, we need to use napi_complete_done() variant in this driver. RX path is already using it, this patch implements TX completion side. mlx4_en_process_tx_cq() now returns the amount of retired packets, instead of a boolean, so that mlx4_en_poll_tx_cq() can pass this value to napi_complete_done(). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-04-23 12:43:20 -07:00
Wenwen Wang	48ec7014c5	net/mlx4_en: fix a memory leak bug In mlx4_en_config_rss_steer(), 'rss_map->indir_qp' is allocated through kzalloc(). After that, mlx4_qp_alloc() is invoked to configure RSS indirection. However, if mlx4_qp_alloc() fails, the allocated 'rss_map->indir_qp' is not deallocated, leading to a memory leak bug. To fix the above issue, add the 'qp_alloc_err' label to free 'rss_map->indir_qp'. Fixes: `4931c6ef04` ("net/mlx4_en: Optimized single ring steering") Signed-off-by: Wenwen Wang <wenwen@cs.uga.edu> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>	2019-08-13 19:42:43 -07:00
Saeed Mahameed	29dded89e8	net/mlx4_en: Force CHECKSUM_NONE for short ethernet frames When an ethernet frame is padded to meet the minimum ethernet frame size, the padding octets are not covered by the hardware checksum. Fortunately the padding octets are usually zero's, which don't affect checksum. However, it is not guaranteed. For example, switches might choose to make other use of these octets. This repeatedly causes kernel hardware checksum fault. Prior to the cited commit below, skb checksum was forced to be CHECKSUM_NONE when padding is detected. After it, we need to keep skb->csum updated. However, fixing up CHECKSUM_COMPLETE requires to verify and parse IP headers, it does not worth the effort as the packets are so small that CHECKSUM_COMPLETE has no significant advantage. Future work: when reporting checksum complete is not an option for IP non-TCP/UDP packets, we can actually fallback to report checksum unnecessary, by looking at cqe IPOK bit. Fixes: `88078d98d1` ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends") Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-02-12 12:26:21 -05:00
Eric Dumazet	4beaacc6fe	net/mlx4_en: remove fallback after kzalloc_node() kzalloc_node(..., GFP_KERNEL, node) will attempt to allocate memory as close as possible to the node. There is no need to fallback to kzalloc() if this has failed. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-12-17 11:18:02 -08:00
Michał Mirosław	4b17f9fe48	mlx4: use __vlan_hwaccel helpers Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-11-08 20:45:04 -08:00
Eric Dumazet	3aa8029e1a	net/mlx4_en: add a missing <net/ip.h> include Abdul Haleem reported a build error on ppc : drivers/net/ethernet/mellanox/mlx4/en_rx.c:582:18: warning: `struct iphdr` declared inside parameter list [enabled by default] struct iphdr *iph) ^ drivers/net/ethernet/mellanox/mlx4/en_rx.c:582:18: warning: its scope is only this definition or declaration, which is probably not what you want [enabled by default] drivers/net/ethernet/mellanox/mlx4/en_rx.c: In function get_fixed_ipv4_csum: drivers/net/ethernet/mellanox/mlx4/en_rx.c:586:20: error: dereferencing pointer to incomplete type __u8 ipproto = iph->protocol; ^ Fixes: `55469bc6b5` ("drivers: net: remove <net/busy_poll.h> inclusion when not needed") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-30 11:18:59 -07:00
Eric Dumazet	55469bc6b5	drivers: net: remove <net/busy_poll.h> inclusion when not needed Drivers using generic NAPI interface no longer need to include <net/busy_poll.h>, since busy polling was moved to core networking stack long ago. See commit `79e7fff47b` ("net: remove support for per driver ndo_busy_poll()") for reference. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-10-25 16:20:02 -07:00
Gustavo A. R. Silva	c8581f2bb5	net/mlx4/en_rx: Mark expected switch fall-throughs In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. Addresses-Coverity-ID: 114794 ("Missing break in switch") Addresses-Coverity-ID: 114795 ("Missing break in switch") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-08-07 17:54:20 -07:00
Saeed Mahameed	432e629e56	net/mlx4_en: Don't reuse RX page when XDP is set When a new rx packet arrives, the rx path will decide whether to reuse the remainder of the page or not according to one of the below conditions: 1. frag_info->frag_stride == PAGE_SIZE / 2 2. frags->page_offset + frag_info->frag_size > PAGE_SIZE; The first condition is no met for when XDP is set. For XDP, page_offset is always set to priv->rx_headroom which is XDP_PACKET_HEADROOM and frag_info->frag_size is around mtu size + some padding, still the 2nd release condition will hold since XDP_PACKET_HEADROOM + 1536 < PAGE_SIZE, as a result the page will not be released and will be _wrongly_ reused for next free rx descriptor. In XDP there is an assumption to have a page per packet and reuse can break such assumption and might cause packet data corruptions. Fix this by adding an extra condition (!priv->rx_headroom) to the 2nd case to avoid page reuse when XDP is set, since rx_headroom is set to 0 for non XDP setup and set to XDP_PACKET_HEADROOM for XDP setup. No additional cache line is required for the new condition. Fixes: `34db548bfb` ("mlx4: add page recycling in receive path") Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Suggested-by: Martin KaFai Lau <kafai@fb.com> CC: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-16 14:05:25 -07:00
Eric Dumazet	2d943adf86	net/mlx4_en: optimizes get_fixed_ipv6_csum() While trying to support CHECKSUM_COMPLETE for IPV6 fragments, I had to experiments various hacks in get_fixed_ipv6_csum(). I must admit I could not find how to implement this :/ However, get_fixed_ipv6_csum() does a lot of redundant operations, calling csum_partial() twice. First csum_partial() computes the checksum of saddr and daddr, put in @csum_pseudo_hdr. Undone later in the second csum_partial() computed on whole ipv6 header. Then nexthdr is added once, added a second time, then substracted. payload_len is added once, then substracted. Really all this can be reduced to two add_csum(), to add back 6 bytes that were removed by mlx4 when providing hw_checksum in RX descriptor. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Saeed Mahameed <saeedm@mellanox.com> Cc: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Saeed Mahameed <saeedm@mellanox.com> Acked-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-05-04 11:59:19 -04:00
Nikita V. Shirokov	e5e0a59b7c	bpf: make mlx4 compatible w/ bpf_xdp_adjust_tail w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as well (only "decrease" of pointer's location is going to be supported). changing of this pointer will change packet's size. for mlx4 driver we will just calculate packet's length unconditionally (the same way as it's already being done in mlx5) Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2018-04-18 23:34:16 +02:00
Eric Dumazet	ce5a453c44	net/mlx4_en: CHECKSUM_COMPLETE support for fragments Refine the RX check summing handling to propagate the hardware provided checksum so that we do not have to compute it later in software. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Tariq Toukan <tariqt@mellanox.com> Acked-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-04-01 13:56:37 -04:00
Eric Dumazet	d8c13f2271	net/mlx4_en: try to use high order pages for RX rings RX rings can fit most of the time in a contiguous piece of memory, so lets use kvzalloc_node/kvfree instead of vzalloc_node/vfree Note that kvzalloc_node() automatically falls back to another node, there is no need to do the fallback ourselves. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-07 13:35:23 -05:00
Tariq Toukan	a970d8dba5	net/mlx4_en: RX csum, pre-define enabled protocols for IP status masking Pre-define a mask for IP status of a completion, that tests the MLX4_CQE_STATUS_IPV6 only in case CONFIG_IPV6 is enabled. Use it for IP status testing upon completion, instead of separating the datapath into two flows. This takes common code structures (such as closing parenthesis) back to their original place, and makes code more readable. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Suggested-by: David S. Miller <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-02-27 14:53:26 -05:00
Tariq Toukan	1cb8b1216c	net/mlx4_en: Combine checks of end-cases in RX completion function Combine two end-cases in the same if statement with a single return value. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-02-27 14:53:26 -05:00
Jesper Dangaard Brouer	ae75415de1	mlx4: setup xdp_rxq_info Driver hook points for xdp_rxq_info: * reg : mlx4_en_create_rx_ring * unreg: mlx4_en_destroy_rx_ring Tested on actual hardware. Cc: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2018-01-05 15:21:21 -08:00
Tariq Toukan	dc484851ed	net/mlx4_en: RX csum, reorder branches Use early goto commands, and save else branches. This uses less indentations and brackets, making the code more readable. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-28 12:24:05 -05:00
Tariq Toukan	345ef18c24	net/mlx4_en: RX csum, remove redundant branches and checks Do not check IPv6 bit in cqe status if CONFIG_IPV6 is not enabled. Function check_csum() is reached only with IPv4 or IPv6 set (if enabled), if IPv6 is not set (or is not enabled) it is redundant to test the IPv4 bit. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-12-28 12:24:05 -05:00
Linus Torvalds	7c225c69f8	Merge branch 'akpm' (patches from Andrew) Merge updates from Andrew Morton: - a few misc bits - ocfs2 updates - almost all of MM * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (131 commits) memory hotplug: fix comments when adding section mm: make alloc_node_mem_map a void call if we don't have CONFIG_FLAT_NODE_MEM_MAP mm: simplify nodemask printing mm,oom_reaper: remove pointless kthread_run() error check mm/page_ext.c: check if page_ext is not prepared writeback: remove unused function parameter mm: do not rely on preempt_count in print_vma_addr mm, sparse: do not swamp log with huge vmemmap allocation failures mm/hmm: remove redundant variable align_end mm/list_lru.c: mark expected switch fall-through mm/shmem.c: mark expected switch fall-through mm/page_alloc.c: broken deferred calculation mm: don't warn about allocations which stall for too long fs: fuse: account fuse_inode slab memory as reclaimable mm, page_alloc: fix potential false positive in __zone_watermark_ok mm: mlock: remove lru_add_drain_all() mm, sysctl: make NUMA stats configurable shmem: convert shmem_init_inodecache() to void Unify migrate_pages and move_pages access checks mm, pagevec: rename pagevec drained field ...	2017-11-15 19:42:40 -08:00
Mel Gorman	453f85d43f	mm: remove __GFP_COLD As the page free path makes no distinction between cache hot and cold pages, there is no real useful ordering of pages in the free list that allocation requests can take advantage of. Juding from the users of __GFP_COLD, it is likely that a number of them are the result of copying other sites instead of actually measuring the impact. Remove the __GFP_COLD parameter which simplifies a number of paths in the page allocator. This is potentially controversial but bear in mind that the size of the per-cpu pagelists versus modern cache sizes means that the whole per-cpu list can often fit in the L3 cache. Hence, there is only a potential benefit for microbenchmarks that alloc/free pages in a tight loop. It's even worse when THP is taken into account which has little or no chance of getting a cache-hot page as the per-cpu list is bypassed and the zeroing of multiple pages will thrash the cache anyway. The truncate microbenchmarks are not shown as this patch affects the allocation path and not the free path. A page fault microbenchmark was tested but it showed no sigificant difference which is not surprising given that the __GFP_COLD branches are a miniscule percentage of the fault path. Link: http://lkml.kernel.org/r/20171018075952.10627-9-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-11-15 18:21:06 -08:00
Tariq Toukan	5dad61b838	net/mlx4_en: Replace netdev parameter with priv in XDP xmit function The struct net_device parameter was passed only to extract struct mlx4_en_priv out of it. Here we pass the priv parameter directly. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-11 20:21:23 -07:00
Inbar Karmy	80a8dc75ee	net/mlx4_en: Increase number of default RX rings Remove limitation of netif_get_num_default_rss_queues() from logic of RX rings default number. Signed-off-by: Inbar Karmy <inbark@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-10-10 13:11:22 -07:00
Daniel Borkmann	de8f3a83b0	bpf: add meta pointer for direct access This work enables generic transfer of metadata from XDP into skb. The basic idea is that we can make use of the fact that the resulting skb must be linear and already comes with a larger headroom for supporting bpf_xdp_adjust_head(), which mangles xdp->data. Here, we base our work on a similar principle and introduce a small helper bpf_xdp_adjust_meta() for adjusting a new pointer called xdp->data_meta. Thus, the packet has a flexible and programmable room for meta data, followed by the actual packet data. struct xdp_buff is therefore laid out that we first point to data_hard_start, then data_meta directly prepended to data followed by data_end marking the end of packet. bpf_xdp_adjust_head() takes into account whether we have meta data already prepended and if so, memmove()s this along with the given offset provided there's enough room. xdp->data_meta is optional and programs are not required to use it. The rationale is that when we process the packet in XDP (e.g. as DoS filter), we can push further meta data along with it for the XDP_PASS case, and give the guarantee that a clsact ingress BPF program on the same device can pick this up for further post-processing. Since we work with skb there, we can also set skb->mark, skb->priority or other skb meta data out of BPF, thus having this scratch space generic and programmable allows for more flexibility than defining a direct 1:1 transfer of potentially new XDP members into skb (it's also more efficient as we don't need to initialize/handle each of such new members). The facility also works together with GRO aggregation. The scratch space at the head of the packet can be multiple of 4 byte up to 32 byte large. Drivers not yet supporting xdp->data_meta can simply be set up with xdp->data_meta as xdp->data + 1 as bpf_xdp_adjust_meta() will detect this and bail out, such that the subsequent match against xdp->data for later access is guaranteed to fail. The verifier treats xdp->data_meta/xdp->data the same way as we treat xdp->data/xdp->data_end pointer comparisons. The requirement for doing the compare against xdp->data is that it hasn't been modified from it's original address we got from ctx access. It may have a range marking already from prior successful xdp->data/xdp->data_end pointer comparisons though. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-09-26 13:36:44 -07:00
Linus Torvalds	aae3dbb477	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next Pull networking updates from David Miller: 1) Support ipv6 checksum offload in sunvnet driver, from Shannon Nelson. 2) Move to RB-tree instead of custom AVL code in inetpeer, from Eric Dumazet. 3) Allow generic XDP to work on virtual devices, from John Fastabend. 4) Add bpf device maps and XDP_REDIRECT, which can be used to build arbitrary switching frameworks using XDP. From John Fastabend. 5) Remove UFO offloads from the tree, gave us little other than bugs. 6) Remove the IPSEC flow cache, from Florian Westphal. 7) Support ipv6 route offload in mlxsw driver. 8) Support VF representors in bnxt_en, from Sathya Perla. 9) Add support for forward error correction modes to ethtool, from Vidya Sagar Ravipati. 10) Add time filter for packet scheduler action dumping, from Jamal Hadi Salim. 11) Extend the zerocopy sendmsg() used by virtio and tap to regular sockets via MSG_ZEROCOPY. From Willem de Bruijn. 12) Significantly rework value tracking in the BPF verifier, from Edward Cree. 13) Add new jump instructions to eBPF, from Daniel Borkmann. 14) Rework rtnetlink plumbing so that operations can be run without taking the RTNL semaphore. From Florian Westphal. 15) Support XDP in tap driver, from Jason Wang. 16) Add 32-bit eBPF JIT for ARM, from Shubham Bansal. 17) Add Huawei hinic ethernet driver. 18) Allow to report MD5 keys in TCP inet_diag dumps, from Ivan Delalande. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1780 commits) i40e: point wb_desc at the nvm_wb_desc during i40e_read_nvm_aq i40e: avoid NVM acquire deadlock during NVM update drivers: net: xgene: Remove return statement from void function drivers: net: xgene: Configure tx/rx delay for ACPI drivers: net: xgene: Read tx/rx delay for ACPI rocker: fix kcalloc parameter order rds: Fix non-atomic operation on shared flag variable net: sched: don't use GFP_KERNEL under spin lock vhost_net: correctly check tx avail during rx busy polling net: mdio-mux: add mdio_mux parameter to mdio_mux_init() rxrpc: Make service connection lookup always check for retry net: stmmac: Delete dead code for MDIO registration gianfar: Fix Tx flow control deactivation cxgb4: Ignore MPS_TX_INT_CAUSE[Bubble] for T6 cxgb4: Fix pause frame count in t4_get_port_stats cxgb4: fix memory leak tun: rename generic_xdp to skb_xdp tun: reserve extra headroom only when XDP is set net: dsa: bcm_sf2: Configure IMP port TC2QOS mapping net: dsa: bcm_sf2: Advertise number of egress queues ...	2017-09-06 14:45:08 -07:00
Linus Torvalds	aa9d4648c2	Updates for 4.14 kernel merge window - Lots of hfi1 driver updates (mixed with a few qib and core updates as well) - rxe updates - various mlx updates - Set default roce type to RoCEv2 - Several larger fixes for bnxt_re that were too big for -rc - Several larger fixes for qedr that, likewise, were too big for -rc - Misc core changes - Make the hns_roce driver compilable on arches other than aarch64 so we can more easily debug build issues related to it - Add rdma-netlink infrastructure updates - Add automatic IRQ affinity infrastructure - Add 32bit lid support - Lots of misc fixes across the subsystem from random people - Autoloading of RDMA netlink modules - PCI pool cleanups from Romain Perier - mlx5 driver feature additions and fixes - Hardware tag matchine feature - Fix sleeping in atomic when resolving roce ah - Add experimental ioctl interface as posted to linux-api@ -----BEGIN PGP SIGNATURE----- iQIcBAABAgAGBQJZqBDtAAoJELgmozMOVy/dNlcQAJhYNRGaNUBx0L6+8t2xwUrt 7ndP6qlMar30DJY9FjTQCzRBw0CRMWkXdJD8rYlyaHy07pjWDKG8LZtxEXu1FLdZ oNRvQX6ZJh8Bz7db2SQFBCTF2uWGZZFqWQCrSbQwjj9xxjMDs59u/knmwHVY9dKk egjPG4IQBDmcTeNY7h1otG2hXpx7QPIOilQW2EFN5SWAuBAazdF2JKxjjxqhnUfp gD2pSdgsm3VSMoo0zpMa6qOP+9GcOu8J97fYFhasRYWCavPdWHyq+XNu9S/eicRd xbv+seCYM+9jPb2dsNdjEKll7w3yyWdu7h6tSCMPYv54eN9sDDiO1w2L2ZnESMZa JRnSfB+HXru1r4RyHOTPO8peaNhYlR1V4u8bTS5G2dffbHis9BajkWoAR/oSiUcB AIjIIDcdJFVGfpF9KIt/pEl+adHNgESibSijzOUYkyw6RNbPqDmdd7YakPHcQhKN clE3zQfIsPRLWsToP/nkBE0tUd3tQocRuLy7ote7hXQK+0p7TBz0a6Kkj87MvX33 8dVbUI+q6WRlEY90l71y0ZdXy/AvkxkFxAc4Y7FQZyJxhEArTaKgfa5fmpRwVxBm yi9baoYCspHNRNv6AO4IL86ZCJqmWBuch8CBY1n2X3h8IGfKYEZUAZ+T/mnTTeUq A4joXduz94ZD4w23leD1 =2ntC -----END PGP SIGNATURE----- Merge tag 'for-linus-ioctl' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma Pull rdma updates from Doug Ledford: "This is a big pull request. Of note is that I'm sending you the new ioctl API for the rdma subsystem. We put it up on linux-api@, but didn't get much response. The API is complex, but it solves two different problems in one go: 1) The bi-directional nature of the RDMA file write calls, which created the security hole we had to handle (and for which the fix is now causing problems for systems in production, we were a bit over zealous in the fix and the ability to open a device, then fork, then create new queue pairs on the device and use them is broken). 2) The bloat caused by different vendors implementing extensions to the base verbs API. Each vendor's hardware is slightly different, and the hardware might be suitable for one extension but not another. By the time we add generic extensions for all the different ways that the different hardware can offload things, the API becomes bloated. Things like our completion structs have started to exceed a cache line in size because of all the elements needed to support this. That in turn shows up heavily in the performance graphs with a noticable drop in performance on 100Gigabit links as our completion structs go from occupying one cache line to 1+. This API makes things like the completion structs modular in a very similar way to netlink so that your structs can only include the items needed for the offloads/features you are actually using on a given queue pair. In that way we support everything, but only use what we need, and our structs stay smaller. The ioctl API is better explained by the posting on linux-api@ than I can explain it here, so I'll just leave it at that. The rest of the pull request is typical stuff. Updates for 4.14 kernel merge window - Lots of hfi1 driver updates (mixed with a few qib and core updates as well) - rxe updates - various mlx updates - Set default roce type to RoCEv2 - Several larger fixes for bnxt_re that were too big for -rc - Several larger fixes for qedr that, likewise, were too big for -rc - Misc core changes - Make the hns_roce driver compilable on arches other than aarch64 so we can more easily debug build issues related to it - Add rdma-netlink infrastructure updates - Add automatic IRQ affinity infrastructure - Add 32bit lid support - Lots of misc fixes across the subsystem from random people - Autoloading of RDMA netlink modules - PCI pool cleanups from Romain Perier - mlx5 driver feature additions and fixes - Hardware tag matchine feature - Fix sleeping in atomic when resolving roce ah - Add experimental ioctl interface as posted to linux-api@" * tag 'for-linus-ioctl' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (328 commits) IB/core: Expose ioctl interface through experimental Kconfig IB/core: Assign root to all drivers IB/core: Add completion queue (cq) object actions IB/core: Add legacy driver's user-data IB/core: Export ioctl enum types to user-space IB/core: Explicitly destroy an object while keeping uobject IB/core: Add macros for declaring methods and attributes IB/core: Add uverbs merge trees functionality IB/core: Add DEVICE object and root tree structure IB/core: Declare an object instead of declaring only type attributes IB/core: Add new ioctl interface RDMA/vmw_pvrdma: Fix a signedness RDMA/vmw_pvrdma: Report network header type in WC IB/core: Add might_sleep() annotation to ib_init_ah_from_wc() IB/cm: Fix sleeping in atomic when RoCE is used IB/core: Add support to finalize objects in one transaction IB/core: Add a generic way to execute an operation on a uobject Documentation: Hardware tag matching IB/mlx5: Support IB_SRQT_TM net/mlx5: Add XRQ support ...	2017-09-03 17:49:17 -07:00
stephen hemminger	31975e27a4	mlx4: sizeof style usage The kernel coding style is to treat sizeof as a function (ie. with parenthesis) not as an operator. Also use kcalloc and kmalloc_array Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-16 11:01:57 -07:00
Davide Caratti	e718fe450e	net/mlx4_en: don't set CHECKSUM_COMPLETE on SCTP packets if the NIC fails to validate the checksum on TCP/UDP, and validation of IP checksum is successful, the driver subtracts the pseudo-header checksum from the value obtained by the hardware and sets CHECKSUM_COMPLETE. Don't do that if protocol is IPPROTO_SCTP, otherwise CRC32c validation fails. V2: don't test MLX4_CQE_STATUS_IPV6 if MLX4_CQE_STATUS_IPV4 is set Reported-by: Shuang Li <shuali@redhat.com> Fixes: `f8c6455bb0` ("net/mlx4_en: Extend checksum offloading by CHECKSUM COMPLETE") Signed-off-by: Davide Caratti <dcaratti@redhat.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-08 17:59:57 -07:00
Moshe Shemesh	f330187016	(IB, net)/mlx4: Add resource utilization support Adding visibility of resource usage of QPs, CQs and counters used by virtual functions. This feature will be used to give the PF administrator more data while debugging VF status. Usage info was added to ALLOC_RES command, to notify the PF if the resource which is being reserved or allocated for the VF will be used by kernel driver or by user verbs. Updated reservation and allocation functions of QP, CQ and counter with additional usage parameter. Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Doug Ledford <dledford@redhat.com>	2017-07-24 10:41:35 -04:00
Leon Romanovsky	8900b894e7	{net, IB}/mlx4: Remove gfp flags argument The caller to the driver marks GFP_NOIO allocations with help of memalloc_noio-* calls now. This makes redundant to pass down to the driver gfp flags, which can be GFP_KERNEL only. The patch removes the gfp flags argument and updates all driver paths. Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Doug Ledford <dledford@redhat.com>	2017-07-17 21:21:24 -04:00
Tariq Toukan	6c78511b05	net/mlx4_en: Poll XDP TX completion queue in RX NAPI Instead of having their own NAPIs, XDP TX completion queues get polled within the corresponding RX NAPI. This prevents any possible race on TX ring prod/cons indices, between the context that issues the transmits (RX NAPI) and the context that handles the completions (was previously done in a separate NAPI). This also improves performance, as it decreases the number of NAPIs running on a CPU, saving the overhead of syncing and switching between the contexts. Performance tests: Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz Single queue no-RSS optimization ON. XDP_TX packet rate: ------------------------------------- \| Before \| After \| Gain \| IPv4 \| 12.0 Mpps \| 13.8 Mpps \| 15% \| IPv6 \| 12.0 Mpps \| 13.8 Mpps \| 15% \| ------------------------------------- Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Saeed Mahameed <saeedm@mellanox.com> Cc: kernel-team@fb.com Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-06-15 22:53:23 -04:00
Tariq Toukan	36ea796498	net/mlx4_en: Improve XDP xmit function Several performance improvements in XDP TX datapath, including: - Ring a single doorbell for XDP TX ring per NAPI budget, instead of doing it per a lower threshold (was 8). This includes removing the flow of immediate doorbell ringing in case of a full TX ring. - Compiler branch predictor hints. - Calculate values in compile time rather than in runtime. Performance tests: Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz Single queue no-RSS optimization ON. XDP_TX packet rate: ------------------------------------- \| Before \| After \| Gain \| IPv4 \| 10.3 Mpps \| 12.0 Mpps \| 17% \| IPv6 \| 10.3 Mpps \| 12.0 Mpps \| 17% \| ------------------------------------- Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Saeed Mahameed <saeedm@mellanox.com> Cc: kernel-team@fb.com Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-06-15 22:53:23 -04:00
Tariq Toukan	9bcee89ac4	net/mlx4_en: Improve receive data-path Several small performance improvements in RX datapath, including: - Compiler branch predictor hints. - Replace a multiplication with a shift operation. - Minimize variables scope. - Write-prefetch for packet header. - Avoid trinary-operator ("?") when value can be preset in a matching branch. - Save a branch by updating RX ring doorbell within mlx4_en_refill_rx_buffers(), which now returns void. Performance tests: Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz Single queue no-RSS optimization ON (enable by ethtool -L <interface> rx 1). XDP_DROP packet rate: Same (28.1 Mpps), lower CPU utilization (from ~100% to ~92%). Drop packets in TC: ------------------------------------- \| Before \| After \| Gain \| IPv4 \| 4.14 Mpps \| 4.18 Mpps \| 1% \| ------------------------------------- XDP_TX packet rate: ------------------------------------- \| Before \| After \| Gain \| IPv4 \| 10.1 Mpps \| 10.3 Mpps \| 2% \| IPv6 \| 10.1 Mpps \| 10.3 Mpps \| 2% \| ------------------------------------- Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Saeed Mahameed <saeedm@mellanox.com> Cc: kernel-team@fb.com Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-06-15 22:53:23 -04:00
Saeed Mahameed	4931c6ef04	net/mlx4_en: Optimized single ring steering Avoid touching RX QP RSS context when loading with only one RX ring, to allow optimized A0 RX steering. Enable by: - loading mlx4_core with module param: log_num_mgm_entry_size = -6. - then: ethtool -L <interface> rx 1 Performance tests: Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz XDP_DROP packet rate: ------------------------------------- \| Before \| After \| Gain \| IPv4 \| 20.5 Mpps \| 28.1 Mpps \| 37% \| IPv6 \| 18.4 Mpps \| 28.1 Mpps \| 53% \| ------------------------------------- Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Cc: kernel-team@fb.com Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-06-15 22:53:22 -04:00
Kamal Heib	505a9249c2	net/mlx4_en: Change the error print to debug print The error print within mlx4_en_calc_rx_buf() should be a debug print. Fixes: `51151a16a6` ('mlx4: allow order-0 memory allocations in RX path') Signed-off-by: Kamal Heib <kamalh@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-05-09 11:22:46 -04:00
Eric Dumazet	68b8df4644	mlx4: remove duplicate code in mlx4_en_process_rx_cq() We should keep one way to build skbs, regardless of GRO being on or off. Note that I made sure to defer as much as possible the point we need to pull data from the frame, so that future prefetch() we might add are more effective. These skb attributes derive from the CQE or ring : ip_summed, csum hash vlan offload hwtstamps queue_mapping As a bonus, this patch removes mlx4 dependency on eth_get_headlen() which is very often broken enough to give us headaches. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-03-09 09:54:46 -08:00
Eric Dumazet	6969cf0fdb	mlx4: make validate_loopback() more generic Testing a boolean in fast path is not worth duplicating the code allocating packets, when GRO is on or off. If this proves to be a problem, we might later use a jump label. Next patch will remove this duplicated code and ease code review. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-03-09 09:54:46 -08:00
Eric Dumazet	02e6fd3e55	mlx4: factorize page_address() calls We need to compute the frame virtual address at different points. Do it once. Following patch will use the new va address for validate_loopback() Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-03-09 09:54:46 -08:00
Eric Dumazet	9e8c0395a7	mlx4: do not access rx_desc from mlx4_en_process_rx_cq() Instead of fetching dma address from rx_desc->data[0].addr, prefer using frags[0].dma + frags[0].page_offset to avoid a potential cache line miss. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-03-09 09:54:46 -08:00
Eric Dumazet	7d7bfc6a3f	mlx4: add rx_alloc_pages counter in ethtool -S This new counter tracks number of pages that we allocated for one port. lpaa24:~# ethtool -S eth0 \| egrep 'rx_alloc_pages\|rx_packets' rx_packets: 306755183 rx_alloc_pages: 932897 Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-03-09 09:54:46 -08:00
Eric Dumazet	34db548bfb	mlx4: add page recycling in receive path Same technique than some Intel drivers, for arches where PAGE_SIZE = 4096 In most cases, pages are reused because they were consumed before we could loop around the RX ring. This brings back performance, and is even better, a single TCP flow reaches 30Gbit on my hosts. v2: added full memset() in mlx4_en_free_frag(), as Tariq found it was needed if we switch to large MTU, as priv->log_rx_info can dynamically be changed. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-03-09 09:54:46 -08:00
Eric Dumazet	b5a54d9a31	mlx4: use order-0 pages for RX Use of order-3 pages is problematic in some cases. This patch might add three kinds of regression : 1) a CPU performance regression, but we will add later page recycling and performance should be back. 2) TCP receiver could grow its receive window slightly slower, because skb->len/skb->truesize ratio will decrease. This is mostly ok, we prefer being conservative to not risk OOM, and eventually tune TCP better in the future. This is consistent with other drivers using 2048 per ethernet frame. 3) Because we allocate one page per RX slot, we consume more memory for the ring buffers. XDP already had this constraint anyway. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-03-09 09:54:46 -08:00
Eric Dumazet	60c7f5ae54	mlx4: removal of frag_sizes[] We will soon use order-0 pages, and frag truesize will more precisely match real sizes. In the new model, we prefer to use <= 2048 bytes fragments, so that we can use page-recycle technique on PAGE_SIZE=4096 arches. We will still pack as much frames as possible on arches with big pages, like PowerPC. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-03-09 09:54:46 -08:00
Eric Dumazet	acd7628de0	mlx4: reduce rx ring page_cache size We only need to store the page and dma address. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-03-09 09:54:46 -08:00
Eric Dumazet	d85f6c14e9	mlx4: rx_headroom is a per port attribute No need to duplicate it per RX queue / frags. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-03-09 09:54:46 -08:00
Eric Dumazet	aaca121dd6	mlx4: get rid of frag_prefix_size Using per frag storage for frag_prefix_size is really silly. mlx4_en_complete_rx_desc() has all needed info already. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-03-09 09:54:46 -08:00

1 2 3 4

177 commits