1
0
Fork 0
Commit Graph

636545 Commits (dedecb6d429bd3311bb24ea1379b47c8471c88b9)

Author SHA1 Message Date
Joe Perches dedecb6d42 i40evf: Move some i40evf_reset_task code to separate function
The i40evf_reset_task function is a couple hundred lines and it has
a separable block that disables VF.  Move that block to a new
i40evf_disable_vf function to shorten i40evf_reset_task a bit.

Signed-off-by: Joe Perches <joe@perches.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-12-02 23:08:48 -08:00
Tushar Dave 2f7679ee2e i40e: fix panic on SPARC while changing num of desc
On SPARC, writel() should not be used to write directly to memory
address but only to memory mapped I/O address otherwise it causes
data access exception.

Commit 147e81ec75 ("i40e: Test memory before ethtool alloc
succeeds") introduced a code that uses memory address to fake the HW
tail address and attempt to write to that address using writel()
causes kernel panic on SPARC. The issue is reproduced while changing
number of descriptors using ethtool.

This change resolves the panic by using HW read-only memory mapped
I/O register to fake HW tail address instead memory address.

e.g.
> ethtool -G eth2 tx 2048 rx 2048
i40e 0000:03:00.2 eth2: Changing Tx descriptor count from 512 to 2048.
i40e 0000:03:00.2 eth2: Changing Rx descriptor count from 512 to 2048
sun4v_data_access_exception: ADDR[fff8001f9734a000] CTX[0000]
TYPE[0004], going.
              \|/ ____ \|/
              "@'/ .. \`@"
              /_| \__/ |_\
                 \__U_/
ethtool(3273): Dax [#1]
CPU: 9 PID: 3273 Comm: ethtool Tainted: G            E
4.8.0-linux-net_temp+ #7
task: fff8001f96d7a660 task.stack: fff8001f97348000
TSTATE: 0000009911001601 TPC: 00000000103189e4 TNPC: 00000000103189e8 Y:
00000000    Tainted: G            E
TPC: <i40e_alloc_rx_buffers+0x124/0x260 [i40e]>
g0: fff8001f4eb64000 g1: 00000000000007ff g2: fff8001f9734b92c g3:
00203e0000000000
g4: fff8001f96d7a660 g5: fff8001fa6704000 g6: fff8001f97348000 g7:
0000000000000001
o0: 0006000046706928 o1: 00000000db3e2000 o2: fff8001f00000000 o3:
0000000000002000
o4: 0000000000002000 o5: 0000000000000001 sp: fff8001f9734afc1 ret_pc:
0000000010318a64
RPC: <i40e_alloc_rx_buffers+0x1a4/0x260 [i40e]>
l0: fff8001f4e8bffe0 l1: fff8001f4e8cffe0 l2: 00000000000007ff l3:
00000000ff000000
l4: 0000000000ff0000 l5: 000000000000ff00 l6: 0000000000cda6a8 l7:
0000000000e822f0
i0: fff8001f96380000 i1: 0000000000000000 i2: 00203edb00000000 i3:
0006000046706928
i4: 0000000002086320 i5: 0000000000e82370 i6: fff8001f9734b071 i7:
00000000103062d4
I7: <i40e_set_ringparam+0x3b4/0x540 [i40e]>
Call Trace:
 [00000000103062d4] i40e_set_ringparam+0x3b4/0x540 [i40e]
 [000000000094e2f8] dev_ethtool+0x898/0xbe0
 [0000000000965570] dev_ioctl+0x250/0x300
 [0000000000923800] sock_do_ioctl+0x40/0x60
 [000000000092427c] sock_ioctl+0x7c/0x280
 [00000000005ef040] vfs_ioctl+0x20/0x60
 [00000000005ef5d4] do_vfs_ioctl+0x194/0x4c0
 [00000000005ef974] SyS_ioctl+0x74/0xa0
 [0000000000406214] linux_sparc_syscall+0x34/0x44
Disabling lock debugging due to kernel taint
Caller[00000000103062d4]: i40e_set_ringparam+0x3b4/0x540 [i40e]
Caller[000000000094e2f8]: dev_ethtool+0x898/0xbe0
Caller[0000000000965570]: dev_ioctl+0x250/0x300
Caller[0000000000923800]: sock_do_ioctl+0x40/0x60
Caller[000000000092427c]: sock_ioctl+0x7c/0x280
Caller[00000000005ef040]: vfs_ioctl+0x20/0x60
Caller[00000000005ef5d4]: do_vfs_ioctl+0x194/0x4c0
Caller[00000000005ef974]: SyS_ioctl+0x74/0xa0
Caller[0000000000406214]: linux_sparc_syscall+0x34/0x44
Caller[0000000000107154]: 0x107154
Instruction DUMP: e43620c8
 e436204a  c45e2038
<c2a083a0> 82102000
 81cfe008  90086001
 82102000  81cfe008

Kernel panic - not syncing: Fatal exception

Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-12-02 23:06:40 -08:00
Piotr Raczynski 64f5ead95a i40e: Add protocols over MCTP to i40e_aq_discover_capabilities
Add logical_id to I40E_AQ_CAP_ID_MNG_MODE capability starting from major
version 2.

Change-ID: Idb29214b172ea5c70cbd45a99e6745c0215af7e4
Signed-off-by: Piotr Raczynski <piotr.raczynski@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-12-02 22:59:04 -08:00
Jacob Keller 0b7c8b5d54 i40e: fix trivial typo in naming of i40e_sync_filters_subtask
A comment incorrectly referred to i40e_vsi_sync_filters_subtask which
does not actually exist. Reference the correct function instead.

Change-ID: I6bd805c605741ffb6fe34377259bb0d597edfafd
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-12-02 22:56:29 -08:00
Michal Kosiarz 91dc1e5d3d i40e: Add Clause22 implementation
Some external PHYs require Clause22 method for accessing registers.
This patch also adds some defines to support blink led on devices using
10CBaseT PHY.

Change-ID: I868a4326911900f6c89e7e522fda4968b0825f14
Signed-off-by: Michal Kosiarz <michal.kosiarz@intel.com>
Signed-off-by: Matt Jared <matthew.a.jared@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-12-02 22:49:39 -08:00
Jacob Keller d182a5ca1f i40e: avoid duplicate private flags definitions
Separate the global private flags and the regular private flags per
interface into two arrays. Future additions of private flags will not
need to be duplicated which may lead to buggy code. Also rename
"i40e_priv_flags_strings_gl" to "i40e_gl_priv_flags_strings" for
clarity, as it reads more naturally.

Change-ID: I68caef3c9954eb7da342d7f9d20f2873186f2758
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-12-02 22:40:58 -08:00
Jacob Keller 6a112785fd i40e: remove second check of VLAN_N_VID in i40e_vlan_rx_add_vid
Replace a check of magic number 4095 with VLAN_N_VID. This
makes it obvious that a later check against VLAN_N_VID is
always true and can be removed.

Change-ID: I28998f127a61a529480ce63d8a07e266f6c63b7b
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-12-02 22:38:47 -08:00
Jacob Keller 7429c0bd01 i40e: remove error_param_int label from i40e_vc_config_promiscuous_mode_msg
This label is unnecessary, as are jumping to a block that checks aq_ret
and then immediately skipping it and returning. So just jump straight to
the error_param and remove this unnecessary label.

Also use goto error_param even in the last check for style consistency.

Change-ID: If487c7d10c4048e37c594e5eca167693aaed45f6
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-12-02 22:30:44 -08:00
Alexander Duyck 06fc016c43 i40evf: Be much more verbose about what we can and cannot offload
This change makes it so that we are much more robust about defining what we
can and cannot offload.  Previously we were performing no checks.  This
should bring us up to parity with the i40e PF driver.

In addition the device only supports GSO as long as the MSS is 64 or
greater.  We were not checking this so an MSS less than that was resulting
in Tx hangs.

Change-ID: If533553ec92fc6ba694eab6ac81fdaf3004f3592
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2016-12-02 22:20:48 -08:00
Alexander Duyck f114dca253 i40e: Be much more verbose about what we can and cannot offload
This change makes it so that we are much more robust about defining what we
can and cannot offload.  Previously we were just checking for the L4 tunnel
header length, however there are other fields we should be verifying as
there are multiple scenarios in which we cannot perform hardware offloads.

In addition the device only supports GSO as long as the MSS is 64 or
greater.  We were not checking this so an MSS less than that was resulting
in Tx hangs.

Change-ID: I5e2fd5f3075c73601b4b36327b771c64fcb6c31b
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
2016-12-02 22:19:03 -08:00
David S. Miller ab17cb1fea wireless-drivers-next patches for 4.10
Major changes:
 
 rsi
 
 * filter rx frames
 * configure tx power
 * make it possible to select antenna
 * support 802.11d
 
 brcmfmac
 
 * cleanup of scheduled scan code
 * support for bcm43341 chipset with different chip id
 * support rev6 of PCIe device interface
 
 ath10k
 
 * add spectral scan support for QCA6174 and QCA9377 families
 * show used tx bitrate with 10.4 firmware
 
 wil6210
 
 * add power save mode support
 * add abort scan functionality
 * add support settings retry limit for short frames
 
 bcma
 
 * add Dell Inspiron 3148
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQEcBAABAgAGBQJYQGivAAoJEG4XJFUm622bqG0IAJtSGt4Fxv2jL7GPmPpEUtYK
 F6G1PCk9LxO44rOZ15E/CT1vPk6Bnwqp9brdngmXwl7jc+jGs4MQN7g6cD4UZgPm
 gxjx8cah2HPRVgEE7PeOILthRxwPA+9klycsvwtglkgQ1SpQVmLHDTLpeOAkRluY
 olJGINoGHTD6osud6p3oKK+VP891omJvu8TPqRjhrhLhbQTWAuTxl2Gsdye30yag
 CsdaEZb9wdUEBoS80EVRwvgBzqrdKU5kGDGbuzytcyrFrRHo4flti1KgxDg3nIpI
 jC4Liwg0yE/aYZlfMqi/960rt8AttCJBDt/vwqp0mOE4IwFsE9Yaio6xXUonAC8=
 =a6a/
 -----END PGP SIGNATURE-----

Merge tag 'wireless-drivers-next-for-davem-2016-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next

Kalle Valo says:

====================
wireless-drivers-next patches for 4.10

Major changes:

rsi

* filter rx frames
* configure tx power
* make it possible to select antenna
* support 802.11d

brcmfmac

* cleanup of scheduled scan code
* support for bcm43341 chipset with different chip id
* support rev6 of PCIe device interface

ath10k

* add spectral scan support for QCA6174 and QCA9377 families
* show used tx bitrate with 10.4 firmware

wil6210

* add power save mode support
* add abort scan functionality
* add support settings retry limit for short frames

bcma

* add Dell Inspiron 3148
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 13:58:10 -05:00
David S. Miller 4f4f907a67 Merge branch 'mvneta-64bit'
Gregory CLEMENT says:

====================
Support Armada 37xx SoC (ARMv8 64-bits) in mvneta driver

The Armada 37xx is a new ARMv8 SoC from Marvell using same network
controller as the older Armada 370/38x/XP SoCs. This series adapts the
driver in order to be able to use it on this new SoC. The main changes
are:

- 64-bits support: the first patches allow using the driver on a 64-bit
  architecture.

- MBUS support: the mbus configuration is different on Armada 37xx
  from the older SoCs.

- per cpu interrupt: Armada 37xx do not support per cpu interrupt for
  the NETA IP, the non-per-CPU behavior was added back.

The first patch is an optimization in the rx path in swbm mode.
The second patch remove unnecessary allocation for HWBM.
The first item is solved by patches 4 and 5.
The 2 last items are solved by patch 6.
In patch 7 the dt support is added.

Beside Armada 37xx, this series have been again tested on Armada XP
and Armada 38x (with Hardware Buffer Management and with Software
Buffer Management).

This is the 6th version of the series:
- 1st version:
http://lists.infradead.org/pipermail/linux-arm-kernel/2016-November/469588.html

- 2nd version:
http://lists.infradead.org/pipermail/linux-arm-kernel/2016-November/470476.html

- 3rd version:
http://lists.infradead.org/pipermail/linux-arm-kernel/2016-November/470901.html

- 4th version:
http://lists.infradead.org/pipermail/linux-arm-kernel/2016-November/471039.html

- 5th version:
http://lists.infradead.org/pipermail/linux-arm-kernel/2016-November/471478.html

Changelog:
v5 -> v6:
 - Added Tested-by from  Marcin Wojtas on the series
 - Added Reviewed-by from Jisheng Zhang on patch 3
 - Fix eth1 phy mode for Armada 3720 DB board on patch 7

v4 -> v5:
 - remove unnecessary cast in patch 3

v3 -> v4:
 - Adding new patch: "net: mvneta: do not allocate buffer in rxq init
   with HWBM"

 - Simplify the HWBM case in patch 3 as suggested by Marcin

v2 -> v3:
 - Adding patch 1 "Optimize rx path for small frame"

 - Fix the kbuild error by moving the "phys_addr += pp->rx_offset_correction;"
  line from patch 2 to patch 3 where rx_offset_correction is introduced.

 - Move the memory allocation of the buf_virt_addr of the rxq to be
   called by the probe function in order to avoid a memory leak.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 13:52:02 -05:00
Gregory CLEMENT ea7ae8854a ARM64: dts: marvell: Add network support for Armada 3700
Add neta nodes for network support both in device tree for the SoC and
the board.

Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 13:52:01 -05:00
Marcin Wojtas 2636ac3cc2 net: mvneta: Add network support for Armada 3700 SoC
Armada 3700 is a new ARMv8 SoC from Marvell using same network controller
as older Armada 370/38x/XP. There are however some differences that
needed taking into account when adding support for it:

* open default MBUS window to 4GB of DRAM - Armada 3700 SoC's Mbus
  configuration for network controller has to be done on two levels:
  global and per-port. The first one is inherited from the
  bootloader. The latter can be opened in a default way, leaving
  arbitration to the bus controller.  Hence filled mbus_dram_target_info
  structure is not needed

* make per-CPU operation optional - Recent patches adding RSS and XPS
  support for Armada 38x/XP enabled per-CPU operation of the controller
  by default. Contrary to older SoC's Armada 3700 SoC's network
  controller is not capable of per-CPU processing due to interrupt lines'
  connectivity.  This patch restores non-per-CPU operation, which is now
  optional and depends on neta_armada3700 flag value in mvneta_port
  structure. In order not to complicate the code, separate interrupt
  subroutine is implemented.

For now, on the Armada 3700, RSS is disabled as the current
implementation depend on the per cpu interrupts.

[gregory.clement@free-electrons.com: extract from a larger patch, replace
some ifdef and port to net-next for v4.10]

Signed-off-by: Marcin Wojtas <mw@semihalf.com>
Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
Tested-by: Marcin Wojtas <mw@semihalf.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 13:52:01 -05:00
Gregory CLEMENT f34dacccb4 net: mvneta: Only disable mvneta_bm for 64-bits
Actually only the mvneta_bm support is not 64-bits compatible.
The mvneta code itself can run on 64-bits architecture.

Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
Tested-by: Marcin Wojtas <mw@semihalf.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 13:52:01 -05:00
Marcin Wojtas 8d5047cf9c net: mvneta: Convert to be 64 bits compatible
Prepare the mvneta driver in order to be usable on the 64 bits platform
such as the Armada 3700.

[gregory.clement@free-electrons.com]: this patch was extract from a larger
one to ease review and maintenance.

Signed-off-by: Marcin Wojtas <mw@semihalf.com>
Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
Tested-by: Marcin Wojtas <mw@semihalf.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 13:52:00 -05:00
Gregory CLEMENT f88bee1c4b net: mvneta: Use cacheable memory to store the rx buffer virtual address
Until now the virtual address of the received buffer were stored in the
cookie field of the rx descriptor. However, this field is 32-bits only
which prevents to use the driver on a 64-bits architecture.

With this patch the virtual address is stored in an array not shared with
the hardware (no more need to use the DMA API). Thanks to this, it is
possible to use cache contrary to the access of the rx descriptor member.

The change is done in the swbm path only because the hwbm uses the cookie
field, this also means that currently the hwbm is not usable in 64-bits.

Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
Reviewed-by: Jisheng Zhang <jszhang@marvell.com>
Tested-by: Marcin Wojtas <mw@semihalf.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 13:52:00 -05:00
Gregory CLEMENT e9f6499965 net: mvneta: Do not allocate buffer in rxq init with HWBM
For HWBM all buffers are allocated in mvneta_bm_construct() and in runtime
they are put into descriptors by hardware. There is no need to fill them
at this point.

Suggested-by: Marcin Wojtas <mw@semihalf.com>
Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
Tested-by: Marcin Wojtas <mw@semihalf.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 13:52:00 -05:00
Gregory CLEMENT ac83b7ddf2 net: mvneta: Optimize rx path for small frame
For small frame reuse the phys_addr variable instead of accessing the
uncacheable value in the rx descriptor.

Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
Tested-by: Marcin Wojtas <mw@semihalf.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 13:52:00 -05:00
David S. Miller b5b5eca9aa Merge branch 'bpf-support-for-sockets'
David Ahern says:

====================
net: Add bpf support for sockets

The recently added VRF support in Linux leverages the bind-to-device
API for programs to specify an L3 domain for a socket. While
SO_BINDTODEVICE has been around for ages, not every ipv4/ipv6 capable
program has support for it. Even for those programs that do support it,
the API requires processes to be started as root (CAP_NET_RAW) which
is not desirable from a general security perspective.

This patch set leverages Daniel Mack's work to attach bpf programs to
a cgroup to provide a capability to set sk_bound_dev_if for all
AF_INET{6} sockets opened by a process in a cgroup when the sockets
are allocated.

For example:
 1. configure vrf (e.g., using ifupdown2)
        auto eth0
        iface eth0 inet dhcp
            vrf mgmt

        auto mgmt
        iface mgmt
            vrf-table auto

 2. configure cgroup
        mount -t cgroup2 none /tmp/cgroupv2
        mkdir /tmp/cgroupv2/mgmt
        test_cgrp2_sock /tmp/cgroupv2/mgmt 15

 3. set shell into cgroup (e.g., can be done at login using pam)
        echo $$ >> /tmp/cgroupv2/mgmt/cgroup.procs

At this point all commands run in the shell (e.g, apt) have sockets
automatically bound to the VRF (see output of ss -ap 'dev == <vrf>'),
including processes not running as root.

This capability enables running any program in a VRF context and is key
to deploying Management VRF, a fundamental configuration for networking
gear, with any Linux OS installation.

This patchset also exports the socket family, type and protocol as
read-only allowing bpf filters to deny a process in a cgroup the ability
to open specific types of AF_INET or AF_INET6 sockets.

v7
- comments from Alexei

v6
- add export of socket family, type and protocol
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 13:46:21 -05:00
David Ahern 554ae6e792 samples/bpf: add userspace example for prohibiting sockets
Add examples preventing a process in a cgroup from opening a socket
based family, protocol and type.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 13:46:09 -05:00
David Ahern 4f2e7ae56e samples/bpf: Update bpf loader for cgroup section names
Add support for section names starting with cgroup/skb and cgroup/sock.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 13:46:09 -05:00
David Ahern aa4c1037a3 bpf: Add support for reading socket family, type, protocol
Add socket family, type and protocol to bpf_sock allowing bpf programs
read-only access.

Add __sk_flags_offset[0] to struct sock before the bitfield to
programmtically determine the offset of the unsigned int containing
protocol and type.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 13:46:09 -05:00
David Ahern ad2805dc79 samples: bpf: add userspace example for modifying sk_bound_dev_if
Add a simple program to demonstrate the ability to attach a bpf program
to a cgroup that sets sk_bound_dev_if for AF_INET{6} sockets when they
are created.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 13:46:08 -05:00
David Ahern 6102365876 bpf: Add new cgroup attach type to enable sock modifications
Add new cgroup based program type, BPF_PROG_TYPE_CGROUP_SOCK. Similar to
BPF_PROG_TYPE_CGROUP_SKB programs can be attached to a cgroup and run
any time a process in the cgroup opens an AF_INET or AF_INET6 socket.
Currently only sk_bound_dev_if is exported to userspace for modification
by a bpf program.

This allows a cgroup to be configured such that AF_INET{6} sockets opened
by processes are automatically bound to a specific device. In turn, this
enables the running of programs that do not support SO_BINDTODEVICE in a
specific VRF context / L3 domain.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 13:46:08 -05:00
David Ahern b2cd12574a bpf: Refactor cgroups code in prep for new type
Code move and rename only; no functional change intended.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 13:44:56 -05:00
Eric Dumazet 7f7bf1606f mlx4: fix use-after-free in mlx4_en_fold_software_stats()
My recent commit to get more precise rx/tx counters in ndo_get_stats64()
can lead to crashes at device dismantle, as Jesper found out.

We must prevent mlx4_en_fold_software_stats() trying to access
tx/rx rings if they are deleted.

Fix this by adding a test against priv->port_up in
mlx4_en_fold_software_stats()

Calling mlx4_en_fold_software_stats() from mlx4_en_stop_port()
allows us to eventually broadcast the latest/current counters to
rtnetlink monitors.

Fixes: 40931b8511 ("mlx4: give precise rx/tx bytes/packets counters")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-and-bisected-by: Jesper Dangaard Brouer <brouer@redhat.com>
Tested-by: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Cc: Saeed Mahameed <saeedm@dev.mellanox.co.il>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 13:33:32 -05:00
Sunil Goutham bd3ad7d3a1 net: thunderx: Fix transmit queue timeout issue
Transmit queue timeout issue is seen in two cases
- Due to a race condition btw setting stop_queue at xmit()
  and checking for stopped_queue in NAPI poll routine, at times
  transmission from a SQ comes to a halt. This is fixed
  by using barriers and also added a check for SQ free descriptors,
  incase SQ is stopped and there are only CQE_RX i.e no CQE_TX.
- Contrary to an assumption, a HW errata where HW doesn't stop transmission
  even though there are not enough CQEs available for a CQE_TX is
  not fixed in T88 pass 2.x. This results in a Qset error with
  'CQ_WR_FULL' stalling transmission. This is fixed by adjusting
  RXQ's  RED levels for CQ level such that there is always enough
  space left for CQE_TXs.

Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 13:32:59 -05:00
David S. Miller 9aac3c1879 Merge branch 'offloading-tc-rules-hw'
Hadar Hen Zion says:

====================
Offloading tc rules using underline Hardware device

This series adds flower classifier support in offloading tc rules when the
Software ingress device is different from the Hardware ingress device,
such as when dealing with IP tunnels

The first two patches are a small fixes to flower, checking the skip_hw flag
wasn't set before calling the Hardware offloading functions which will try to
offload the rule.

The next two patches are infrastructure patches, a preparation for the fourth
patch which is adding support in flower to offload rules when the ingress
device is not a Hardware device and therefore can't offload.
In this case ndo_setup_tc is called with the mirred (egress) device.

The last three patchs are adding mlx5e support to offload rules using the new
"egress_device" flag.

Thanks,
Hadar

Changes from v0:
- check if CONFIG_NET_CLS_ACT is defined befor calling tc_action_ops get_dev()
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 13:28:38 -05:00
Hadar Hen Zion ebe06875ff net/mlx5e: Support adding ingress tc rule when egress device flag is set
When ndo_setup_tc is called with an egress_dev flag set, it means that
the ndo call was executed on the mirred action (egress) device and not
on the ingress device.

In order to support this kind of ndo_setup_tc call, and insert the
correct decap rule to the hardware, the uplink device on the same eswitch
should be found.

Currently, we use this resolution between the mirred device and the
uplink on the same eswitch to offload vxlan shared device decap rules.

Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 13:28:38 -05:00
Hadar Hen Zion 726293f1f8 net/mlx5e: Save the represntor netdevice as part of the representor
Replace the representor private data to a net_device pointer holding the
representor netdevice, instead of void pointer holding mlx5e_priv.

It will be used by a new eswitch service function, returning the uplink representor
netdevice.

Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 13:28:37 -05:00
Hadar Hen Zion 718f13e72b net/mlx5e: Bring back representor's ndos that were accidentally removed
The VF Representor udp tunnel ndo entries were removed by mistake,
return them.

Fixes: 370bad0f9a ('net/mlx5e: Support HW (offloaded) and SW counters for SRIOV switchdev mode')
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 13:28:37 -05:00
Hadar Hen Zion 7091d8c705 net/sched: cls_flower: Add offload support using egress Hardware device
In order to support hardware offloading when the device given by the tc
rule is different from the Hardware underline device, extract the mirred
(egress) device from the tc action when a filter is added, using the new
tc_action_ops, get_dev().

Flower caches the information about the mirred device and use it for
calling ndo_setup_tc in filter change, update stats and delete.

Calling ndo_setup_tc of the mirred (egress) device instead of the
ingress device will allow a resolution between the software ingress
device and the underline hardware device.

The resolution will take place inside the offloading driver using
'egress_device' flag added to tc_to_netdev struct which is provided to
the offloading driver.

Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 13:28:37 -05:00
Hadar Hen Zion 255cb30425 net/sched: act_mirred: Add new tc_action_ops get_dev()
Adding support to a new tc_action_ops.
get_dev is a general option which allows to get the underline
device when trying to offload a tc rule.

In case of mirred action the returned device is the mirred (egress)
device.

Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 13:28:36 -05:00
Hadar Hen Zion 3036dab670 net/sched: cls_flower: Provide a filter to replace/destroy hardware filter functions
Instead of providing many arguments to fl_hw_{replace/destroy}_filter
functions, just provide cls_fl_filter struct that includes all the relevant
args.

This patches doesn't add any new functionality.

Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 13:28:36 -05:00
Hadar Hen Zion 796852197c net/sched: cls_flower: Try to offload only if skip_hw flag isn't set
Check skip_hw flag isn't set before calling
fl_hw_{replace/destroy}_filter and fl_hw_update_stats functions.

Replace the call to tc_should_offload with tc_can_offload.
tc_can_offload only checks if the device supports offloading, the check for
skip_hw flag is done earlier in the flow.

Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 13:28:36 -05:00
Hadar Hen Zion 55330f0596 net/sched: Add separate check for skip_hw flag
Creating a difference between two possible cases:
1. Not offloading tc rule since the user sets 'skip_hw' flag.
2. Not offloading tc rule since the device doesn't support offloading.

This patch doesn't add any new functionality.

Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 13:28:36 -05:00
Florian Westphal 25429d7b7d tcp: allow to turn tcp timestamp randomization off
Eric says: "By looking at tcpdump, and TS val of xmit packets of multiple
flows, we can deduct the relative qdisc delays (think of fq pacing).
This should work even if we have one flow per remote peer."

Having random per flow (or host) offsets doesn't allow that anymore so add
a way to turn this off.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 12:49:59 -05:00
Florian Westphal 95a22caee3 tcp: randomize tcp timestamp offsets for each connection
jiffies based timestamps allow for easy inference of number of devices
behind NAT translators and also makes tracking of hosts simpler.

commit ceaa1fef65 ("tcp: adding a per-socket timestamp offset")
added the main infrastructure that is needed for per-connection ts
randomization, in particular writing/reading the on-wire tcp header
format takes the offset into account so rest of stack can use normal
tcp_time_stamp (jiffies).

So only two items are left:
 - add a tsoffset for request sockets
 - extend the tcp isn generator to also return another 32bit number
   in addition to the ISN.

Re-use of ISN generator also means timestamps are still monotonically
increasing for same connection quadruple, i.e. PAWS will still work.

Includes fixes from Eric Dumazet.

Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 12:49:59 -05:00
David S. Miller 7df5358d47 Merge branch 'qed-iscsi'
Manish Rangankar says:

====================
Add QLogic FastLinQ iSCSI (qedi) driver.

This series introduces hardware offload iSCSI initiator driver for the
41000 Series Converged Network Adapters (579xx chip) by Qlogic. The overall
driver design includes a common module ('qed') and protocol specific
dependent modules ('qedi' for iSCSI).

This is an open iSCSI driver, modifications to open iSCSI user components
'iscsid', 'iscsiuio', etc. are required for the solution to work. The user
space changes are also in the process of being submitted.

    https://groups.google.com/forum/#!forum/open-iscsi

The 'qed' common module, under drivers/net/ethernet/qlogic/qed/, is
enhanced with functionality required for the iSCSI support. This series
is based on:

    net tree base: Merge of net and net-next as of 11/29/2016

Changes from RFC v2:

  1. qedi patches are squashed into single patch to prevent krobot
     warning.
  2. Fixed 'hw_p_cpuq' incompatible pointer type.
  3. Fixed sparse incompatible types in comparison expression.
  4. Misc fixes with latest 'checkpatch --strict' option.
  5. Remove int_mode option from MODULE_PARAM.
  6. Prefix all MODULE_PARAM params with qedi_*.
  7. Use CONFIG_QED_ISCSI instead of CONFIG_QEDI
  8. Added bad task mem access fix.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 12:44:38 -05:00
Yuval Mintz 1d6cff4fca qed: Add iSCSI out of order packet handling.
This patch adds out of order packet handling for hardware offloaded
iSCSI. Out of order packet handling requires driver buffer allocation
and assistance.

Signed-off-by: Arun Easi <arun.easi@cavium.com>
Signed-off-by: Yuval Mintz <yuval.mintz@cavium.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 12:44:38 -05:00
Yuval Mintz fc831825f9 qed: Add support for hardware offloaded iSCSI.
This adds the backbone required for the various HW initalizations
which are necessary for the iSCSI driver (qedi) for QLogic FastLinQ
4xxxx line of adapters - FW notification, resource initializations, etc.

Signed-off-by: Arun Easi <arun.easi@cavium.com>
Signed-off-by: Yuval Mintz <yuval.mintz@cavium.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 12:44:37 -05:00
Rasmus Villemoes b14945ac3e net: atarilance: use %8ph for printing hex string
This is already using the %pM printf extension; might as well also use
%ph to make the code smaller.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 12:03:35 -05:00
Arnd Bergmann d709b2a186 net/mlx5e: skip loopback selftest with !CONFIG_INET
When CONFIG_INET is disabled, the new selftest results in a link
error:

drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.o: In function `mlx5e_test_loopback':
en_selftest.c:(.text.mlx5e_test_loopback+0x2ec): undefined reference to `ip_send_check'
en_selftest.c:(.text.mlx5e_test_loopback+0x34c): undefined reference to `udp4_hwcsum'

This hides the specific test in that configuration.

Fixes: 0952da791c ("net/mlx5e: Add support for loopback selftest")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 11:55:57 -05:00
Daniel Borkmann 366cbf2f46 bpf, xdp: drop rcu_read_lock from bpf_prog_run_xdp and move to caller
After 326fe02d1e ("net/mlx4_en: protect ring->xdp_prog with rcu_read_lock"),
the rcu_read_lock() in bpf_prog_run_xdp() is superfluous, since callers
need to hold rcu_read_lock() already to make sure BPF program doesn't
get released in the background.

Thus, drop it from bpf_prog_run_xdp(), as it can otherwise be misleading.
Still keeping the bpf_prog_run_xdp() is useful as it allows for grepping
in XDP supported drivers and to keep the typecheck on the context intact.
For mlx4, this means we don't have a double rcu_read_lock() anymore. nfp can
just make use of bpf_prog_run_xdp(), too. For qede, just move rcu_read_lock()
out of the helper. When the driver gets atomic replace support, this will
move to call-sites eventually.

mlx5 needs actual fixing as it has the same issue as described already in
326fe02d1e ("net/mlx4_en: protect ring->xdp_prog with rcu_read_lock"),
that is, we're under RCU bh at this time, BPF programs are released via
call_rcu(), and call_rcu() != call_rcu_bh(), so we need to properly mark
read side as programs can get xchg()'ed in mlx5e_xdp_set() without queue
reset.

Fixes: 86994156c7 ("net/mlx5e: XDP fast RX drop bpf programs support")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 11:06:24 -05:00
Soheil Hassas Yeganeh 83a1a1a70e sock: reset sk_err for ICMP packets read from error queue
Only when ICMP packets are enqueued onto the error queue,
sk_err is also set. Before f5f99309fa (sock: do not set sk_err
in sock_dequeue_err_skb), a subsequent error queue read
would set sk_err to the next error on the queue, or 0 if empty.
As no error types other than ICMP set this field, sk_err should
not be modified upon dequeuing them.

Only for ICMP errors, reset the (racy) sk_err. Some applications,
like traceroute, rely on it and go into a futile busy POLLERR
loop otherwise.

In principle, sk_err has to be set while an ICMP error is queued.
Testing is_icmp_err_skb(skb_next) approximates this without
requiring a full queue walk. Applications that receive both ICMP
and other errors cannot rely on this legacy behavior, as other
errors do not set sk_err in the first place.

Fixes: f5f99309fa (sock: do not set sk_err in sock_dequeue_err_skb)
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 10:55:39 -05:00
David S. Miller f577e22c73 Merge branch 'lwt-bpf'
Thomas Graf says:

====================
bpf: BPF for lightweight tunnel encapsulation

This series implements BPF program invocation from dst entries via the
lightweight tunnels infrastructure. The BPF program can be attached to
lwtunnel_input(), lwtunnel_output() or lwtunnel_xmit() and see an L3
skb as context. Programs attached to input and output are read-only.
Programs attached to lwtunnel_xmit() can modify and redirect, push headers
and redirect packets.

The facility can be used to:
 - Collect statistics and generate sampling data for a subset of traffic
   based on the dst utilized by the packet thus allowing to extend the
   existing realms.
 - Apply additional per route/dst filters to prohibit certain outgoing
   or incoming packets based on BPF filters. In particular, this allows
   to maintain per dst custom state across multiple packets in BPF maps
   and apply filters based on statistics and behaviour observed over time.
 - Attachment of L2 headers at transmit where resolving the L2 address
   is not required.
 - Possibly many more.

v3 -> v4:
 - Bumped LWT_BPF_MAX_HEADROOM from 128 to 256 (Alexei)
 - Renamed bpf_skb_push() helper to bpf_skb_change_head() to relate to
   existing bpf_skb_change_tail() helper (Alexei/Daniel)
 - Added check in __bpf_redirect_common() to verify that program added a
   link header before redirecting to a l2 device. Adding the check to
   lwt-bpf code was considered but dropped due to massive code required
   due to retrieval of net_device via per-cpu redirect buffer. A test
   case was added to cover the scenario when a program directs to an l2
   device without adding an appropriate l2 header.
   (Alexei)
 - Prohibited access to tc_classid (Daniel)
 - Collapsed bpf_verifier_ops instance for lwt in/out as they are
   identical (Daniel)
 - Some cosmetic changes

v2 -> v3:
 - Added real world sample lwt_len_hist_kern.c which demonstrates how to
   collect a histogram on packet sizes for all packets flowing through
   a number of routes.
 - Restricted output to be read-only. Since the header can no longer
   be modified, the rerouting functionality has been removed again.
 - Added test case which cover destructive modification of packet data.

v1 -> v2:
 - Added new BPF_LWT_REROUTE return code for program to indicate
   that new route lookup should be performed. Suggested by Tom.
 - New sample to illustrate rerouting
 - New patch 05: Recursion limit for lwtunnel_output for the case
   when user creates circular dst redirection. Also resolves the
   issue for ILA.
 - Fix to ensure headroom for potential future L2 header is still
   guaranteed
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 10:52:05 -05:00
Thomas Graf f74599f7c5 bpf: Add tests and samples for LWT-BPF
Adds a series of tests to verify the functionality of attaching
BPF programs at LWT hooks.

Also adds a sample which collects a histogram of packet sizes which
pass through an LWT hook.

$ ./lwt_len_hist.sh
Starting netserver with host 'IN(6)ADDR_ANY' port '12865' and family AF_UNSPEC
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.253.2 () port 0 AF_INET : demo
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

 87380  16384  16384    10.00    39857.69
       1 -> 1        : 0        |                                      |
       2 -> 3        : 0        |                                      |
       4 -> 7        : 0        |                                      |
       8 -> 15       : 0        |                                      |
      16 -> 31       : 0        |                                      |
      32 -> 63       : 22       |                                      |
      64 -> 127      : 98       |                                      |
     128 -> 255      : 213      |                                      |
     256 -> 511      : 1444251  |********                              |
     512 -> 1023     : 660610   |***                                   |
    1024 -> 2047     : 535241   |**                                    |
    2048 -> 4095     : 19       |                                      |
    4096 -> 8191     : 180      |                                      |
    8192 -> 16383    : 5578023  |************************************* |
   16384 -> 32767    : 632099   |***                                   |
   32768 -> 65535    : 6575     |                                      |

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 10:52:00 -05:00
Thomas Graf 3a0af8fd61 bpf: BPF for lightweight tunnel infrastructure
Registers new BPF program types which correspond to the LWT hooks:
  - BPF_PROG_TYPE_LWT_IN   => dst_input()
  - BPF_PROG_TYPE_LWT_OUT  => dst_output()
  - BPF_PROG_TYPE_LWT_XMIT => lwtunnel_xmit()

The separate program types are required to differentiate between the
capabilities each LWT hook allows:

 * Programs attached to dst_input() or dst_output() are restricted and
   may only read the data of an skb. This prevent modification and
   possible invalidation of already validated packet headers on receive
   and the construction of illegal headers while the IP headers are
   still being assembled.

 * Programs attached to lwtunnel_xmit() are allowed to modify packet
   content as well as prepending an L2 header via a newly introduced
   helper bpf_skb_change_head(). This is safe as lwtunnel_xmit() is
   invoked after the IP header has been assembled completely.

All BPF programs receive an skb with L3 headers attached and may return
one of the following error codes:

 BPF_OK - Continue routing as per nexthop
 BPF_DROP - Drop skb and return EPERM
 BPF_REDIRECT - Redirect skb to device as per redirect() helper.
                (Only valid in lwtunnel_xmit() context)

The return codes are binary compatible with their TC_ACT_
relatives to ease compatibility.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 10:51:49 -05:00
Thomas Graf efd8570081 route: Set lwtstate for local traffic and cached input dsts
A route on the output path hitting a RTN_LOCAL route will keep the dst
associated on its way through the loopback device. On the receive path,
the dst_input() call will thus invoke the input handler of the route
created in the output path. Thus, lwt redirection for input must be done
for dsts allocated in the otuput path as well.

Also, if a route is cached in the input path, the allocated dst should
respect lwtunnel configuration on the nexthop as well.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 10:51:49 -05:00