Commit graph

888741 commits

Author SHA1 Message Date
Julia Lawall 6485f9ae3b ptp: ptp_clockmatrix: constify copied structure
The idtcm_caps structure is only copied into another structure,
so make it const.

The opportunity for this change was found using Coccinelle.

Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-02 16:31:03 -08:00
Ben Hutchings edf4579123 sfc: Remove unnecessary dependencies on I2C
Only the SFC4000 code, now moved to sfc-falcon, needed I2C.

Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-02 16:23:18 -08:00
David S. Miller 7a8d8a4642 Merge branch 'tcp-Add-support-for-L3-domains-to-MD5-auth'
David Ahern says:

====================
tcp: Add support for L3 domains to MD5 auth

With VRF, the scope of network addresses is limited to the L3 domain
the device is associated. MD5 keys are based on addresses, so proper
VRF support requires an L3 domain to be considered for the lookups.

Leverage the new TCP_MD5SIG_EXT option to add support for a device index
to MD5 keys. The __tcpm_pad entry in tcp_md5sig is renamed to tcpm_ifindex
and a new flag, TCP_MD5SIG_FLAG_IFINDEX, in tcpm_flags determines if the
entry is examined. This follows what was done for MD5 and prefixes with
commits
   8917a777be ("tcp: md5: add TCP_MD5SIG_EXT socket option to set a key address prefix")
   6797318e62 ("tcp: md5: add an address prefix for key lookup")

Handling both a device AND L3 domain is much more complicated for the
response paths. This set focuses only on L3 support - requiring the
device index to be an l3mdev (ie, VRF). Support for slave devices can
be added later if desired, much like the progression of support for
sockets bound to a VRF and then bound to a device in a VRF. Kernel
code is setup to explicitly call out that current lookup is for an L3
index, while the uapi just references a device index allowing its
meaning to include other devices in the future.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-02 15:51:22 -08:00
David Ahern 5cad8bce26 fcnal-test: Add TCP MD5 tests for VRF
Add tests for new TCP MD5 API for L3 domains (VRF).

A new namespace is added to create a duplicate configuration between
the VRF and default VRF to verify overlapping config is handled properly.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-02 15:51:22 -08:00
David Ahern f0bee1ebb5 fcnal-test: Add TCP MD5 tests
Add tests for existing TCP MD5 APIs - both single address
config and the new extended API for prefixes.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-02 15:51:22 -08:00
David Ahern eb09cf03b9 nettest: Add support for TCP_MD5 extensions
Update nettest to implement TCP_MD5SIG_EXT for a prefix and a device.

Add a new option, -m, to specify a prefix and length to use with MD5
auth. The device option comes from the existing -d option. If either
are set and MD5 auth is requested, TCP_MD5SIG_EXT is used instead of
TCP_MD5SIG.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-02 15:51:22 -08:00
David Ahern 1bfb45d860 nettest: Return 1 on MD5 failure for server mode
On failure to set MD5 password, do_server should return 1 so that the
program exits with 1 rather than 255. This used for negative testing
when adding MD5 with device option.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-02 15:51:22 -08:00
David Ahern 6b102db50c net: Add device index to tcp_md5sig
Add support for userspace to specify a device index to limit the scope
of an entry via the TCP_MD5SIG_EXT setsockopt. The existing __tcpm_pad
is renamed to tcpm_ifindex and the new field is only checked if the new
TCP_MD5SIG_FLAG_IFINDEX is set in tcpm_flags. For now, the device index
must point to an L3 master device (e.g., VRF). The API and error
handling are setup to allow the constraint to be relaxed in the future
to any device index.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-02 15:51:22 -08:00
David Ahern dea53bb80e tcp: Add l3index to tcp_md5sig_key and md5 functions
Add l3index to tcp_md5sig_key to represent the L3 domain of a key, and
add l3index to tcp_md5_do_add and tcp_md5_do_del to fill in the key.

With the key now based on an l3index, add the new parameter to the
lookup functions and consider the l3index when looking for a match.

The l3index comes from the skb when processing ingress packets leveraging
the helpers created for socket lookups, tcp_v4_sdif and inet_iif (and the
v6 variants). When the sdif index is set it means the packet ingressed a
device that is part of an L3 domain and inet_iif points to the VRF device.
For egress, the L3 domain is determined from the socket binding and
sk_bound_dev_if.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-02 15:51:22 -08:00
David Ahern 534322ca3d ipv4/tcp: Pass dif and sdif to tcp_v4_inbound_md5_hash
The original ingress device index is saved to the cb space of the skb
and the cb is moved during tcp processing. Since tcp_v4_inbound_md5_hash
can be called before and after the cb move, pass dif and sdif to it so
the caller can save both prior to the cb move. Both are used by a later
patch.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-02 15:51:22 -08:00
David Ahern d14c77e0b2 ipv6/tcp: Pass dif and sdif to tcp_v6_inbound_md5_hash
The original ingress device index is saved to the cb space of the skb
and the cb is moved during tcp processing. Since tcp_v6_inbound_md5_hash
can be called before and after the cb move, pass dif and sdif to it so
the caller can save both prior to the cb move. Both are used by a later
patch.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-02 15:51:22 -08:00
David Ahern cea9760950 ipv4/tcp: Use local variable for tcp_md5_addr
Extract the typecast to (union tcp_md5_addr *) to a local variable
rather than the current long, inline declaration with function calls.

No functional change intended.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-02 15:51:22 -08:00
Niu Xilei 98c8147648 vxlan: Fix alignment and code style of vxlan.c
Fixed Coding function and style issues

Signed-off-by: Niu Xilei <niu_xilei@163.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-02 15:41:33 -08:00
David S. Miller f5e5d27248 Merge branch 'mlxsw-Allow-setting-default-port-priority'
Ido Schimmel says:

====================
mlxsw: Allow setting default port priority

Petr says:

When LLDP APP TLV selector 1 (EtherType) is used with PID of 0, the
corresponding entry specifies "default application priority [...] when
application priority is not otherwise specified."

mlxsw currently supports this type of APP entry, but uses it only as a
fallback for unspecified DSCP rules. However non-IP traffic is prioritized
according to port-default priority, not according to the DSCP-to-prio
tables, and thus it's currently not possible to prioritize such traffic
correctly.

This patchset extends the use of the abovementioned APP entry to also set
default port priority (in patches #1 and #2) and then (in patch #3) adds a
selftest.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-02 15:40:14 -08:00
Petr Machata c5341bcc33 selftests: mlxsw: Add a self-test for port-default priority
Send non-IP traffic to a port and observe that it gets prioritized
according to the lldptool app=$prio,1,0 rules.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-02 15:40:14 -08:00
Petr Machata 379a00dd21 mlxsw: spectrum_dcb: Allow setting default port priority
When APP TLV selector 1 (EtherType) is used with PID of 0, the
corresponding entry specifies "default application priority [...] when
application priority is not otherwise specified."

mlxsw currently supports this type of APP entry, but uses it only as a
fallback for unspecified DSCP rules. However non-IP traffic is prioritized
according to port-default priority, not according to the DSCP-to-prio
tables, and thus it's currently not possible to prioritize such traffic
correctly.

Extend the use of the abovementioned APP entry to also set default port
priority.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-02 15:40:14 -08:00
Petr Machata d8446884f8 mlxsw: reg: Add QoS Port DSCP to Priority Mapping Register
Add QPDP. This register controls the port default Switch Priority and
Color. The default Switch Priority and Color are used for frames where the
trust state uses default values. Currently there are two cases where this
applies: a port is in trust-PCP state, but a packet arrives untagged; and a
port is in trust-DSCP state, but a non-IP packet arrives.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-02 15:40:14 -08:00
David S. Miller c9a2069b1d Merge branch 'page_pool-NUMA-node-handling-fixes'
Jesper Dangaard Brouer says:

====================
page_pool: NUMA node handling fixes

The recently added NUMA changes (merged for v5.5) to page_pool, it both
contains a bug in handling NUMA_NO_NODE condition, and added code to
the fast-path.

This patchset fixes the bug and moves code out of fast-path. The first
patch contains a fix that should be considered for 5.5. The second
patch reduce code size and overhead in case CONFIG_NUMA is disabled.

Currently the NUMA_NO_NODE setting bug only affects driver 'ti_cpsw'
(drivers/net/ethernet/ti/), but after this patchset, we plan to move
other drivers (netsec and mvneta) to use NUMA_NO_NODE setting.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-02 15:37:53 -08:00
Jesper Dangaard Brouer f13fc10785 page_pool: help compiler remove code in case CONFIG_NUMA=n
When kernel is compiled without NUMA support, then page_pool NUMA
config setting (pool->p.nid) doesn't make any practical sense. The
compiler cannot see that it can remove the code paths.

This patch avoids reading pool->p.nid setting in case of !CONFIG_NUMA,
in allocation and numa check code, which helps compiler to see the
optimisation potential. It leaves update code intact to keep API the
same.

 $ ./scripts/bloat-o-meter net/core/page_pool.o-numa-enabled \
                           net/core/page_pool.o-numa-disabled
 add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-113 (-113)
 Function                                     old     new   delta
 page_pool_create                             401     398      -3
 __page_pool_alloc_pages_slow                 439     426     -13
 page_pool_refill_alloc_cache                 425     328     -97
 Total: Before=3611, After=3498, chg -3.13%

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-02 15:37:52 -08:00
Jesper Dangaard Brouer 44768decb7 page_pool: handle page recycle for NUMA_NO_NODE condition
The check in pool_page_reusable (page_to_nid(page) == pool->p.nid) is
not valid if page_pool was configured with pool->p.nid = NUMA_NO_NODE.

The goal of the NUMA changes in commit d5394610b1 ("page_pool: Don't
recycle non-reusable pages"), were to have RX-pages that belongs to the
same NUMA node as the CPU processing RX-packet during softirq/NAPI. As
illustrated by the performance measurements.

This patch moves the NAPI checks out of fast-path, and at the same time
solves the NUMA_NO_NODE issue.

First realize that alloc_pages_node() with pool->p.nid = NUMA_NO_NODE
will lookup current CPU nid (Numa ID) via numa_mem_id(), which is used
as the the preferred nid.  It is only in rare situations, where
e.g. NUMA zone runs dry, that page gets doesn't get allocated from
preferred nid.  The page_pool API allows drivers to control the nid
themselves via controlling pool->p.nid.

This patch moves the NAPI check to when alloc cache is refilled, via
dequeuing/consuming pages from the ptr_ring. Thus, we can allow placing
pages from remote NUMA into the ptr_ring, as the dequeue/consume step
will check the NUMA node. All current drivers using page_pool will
alloc/refill RX-ring from same CPU running softirq/NAPI process.

Drivers that control the nid explicitly, also use page_pool_update_nid
when changing nid runtime.  To speed up transision to new nid the alloc
cache is now flushed on nid changes.  This force pages to come from
ptr_ring, which does the appropate nid check.

For the NUMA_NO_NODE case, when a NIC IRQ is moved to another NUMA
node, we accept that transitioning the alloc cache doesn't happen
immediately. The preferred nid change runtime via consulting
numa_mem_id() based on the CPU processing RX-packets.

Notice, to avoid stressing the page buddy allocator and avoid doing too
much work under softirq with preempt disabled, the NUMA check at
ptr_ring dequeue will break the refill cycle, when detecting a NUMA
mismatch. This will cause a slower transition, but its done on purpose.

Fixes: d5394610b1 ("page_pool: Don't recycle non-reusable pages")
Reported-by: Li RongQing <lirongqing@baidu.com>
Reported-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-02 15:37:52 -08:00
David S. Miller fe23d63422 Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue
Jeff Kirsher says:

====================
1GbE Intel Wired LAN Driver Updates 2019-12-31

This series contains updates to e1000e, igb and igc only.

Robert Beckett provide an igb change to assist in keeping packets from
being dropped due to receive descriptor ring being full when receive
flow control is enabled.  Create a separate function to setup SRRCTL to
ease in reuse and ensure that setting of the drop enable bit only if
receive flow control is not enabled.

Sasha adds support for scatter gather support in igc.  Improve the
direct memory address mapping flow by optimizing/simplifying and more
clear.  Update igc to use pci_release_mem_regions() instead of
pci_release_selected_regions().  Clean up function header comments to
align with the actual code.  Adds support for 64 bit DMA access, to help
handle socket buffer fragments in high memory.  Adds legacy power
management support in igc by implementing suspend, resume,
runtime_suspend/resume, and runtime_idle callbacks.  Clean up references
to Serdes interface in igc since that interface is not supported for
i225 devices.

Alex replaces the pr_info calls with netdev_info in all cases related to
netdev link state, as suggested by Joe Perches.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-31 21:43:31 -08:00
Sasha Neftin 684ea87cc3 igc: Remove serdes comments from a description of methods
Serdes interface is not applicable for i225 devices.
Remove this from comments and make comments more clearly.

Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-12-31 14:25:47 -08:00
Alexander Duyck c557a4b3f7 e1000e: Use netdev_info instead of pr_info for link messages
Replace the pr_info calls with netdev_info in all cases related to the
netdevice link state.

As a result of this patch the link messages will change as shown below.
Before:
e1000e: ens3 NIC Link is Down
e1000e: ens3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx

After:
e1000e 0000:00:03.0 ens3: NIC Link is Down
e1000e 0000:00:03.0 ens3: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx

Suggested-by: Joe Perches <joe@perches.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-12-31 14:25:41 -08:00
Sasha Neftin 9513d2a5dc igc: Add legacy power management support
Add suspend, resume, runtime_suspend, runtime_resume and
runtime_idle callbacks implementation.

Reported-by: kbuild test robot <lpk@intel.com>
Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-12-31 14:25:17 -08:00
David S. Miller 31d518f35e Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Simple overlapping changes in bpf land wrt. bpf_helper_defs.h
handling.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-31 13:37:13 -08:00
Sasha Neftin 4439dc427d igc: Add 64 bit DMA access support
On relevant platforms ndo_start_xmit can handle socket buffer
fragments in high memory

Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-12-31 11:17:25 -08:00
Sasha Neftin 86efeccd5a igc: Fix parameter descriptions for a several functions
igc_watchdog, igc_set_interrupt_capability, igc_init_interrupt_scheme,
__igc_open and __igc_close parameter descriptions has not reflected
functions meaning. Add meaningful description.

Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-12-31 11:17:25 -08:00
Sasha Neftin 085c858950 igc: Fix the parameter description for igc_alloc_rx_buffers
The function description for igc_alloc_rx_buffers has not reflected
the function meaning. Add meaningful description.

Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-12-31 11:17:25 -08:00
Sasha Neftin 57cd472c2b igc: Remove excess parameter description from igc_is_non_eop
The function description for igc_is_non_eop includes an extra @skb
parameter description. This parameter doesn't exist on the function, so
remove it.

Suggested-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-12-31 11:17:25 -08:00
Sasha Neftin faf4dd52e9 igc: Prefer to use the pci_release_mem_regions method
Use the pci_release_mem_regions method instead of the
pci_release_selected_regions method

Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-12-31 11:17:25 -08:00
Sasha Neftin 21da01fd3b igc: Improve the DMA mapping flow
Improve the probe flow and set both the DMA mask and the coherent
to the same thing. Make the flow optimized and cleared.

Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-12-31 11:17:25 -08:00
Sasha Neftin b7b462454a igc: Add scatter gather support
Scatter gather is used to do DMA data transfers of data that is written to
noncontiguous areas of memory.
This patch enables scatter gather support.

Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-12-31 11:17:25 -08:00
Robert Beckett 6506f52dcb igb: dont drop packets if rx flow control is enabled
If Rx flow control has been enabled (via autoneg or forced), packets
should not be dropped due to Rx descriptor ring exhaustion. Instead
pause frames should be used to apply back pressure. This only applies
if VFs are not in use.

Move SRRCTL setup to its own function for easy reuse and only set drop
enable bit if Rx flow control is not enabled.

Since v1: always enable dropping of packets if VFs in use.

Signed-off-by: Robert Beckett <bob.beckett@collabora.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-12-31 11:17:25 -08:00
Linus Torvalds 738d290277 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Fix big endian overflow in nf_flow_table, from Arnd Bergmann.

 2) Fix port selection on big endian in nft_tproxy, from Phil Sutter.

 3) Fix precision tracking for unbound scalars in bpf verifier, from
    Daniel Borkmann.

 4) Fix integer overflow in socket rcvbuf check in UDP, from Antonio
    Messina.

 5) Do not perform a neigh confirmation during a pmtu update over a
    tunnel, from Hangbin Liu.

 6) Fix DMA mapping leak in dpaa_eth driver, from Madalin Bucur.

 7) Various PTP fixes for sja1105 dsa driver, from Vladimir Oltean.

 8) Add missing to dummy definition of of_mdiobus_child_is_phy(), from
    Geert Uytterhoeven

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (54 commits)
  hsr: fix slab-out-of-bounds Read in hsr_debugfs_rename()
  net/sched: add delete_empty() to filters and use it in cls_flower
  tcp: Fix highest_sack and highest_sack_seq
  ptp: fix the race between the release of ptp_clock and cdev
  net: dsa: sja1105: Reconcile the meaning of TPID and TPID2 for E/T and P/Q/R/S
  Documentation: net: dsa: sja1105: Remove text about taprio base-time limitation
  net: dsa: sja1105: Remove restriction of zero base-time for taprio offload
  net: dsa: sja1105: Really make the PTP command read-write
  net: dsa: sja1105: Take PTP egress timestamp by port, not mgmt slot
  cxgb4/cxgb4vf: fix flow control display for auto negotiation
  mlxsw: spectrum: Use dedicated policer for VRRP packets
  mlxsw: spectrum_router: Skip loopback RIFs during MAC validation
  net: stmmac: dwmac-meson8b: Fix the RGMII TX delay on Meson8b/8m2 SoCs
  net/sched: act_mirred: Pull mac prior redir to non mac_header_xmit device
  net_sched: sch_fq: properly set sk->sk_pacing_status
  bnx2x: Fix accounting of vlan resources among the PFs
  bnx2x: Use appropriate define for vlan credit
  of: mdio: Add missing inline to of_mdiobus_child_is_phy() dummy
  net: phy: aquantia: add suspend / resume ops for AQR105
  dpaa_eth: fix DMA mapping leak
  ...
2019-12-31 11:14:58 -08:00
Linus Torvalds c5c928c667 Two bugfix patches for 5.5.
tomoyo: Suppress RCU warning at list_for_each_entry_rcu().
   tomoyo: Don't use nifty names on sockets.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABAgAGBQJeCqRMAAoJEEJfEo0MZPUqmRoQAJdWfsAObzNRq8VsiZJTQ7NV
 Y4Nz1QYAUMjsSnWCTY1iI/9a4SY7zOmlGoCO9pC4leGeV/H9FCp+flBVBs6nEiQF
 A/bRT9P9ek3urr+kOj1A9udxRRXQdQ292TVd9Ll7fJuamLHsLAQSTht27SdJyStQ
 FY5COJCTC3Qvoz/jdkqQ1qEyYUavH6FvNjN9eIsjLow6BahHzDj/sw6WC/iUQhOC
 55mK10bN7jHvewRsrW5HLQ0aUazz/6FTZIuVckFpk/R67aljEIsMccAeYw7XeBWS
 a4AqI5a+8Go8ryXeM6y76JF3SnWpX9PZYLMYZz2SYkohBbl+ivVKJZOWEOQPhYTO
 wFAFSwZVA4uEsYEQF7qWGsQGMS/QR1tre2na4dWjpSZ0Ly2xX81tcjEhVYq6jsuk
 1MGHTDCc93dMJW8OKx31CRRr9mIkJ4C1pJqQzlApjkqxUMq3Bxdidc4WOshB20y6
 cLf1nfIor4/8VBb8LdBICPcedfHWk3KY6nL5yTqUjtETln7Ba/UeXpPKL9kH15Pg
 N29AzQfNuRyTL/s51ZRrvmh/WJWIrl2xLue3s5u8yJA6DivUb8U46BK+BQn+POHG
 XJDydAxynKYbPlqbNE6yoFl36BwyNwy5vWQp/0ONiwXHPSl3lsi5LYWy0XGODS8a
 9SSbLovjKTvMeLnoN+AI
 =11MC
 -----END PGP SIGNATURE-----

Merge tag 'tomoyo-fixes-for-5.5' of git://git.osdn.net/gitroot/tomoyo/tomoyo-test1

Pull tomoyo fixes from Tetsuo Handa:
 "Two bug fixes:

   - Suppress RCU warning at list_for_each_entry_rcu()

   - Don't use fancy names on sockets"

* tag 'tomoyo-fixes-for-5.5' of git://git.osdn.net/gitroot/tomoyo/tomoyo-test1:
  tomoyo: Suppress RCU warning at list_for_each_entry_rcu().
  tomoyo: Don't use nifty names on sockets.
2019-12-31 10:51:27 -08:00
Taehee Yoo 04b69426d8 hsr: fix slab-out-of-bounds Read in hsr_debugfs_rename()
hsr slave interfaces don't have debugfs directory.
So, hsr_debugfs_rename() shouldn't be called when hsr slave interface name
is changed.

Test commands:
    ip link add dummy0 type dummy
    ip link add dummy1 type dummy
    ip link add hsr0 type hsr slave1 dummy0 slave2 dummy1
    ip link set dummy0 name ap

Splat looks like:
[21071.899367][T22666] ap: renamed from dummy0
[21071.914005][T22666] ==================================================================
[21071.919008][T22666] BUG: KASAN: slab-out-of-bounds in hsr_debugfs_rename+0xaa/0xb0 [hsr]
[21071.923640][T22666] Read of size 8 at addr ffff88805febcd98 by task ip/22666
[21071.926941][T22666]
[21071.927750][T22666] CPU: 0 PID: 22666 Comm: ip Not tainted 5.5.0-rc2+ #240
[21071.929919][T22666] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[21071.935094][T22666] Call Trace:
[21071.935867][T22666]  dump_stack+0x96/0xdb
[21071.936687][T22666]  ? hsr_debugfs_rename+0xaa/0xb0 [hsr]
[21071.937774][T22666]  print_address_description.constprop.5+0x1be/0x360
[21071.939019][T22666]  ? hsr_debugfs_rename+0xaa/0xb0 [hsr]
[21071.940081][T22666]  ? hsr_debugfs_rename+0xaa/0xb0 [hsr]
[21071.940949][T22666]  __kasan_report+0x12a/0x16f
[21071.941758][T22666]  ? hsr_debugfs_rename+0xaa/0xb0 [hsr]
[21071.942674][T22666]  kasan_report+0xe/0x20
[21071.943325][T22666]  hsr_debugfs_rename+0xaa/0xb0 [hsr]
[21071.944187][T22666]  hsr_netdev_notify+0x1fe/0x9b0 [hsr]
[21071.945052][T22666]  ? __module_text_address+0x13/0x140
[21071.945897][T22666]  notifier_call_chain+0x90/0x160
[21071.946743][T22666]  dev_change_name+0x419/0x840
[21071.947496][T22666]  ? __read_once_size_nocheck.constprop.6+0x10/0x10
[21071.948600][T22666]  ? netdev_adjacent_rename_links+0x280/0x280
[21071.949577][T22666]  ? __read_once_size_nocheck.constprop.6+0x10/0x10
[21071.950672][T22666]  ? lock_downgrade+0x6e0/0x6e0
[21071.951345][T22666]  ? do_setlink+0x811/0x2ef0
[21071.951991][T22666]  do_setlink+0x811/0x2ef0
[21071.952613][T22666]  ? is_bpf_text_address+0x81/0xe0
[ ... ]

Reported-by: syzbot+9328206518f08318a5fd@syzkaller.appspotmail.com
Fixes: 4c2d5e33dc ("hsr: rename debugfs file when interface name is changed")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-30 20:36:27 -08:00
Davide Caratti a5b72a083d net/sched: add delete_empty() to filters and use it in cls_flower
Revert "net/sched: cls_u32: fix refcount leak in the error path of
u32_change()", and fix the u32 refcount leak in a more generic way that
preserves the semantic of rule dumping.
On tc filters that don't support lockless insertion/removal, there is no
need to guard against concurrent insertion when a removal is in progress.
Therefore, for most of them we can avoid a full walk() when deleting, and
just decrease the refcount, like it was done on older Linux kernels.
This fixes situations where walk() was wrongly detecting a non-empty
filter, like it happened with cls_u32 in the error path of change(), thus
leading to failures in the following tdc selftests:

 6aa7: (filter, u32) Add/Replace u32 with source match and invalid indev
 6658: (filter, u32) Add/Replace u32 with custom hash table and invalid handle
 74c2: (filter, u32) Add/Replace u32 filter with invalid hash table id

On cls_flower, and on (future) lockless filters, this check is necessary:
move all the check_empty() logic in a callback so that each filter
can have its own implementation. For cls_flower, it's sufficient to check
if no IDRs have been allocated.

This reverts commit 275c44aa19.

Changes since v1:
 - document the need for delete_empty() when TCF_PROTO_OPS_DOIT_UNLOCKED
   is used, thanks to Vlad Buslov
 - implement delete_empty() without doing fl_walk(), thanks to Vlad Buslov
 - squash revert and new fix in a single patch, to be nice with bisect
   tests that run tdc on u32 filter, thanks to Dave Miller

Fixes: 275c44aa19 ("net/sched: cls_u32: fix refcount leak in the error path of u32_change()")
Fixes: 6676d5e416 ("net: sched: set dedicated tcf_walker flag when tp is empty")
Suggested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Suggested-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Vlad Buslov <vladbu@mellanox.com>
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-30 20:35:19 -08:00
Vijay Khemka 9e860947d8 net/ncsi: Fix gma flag setting after response
gma_flag was set at the time of GMA command request but it should
only be set after getting successful response. Movinng this flag
setting in GMA response handler.

This flag is used mainly for not repeating GMA command once
received MAC address.

Signed-off-by: Vijay Khemka <vijaykhemka@fb.com>
Reviewed-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-30 20:34:17 -08:00
Kevin Kou f398efc14a sctp: add enabled check for path tracepoint loop.
sctp_outq_sack is the main function handles SACK, it is called very
frequently. As the commit "move trace_sctp_probe_path into sctp_outq_sack"
added below code to this function, sctp tracepoint is disabled most of time,
but the loop of transport list will be always called even though the
tracepoint is disabled, this is unnecessary.

+	/* SCTP path tracepoint for congestion control debugging. */
+	list_for_each_entry(transport, transport_list, transports) {
+		trace_sctp_probe_path(transport, asoc);
+	}

This patch is to add tracepoint enabled check at outside of the loop of
transport list, and avoid traversing the loop when trace is disabled,
it is a small optimization.

Signed-off-by: Kevin Kou <qdkevin.kou@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-30 20:33:05 -08:00
David S. Miller 9010ef5759 Merge branch 'Improvements-to-SJA1105-DSA-RX-timestamping'
Vladimir Oltean says:

====================
Improvements to SJA1105 DSA RX timestamping

This series makes the sja1105 DSA driver use a dedicated kernel thread
for RX timestamping, a process which is time-sensitive and otherwise a
bit fragile. This allows users to customize their system (probabil an
embedded PTP switch) fully and allocate the CPU bandwidth for the driver
to expedite the RX timestamps as quickly as possible.

While doing this conversion, add a function to the PTP core for
cancelling this kernel thread (function which I found rather strange to
be missing).
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-30 20:31:41 -08:00
Vladimir Oltean 19d1f0ed74 net: dsa: sja1105: Empty the RX timestamping queue on PTP settings change
When disabling PTP timestamping, don't reset the switch with the new
static config until all existing PTP frames have been timestamped on the
RX path or dropped. There's nothing we can do with these afterwards.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-30 20:31:40 -08:00
Vladimir Oltean 1e762bd278 net: dsa: sja1105: Use PTP core's dedicated kernel thread for RX timestamping
And move the queue of skb's waiting for RX timestamps into the ptp_data
structure, since it isn't needed if PTP is not compiled.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-30 20:31:40 -08:00
Vladimir Oltean 544fed47af ptp: introduce ptp_cancel_worker_sync
In order to effectively use the PTP kernel thread for tasks such as
timestamping packets, allow the user control over stopping it, which is
needed e.g. when the timestamping queues must be drained.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-30 20:31:40 -08:00
Cambda Zhu 853697504d tcp: Fix highest_sack and highest_sack_seq
>From commit 50895b9de1 ("tcp: highest_sack fix"), the logic about
setting tp->highest_sack to the head of the send queue was removed.
Of course the logic is error prone, but it is logical. Before we
remove the pointer to the highest sack skb and use the seq instead,
we need to set tp->highest_sack to NULL when there is no skb after
the last sack, and then replace NULL with the real skb when new skb
inserted into the rtx queue, because the NULL means the highest sack
seq is tp->snd_nxt. If tp->highest_sack is NULL and new data sent,
the next ACK with sack option will increase tp->reordering unexpectedly.

This patch sets tp->highest_sack to the tail of the rtx queue if
it's NULL and new data is sent. The patch keeps the rule that the
highest_sack can only be maintained by sack processing, except for
this only case.

Fixes: 50895b9de1 ("tcp: highest_sack fix")
Signed-off-by: Cambda Zhu <cambda@linux.alibaba.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-30 20:28:39 -08:00
Vladis Dronov a33121e548 ptp: fix the race between the release of ptp_clock and cdev
In a case when a ptp chardev (like /dev/ptp0) is open but an underlying
device is removed, closing this file leads to a race. This reproduces
easily in a kvm virtual machine:

ts# cat openptp0.c
int main() { ... fp = fopen("/dev/ptp0", "r"); ... sleep(10); }
ts# uname -r
5.5.0-rc3-46cf053e
ts# cat /proc/cmdline
... slub_debug=FZP
ts# modprobe ptp_kvm
ts# ./openptp0 &
[1] 670
opened /dev/ptp0, sleeping 10s...
ts# rmmod ptp_kvm
ts# ls /dev/ptp*
ls: cannot access '/dev/ptp*': No such file or directory
ts# ...woken up
[   48.010809] general protection fault: 0000 [#1] SMP
[   48.012502] CPU: 6 PID: 658 Comm: openptp0 Not tainted 5.5.0-rc3-46cf053e #25
[   48.014624] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), ...
[   48.016270] RIP: 0010:module_put.part.0+0x7/0x80
[   48.017939] RSP: 0018:ffffb3850073be00 EFLAGS: 00010202
[   48.018339] RAX: 000000006b6b6b6b RBX: 6b6b6b6b6b6b6b6b RCX: ffff89a476c00ad0
[   48.018936] RDX: fffff65a08d3ea08 RSI: 0000000000000247 RDI: 6b6b6b6b6b6b6b6b
[   48.019470] ...                                              ^^^ a slub poison
[   48.023854] Call Trace:
[   48.024050]  __fput+0x21f/0x240
[   48.024288]  task_work_run+0x79/0x90
[   48.024555]  do_exit+0x2af/0xab0
[   48.024799]  ? vfs_write+0x16a/0x190
[   48.025082]  do_group_exit+0x35/0x90
[   48.025387]  __x64_sys_exit_group+0xf/0x10
[   48.025737]  do_syscall_64+0x3d/0x130
[   48.026056]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   48.026479] RIP: 0033:0x7f53b12082f6
[   48.026792] ...
[   48.030945] Modules linked in: ptp i6300esb watchdog [last unloaded: ptp_kvm]
[   48.045001] Fixing recursive fault but reboot is needed!

This happens in:

static void __fput(struct file *file)
{   ...
    if (file->f_op->release)
        file->f_op->release(inode, file); <<< cdev is kfree'd here
    if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL &&
             !(mode & FMODE_PATH))) {
        cdev_put(inode->i_cdev); <<< cdev fields are accessed here

Namely:

__fput()
  posix_clock_release()
    kref_put(&clk->kref, delete_clock) <<< the last reference
      delete_clock()
        delete_ptp_clock()
          kfree(ptp) <<< cdev is embedded in ptp
  cdev_put
    module_put(p->owner) <<< *p is kfree'd, bang!

Here cdev is embedded in posix_clock which is embedded in ptp_clock.
The race happens because ptp_clock's lifetime is controlled by two
refcounts: kref and cdev.kobj in posix_clock. This is wrong.

Make ptp_clock's sysfs device a parent of cdev with cdev_device_add()
created especially for such cases. This way the parent device with its
ptp_clock is not released until all references to the cdev are released.
This adds a requirement that an initialized but not exposed struct
device should be provided to posix_clock_register() by a caller instead
of a simple dev_t.

This approach was adopted from the commit 72139dfa24 ("watchdog: Fix
the race between the release of watchdog_core_data and cdev"). See
details of the implementation in the commit 233ed09d7f ("chardev: add
helper function to register char devs with a struct device").

Link: https://lore.kernel.org/linux-fsdevel/20191125125342.6189-1-vdronov@redhat.com/T/#u
Analyzed-by: Stephen Johnston <sjohnsto@redhat.com>
Analyzed-by: Vern Lovejoy <vlovejoy@redhat.com>
Signed-off-by: Vladis Dronov <vdronov@redhat.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-30 20:19:27 -08:00
Vladimir Oltean 54fa49ee88 net: dsa: sja1105: Reconcile the meaning of TPID and TPID2 for E/T and P/Q/R/S
For first-generation switches (SJA1105E and SJA1105T):
- TPID means C-Tag (typically 0x8100)
- TPID2 means S-Tag (typically 0x88A8)

While for the second generation switches (SJA1105P, SJA1105Q, SJA1105R,
SJA1105S) it is the other way around:
- TPID means S-Tag (typically 0x88A8)
- TPID2 means C-Tag (typically 0x8100)

In other words, E/T tags untagged traffic with TPID, and P/Q/R/S with
TPID2.

So the patch mentioned below fixed VLAN filtering for P/Q/R/S, but broke
it for E/T.

We strive for a common code path for all switches in the family, so just
lie in the static config packing functions that TPID and TPID2 are at
swapped bit offsets than they actually are, for P/Q/R/S. This will make
both switches understand TPID to be ETH_P_8021Q and TPID2 to be
ETH_P_8021AD. The meaning from the original E/T was chosen over P/Q/R/S
because E/T is actually the one with public documentation available
(UM10944.pdf).

Fixes: f9a1a7646c ("net: dsa: sja1105: Reverse TPID and TPID2")
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-30 20:15:02 -08:00
Vladimir Oltean 3a323ed7c9 Documentation: net: dsa: sja1105: Remove text about taprio base-time limitation
Since commit 86db36a347 ("net: dsa: sja1105: Implement state machine
for TAS with PTP clock source"), this paragraph is no longer true. So
remove it.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-30 20:14:27 -08:00
Vladimir Oltean d00bdc0a88 net: dsa: sja1105: Remove restriction of zero base-time for taprio offload
The check originates from the initial implementation which was not based
on PTP time but on a standalone clock source. In the meantime we can now
program the PTPSCHTM register at runtime with the dynamic base time
(actually with a value that is 200 ns smaller, to avoid writing DELTA=0
in the Schedule Entry Points Parameters Table). And we also have logic
for moving the actual base time in the future of the PHC's current time
base, so the check for zero serves no purpose, since even if the user
will specify zero, that's not what will end up in the static config
table where the limitation is.

Fixes: 86db36a347 ("net: dsa: sja1105: Implement state machine for TAS with PTP clock source")
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-30 20:13:11 -08:00
Vladimir Oltean 5a47f588ee net: dsa: sja1105: Really make the PTP command read-write
When activating tc-taprio offload on the switch ports, the TAS state
machine will try to check whether it is running or not, but will find
both the STARTED and STOPPED bits as false in the
sja1105_tas_check_running function. So the function will return -EINVAL
(an abnormal situation) and the kernel will keep printing this from the
TAS FSM workqueue:

[   37.691971] sja1105 spi0.1: An operation returned -22

The reason is that the underlying function that gets called,
sja1105_ptp_commit, does not actually do a SPI_READ, but a SPI_WRITE. So
the command buffer remains initialized with zeroes instead of retrieving
the hardware state. Fix that.

Fixes: 41603d78b3 ("net: dsa: sja1105: Make the PTP command read-write")
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-30 20:11:28 -08:00
Vladimir Oltean 9fcf024dd6 net: dsa: sja1105: Take PTP egress timestamp by port, not mgmt slot
The PTP egress timestamp N must be captured from register PTPEGR_TS[n],
where n = 2 * PORT + TSREG. There are 10 PTPEGR_TS registers, 2 per
port. We are only using TSREG=0.

As opposed to the management slots, which are 4 in number
(SJA1105_NUM_PORTS, minus the CPU port). Any management frame (which
includes PTP frames) can be sent to any non-CPU port through any
management slot. When the CPU port is not the last port (#4), there will
be a mismatch between the slot and the port number.

Luckily, the only mainline occurrence with this switch
(arch/arm/boot/dts/ls1021a-tsn.dts) does have the CPU port as #4, so the
issue did not manifest itself thus far.

Fixes: 47ed985e97 ("net: dsa: sja1105: Add logic for TX timestamping")
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-30 20:10:20 -08:00