Commit graph

58400 commits

Author SHA1 Message Date
David S. Miller c3afb7eab0 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:

====================
pull request (net): ipsec 2019-11-13

1) Fix a page memleak on xfrm state destroy.

2) Fix a refcount imbalance if a xfrm_state
   gets invaild during async resumption.
   From Xiaodong Xu.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-13 11:26:20 -08:00
Oleksij Rempel 4a15d574e6 can: j1939: warn if resources are still linked on destroy
j1939_session_destroy() and __j1939_priv_release() should be called only
if session, ecu or socket are not linked or used by any one else. If at
least one of these resources is linked, then the reference counting is
broken somewhere.

This warning will be triggered before KASAN will do, and will make it
easier to debug initial issue. This works on platforms without KASAN
support.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
2019-11-13 10:42:34 +01:00
Oleksij Rempel ddeeb7d482 can: j1939: j1939_can_recv(): add priv refcounting
j1939_can_recv() can be called in parallel with socket release. In this
case sk_release and sk_destruct can be done earlier than
j1939_can_recv() is processed.

Reported-by: syzbot+ca172a0ac477ac90f045@syzkaller.appspotmail.com
Reported-by: syzbot+07ca5bce8530070a5650@syzkaller.appspotmail.com
Reported-by: syzbot+a47537d3964ef6c874e1@syzkaller.appspotmail.com
Fixes: 9d71dd0c70 ("can: add support of SAE J1939 protocol")
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
2019-11-13 10:42:34 +01:00
Oleksij Rempel 8d7a5f000e can: j1939: transport: j1939_cancel_active_session(): use hrtimer_try_to_cancel() instead of hrtimer_cancel()
This part of the code protected by lock used in the hrtimer as well.
Using hrtimer_cancel() will trigger dead lock.

Fixes: 9d71dd0c70 ("can: add support of SAE J1939 protocol")
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
2019-11-13 10:42:34 +01:00
Oleksij Rempel 62ebce1dc1 can: j1939: make sure socket is held as long as session exists
We link the socket to the session to be able provide socket specific
notifications. For example messages over error queue.

We need to keep the socket held, while we have a reference to it.

Fixes: 9d71dd0c70 ("can: add support of SAE J1939 protocol")
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
2019-11-13 10:42:34 +01:00
Oleksij Rempel d966635b38 can: j1939: transport: make sure the aborted session will be deactivated only once
j1939_session_cancel() was modifying session->state without protecting
it by locks and without checking actual state of the session.

This patch moves j1939_tp_set_rxtimeout() into j1939_session_cancel()
and adds the missing locking.

Fixes: 9d71dd0c70 ("can: add support of SAE J1939 protocol")
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
2019-11-13 10:42:34 +01:00
Oleksij Rempel fd81ebfe79 can: j1939: socket: rework socket locking for j1939_sk_release() and j1939_sk_sendmsg()
j1939_sk_sendmsg() should be protected by lock_sock() to avoid race with
j1939_sk_bind() and j1939_sk_release().

Reported-by: syzbot+afd421337a736d6c1ee6@syzkaller.appspotmail.com
Reported-by: syzbot+6d04f6a1b31a0ae12ca9@syzkaller.appspotmail.com
Fixes: 9d71dd0c70 ("can: add support of SAE J1939 protocol")
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
2019-11-13 10:42:34 +01:00
Oleksij Rempel c48c8c1e2e can: j1939: main: j1939_ndev_to_priv(): avoid crash if can_ml_priv is NULL
This patch avoids a NULL pointer deref crash if ndev->ml_priv is NULL.

Reported-by: syzbot+95c8e0d9dffde15b6c5c@syzkaller.appspotmail.com
Fixes: 9d71dd0c70 ("can: add support of SAE J1939 protocol")
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
2019-11-13 10:42:34 +01:00
Oleksij Rempel 25fe97cb76 can: j1939: move j1939_priv_put() into sk_destruct callback
This patch delays the j1939_priv_put() until the socket is destroyed via
the sk_destruct callback, to avoid use-after-free problems.

Fixes: 9d71dd0c70 ("can: add support of SAE J1939 protocol")
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
2019-11-13 10:42:33 +01:00
Oleksij Rempel 975987e701 can: af_can: export can_sock_destruct()
In j1939 we need our own struct sock::sk_destruct callback. Export the
generic af_can can_sock_destruct() that allows us to chain-call it.

Fixes: 9d71dd0c70 ("can: add support of SAE J1939 protocol")
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
2019-11-13 10:42:33 +01:00
Pablo Neira Ayuso 3944a4fd0d Merge branch 'master' of git://blackhole.kfki.hu/nf-next
Jozsef Kadlecsik says:

====================
ipset patches for nf-next

- Add wildcard support to hash:net,iface which makes possible to
  match interface prefixes besides complete interfaces names, from
  Kristian Evensen.
====================

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-13 10:42:07 +01:00
Pablo Neira Ayuso f6ae9f120d netfilter: nft_payload: add C-VLAN support
If the encapsulated ethertype announces another inner VLAN header and
the offset falls within the boundaries of the inner VLAN header, then
adjust arithmetics to include the extra VLAN header length and fetch the
bytes from the vlan header in the skbuff data area that represents this
inner VLAN header.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-13 10:41:42 +01:00
Pablo Neira Ayuso be193f5e21 netfilter: nf_tables_offload: pass extack to nft_flow_cls_offload_setup()
Otherwise this leads to a stack corruption.

Fixes: c5d275276f ("netfilter: nf_tables_offload: add nft_flow_cls_offload_setup()")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-13 10:41:41 +01:00
Pablo Neira Ayuso 8dfd8b09aa netfilter: nf_tables: add nft_payload_rebuild_vlan_hdr()
Wrap the code to rebuild the ethernet + vlan header into a function.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Florian Westphal <fw@strlen.de>
2019-11-13 10:41:35 +01:00
Pablo Neira Ayuso f41f72d09e netfilter: nft_payload: simplify vlan header handling
If the offset is within the ethernet + vlan header size boundary, then
rebuild the ethernet + vlan header and use it to copy the bytes to the
register. Otherwise, subtract the vlan header size from the offset and
fall back to use skb_copy_bits().

There is one corner case though: If the offset plus the length of the
payload instruction goes over the ethernet + vlan header boundary, then,
fetch as many bytes as possible from the rebuilt ethernet + vlan header
and fall back to copy the remaining bytes through skb_copy_bits().

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Florian Westphal <fw@strlen.de>
2019-11-13 10:41:34 +01:00
Pablo Neira Ayuso 25da5eb32c netfilter: nft_meta: offload support for interface index
This patch adds support for offloading the NFT_META_IIF selector.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-13 10:41:34 +01:00
Matthias Schiffer 542575fe4b bridge: implement get_link_ksettings ethtool method
We return the maximum speed of all active ports. This matches how the link
speed would give an upper limit for traffic to/from any single peer if the
bridge were replaced with a hardware switch.

Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-12 19:52:15 -08:00
Florian Fainelli 129bd7ca8a net: dsa: Prevent usage of NET_DSA_TAG_8021Q as tagging protocol
It is possible for a switch driver to use NET_DSA_TAG_8021Q as a valid
DSA tagging protocol since it registers itself as such, unfortunately
since there are not xmit or rcv functions provided, the lack of a xmit()
function will lead to a NPD in dsa_slave_xmit() to start with.

net/dsa/tag_8021q.c is only comprised of a set of helper functions at
the moment, but is not a fully autonomous or functional tagging "driver"
(though it could become later on). We do not have any users of
NET_DSA_TAG_8021Q so now is a good time to make sure there are not
issues being encountered by making this file strictly a place holder for
helper functions.

Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-12 19:46:27 -08:00
Hoang Le 46cb01eeeb tipc: update mon's self addr when node addr generated
In commit 25b0b9c4e8 ("tipc: handle collisions of 32-bit node address
hash values"), the 32-bit node address only generated after one second
trial period expired. However the self's addr in struct tipc_monitor do
not update according to node address generated. This lead to it is
always zero as initial value. As result, sorting algorithm using this
value does not work as expected, neither neighbor monitoring framework.

In this commit, we add a fix to update self's addr when 32-bit node
address generated.

Fixes: 25b0b9c4e8 ("tipc: handle collisions of 32-bit node address hash values")
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-12 19:45:45 -08:00
Pablo Neira Ayuso c29f74e0df netfilter: nf_flow_table: hardware offload support
This patch adds the dataplane hardware offload to the flowtable
infrastructure. Three new flags represent the hardware state of this
flow:

* FLOW_OFFLOAD_HW: This flow entry resides in the hardware.
* FLOW_OFFLOAD_HW_DYING: This flow entry has been scheduled to be remove
  from hardware. This might be triggered by either packet path (via TCP
  RST/FIN packet) or via aging.
* FLOW_OFFLOAD_HW_DEAD: This flow entry has been already removed from
  the hardware, the software garbage collector can remove it from the
  software flowtable.

This patch supports for:

* IPv4 only.
* Aging via FLOW_CLS_STATS, no packet and byte counter synchronization
  at this stage.

This patch also adds the action callback that specifies how to convert
the flow entry into the flow_rule object that is passed to the driver.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-12 19:42:26 -08:00
Pablo Neira Ayuso 8bb69f3b29 netfilter: nf_tables: add flowtable offload control plane
This patch adds the NFTA_FLOWTABLE_FLAGS attribute that allows users to
specify the NF_FLOWTABLE_HW_OFFLOAD flag. This patch also adds a new
setup interface for the flowtable type to perform the flowtable offload
block callback configuration.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-12 19:42:26 -08:00
Pablo Neira Ayuso f1363e058b netfilter: nf_flow_table: detach routing information from flow description
This patch adds the infrastructure to support for flow entry types.
The initial type is NF_FLOW_OFFLOAD_ROUTE that stores the routing
information into the flow entry to define a fastpath for the classic
forwarding path.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-12 19:42:26 -08:00
Pablo Neira Ayuso 62248df88a netfilter: nf_flowtable: remove flow_offload_entry structure
Move rcu_head to struct flow_offload, then remove the flow_offload_entry
structure definition.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-12 19:42:26 -08:00
Pablo Neira Ayuso b32d2f3416 netfilter: nf_flow_table: move conntrack object to struct flow_offload
Simplify this code by storing the pointer to conntrack object in the
flow_offload structure.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-12 19:42:26 -08:00
David S. Miller ce9db46436 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: linux-bluetooth 2019-11-11

Here's one more bluetooth-next pull request for the 5.5 kernel release.

 - Several fixes for LE advertising
 - Added PM support to hci_qca driver
 - Added support for WCN3991 SoC in hci_qca driver
 - Added DT bindings for BCM43540 module
 - A few other small cleanups/fixes
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-12 19:29:57 -08:00
Ursula Braun 6d6dd528d5 net/smc: fix refcount non-blocking connect() -part 2
If an SMC socket is immediately terminated after a non-blocking connect()
has been called, a memory leak is possible.
Due to the sock_hold move in
commit 301428ea37 ("net/smc: fix refcounting for non-blocking connect()")
an extra sock_put() is needed in smc_connect_work(), if the internal
TCP socket is aborted and cancels the sk_stream_wait_connect() of the
connect worker.

Reported-by: syzbot+4b73ad6fc767e576e275@syzkaller.appspotmail.com
Fixes: 301428ea37 ("net/smc: fix refcounting for non-blocking connect()")
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-12 12:15:29 -08:00
Davide Caratti e0e2b35b79 net/sched: actions: remove unused 'order'
after commit 4097e9d250 ("net: sched: don't use tc_action->order during
action dump"), 'act->order' is initialized but then it's no more read, so
we can just remove this member of struct tc_action.

CC: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-12 12:11:22 -08:00
Aya Levin e2cde864a1 devlink: Allow large formatted message of binary output
Devlink supports pair output of name and value. When the value is
binary, it must be presented in an array. If the length of the binary
value exceeds fmsg limitation, break the value into chunks internally.

Signed-off-by: Aya Levin <ayal@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-12 11:25:44 -08:00
Tejun Heo 743210386c cgroup: use cgrp->kn->id as the cgroup ID
cgroup ID is currently allocated using a dedicated per-hierarchy idr
and used internally and exposed through tracepoints and bpf.  This is
confusing because there are tracepoints and other interfaces which use
the cgroupfs ino as IDs.

The preceding changes made kn->id exposed as ino as 64bit ino on
supported archs or ino+gen (low 32bits as ino, high gen).  There's no
reason for cgroup to use different IDs.  The kernfs IDs are unique and
userland can easily discover them and map them back to paths using
standard file operations.

This patch replaces cgroup IDs with kernfs IDs.

* cgroup_id() is added and all cgroup ID users are converted to use it.

* kernfs_node creation is moved to earlier during cgroup init so that
  cgroup_id() is available during init.

* While at it, s/cgroup/cgrp/ in psi helpers for consistency.

* Fallback ID value is changed to 1 to be consistent with root cgroup
  ID.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
2019-11-12 08:18:04 -08:00
Tejun Heo 67c0496e87 kernfs: convert kernfs_node->id from union kernfs_node_id to u64
kernfs_node->id is currently a union kernfs_node_id which represents
either a 32bit (ino, gen) pair or u64 value.  I can't see much value
in the usage of the union - all that's needed is a 64bit ID which the
current code is already limited to.  Using a union makes the code
unnecessarily complicated and prevents using 64bit ino without adding
practical benefits.

This patch drops union kernfs_node_id and makes kernfs_node->id a u64.
ino is stored in the lower 32bits and gen upper.  Accessors -
kernfs[_id]_ino() and kernfs[_id]_gen() - are added to retrieve the
ino and gen.  This simplifies ID handling less cumbersome and will
allow using 64bit inos on supported archs.

This patch doesn't make any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexei Starovoitov <ast@kernel.org>
2019-11-12 08:18:03 -08:00
Tejun Heo db53c73a8b netprio: use css ID instead of cgroup ID
netprio uses cgroup ID to index the priority mapping table.  This is
currently okay as cgroup IDs are allocated using idr and packed.
However, cgroup IDs will be changed to use full 64bit range and won't
be packed making this impractical.  netprio doesn't care what type of
IDs it uses as long as they can identify the controller instances and
are packed.  Let's switch to css IDs instead of cgroup IDs.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Namhyung Kim <namhyung@kernel.org>
2019-11-12 08:18:03 -08:00
Xiaodong Xu 4944a4b107 xfrm: release device reference for invalid state
An ESP packet could be decrypted in async mode if the input handler for
this packet returns -EINPROGRESS in xfrm_input(). At this moment the device
reference in skb is held. Later xfrm_input() will be invoked again to
resume the processing.
If the transform state is still valid it would continue to release the
device reference and there won't be a problem; however if the transform
state is not valid when async resumption happens, the packet will be
dropped while the device reference is still being held.
When the device is deleted for some reason and the reference to this
device is not properly released, the kernel will keep logging like:

unregister_netdevice: waiting for ppp2 to become free. Usage count = 1

The issue is observed when running IPsec traffic over a PPPoE device based
on a bridge interface. By terminating the PPPoE connection on the server
end for multiple times, the PPPoE device on the client side will eventually
get stuck on the above warning message.

This patch will check the async mode first and continue to release device
reference in async resumption, before it is dropped due to invalid state.

v2: Do not assign address family from outer_mode in the transform if the
state is invalid

v3: Release device reference in the error path instead of jumping to resume

Fixes: 4ce3dbe397 ("xfrm: Fix xfrm_input() to verify state is valid when (encap_type < 0)")
Signed-off-by: Xiaodong Xu <stid.smth@gmail.com>
Reported-by: Bo Chen <chenborfc@163.com>
Tested-by: Bo Chen <chenborfc@163.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-11-12 08:24:38 +01:00
Colin Ian King c33fdc3453 tipc: fix update of the uninitialized variable err
Variable err is not uninitialized and hence can potentially contain
any garbage value.  This may cause an error when logical or'ing the
return values from the calls to functions crypto_aead_setauthsize or
crypto_aead_setkey.  Fix this by setting err to the return of
crypto_aead_setauthsize rather than or'ing in the return into the
uninitialized variable

Addresses-Coverity: ("Uninitialized scalar variable")
Fixes: fc1b6d6de2 ("tipc: introduce TIPC encryption & authentication")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-11 22:04:03 -08:00
Aya Levin d279505b72 devlink: Add method for time-stamp on reporter's dump
When setting the dump's time-stamp, use ktime_get_real in addition to
jiffies. This simplifies the user space implementation and bypasses
some inconsistent behavior with translating jiffies to current time.
The time taken is transformed into nsec, to comply with y2038 issue.

Fixes: c8e1da0bf9 ("devlink: Add health report functionality")
Signed-off-by: Aya Levin <ayal@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-11 16:04:21 -08:00
Xin Long 0c06d166ea lwtunnel: ignore any TUNNEL_OPTIONS_PRESENT flags set by users
TUNNEL_OPTIONS_PRESENT (TUNNEL_GENEVE_OPT|TUNNEL_VXLAN_OPT|
TUNNEL_ERSPAN_OPT) flags should be set only according to
tb[LWTUNNEL_IP_OPTS], which is done in ip_tun_parse_opts().

When setting info key.tun_flags, the TUNNEL_OPTIONS_PRESENT
bits in tb[LWTUNNEL_IP(6)_FLAGS] passed from users should
be ignored.

While at it, replace all (TUNNEL_GENEVE_OPT|TUNNEL_VXLAN_OPT|
TUNNEL_ERSPAN_OPT) with 'TUNNEL_OPTIONS_PRESENT'.

Fixes: 3093fbe7ff ("route: Per route IP tunnel metadata via lightweight tunnel")
Fixes: 32a2b002ce ("ipv6: route: per route IP tunnel metadata via lightweight tunnel")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-11 14:43:02 -08:00
Xin Long 58e8494eb0 lwtunnel: get nlsize for erspan options properly
erspan v1 has OPT_ERSPAN_INDEX while erspan v2 has OPT_ERSPAN_DIR and
OPT_ERSPAN_HWID attributes, and they require different nlsize when
dumping.

So this patch is to get nlsize for erspan options properly according
to erspan version.

Fixes: b0a21810bd ("lwtunnel: add options setting and dumping for erspan")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-11 14:42:38 -08:00
Xin Long ed02551f58 lwtunnel: change to use nla_parse_nested on new options
As the new options added in kernel, all should always use strict
parsing from the beginning with nla_parse_nested(), instead of
nla_parse_nested_deprecated().

Fixes: b0a21810bd ("lwtunnel: add options setting and dumping for erspan")
Fixes: edf31cbb15 ("lwtunnel: add options setting and dumping for vxlan")
Fixes: 4ece477870 ("lwtunnel: add options setting and dumping for geneve")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-11 14:42:14 -08:00
Michael Guralnik 6c7295e13f devlink: Add new "enable_roce" generic device param
New device parameter to enable/disable handling of RoCE traffic in the
device.

Signed-off-by: Michael Guralnik <michaelgur@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-11-11 12:15:29 -08:00
Greg Kroah-Hartman 01b59c763f Merge 5.4-rc7 into char-misc-next
We need the char/misc driver fixes in here as well.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-11 06:24:30 +01:00
Jiri Pirko 5a508a254b devlink: disallow reload operation during device cleanup
There is a race between driver code that does setup/cleanup of device
and devlink reload operation that in some drivers works with the same
code. Use after free could we easily obtained by running:

while true; do
        echo "0000:00:10.0" >/sys/bus/pci/drivers/mlxsw_spectrum2/bind
        devlink dev reload pci/0000:00:10.0 &
        echo "0000:00:10.0" >/sys/bus/pci/drivers/mlxsw_spectrum2/unbind
done

Fix this by enabling reload only after setup of device is complete and
disabling it at the beginning of the cleanup process.

Reported-by: Ido Schimmel <idosch@mellanox.com>
Fixes: 2d8dc5bbf4 ("devlink: Add support for reload")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-09 19:38:36 -08:00
David S. Miller 14684b9301 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
One conflict in the BPF samples Makefile, some fixes in 'net' whilst
we were converting over to Makefile.target rules in 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-09 11:04:37 -08:00
Geert Uytterhoeven 0628cda318 mac80211: Use debugfs_create_xul() helper
Use the new debugfs_create_xul() helper instead of open-coding the same
operation.

Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Link: https://lore.kernel.org/r/20191025094130.26033-3-geert+renesas@glider.be
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-09 11:52:56 +01:00
Xin Long d467ac0a38 sctp: add SCTP_PEER_ADDR_THLDS_V2 sockopt
Section 7.2 of rfc7829: "Peer Address Thresholds (SCTP_PEER_ADDR_THLDS)
Socket Option" extends 'struct sctp_paddrthlds' with 'spt_pathcpthld'
added to allow a user to change ps_retrans per sock/asoc/transport, as
other 2 paddrthlds: pf_retrans, pathmaxrxt.

Note: to not break the user's program, here to support pf_retrans dump
and setting by adding a new sockopt SCTP_PEER_ADDR_THLDS_V2, and a new
structure sctp_paddrthlds_v2 instead of extending sctp_paddrthlds.

Also, when setting ps_retrans, the value is not allowed to be greater
than pf_retrans.

v1->v2:
  - use SCTP_PEER_ADDR_THLDS_V2 to set/get pf_retrans instead,
    as Marcelo and David Laight suggested.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08 14:18:32 -08:00
Xin Long 34515e94c9 sctp: add support for Primary Path Switchover
This is a new feature defined in section 5 of rfc7829: "Primary Path
Switchover". By introducing a new tunable parameter:

  Primary.Switchover.Max.Retrans (PSMR)

The primary path will be changed to another active path when the path
error counter on the old primary path exceeds PSMR, so that "the SCTP
sender is allowed to continue data transmission on a new working path
even when the old primary destination address becomes active again".

This patch is to add this tunable parameter, 'ps_retrans' per netns,
sock, asoc and transport. It also allows a user to change ps_retrans
per netns by sysctl, and ps_retrans per sock/asoc/transport will be
initialized with it.

The check will be done in sctp_do_8_2_transport_strike() when this
feature is enabled.

Note this feature is disabled by initializing 'ps_retrans' per netns
as 0xffff by default, and its value can't be less than 'pf_retrans'
when changing by sysctl.

v3->v4:
  - add define SCTP_PS_RETRANS_MAX 0xffff, and use it on extra2 of
    sysctl 'ps_retrans'.
  - add a new entry for ps_retrans on ip-sysctl.txt.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08 14:18:32 -08:00
Xin Long 8d2a6935d8 sctp: add SCTP_EXPOSE_POTENTIALLY_FAILED_STATE sockopt
This is a sockopt defined in section 7.3 of rfc7829: "Exposing
the Potentially Failed Path State", by which users can change
pf_expose per sock and asoc.

The new sockopt SCTP_EXPOSE_POTENTIALLY_FAILED_STATE is also
known as SCTP_EXPOSE_PF_STATE for short.

v2->v3:
  - return -EINVAL if params.assoc_value > SCTP_PF_EXPOSE_MAX.
  - define SCTP_EXPOSE_PF_STATE SCTP_EXPOSE_POTENTIALLY_FAILED_STATE.
v3->v4:
  - improve changelog.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08 14:18:32 -08:00
Xin Long 768e15182d sctp: add SCTP_ADDR_POTENTIALLY_FAILED notification
SCTP Quick failover draft section 5.1, point 5 has been removed
from rfc7829. Instead, "the sender SHOULD (i) notify the Upper
Layer Protocol (ULP) about this state transition", as said in
section 3.2, point 8.

So this patch is to add SCTP_ADDR_POTENTIALLY_FAILED, defined
in section 7.1, "which is reported if the affected address
becomes PF". Also remove transport cwnd's update when moving
from PF back to ACTIVE , which is no longer in rfc7829 either.

Note that ulp_notify will be set to false if asoc->expose is
not 'enabled', according to last patch.

v2->v3:
  - define SCTP_ADDR_PF SCTP_ADDR_POTENTIALLY_FAILED.
v3->v4:
  - initialize spc_state with SCTP_ADDR_AVAILABLE, as Marcelo suggested.
  - check asoc->pf_expose in sctp_assoc_control_transport(), as Marcelo
    suggested.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08 14:18:32 -08:00
Xin Long aef587be42 sctp: add pf_expose per netns and sock and asoc
As said in rfc7829, section 3, point 12:

  The SCTP stack SHOULD expose the PF state of its destination
  addresses to the ULP as well as provide the means to notify the
  ULP of state transitions of its destination addresses from
  active to PF, and vice versa.  However, it is recommended that
  an SCTP stack implementing SCTP-PF also allows for the ULP to be
  kept ignorant of the PF state of its destinations and the
  associated state transitions, thus allowing for retention of the
  simpler state transition model of [RFC4960] in the ULP.

Not only does it allow to expose the PF state to ULP, but also
allow to ignore sctp-pf to ULP.

So this patch is to add pf_expose per netns, sock and asoc. And in
sctp_assoc_control_transport(), ulp_notify will be set to false if
asoc->expose is not 'enabled' in next patch.

It also allows a user to change pf_expose per netns by sysctl, and
pf_expose per sock and asoc will be initialized with it.

Note that pf_expose also works for SCTP_GET_PEER_ADDR_INFO sockopt,
to not allow a user to query the state of a sctp-pf peer address
when pf_expose is 'disabled', as said in section 7.3.

v1->v2:
  - Fix a build warning noticed by Nathan Chancellor.
v2->v3:
  - set pf_expose to UNUSED by default to keep compatible with old
    applications.
v3->v4:
  - add a new entry for pf_expose on ip-sysctl.txt, as Marcelo suggested.
  - change this patch to 1/5, and move sctp_assoc_control_transport
    change into 2/5, as Marcelo suggested.
  - use SCTP_PF_EXPOSE_UNSET instead of SCTP_PF_EXPOSE_UNUSED, and
    set SCTP_PF_EXPOSE_UNSET to 0 in enum, as Marcelo suggested.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08 14:18:32 -08:00
Jiri Pirko a0c76345e3 devlink: disallow reload operation during device cleanup
There is a race between driver code that does setup/cleanup of device
and devlink reload operation that in some drivers works with the same
code. Use after free could we easily obtained by running:

while true; do
        echo 10 > /sys/bus/netdevsim/new_device
        devlink dev reload netdevsim/netdevsim10 &
        echo 10 > /sys/bus/netdevsim/del_device
done

Fix this by enabling reload only after setup of device is complete and
disabling it at the beginning of the cleanup process.

Reported-by: Ido Schimmel <idosch@mellanox.com>
Fixes: 2d8dc5bbf4 ("devlink: Add support for reload")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08 14:11:47 -08:00
Eric Dumazet b756ad928d packet: fix data-race in fanout_flow_is_huge()
KCSAN reported the following data-race [1]

Adding a couple of READ_ONCE()/WRITE_ONCE() should silence it.

Since the report hinted about multiple cpus using the history
concurrently, I added a test avoiding writing on it if the
victim slot already contains the desired value.

[1]

BUG: KCSAN: data-race in fanout_demux_rollover / fanout_demux_rollover

read to 0xffff8880b01786cc of 4 bytes by task 18921 on cpu 1:
 fanout_flow_is_huge net/packet/af_packet.c:1303 [inline]
 fanout_demux_rollover+0x33e/0x3f0 net/packet/af_packet.c:1353
 packet_rcv_fanout+0x34e/0x490 net/packet/af_packet.c:1453
 deliver_skb net/core/dev.c:1888 [inline]
 dev_queue_xmit_nit+0x15b/0x540 net/core/dev.c:1958
 xmit_one net/core/dev.c:3195 [inline]
 dev_hard_start_xmit+0x3f5/0x430 net/core/dev.c:3215
 __dev_queue_xmit+0x14ab/0x1b40 net/core/dev.c:3792
 dev_queue_xmit+0x21/0x30 net/core/dev.c:3825
 neigh_direct_output+0x1f/0x30 net/core/neighbour.c:1530
 neigh_output include/net/neighbour.h:511 [inline]
 ip6_finish_output2+0x7a2/0xec0 net/ipv6/ip6_output.c:116
 __ip6_finish_output net/ipv6/ip6_output.c:142 [inline]
 __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127
 ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152
 NF_HOOK_COND include/linux/netfilter.h:294 [inline]
 ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175
 dst_output include/net/dst.h:436 [inline]
 ip6_local_out+0x74/0x90 net/ipv6/output_core.c:179
 ip6_send_skb+0x53/0x110 net/ipv6/ip6_output.c:1795
 udp_v6_send_skb.isra.0+0x3ec/0xa70 net/ipv6/udp.c:1173
 udpv6_sendmsg+0x1906/0x1c20 net/ipv6/udp.c:1471
 inet6_sendmsg+0x6d/0x90 net/ipv6/af_inet6.c:576
 sock_sendmsg_nosec net/socket.c:637 [inline]
 sock_sendmsg+0x9f/0xc0 net/socket.c:657
 ___sys_sendmsg+0x2b7/0x5d0 net/socket.c:2311
 __sys_sendmmsg+0x123/0x350 net/socket.c:2413
 __do_sys_sendmmsg net/socket.c:2442 [inline]
 __se_sys_sendmmsg net/socket.c:2439 [inline]
 __x64_sys_sendmmsg+0x64/0x80 net/socket.c:2439
 do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

write to 0xffff8880b01786cc of 4 bytes by task 18922 on cpu 0:
 fanout_flow_is_huge net/packet/af_packet.c:1306 [inline]
 fanout_demux_rollover+0x3a4/0x3f0 net/packet/af_packet.c:1353
 packet_rcv_fanout+0x34e/0x490 net/packet/af_packet.c:1453
 deliver_skb net/core/dev.c:1888 [inline]
 dev_queue_xmit_nit+0x15b/0x540 net/core/dev.c:1958
 xmit_one net/core/dev.c:3195 [inline]
 dev_hard_start_xmit+0x3f5/0x430 net/core/dev.c:3215
 __dev_queue_xmit+0x14ab/0x1b40 net/core/dev.c:3792
 dev_queue_xmit+0x21/0x30 net/core/dev.c:3825
 neigh_direct_output+0x1f/0x30 net/core/neighbour.c:1530
 neigh_output include/net/neighbour.h:511 [inline]
 ip6_finish_output2+0x7a2/0xec0 net/ipv6/ip6_output.c:116
 __ip6_finish_output net/ipv6/ip6_output.c:142 [inline]
 __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127
 ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152
 NF_HOOK_COND include/linux/netfilter.h:294 [inline]
 ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175
 dst_output include/net/dst.h:436 [inline]
 ip6_local_out+0x74/0x90 net/ipv6/output_core.c:179
 ip6_send_skb+0x53/0x110 net/ipv6/ip6_output.c:1795
 udp_v6_send_skb.isra.0+0x3ec/0xa70 net/ipv6/udp.c:1173
 udpv6_sendmsg+0x1906/0x1c20 net/ipv6/udp.c:1471
 inet6_sendmsg+0x6d/0x90 net/ipv6/af_inet6.c:576
 sock_sendmsg_nosec net/socket.c:637 [inline]
 sock_sendmsg+0x9f/0xc0 net/socket.c:657
 ___sys_sendmsg+0x2b7/0x5d0 net/socket.c:2311
 __sys_sendmmsg+0x123/0x350 net/socket.c:2413
 __do_sys_sendmmsg net/socket.c:2442 [inline]
 __se_sys_sendmmsg net/socket.c:2439 [inline]
 __x64_sys_sendmmsg+0x64/0x80 net/socket.c:2439
 do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 18922 Comm: syz-executor.3 Not tainted 5.4.0-rc6+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

Fixes: 3b3a5b0aab ("packet: rollover huge flows before small flows")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08 14:06:19 -08:00
Tuong Lien e1f32190cf tipc: add support for AEAD key setting via netlink
This commit adds two netlink commands to TIPC in order for user to be
able to set or remove AEAD keys:
- TIPC_NL_KEY_SET
- TIPC_NL_KEY_FLUSH

When the 'KEY_SET' is given along with the key data, the key will be
initiated and attached to TIPC crypto. On the other hand, the
'KEY_FLUSH' command will remove all existing keys if any.

Acked-by: Ying Xue <ying.xue@windreiver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08 14:01:59 -08:00
Tuong Lien fc1b6d6de2 tipc: introduce TIPC encryption & authentication
This commit offers an option to encrypt and authenticate all messaging,
including the neighbor discovery messages. The currently most advanced
algorithm supported is the AEAD AES-GCM (like IPSec or TLS). All
encryption/decryption is done at the bearer layer, just before leaving
or after entering TIPC.

Supported features:
- Encryption & authentication of all TIPC messages (header + data);
- Two symmetric-key modes: Cluster and Per-node;
- Automatic key switching;
- Key-expired revoking (sequence number wrapped);
- Lock-free encryption/decryption (RCU);
- Asynchronous crypto, Intel AES-NI supported;
- Multiple cipher transforms;
- Logs & statistics;

Two key modes:
- Cluster key mode: One single key is used for both TX & RX in all
nodes in the cluster.
- Per-node key mode: Each nodes in the cluster has one specific TX key.
For RX, a node requires its peers' TX key to be able to decrypt the
messages from those peers.

Key setting from user-space is performed via netlink by a user program
(e.g. the iproute2 'tipc' tool).

Internal key state machine:

                                 Attach    Align(RX)
                                     +-+   +-+
                                     | V   | V
        +---------+      Attach     +---------+
        |  IDLE   |---------------->| PENDING |(user = 0)
        +---------+                 +---------+
           A   A                   Switch|  A
           |   |                         |  |
           |   | Free(switch/revoked)    |  |
     (Free)|   +----------------------+  |  |Timeout
           |              (TX)        |  |  |(RX)
           |                          |  |  |
           |                          |  v  |
        +---------+      Switch     +---------+
        | PASSIVE |<----------------| ACTIVE  |
        +---------+       (RX)      +---------+
        (user = 1)                  (user >= 1)

The number of TFMs is 10 by default and can be changed via the procfs
'net/tipc/max_tfms'. At this moment, as for simplicity, this file is
also used to print the crypto statistics at runtime:

echo 0xfff1 > /proc/sys/net/tipc/max_tfms

The patch defines a new TIPC version (v7) for the encryption message (-
backward compatibility as well). The message is basically encapsulated
as follows:

   +----------------------------------------------------------+
   | TIPCv7 encryption  | Original TIPCv2    | Authentication |
   | header             | packet (encrypted) | Tag            |
   +----------------------------------------------------------+

The throughput is about ~40% for small messages (compared with non-
encryption) and ~9% for large messages. With the support from hardware
crypto i.e. the Intel AES-NI CPU instructions, the throughput increases
upto ~85% for small messages and ~55% for large messages.

By default, the new feature is inactive (i.e. no encryption) until user
sets a key for TIPC. There is however also a new option - "TIPC_CRYPTO"
in the kernel configuration to enable/disable the new code when needed.

MAINTAINERS | add two new files 'crypto.h' & 'crypto.c' in tipc

Acked-by: Ying Xue <ying.xue@windreiver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08 14:01:59 -08:00
Tuong Lien 4cbf8ac2fe tipc: enable creating a "preliminary" node
When user sets RX key for a peer not existing on the own node, a new
node entry is needed to which the RX key will be attached. However,
since the peer node address (& capabilities) is unknown at that moment,
only the node-ID is provided, this commit allows the creation of a node
with only the data that we call as “preliminary”.

A preliminary node is not the object of the “tipc_node_find()” but the
“tipc_node_find_by_id()”. Once the first message i.e. LINK_CONFIG comes
from that peer, and is successfully decrypted by the own node, the
actual peer node data will be properly updated and the node will
function as usual.

In addition, the node timer always starts when a node object is created
so if a preliminary node is not used, it will be cleaned up.

The later encryption functions will also use the node timer and be able
to create a preliminary node automatically when needed.

Acked-by: Ying Xue <ying.xue@windreiver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08 14:01:59 -08:00
Tuong Lien 2a7ee696f7 tipc: add reference counter to bearer
As a need to support the crypto asynchronous operations in the later
commits, apart from the current RCU mechanism for bearer pointer, we
add a 'refcnt' to the bearer object as well.

So, a bearer can be hold via 'tipc_bearer_hold()' without being freed
even though the bearer or interface can be disabled in the meanwhile.
If that happens, the bearer will be released then when the crypto
operation is completed and 'tipc_bearer_put()' is called.

Acked-by: Ying Xue <ying.xue@windreiver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08 14:01:59 -08:00
Eric Dumazet bbab7ef235 net: icmp: fix data-race in cmp_global_allow()
This code reads two global variables without protection
of a lock. We need READ_ONCE()/WRITE_ONCE() pairs to
avoid load/store-tearing and better document the intent.

KCSAN reported :
BUG: KCSAN: data-race in icmp_global_allow / icmp_global_allow

read to 0xffffffff861a8014 of 4 bytes by task 11201 on cpu 0:
 icmp_global_allow+0x36/0x1b0 net/ipv4/icmp.c:254
 icmpv6_global_allow net/ipv6/icmp.c:184 [inline]
 icmpv6_global_allow net/ipv6/icmp.c:179 [inline]
 icmp6_send+0x493/0x1140 net/ipv6/icmp.c:514
 icmpv6_send+0x71/0xb0 net/ipv6/ip6_icmp.c:43
 ip6_link_failure+0x43/0x180 net/ipv6/route.c:2640
 dst_link_failure include/net/dst.h:419 [inline]
 vti_xmit net/ipv4/ip_vti.c:243 [inline]
 vti_tunnel_xmit+0x27f/0xa50 net/ipv4/ip_vti.c:279
 __netdev_start_xmit include/linux/netdevice.h:4420 [inline]
 netdev_start_xmit include/linux/netdevice.h:4434 [inline]
 xmit_one net/core/dev.c:3280 [inline]
 dev_hard_start_xmit+0xef/0x430 net/core/dev.c:3296
 __dev_queue_xmit+0x14c9/0x1b60 net/core/dev.c:3873
 dev_queue_xmit+0x21/0x30 net/core/dev.c:3906
 neigh_direct_output+0x1f/0x30 net/core/neighbour.c:1530
 neigh_output include/net/neighbour.h:511 [inline]
 ip6_finish_output2+0x7a6/0xec0 net/ipv6/ip6_output.c:116
 __ip6_finish_output net/ipv6/ip6_output.c:142 [inline]
 __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127
 ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152
 NF_HOOK_COND include/linux/netfilter.h:294 [inline]
 ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175
 dst_output include/net/dst.h:436 [inline]
 ip6_local_out+0x74/0x90 net/ipv6/output_core.c:179

write to 0xffffffff861a8014 of 4 bytes by task 11183 on cpu 1:
 icmp_global_allow+0x174/0x1b0 net/ipv4/icmp.c:272
 icmpv6_global_allow net/ipv6/icmp.c:184 [inline]
 icmpv6_global_allow net/ipv6/icmp.c:179 [inline]
 icmp6_send+0x493/0x1140 net/ipv6/icmp.c:514
 icmpv6_send+0x71/0xb0 net/ipv6/ip6_icmp.c:43
 ip6_link_failure+0x43/0x180 net/ipv6/route.c:2640
 dst_link_failure include/net/dst.h:419 [inline]
 vti_xmit net/ipv4/ip_vti.c:243 [inline]
 vti_tunnel_xmit+0x27f/0xa50 net/ipv4/ip_vti.c:279
 __netdev_start_xmit include/linux/netdevice.h:4420 [inline]
 netdev_start_xmit include/linux/netdevice.h:4434 [inline]
 xmit_one net/core/dev.c:3280 [inline]
 dev_hard_start_xmit+0xef/0x430 net/core/dev.c:3296
 __dev_queue_xmit+0x14c9/0x1b60 net/core/dev.c:3873
 dev_queue_xmit+0x21/0x30 net/core/dev.c:3906
 neigh_direct_output+0x1f/0x30 net/core/neighbour.c:1530
 neigh_output include/net/neighbour.h:511 [inline]
 ip6_finish_output2+0x7a6/0xec0 net/ipv6/ip6_output.c:116
 __ip6_finish_output net/ipv6/ip6_output.c:142 [inline]
 __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127
 ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152
 NF_HOOK_COND include/linux/netfilter.h:294 [inline]
 ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 11183 Comm: syz-executor.2 Not tainted 5.4.0-rc3+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

Fixes: 4cdf507d54 ("icmp: add a global rate limitation")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08 12:32:43 -08:00
Eric Dumazet 90b2be27bb net/sched: annotate lockless accesses to qdisc->empty
KCSAN reported the following race [1]

BUG: KCSAN: data-race in __dev_queue_xmit / net_tx_action

read to 0xffff8880ba403508 of 1 bytes by task 21814 on cpu 1:
 __dev_xmit_skb net/core/dev.c:3389 [inline]
 __dev_queue_xmit+0x9db/0x1b40 net/core/dev.c:3761
 dev_queue_xmit+0x21/0x30 net/core/dev.c:3825
 neigh_hh_output include/net/neighbour.h:500 [inline]
 neigh_output include/net/neighbour.h:509 [inline]
 ip6_finish_output2+0x873/0xec0 net/ipv6/ip6_output.c:116
 __ip6_finish_output net/ipv6/ip6_output.c:142 [inline]
 __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127
 ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152
 NF_HOOK_COND include/linux/netfilter.h:294 [inline]
 ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175
 dst_output include/net/dst.h:436 [inline]
 ip6_local_out+0x74/0x90 net/ipv6/output_core.c:179
 ip6_send_skb+0x53/0x110 net/ipv6/ip6_output.c:1795
 udp_v6_send_skb.isra.0+0x3ec/0xa70 net/ipv6/udp.c:1173
 udpv6_sendmsg+0x1906/0x1c20 net/ipv6/udp.c:1471
 inet6_sendmsg+0x6d/0x90 net/ipv6/af_inet6.c:576
 sock_sendmsg_nosec net/socket.c:637 [inline]
 sock_sendmsg+0x9f/0xc0 net/socket.c:657
 ___sys_sendmsg+0x2b7/0x5d0 net/socket.c:2311
 __sys_sendmmsg+0x123/0x350 net/socket.c:2413
 __do_sys_sendmmsg net/socket.c:2442 [inline]
 __se_sys_sendmmsg net/socket.c:2439 [inline]
 __x64_sys_sendmmsg+0x64/0x80 net/socket.c:2439
 do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

write to 0xffff8880ba403508 of 1 bytes by interrupt on cpu 0:
 qdisc_run_begin include/net/sch_generic.h:160 [inline]
 qdisc_run include/net/pkt_sched.h:120 [inline]
 net_tx_action+0x2b1/0x6c0 net/core/dev.c:4551
 __do_softirq+0x115/0x33f kernel/softirq.c:292
 do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1082
 do_softirq.part.0+0x6b/0x80 kernel/softirq.c:337
 do_softirq kernel/softirq.c:329 [inline]
 __local_bh_enable_ip+0x76/0x80 kernel/softirq.c:189
 local_bh_enable include/linux/bottom_half.h:32 [inline]
 rcu_read_unlock_bh include/linux/rcupdate.h:688 [inline]
 ip6_finish_output2+0x7bb/0xec0 net/ipv6/ip6_output.c:117
 __ip6_finish_output net/ipv6/ip6_output.c:142 [inline]
 __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127
 ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152
 NF_HOOK_COND include/linux/netfilter.h:294 [inline]
 ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175
 dst_output include/net/dst.h:436 [inline]
 ip6_local_out+0x74/0x90 net/ipv6/output_core.c:179
 ip6_send_skb+0x53/0x110 net/ipv6/ip6_output.c:1795
 udp_v6_send_skb.isra.0+0x3ec/0xa70 net/ipv6/udp.c:1173
 udpv6_sendmsg+0x1906/0x1c20 net/ipv6/udp.c:1471
 inet6_sendmsg+0x6d/0x90 net/ipv6/af_inet6.c:576
 sock_sendmsg_nosec net/socket.c:637 [inline]
 sock_sendmsg+0x9f/0xc0 net/socket.c:657
 ___sys_sendmsg+0x2b7/0x5d0 net/socket.c:2311
 __sys_sendmmsg+0x123/0x350 net/socket.c:2413
 __do_sys_sendmmsg net/socket.c:2442 [inline]
 __se_sys_sendmmsg net/socket.c:2439 [inline]
 __x64_sys_sendmmsg+0x64/0x80 net/socket.c:2439
 do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 21817 Comm: syz-executor.2 Not tainted 5.4.0-rc6+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

Fixes: d518d2ed86 ("net/sched: fix race between deactivation and dequeue for NOLOCK qdisc")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08 12:21:55 -08:00
Stefano Garzarella ad8a722035 vsock/virtio: fix sock refcnt holding during the shutdown
The "42f5cda5eaf4" commit rightly set SOCK_DONE on peer shutdown,
but there is an issue if we receive the SHUTDOWN(RDWR) while the
virtio_transport_close_timeout() is scheduled.
In this case, when the timeout fires, the SOCK_DONE is already
set and the virtio_transport_close_timeout() will not call
virtio_transport_reset() and virtio_transport_do_close().
This causes that both sockets remain open and will never be released,
preventing the unloading of [virtio|vhost]_transport modules.

This patch fixes this issue, calling virtio_transport_reset() and
virtio_transport_do_close() when we receive the SHUTDOWN(RDWR)
and there is nothing left to read.

Fixes: 42f5cda5ea ("vsock/virtio: set SOCK_DONE on peer shutdown")
Cc: Stephen Barber <smbarber@chromium.org>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08 12:17:50 -08:00
David S. Miller 5bd2ce6aa5 Some relatively small changes:
* typo fixes in docs
  * APIs for station separation using VLAN tags rather
    than separate wifi netdevs
  * some preparations for upcoming features (802.3 offload
    and airtime queue limits (AQL)
  * stack reduction in ieee80211_assoc_success()
  * use DEFINE_DEBUGFS_ATTRIBUTE in hwsim
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl3FYNEACgkQB8qZga/f
 l8TQig//aZIXq2qlrci+Tzydi5clhrog8mzGBOECRUrgAFUZB9Q1qSD0IvqQrU8o
 3xY+9lDxYvaOYKAm83oyk2st8FIgGXwPmrfPMEodFlFWla4Dkev/n8+CdbVgJDGI
 9j0+9tPsQFwvIIOwD1OUKzJyRUIwmJ7+f8DRhMZ8LlkHZwSx217yD1RGq8zbyXUf
 tIndHM2yAY0H9ip1GPaTiGMyWN9wbpcllNq6h126OJZ0kGRLbY9X9ySoMIw8zSPF
 ul1qofvhcdN8BEOqCd97JIsDF6d3fG/LZyNm+UM415C/Se9bSG1O9i9AakGkk2SW
 zqSjkhqeBK5DUfMqGlh6uxx+1CH46vTZhBKuYIe2uWrdia14d88qkQE88R4cRcir
 dTJHUP90XOXXh3DrDSgEsfb341E5/IMVgUIIu6U72efNqZ8eMCbsXnp8Bnpervtw
 sBPYC4aTzU5wBWbl8XDgm0KWx+nh06Ah5+PXW7gzkKZQpW8qOJp4ATcDkqHrMPBf
 OFd8zhYKULBVRH8dsshzXdeVOurNwazJnEN/h72U1eZTa7ZxFL+CMShpY20d4t9I
 ldh6EG//b/D7WjBIpbs3VBBQ6fnblwtq41GUS0TgndoDUD1wmBf3mJL9ZvEsYHwC
 OmsTaAcsAkHBPdOZDpSyyGYddxXp7YNCT6ua3lNdRAbVn5Y0A/8=
 =gDUI
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-next-for-net-next-2019-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
Some relatively small changes:
 * typo fixes in docs
 * APIs for station separation using VLAN tags rather
   than separate wifi netdevs
 * some preparations for upcoming features (802.3 offload
   and airtime queue limits (AQL)
 * stack reduction in ieee80211_assoc_success()
 * use DEFINE_DEBUGFS_ATTRIBUTE in hwsim
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08 11:36:48 -08:00
Gurumoorthi Gnanasambandhan 14f34e36b3 cfg80211: VLAN offload support for set_key and set_sta_vlan
This provides an alternative mechanism for AP VLAN support where a
single netdev is used with VLAN tagged frames instead of separate
netdevs for each VLAN without tagged frames from the WLAN driver.

By setting NL80211_EXT_FEATURE_VLAN_OFFLOAD flag the driver indicates
support for a single netdev with VLAN tagged frames. Separate
VLAN-specific netdevs can be added using RTM_NEWLINK/IFLA_VLAN_ID
similarly to Ethernet. NL80211_CMD_NEW_KEY (for group keys),
NL80211_CMD_NEW_STATION, and NL80211_CMD_SET_STATION will optionally
specify vlan_id using NL80211_ATTR_VLAN_ID.

Signed-off-by: Gurumoorthi Gnanasambandhan <gguru@codeaurora.org>
Signed-off-by: Jouni Malinen <jouni@codeaurora.org>
Link: https://lore.kernel.org/r/20191031214640.5012-1-jouni@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-11-08 11:19:19 +01:00
Toke Høiland-Jørgensen 6912daed05 mac80211: Shrink the size of ack_frame_id to make room for tx_time_est
To implement airtime queue limiting, we need to keep a running account of
the estimated airtime of all skbs queued into the device. Do to this
correctly, we need to store the airtime estimate into the skb so we can
decrease the outstanding balance when the skb is freed. This means that the
time estimate must be stored somewhere that will survive for the lifetime
of the skb.

To get this, decrease the size of the ack_frame_id field to 6 bits, and
lower the size of the ID space accordingly. This leaves 10 bits for use for
tx_time_est, which is enough to store a maximum of 4096 us, if we shift the
values so they become units of 4us.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/157182474063.150713.16132669599100802716.stgit@toke.dk
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-11-08 10:19:19 +01:00
Johannes Berg f61d7884ce mac80211: don't re-parse elems in ieee80211_assoc_success()
We've already parsed the same data in the caller, so we can
pass it. The only thing is that we might fill in more details
in ieee80211_assoc_success(), but that doesn't bother the
caller, so it's fine to do even when we share the parsed data.

This reduces the stack space usage of the call stack here,
Arnd reported it had grown above the 1024 byte warning limit.

Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://lore.kernel.org/r/20191028125240.cb7661671bd2.I757c8752bf4f2f35e54f5e0a2c0a9cd9216c3d8b@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-11-08 10:18:31 +01:00
John Crispin 5d8983c8c3 mac80211: move store skb ack code to its own function
This patch moves the code handling SKBTX_WIFI_STATUS inside the TX path
into an extra function. This allows us to reuse it inside the 802.11 encap
offloading datapath.

Signed-off-by: John Crispin <john@phrozen.org>
Link: https://lore.kernel.org/r/20191029091304.7330-2-john@phrozen.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-11-08 10:18:13 +01:00
Ahmed Zaki 285531f9e6 mac80211: fix station inactive_time shortly after boot
In the first 5 minutes after boot (time of INITIAL_JIFFIES),
ieee80211_sta_last_active() returns zero if last_ack is zero. This
leads to "inactive time" showing jiffies_to_msecs(jiffies).

 # iw wlan0 station get fc:ec:da:64:a6:dd
 Station fc:ec:da:64:a6:dd (on wlan0)
	inactive time:	4294894049 ms
	.
	.
	connected time:	70 seconds

Fix by returning last_rx if last_ack == 0.

Signed-off-by: Ahmed Zaki <anzaki@gmail.com>
Link: https://lore.kernel.org/r/20191031121243.27694-1-anzaki@gmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-11-08 09:17:28 +01:00
Johannes Berg 6dd47d9754 mac80211: fix ieee80211_txq_setup_flows() failure path
If ieee80211_txq_setup_flows() fails, we don't clean up LED
state properly, leading to crashes later on, fix that.

Fixes: dc8b274f09 ("mac80211: Move up init of TXQs")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Link: https://lore.kernel.org/r/20191105154110.1ccf7112ba5d.I0ba865792446d051867b33153be65ce6b063d98c@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-11-08 09:11:33 +01:00
Hoang Le d408bef4bf tipc: eliminate checking netns if node established
Currently, we scan over all network namespaces at each received
discovery message in order to check if the sending peer might be
present in a host local namespaces.

This is unnecessary since we can assume that a peer will not change its
location during an established session.

We now improve the condition for this testing so that we don't perform
any redundant scans.

Fixes: f73b12812a ("tipc: improve throughput between nodes in netns")
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-07 20:08:53 -08:00
Eric Dumazet c305c6ae79 net: add annotations on hh->hh_len lockless accesses
KCSAN reported a data-race [1]

While we can use READ_ONCE() on the read sides,
we need to make sure hh->hh_len is written last.

[1]

BUG: KCSAN: data-race in eth_header_cache / neigh_resolve_output

write to 0xffff8880b9dedcb8 of 4 bytes by task 29760 on cpu 0:
 eth_header_cache+0xa9/0xd0 net/ethernet/eth.c:247
 neigh_hh_init net/core/neighbour.c:1463 [inline]
 neigh_resolve_output net/core/neighbour.c:1480 [inline]
 neigh_resolve_output+0x415/0x470 net/core/neighbour.c:1470
 neigh_output include/net/neighbour.h:511 [inline]
 ip6_finish_output2+0x7a2/0xec0 net/ipv6/ip6_output.c:116
 __ip6_finish_output net/ipv6/ip6_output.c:142 [inline]
 __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127
 ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152
 NF_HOOK_COND include/linux/netfilter.h:294 [inline]
 ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175
 dst_output include/net/dst.h:436 [inline]
 NF_HOOK include/linux/netfilter.h:305 [inline]
 ndisc_send_skb+0x459/0x5f0 net/ipv6/ndisc.c:505
 ndisc_send_ns+0x207/0x430 net/ipv6/ndisc.c:647
 rt6_probe_deferred+0x98/0xf0 net/ipv6/route.c:615
 process_one_work+0x3d4/0x890 kernel/workqueue.c:2269
 worker_thread+0xa0/0x800 kernel/workqueue.c:2415
 kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352

read to 0xffff8880b9dedcb8 of 4 bytes by task 29572 on cpu 1:
 neigh_resolve_output net/core/neighbour.c:1479 [inline]
 neigh_resolve_output+0x113/0x470 net/core/neighbour.c:1470
 neigh_output include/net/neighbour.h:511 [inline]
 ip6_finish_output2+0x7a2/0xec0 net/ipv6/ip6_output.c:116
 __ip6_finish_output net/ipv6/ip6_output.c:142 [inline]
 __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127
 ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152
 NF_HOOK_COND include/linux/netfilter.h:294 [inline]
 ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175
 dst_output include/net/dst.h:436 [inline]
 NF_HOOK include/linux/netfilter.h:305 [inline]
 ndisc_send_skb+0x459/0x5f0 net/ipv6/ndisc.c:505
 ndisc_send_ns+0x207/0x430 net/ipv6/ndisc.c:647
 rt6_probe_deferred+0x98/0xf0 net/ipv6/route.c:615
 process_one_work+0x3d4/0x890 kernel/workqueue.c:2269
 worker_thread+0xa0/0x800 kernel/workqueue.c:2415
 kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 29572 Comm: kworker/1:4 Not tainted 5.4.0-rc6+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: events rt6_probe_deferred

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-07 20:07:30 -08:00
Amit Cohen 3b063ae57b devlink: Add layer 3 generic packet exception traps
Add layer 3 generic packet exception traps that can report trapped
packets and documentation of the traps.

Unlike drop traps, these exception traps also need to inject the packet
to the kernel's receive path. For example, a packet that was trapped due
to unreachable neighbour need to be injected into the kernel so that it
will trigger an ARP request or a neighbour solicitation message.

Signed-off-by: Amit Cohen <amitc@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-07 19:51:40 -08:00
Amit Cohen 6896cc4d8f devlink: Add layer 3 generic packet traps
Add packet traps that can report packets that were dropped during layer
3 forwarding.

Signed-off-by: Amit Cohen <amitc@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-07 19:51:40 -08:00
Eric Dumazet 200ecef67b tcp: Remove one extra ktime_get_ns() from cookie_init_timestamp
tcp_make_synack() already uses tcp_clock_ns(), and can pass
the value to cookie_init_timestamp() to avoid another call
to ktime_get_ns() helper.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-07 16:17:52 -08:00
Eric Dumazet 71685eb4ce inetpeer: fix data-race in inet_putpeer / inet_putpeer
We need to explicitely forbid read/store tearing in inet_peer_gc()
and inet_putpeer().

The following syzbot report reminds us about inet_putpeer()
running without a lock held.

BUG: KCSAN: data-race in inet_putpeer / inet_putpeer

write to 0xffff888121fb2ed0 of 4 bytes by interrupt on cpu 0:
 inet_putpeer+0x37/0xa0 net/ipv4/inetpeer.c:240
 ip4_frag_free+0x3d/0x50 net/ipv4/ip_fragment.c:102
 inet_frag_destroy_rcu+0x58/0x80 net/ipv4/inet_fragment.c:228
 __rcu_reclaim kernel/rcu/rcu.h:222 [inline]
 rcu_do_batch+0x256/0x5b0 kernel/rcu/tree.c:2157
 rcu_core+0x369/0x4d0 kernel/rcu/tree.c:2377
 rcu_core_si+0x12/0x20 kernel/rcu/tree.c:2386
 __do_softirq+0x115/0x33f kernel/softirq.c:292
 invoke_softirq kernel/softirq.c:373 [inline]
 irq_exit+0xbb/0xe0 kernel/softirq.c:413
 exiting_irq arch/x86/include/asm/apic.h:536 [inline]
 smp_apic_timer_interrupt+0xe6/0x280 arch/x86/kernel/apic/apic.c:1137
 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:830
 native_safe_halt+0xe/0x10 arch/x86/kernel/paravirt.c:71
 arch_cpu_idle+0x1f/0x30 arch/x86/kernel/process.c:571
 default_idle_call+0x1e/0x40 kernel/sched/idle.c:94
 cpuidle_idle_call kernel/sched/idle.c:154 [inline]
 do_idle+0x1af/0x280 kernel/sched/idle.c:263

write to 0xffff888121fb2ed0 of 4 bytes by interrupt on cpu 1:
 inet_putpeer+0x37/0xa0 net/ipv4/inetpeer.c:240
 ip4_frag_free+0x3d/0x50 net/ipv4/ip_fragment.c:102
 inet_frag_destroy_rcu+0x58/0x80 net/ipv4/inet_fragment.c:228
 __rcu_reclaim kernel/rcu/rcu.h:222 [inline]
 rcu_do_batch+0x256/0x5b0 kernel/rcu/tree.c:2157
 rcu_core+0x369/0x4d0 kernel/rcu/tree.c:2377
 rcu_core_si+0x12/0x20 kernel/rcu/tree.c:2386
 __do_softirq+0x115/0x33f kernel/softirq.c:292
 run_ksoftirqd+0x46/0x60 kernel/softirq.c:603
 smpboot_thread_fn+0x37d/0x4a0 kernel/smpboot.c:165
 kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.4.0-rc3+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

Fixes: 4b9d9be839 ("inetpeer: remove unused list")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-07 16:15:56 -08:00
David Ahern e0a312629f ipv4: Fix table id reference in fib_sync_down_addr
Hendrik reported routes in the main table using source address are not
removed when the address is removed. The problem is that fib_sync_down_addr
does not account for devices in the default VRF which are associated
with the main table. Fix by updating the table id reference.

Fixes: 5a56a0b3a4 ("net: Don't delete routes in different VRFs")
Reported-by: Hendrik Donner <hd@os-cillation.de>
Signed-off-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-07 16:14:36 -08:00
Eric Dumazet 1bef4c223b ipv6: fixes rt6_probe() and fib6_nh->last_probe init
While looking at a syzbot KCSAN report [1], I found multiple
issues in this code :

1) fib6_nh->last_probe has an initial value of 0.

   While probably okay on 64bit kernels, this causes an issue
   on 32bit kernels since the time_after(jiffies, 0 + interval)
   might be false ~24 days after boot (for HZ=1000)

2) The data-race found by KCSAN
   I could use READ_ONCE() and WRITE_ONCE(), but we also can
   take the opportunity of not piling-up too many rt6_probe_deferred()
   works by using instead cmpxchg() so that only one cpu wins the race.

[1]
BUG: KCSAN: data-race in find_match / find_match

write to 0xffff8880bb7aabe8 of 8 bytes by interrupt on cpu 1:
 rt6_probe net/ipv6/route.c:663 [inline]
 find_match net/ipv6/route.c:757 [inline]
 find_match+0x5bd/0x790 net/ipv6/route.c:733
 __find_rr_leaf+0xe3/0x780 net/ipv6/route.c:831
 find_rr_leaf net/ipv6/route.c:852 [inline]
 rt6_select net/ipv6/route.c:896 [inline]
 fib6_table_lookup+0x383/0x650 net/ipv6/route.c:2164
 ip6_pol_route+0xee/0x5c0 net/ipv6/route.c:2200
 ip6_pol_route_output+0x48/0x60 net/ipv6/route.c:2452
 fib6_rule_lookup+0x3d6/0x470 net/ipv6/fib6_rules.c:117
 ip6_route_output_flags_noref+0x16b/0x230 net/ipv6/route.c:2484
 ip6_route_output_flags+0x50/0x1a0 net/ipv6/route.c:2497
 ip6_dst_lookup_tail+0x25d/0xc30 net/ipv6/ip6_output.c:1049
 ip6_dst_lookup_flow+0x68/0x120 net/ipv6/ip6_output.c:1150
 inet6_csk_route_socket+0x2f7/0x420 net/ipv6/inet6_connection_sock.c:106
 inet6_csk_xmit+0x91/0x1f0 net/ipv6/inet6_connection_sock.c:121
 __tcp_transmit_skb+0xe81/0x1d60 net/ipv4/tcp_output.c:1169
 tcp_transmit_skb net/ipv4/tcp_output.c:1185 [inline]
 tcp_xmit_probe_skb+0x19b/0x1d0 net/ipv4/tcp_output.c:3735

read to 0xffff8880bb7aabe8 of 8 bytes by interrupt on cpu 0:
 rt6_probe net/ipv6/route.c:657 [inline]
 find_match net/ipv6/route.c:757 [inline]
 find_match+0x521/0x790 net/ipv6/route.c:733
 __find_rr_leaf+0xe3/0x780 net/ipv6/route.c:831
 find_rr_leaf net/ipv6/route.c:852 [inline]
 rt6_select net/ipv6/route.c:896 [inline]
 fib6_table_lookup+0x383/0x650 net/ipv6/route.c:2164
 ip6_pol_route+0xee/0x5c0 net/ipv6/route.c:2200
 ip6_pol_route_output+0x48/0x60 net/ipv6/route.c:2452
 fib6_rule_lookup+0x3d6/0x470 net/ipv6/fib6_rules.c:117
 ip6_route_output_flags_noref+0x16b/0x230 net/ipv6/route.c:2484
 ip6_route_output_flags+0x50/0x1a0 net/ipv6/route.c:2497
 ip6_dst_lookup_tail+0x25d/0xc30 net/ipv6/ip6_output.c:1049
 ip6_dst_lookup_flow+0x68/0x120 net/ipv6/ip6_output.c:1150
 inet6_csk_route_socket+0x2f7/0x420 net/ipv6/inet6_connection_sock.c:106
 inet6_csk_xmit+0x91/0x1f0 net/ipv6/inet6_connection_sock.c:121
 __tcp_transmit_skb+0xe81/0x1d60 net/ipv4/tcp_output.c:1169

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 18894 Comm: udevd Not tainted 5.4.0-rc3+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

Fixes: cc3a86c802 ("ipv6: Change rt6_probe to take a fib6_nh")
Fixes: f547fac624 ("ipv6: rate-limit probes for neighbourless routes")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-07 16:13:13 -08:00
Pan Bian 025ec40b81 nfc: netlink: fix double device reference drop
The function nfc_put_device(dev) is called twice to drop the reference
to dev when there is no associated local llcp. Remove one of them to fix
the bug.

Fixes: 52feb444a9 ("NFC: Extend netlink interface for LTO, RW, and MIUX parameters support")
Fixes: d9b8d8e19b ("NFC: llcp: Service Name Lookup netlink interface")
Signed-off-by: Pan Bian <bianpan2016@163.com>
Reviewed-by: Johan Hovold <johan@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-07 15:23:19 -08:00
Steffen Klassert 86c6739eda xfrm: Fix memleak on xfrm state destroy
We leak the page that we use to create skb page fragments
when destroying the xfrm_state. Fix this by dropping a
page reference if a page was assigned to the xfrm_state.

Fixes: cac2661c53 ("esp4: Avoid skb_cow_data whenever possible")
Reported-by: JD <jdtxs00@gmail.com>
Reported-by: Paul Wouters <paul@nohats.ca>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-11-07 10:38:07 +01:00
Eric Dumazet 1c8dd9cb46 net_sched: gen_estimator: extend packet counter to 64bit
I forgot to change last_packets field in struct net_rate_estimator.

Without this fix, rate estimators would misbehave after more
than 2^32 packets have been sent.

Another solution would be to be careful and only use the
32 least significant bits of packets counters, but we have
a hole in net_rate_estimator structure and this looks
easier to read/maintain.

Fixes: d0083d98f6 ("net_sched: extend packet counter to 64bit")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06 21:51:36 -08:00
Eric Dumazet a5a7daa52e tcp: fix data-race in tcp_recvmsg()
Reading tp->recvmsg_inq after socket lock is released
raises a KCSAN warning [1]

Replace has_tss & has_cmsg by cmsg_flags and make
sure to not read tp->recvmsg_inq a second time.

[1]
BUG: KCSAN: data-race in tcp_chrono_stop / tcp_recvmsg

write to 0xffff888126adef24 of 2 bytes by interrupt on cpu 0:
 tcp_chrono_set net/ipv4/tcp_output.c:2309 [inline]
 tcp_chrono_stop+0x14c/0x280 net/ipv4/tcp_output.c:2338
 tcp_clean_rtx_queue net/ipv4/tcp_input.c:3165 [inline]
 tcp_ack+0x274f/0x3170 net/ipv4/tcp_input.c:3688
 tcp_rcv_established+0x37e/0xf50 net/ipv4/tcp_input.c:5696
 tcp_v4_do_rcv+0x381/0x4e0 net/ipv4/tcp_ipv4.c:1561
 tcp_v4_rcv+0x19dc/0x1bb0 net/ipv4/tcp_ipv4.c:1942
 ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204
 ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
 NF_HOOK include/linux/netfilter.h:305 [inline]
 NF_HOOK include/linux/netfilter.h:299 [inline]
 ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
 dst_input include/net/dst.h:442 [inline]
 ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
 NF_HOOK include/linux/netfilter.h:305 [inline]
 NF_HOOK include/linux/netfilter.h:299 [inline]
 ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
 __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010
 __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124
 netif_receive_skb_internal+0x59/0x190 net/core/dev.c:5214
 napi_skb_finish net/core/dev.c:5677 [inline]
 napi_gro_receive+0x28f/0x330 net/core/dev.c:5710

read to 0xffff888126adef25 of 1 bytes by task 7275 on cpu 1:
 tcp_recvmsg+0x77b/0x1a30 net/ipv4/tcp.c:2187
 inet_recvmsg+0xbb/0x250 net/ipv4/af_inet.c:838
 sock_recvmsg_nosec net/socket.c:871 [inline]
 sock_recvmsg net/socket.c:889 [inline]
 sock_recvmsg+0x92/0xb0 net/socket.c:885
 sock_read_iter+0x15f/0x1e0 net/socket.c:967
 call_read_iter include/linux/fs.h:1889 [inline]
 new_sync_read+0x389/0x4f0 fs/read_write.c:414
 __vfs_read+0xb1/0xc0 fs/read_write.c:427
 vfs_read fs/read_write.c:461 [inline]
 vfs_read+0x143/0x2c0 fs/read_write.c:446
 ksys_read+0xd5/0x1b0 fs/read_write.c:587
 __do_sys_read fs/read_write.c:597 [inline]
 __se_sys_read fs/read_write.c:595 [inline]
 __x64_sys_read+0x4c/0x60 fs/read_write.c:595
 do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 7275 Comm: sshd Not tainted 5.4.0-rc3+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

Fixes: b75eba76d3 ("tcp: send in-queue bytes in cmsg upon read")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06 21:38:34 -08:00
Eric Dumazet 9ed498c628 net: silence data-races on sk_backlog.tail
sk->sk_backlog.tail might be read without holding the socket spinlock,
we need to add proper READ_ONCE()/WRITE_ONCE() to silence the warnings.

KCSAN reported :

BUG: KCSAN: data-race in tcp_add_backlog / tcp_recvmsg

write to 0xffff8881265109f8 of 8 bytes by interrupt on cpu 1:
 __sk_add_backlog include/net/sock.h:907 [inline]
 sk_add_backlog include/net/sock.h:938 [inline]
 tcp_add_backlog+0x476/0xce0 net/ipv4/tcp_ipv4.c:1759
 tcp_v4_rcv+0x1a70/0x1bd0 net/ipv4/tcp_ipv4.c:1947
 ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204
 ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
 NF_HOOK include/linux/netfilter.h:305 [inline]
 NF_HOOK include/linux/netfilter.h:299 [inline]
 ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
 dst_input include/net/dst.h:442 [inline]
 ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
 NF_HOOK include/linux/netfilter.h:305 [inline]
 NF_HOOK include/linux/netfilter.h:299 [inline]
 ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
 __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:4929
 __netif_receive_skb+0x37/0xf0 net/core/dev.c:5043
 netif_receive_skb_internal+0x59/0x190 net/core/dev.c:5133
 napi_skb_finish net/core/dev.c:5596 [inline]
 napi_gro_receive+0x28f/0x330 net/core/dev.c:5629
 receive_buf+0x284/0x30b0 drivers/net/virtio_net.c:1061
 virtnet_receive drivers/net/virtio_net.c:1323 [inline]
 virtnet_poll+0x436/0x7d0 drivers/net/virtio_net.c:1428
 napi_poll net/core/dev.c:6311 [inline]
 net_rx_action+0x3ae/0xa90 net/core/dev.c:6379
 __do_softirq+0x115/0x33f kernel/softirq.c:292
 invoke_softirq kernel/softirq.c:373 [inline]
 irq_exit+0xbb/0xe0 kernel/softirq.c:413
 exiting_irq arch/x86/include/asm/apic.h:536 [inline]
 do_IRQ+0xa6/0x180 arch/x86/kernel/irq.c:263
 ret_from_intr+0x0/0x19
 native_safe_halt+0xe/0x10 arch/x86/kernel/paravirt.c:71
 arch_cpu_idle+0x1f/0x30 arch/x86/kernel/process.c:571
 default_idle_call+0x1e/0x40 kernel/sched/idle.c:94
 cpuidle_idle_call kernel/sched/idle.c:154 [inline]
 do_idle+0x1af/0x280 kernel/sched/idle.c:263
 cpu_startup_entry+0x1b/0x20 kernel/sched/idle.c:355
 start_secondary+0x208/0x260 arch/x86/kernel/smpboot.c:264
 secondary_startup_64+0xa4/0xb0 arch/x86/kernel/head_64.S:241

read to 0xffff8881265109f8 of 8 bytes by task 8057 on cpu 0:
 tcp_recvmsg+0x46e/0x1b40 net/ipv4/tcp.c:2050
 inet_recvmsg+0xbb/0x250 net/ipv4/af_inet.c:838
 sock_recvmsg_nosec net/socket.c:871 [inline]
 sock_recvmsg net/socket.c:889 [inline]
 sock_recvmsg+0x92/0xb0 net/socket.c:885
 sock_read_iter+0x15f/0x1e0 net/socket.c:967
 call_read_iter include/linux/fs.h:1889 [inline]
 new_sync_read+0x389/0x4f0 fs/read_write.c:414
 __vfs_read+0xb1/0xc0 fs/read_write.c:427
 vfs_read fs/read_write.c:461 [inline]
 vfs_read+0x143/0x2c0 fs/read_write.c:446
 ksys_read+0xd5/0x1b0 fs/read_write.c:587
 __do_sys_read fs/read_write.c:597 [inline]
 __se_sys_read fs/read_write.c:595 [inline]
 __x64_sys_read+0x4c/0x60 fs/read_write.c:595
 do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 8057 Comm: syz-fuzzer Not tainted 5.4.0-rc6+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06 21:35:34 -08:00
Tonghao Zhang 90ce9f23a8 net: openvswitch: select vport upcall portid directly
The commit 69c51582ff786 ("dpif-netlink: don't allocate per
thread netlink sockets"), in Open vSwitch ovs-vswitchd, has
changed the number of allocated sockets to just one per port
by moving the socket array from a per handler structure to
a per datapath one. In the kernel datapath, a vport will have
only one socket in most case, if so select it directly in
fast-path.

Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06 21:25:19 -08:00
David S. Miller 53ba60afb1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Missing register size validation in bitwise and cmp offloads.

2) Fix error code in ip_set_sockfn_get() when copy_to_user() fails,
   from Dan Carpenter.

3) Oneliner to copy MAC address in IPv6 hash:ip,mac sets, from
   Stefano Brivio.

4) Missing policy validation in ipset with NL_VALIDATE_STRICT,
   from Jozsef Kadlecsik.

5) Fix unaligned access to private data area of nf_tables instructions,
   from Lukas Wunner.

6) Relax check for object updates, reported as a regression by
   Eric Garver, patch from Fernando Fernandez Mancera.

7) Crash on ebtables dnat extension when used from the output path.
   From Florian Westphal.

8) Fix bogus EOPNOTSUPP when updating basechain flags.

9) Fix bogus EBUSY when updating a basechain that is already offloaded.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06 21:16:55 -08:00
Tuong Lien d0d605c5e1 tipc: eliminate the dummy packet in link synching
When preparing tunnel packets for the link failover or synchronization,
as for the safe algorithm, we added a dummy packet on the pair link but
never sent it out. In the case of failover, the pair link will be reset
anyway. But for link synching, it will always result in retransmission
of the dummy packet after that.
We have also observed that such the retransmission at the early stage
when a new node comes in a large cluster will take some time and hard
to be done, leading to the repeated retransmit failures and the link is
reset.

Since in commit 4929a932be ("tipc: optimize link synching mechanism")
we have already built a dummy 'TUNNEL_PROTOCOL' message on the new link
for the synchronization, there's no need for the dummy on the pair one,
this commit will skip it when the new mechanism takes in place. In case
nothing exists in the pair link's transmq, the link synching will just
start and stop shortly on the peer side.

The patch is backward compatible.

Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Tested-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06 21:16:02 -08:00
Xin Long b0a21810bd lwtunnel: add options setting and dumping for erspan
Based on the code framework built on the last patch, to
support setting and dumping for vxlan, we only need to
add ip_tun_parse_opts_erspan() for .build_state and
ip_tun_fill_encap_opts_erspan() for .fill_encap and
if (tun_flags & TUNNEL_ERSPAN_OPT) for .get_encap_size.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06 21:14:22 -08:00
Xin Long edf31cbb15 lwtunnel: add options setting and dumping for vxlan
Based on the code framework built on the last patch, to
support setting and dumping for vxlan, we only need to
add ip_tun_parse_opts_vxlan() for .build_state and
ip_tun_fill_encap_opts_vxlan() for .fill_encap and
if (tun_flags & TUNNEL_VXLAN_OPT) for .get_encap_size.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06 21:14:21 -08:00
Xin Long 4ece477870 lwtunnel: add options setting and dumping for geneve
To add options setting and dumping, .build_state(), .fill_encap() and
.get_encap_size() in ip_tun_lwt_ops needs to be extended:

ip_tun_build_state():
  ip_tun_parse_opts():
    ip_tun_parse_opts_geneve()

ip_tun_fill_encap_info():
  ip_tun_fill_encap_opts():
    ip_tun_fill_encap_opts_geneve()

ip_tun_encap_nlsize()
   ip_tun_opts_nlsize():
     if (tun_flags & TUNNEL_GENEVE_OPT)

ip_tun_parse_opts(), ip_tun_fill_encap_opts() and ip_tun_opts_nlsize()
processes LWTUNNEL_IP_OPTS.

ip_tun_parse_opts_geneve(), ip_tun_fill_encap_opts_geneve() and
if (tun_flags & TUNNEL_GENEVE_OPT) processes LWTUNNEL_IP_OPTS_GENEVE.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06 21:14:21 -08:00
Xin Long 0eb8eb2f96 lwtunnel: add options process for cmp_encap
When comparing two tun_info, dst_cache member should have been skipped,
as dst_cache is a per cpu pointer and they are always different values
even in two tun_info with the same keys.

So this patch is to skip dst_cache member and compare the key, mode and
options_len only. For the future opts setting support, also to compare
options.

Fixes: 2d79849903 ("lwtunnel: ip tunnel: fix multiple routes with different encap")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06 21:14:21 -08:00
Xin Long f52f11ec8a lwtunnel: add options process for arp request
Without options copied to the dst tun_info in iptunnel_metadata_reply()
called by arp_process for handling arp_request, the generated arp_reply
packet may be dropped or sent out with wrong options for some tunnels
like erspan and vxlan, and the traffic will break.

Fixes: 63d008a4e9 ("ipv4: send arp replies to the correct tunnel")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06 21:14:21 -08:00
Ursula Braun 98f3375505 net/smc: fix ethernet interface refcounting
If a pnet table entry is to be added mentioning a valid ethernet
interface, but an invalid infiniband or ISM device, the dev_put()
operation for the ethernet interface is called twice, resulting
in a negative refcount for the ethernet interface, which disables
removal of such a network interface.

This patch removes one of the dev_put() calls.

Fixes: 890a2cb4a9 ("net/smc: rework pnet table")
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06 17:44:32 -08:00
Hoang Le 426071f1f3 tipc: reduce sensitive to retransmit failures
With huge cluster (e.g >200nodes), the amount of that flow:
gap -> retransmit packet -> acked will take time in case of STATE_MSG
dropped/delayed because a lot of traffic. This lead to 1.5 sec tolerance
value criteria made link easy failure around 2nd, 3rd of failed
retransmission attempts.

Instead of re-introduced criteria of 99 faled retransmissions to fix the
issue, we increase failure detection timer to ten times tolerance value.

Fixes: 77cf8edbc0 ("tipc: simplify stale link failure criteria")
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Acked-by: Jon
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06 17:37:43 -08:00
Hoang Le 6708ef7792 tipc: update cluster capabilities if node deleted
There are two improvements when re-calculate cluster capabilities:

- When deleting a specific down node, need to re-calculate.
- In tipc_node_cleanup(), do not need to re-calculate if node
is still existing in cluster.

Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Acked-by: Jon
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06 17:37:43 -08:00
Jakub Kicinski 79ffe6087e net/tls: add a TX lock
TLS TX needs to release and re-acquire the socket lock if send buffer
fills up.

TLS SW TX path currently depends on only allowing one thread to enter
the function by the abuse of sk_write_pending. If another writer is
already waiting for memory no new ones are allowed in.

This has two problems:
 - writers don't wake other threads up when they leave the kernel;
   meaning that this scheme works for single extra thread (second
   application thread or delayed work) because memory becoming
   available will send a wake up request, but as Mallesham and
   Pooja report with larger number of threads it leads to threads
   being put to sleep indefinitely;
 - the delayed work does not get _scheduled_ but it may _run_ when
   other writers are present leading to crashes as writers don't
   expect state to change under their feet (same records get pushed
   and freed multiple times); it's hard to reliably bail from the
   work, however, because the mere presence of a writer does not
   guarantee that the writer will push pending records before exiting.

Ensuring wakeups always happen will make the code basically open
code a mutex. Just use a mutex.

The TLS HW TX path does not have any locking (not even the
sk_write_pending hack), yet it uses a per-socket sg_tx_data
array to push records.

Fixes: a42055e8d2 ("net/tls: Add support for async encryption of records for performance")
Reported-by: Mallesham  Jatharakonda <mallesh537@gmail.com>
Reported-by: Pooja Trivedi <poojatrivedi@gmail.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06 17:33:32 -08:00
Jakub Kicinski 02b1fa07bb net/tls: don't pay attention to sk_write_pending when pushing partial records
sk_write_pending being not zero does not guarantee that partial
record will be pushed. If the thread waiting for memory times out
the pending record may get stuck.

In case of tls_device there is no path where parial record is
set and writer present in the first place. Partial record is
set only in tls_push_sg() and tls_push_sg() will return an
error immediately. All tls_device callers of tls_push_sg()
will return (and not wait for memory) if it failed.

Fixes: a42055e8d2 ("net/tls: Add support for async encryption of records for performance")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06 17:33:32 -08:00
Eric Dumazet 099ecf59f0 net: annotate lockless accesses to sk->sk_max_ack_backlog
sk->sk_max_ack_backlog can be read without any lock being held
at least in TCP/DCCP cases.

We need to use READ_ONCE()/WRITE_ONCE() to avoid load/store tearing
and/or potential KCSAN warnings.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06 16:14:48 -08:00
Eric Dumazet 288efe8606 net: annotate lockless accesses to sk->sk_ack_backlog
sk->sk_ack_backlog can be read without any lock being held.
We need to use READ_ONCE()/WRITE_ONCE() to avoid load/store tearing
and/or potential KCSAN warnings.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06 16:14:48 -08:00
Eric Dumazet 7976a11b30 net: use helpers to change sk_ack_backlog
Writers are holding a lock, but many readers do not.

Following patch will add appropriate barriers in
sk_acceptq_removed() and sk_acceptq_added().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06 16:14:48 -08:00
Eric Dumazet 3828a93f5c inet_diag: use jiffies_delta_to_msecs()
Use jiffies_delta_to_msecs() to avoid reporting 'infinite'
timeouts and to cleanup code.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06 16:14:48 -08:00
Eric Dumazet 9d027e3a83 net: neigh: use long type to store jiffies delta
A difference of two unsigned long needs long storage.

Fixes: c7fb64db00 ("[NETLINK]: Neighbour table configuration and statistics via rtnetlink")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06 16:14:48 -08:00
David S. Miller d673f56358 This feature/cleanup patchset includes the following patches:
- bump version strings, by Simon Wunderlich
 
  - Simplify batadv_v_ogm_aggr_list_free using skb_queue_purge,
    by Christophe Jaillet
 
  - Replace aggr_list_lock with lock free skb handlers,
    by Christophe Jaillet
 
  - explicitly mark fallthrough cases, by Sven Eckelmann
 
  - Drop lockdep.h include from soft-interface.c, by Sven Eckelmann
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAl3BQgwWHHN3QHNpbW9u
 d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoYaMD/93QapwlEvS+C/BQXk7B8nmrOUM
 a/Q2E/Pgp4VX8EjJ031urkWNsqtOOdgi3fx0iUassTDU8ei0bK6vkPN+XYrNoAPO
 G3Y4kBcyh2gViKC8ClNM2eUkrOpMJD31n7QiVmCPgyxwcK3pNoxvHSLVPgXx65BQ
 kE+ZEKwfQTe2wATzzYZQbFl8tSj245euBjGKqrV8uVhXIu1EvNsyHKDWZWKZs1v/
 igMDKssIn9JBO/AHqxcVqykps6uR8SD2ZPa3n72nDcJcUz5SMgFXuDU2b/QouvUj
 bjlUxzGfD2V+J6ZcJyQmY3XNq4Ugs1xrqLWkvHlCeisyG3E98pWt1vF2fDl73jyd
 jjamv6tjdvePYH4xALcxRD5PW+INaIrOjIse3j7gezZPdnVTeUiwpgrixV/eQoeH
 Ku4RjFiE2eGTFpBHeWvDbtz6KU9f/p0RDlgxBr5ZU+cWvMR2Mqd7ZW6ICvx4UVB8
 I/1GwKMsjJcO2FlsVbDXrFvOGHBirzk2BWG8/hLnrQe+E8QAvUaA7pJoqs1YFnSj
 KnU2pVS1TV8PQTrxIe3CIwBL9uLPD9DjJ1wS7zdvswVQoZxXPof3Ebwe31v+xjsX
 GmZMeC2nfVxmPojPcXCKf40Ffa8D1PP9YoAQfbhcuEgkeo98uybcfGqBWB7pWQhu
 8t7HjEAS8VMx7btLfA==
 =h5/d
 -----END PGP SIGNATURE-----

Merge tag 'batadv-next-for-davem-20191105' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
This feature/cleanup patchset includes the following patches:

 - bump version strings, by Simon Wunderlich

 - Simplify batadv_v_ogm_aggr_list_free using skb_queue_purge,
   by Christophe Jaillet

 - Replace aggr_list_lock with lock free skb handlers,
   by Christophe Jaillet

 - explicitly mark fallthrough cases, by Sven Eckelmann

 - Drop lockdep.h include from soft-interface.c, by Sven Eckelmann
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-05 18:33:05 -08:00
Eric Dumazet b33e699fe4 net_sched: add TCA_STATS_PKT64 attribute
Now the kernel uses 64bit packet counters in scheduler layer,
we want to export these counters to user space.

Instead risking breaking user space by adding fields
to struct gnet_stats_basic, add a new TCA_STATS_PKT64.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-05 18:20:55 -08:00
Eric Dumazet d0083d98f6 net_sched: extend packet counter to 64bit
After this change, qdisc packet counter is no longer
a 32bit quantity. We still export 32bit values to user.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-05 18:20:55 -08:00
Andrew Lunn 5cd73fbd78 net: dsa: Add support for devlink resources
Add wrappers around the devlink resource API, so that DSA drivers can
register and unregister devlink resources.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-05 18:09:45 -08:00
Jakub Kicinski 683916f6a8 net/tls: fix sk_msg trim on fallback to copy mode
sk_msg_trim() tries to only update curr pointer if it falls into
the trimmed region. The logic, however, does not take into the
account pointer wrapping that sk_msg_iter_var_prev() does nor
(as John points out) the fact that msg->sg is a ring buffer.

This means that when the message was trimmed completely, the new
curr pointer would have the value of MAX_MSG_FRAGS - 1, which is
neither smaller than any other value, nor would it actually be
correct.

Special case the trimming to 0 length a little bit and rework
the comparison between curr and end to take into account wrapping.

This bug caused the TLS code to not copy all of the message, if
zero copy filled in fewer sg entries than memcopy would need.

Big thanks to Alexander Potapenko for the non-KMSAN reproducer.

v2:
 - take into account that msg->sg is a ring buffer (John).

Link: https://lore.kernel.org/netdev/20191030160542.30295-1-jakub.kicinski@netronome.com/ (v1)

Fixes: d829e9c411 ("tls: convert to generic sk_msg interface")
Reported-by: syzbot+f8495bff23a879a6d0bd@syzkaller.appspotmail.com
Reported-by: syzbot+6f50c99e8f6194bf363f@syzkaller.appspotmail.com
Co-developed-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-05 18:07:47 -08:00
Martin Varghese fbdcdd78da Change in Openvswitch to support MPLS label depth of 3 in ingress direction
The openvswitch was supporting a MPLS label depth of 1 in the ingress
direction though the userspace OVS supports a max depth of 3 labels.
This change enables openvswitch module to support a max depth of
3 labels in the ingress.

Signed-off-by: Martin Varghese <martin.varghese@nokia.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-05 18:02:29 -08:00
Christophe JAILLET 3b7ad08b51 vsock: Simplify '__vsock_release()'
Use 'skb_queue_purge()' instead of re-implementing it.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-05 17:55:47 -08:00
Florian Fainelli c058f6dfeb net: dsa: Fix use after free in dsa_switch_remove()
The order in which the ports are deleted from the list and freed and the
call to dsa_switch_remove() is done is reversed, which leads to an
use after free condition. Reverse the two: first tear down the ports and
switch from the fabric, then free the ports associated with that switch
fabric.

Fixes: 05f294a852 ("net: dsa: allocate ports on touch")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-05 17:53:58 -08:00
John Hurley 59eb87cb52 net: sched: prevent duplicate flower rules from tcf_proto destroy race
When a new filter is added to cls_api, the function
tcf_chain_tp_insert_unique() looks up the protocol/priority/chain to
determine if the tcf_proto is duplicated in the chain's hashtable. It then
creates a new entry or continues with an existing one. In cls_flower, this
allows the function fl_ht_insert_unque to determine if a filter is a
duplicate and reject appropriately, meaning that the duplicate will not be
passed to drivers via the offload hooks. However, when a tcf_proto is
destroyed it is removed from its chain before a hardware remove hook is
hit. This can lead to a race whereby the driver has not received the
remove message but duplicate flows can be accepted. This, in turn, can
lead to the offload driver receiving incorrect duplicate flows and out of
order add/delete messages.

Prevent duplicates by utilising an approach suggested by Vlad Buslov. A
hash table per block stores each unique chain/protocol/prio being
destroyed. This entry is only removed when the full destroy (and hardware
offload) has completed. If a new flow is being added with the same
identiers as a tc_proto being detroyed, then the add request is replayed
until the destroy is complete.

Fixes: 8b64678e0a ("net: sched: refactor tp insert/delete for concurrent execution")
Signed-off-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reported-by: Louis Peens <louis.peens@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-05 17:47:26 -08:00
Matteo Croce 54074f1dbd icmp: remove duplicate code
The same code which recognizes ICMP error packets is duplicated several
times. Use the icmp_is_err() and icmpv6_is_err() helpers instead, which
do the same thing.

ip_multipath_l3_keys() and tcf_nat_act() didn't check for all the error types,
assume that they should instead.

Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-05 14:03:11 -08:00
Ivan Khoronzhuk 0763b3e81a taprio: fix panic while hw offload sched list swap
Don't swap oper and admin schedules too early, it's not correct and
causes crash.

Steps to reproduce:

1)
tc qdisc replace dev eth0 parent root handle 100 taprio \
    num_tc 3 \
    map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \
    queues 1@0 1@1 1@2 \
    base-time $SOME_BASE_TIME \
    sched-entry S 01 80000 \
    sched-entry S 02 15000 \
    sched-entry S 04 40000 \
    flags 2

2)
tc qdisc replace dev eth0 parent root handle 100 taprio \
    base-time $SOME_BASE_TIME \
    sched-entry S 01 90000 \
    sched-entry S 02 20000 \
    sched-entry S 04 40000 \
    flags 2

3)
tc qdisc replace dev eth0 parent root handle 100 taprio \
    base-time $SOME_BASE_TIME \
    sched-entry S 01 150000 \
    sched-entry S 02 200000 \
    sched-entry S 04 40000 \
    flags 2

Do 2 3 2 .. steps  more times if not happens and observe:

[  305.832319] Unable to handle kernel write to read-only memory at
virtual address ffff0000087ce7f0
[  305.910887] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
[  305.919306] Hardware name: Texas Instruments AM654 Base Board (DT)

[...]

[  306.017119] x1 : ffff800848031d88 x0 : ffff800848031d80
[  306.022422] Call trace:
[  306.024866]  taprio_free_sched_cb+0x4c/0x98
[  306.029040]  rcu_process_callbacks+0x25c/0x410
[  306.033476]  __do_softirq+0x10c/0x208
[  306.037132]  irq_exit+0xb8/0xc8
[  306.040267]  __handle_domain_irq+0x64/0xb8
[  306.044352]  gic_handle_irq+0x7c/0x178
[  306.048092]  el1_irq+0xb0/0x128
[  306.051227]  arch_cpu_idle+0x10/0x18
[  306.054795]  do_idle+0x120/0x138
[  306.058015]  cpu_startup_entry+0x20/0x28
[  306.061931]  rest_init+0xcc/0xd8
[  306.065154]  start_kernel+0x3bc/0x3e4
[  306.068810] Code: f2fbd5b7 f2fbd5b6 d503201f f9400422 (f9000662)
[  306.074900] ---[ end trace 96c8e2284a9d9d6e ]---
[  306.079507] Kernel panic - not syncing: Fatal exception in interrupt
[  306.085847] SMP: stopping secondary CPUs
[  306.089765] Kernel Offset: disabled

Try to explain one of the possible crash cases:

The "real" admin list is assigned when admin_sched is set to
new_admin, it happens after "swap", that assigns to oper_sched NULL.
Thus if call qdisc show it can crash.

Farther, next second time, when sched list is updated, the admin_sched
is not NULL and becomes the oper_sched, previous oper_sched was NULL so
just skipped. But then admin_sched is assigned new_admin, but schedules
to free previous assigned admin_sched (that already became oper_sched).

Farther, next third time, when sched list is updated,
while one more swap, oper_sched is not null, but it was happy to be
freed already (while prev. admin update), so while try to free
oper_sched the kernel panic happens at taprio_free_sched_cb().

So, move the "swap emulation" where it should be according to function
comment from code.

Fixes: 9c66d15646 ("taprio: Add support for hardware offloading")
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Tested-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-05 13:58:14 -08:00
Marcel Holtmann 8670b2b8b0 rfkill: allocate static minor
udev has a feature of creating /dev/<node> device-nodes if it finds
a devnode:<node> modalias. This allows for auto-loading of modules that
provide the node. This requires to use a statically allocated minor
number for misc character devices.

However, rfkill uses dynamic minor numbers and prevents auto-loading
of the module. So allocate the next static misc minor number and use
it for rfkill.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Link: https://lore.kernel.org/r/20191024174042.19851-1-marcel@holtmann.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-05 18:25:54 +01:00
Oleksij Rempel 688d11c384 can: j1939: transport: j1939_xtp_rx_eoma_one(): Add sanity check for correct total message size
We were sending malformed EOMA with total message size set to 0. This
issue has been fixed in the previous patch.

In this patch a sanity check is added to the RX path and a error message
is displayed.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2019-11-04 21:47:23 +01:00
Oleksij Rempel eaa654f164 can: j1939: transport: j1939_session_fresh_new(): make sure EOMA is send with the total message size set
We were sending malformed EOMA messageswith total message size set to 0.

This patch fixes the bug.

Reported-by: https://github.com/linux-can/can-utils/issues/159
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Acked-by: Kurt Van Dijck <dev.kurt@vandijck-laurijssen.be>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2019-11-04 21:47:23 +01:00
Oleksij Rempel 896daf723c can: j1939: fix memory leak if filters was set
Filters array is coped from user space and linked to the j1939 socket.
On socket release this memory was not freed.

Fixes: 9d71dd0c70 ("can: add support of SAE J1939 protocol")
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2019-11-04 21:47:23 +01:00
Colin Ian King db1a804cca can: j1939: fix resource leak of skb on error return paths
Currently the error return paths do not free skb and this results in a
memory leak. Fix this by freeing them before the return.

Addresses-Coverity: ("Resource leak")
Fixes: 9d71dd0c70 ("can: add support of SAE J1939 protocol")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2019-11-04 21:47:23 +01:00
Pablo Neira Ayuso 774e4d34db Merge branch 'master' of git://blackhole.kfki.hu/nf
Jozsef Kadlecsik says:

====================
ipset patches for nf

- Fix the error code in ip_set_sockfn_get() when copy_to_user() is used,
  from Dan Carpenter.
- The IPv6 part was missed when fixing copying the right MAC address
  in the patch "netfilter: ipset: Copy the right MAC address in bitmap:ip,mac
  and hash:ip,mac sets", it is completed now by Stefano Brivio.
- ipset nla_policies are fixed to fully support NL_VALIDATE_STRICT and
  the code is converted from deprecated parsings to verified ones.
====================

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-04 20:59:00 +01:00
Pablo Neira Ayuso 88c749840d netfilter: nf_tables_offload: skip EBUSY on chain update
Do not try to bind a chain again if it exists, otherwise the driver
returns EBUSY.

Fixes: c9626a2cbd ("netfilter: nf_tables: add hardware offload support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-04 20:58:36 +01:00
Pablo Neira Ayuso 1ed012f6fd netfilter: nf_tables: bogus EOPNOTSUPP on basechain update
Userspace never includes the NFT_BASE_CHAIN flag, this flag is inferred
from the NFTA_CHAIN_HOOK atribute. The chain update path does not allow
to update flags at this stage, the existing sanity check bogusly hits
EOPNOTSUPP in the basechain case if the offload flag is set on.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-04 20:58:35 +01:00
Florian Westphal b23c0742c2 bridge: ebtables: don't crash when using dnat target in output chains
xt_in() returns NULL in the output hook, skip the pkt_type change for
that case, redirection only makes sense in broute/prerouting hooks.

Reported-by: Tom Yan <tom.ty89@gmail.com>
Cc: Linus Lüssing <linus.luessing@c0d3.blue>
Fixes: cf3cb246e2 ("bridge: ebtables: fix reception of frames DNAT-ed to bridge device/port")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-04 20:58:34 +01:00
Fernando Fernandez Mancera 9fedd894b4 netfilter: nf_tables: fix unexpected EOPNOTSUPP error
If the object type doesn't implement an update operation and the user tries to
update it will silently ignore the update operation.

Fixes: aa4095a156 ("netfilter: nf_tables: fix possible null-pointer dereference in object update")
Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-04 20:58:33 +01:00
Jozsef Kadlecsik 1289975643 netfilter: ipset: Fix nla_policies to fully support NL_VALIDATE_STRICT
Since v5.2 (commit "netlink: re-add parse/validate functions in strict
mode") NL_VALIDATE_STRICT is enabled. Fix the ipset nla_policies which did
not support strict mode and convert from deprecated parsings to verified ones.

Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
2019-11-04 20:46:13 +01:00
Stefano Brivio 97664bc2c7 netfilter: ipset: Copy the right MAC address in hash:ip,mac IPv6 sets
Same as commit 1b4a75108d ("netfilter: ipset: Copy the right MAC
address in bitmap:ip,mac and hash:ip,mac sets"), another copy and paste
went wrong in commit 8cc4ccf583 ("netfilter: ipset: Allow matching on
destination MAC address for mac and ipmac sets").

When I fixed this for IPv4 in 1b4a75108d, I didn't realise that
hash:ip,mac sets also support IPv6 as family, and this is covered by a
separate function, hash_ipmac6_kadt().

In hash:ip,mac sets, the first dimension is the IP address, and the
second dimension is the MAC address: check the IPSET_DIM_TWO_SRC flag
in flags while deciding which MAC address to copy, destination or
source.

This way, mixing source and destination matches for the two dimensions
of ip,mac hash type works as expected, also for IPv6. With this setup:

  ip netns add A
  ip link add veth1 type veth peer name veth2 netns A
  ip addr add 2001:db8::1/64 dev veth1
  ip -net A addr add 2001:db8::2/64 dev veth2
  ip link set veth1 up
  ip -net A link set veth2 up

  dst=$(ip netns exec A cat /sys/class/net/veth2/address)

  ip netns exec A ipset create test_hash hash:ip,mac family inet6
  ip netns exec A ipset add test_hash 2001:db8::1,${dst}
  ip netns exec A ip6tables -A INPUT -p icmpv6 --icmpv6-type 135 -j ACCEPT
  ip netns exec A ip6tables -A INPUT -m set ! --match-set test_hash src,dst -j DROP

ipset now correctly matches a test packet:

  # ping -c1 2001:db8::2 >/dev/null
  # echo $?
  0

Reported-by: Chen, Yi <yiche@redhat.com>
Fixes: 8cc4ccf583 ("netfilter: ipset: Allow matching on destination MAC address for mac and ipmac sets")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
2019-11-04 20:45:53 +01:00
Dan Carpenter 30b7244d79 netfilter: ipset: Fix an error code in ip_set_sockfn_get()
The copy_to_user() function returns the number of bytes remaining to be
copied.  In this code, that positive return is checked at the end of the
function and we return zero/success.  What we should do instead is
return -EFAULT.

Fixes: a7b4f989a6 ("netfilter: ipset: IP set core support")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
2019-11-04 20:45:29 +01:00
Kristian Evensen b6520fce07 netfilter: ipset: Add wildcard support to net,iface
The net,iface equal functions currently compares the full interface
names. In several cases, wildcard (or prefix) matching is useful. For
example, when converting a large iptables rule-set to make use of ipset,
I was able to significantly reduce the number of set elements by making
use of wildcard matching.

Wildcard matching is enabled by adding "wildcard" when adding an element
to a set. Internally, this causes the IPSET_FLAG_IFACE_WILDCARD-flag to
be set.  When this flag is set, only the initial part of the interface
name is used for comparison.

Wildcard matching is done per element and not per set, as there are many
cases where mixing wildcard and non-wildcard elements are useful. This
means that is up to the user to handle (avoid) overlapping interface
names.

Signed-off-by: Kristian Evensen <kristian.evensen@gmail.com>
Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
2019-11-04 20:44:17 +01:00
Eric Dumazet 3d1e5039f5 dccp: do not leak jiffies on the wire
For some reason I missed the case of DCCP passive
flows in my previous patch.

Fixes: a904a0693c ("inet: stop leaking jiffies on the wire")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Thiemo Nagel <tnagel@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-04 11:36:31 -08:00
Eric Dumazet b6b556afd2 ipv6: use jhash2() in rt6_exception_hash()
Faster jhash2() can be used instead of jhash(), since
IPv6 addresses have the needed alignment requirement.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-04 11:28:49 -08:00
Andrew Lunn 0c65b2b90d net: of_get_phy_mode: Change API to solve int/unit warnings
Before this change of_get_phy_mode() returned an enum,
phy_interface_t. On error, -ENODEV etc, is returned. If the result of
the function is stored in a variable of type phy_interface_t, and the
compiler has decided to represent this as an unsigned int, comparision
with -ENODEV etc, is a signed vs unsigned comparision.

Fix this problem by changing the API. Make the function return an
error, or 0 on success, and pass a pointer, of type phy_interface_t,
where the phy mode should be stored.

v2:
Return with *interface set to PHY_INTERFACE_MODE_NA on error.
Add error checks to all users of of_get_phy_mode()
Fixup a few reverse christmas tree errors
Fixup a few slightly malformed reverse christmas trees

v3:
Fix 0-day reported errors.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-04 11:21:25 -08:00
Nikolay Aleksandrov 5d1fcaf35d net: bridge: fdb: eliminate extra port state tests from fast-path
When commit df1c0b8468 ("[BRIDGE]: Packets leaking out of
disabled/blocked ports.") introduced the port state tests in
br_fdb_update() it was to avoid learning/refreshing from STP BPDUs, it was
also used to avoid learning/refreshing from user-space with NTF_USE. Those
two tests are done for every packet entering the bridge if it's learning,
but for the fast-path we already have them checked in br_handle_frame() and
is unnecessary to do it again. Thus push the checks to the unlikely cases
and drop them from br_fdb_update(), the new nbp_state_should_learn() helper
is used to determine if the port state allows br_fdb_update() to be called.
The two places which need to do it manually are:
 - user-space add call with NTF_USE set
 - link-local packet learning done in __br_handle_local_finish()

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-04 11:15:27 -08:00
Pablo Neira Ayuso de2a605223 netfilter: nf_tables_offload: check for register data length mismatches
Make sure register data length does not mismatch immediate data length,
otherwise hit EOPNOTSUPP.

Fixes: c9626a2cbd ("netfilter: nf_tables: add hardware offload support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-04 18:31:17 +01:00
Luiz Augusto von Dentz 6012b9346d Bluetooth: Fix advertising duplicated flags
Instances may have flags set as part of its data in which case the code
should not attempt to add it again otherwise it can cause duplication:

< HCI Command: LE Set Extended Advertising Data (0x08|0x0037) plen 35
        Handle: 0x00
        Operation: Complete extended advertising data (0x03)
        Fragment preference: Minimize fragmentation (0x01)
        Data length: 0x06
        Flags: 0x04
          BR/EDR Not Supported
        Flags: 0x06
          LE General Discoverable Mode
          BR/EDR Not Supported

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2019-11-04 10:12:05 +02:00
Francesco Ruggeri fac6fce9bd net: icmp6: provide input address for traceroute6
traceroute6 output can be confusing, in that it shows the address
that a router would use to reach the sender, rather than the address
the packet used to reach the router.
Consider this case:

        ------------------------ N2
         |                    |
       ------              ------  N3  ----
       | R1 |              | R2 |------|H2|
       ------              ------      ----
         |                    |
        ------------------------ N1
                  |
                 ----
                 |H1|
                 ----

where H1's default route is through R1, and R1's default route is
through R2 over N2.
traceroute6 from H1 to H2 shows R2's address on N1 rather than on N2.

The script below can be used to reproduce this scenario.

traceroute6 output without this patch:

traceroute to 2000:103::4 (2000:103::4), 30 hops max, 80 byte packets
 1  2000:101::1 (2000:101::1)  0.036 ms  0.008 ms  0.006 ms
 2  2000:101::2 (2000:101::2)  0.011 ms  0.008 ms  0.007 ms
 3  2000:103::4 (2000:103::4)  0.013 ms  0.010 ms  0.009 ms

traceroute6 output with this patch:

traceroute to 2000:103::4 (2000:103::4), 30 hops max, 80 byte packets
 1  2000:101::1 (2000:101::1)  0.056 ms  0.019 ms  0.006 ms
 2  2000:102::2 (2000:102::2)  0.013 ms  0.008 ms  0.008 ms
 3  2000:103::4 (2000:103::4)  0.013 ms  0.009 ms  0.009 ms

#!/bin/bash
#
#        ------------------------ N2
#         |                    |
#       ------              ------  N3  ----
#       | R1 |              | R2 |------|H2|
#       ------              ------      ----
#         |                    |
#        ------------------------ N1
#                  |
#                 ----
#                 |H1|
#                 ----
#
# N1: 2000:101::/64
# N2: 2000:102::/64
# N3: 2000:103::/64
#
# R1's host part of address: 1
# R2's host part of address: 2
# H1's host part of address: 3
# H2's host part of address: 4
#
# For example:
# the IPv6 address of R1's interface on N2 is 2000:102::1/64
#
# Nets are implemented by macvlan interfaces (bridge mode) over
# dummy interfaces.
#

# Create net namespaces
ip netns add host1
ip netns add host2
ip netns add rtr1
ip netns add rtr2

# Create nets
ip link add net1 type dummy; ip link set net1 up
ip link add net2 type dummy; ip link set net2 up
ip link add net3 type dummy; ip link set net3 up

# Add interfaces to net1, move them to their nemaspaces
ip link add link net1 dev host1net1 type macvlan mode bridge
ip link set host1net1 netns host1
ip link add link net1 dev rtr1net1 type macvlan mode bridge
ip link set rtr1net1 netns rtr1
ip link add link net1 dev rtr2net1 type macvlan mode bridge
ip link set rtr2net1 netns rtr2

# Add interfaces to net2, move them to their nemaspaces
ip link add link net2 dev rtr1net2 type macvlan mode bridge
ip link set rtr1net2 netns rtr1
ip link add link net2 dev rtr2net2 type macvlan mode bridge
ip link set rtr2net2 netns rtr2

# Add interfaces to net3, move them to their nemaspaces
ip link add link net3 dev rtr2net3 type macvlan mode bridge
ip link set rtr2net3 netns rtr2
ip link add link net3 dev host2net3 type macvlan mode bridge
ip link set host2net3 netns host2

# Configure interfaces and routes in host1
ip netns exec host1 ip link set lo up
ip netns exec host1 ip link set host1net1 up
ip netns exec host1 ip -6 addr add 2000:101::3/64 dev host1net1
ip netns exec host1 ip -6 route add default via 2000:101::1

# Configure interfaces and routes in rtr1
ip netns exec rtr1 ip link set lo up
ip netns exec rtr1 ip link set rtr1net1 up
ip netns exec rtr1 ip -6 addr add 2000:101::1/64 dev rtr1net1
ip netns exec rtr1 ip link set rtr1net2 up
ip netns exec rtr1 ip -6 addr add 2000:102::1/64 dev rtr1net2
ip netns exec rtr1 ip -6 route add default via 2000:102::2
ip netns exec rtr1 sysctl net.ipv6.conf.all.forwarding=1

# Configure interfaces and routes in rtr2
ip netns exec rtr2 ip link set lo up
ip netns exec rtr2 ip link set rtr2net1 up
ip netns exec rtr2 ip -6 addr add 2000:101::2/64 dev rtr2net1
ip netns exec rtr2 ip link set rtr2net2 up
ip netns exec rtr2 ip -6 addr add 2000:102::2/64 dev rtr2net2
ip netns exec rtr2 ip link set rtr2net3 up
ip netns exec rtr2 ip -6 addr add 2000:103::2/64 dev rtr2net3
ip netns exec rtr2 sysctl net.ipv6.conf.all.forwarding=1

# Configure interfaces and routes in host2
ip netns exec host2 ip link set lo up
ip netns exec host2 ip link set host2net3 up
ip netns exec host2 ip -6 addr add 2000:103::4/64 dev host2net3
ip netns exec host2 ip -6 route add default via 2000:103::2

# Ping host2 from host1
ip netns exec host1 ping6 -c5 2000:103::4

# Traceroute host2 from host1
ip netns exec host1 traceroute6 2000:103::4

# Delete nets
ip link del net3
ip link del net2
ip link del net1

# Delete namespaces
ip netns del rtr2
ip netns del rtr1
ip netns del host2
ip netns del host1

Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>
Original-patch-by: Honggang Xu <hxu@arista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-03 17:26:53 -08:00
Tuong Lien 06e7c70c6e tipc: improve message bundling algorithm
As mentioned in commit e95584a889 ("tipc: fix unlimited bundling of
small messages"), the current message bundling algorithm is inefficient
that can generate bundles of only one payload message, that causes
unnecessary overheads for both the sender and receiver.

This commit re-designs the 'tipc_msg_make_bundle()' function (now named
as 'tipc_msg_try_bundle()'), so that when a message comes at the first
place, we will just check & keep a reference to it if the message is
suitable for bundling. The message buffer will be put into the link
backlog queue and processed as normal. Later on, when another one comes
we will make a bundle with the first message if possible and so on...
This way, a bundle if really needed will always consist of at least two
payload messages. Otherwise, we let the first buffer go its way without
any need of bundling, so reduce the overheads to zero.

Moreover, since now we have both the messages in hand, we can even
optimize the 'tipc_msg_bundle()' function, make bundle of a very large
(size ~ MSS) and small messages which is not with the current algorithm
e.g. [1400-byte message] + [10-byte message] (MTU = 1500).

Acked-by: Ying Xue <ying.xue@windreiver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-03 17:26:15 -08:00
Francesco Ruggeri 2adf81c0f7 net: icmp: use input address in traceroute
Even with icmp_errors_use_inbound_ifaddr set, traceroute returns the
primary address of the interface the packet was received on, even if
the path goes through a secondary address. In the example:

                    1.0.3.1/24
 ---- 1.0.1.3/24    1.0.1.1/24 ---- 1.0.2.1/24    1.0.2.4/24 ----
 |H1|--------------------------|R1|--------------------------|H2|
 ----            N1            ----            N2            ----

where 1.0.3.1/24 is R1's primary address on N1, traceroute from
H1 to H2 returns:

traceroute to 1.0.2.4 (1.0.2.4), 30 hops max, 60 byte packets
 1  1.0.3.1 (1.0.3.1)  0.018 ms  0.006 ms  0.006 ms
 2  1.0.2.4 (1.0.2.4)  0.021 ms  0.007 ms  0.007 ms

After applying this patch, it returns:

traceroute to 1.0.2.4 (1.0.2.4), 30 hops max, 60 byte packets
 1  1.0.1.1 (1.0.1.1)  0.033 ms  0.007 ms  0.006 ms
 2  1.0.2.4 (1.0.2.4)  0.011 ms  0.007 ms  0.007 ms

Original-patch-by: Bill Fenner <fenner@arista.com>
Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-03 17:25:18 -08:00
Tonghao Zhang eec62eadd1 net: openvswitch: simplify the ovs_dp_cmd_new
use the specified functions to init resource.

Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Tested-by: Greg Rose <gvrose8192@gmail.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-03 17:18:04 -08:00
Tonghao Zhang 4c76bf696a net: openvswitch: don't unlock mutex when changing the user_features fails
Unlocking of a not locked mutex is not allowed.
Other kernel thread may be in critical section while
we unlock it because of setting user_feature fail.

Fixes: 95a7233c4 ("net: openvswitch: Set OvS recirc_id from tc chain index")
Cc: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Tested-by: Greg Rose <gvrose8192@gmail.com>
Acked-by: William Tu <u9012063@gmail.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-03 17:18:04 -08:00
Tonghao Zhang 50b0e61b32 net: openvswitch: fix possible memleak on destroy flow-table
When we destroy the flow tables which may contain the flow_mask,
so release the flow mask struct.

Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Tested-by: Greg Rose <gvrose8192@gmail.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-03 17:18:03 -08:00
Tonghao Zhang 0a3e01371d net: openvswitch: add likely in flow_lookup
The most case *index < ma->max, and flow-mask is not NULL.
We add un/likely for performance.

Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Tested-by: Greg Rose <gvrose8192@gmail.com>
Acked-by: William Tu <u9012063@gmail.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-03 17:18:03 -08:00
Tonghao Zhang 515b65a4b9 net: openvswitch: simplify the flow_hash
Simplify the code and remove the unnecessary BUILD_BUG_ON.

Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Tested-by: Greg Rose <gvrose8192@gmail.com>
Acked-by: William Tu <u9012063@gmail.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-03 17:18:03 -08:00
Tonghao Zhang 57f7d7b916 net: openvswitch: optimize flow-mask looking up
The full looking up on flow table traverses all mask array.
If mask-array is too large, the number of invalid flow-mask
increase, performance will be drop.

One bad case, for example: M means flow-mask is valid and NULL
of flow-mask means deleted.

+-------------------------------------------+
| M | NULL | ...                  | NULL | M|
+-------------------------------------------+

In that case, without this patch, openvswitch will traverses all
mask array, because there will be one flow-mask in the tail. This
patch changes the way of flow-mask inserting and deleting, and the
mask array will be keep as below: there is not a NULL hole. In the
fast path, we can "break" "for" (not "continue") in flow_lookup
when we get a NULL flow-mask.

         "break"
            v
+-------------------------------------------+
| M | M |  NULL |...           | NULL | NULL|
+-------------------------------------------+

This patch don't optimize slow or control path, still using ma->max
to traverse. Slow path:
* tbl_mask_array_realloc
* ovs_flow_tbl_lookup_exact
* flow_mask_find

Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Tested-by: Greg Rose <gvrose8192@gmail.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-03 17:18:03 -08:00
Tonghao Zhang a7f35e78e7 net: openvswitch: optimize flow mask cache hash collision
Port the codes to linux upstream and with little changes.

Pravin B Shelar, says:
| In case hash collision on mask cache, OVS does extra flow
| lookup. Following patch avoid it.

Link: 0e6efbe271
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Tested-by: Greg Rose <gvrose8192@gmail.com>
Signed-off-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-03 17:18:03 -08:00
Tonghao Zhang 1689754de6 net: openvswitch: shrink the mask array if necessary
When creating and inserting flow-mask, if there is no available
flow-mask, we realloc the mask array. When removing flow-mask,
if necessary, we shrink mask array.

Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Tested-by: Greg Rose <gvrose8192@gmail.com>
Acked-by: William Tu <u9012063@gmail.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-03 17:18:03 -08:00
Tonghao Zhang 4bc63b1b53 net: openvswitch: convert mask list in mask array
Port the codes to linux upstream and with little changes.

Pravin B Shelar, says:
| mask caches index of mask in mask_list. On packet recv OVS
| need to traverse mask-list to get cached mask. Therefore array
| is better for retrieving cached mask. This also allows better
| cache replacement algorithm by directly checking mask's existence.

Link: d49fc3ff53
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Tested-by: Greg Rose <gvrose8192@gmail.com>
Acked-by: William Tu <u9012063@gmail.com>
Signed-off-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-03 17:18:03 -08:00
Tonghao Zhang 04b7d136d0 net: openvswitch: add flow-mask cache for performance
The idea of this optimization comes from a patch which
is committed in 2014, openvswitch community. The author
is Pravin B Shelar. In order to get high performance, I
implement it again. Later patches will use it.

Pravin B Shelar, says:
| On every packet OVS needs to lookup flow-table with every
| mask until it finds a match. The packet flow-key is first
| masked with mask in the list and then the masked key is
| looked up in flow-table. Therefore number of masks can
| affect packet processing performance.

Link: 5604935e4e
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Tested-by: Greg Rose <gvrose8192@gmail.com>
Acked-by: William Tu <u9012063@gmail.com>
Signed-off-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-03 17:18:03 -08:00
Sven Eckelmann 5759af0682 batman-adv: Drop lockdep.h include for soft-interface.c
The commit ab92d68fc2 ("net: core: add generic lockdep keys") removed
all lockdep functionality from soft-interface.c but didn't remove the
include for this functionality.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2019-11-03 08:30:58 +01:00
Sven Eckelmann a7757d318a batman-adv: Use 'fallthrough' pseudo keyword
The usage of the '/* fall through */' comments in switches are no longer
marked as non-deprecated variant of implicit fall throughs for switch
statements. The commit 294f69e662 ("compiler_attributes.h: Add
'fallthrough' pseudo keyword for switch/case use") introduced a replacement
keyword which should be used instead.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2019-11-03 08:30:58 +01:00
Christophe JAILLET baa1e8a0da batman-adv: Axe 'aggr_list_lock'
'aggr_list.lock' can safely be used in place of another explicit spinlock
when access to 'aggr_list' has to be guarded.

This avoids to take 2 locks, knowing that the 2nd one is always successful.

Now that the 'aggr_list.lock' is handled explicitly, the lock-free
__sbk_something() variants should be used when dealing with 'aggr_list'.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2019-11-03 08:30:58 +01:00
Christophe JAILLET 9044854e4b batman-adv: Simplify 'batadv_v_ogm_aggr_list_free()'
Use 'skb_queue_purge()' instead of re-implementing it.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2019-11-03 08:30:58 +01:00
Simon Wunderlich 79f0a4858f batman-adv: Start new development cycle
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2019-11-03 08:30:58 +01:00
David S. Miller ae8a76fb8b Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2019-11-02

The following pull-request contains BPF updates for your *net-next* tree.

We've added 30 non-merge commits during the last 7 day(s) which contain
a total of 41 files changed, 1864 insertions(+), 474 deletions(-).

The main changes are:

1) Fix long standing user vs kernel access issue by introducing
   bpf_probe_read_user() and bpf_probe_read_kernel() helpers, from Daniel.

2) Accelerated xskmap lookup, from Björn and Maciej.

3) Support for automatic map pinning in libbpf, from Toke.

4) Cleanup of BTF-enabled raw tracepoints, from Alexei.

5) Various fixes to libbpf and selftests.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-02 15:29:58 -07:00
David S. Miller d31e95585c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
The only slightly tricky merge conflict was the netdevsim because the
mutex locking fix overlapped a lot of driver reload reorganization.

The rest were (relatively) trivial in nature.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-02 13:54:56 -07:00
Linus Torvalds 1204c70d9d Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Fix free/alloc races in batmanadv, from Sven Eckelmann.

 2) Several leaks and other fixes in kTLS support of mlx5 driver, from
    Tariq Toukan.

 3) BPF devmap_hash cost calculation can overflow on 32-bit, from Toke
    Høiland-Jørgensen.

 4) Add an r8152 device ID, from Kazutoshi Noguchi.

 5) Missing include in ipv6's addrconf.c, from Ben Dooks.

 6) Use siphash in flow dissector, from Eric Dumazet. Attackers can
    easily infer the 32-bit secret otherwise etc.

 7) Several netdevice nesting depth fixes from Taehee Yoo.

 8) Fix several KCSAN reported errors, from Eric Dumazet. For example,
    when doing lockless skb_queue_empty() checks, and accessing
    sk_napi_id/sk_incoming_cpu lockless as well.

 9) Fix jumbo packet handling in RXRPC, from David Howells.

10) Bump SOMAXCONN and tcp_max_syn_backlog values, from Eric Dumazet.

11) Fix DMA synchronization in gve driver, from Yangchun Fu.

12) Several bpf offload fixes, from Jakub Kicinski.

13) Fix sk_page_frag() recursion during memory reclaim, from Tejun Heo.

14) Fix ping latency during high traffic rates in hisilicon driver, from
    Jiangfent Xiao.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (146 commits)
  net: fix installing orphaned programs
  net: cls_bpf: fix NULL deref on offload filter removal
  selftests: bpf: Skip write only files in debugfs
  selftests: net: reuseport_dualstack: fix uninitalized parameter
  r8169: fix wrong PHY ID issue with RTL8168dp
  net: dsa: bcm_sf2: Fix IMP setup for port different than 8
  net: phylink: Fix phylink_dbg() macro
  gve: Fixes DMA synchronization.
  inet: stop leaking jiffies on the wire
  ixgbe: Remove duplicate clear_bit() call
  Documentation: networking: device drivers: Remove stray asterisks
  e1000: fix memory leaks
  i40e: Fix receive buffer starvation for AF_XDP
  igb: Fix constant media auto sense switching when no cable is connected
  net: ethernet: arc: add the missed clk_disable_unprepare
  igb: Enable media autosense for the i350.
  igb/igc: Don't warn on fatal read failures when the device is removed
  tcp: increase tcp_max_syn_backlog max value
  net: increase SOMAXCONN to 4096
  netdevsim: Fix use-after-free during device dismantle
  ...
2019-11-01 17:48:11 -07:00
Björn Töpel d817991cc7 xsk: Restructure/inline XSKMAP lookup/redirect/flush
In this commit the XSKMAP entry lookup function used by the XDP
redirect code is moved from the xskmap.c file to the xdp_sock.h
header, so the lookup can be inlined from, e.g., the
bpf_xdp_redirect_map() function.

Further the __xsk_map_redirect() and __xsk_map_flush() is moved to the
xsk.c, which lets the compiler inline the xsk_rcv() and xsk_flush()
functions.

Finally, all the XDP socket functions were moved from linux/bpf.h to
net/xdp_sock.h, where most of the XDP sockets functions are anyway.

This yields a ~2% performance boost for the xdpsock "rx_drop"
scenario.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20191101110346.15004-4-bjorn.topel@gmail.com
2019-11-02 00:38:49 +01:00
Jakub Kicinski aefc3e723a net: fix installing orphaned programs
When netdevice with offloaded BPF programs is destroyed
the programs are orphaned and removed from the program
IDA - their IDs get released (the programs may remain
accessible via existing open file descriptors and pinned
files). After IDs are released they are set to 0.

This confuses dev_change_xdp_fd() because it compares
the __dev_xdp_query() result where 0 means no program
with prog->aux->id where 0 means orphaned.

dev_change_xdp_fd() would have incorrectly returned success
even though it had not installed the program.

Since drivers already catch this case via bpf_offload_dev_match()
let them handle this case. The error message drivers produce in
this case ("program loaded for a different device") is in fact
correct as the orphaned program must had to be loaded for a
different device.

Fixes: c14a9f633d ("net: Don't call XDP_SETUP_PROG when nothing is changed")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-01 15:16:01 -07:00
Jakub Kicinski 41aa29a58b net: cls_bpf: fix NULL deref on offload filter removal
Commit 4011921137 ("net: sched: refactor block offloads counter
usage") missed the fact that either new prog or old prog may be
NULL.

Fixes: 4011921137 ("net: sched: refactor block offloads counter usage")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-01 15:16:01 -07:00
Eric Dumazet a904a0693c inet: stop leaking jiffies on the wire
Historically linux tried to stick to RFC 791, 1122, 2003
for IPv4 ID field generation.

RFC 6864 made clear that no matter how hard we try,
we can not ensure unicity of IP ID within maximum
lifetime for all datagrams with a given source
address/destination address/protocol tuple.

Linux uses a per socket inet generator (inet_id), initialized
at connection startup with a XOR of 'jiffies' and other
fields that appear clear on the wire.

Thiemo Nagel pointed that this strategy is a privacy
concern as this provides 16 bits of entropy to fingerprint
devices.

Let's switch to a random starting point, this is just as
good as far as RFC 6864 is concerned and does not leak
anything critical.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Thiemo Nagel <tnagel@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-01 14:57:52 -07:00
Nikolay Aleksandrov 58ec1ea637 net: bridge: fdb: restore unlikely() when taking over externally added entries
Taking over hw-learned entries is not a likely scenario so restore the
unlikely() use for the case of SW taking over externally learned
entries.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-01 10:32:43 -07:00
Nikolay Aleksandrov 31f1155bdc net: bridge: fdb: avoid two atomic bitops in br_fdb_external_learn_add()
If we setup the fdb flags prior to calling fdb_create() we can avoid
two atomic bitops when learning a new entry.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-01 10:32:43 -07:00
Nikolay Aleksandrov be0c567797 net: bridge: fdb: br_fdb_update can take flags directly
If we modify br_fdb_update() to take flags directly we can get rid of
one test and one atomic bitop in the learning path.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-01 10:32:43 -07:00
Eric Biggers b95bba5d01 crypto: skcipher - rename the crypto_blkcipher module and kconfig option
Now that the blkcipher algorithm type has been removed in favor of
skcipher, rename the crypto_blkcipher kernel module to crypto_skcipher,
and rename the config options accordingly:

	CONFIG_CRYPTO_BLKCIPHER => CONFIG_CRYPTO_SKCIPHER
	CONFIG_CRYPTO_BLKCIPHER2 => CONFIG_CRYPTO_SKCIPHER2

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-11-01 13:42:47 +08:00
Eric Biggers c65058b758 crypto: skcipher - remove the "blkcipher" algorithm type
Now that all "blkcipher" algorithms have been converted to "skcipher",
remove the blkcipher algorithm type.

The skcipher (symmetric key cipher) algorithm type was introduced a few
years ago to replace both blkcipher and ablkcipher (synchronous and
asynchronous block cipher).  The advantages of skcipher include:

  - A much less confusing name, since none of these algorithm types have
    ever actually been for raw block ciphers, but rather for all
    length-preserving encryption modes including block cipher modes of
    operation, stream ciphers, and other length-preserving modes.

  - It unified blkcipher and ablkcipher into a single algorithm type
    which supports both synchronous and asynchronous implementations.
    Note, blkcipher already operated only on scatterlists, so the fact
    that skcipher does too isn't a regression in functionality.

  - Better type safety by using struct skcipher_alg, struct
    crypto_skcipher, etc. instead of crypto_alg, crypto_tfm, etc.

  - It sometimes simplifies the implementations of algorithms.

Also, the blkcipher API was no longer being tested.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-11-01 13:38:32 +08:00
Vivien Didelot fcee85f19f net: dsa: tag_8021q: clarify index limitation
Now that there's no restriction from the DSA core side regarding
the switch IDs and port numbers, only tag_8021q which is currently
reserving 3 bits for the switch ID and 4 bits for the port number, has
limitation for these values. Update their descriptions to reflect that.

Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-31 14:26:38 -07:00
Vivien Didelot 27d4d19d7c net: dsa: remove limitation of switch index value
Because there is no static array describing the links between switches
anymore, we have no reason to force a limitation of the index value
set by the device tree.

Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-31 14:26:38 -07:00
Vivien Didelot 8e5cb84c67 net: dsa: remove tree functions related to switches
The DSA fabric setup code has been simplified a lot so get rid of
the dsa_tree_remove_switch, dsa_tree_add_switch and dsa_switch_add
helpers, and keep the code simple with only the dsa_switch_probe and
dsa_switch_remove functions.

Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-31 14:26:38 -07:00
Vivien Didelot 9c8ad1ab66 net: dsa: remove the dst->ds array
Now that the DSA ports are listed in the switch fabric, there is
no need to store the dsa_switch structures from the drivers in the
fabric anymore. So get rid of the dst->ds static array.

Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-31 14:26:38 -07:00
Vivien Didelot 3774ecdb8c net: dsa: remove switch routing table setup code
The dsa_switch structure has no routing table specific data to setup,
so the switch fabric can directly walk its ports and initialize its
routing table from them.

This allows us to remove the dsa_switch_setup_routing_table function.

Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-31 14:26:38 -07:00
Vivien Didelot 96252b8e05 net: dsa: remove ds->rtable
Drivers do not use the ds->rtable static arrays anymore, get rid of it.

Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-31 14:26:38 -07:00
Vivien Didelot c5f51765a1 net: dsa: list DSA links in the fabric
Implement a new list of DSA links in the switch fabric itself, to
provide an alterative to the ds->rtable static arrays.

At the same time, provide a new dsa_routing_port() helper to abstract
the usage of ds->rtable in drivers. If there's no port to reach a
given device, return the first invalid port, ds->num_ports. This avoids
potential signedness errors or the need to define special values.

Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-31 14:26:38 -07:00
Eric Dumazet 623d0c2db0 tcp: increase tcp_max_syn_backlog max value
tcp_max_syn_backlog default value depends on memory size
and TCP ehash size. Before this patch, the max value
was 2048 [1], which is considered too small nowadays.

Increase it to 4096 to match the recent SOMAXCONN change.

[1] This is with TCP ehash size being capped to 524288 buckets.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Yue Cao <ycao009@ucr.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-31 14:02:01 -07:00
David Howells f9c32435ab rxrpc: Fix handling of last subpacket of jumbo packet
When rxrpc_recvmsg_data() sets the return value to 1 because it's drained
all the data for the last packet, it checks the last-packet flag on the
whole packet - but this is wrong, since the last-packet flag is only set on
the final subpacket of the last jumbo packet.  This means that a call that
receives its last packet in a jumbo packet won't complete properly.

Fix this by having rxrpc_locate_data() determine the last-packet state of
the subpacket it's looking at and passing that back to the caller rather
than having the caller look in the packet header.  The caller then needs to
cache this in the rxrpc_call struct as rxrpc_locate_data() isn't then
called again for this packet.

Fixes: 248f219cb8 ("rxrpc: Rewrite the data and ack handling code")
Fixes: e2de6c4048 ("rxrpc: Use info in skbuff instead of reparsing a jumbo packet")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-31 12:23:09 -07:00
David S. Miller 5a7ec66782 Just two fixes:
* HT operation is not allowed on channel 14 (Japan only)
  * netlink policy for nexthop attribute was wrong
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl26uKkACgkQB8qZga/f
 l8S8lg//UB9ocCdFI2w5EMjpCpdOlrHsbOfyyeKdYtGZuy30YXIWg6AqU1sy3Cwr
 SMYoB1RWrrnnEAV1LirMXmYuK48gIAGvW3t/WHhTy0W6vhcEeOSGC5LKAsZAexR9
 uG9BpcqkYhsx8IeXvw2tjhBG+poqZymmmCbujfaUA1+ZTcOk5C8vMWONp079Os6L
 RfkKRDwjxUG68j68aHEHeQX26MQGPXAhm0IQm7Or6KXn2ZKs18h0h5+Yp5RCUbTe
 Gqov2O1GGdcirz2rckLQv0kXblVzgRYOFnrvu+ePvcMeVVLH81Lq19RyccgDBC58
 iUdDpnpmJQ4gFLCFOdFVYewsSJ+VNmHqE9Wo0dqj+kNLuN1C6KTaS81TNMY0eeVw
 M5GDVQLTxO0hiiIweQD/1PhB3CJqyxLbahZ2ikHTMcjQLy1CSHaGYWGV5dxkkRbk
 O0vH5LAb/VuTI0kZMIoCJAph76snWFpPFNTbrdO4DEY11wetVb0uUcq2LY7DoYXp
 cEysfyloWXNpUcpB21b6o5eKtqTvZZ1Ot1Q5Z5UpEEU1GVE05iAJeVLVHGy3fQCF
 zSymwtdc6OjPjYCh/lMHMq9tSfywHX0aseD1dBpe9KkLIyloa76xNnLhUrRTets9
 DdVuu2Fi1HDIWkOUd2Gok2aAQrNTYI+zb2xPsnB1mN1swc4V5o4=
 =B0D8
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-net-2019-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
Just two fixes:
 * HT operation is not allowed on channel 14 (Japan only)
 * netlink policy for nexthop attribute was wrong
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-31 11:43:36 -07:00
Ingo Molnar 43e0ae7ae0 Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU and LKMM changes from Paul E. McKenney:

  - Documentation updates.

  - Miscellaneous fixes.

  - Dynamic tick (nohz) updates, perhaps most notably changes to
    force the tick on when needed due to lengthy in-kernel execution
    on CPUs on which RCU is waiting.

  - Replace rcu_swap_protected() with rcu_prepace_pointer().

  - Torture-test updates.

  - Linux-kernel memory consistency model updates.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-31 09:33:19 +01:00
Vlad Buslov e382267860 net: sched: update action implementations to support flags
Extend struct tc_action with new "tcfa_flags" field. Set the field in
tcf_idr_create() function and provide new helper
tcf_idr_create_from_flags() that derives 'cpustats' boolean from flags
value. Update individual hardware-offloaded actions init() to pass their
"flags" argument to new helper in order to skip percpu stats allocation
when user requested it through flags.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 18:07:51 -07:00
Vlad Buslov abbb0d3363 net: sched: extend TCA_ACT space with TCA_ACT_FLAGS
Extend TCA_ACT space with nla_bitfield32 flags. Add
TCA_ACT_FLAGS_NO_PERCPU_STATS as the only allowed flag. Parse the flags in
tcf_action_init_1() and pass resulting value as additional argument to
a_o->init().

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 18:07:50 -07:00
Vlad Buslov 5e174d5e73 net: sched: modify stats helper functions to support regular stats
Modify stats update helper functions introduced in previous patches in this
series to fallback to regular tc_action->tcfa_{b|q}stats if cpu stats are
not allocated for the action argument. If regular non-percpu allocated
counters are in use, then obtain action tcfa_lock while modifying them.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 18:07:50 -07:00
Vlad Buslov ef816f3c49 net: sched: don't expose action qstats to skb_tc_reinsert()
Previous commit introduced helper function for updating qstats and
refactored set of actions to use the helpers, instead of modifying qstats
directly. However, one of the affected action exposes its qstats to
skb_tc_reinsert(), which then modifies it.

Refactor skb_tc_reinsert() to return integer error code and don't increment
overlimit qstats in case of error, and use the returned error code in
tcf_mirred_act() to manually increment the overlimit counter with new
helper function.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 18:07:50 -07:00
Vlad Buslov 26b537a88c net: sched: extract qstats update code into functions
Extract common code that increments cpu_qstats counters into standalone act
API functions. Change hardware offloaded actions that use percpu counter
allocation to use the new functions instead of accessing cpu_qstats
directly.

This commit doesn't change functionality.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 18:07:50 -07:00
Vlad Buslov 5e1ad95b63 net: sched: extract bstats update code into function
Extract common code that increments cpu_bstats counter into standalone act
API function. Change hardware offloaded actions that use percpu counter
allocation to use the new function instead of incrementing cpu_bstats
directly.

This commit doesn't change functionality.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 18:07:50 -07:00
Vlad Buslov c8ecebd04c net: sched: extract common action counters update code into function
Currently, all implementations of tc_action_ops->stats_update() callback
have almost exactly the same implementation of counters update
code (besides gact which also updates drop counter). In order to simplify
support for using both percpu-allocated and regular action counters
depending on run-time flag in following patches, extract action counters
update code into standalone function in act API.

This commit doesn't change functionality.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 18:07:50 -07:00
Christophe JAILLET 21d8bd123a net: qrtr: Simplify 'qrtr_tun_release()'
Use 'skb_queue_purge()' instead of re-implementing it.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 17:58:23 -07:00
Heiner Kallweit a2a1a13b81 net: dsa: add ethtool pause configuration support
This patch adds glue logic to make pause settings per port
configurable vie ethtool.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 17:42:51 -07:00
Matteo Croce 5dec597e5c flow_dissector: extract more ICMP information
The ICMP flow dissector currently parses only the Type and Code fields.
Some ICMP packets (echo, timestamp) have a 16 bit Identifier field which
is used to correlate packets.
Add such field in flow_dissector_key_icmp and replace skb_flow_get_be16()
with a more complex function which populate this field.

Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 17:21:35 -07:00
Matteo Croce 3b336d6f4e flow_dissector: skip the ICMP dissector for non ICMP packets
FLOW_DISSECTOR_KEY_ICMP is checked for every packet, not only ICMP ones.
Even if the test overhead is probably negligible, move the
ICMP dissector code under the big 'switch(ip_proto)' so it gets called
only for ICMP packets.

Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 17:21:35 -07:00
Matteo Croce 98298e6ca6 flow_dissector: add meaningful comments
Documents two piece of code which can't be understood at a glance.

Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 17:21:35 -07:00
Eric Dumazet 7170a97774 net: annotate accesses to sk->sk_incoming_cpu
This socket field can be read and written by concurrent cpus.

Use READ_ONCE() and WRITE_ONCE() annotations to document this,
and avoid some compiler 'optimizations'.

KCSAN reported :

BUG: KCSAN: data-race in tcp_v4_rcv / tcp_v4_rcv

write to 0xffff88812220763c of 4 bytes by interrupt on cpu 0:
 sk_incoming_cpu_update include/net/sock.h:953 [inline]
 tcp_v4_rcv+0x1b3c/0x1bb0 net/ipv4/tcp_ipv4.c:1934
 ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204
 ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
 NF_HOOK include/linux/netfilter.h:305 [inline]
 NF_HOOK include/linux/netfilter.h:299 [inline]
 ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
 dst_input include/net/dst.h:442 [inline]
 ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
 NF_HOOK include/linux/netfilter.h:305 [inline]
 NF_HOOK include/linux/netfilter.h:299 [inline]
 ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
 __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010
 __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124
 process_backlog+0x1d3/0x420 net/core/dev.c:5955
 napi_poll net/core/dev.c:6392 [inline]
 net_rx_action+0x3ae/0xa90 net/core/dev.c:6460
 __do_softirq+0x115/0x33f kernel/softirq.c:292
 do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1082
 do_softirq.part.0+0x6b/0x80 kernel/softirq.c:337
 do_softirq kernel/softirq.c:329 [inline]
 __local_bh_enable_ip+0x76/0x80 kernel/softirq.c:189

read to 0xffff88812220763c of 4 bytes by interrupt on cpu 1:
 sk_incoming_cpu_update include/net/sock.h:952 [inline]
 tcp_v4_rcv+0x181a/0x1bb0 net/ipv4/tcp_ipv4.c:1934
 ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204
 ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
 NF_HOOK include/linux/netfilter.h:305 [inline]
 NF_HOOK include/linux/netfilter.h:299 [inline]
 ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
 dst_input include/net/dst.h:442 [inline]
 ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
 NF_HOOK include/linux/netfilter.h:305 [inline]
 NF_HOOK include/linux/netfilter.h:299 [inline]
 ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
 __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010
 __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124
 process_backlog+0x1d3/0x420 net/core/dev.c:5955
 napi_poll net/core/dev.c:6392 [inline]
 net_rx_action+0x3ae/0xa90 net/core/dev.c:6460
 __do_softirq+0x115/0x33f kernel/softirq.c:292
 run_ksoftirqd+0x46/0x60 kernel/softirq.c:603
 smpboot_thread_fn+0x37d/0x4a0 kernel/smpboot.c:165

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.4.0-rc3+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 13:24:25 -07:00
Jon Maloy c0bceb97db tipc: add smart nagle feature
We introduce a feature that works like a combination of TCP_NAGLE and
TCP_CORK, but without some of the weaknesses of those. In particular,
we will not observe long delivery delays because of delayed acks, since
the algorithm itself decides if and when acks are to be sent from the
receiving peer.

- The nagle property as such is determined by manipulating a new
  'maxnagle' field in struct tipc_sock. If certain conditions are met,
  'maxnagle' will define max size of the messages which can be bundled.
  If it is set to zero no messages are ever bundled, implying that the
  nagle property is disabled.
- A socket with the nagle property enabled enters nagle mode when more
  than 4 messages have been sent out without receiving any data message
  from the peer.
- A socket leaves nagle mode whenever it receives a data message from
  the peer.

In nagle mode, messages smaller than 'maxnagle' are accumulated in the
socket write queue. The last buffer in the queue is marked with a new
'ack_required' bit, which forces the receiving peer to send a CONN_ACK
message back to the sender upon reception.

The accumulated contents of the write queue is transmitted when one of
the following events or conditions occur.

- A CONN_ACK message is received from the peer.
- A data message is received from the peer.
- A SOCK_WAKEUP pseudo message is received from the link level.
- The write queue contains more than 64 1k blocks of data.
- The connection is being shut down.
- There is no CONN_ACK message to expect. I.e., there is currently
  no outstanding message where the 'ack_required' bit was set. As a
  consequence, the first message added after we enter nagle mode
  is always sent directly with this bit set.

This new feature gives a 50-100% improvement of throughput for small
(i.e., less than MTU size) messages, while it might add up to one RTT
to latency time when the socket is in nagle mode.

Acked-by: Ying Xue <ying.xue@windreiver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 12:16:22 -07:00
Trond Myklebust 669996add4 SUNRPC: Destroy the back channel when we destroy the host transport
When we're destroying the host transport mechanism, we should ensure
that we do not leak memory by failing to release any back channel
slots that might still exist.

Reported-by: Neil Brown <neilb@suse.de>
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-30 12:04:35 -04:00
Trond Myklebust 9edb455e67 SUNRPC: The RDMA back channel mustn't disappear while requests are outstanding
If there are RDMA back channel requests being processed by the
server threads, then we should hold a reference to the transport
to ensure it doesn't get freed from underneath us.

Reported-by: Neil Brown <neilb@suse.de>
Fixes: 63cae47005 ("xprtrdma: Handle incoming backward direction RPC calls")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-30 12:04:35 -04:00
Trond Myklebust 875f0706ac SUNRPC: The TCP back channel mustn't disappear while requests are outstanding
If there are TCP back channel requests being processed by the
server threads, then we should hold a reference to the transport
to ensure it doesn't get freed from underneath us.

Reported-by: Neil Brown <neilb@suse.de>
Fixes: 2ea24497a1 ("SUNRPC: RPC callbacks may be split across several..")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-30 12:04:35 -04:00
Paul E. McKenney 445d374931 net/sched: Replace rcu_swap_protected() with rcu_replace_pointer()
This commit replaces the use of rcu_swap_protected() with the more
intuitively appealing rcu_replace_pointer() as a step towards removing
rcu_swap_protected().

Link: https://lore.kernel.org/lkml/CAHk-=wiAsJLw1egFEE=Z7-GGtM6wcvtyytXZA1+BHqta4gg6Hw@mail.gmail.com/
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
[ paulmck: From rcu_replace() to rcu_replace_pointer() per Ingo Molnar. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: <netdev@vger.kernel.org>
Cc: <linux-kernel@vger.kernel.org>
2019-10-30 08:45:48 -07:00
Paul E. McKenney b685b534bf net/netfilter: Replace rcu_swap_protected() with rcu_replace_pointer()
This commit replaces the use of rcu_swap_protected() with the more
intuitively appealing rcu_replace_pointer() as a step towards removing
rcu_swap_protected().

Link: https://lore.kernel.org/lkml/CAHk-=wiAsJLw1egFEE=Z7-GGtM6wcvtyytXZA1+BHqta4gg6Hw@mail.gmail.com/
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
[ paulmck: From rcu_replace() to rcu_replace_pointer() per Ingo Molnar. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Jozsef Kadlecsik <kadlec@netfilter.org>
Cc: Florian Westphal <fw@strlen.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: <netfilter-devel@vger.kernel.org>
Cc: <coreteam@netfilter.org>
Cc: <netdev@vger.kernel.org>
2019-10-30 08:45:38 -07:00
Paul E. McKenney e3f0d761fc net/core: Replace rcu_swap_protected() with rcu_replace_pointer()
This commit replaces the use of rcu_swap_protected() with the more
intuitively appealing rcu_replace_pointer() as a step towards removing
rcu_swap_protected().

Link: https://lore.kernel.org/lkml/CAHk-=wiAsJLw1egFEE=Z7-GGtM6wcvtyytXZA1+BHqta4gg6Hw@mail.gmail.com/
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
[ paulmck: From rcu_replace() to rcu_replace_pointer() per Ingo Molnar. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Ido Schimmel <idosch@mellanox.com>
Cc: Petr Machata <petrm@mellanox.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: <netdev@vger.kernel.org>
2019-10-30 08:45:26 -07:00
Markus Theil 1fab1b89e2 nl80211: fix validation of mesh path nexthop
Mesh path nexthop should be a ethernet address, but current validation
checks against 4 byte integers.

Cc: stable@vger.kernel.org
Fixes: 2ec600d672 ("nl80211/cfg80211: support for mesh, sta dumping")
Signed-off-by: Markus Theil <markus.theil@tu-ilmenau.de>
Link: https://lore.kernel.org/r/20191029093003.10355-1-markus.theil@tu-ilmenau.de
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-10-30 10:11:18 +01:00
Masashi Honma ec649fed66 nl80211: Disallow setting of HT for channel 14
This patch disables setting of HT20 and more for channel 14 because
the channel is only for IEEE 802.11b.

The patch for net/wireless/util.c was unit-tested.

The patch for net/wireless/chan.c was tested with iw command.

Before this patch.
$ sudo iw dev <ifname> set channel 14 HT20
$

After this patch.
$ sudo iw dev <ifname> set channel 14 HT20
kernel reports: invalid channel definition
command failed: Invalid argument (-22)
$

Signed-off-by: Masashi Honma <masashi.honma@gmail.com>
Link: https://lore.kernel.org/r/20191021075045.2719-1-masashi.honma@gmail.com
[clean up the code, use != instead of equivalent >]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-10-30 10:07:22 +01:00
Nikolay Aleksandrov 8b73018fe4 net: rtnetlink: fix a typo fbd -> fdb
A simple typo fix in the nl error message (fbd -> fdb).

CC: David Ahern <dsahern@gmail.com>
Fixes: 8c6e137fbc ("rtnetlink: Update rtnl_fdb_dump for strict data checking")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29 18:13:38 -07:00
Nikolay Aleksandrov 3fb01a31af net: bridge: fdb: set flags directly in fdb_create
No need to have separate arguments for each flag, just set the flags to
whatever was passed to fdb_create() before the fdb is published.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29 18:12:49 -07:00
Nikolay Aleksandrov d38c6e3db0 net: bridge: fdb: convert offloaded to use bitops
Convert the offloaded field to a flag and use bitops.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29 18:12:49 -07:00
Nikolay Aleksandrov b5cd9f7c42 net: bridge: fdb: convert added_by_external_learn to use bitops
Convert the added_by_external_learn field to a flag and use bitops.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29 18:12:49 -07:00
Nikolay Aleksandrov ac3ca6af44 net: bridge: fdb: convert added_by_user to bitops
Straight-forward convert of the added_by_user field to bitops.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29 18:12:49 -07:00
Nikolay Aleksandrov e0458d9a73 net: bridge: fdb: convert is_sticky to bitops
Straight-forward convert of the is_sticky field to bitops.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29 18:12:49 -07:00
Nikolay Aleksandrov 29e63fffd6 net: bridge: fdb: convert is_static to bitops
Convert the is_static to bitops, make use of the combined
test_and_set/clear_bit to simplify expressions in fdb_add_entry.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29 18:12:49 -07:00
Nikolay Aleksandrov 6869c3b02b net: bridge: fdb: convert is_local to bitops
The patch adds a new fdb flags field in the hole between the two cache
lines and uses it to convert is_local to bitops.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29 18:12:49 -07:00
Ursula Braun 301428ea37 net/smc: fix refcounting for non-blocking connect()
If a nonblocking socket is immediately closed after connect(),
the connect worker may not have started. This results in a refcount
problem, since sock_hold() is called from the connect worker.
This patch moves the sock_hold in front of the connect worker
scheduling.

Reported-by: syzbot+4c063e6dea39e4b79f29@syzkaller.appspotmail.com
Fixes: 50717a37db ("net/smc: nonblocking connect rework")
Reviewed-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29 18:09:50 -07:00
Hoang Le f73b12812a tipc: improve throughput between nodes in netns
Currently, TIPC transports intra-node user data messages directly
socket to socket, hence shortcutting all the lower layers of the
communication stack. This gives TIPC very good intra node performance,
both regarding throughput and latency.

We now introduce a similar mechanism for TIPC data traffic across
network namespaces located in the same kernel. On the send path, the
call chain is as always accompanied by the sending node's network name
space pointer. However, once we have reliably established that the
receiving node is represented by a namespace on the same host, we just
replace the namespace pointer with the receiving node/namespace's
ditto, and follow the regular socket receive patch though the receiving
node. This technique gives us a throughput similar to the node internal
throughput, several times larger than if we let the traffic go though
the full network stacks. As a comparison, max throughput for 64k
messages is four times larger than TCP throughput for the same type of
traffic.

To meet any security concerns, the following should be noted.

- All nodes joining a cluster are supposed to have been be certified
and authenticated by mechanisms outside TIPC. This is no different for
nodes/namespaces on the same host; they have to auto discover each
other using the attached interfaces, and establish links which are
supervised via the regular link monitoring mechanism. Hence, a kernel
local node has no other way to join a cluster than any other node, and
have to obey to policies set in the IP or device layers of the stack.

- Only when a sender has established with 100% certainty that the peer
node is located in a kernel local namespace does it choose to let user
data messages, and only those, take the crossover path to the receiving
node/namespace.

- If the receiving node/namespace is removed, its namespace pointer
is invalidated at all peer nodes, and their neighbor link monitoring
will eventually note that this node is gone.

- To ensure the "100% certainty" criteria, and prevent any possible
spoofing, received discovery messages must contain a proof that the
sender knows a common secret. We use the hash mix of the sending
node/namespace for this purpose, since it can be accessed directly by
all other namespaces in the kernel. Upon reception of a discovery
message, the receiver checks this proof against all the local
namespaces'hash_mix:es. If it finds a match, that, along with a
matching node id and cluster id, this is deemed sufficient proof that
the peer node in question is in a local namespace, and a wormhole can
be opened.

- We should also consider that TIPC is intended to be a cluster local
IPC mechanism (just like e.g. UNIX sockets) rather than a network
protocol, and hence we think it can justified to allow it to shortcut the
lower protocol layers.

Regarding traceability, we should notice that since commit 6c9081a391
("tipc: add loopback device tracking") it is possible to follow the node
internal packet flow by just activating tcpdump on the loopback
interface. This will be true even for this mechanism; by activating
tcpdump on the involved nodes' loopback interfaces their inter-name
space messaging can easily be tracked.

v2:
- update 'net' pointer when node left/rejoined
v3:
- grab read/write lock when using node ref obj
v4:
- clone traffics between netns to loopback

Suggested-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29 17:55:38 -07:00
Florian Westphal 51210ad5a5 inet: do not call sublist_rcv on empty list
syzbot triggered struct net NULL deref in NF_HOOK_LIST:
RIP: 0010:NF_HOOK_LIST include/linux/netfilter.h:331 [inline]
RIP: 0010:ip6_sublist_rcv+0x5c9/0x930 net/ipv6/ip6_input.c:292
 ipv6_list_rcv+0x373/0x4b0 net/ipv6/ip6_input.c:328
 __netif_receive_skb_list_ptype net/core/dev.c:5274 [inline]

Reason:
void ipv6_list_rcv(struct list_head *head, struct packet_type *pt,
                   struct net_device *orig_dev)
[..]
        list_for_each_entry_safe(skb, next, head, list) {
		/* iterates list */
                skb = ip6_rcv_core(skb, dev, net);
		/* ip6_rcv_core drops skb -> NULL is returned */
                if (skb == NULL)
                        continue;
	[..]
	}
	/* sublist is empty -> curr_net is NULL */
        ip6_sublist_rcv(&sublist, curr_dev, curr_net);

Before the recent change NF_HOOK_LIST did a list iteration before
struct net deref, i.e. it was a no-op in the empty list case.

List iteration now happens after *net deref, causing crash.

Follow the same pattern as the ip(v6)_list_rcv loop and add a list_empty
test for the final sublist dispatch too.

Cc: Edward Cree <ecree@solarflare.com>
Reported-by: syzbot+c54f457cad330e57e967@syzkaller.appspotmail.com
Fixes: ca58fbe06c ("netfilter: add and use nf_hook_slow_list()")
Signed-off-by: Florian Westphal <fw@strlen.de>
Tested-by: Leon Romanovsky <leonro@mellanox.com>
Tested-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29 17:54:29 -07:00
Xin Long 2eb8d6d291 erspan: fix the tun_info options_len check for erspan
The check for !md doens't really work for ip_tunnel_info_opts(info) which
only does info + 1. Also to avoid out-of-bounds access on info, it should
ensure options_len is not less than erspan_metadata in both erspan_xmit()
and ip6erspan_tunnel_xmit().

Fixes: 1a66a836da ("gre: add collect_md mode to ERSPAN tunnel")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29 17:36:42 -07:00