1
0
Fork 0
Commit Graph

58686 Commits (alistair/sunxi64-5.5-dsi)

Author SHA1 Message Date
John Fastabend d468e4775c bpf: Sockmap/tls, tls_sw can create a plaintext buf > encrypt buf
It is possible to build a plaintext buffer using push helper that is larger
than the allocated encrypt buffer. When this record is pushed to crypto
layers this can result in a NULL pointer dereference because the crypto
API expects the encrypt buffer is large enough to fit the plaintext
buffer. Kernel splat below.

To resolve catch the cases this can happen and split the buffer into two
records to send individually. Unfortunately, there is still one case to
handle where the split creates a zero sized buffer. In this case we merge
the buffers and unmark the split. This happens when apply is zero and user
pushed data beyond encrypt buffer. This fixes the original case as well
because the split allocated an encrypt buffer larger than the plaintext
buffer and the merge simply moves the pointers around so we now have
a reference to the new (larger) encrypt buffer.

Perhaps its not ideal but it seems the best solution for a fixes branch
and avoids handling these two cases, (a) apply that needs split and (b)
non apply case. The are edge cases anyways so optimizing them seems not
necessary unless someone wants later in next branches.

[  306.719107] BUG: kernel NULL pointer dereference, address: 0000000000000008
[...]
[  306.747260] RIP: 0010:scatterwalk_copychunks+0x12f/0x1b0
[...]
[  306.770350] Call Trace:
[  306.770956]  scatterwalk_map_and_copy+0x6c/0x80
[  306.772026]  gcm_enc_copy_hash+0x4b/0x50
[  306.772925]  gcm_hash_crypt_remain_continue+0xef/0x110
[  306.774138]  gcm_hash_crypt_continue+0xa1/0xb0
[  306.775103]  ? gcm_hash_crypt_continue+0xa1/0xb0
[  306.776103]  gcm_hash_assoc_remain_continue+0x94/0xa0
[  306.777170]  gcm_hash_assoc_continue+0x9d/0xb0
[  306.778239]  gcm_hash_init_continue+0x8f/0xa0
[  306.779121]  gcm_hash+0x73/0x80
[  306.779762]  gcm_encrypt_continue+0x6d/0x80
[  306.780582]  crypto_gcm_encrypt+0xcb/0xe0
[  306.781474]  crypto_aead_encrypt+0x1f/0x30
[  306.782353]  tls_push_record+0x3b9/0xb20 [tls]
[  306.783314]  ? sk_psock_msg_verdict+0x199/0x300
[  306.784287]  bpf_exec_tx_verdict+0x3f2/0x680 [tls]
[  306.785357]  tls_sw_sendmsg+0x4a3/0x6a0 [tls]

test_sockmap test signature to trigger bug,

[TEST]: (1, 1, 1, sendmsg, pass,redir,start 1,end 2,pop (1,2),ktls,):

Fixes: d3b18ad31f ("tls: add bpf support to sk_msg handling")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/bpf/20200111061206.8028-7-john.fastabend@gmail.com
2020-01-15 23:26:13 +01:00
John Fastabend cf21e9ba1e bpf: Sockmap/tls, msg_push_data may leave end mark in place
Leaving an incorrect end mark in place when passing to crypto
layer will cause crypto layer to stop processing data before
all data is encrypted. To fix clear the end mark on push
data instead of expecting users of the helper to clear the
mark value after the fact.

This happens when we push data into the middle of a skmsg and
have room for it so we don't do a set of copies that already
clear the end flag.

Fixes: 6fff607e2f ("bpf: sk_msg program helper bpf_msg_push_data")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/bpf/20200111061206.8028-6-john.fastabend@gmail.com
2020-01-15 23:26:13 +01:00
John Fastabend 6562e29cf6 bpf: Sockmap, skmsg helper overestimates push, pull, and pop bounds
In the push, pull, and pop helpers operating on skmsg objects to make
data writable or insert/remove data we use this bounds check to ensure
specified data is valid,

 /* Bounds checks: start and pop must be inside message */
 if (start >= offset + l || last >= msg->sg.size)
     return -EINVAL;

The problem here is offset has already included the length of the
current element the 'l' above. So start could be past the end of
the scatterlist element in the case where start also points into an
offset on the last skmsg element.

To fix do the accounting slightly different by adding the length of
the previous entry to offset at the start of the iteration. And
ensure its initialized to zero so that the first iteration does
nothing.

Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Fixes: 6fff607e2f ("bpf: sk_msg program helper bpf_msg_push_data")
Fixes: 7246d8ed4d ("bpf: helper to pop data from messages")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/bpf/20200111061206.8028-5-john.fastabend@gmail.com
2020-01-15 23:26:13 +01:00
John Fastabend 33bfe20dd7 bpf: Sockmap/tls, push write_space updates through ulp updates
When sockmap sock with TLS enabled is removed we cleanup bpf/psock state
and call tcp_update_ulp() to push updates to TLS ULP on top. However, we
don't push the write_space callback up and instead simply overwrite the
op with the psock stored previous op. This may or may not be correct so
to ensure we don't overwrite the TLS write space hook pass this field to
the ULP and have it fixup the ctx.

This completes a previous fix that pushed the ops through to the ULP
but at the time missed doing this for write_space, presumably because
write_space TLS hook was added around the same time.

Fixes: 95fa145479 ("bpf: sockmap/tls, close can race with map free")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/bpf/20200111061206.8028-4-john.fastabend@gmail.com
2020-01-15 23:26:13 +01:00
John Fastabend 7e81a35302 bpf: Sockmap, ensure sock lock held during tear down
The sock_map_free() and sock_hash_free() paths used to delete sockmap
and sockhash maps walk the maps and destroy psock and bpf state associated
with the socks in the map. When done the socks no longer have BPF programs
attached and will function normally. This can happen while the socks in
the map are still "live" meaning data may be sent/received during the walk.

Currently, though we don't take the sock_lock when the psock and bpf state
is removed through this path. Specifically, this means we can be writing
into the ops structure pointers such as sendmsg, sendpage, recvmsg, etc.
while they are also being called from the networking side. This is not
safe, we never used proper READ_ONCE/WRITE_ONCE semantics here if we
believed it was safe. Further its not clear to me its even a good idea
to try and do this on "live" sockets while networking side might also
be using the socket. Instead of trying to reason about using the socks
from both sides lets realize that every use case I'm aware of rarely
deletes maps, in fact kubernetes/Cilium case builds map at init and
never tears it down except on errors. So lets do the simple fix and
grab sock lock.

This patch wraps sock deletes from maps in sock lock and adds some
annotations so we catch any other cases easier.

Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/bpf/20200111061206.8028-3-john.fastabend@gmail.com
2020-01-15 23:26:13 +01:00
David S. Miller 5a40420e04 Here is a batman-adv bugfix:
- Fix DAT candidate selection on little endian systems,
    by Sven Eckelmann
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAl4dzQUWHHN3QHNpbW9u
 d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoQFuD/44PUv3jHW9S8aIu5FobWHlswOg
 e1p9g98QdO7dezkkQa64tuFMR6mgVefOeK2+LHCZQ3Svd6xeGaxEGNo6vdzrkyrj
 iRl/5nY8hRZ3jIdD7kOG11yX/ra+3BSWMWS4yu0OAawYFMYxakW6qUR7OUaqL9xv
 Qr/XHu7S/tpy4MGUDabu11KlJGyZBfyXQxEKmSjjfWVITqVKNeoZONFqm2lmLOHs
 RppvTjzdTf3HUPp6C9rC1b6phN7G8XLRXqXrgY4LzTDmwHbOmPvJQ1t+kh9dfZuX
 DYsv89dl4iVIcSUY9fyUA9YxSHRyqiXRDDG0DPIy0AqRrHy0HgdGTlwCkS1idl1H
 okiuO5rUzpDFGG5F8BRKY2t29JN8BO+/YVZpa5bl9ZgQQpnuBu+kkbceYCHY0lat
 h2y3AMsJUcWEbFsa/ulEmHw8aj38rE4lgd7lhLir04BFVZHVB+faH78MQ7xB7+Xl
 lPl4ptf3l3mwWW/tXbuWa6bXjCg5UBAOr4fEyhrg8fM0rst8LG6trWZWlWgZykLe
 3HSnaR00tH8DiMPf/Zl7tAX/dKCmzCo6h7mXUAfvw1PktOvWC1JhtjchJASOqnoP
 peD45G5W3DGgxtmfcl6J33BmMllbxeT1N6sQRApliYvcrrMZ9gVzr2sHFxHftFXl
 kyCpbAlfAZ6ceGYCuQ==
 =6zKP
 -----END PGP SIGNATURE-----

Merge tag 'batadv-net-for-davem-20200114' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
Here is a batman-adv bugfix:

 - Fix DAT candidate selection on little endian systems,
   by Sven Eckelmann
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-15 22:44:23 +01:00
Pengcheng Yang e176b1ba47 tcp: fix marked lost packets not being retransmitted
When the packet pointed to by retransmit_skb_hint is unlinked by ACK,
retransmit_skb_hint will be set to NULL in tcp_clean_rtx_queue().
If packet loss is detected at this time, retransmit_skb_hint will be set
to point to the current packet loss in tcp_verify_retransmit_hint(),
then the packets that were previously marked lost but not retransmitted
due to the restriction of cwnd will be skipped and cannot be
retransmitted.

To fix this, when retransmit_skb_hint is NULL, retransmit_skb_hint can
be reset only after all marked lost packets are retransmitted
(retrans_out >= lost_out), otherwise we need to traverse from
tcp_rtx_queue_head in tcp_xmit_retransmit_queue().

Packetdrill to demonstrate:

// Disable RACK and set max_reordering to keep things simple
    0 `sysctl -q net.ipv4.tcp_recovery=0`
   +0 `sysctl -q net.ipv4.tcp_max_reordering=3`

// Establish a connection
   +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

  +.1 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
   +0 > S. 0:0(0) ack 1 <...>
 +.01 < . 1:1(0) ack 1 win 257
   +0 accept(3, ..., ...) = 4

// Send 8 data segments
   +0 write(4, ..., 8000) = 8000
   +0 > P. 1:8001(8000) ack 1

// Enter recovery and 1:3001 is marked lost
 +.01 < . 1:1(0) ack 1 win 257 <sack 3001:4001,nop,nop>
   +0 < . 1:1(0) ack 1 win 257 <sack 5001:6001 3001:4001,nop,nop>
   +0 < . 1:1(0) ack 1 win 257 <sack 5001:7001 3001:4001,nop,nop>

// Retransmit 1:1001, now retransmit_skb_hint points to 1001:2001
   +0 > . 1:1001(1000) ack 1

// 1001:2001 was ACKed causing retransmit_skb_hint to be set to NULL
 +.01 < . 1:1(0) ack 2001 win 257 <sack 5001:8001 3001:4001,nop,nop>
// Now retransmit_skb_hint points to 4001:5001 which is now marked lost

// BUG: 2001:3001 was not retransmitted
   +0 > . 2001:3001(1000) ack 1

Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-15 22:34:31 +01:00
David S. Miller eb507906fe A few fixes:
* -O3 enablement fallout, thanks to Arnd who ran this
  * fixes for a few leaks, thanks to Felix
  * channel 12 regulatory fix for custom regdomains
  * check for a crash reported by syzbot
    (NULL function is called on drivers that don't have it)
  * fix TKIP replay protection after setup with some APs
    (from Jouni)
  * restrict obtaining some mesh data to avoid WARN_ONs
  * fix deadlocks with auto-disconnect (socket owner)
  * fix radar detection events with multiple devices
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl4e4voACgkQB8qZga/f
 l8RVFg/+Ngjzje8xU6Ymh7VSwvBBz6CmSVwH/nRSsIAszDrQLoDCo7+eWeUHnAKG
 2ViFQ4Lf5CP2J9kSgi7pbD0yGQRDGxH1CVSxMB64uZZRErlOCZYLRTIeBn7HGb8w
 JhqAVXBjskzP9iB5jUZIkOgJ6L2ESM2KYKQVGtgn8p9fFdHKitec8UEKiPU3tcX1
 zx94EbKldynuvDNb7VkTm4MdoY8LK5KI8ovVdVKwJ05qCtLvc6xvEBYhm33Fl6Pe
 66cWklCrU093jQXUDTk0yBkwlcXGxJc143y1nXuurI8ZLH6q2wdDUAMHUEKdL6mz
 ktHxABK4JA4Z/ylphR2Tp1TNh8xnz/ZZ5IdAe4P9K0/PerrjH8Yk7zf/48lrHTSv
 LKb86Oaoq4MY4k5UNyjDUdgPB/q8CgxN2xDQaeQX69u740wrlxYtrwgHmqHrKI69
 ppCzL7a3wmJzAz/97TMFhSseCH1WEQfWY+tlzktrvRofSgOGsIUtcPP10NmvGZHj
 zdF79E9TiUPX4KVWssRb9A2MjBbvKA31hCQbH/OVKSWt/fItoBUbMQKtPxfnHcam
 /W/gCWrOpw5Qqhc/BdFbYYsgY9J06KzifsV5e2Z6he+6Ex+rtdZNoLXc6/4QH3+9
 HzsVYBP+ZUaqrMJi64KDN59C3ni4jDVs9zfYnVkC9LhzHASI5EE=
 =i20C
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-net-2020-01-15' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
A few fixes:
 * -O3 enablement fallout, thanks to Arnd who ran this
 * fixes for a few leaks, thanks to Felix
 * channel 12 regulatory fix for custom regdomains
 * check for a crash reported by syzbot
   (NULL function is called on drivers that don't have it)
 * fix TKIP replay protection after setup with some APs
   (from Jouni)
 * restrict obtaining some mesh data to avoid WARN_ONs
 * fix deadlocks with auto-disconnect (socket owner)
 * fix radar detection events with multiple devices
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-15 04:12:00 -08:00
Ulrich Weber 4e4362d2bf xfrm: support output_mark for offload ESP packets
Commit 9b42c1f179 ("xfrm: Extend the output_mark") added output_mark
support but missed ESP offload support.

xfrm_smark_get() is not called within xfrm_input() for packets coming
from esp4_gro_receive() or esp6_gro_receive(). Therefore call
xfrm_smark_get() directly within these functions.

Fixes: 9b42c1f179 ("xfrm: Extend the output_mark to support input direction and masking.")
Signed-off-by: Ulrich Weber <ulrich.weber@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-01-15 12:18:35 +01:00
Felix Fietkau 81c044fc3b cfg80211: fix page refcount issue in A-MSDU decap
The fragments attached to a skb can be part of a compound page. In that case,
page_ref_inc will increment the refcount for the wrong page. Fix this by
using get_page instead, which calls page_ref_inc on the compound head and
also checks for overflow.

Fixes: 2b67f944f8 ("cfg80211: reuse existing page fragments in A-MSDU rx")
Cc: stable@vger.kernel.org
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20200113182107.20461-1-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-01-15 09:53:35 +01:00
Johannes Berg 24953de0a5 cfg80211: check for set_wiphy_params
Check if set_wiphy_params is assigned and return an error if not,
some drivers (e.g. virt_wifi where syzbot reported it) don't have
it.

Reported-by: syzbot+e8a797964a4180eb57d5@syzkaller.appspotmail.com
Reported-by: syzbot+34b582cf32c1db008f8e@syzkaller.appspotmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://lore.kernel.org/r/20200113125358.ac07f276efff.Ibd85ee1b12e47b9efb00a2adc5cd3fac50da791a@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-01-15 09:53:24 +01:00
Felix Fietkau df16737d43 cfg80211: fix memory leak in cfg80211_cqm_rssi_update
The per-tid statistics need to be released after the call to rdev_get_station

Cc: stable@vger.kernel.org
Fixes: 8689c051a2 ("cfg80211: dynamically allocate per-tid stats for station info")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20200108170630.33680-2-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-01-15 09:53:12 +01:00
Felix Fietkau 2a279b3416 cfg80211: fix memory leak in nl80211_probe_mesh_link
The per-tid statistics need to be released after the call to rdev_get_station

Cc: stable@vger.kernel.org
Fixes: 5ab92e7fe4 ("cfg80211: add support to probe unexercised mesh link")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20200108170630.33680-1-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-01-15 09:53:02 +01:00
Markus Theil 5a128a088a cfg80211: fix deadlocks in autodisconnect work
Use methods which do not try to acquire the wdev lock themselves.

Cc: stable@vger.kernel.org
Fixes: 37b1c00468 ("cfg80211: Support all iftypes in autodisconnect_wk")
Signed-off-by: Markus Theil <markus.theil@tu-ilmenau.de>
Link: https://lore.kernel.org/r/20200108115536.2262-1-markus.theil@tu-ilmenau.de
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-01-15 09:52:49 +01:00
Arnd Bergmann e16119655c wireless: wext: avoid gcc -O3 warning
After the introduction of CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE_O3,
the wext code produces a bogus warning:

In function 'iw_handler_get_iwstats',
    inlined from 'ioctl_standard_call' at net/wireless/wext-core.c:1015:9,
    inlined from 'wireless_process_ioctl' at net/wireless/wext-core.c:935:10,
    inlined from 'wext_ioctl_dispatch.part.8' at net/wireless/wext-core.c:986:8,
    inlined from 'wext_handle_ioctl':
net/wireless/wext-core.c:671:3: error: argument 1 null where non-null expected [-Werror=nonnull]
   memcpy(extra, stats, sizeof(struct iw_statistics));
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from arch/x86/include/asm/string.h:5,
net/wireless/wext-core.c: In function 'wext_handle_ioctl':
arch/x86/include/asm/string_64.h:14:14: note: in a call to function 'memcpy' declared here

The problem is that ioctl_standard_call() sometimes calls the handler
with a NULL argument that would cause a problem for iw_handler_get_iwstats.
However, iw_handler_get_iwstats never actually gets called that way.

Marking that function as noinline avoids the warning and leads
to slightly smaller object code as well.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20200107200741.3588770-1-arnd@arndb.de
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-01-15 09:52:20 +01:00
Jouni Malinen 6f60126521 mac80211: Fix TKIP replay protection immediately after key setup
TKIP replay protection was skipped for the very first frame received
after a new key is configured. While this is potentially needed to avoid
dropping a frame in some cases, this does leave a window for replay
attacks with group-addressed frames at the station side. Any earlier
frame sent by the AP using the same key would be accepted as a valid
frame and the internal RSC would then be updated to the TSC from that
frame. This would allow multiple previously transmitted group-addressed
frames to be replayed until the next valid new group-addressed frame
from the AP is received by the station.

Fix this by limiting the no-replay-protection exception to apply only
for the case where TSC=0, i.e., when this is for the very first frame
protected using the new key, and the local RSC had not been set to a
higher value when configuring the key (which may happen with GTK).

Signed-off-by: Jouni Malinen <j@w1.fi>
Link: https://lore.kernel.org/r/20200107153545.10934-1-j@w1.fi
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-01-15 09:52:12 +01:00
Orr Mazor 26ec17a1dc cfg80211: Fix radar event during another phy CAC
In case a radar event of CAC_FINISHED or RADAR_DETECTED
happens during another phy is during CAC we might need
to cancel that CAC.

If we got a radar in a channel that another phy is now
doing CAC on then the CAC should be canceled there.

If, for example, 2 phys doing CAC on the same channels,
or on comptable channels, once on of them will finish his
CAC the other might need to cancel his CAC, since it is no
longer relevant.

To fix that the commit adds an callback and implement it in
mac80211 to end CAC.
This commit also adds a call to said callback if after a radar
event we see the CAC is no longer relevant

Signed-off-by: Orr Mazor <Orr.Mazor@tandemg.com>
Reviewed-by: Sergey Matyukevich <sergey.matyukevich.os@quantenna.com>
Link: https://lore.kernel.org/r/20191222145449.15792-1-Orr.Mazor@tandemg.com
[slightly reformat/reword commit message]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-01-15 09:50:48 +01:00
Ganapathi Bhat c4b9d655e4 wireless: fix enabling channel 12 for custom regulatory domain
Commit e33e2241e2 ("Revert "cfg80211: Use 5MHz bandwidth by
default when checking usable channels"") fixed a broken
regulatory (leaving channel 12 open for AP where not permitted).
Apply a similar fix to custom regulatory domain processing.

Signed-off-by: Cathy Luo <xiaohua.luo@nxp.com>
Signed-off-by: Ganapathi Bhat <ganapathi.bhat@nxp.com>
Link: https://lore.kernel.org/r/1576836859-8945-1-git-send-email-ganapathi.bhat@nxp.com
[reword commit message, fix coding style, add a comment]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-01-15 09:47:52 +01:00
Sunil Muthuswamy c742c59e1f hv_sock: Remove the accept port restriction
Currently, hv_sock restricts the port the guest socket can accept
connections on. hv_sock divides the socket port namespace into two parts
for server side (listening socket), 0-0x7FFFFFFF & 0x80000000-0xFFFFFFFF
(there are no restrictions on client port namespace). The first part
(0-0x7FFFFFFF) is reserved for sockets where connections can be accepted.
The second part (0x80000000-0xFFFFFFFF) is reserved for allocating ports
for the peer (host) socket, once a connection is accepted.
This reservation of the port namespace is specific to hv_sock and not
known by the generic vsock library (ex: af_vsock). This is problematic
because auto-binds/ephemeral ports are handled by the generic vsock
library and it has no knowledge of this port reservation and could
allocate a port that is not compatible with hv_sock (and legitimately so).
The issue hasn't surfaced so far because the auto-bind code of vsock
(__vsock_bind_stream) prior to the change 'VSOCK: bind to random port for
VMADDR_PORT_ANY' would start walking up from LAST_RESERVED_PORT (1023) and
start assigning ports. That will take a large number of iterations to hit
0x7FFFFFFF. But, after the above change to randomize port selection, the
issue has started coming up more frequently.
There has really been no good reason to have this port reservation logic
in hv_sock from the get go. Reserving a local port for peer ports is not
how things are handled generally. Peer ports should reflect the peer port.
This fixes the issue by lifting the port reservation, and also returns the
right peer port. Since the code converts the GUID to the peer port (by
using the first 4 bytes), there is a possibility of conflicts, but that
seems like a reasonable risk to take, given this is limited to vsock and
that only applies to all local sockets.

Signed-off-by: Sunil Muthuswamy <sunilmut@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-14 11:50:53 -08:00
Chuck Lever 671c450b6f xprtrdma: Fix oops in Receive handler after device removal
Since v5.4, a device removal occasionally triggered this oops:

Dec  2 17:13:53 manet kernel: BUG: unable to handle page fault for address: 0000000c00000219
Dec  2 17:13:53 manet kernel: #PF: supervisor read access in kernel mode
Dec  2 17:13:53 manet kernel: #PF: error_code(0x0000) - not-present page
Dec  2 17:13:53 manet kernel: PGD 0 P4D 0
Dec  2 17:13:53 manet kernel: Oops: 0000 [#1] SMP
Dec  2 17:13:53 manet kernel: CPU: 2 PID: 468 Comm: kworker/2:1H Tainted: G        W         5.4.0-00050-g53717e43af61 #883
Dec  2 17:13:53 manet kernel: Hardware name: Supermicro SYS-6028R-T/X10DRi, BIOS 1.1a 10/16/2015
Dec  2 17:13:53 manet kernel: Workqueue: ib-comp-wq ib_cq_poll_work [ib_core]
Dec  2 17:13:53 manet kernel: RIP: 0010:rpcrdma_wc_receive+0x7c/0xf6 [rpcrdma]
Dec  2 17:13:53 manet kernel: Code: 6d 8b 43 14 89 c1 89 45 78 48 89 4d 40 8b 43 2c 89 45 14 8b 43 20 89 45 18 48 8b 45 20 8b 53 14 48 8b 30 48 8b 40 10 48 8b 38 <48> 8b 87 18 02 00 00 48 85 c0 75 18 48 8b 05 1e 24 c4 e1 48 85 c0
Dec  2 17:13:53 manet kernel: RSP: 0018:ffffc900035dfe00 EFLAGS: 00010246
Dec  2 17:13:53 manet kernel: RAX: ffff888467290000 RBX: ffff88846c638400 RCX: 0000000000000048
Dec  2 17:13:53 manet kernel: RDX: 0000000000000048 RSI: 00000000f942e000 RDI: 0000000c00000001
Dec  2 17:13:53 manet kernel: RBP: ffff888467611b00 R08: ffff888464e4a3c4 R09: 0000000000000000
Dec  2 17:13:53 manet kernel: R10: ffffc900035dfc88 R11: fefefefefefefeff R12: ffff888865af4428
Dec  2 17:13:53 manet kernel: R13: ffff888466023000 R14: ffff88846c63f000 R15: 0000000000000010
Dec  2 17:13:53 manet kernel: FS:  0000000000000000(0000) GS:ffff88846fa80000(0000) knlGS:0000000000000000
Dec  2 17:13:53 manet kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec  2 17:13:53 manet kernel: CR2: 0000000c00000219 CR3: 0000000002009002 CR4: 00000000001606e0
Dec  2 17:13:53 manet kernel: Call Trace:
Dec  2 17:13:53 manet kernel: __ib_process_cq+0x5c/0x14e [ib_core]
Dec  2 17:13:53 manet kernel: ib_cq_poll_work+0x26/0x70 [ib_core]
Dec  2 17:13:53 manet kernel: process_one_work+0x19d/0x2cd
Dec  2 17:13:53 manet kernel: ? cancel_delayed_work_sync+0xf/0xf
Dec  2 17:13:53 manet kernel: worker_thread+0x1a6/0x25a
Dec  2 17:13:53 manet kernel: ? cancel_delayed_work_sync+0xf/0xf
Dec  2 17:13:53 manet kernel: kthread+0xf4/0xf9
Dec  2 17:13:53 manet kernel: ? kthread_queue_delayed_work+0x74/0x74
Dec  2 17:13:53 manet kernel: ret_from_fork+0x24/0x30

The proximal cause is that this rpcrdma_rep has a rr_rdmabuf that
is still pointing to the old ib_device, which has been freed. The
only way that is possible is if this rpcrdma_rep was not destroyed
by rpcrdma_ia_remove.

Debugging showed that was indeed the case: this rpcrdma_rep was
still in use by a completing RPC at the time of the device removal,
and thus wasn't on the rep free list. So, it was not found by
rpcrdma_reps_destroy().

The fix is to introduce a list of all rpcrdma_reps so that they all
can be found when a device is removed. That list is used to perform
only regbuf DMA unmapping, replacing that call to
rpcrdma_reps_destroy().

Meanwhile, to prevent corruption of this list, I've moved the
destruction of temp rpcrdma_rep objects to rpcrdma_post_recvs().
rpcrdma_xprt_drain() ensures that post_recvs (and thus rep_destroy) is
not invoked while rpcrdma_reps_unmap is walking rb_all_reps, thus
protecting the rb_all_reps list.

Fixes: b0b227f071 ("xprtrdma: Use an llist to manage free rpcrdma_reps")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-14 13:30:24 -05:00
Chuck Lever 13cb886c59 xprtrdma: Fix completion wait during device removal
I've found that on occasion, "rmmod <dev>" will hang while if an NFS
is under load.

Ensure that ri_remove_done is initialized only just before the
transport is woken up to force a close. This avoids the completion
possibly getting initialized again while the CM event handler is
waiting for a wake-up.

Fixes: bebd031866 ("xprtrdma: Support unplugging an HCA from under an NFS mount")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-14 13:30:24 -05:00
Chuck Lever b32b9ed493 xprtrdma: Fix create_qp crash on device unload
On device re-insertion, the RDMA device driver crashes trying to set
up a new QP:

Nov 27 16:32:06 manet kernel: BUG: kernel NULL pointer dereference, address: 00000000000001c0
Nov 27 16:32:06 manet kernel: #PF: supervisor write access in kernel mode
Nov 27 16:32:06 manet kernel: #PF: error_code(0x0002) - not-present page
Nov 27 16:32:06 manet kernel: PGD 0 P4D 0
Nov 27 16:32:06 manet kernel: Oops: 0002 [#1] SMP
Nov 27 16:32:06 manet kernel: CPU: 1 PID: 345 Comm: kworker/u28:0 Tainted: G        W         5.4.0 #852
Nov 27 16:32:06 manet kernel: Hardware name: Supermicro SYS-6028R-T/X10DRi, BIOS 1.1a 10/16/2015
Nov 27 16:32:06 manet kernel: Workqueue: xprtiod xprt_rdma_connect_worker [rpcrdma]
Nov 27 16:32:06 manet kernel: RIP: 0010:atomic_try_cmpxchg+0x2/0x12
Nov 27 16:32:06 manet kernel: Code: ff ff 48 8b 04 24 5a c3 c6 07 00 0f 1f 40 00 c3 31 c0 48 81 ff 08 09 68 81 72 0c 31 c0 48 81 ff 83 0c 68 81 0f 92 c0 c3 8b 06 <f0> 0f b1 17 0f 94 c2 84 d2 75 02 89 06 88 d0 c3 53 ba 01 00 00 00
Nov 27 16:32:06 manet kernel: RSP: 0018:ffffc900035abbf0 EFLAGS: 00010046
Nov 27 16:32:06 manet kernel: RAX: 0000000000000000 RBX: 00000000000001c0 RCX: 0000000000000000
Nov 27 16:32:06 manet kernel: RDX: 0000000000000001 RSI: ffffc900035abbfc RDI: 00000000000001c0
Nov 27 16:32:06 manet kernel: RBP: ffffc900035abde0 R08: 000000000000000e R09: ffffffffffffc000
Nov 27 16:32:06 manet kernel: R10: 0000000000000000 R11: 000000000002e800 R12: ffff88886169d9f8
Nov 27 16:32:06 manet kernel: R13: ffff88886169d9f4 R14: 0000000000000246 R15: 0000000000000000
Nov 27 16:32:06 manet kernel: FS:  0000000000000000(0000) GS:ffff88846fa40000(0000) knlGS:0000000000000000
Nov 27 16:32:06 manet kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 27 16:32:06 manet kernel: CR2: 00000000000001c0 CR3: 0000000002009006 CR4: 00000000001606e0
Nov 27 16:32:06 manet kernel: Call Trace:
Nov 27 16:32:06 manet kernel: do_raw_spin_lock+0x2f/0x5a
Nov 27 16:32:06 manet kernel: create_qp_common.isra.47+0x856/0xadf [mlx4_ib]
Nov 27 16:32:06 manet kernel: ? slab_post_alloc_hook.isra.60+0xa/0x1a
Nov 27 16:32:06 manet kernel: ? __kmalloc+0x125/0x139
Nov 27 16:32:06 manet kernel: mlx4_ib_create_qp+0x57f/0x972 [mlx4_ib]

The fix is to copy the qp_init_attr struct that was just created by
rpcrdma_ep_create() instead of using the one from the previous
connection instance.

Fixes: 98ef77d1aa ("xprtrdma: Send Queue size grows after a reconnect")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-14 13:30:24 -05:00
Xu Wang 8aaea2b042 xfrm: interface: do not confirm neighbor when do pmtu update
When do IPv6 tunnel PMTU update and calls __ip6_rt_update_pmtu() in the end,
we should not call dst_confirm_neigh() as there is no two-way communication.

Signed-off-by: Xu Wang <vulab@iscas.ac.cn>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-01-14 08:55:38 +01:00
Nicolas Dichtel f042365dbf xfrm interface: fix packet tx through bpf_redirect()
With an ebpf program that redirects packets through a xfrm interface,
packets are dropped because no dst is attached to skb.

This could also be reproduced with an AF_PACKET socket, with the following
python script (xfrm1 is a xfrm interface):

 import socket
 send_s = socket.socket(socket.AF_PACKET, socket.SOCK_RAW, 0)
 # scapy
 # p = IP(src='10.100.0.2', dst='10.200.0.1')/ICMP(type='echo-request')
 # raw(p)
 req = b'E\x00\x00\x1c\x00\x01\x00\x00@\x01e\xb2\nd\x00\x02\n\xc8\x00\x01\x08\x00\xf7\xff\x00\x00\x00\x00'
 send_s.sendto(req, ('xfrm1', 0x800, 0, 0))

It was also not possible to send an ip packet through an AF_PACKET socket
because a LL header was expected. Let's remove those LL header constraints.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-01-14 08:55:38 +01:00
Nicolas Dichtel 95224166a9 vti[6]: fix packet tx through bpf_redirect()
With an ebpf program that redirects packets through a vti[6] interface,
the packets are dropped because no dst is attached.

This could also be reproduced with an AF_PACKET socket, with the following
python script (vti1 is an ip_vti interface):

 import socket
 send_s = socket.socket(socket.AF_PACKET, socket.SOCK_RAW, 0)
 # scapy
 # p = IP(src='10.100.0.2', dst='10.200.0.1')/ICMP(type='echo-request')
 # raw(p)
 req = b'E\x00\x00\x1c\x00\x01\x00\x00@\x01e\xb2\nd\x00\x02\n\xc8\x00\x01\x08\x00\xf7\xff\x00\x00\x00\x00'
 send_s.sendto(req, ('vti1', 0x800, 0, 0))

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-01-14 08:55:38 +01:00
Florian Westphal 212e7f5660 netfilter: arp_tables: init netns pointer in xt_tgdtor_param struct
An earlier commit (1b789577f6,
"netfilter: arp_tables: init netns pointer in xt_tgchk_param struct")
fixed missing net initialization for arptables, but turns out it was
incomplete.  We can get a very similar struct net NULL deref during
error unwinding:

general protection fault: 0000 [#1] PREEMPT SMP KASAN
RIP: 0010:xt_rateest_put+0xa1/0x440 net/netfilter/xt_RATEEST.c:77
 xt_rateest_tg_destroy+0x72/0xa0 net/netfilter/xt_RATEEST.c:175
 cleanup_entry net/ipv4/netfilter/arp_tables.c:509 [inline]
 translate_table+0x11f4/0x1d80 net/ipv4/netfilter/arp_tables.c:587
 do_replace net/ipv4/netfilter/arp_tables.c:981 [inline]
 do_arpt_set_ctl+0x317/0x650 net/ipv4/netfilter/arp_tables.c:1461

Also init the netns pointer in xt_tgdtor_param struct.

Fixes: add6746124 ("netfilter: add struct net * to target parameters")
Reported-by: syzbot+91bdd8eece0f6629ec8b@syzkaller.appspotmail.com
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-13 19:22:10 +01:00
Cong Wang c120959387 netfilter: fix a use-after-free in mtype_destroy()
map->members is freed by ip_set_free() right before using it in
mtype_ext_cleanup() again. So we just have to move it down.

Reported-by: syzbot+4c3cc6dbe7259dbf9054@syzkaller.appspotmail.com
Fixes: 40cd63bf33 ("netfilter: ipset: Support extensions which need a per data destroy function")
Acked-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-13 18:53:59 +01:00
Jacob Keller b0efcae5e1 devlink: correct misspelling of snapshot
The function to obtain a unique snapshot id was mistakenly typo'd as
devlink_region_shapshot_id_get. Fix this typo by renaming the function
and all of its users.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-11 14:30:24 -08:00
Ido Schimmel 4c582234ab devlink: Wait longer before warning about unset port type
The commit cited below causes devlink to emit a warning if a type was
not set on a devlink port for longer than 30 seconds to "prevent
misbehavior of drivers". This proved to be problematic when
unregistering the backing netdev. The flow is always:

devlink_port_type_clear()	// schedules the warning
unregister_netdev()		// blocking
devlink_port_unregister()	// cancels the warning

The call to unregister_netdev() can block for long periods of time for
various reasons: RTNL lock is contended, large amounts of configuration
to unroll following dismantle of the netdev, etc. This results in
devlink emitting a warning despite the driver behaving correctly.

In emulated environments (of future hardware) which are usually very
slow, the warning can also be emitted during port creation as more than
30 seconds can pass between the time the devlink port is registered and
when its type is set.

In addition, syzbot has hit this warning [1] 1974 times since 07/11/19
without being able to produce a reproducer. Probably because
reproduction depends on the load or other bugs (e.g., RTNL not being
released).

To prevent bogus warnings, increase the timeout to 1 hour.

[1] https://syzkaller.appspot.com/bug?id=e99b59e9c024a666c9f7450dc162a4b74d09d9cb

Fixes: 136bf27fc0 ("devlink: add warning in case driver does not set port type")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reported-by: syzbot+b0a18ed7b08b735d2f41@syzkaller.appspotmail.com
Reported-by: Alex Veber <alexve@mellanox.com>
Tested-by: Alex Veber <alexve@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-10 16:45:40 -08:00
David Ahern 9827c0634e ipv4: Detect rollover in specific fib table dump
Sven-Haegar reported looping on fib dumps when 255.255.255.255 route has
been added to a table. The looping is caused by the key rolling over from
FFFFFFFF to 0. When dumping a specific table only, we need a means to detect
when the table dump is done. The key and count saved to cb args are both 0
only at the start of the table dump. If key is 0 and count > 0, then we are
in the rollover case. Detect and return to avoid looping.

This only affects dumps of a specific table; for dumps of all tables
(the case prior to the change in the Fixes tag) inet_dump_fib moved
the entry counter to the next table and reset the cb args used by
fib_table_dump and fn_trie_dump_leaf, so the rollover ffffffff back
to 0 did not cause looping with the dumps.

Fixes: effe679266 ("net: Enable kernel side filtering of route dumps")
Reported-by: Sven-Haegar Koch <haegar@sdinet.de>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-10 11:36:36 -08:00
Jakub Kicinski db885e66d2 net/tls: fix async operation
Mallesham reports the TLS with async accelerator was broken by
commit d10523d0b3 ("net/tls: free the record on encryption error")
because encryption can return -EINPROGRESS in such setups, which
should not be treated as an error.

The error is also present in the BPF path (likely copied from there).

Reported-by: Mallesham Jatharakonda <mallesham.jatharakonda@oneconvergence.com>
Fixes: d3b18ad31f ("tls: add bpf support to sk_msg handling")
Fixes: d10523d0b3 ("net/tls: free the record on encryption error")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-10 11:20:46 -08:00
Jakub Kicinski 5c5d22a750 net/tls: avoid spurious decryption error with HW resync
When device loses sync mid way through a record - kernel
has to re-encrypt the part of the record which the device
already decrypted to be able to decrypt and authenticate
the record in its entirety.

The re-encryption piggy backs on the decryption routine,
but obviously because the partially decrypted record can't
be authenticated crypto API returns an error which is then
ignored by tls_device_reencrypt().

Commit 5c5ec66858 ("net/tls: add TlsDecryptError stat")
added a statistic to count decryption errors, this statistic
can't be incremented when we see the expected re-encryption
error. Move the inc to the caller.

Reported-and-tested-by: David Beckett <david.beckett@netronome.com>
Fixes: 5c5ec66858 ("net/tls: add TlsDecryptError stat")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-10 11:18:15 -08:00
Lorenz Bauer 2e012c7482 net: bpf: Don't leak time wait and request sockets
It's possible to leak time wait and request sockets via the following
BPF pseudo code:
 
  sk = bpf_skc_lookup_tcp(...)
  if (sk)
    bpf_sk_release(sk)

If sk->sk_state is TCP_NEW_SYN_RECV or TCP_TIME_WAIT the refcount taken
by bpf_skc_lookup_tcp is not undone by bpf_sk_release. This is because
sk_flags is re-used for other data in both kinds of sockets. The check

  !sock_flag(sk, SOCK_RCU_FREE)

therefore returns a bogus result. Check that sk_flags is valid by calling
sk_fullsock. Skip checking SOCK_RCU_FREE if we already know that sk is
not a full socket.

Fixes: edbf8c01de ("bpf: add skc_lookup_tcp helper")
Fixes: f7355a6c04 ("bpf: Check sk_fullsock() before returning from bpf_sk_lookup()")
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200110132336.26099-1-lmb@cloudflare.com
2020-01-10 10:42:23 -08:00
Martin Schiller e21dba7a4d net/x25: fix nonblocking connect
This patch fixes 2 issues in x25_connect():

1. It makes absolutely no sense to reset the neighbour and the
connection state after a (successful) nonblocking call of x25_connect.
This prevents any connection from being established, since the response
(call accept) cannot be processed.

2. Any further calls to x25_connect() while a call is pending should
simply return, instead of creating new Call Request (on different
logical channels).

This patch should also fix the "KASAN: null-ptr-deref Write in
x25_connect" and "BUG: unable to handle kernel NULL pointer dereference
in x25_connect" bugs reported by syzbot.

Signed-off-by: Martin Schiller <ms@dev.tdt.de>
Reported-by: syzbot+429c200ffc8772bfe070@syzkaller.appspotmail.com
Reported-by: syzbot+eec0c87f31a7c3b66f7b@syzkaller.appspotmail.com
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-09 18:39:33 -08:00
Lingpeng Chen e7a5f1f1cd bpf/sockmap: Read psock ingress_msg before sk_receive_queue
Right now in tcp_bpf_recvmsg, sock read data first from sk_receive_queue
if not empty than psock->ingress_msg otherwise. If a FIN packet arrives
and there's also some data in psock->ingress_msg, the data in
psock->ingress_msg will be purged. It is always happen when request to a
HTTP1.0 server like python SimpleHTTPServer since the server send FIN
packet after data is sent out.

Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Reported-by: Arika Chen <eaglesora@gmail.com>
Suggested-by: Arika Chen <eaglesora@gmail.com>
Signed-off-by: Lingpeng Chen <forrest0579@gmail.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200109014833.18951-1-forrest0579@gmail.com
2020-01-09 23:13:48 +01:00
Tuong Lien 9546a0b7ce tipc: fix wrong connect() return code
The current 'tipc_wait_for_connect()' function does a wait-loop for the
condition 'sk->sk_state != TIPC_CONNECTING' to conclude if the socket
connecting has done. However, when the condition is met, it returns '0'
even in the case the connecting is actually failed, the socket state is
set to 'TIPC_DISCONNECTING' (e.g. when the server socket has closed..).
This results in a wrong return code for the 'connect()' call from user,
making it believe that the connection is established and go ahead with
building, sending a message, etc. but finally failed e.g. '-EPIPE'.

This commit fixes the issue by changing the wait condition to the
'tipc_sk_connected(sk)', so the function will return '0' only when the
connection is really established. Otherwise, either the socket 'sk_err'
if any or '-ETIMEDOUT'/'-EINTR' will be returned correspondingly.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-08 15:57:35 -08:00
Tuong Lien 49afb806cb tipc: fix link overflow issue at socket shutdown
When a socket is suddenly shutdown or released, it will reject all the
unreceived messages in its receive queue. This applies to a connected
socket too, whereas there is only one 'FIN' message required to be sent
back to its peer in this case.

In case there are many messages in the queue and/or some connections
with such messages are shutdown at the same time, the link layer will
easily get overflowed at the 'TIPC_SYSTEM_IMPORTANCE' backlog level
because of the message rejections. As a result, the link will be taken
down. Moreover, immediately when the link is re-established, the socket
layer can continue to reject the messages and the same issue happens...

The commit refactors the '__tipc_shutdown()' function to only send one
'FIN' in the situation mentioned above. For the connectionless case, it
is unavoidable but usually there is no rejections for such socket
messages because they are 'dest-droppable' by default.

In addition, the new code makes the other socket states clear
(e.g.'TIPC_LISTEN') and treats as a separate case to avoid misbehaving.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-08 15:57:07 -08:00
David S. Miller b73a65610b Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Missing netns context in arp_tables, from Florian Westphal.

2) Underflow in flowtable reference counter, from wenxu.

3) Fix incorrect ethernet destination address in flowtable offload,
   from wenxu.

4) Check for status of neighbour entry, from wenxu.

5) Fix NAT port mangling, from wenxu.

6) Unbind callbacks from destroy path to cleanup hardware properly
   on flowtable removal.

7) Fix missing casting statistics timestamp, add nf_flowtable_time_stamp
   and use it.

8) NULL pointer exception when timeout argument is null in conntrack
   dccp and sctp protocol helpers, from Florian Westphal.

9) Possible nul-dereference in ipset with IPSET_ATTR_LINENO, also from
   Florian.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-08 15:22:41 -08:00
Florian Westphal 22dad713b8 netfilter: ipset: avoid null deref when IPSET_ATTR_LINENO is present
The set uadt functions assume lineno is never NULL, but it is in
case of ip_set_utest().

syzkaller managed to generate a netlink message that calls this with
LINENO attr present:

general protection fault: 0000 [#1] PREEMPT SMP KASAN
RIP: 0010:hash_mac4_uadt+0x1bc/0x470 net/netfilter/ipset/ip_set_hash_mac.c:104
Call Trace:
 ip_set_utest+0x55b/0x890 net/netfilter/ipset/ip_set_core.c:1867
 nfnetlink_rcv_msg+0xcf2/0xfb0 net/netfilter/nfnetlink.c:229
 netlink_rcv_skb+0x177/0x450 net/netlink/af_netlink.c:2477
 nfnetlink_rcv+0x1ba/0x460 net/netfilter/nfnetlink.c:563

pass a dummy lineno storage, its easier than patching all set
implementations.

This seems to be a day-0 bug.

Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Reported-by: syzbot+34bd2369d38707f3f4a7@syzkaller.appspotmail.com
Fixes: a7b4f989a6 ("netfilter: ipset: IP set core support")
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-08 23:31:46 +01:00
Florian Westphal 1d9a7acd3d netfilter: conntrack: dccp, sctp: handle null timeout argument
The timeout pointer can be NULL which means we should modify the
per-nets timeout instead.

All do this, except sctp and dccp which instead give:

general protection fault: 0000 [#1] PREEMPT SMP KASAN
net/netfilter/nf_conntrack_proto_dccp.c:682
 ctnl_timeout_parse_policy+0x150/0x1d0 net/netfilter/nfnetlink_cttimeout.c:67
 cttimeout_default_set+0x150/0x1c0 net/netfilter/nfnetlink_cttimeout.c:368
 nfnetlink_rcv_msg+0xcf2/0xfb0 net/netfilter/nfnetlink.c:229
 netlink_rcv_skb+0x177/0x450 net/netlink/af_netlink.c:2477

Reported-by: syzbot+46a4ad33f345d1dd346e@syzkaller.appspotmail.com
Fixes: c779e84960 ("netfilter: conntrack: remove get_timeout() indirection")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-08 23:31:22 +01:00
Petr Machata 240ce7f642 net: sch_prio: When ungrafting, replace with FIFO
When a child Qdisc is removed from one of the PRIO Qdisc's bands, it is
replaced unconditionally by a NOOP qdisc. As a result, any traffic hitting
that band gets dropped. That is incorrect--no Qdisc was explicitly added
when PRIO was created, and after removal, none should have to be added
either.

Fix PRIO by first attempting to create a default Qdisc and only falling
back to noop when that fails. This pattern of attempting to create an
invisible FIFO, using NOOP only as a fallback, is also seen in other
Qdiscs.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-08 12:45:53 -08:00
Eric Dumazet d9e15a2733 pkt_sched: fq: do not accept silly TCA_FQ_QUANTUM
As diagnosed by Florian :

If TCA_FQ_QUANTUM is set to 0x80000000, fq_deueue()
can loop forever in :

if (f->credit <= 0) {
  f->credit += q->quantum;
  goto begin;
}

... because f->credit is either 0 or -2147483648.

Let's limit TCA_FQ_QUANTUM to no more than 1 << 20 :
This max value should limit risks of breaking user setups
while fixing this bug.

Fixes: afe4fd0624 ("pkt_sched: fq: Fair Queue packet scheduler")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Diagnosed-by: Florian Westphal <fw@strlen.de>
Reported-by: syzbot+dc9071cc5a85950bdfce@syzkaller.appspotmail.com
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-08 12:40:47 -08:00
Masahiro Yamada b969fee12b tipc: remove meaningless assignment in Makefile
There is no module named tipc_diag.

The assignment to tipc_diag-y has no effect.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-08 12:38:54 -08:00
Masahiro Yamada ea04b445a2 tipc: do not add socket.o to tipc-y twice
net/tipc/Makefile adds socket.o twice.

tipc-y	+= addr.o bcast.o bearer.o \
           core.o link.o discover.o msg.o  \
           name_distr.o  subscr.o monitor.o name_table.o net.o  \
           netlink.o netlink_compat.o node.o socket.o eth_media.o \
                                             ^^^^^^^^
           topsrv.o socket.o group.o trace.o
                    ^^^^^^^^

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-08 12:38:54 -08:00
Eric Dumazet eb8ef2a3c5 vlan: vlan_changelink() should propagate errors
Both vlan_dev_change_flags() and vlan_dev_set_egress_priority()
can return an error. vlan_changelink() should not ignore them.

Fixes: 07b5b17e15 ("[VLAN]: Use rtnl_link API")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-07 13:35:14 -08:00
Eric Dumazet 9bbd917e0b vlan: fix memory leak in vlan_dev_set_egress_priority
There are few cases where the ndo_uninit() handler might be not
called if an error happens while device is initialized.

Since vlan_newlink() calls vlan_changelink() before
trying to register the netdevice, we need to make sure
vlan_dev_uninit() has been called at least once,
or we might leak allocated memory.

BUG: memory leak
unreferenced object 0xffff888122a206c0 (size 32):
  comm "syz-executor511", pid 7124, jiffies 4294950399 (age 32.240s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 61 73 00 00 00 00 00 00 00 00  ......as........
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<000000000eb3bb85>] kmemleak_alloc_recursive include/linux/kmemleak.h:43 [inline]
    [<000000000eb3bb85>] slab_post_alloc_hook mm/slab.h:586 [inline]
    [<000000000eb3bb85>] slab_alloc mm/slab.c:3320 [inline]
    [<000000000eb3bb85>] kmem_cache_alloc_trace+0x145/0x2c0 mm/slab.c:3549
    [<000000007b99f620>] kmalloc include/linux/slab.h:556 [inline]
    [<000000007b99f620>] vlan_dev_set_egress_priority+0xcc/0x150 net/8021q/vlan_dev.c:194
    [<000000007b0cb745>] vlan_changelink+0xd6/0x140 net/8021q/vlan_netlink.c:126
    [<0000000065aba83a>] vlan_newlink+0x135/0x200 net/8021q/vlan_netlink.c:181
    [<00000000fb5dd7a2>] __rtnl_newlink+0x89a/0xb80 net/core/rtnetlink.c:3305
    [<00000000ae4273a1>] rtnl_newlink+0x4e/0x80 net/core/rtnetlink.c:3363
    [<00000000decab39f>] rtnetlink_rcv_msg+0x178/0x4b0 net/core/rtnetlink.c:5424
    [<00000000accba4ee>] netlink_rcv_skb+0x61/0x170 net/netlink/af_netlink.c:2477
    [<00000000319fe20f>] rtnetlink_rcv+0x1d/0x30 net/core/rtnetlink.c:5442
    [<00000000d51938dc>] netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
    [<00000000d51938dc>] netlink_unicast+0x223/0x310 net/netlink/af_netlink.c:1328
    [<00000000e539ac79>] netlink_sendmsg+0x2c0/0x570 net/netlink/af_netlink.c:1917
    [<000000006250c27e>] sock_sendmsg_nosec net/socket.c:639 [inline]
    [<000000006250c27e>] sock_sendmsg+0x54/0x70 net/socket.c:659
    [<00000000e2a156d1>] ____sys_sendmsg+0x2d0/0x300 net/socket.c:2330
    [<000000008c87466e>] ___sys_sendmsg+0x8a/0xd0 net/socket.c:2384
    [<00000000110e3054>] __sys_sendmsg+0x80/0xf0 net/socket.c:2417
    [<00000000d71077c8>] __do_sys_sendmsg net/socket.c:2426 [inline]
    [<00000000d71077c8>] __se_sys_sendmsg net/socket.c:2424 [inline]
    [<00000000d71077c8>] __x64_sys_sendmsg+0x23/0x30 net/socket.c:2424

Fixe: 07b5b17e15 ("[VLAN]: Use rtnl_link API")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-07 13:35:14 -08:00
Xin Long be7a772920 sctp: free cmd->obj.chunk for the unprocessed SCTP_CMD_REPLY
This patch is to fix a memleak caused by no place to free cmd->obj.chunk
for the unprocessed SCTP_CMD_REPLY. This issue occurs when failing to
process a cmd while there're still SCTP_CMD_REPLY cmds on the cmd seq
with an allocated chunk in cmd->obj.chunk.

So fix it by freeing cmd->obj.chunk for each SCTP_CMD_REPLY cmd left on
the cmd seq when any cmd returns error. While at it, also remove 'nomem'
label.

Reported-by: syzbot+107c4aff5f392bf1517f@syzkaller.appspotmail.com
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-06 13:28:37 -08:00
Ying Xue a7869e5f91 tipc: eliminate KMSAN: uninit-value in __tipc_nl_compat_dumpit error
syzbot found the following crash on:
=====================================================
BUG: KMSAN: uninit-value in __nlmsg_parse include/net/netlink.h:661 [inline]
BUG: KMSAN: uninit-value in nlmsg_parse_deprecated
include/net/netlink.h:706 [inline]
BUG: KMSAN: uninit-value in __tipc_nl_compat_dumpit+0x553/0x11e0
net/tipc/netlink_compat.c:215
CPU: 0 PID: 12425 Comm: syz-executor062 Not tainted 5.5.0-rc1-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x1c9/0x220 lib/dump_stack.c:118
  kmsan_report+0x128/0x220 mm/kmsan/kmsan_report.c:108
  __msan_warning+0x57/0xa0 mm/kmsan/kmsan_instr.c:245
  __nlmsg_parse include/net/netlink.h:661 [inline]
  nlmsg_parse_deprecated include/net/netlink.h:706 [inline]
  __tipc_nl_compat_dumpit+0x553/0x11e0 net/tipc/netlink_compat.c:215
  tipc_nl_compat_dumpit+0x761/0x910 net/tipc/netlink_compat.c:308
  tipc_nl_compat_handle net/tipc/netlink_compat.c:1252 [inline]
  tipc_nl_compat_recv+0x12e9/0x2870 net/tipc/netlink_compat.c:1311
  genl_family_rcv_msg_doit net/netlink/genetlink.c:672 [inline]
  genl_family_rcv_msg net/netlink/genetlink.c:717 [inline]
  genl_rcv_msg+0x1dd0/0x23a0 net/netlink/genetlink.c:734
  netlink_rcv_skb+0x431/0x620 net/netlink/af_netlink.c:2477
  genl_rcv+0x63/0x80 net/netlink/genetlink.c:745
  netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
  netlink_unicast+0xfa0/0x1100 net/netlink/af_netlink.c:1328
  netlink_sendmsg+0x11f0/0x1480 net/netlink/af_netlink.c:1917
  sock_sendmsg_nosec net/socket.c:639 [inline]
  sock_sendmsg net/socket.c:659 [inline]
  ____sys_sendmsg+0x1362/0x13f0 net/socket.c:2330
  ___sys_sendmsg net/socket.c:2384 [inline]
  __sys_sendmsg+0x4f0/0x5e0 net/socket.c:2417
  __do_sys_sendmsg net/socket.c:2426 [inline]
  __se_sys_sendmsg+0x97/0xb0 net/socket.c:2424
  __x64_sys_sendmsg+0x4a/0x70 net/socket.c:2424
  do_syscall_64+0xb6/0x160 arch/x86/entry/common.c:295
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x444179
Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
ff 0f 83 1b d8 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ffd2d6409c8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00000000004002e0 RCX: 0000000000444179
RDX: 0000000000000000 RSI: 0000000020000140 RDI: 0000000000000003
RBP: 00000000006ce018 R08: 0000000000000000 R09: 00000000004002e0
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000401e20
R13: 0000000000401eb0 R14: 0000000000000000 R15: 0000000000000000

Uninit was created at:
  kmsan_save_stack_with_flags mm/kmsan/kmsan.c:149 [inline]
  kmsan_internal_poison_shadow+0x5c/0x110 mm/kmsan/kmsan.c:132
  kmsan_slab_alloc+0x8a/0xe0 mm/kmsan/kmsan_hooks.c:86
  slab_alloc_node mm/slub.c:2774 [inline]
  __kmalloc_node_track_caller+0xe47/0x11f0 mm/slub.c:4382
  __kmalloc_reserve net/core/skbuff.c:141 [inline]
  __alloc_skb+0x309/0xa50 net/core/skbuff.c:209
  alloc_skb include/linux/skbuff.h:1049 [inline]
  nlmsg_new include/net/netlink.h:888 [inline]
  tipc_nl_compat_dumpit+0x6e4/0x910 net/tipc/netlink_compat.c:301
  tipc_nl_compat_handle net/tipc/netlink_compat.c:1252 [inline]
  tipc_nl_compat_recv+0x12e9/0x2870 net/tipc/netlink_compat.c:1311
  genl_family_rcv_msg_doit net/netlink/genetlink.c:672 [inline]
  genl_family_rcv_msg net/netlink/genetlink.c:717 [inline]
  genl_rcv_msg+0x1dd0/0x23a0 net/netlink/genetlink.c:734
  netlink_rcv_skb+0x431/0x620 net/netlink/af_netlink.c:2477
  genl_rcv+0x63/0x80 net/netlink/genetlink.c:745
  netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
  netlink_unicast+0xfa0/0x1100 net/netlink/af_netlink.c:1328
  netlink_sendmsg+0x11f0/0x1480 net/netlink/af_netlink.c:1917
  sock_sendmsg_nosec net/socket.c:639 [inline]
  sock_sendmsg net/socket.c:659 [inline]
  ____sys_sendmsg+0x1362/0x13f0 net/socket.c:2330
  ___sys_sendmsg net/socket.c:2384 [inline]
  __sys_sendmsg+0x4f0/0x5e0 net/socket.c:2417
  __do_sys_sendmsg net/socket.c:2426 [inline]
  __se_sys_sendmsg+0x97/0xb0 net/socket.c:2424
  __x64_sys_sendmsg+0x4a/0x70 net/socket.c:2424
  do_syscall_64+0xb6/0x160 arch/x86/entry/common.c:295
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
=====================================================

The complaint above occurred because the memory region pointed by attrbuf
variable was not initialized. To eliminate this warning, we use kcalloc()
rather than kmalloc_array() to allocate memory for attrbuf.

Reported-by: syzbot+b1fd2bf2c89d8407e15f@syzkaller.appspotmail.com
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-06 13:24:31 -08:00
Pablo Neira Ayuso fb46f1b780 netfilter: flowtable: add nf_flowtable_time_stamp
This patch adds nf_flowtable_time_stamp and updates the existing code to
use it.

This patch is also implicitly fixing up hardware statistic fetching via
nf_flow_offload_stats() where casting to u32 is missing. Use
nf_flow_timeout_delta() to fix this.

Fixes: c29f74e0df ("netfilter: nf_flow_table: hardware offload support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: wenxu <wenxu@ucloud.cn>
2020-01-06 10:30:46 +01:00
Carl Huang ce57785bf9 net: qrtr: fix len of skb_put_padto in qrtr_node_enqueue
The len used for skb_put_padto is wrong, it need to add len of hdr.

In qrtr_node_enqueue, local variable size_t len is assign with
skb->len, then skb_push(skb, sizeof(*hdr)) will add skb->len with
sizeof(*hdr), so local variable size_t len is not same with skb->len
after skb_push(skb, sizeof(*hdr)).

Then the purpose of skb_put_padto(skb, ALIGN(len, 4)) is to add add
pad to the end of the skb's data if skb->len is not aligned to 4, but
unfortunately it use len instead of skb->len, at this line, skb->len
is 32 bytes(sizeof(*hdr)) more than len, for example, len is 3 bytes,
then skb->len is 35 bytes(3 + 32), and ALIGN(len, 4) is 4 bytes, so
__skb_put_padto will do nothing after check size(35) < len(4), the
correct value should be 36(sizeof(*hdr) + ALIGN(len, 4) = 32 + 4),
then __skb_put_padto will pass check size(35) < len(36) and add 1 byte
to the end of skb's data, then logic is correct.

function of skb_push:
void *skb_push(struct sk_buff *skb, unsigned int len)
{
	skb->data -= len;
	skb->len  += len;
	if (unlikely(skb->data < skb->head))
		skb_under_panic(skb, len, __builtin_return_address(0));
	return skb->data;
}

function of skb_put_padto
static inline int skb_put_padto(struct sk_buff *skb, unsigned int len)
{
	return __skb_put_padto(skb, len, true);
}

function of __skb_put_padto
static inline int __skb_put_padto(struct sk_buff *skb, unsigned int len,
				  bool free_on_error)
{
	unsigned int size = skb->len;

	if (unlikely(size < len)) {
		len -= size;
		if (__skb_pad(skb, len, free_on_error))
			return -ENOMEM;
		__skb_put(skb, len);
	}
	return 0;
}

Signed-off-by: Carl Huang <cjhuang@codeaurora.org>
Signed-off-by: Wen Gong <wgong@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-05 14:46:05 -08:00
Pablo Neira Ayuso 5acab91458 netfilter: nf_tables: unbind callbacks from flowtable destroy path
Callback unbinding needs to be done after nf_flow_table_free(),
otherwise entries are not removed from the hardware.

Update nft_unregister_flowtable_net_hooks() to call
nf_unregister_net_hook() instead since the commit/abort paths do not
deal with the callback unbinding anymore.

Add a comment to nft_flowtable_event() to clarify that
flow_offload_netdev_event() already removes the entries before the
callback unbinding.

Fixes: 8bb69f3b29 ("netfilter: nf_tables: add flowtable offload control plane")
Fixes ff4bf2f42a ("netfilter: nf_tables: add nft_unregister_flowtable_hook()")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: wenxu <wenxu@ucloud.cn>
2020-01-05 10:06:49 +01:00
wenxu 73327d47d2 netfilter: nf_flow_table_offload: fix the nat port mangle.
Shift on 32-bit word to define the port number depends on the flow
direction.

Fixes: c29f74e0df ("netfilter: nf_flow_table: hardware offload support")
Fixes: 7acd9378dc ("netfilter: nf_flow_table_offload: Correct memcpy size for flow_overload_mangle()")
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-05 10:06:34 +01:00
wenxu f31ad71c44 netfilter: nf_flow_table_offload: check the status of dst_neigh
It is better to get the dst_neigh with neigh->lock and check the
nud_state is VALID. If there is not neigh previous, the lookup will
Create a non NUD_VALID with 00:00:00:00:00:00 mac.

Fixes: c29f74e0df ("netfilter: nf_flow_table: hardware offload support")
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-05 10:06:31 +01:00
wenxu 1b67e50601 netfilter: nf_flow_table_offload: fix incorrect ethernet dst address
Ethernet destination for original traffic takes the source ethernet address
in the reply direction. For reply traffic, this takes the source
ethernet address of the original direction.

Fixes: c29f74e0df ("netfilter: nf_flow_table: hardware offload support")
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-05 10:06:27 +01:00
wenxu 8ca79606cd netfilter: nft_flow_offload: fix underflow in flowtable reference counter
The .deactivate and .activate interfaces already deal with the reference
counter. Otherwise, this results in spurious "Device is busy" errors.

Fixes: a3c90f7a23 ("netfilter: nf_tables: flow offload expression")
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-05 10:06:22 +01:00
Wen Yang 68aab823c2 sch_cake: avoid possible divide by zero in cake_enqueue()
The variables 'window_interval' is u64 and do_div()
truncates it to 32 bits, which means it can test
non-zero and be truncated to zero for division.
The unit of window_interval is nanoseconds,
so its lower 32-bit is relatively easy to exceed.
Fix this issue by using div64_u64() instead.

Fixes: 7298de9cd7 ("sch_cake: Add ingress mode")
Signed-off-by: Wen Yang <wenyang@linux.alibaba.com>
Cc: Kevin Darbyshire-Bryant <ldir@darbyshire-bryant.me.uk>
Cc: Toke Høiland-Jørgensen <toke@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: cake@lists.bufferbloat.net
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-02 16:34:28 -08:00
Pengcheng Yang c9655008e7 tcp: fix "old stuff" D-SACK causing SACK to be treated as D-SACK
When we receive a D-SACK, where the sequence number satisfies:
	undo_marker <= start_seq < end_seq <= prior_snd_una
we consider this is a valid D-SACK and tcp_is_sackblock_valid()
returns true, then this D-SACK is discarded as "old stuff",
but the variable first_sack_index is not marked as negative
in tcp_sacktag_write_queue().

If this D-SACK also carries a SACK that needs to be processed
(for example, the previous SACK segment was lost), this SACK
will be treated as a D-SACK in the following processing of
tcp_sacktag_write_queue(), which will eventually lead to
incorrect updates of undo_retrans and reordering.

Fixes: fd6dad616d ("[TCP]: Earlier SACK block verification & simplify access to them")
Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-02 15:42:28 -08:00
Markus Theil 02a6144996 mac80211: mesh: restrict airtime metric to peered established plinks
The following warning is triggered every time an unestablished mesh peer
gets dumped. Checks if a peer link is established before retrieving the
airtime link metric.

[ 9563.022567] WARNING: CPU: 0 PID: 6287 at net/mac80211/mesh_hwmp.c:345
               airtime_link_metric_get+0xa2/0xb0 [mac80211]
[ 9563.022697] Hardware name: PC Engines apu2/apu2, BIOS v4.10.0.3
[ 9563.022756] RIP: 0010:airtime_link_metric_get+0xa2/0xb0 [mac80211]
[ 9563.022838] Call Trace:
[ 9563.022897]  sta_set_sinfo+0x936/0xa10 [mac80211]
[ 9563.022964]  ieee80211_dump_station+0x6d/0x90 [mac80211]
[ 9563.023062]  nl80211_dump_station+0x154/0x2a0 [cfg80211]
[ 9563.023120]  netlink_dump+0x17b/0x370
[ 9563.023130]  netlink_recvmsg+0x2a4/0x480
[ 9563.023140]  ____sys_recvmsg+0xa6/0x160
[ 9563.023154]  ___sys_recvmsg+0x93/0xe0
[ 9563.023169]  __sys_recvmsg+0x7e/0xd0
[ 9563.023210]  do_syscall_64+0x4e/0x140
[ 9563.023217]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Signed-off-by: Markus Theil <markus.theil@tu-ilmenau.de>
Link: https://lore.kernel.org/r/20191203180644.70653-1-markus.theil@tu-ilmenau.de
[rewrite commit message]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-01-02 14:48:32 +01:00
Taehee Yoo 04b69426d8 hsr: fix slab-out-of-bounds Read in hsr_debugfs_rename()
hsr slave interfaces don't have debugfs directory.
So, hsr_debugfs_rename() shouldn't be called when hsr slave interface name
is changed.

Test commands:
    ip link add dummy0 type dummy
    ip link add dummy1 type dummy
    ip link add hsr0 type hsr slave1 dummy0 slave2 dummy1
    ip link set dummy0 name ap

Splat looks like:
[21071.899367][T22666] ap: renamed from dummy0
[21071.914005][T22666] ==================================================================
[21071.919008][T22666] BUG: KASAN: slab-out-of-bounds in hsr_debugfs_rename+0xaa/0xb0 [hsr]
[21071.923640][T22666] Read of size 8 at addr ffff88805febcd98 by task ip/22666
[21071.926941][T22666]
[21071.927750][T22666] CPU: 0 PID: 22666 Comm: ip Not tainted 5.5.0-rc2+ #240
[21071.929919][T22666] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[21071.935094][T22666] Call Trace:
[21071.935867][T22666]  dump_stack+0x96/0xdb
[21071.936687][T22666]  ? hsr_debugfs_rename+0xaa/0xb0 [hsr]
[21071.937774][T22666]  print_address_description.constprop.5+0x1be/0x360
[21071.939019][T22666]  ? hsr_debugfs_rename+0xaa/0xb0 [hsr]
[21071.940081][T22666]  ? hsr_debugfs_rename+0xaa/0xb0 [hsr]
[21071.940949][T22666]  __kasan_report+0x12a/0x16f
[21071.941758][T22666]  ? hsr_debugfs_rename+0xaa/0xb0 [hsr]
[21071.942674][T22666]  kasan_report+0xe/0x20
[21071.943325][T22666]  hsr_debugfs_rename+0xaa/0xb0 [hsr]
[21071.944187][T22666]  hsr_netdev_notify+0x1fe/0x9b0 [hsr]
[21071.945052][T22666]  ? __module_text_address+0x13/0x140
[21071.945897][T22666]  notifier_call_chain+0x90/0x160
[21071.946743][T22666]  dev_change_name+0x419/0x840
[21071.947496][T22666]  ? __read_once_size_nocheck.constprop.6+0x10/0x10
[21071.948600][T22666]  ? netdev_adjacent_rename_links+0x280/0x280
[21071.949577][T22666]  ? __read_once_size_nocheck.constprop.6+0x10/0x10
[21071.950672][T22666]  ? lock_downgrade+0x6e0/0x6e0
[21071.951345][T22666]  ? do_setlink+0x811/0x2ef0
[21071.951991][T22666]  do_setlink+0x811/0x2ef0
[21071.952613][T22666]  ? is_bpf_text_address+0x81/0xe0
[ ... ]

Reported-by: syzbot+9328206518f08318a5fd@syzkaller.appspotmail.com
Fixes: 4c2d5e33dc ("hsr: rename debugfs file when interface name is changed")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-30 20:36:27 -08:00
Davide Caratti a5b72a083d net/sched: add delete_empty() to filters and use it in cls_flower
Revert "net/sched: cls_u32: fix refcount leak in the error path of
u32_change()", and fix the u32 refcount leak in a more generic way that
preserves the semantic of rule dumping.
On tc filters that don't support lockless insertion/removal, there is no
need to guard against concurrent insertion when a removal is in progress.
Therefore, for most of them we can avoid a full walk() when deleting, and
just decrease the refcount, like it was done on older Linux kernels.
This fixes situations where walk() was wrongly detecting a non-empty
filter, like it happened with cls_u32 in the error path of change(), thus
leading to failures in the following tdc selftests:

 6aa7: (filter, u32) Add/Replace u32 with source match and invalid indev
 6658: (filter, u32) Add/Replace u32 with custom hash table and invalid handle
 74c2: (filter, u32) Add/Replace u32 filter with invalid hash table id

On cls_flower, and on (future) lockless filters, this check is necessary:
move all the check_empty() logic in a callback so that each filter
can have its own implementation. For cls_flower, it's sufficient to check
if no IDRs have been allocated.

This reverts commit 275c44aa19.

Changes since v1:
 - document the need for delete_empty() when TCF_PROTO_OPS_DOIT_UNLOCKED
   is used, thanks to Vlad Buslov
 - implement delete_empty() without doing fl_walk(), thanks to Vlad Buslov
 - squash revert and new fix in a single patch, to be nice with bisect
   tests that run tdc on u32 filter, thanks to Dave Miller

Fixes: 275c44aa19 ("net/sched: cls_u32: fix refcount leak in the error path of u32_change()")
Fixes: 6676d5e416 ("net: sched: set dedicated tcf_walker flag when tp is empty")
Suggested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Suggested-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Vlad Buslov <vladbu@mellanox.com>
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-30 20:35:19 -08:00
Cambda Zhu 853697504d tcp: Fix highest_sack and highest_sack_seq
>From commit 50895b9de1 ("tcp: highest_sack fix"), the logic about
setting tp->highest_sack to the head of the send queue was removed.
Of course the logic is error prone, but it is logical. Before we
remove the pointer to the highest sack skb and use the seq instead,
we need to set tp->highest_sack to NULL when there is no skb after
the last sack, and then replace NULL with the real skb when new skb
inserted into the rtx queue, because the NULL means the highest sack
seq is tp->snd_nxt. If tp->highest_sack is NULL and new data sent,
the next ACK with sack option will increase tp->reordering unexpectedly.

This patch sets tp->highest_sack to the tail of the rtx queue if
it's NULL and new data is sent. The patch keeps the rule that the
highest_sack can only be maintained by sack processing, except for
this only case.

Fixes: 50895b9de1 ("tcp: highest_sack fix")
Signed-off-by: Cambda Zhu <cambda@linux.alibaba.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-30 20:28:39 -08:00
Florian Westphal 1b789577f6 netfilter: arp_tables: init netns pointer in xt_tgchk_param struct
We get crash when the targets checkentry function tries to make
use of the network namespace pointer for arptables.

When the net pointer got added back in 2010, only ip/ip6/ebtables were
changed to initialize it, so arptables has this set to NULL.

This isn't a problem for normal arptables because no existing
arptables target has a checkentry function that makes use of par->net.

However, direct users of the setsockopt interface can provide any
target they want as long as its registered for ARP or UNPSEC protocols.

syzkaller managed to send a semi-valid arptables rule for RATEEST target
which is enough to trigger NULL deref:

kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
RIP: xt_rateest_tg_checkentry+0x11d/0xb40 net/netfilter/xt_RATEEST.c:109
[..]
 xt_check_target+0x283/0x690 net/netfilter/x_tables.c:1019
 check_target net/ipv4/netfilter/arp_tables.c:399 [inline]
 find_check_entry net/ipv4/netfilter/arp_tables.c:422 [inline]
 translate_table+0x1005/0x1d70 net/ipv4/netfilter/arp_tables.c:572
 do_replace net/ipv4/netfilter/arp_tables.c:977 [inline]
 do_arpt_set_ctl+0x310/0x640 net/ipv4/netfilter/arp_tables.c:1456

Fixes: add6746124 ("netfilter: add struct net * to target parameters")
Reported-by: syzbot+d7358a458d8a81aee898@syzkaller.appspotmail.com
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-12-30 13:09:04 +01:00
Shmulik Ladkani 70cf3dc731 net/sched: act_mirred: Pull mac prior redir to non mac_header_xmit device
There's no skb_pull performed when a mirred action is set at egress of a
mac device, with a target device/action that expects skb->data to point
at the network header.

As a result, either the target device is errornously given an skb with
data pointing to the mac (egress case), or the net stack receives the
skb with data pointing to the mac (ingress case).

E.g:
 # tc qdisc add dev eth9 root handle 1: prio
 # tc filter add dev eth9 parent 1: prio 9 protocol ip handle 9 basic \
   action mirred egress redirect dev tun0

 (tun0 is a tun device. result: tun0 errornously gets the eth header
  instead of the iph)

Revise the push/pull logic of tcf_mirred_act() to not rely on the
skb_at_tc_ingress() vs tcf_mirred_act_wants_ingress() comparison, as it
does not cover all "pull" cases.

Instead, calculate whether the required action on the target device
requires the data to point at the network header, and compare this to
whether skb->data points to network header - and make the push/pull
adjustments as necessary.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Shmulik Ladkani <sladkani@proofpoint.com>
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27 16:35:32 -08:00
Eric Dumazet bb3d0b8bf5 net_sched: sch_fq: properly set sk->sk_pacing_status
If fq_classify() recycles a struct fq_flow because
a socket structure has been reallocated, we do not
set sk->sk_pacing_status immediately, but later if the
flow becomes detached.

This means that any flow requiring pacing (BBR, or SO_MAX_PACING_RATE)
might fallback to TCP internal pacing, which requires a per-socket
high resolution timer, and therefore more cpu cycles.

Fixes: 218af599fa ("tcp: internal implementation for pacing")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-26 15:35:09 -08:00
David S. Miller ec34c01575 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Fix endianness issue in flowtable TCP flags dissector,
   from Arnd Bergmann.

2) Extend flowtable test script with dnat rules, from Florian Westphal.

3) Reject padding in ebtables user entries and validate computed user
   offset, reported by syzbot, from Florian Westphal.

4) Fix endianness in nft_tproxy, from Phil Sutter.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-26 13:11:40 -08:00
Taehee Yoo 3ed0a1d563 hsr: reset network header when supervision frame is created
The supervision frame is L2 frame.
When supervision frame is created, hsr module doesn't set network header.
If tap routine is enabled, dev_queue_xmit_nit() is called and it checks
network_header. If network_header pointer wasn't set(or invalid),
it resets network_header and warns.
In order to avoid unnecessary warning message, resetting network_header
is needed.

Test commands:
    ip netns add nst
    ip link add veth0 type veth peer name veth1
    ip link add veth2 type veth peer name veth3
    ip link set veth1 netns nst
    ip link set veth3 netns nst
    ip link set veth0 up
    ip link set veth2 up
    ip link add hsr0 type hsr slave1 veth0 slave2 veth2
    ip a a 192.168.100.1/24 dev hsr0
    ip link set hsr0 up
    ip netns exec nst ip link set veth1 up
    ip netns exec nst ip link set veth3 up
    ip netns exec nst ip link add hsr1 type hsr slave1 veth1 slave2 veth3
    ip netns exec nst ip a a 192.168.100.2/24 dev hsr1
    ip netns exec nst ip link set hsr1 up
    tcpdump -nei veth0

Splat looks like:
[  175.852292][    C3] protocol 88fb is buggy, dev veth0

Fixes: f421436a59 ("net/hsr: Add support for the High-availability Seamless Redundancy protocol (HSRv0)")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-25 16:35:35 -08:00
Taehee Yoo 92a35678ec hsr: fix a race condition in node list insertion and deletion
hsr nodes are protected by RCU and there is no write side lock.
But node insertions and deletions could be being operated concurrently.
So write side locking is needed.

Test commands:
    ip netns add nst
    ip link add veth0 type veth peer name veth1
    ip link add veth2 type veth peer name veth3
    ip link set veth1 netns nst
    ip link set veth3 netns nst
    ip link set veth0 up
    ip link set veth2 up
    ip link add hsr0 type hsr slave1 veth0 slave2 veth2
    ip a a 192.168.100.1/24 dev hsr0
    ip link set hsr0 up
    ip netns exec nst ip link set veth1 up
    ip netns exec nst ip link set veth3 up
    ip netns exec nst ip link add hsr1 type hsr slave1 veth1 slave2 veth3
    ip netns exec nst ip a a 192.168.100.2/24 dev hsr1
    ip netns exec nst ip link set hsr1 up

    for i in {0..9}
    do
        for j in {0..9}
	do
	    for k in {0..9}
	    do
	        for l in {0..9}
		do
	        arping 192.168.100.2 -I hsr0 -s 00:01:3$i:4$j:5$k:6$l -c1 &
		done
	    done
	done
    done

Splat looks like:
[  236.066091][ T3286] list_add corruption. next->prev should be prev (ffff8880a5940300), but was ffff8880a5940d0.
[  236.069617][ T3286] ------------[ cut here ]------------
[  236.070545][ T3286] kernel BUG at lib/list_debug.c:25!
[  236.071391][ T3286] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
[  236.072343][ T3286] CPU: 0 PID: 3286 Comm: arping Tainted: G        W         5.5.0-rc1+ #209
[  236.073463][ T3286] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[  236.074695][ T3286] RIP: 0010:__list_add_valid+0x74/0xd0
[  236.075499][ T3286] Code: 48 39 da 75 27 48 39 f5 74 36 48 39 dd 74 31 48 83 c4 08 b8 01 00 00 00 5b 5d c3 48 b
[  236.078277][ T3286] RSP: 0018:ffff8880aaa97648 EFLAGS: 00010286
[  236.086991][ T3286] RAX: 0000000000000075 RBX: ffff8880d4624c20 RCX: 0000000000000000
[  236.088000][ T3286] RDX: 0000000000000075 RSI: 0000000000000008 RDI: ffffed1015552ebf
[  236.098897][ T3286] RBP: ffff88809b53d200 R08: ffffed101b3c04f9 R09: ffffed101b3c04f9
[  236.099960][ T3286] R10: 00000000308769a1 R11: ffffed101b3c04f8 R12: ffff8880d4624c28
[  236.100974][ T3286] R13: ffff8880d4624c20 R14: 0000000040310100 R15: ffff8880ce17ee02
[  236.138967][ T3286] FS:  00007f23479fa680(0000) GS:ffff8880d9c00000(0000) knlGS:0000000000000000
[  236.144852][ T3286] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  236.145720][ T3286] CR2: 00007f4a14bab210 CR3: 00000000a61c6001 CR4: 00000000000606f0
[  236.146776][ T3286] Call Trace:
[  236.147222][ T3286]  hsr_add_node+0x314/0x490 [hsr]
[  236.153633][ T3286]  hsr_forward_skb+0x2b6/0x1bc0 [hsr]
[  236.154362][ T3286]  ? rcu_read_lock_sched_held+0x90/0xc0
[  236.155091][ T3286]  ? rcu_read_lock_bh_held+0xa0/0xa0
[  236.156607][ T3286]  hsr_dev_xmit+0x70/0xd0 [hsr]
[  236.157254][ T3286]  dev_hard_start_xmit+0x160/0x740
[  236.157941][ T3286]  __dev_queue_xmit+0x1961/0x2e10
[  236.158565][ T3286]  ? netdev_core_pick_tx+0x2e0/0x2e0
[ ... ]

Reported-by: syzbot+3924327f9ad5f4d2b343@syzkaller.appspotmail.com
Fixes: f421436a59 ("net/hsr: Add support for the High-availability Seamless Redundancy protocol (HSRv0)")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-25 16:35:35 -08:00
Taehee Yoo 4c2d5e33dc hsr: rename debugfs file when interface name is changed
hsr interface has own debugfs file, which name is same with interface name.
So, interface name is changed, debugfs file name should be changed too.

Fixes: fc4ecaeebd ("net: hsr: add debugfs support for display node list")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-25 16:35:35 -08:00
Taehee Yoo c6c4ccd7f9 hsr: add hsr root debugfs directory
In current hsr code, when hsr interface is created, it creates debugfs
directory /sys/kernel/debug/<interface name>.
If there is same directory or file name in there, it fails.
In order to reduce possibility of failure of creation of debugfs,
this patch adds root directory.

Test commands:
    ip link add dummy0 type dummy
    ip link add dummy1 type dummy
    ip link add hsr0 type hsr slave1 dummy0 slave2 dummy1

Before this patch:
    /sys/kernel/debug/hsr0/node_table

After this patch:
    /sys/kernel/debug/hsr/hsr0/node_table

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-25 16:35:35 -08:00
Taehee Yoo 1d19e2d53e hsr: fix error handling routine in hsr_dev_finalize()
hsr_dev_finalize() is called to create new hsr interface.
There are some wrong error handling codes.

1. wrong checking return value of debugfs_create_{dir/file}.
These function doesn't return NULL. If error occurs in there,
it returns error pointer.
So, it should check error pointer instead of NULL.

2. It doesn't unregister interface if it fails to setup hsr interface.
If it fails to initialize hsr interface after register_netdevice(),
it should call unregister_netdevice().

3. Ignore failure of creation of debugfs
If creating of debugfs dir and file is failed, creating hsr interface
will be failed. But debugfs doesn't affect actual logic of hsr module.
So, ignoring this is more correct and this behavior is more general.

Fixes: c5a7591172 ("net/hsr: Use list_head (and rcu) instead of array for slave devices.")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-25 16:35:35 -08:00
Taehee Yoo 84bb59d773 hsr: avoid debugfs warning message when module is remove
When hsr module is being removed, debugfs_remove() is called to remove
both debugfs directory and file.

When module is being removed, module state is changed to
MODULE_STATE_GOING then exit() is called.
At this moment, module couldn't be held so try_module_get()
will be failed.

debugfs's open() callback tries to hold the module if .owner is existing.
If it fails, warning message is printed.

CPU0				CPU1
delete_module()
    try_stop_module()
    hsr_exit()			open() <-- WARNING
        debugfs_remove()

In order to avoid the warning message, this patch makes hsr module does
not set .owner. Unsetting .owner is safe because these are protected by
inode_lock().

Test commands:
    #SHELL1
    ip link add dummy0 type dummy
    ip link add dummy1 type dummy
    while :
    do
        ip link add hsr0 type hsr slave1 dummy0 slave2 dummy1
	modprobe -rv hsr
    done

    #SHELL2
    while :
    do
        cat /sys/kernel/debug/hsr0/node_table
    done

Splat looks like:
[  101.223783][ T1271] ------------[ cut here ]------------
[  101.230309][ T1271] debugfs file owner did not clean up at exit: node_table
[  101.230380][ T1271] WARNING: CPU: 3 PID: 1271 at fs/debugfs/file.c:309 full_proxy_open+0x10f/0x650
[  101.233153][ T1271] Modules linked in: hsr(-) dummy veth openvswitch nsh nf_conncount nf_nat nf_conntrack nf_d]
[  101.237112][ T1271] CPU: 3 PID: 1271 Comm: cat Tainted: G        W         5.5.0-rc1+ #204
[  101.238270][ T1271] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[  101.240379][ T1271] RIP: 0010:full_proxy_open+0x10f/0x650
[  101.241166][ T1271] Code: 48 c1 ea 03 80 3c 02 00 0f 85 c1 04 00 00 49 8b 3c 24 e8 04 86 7e ff 84 c0 75 2d 4c 8
[  101.251985][ T1271] RSP: 0018:ffff8880ca22fa38 EFLAGS: 00010286
[  101.273355][ T1271] RAX: dffffc0000000008 RBX: ffff8880cc6e6200 RCX: 0000000000000000
[  101.274466][ T1271] RDX: 0000000000000000 RSI: 0000000000000006 RDI: ffff8880c4dd5c14
[  101.275581][ T1271] RBP: 0000000000000000 R08: fffffbfff2922f5d R09: 0000000000000000
[  101.276733][ T1271] R10: 0000000000000001 R11: 0000000000000000 R12: ffffffffc0551bc0
[  101.277853][ T1271] R13: ffff8880c4059a48 R14: ffff8880be50a5e0 R15: ffffffff941adaa0
[  101.278956][ T1271] FS:  00007f8871cda540(0000) GS:ffff8880da800000(0000) knlGS:0000000000000000
[  101.280216][ T1271] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  101.282832][ T1271] CR2: 00007f88717cfd10 CR3: 00000000b9440005 CR4: 00000000000606e0
[  101.283974][ T1271] Call Trace:
[  101.285328][ T1271]  do_dentry_open+0x63c/0xf50
[  101.286077][ T1271]  ? open_proxy_open+0x270/0x270
[  101.288271][ T1271]  ? __x64_sys_fchdir+0x180/0x180
[  101.288987][ T1271]  ? inode_permission+0x65/0x390
[  101.289682][ T1271]  path_openat+0x701/0x2810
[  101.290294][ T1271]  ? path_lookupat+0x880/0x880
[  101.290957][ T1271]  ? check_chain_key+0x236/0x5d0
[  101.291676][ T1271]  ? __lock_acquire+0xdfe/0x3de0
[  101.292358][ T1271]  ? sched_clock+0x5/0x10
[  101.292962][ T1271]  ? sched_clock_cpu+0x18/0x170
[  101.293644][ T1271]  ? find_held_lock+0x39/0x1d0
[  101.305616][ T1271]  do_filp_open+0x17a/0x270
[  101.306061][ T1271]  ? may_open_dev+0xc0/0xc0
[ ... ]

Fixes: fc4ecaeebd ("net: hsr: add debugfs support for display node list")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-25 16:35:35 -08:00
Hangbin Liu 4d42df46d6 sit: do not confirm neighbor when do pmtu update
When do IPv6 tunnel PMTU update and calls __ip6_rt_update_pmtu() in the end,
we should not call dst_confirm_neigh() as there is no two-way communication.

v5: No change.
v4: No change.
v3: Do not remove dst_confirm_neigh, but add a new bool parameter in
    dst_ops.update_pmtu to control whether we should do neighbor confirm.
    Also split the big patch to small ones for each area.
v2: Remove dst_confirm_neigh in __ip6_rt_update_pmtu.

Reviewed-by: Guillaume Nault <gnault@redhat.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-24 22:28:55 -08:00
Hangbin Liu 8247a79efa vti: do not confirm neighbor when do pmtu update
When do IPv6 tunnel PMTU update and calls __ip6_rt_update_pmtu() in the end,
we should not call dst_confirm_neigh() as there is no two-way communication.

Although vti and vti6 are immune to this problem because they are IFF_NOARP
interfaces, as Guillaume pointed. There is still no sense to confirm neighbour
here.

v5: Update commit description.
v4: No change.
v3: Do not remove dst_confirm_neigh, but add a new bool parameter in
    dst_ops.update_pmtu to control whether we should do neighbor confirm.
    Also split the big patch to small ones for each area.
v2: Remove dst_confirm_neigh in __ip6_rt_update_pmtu.

Reviewed-by: Guillaume Nault <gnault@redhat.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-24 22:28:55 -08:00
Hangbin Liu 7a1592bcb1 tunnel: do not confirm neighbor when do pmtu update
When do tunnel PMTU update and calls __ip6_rt_update_pmtu() in the end,
we should not call dst_confirm_neigh() as there is no two-way communication.

v5: No Change.
v4: Update commit description
v3: Do not remove dst_confirm_neigh, but add a new bool parameter in
    dst_ops.update_pmtu to control whether we should do neighbor confirm.
    Also split the big patch to small ones for each area.
v2: Remove dst_confirm_neigh in __ip6_rt_update_pmtu.

Fixes: 0dec879f63 ("net: use dst_confirm_neigh for UDP, RAW, ICMP, L2TP")
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Tested-by: Guillaume Nault <gnault@redhat.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-24 22:28:55 -08:00
Hangbin Liu 675d76ad0a ip6_gre: do not confirm neighbor when do pmtu update
When we do ipv6 gre pmtu update, we will also do neigh confirm currently.
This will cause the neigh cache be refreshed and set to REACHABLE before
xmit.

But if the remote mac address changed, e.g. device is deleted and recreated,
we will not able to notice this and still use the old mac address as the neigh
cache is REACHABLE.

Fix this by disable neigh confirm when do pmtu update

v5: No change.
v4: No change.
v3: Do not remove dst_confirm_neigh, but add a new bool parameter in
    dst_ops.update_pmtu to control whether we should do neighbor confirm.
    Also split the big patch to small ones for each area.
v2: Remove dst_confirm_neigh in __ip6_rt_update_pmtu.

Reported-by: Jianlin Shi <jishi@redhat.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-24 22:28:54 -08:00
Hangbin Liu bd085ef678 net: add bool confirm_neigh parameter for dst_ops.update_pmtu
The MTU update code is supposed to be invoked in response to real
networking events that update the PMTU. In IPv6 PMTU update function
__ip6_rt_update_pmtu() we called dst_confirm_neigh() to update neighbor
confirmed time.

But for tunnel code, it will call pmtu before xmit, like:
  - tnl_update_pmtu()
    - skb_dst_update_pmtu()
      - ip6_rt_update_pmtu()
        - __ip6_rt_update_pmtu()
          - dst_confirm_neigh()

If the tunnel remote dst mac address changed and we still do the neigh
confirm, we will not be able to update neigh cache and ping6 remote
will failed.

So for this ip_tunnel_xmit() case, _EVEN_ if the MTU is changed, we
should not be invoking dst_confirm_neigh() as we have no evidence
of successful two-way communication at this point.

On the other hand it is also important to keep the neigh reachability fresh
for TCP flows, so we cannot remove this dst_confirm_neigh() call.

To fix the issue, we have to add a new bool parameter for dst_ops.update_pmtu
to choose whether we should do neigh update or not. I will add the parameter
in this patch and set all the callers to true to comply with the previous
way, and fix the tunnel code one by one on later patches.

v5: No change.
v4: No change.
v3: Do not remove dst_confirm_neigh, but add a new bool parameter in
    dst_ops.update_pmtu to control whether we should do neighbor confirm.
    Also split the big patch to small ones for each area.
v2: Remove dst_confirm_neigh in __ip6_rt_update_pmtu.

Suggested-by: David Miller <davem@davemloft.net>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-24 22:28:54 -08:00
David S. Miller ff43ae4bd5 RxRPC fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl39UCAACgkQ+7dXa6fL
 C2sZhw//S7+XNXpA9HfwUO/Vic9jUFi4FRPkS+zqHUCJQpbC32+Gj1L13J++EEJm
 aIaf8yo5DSNIKRaja+d4UTmZFgXOOg64Jq/p9ZOltdomS+ZihGWoigKHxYQtLfin
 fsBVbPqMO+PSZbSqcwy7inMKfJ1EkFXiu5lhI5o4Swc3gw9w9HTui7u5c02Wol8L
 2ooI2hiuali9FSJ592UfvHoxK3auNUJ6oB2aA+EFuLhMLz+8hhnVCNELsaWoLDOh
 qZkI0D+X5w61MsYiND3jOB85pl+d8ADKPlMw/WM1hIwn/ODPvkqnGHVhOslB7/fh
 dBsNUcGhAsdGz9jBDRXcUv5ak4G/wL1IiV5CRJ0xSX6I5lTm+byIRE2rqtO5mkQg
 ESXCGGodbxFh7EIjjWskdTtusQN+O+v78GLue7h9KVLrsTvzQRT99BIAJLvT6rGM
 F6I/Zirj1ZbRhn7uwkr+hwoD4CJvtF/mvImg0P2iPQmUrhEUJkE5VRVIDf/OnBd0
 vGQJG7IKr5tUK+x3ztk34H3pqIakfQ03FX0mqdTFuoVYmaBlooSk9Do7xNTUHi2o
 2rC3R5S1alRBqFBOKIz5dL6SK8L1wcLCgSzC1pI+q9bcusX6yjW0b3vMn6tzCUHA
 eb1NM0jp7VXx5yN88pfi7U7JFxafcCMoKKBodvdLgnL06mfHnJc=
 =vwQ0
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-fixes-20191220' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Fixes

Here are a couple of bugfixes plus a patch that makes one of the bugfixes
easier:

 (1) Move the ping and mutex unlock on a new call from rxrpc_input_packet()
     into rxrpc_new_incoming_call(), which it calls.  This means the
     lock-unlock section is entirely within the latter function.  This
     simplifies patch (2).

 (2) Don't take the call->user_mutex at all in the softirq path.  Mutexes
     aren't allowed to be taken or released there and a patch was merged
     that caused a warning to be emitted every time this happened.  Looking
     at the code again, it looks like that taking the mutex isn't actually
     necessary, as the value of call->state will block access to the call.

 (3) Fix the incoming call path to check incoming calls earlier to reject
     calls to RPC services for which we don't have a security key of the
     appropriate class.  This avoids an assertion failure if YFS tries
     making a secure call to the kafs cache manager RPC service.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-24 16:12:47 -08:00
Marcelo Ricardo Leitner 61d5d40628 sctp: fix err handling of stream initialization
The fix on 951c6db954 fixed the issued reported there but introduced
another. When the allocation fails within sctp_stream_init() it is
okay/necessary to free the genradix. But it is also called when adding
new streams, from sctp_send_add_streams() and
sctp_process_strreset_addstrm_in() and in those situations it cannot
just free the genradix because by then it is a fully operational
association.

The fix here then is to only free the genradix in sctp_stream_init()
and on those other call sites  move on with what it already had and let
the subsequent error handling to handle it.

Tested with the reproducers from this report and the previous one,
with lksctp-tools and sctp-tests.

Reported-by: syzbot+9a1bc632e78a1a98488b@syzkaller.appspotmail.com
Fixes: 951c6db954 ("sctp: fix memleak on err handling of stream initialization")
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-24 16:07:10 -08:00
Antonio Messina feed8a4fc9 udp: fix integer overflow while computing available space in sk_rcvbuf
When the size of the receive buffer for a socket is close to 2^31 when
computing if we have enough space in the buffer to copy a packet from
the queue to the buffer we might hit an integer overflow.

When an user set net.core.rmem_default to a value close to 2^31 UDP
packets are dropped because of this overflow. This can be visible, for
instance, with failure to resolve hostnames.

This can be fixed by casting sk_rcvbuf (which is an int) to unsigned
int, similarly to how it is done in TCP.

Signed-off-by: Antonio Messina <amessina@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-24 14:52:39 -08:00
Linus Torvalds 78bac77b52 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Several nf_flow_table_offload fixes from Pablo Neira Ayuso,
    including adding a missing ipv6 match description.

 2) Several heap overflow fixes in mwifiex from qize wang and Ganapathi
    Bhat.

 3) Fix uninit value in bond_neigh_init(), from Eric Dumazet.

 4) Fix non-ACPI probing of nxp-nci, from Stephan Gerhold.

 5) Fix use after free in tipc_disc_rcv(), from Tuong Lien.

 6) Enforce limit of 33 tail calls in mips and riscv JIT, from Paul
    Chaignon.

 7) Multicast MAC limit test is off by one in qede, from Manish Chopra.

 8) Fix established socket lookup race when socket goes from
    TCP_ESTABLISHED to TCP_LISTEN, because there lacks an intervening
    RCU grace period. From Eric Dumazet.

 9) Don't send empty SKBs from tcp_write_xmit(), also from Eric Dumazet.

10) Fix active backup transition after link failure in bonding, from
    Mahesh Bandewar.

11) Avoid zero sized hash table in gtp driver, from Taehee Yoo.

12) Fix wrong interface passed to ->mac_link_up(), from Russell King.

13) Fix DSA egress flooding settings in b53, from Florian Fainelli.

14) Memory leak in gmac_setup_txqs(), from Navid Emamdoost.

15) Fix double free in dpaa2-ptp code, from Ioana Ciornei.

16) Reject invalid MTU values in stmmac, from Jose Abreu.

17) Fix refcount leak in error path of u32 classifier, from Davide
    Caratti.

18) Fix regression causing iwlwifi firmware crashes on boot, from Anders
    Kaseorg.

19) Fix inverted return value logic in llc2 code, from Chan Shu Tak.

20) Disable hardware GRO when XDP is attached to qede, frm Manish
    Chopra.

21) Since we encode state in the low pointer bits, dst metrics must be
    at least 4 byte aligned, which is not necessarily true on m68k. Add
    annotations to fix this, from Geert Uytterhoeven.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (160 commits)
  sfc: Include XDP packet headroom in buffer step size.
  sfc: fix channel allocation with brute force
  net: dst: Force 4-byte alignment of dst_metrics
  selftests: pmtu: fix init mtu value in description
  hv_netvsc: Fix unwanted rx_table reset
  net: phy: ensure that phy IDs are correctly typed
  mod_devicetable: fix PHY module format
  qede: Disable hardware gro when xdp prog is installed
  net: ena: fix issues in setting interrupt moderation params in ethtool
  net: ena: fix default tx interrupt moderation interval
  net/smc: unregister ib devices in reboot_event
  net: stmmac: platform: Fix MDIO init for platforms without PHY
  llc2: Fix return statement of llc_stat_ev_rx_null_dsap_xid_c (and _test_c)
  net: hisilicon: Fix a BUG trigered by wrong bytes_compl
  net: dsa: ksz: use common define for tag len
  s390/qeth: don't return -ENOTSUPP to userspace
  s390/qeth: fix promiscuous mode after reset
  s390/qeth: handle error due to unsupported transport mode
  cxgb4: fix refcount init for TC-MQPRIO offload
  tc-testing: initial tdc selftests for cls_u32
  ...
2019-12-22 09:54:33 -08:00
Karsten Graul 28a3b8408f net/smc: unregister ib devices in reboot_event
In the reboot_event handler, unregister the ib devices and enable
the IB layer to release the devices before the reboot.

Fixes: a33a803cfe ("net/smc: guarantee removal of link groups in reboot")
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-20 21:31:19 -08:00
Chan Shu Tak, Alex af1c0e4e00 llc2: Fix return statement of llc_stat_ev_rx_null_dsap_xid_c (and _test_c)
When a frame with NULL DSAP is received, llc_station_rcv is called.
In turn, llc_stat_ev_rx_null_dsap_xid_c is called to check if it is a NULL
XID frame. The return statement of llc_stat_ev_rx_null_dsap_xid_c returns 1
when the incoming frame is not a NULL XID frame and 0 otherwise. Hence, a
NULL XID response is returned unexpectedly, e.g. when the incoming frame is
a NULL TEST command.

To fix the error, simply remove the conditional operator.

A similar error in llc_stat_ev_rx_null_dsap_test_c is also fixed.

Signed-off-by: Chan Shu Tak, Alex <alexchan@task.com.hk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-20 21:19:36 -08:00
Michael Grzeschik 4249c507f4 net: dsa: ksz: use common define for tag len
Remove special taglen define KSZ8795_INGRESS_TAG_LEN
and use generic KSZ_INGRESS_TAG_LEN instead.

Signed-off-by: Michael Grzeschik <m.grzeschik@pengutronix.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-20 21:06:49 -08:00
David Howells 063c60d391 rxrpc: Fix missing security check on incoming calls
Fix rxrpc_new_incoming_call() to check that we have a suitable service key
available for the combination of service ID and security class of a new
incoming call - and to reject calls for which we don't.

This causes an assertion like the following to appear:

	rxrpc: Assertion failed - 6(0x6) == 12(0xc) is false
	kernel BUG at net/rxrpc/call_object.c:456!

Where call->state is RXRPC_CALL_SERVER_SECURING (6) rather than
RXRPC_CALL_COMPLETE (12).

Fixes: 248f219cb8 ("rxrpc: Rewrite the data and ack handling code")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2019-12-20 16:21:32 +00:00
David Howells 13b7955a02 rxrpc: Don't take call->user_mutex in rxrpc_new_incoming_call()
Standard kernel mutexes cannot be used in any way from interrupt or softirq
context, so the user_mutex which manages access to a call cannot be a mutex
since on a new call the mutex must start off locked and be unlocked within
the softirq handler to prevent userspace interfering with a call we're
setting up.

Commit a0855d24fc ("locking/mutex: Complain
upon mutex API misuse in IRQ contexts") causes big warnings to be splashed
in dmesg for each a new call that comes in from the server.  Whilst it
*seems* like it should be okay, since the accept path uses trylock, there
are issues with PI boosting and marking the wrong task as the owner.

Fix this by not taking the mutex in the softirq path at all.  It's not
obvious that there should be any need for it as the state is set before the
first notification is generated for the new call.

There's also no particular reason why the link-assessing ping should be
triggered inside the mutex.  It's not actually transmitted there anyway,
but rather it has to be deferred to a workqueue.

Further, I don't think that there's any particular reason that the socket
notification needs to be done from within rx->incoming_lock, so the amount
of time that lock is held can be shortened too and the ping prepared before
the new call notification is sent.

Fixes: 540b1c48c3 ("rxrpc: Fix deadlock between call creation and sendmsg/recvmsg")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Peter Zijlstra (Intel) <peterz@infradead.org>
cc: Ingo Molnar <mingo@redhat.com>
cc: Will Deacon <will@kernel.org>
cc: Davidlohr Bueso <dave@stgolabs.net>
2019-12-20 16:20:56 +00:00
David Howells f33121cbe9 rxrpc: Unlock new call in rxrpc_new_incoming_call() rather than the caller
Move the unlock and the ping transmission for a new incoming call into
rxrpc_new_incoming_call() rather than doing it in the caller.  This makes
it clearer to see what's going on.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
cc: Ingo Molnar <mingo@redhat.com>
cc: Will Deacon <will@kernel.org>
cc: Davidlohr Bueso <dave@stgolabs.net>
2019-12-20 16:20:48 +00:00
Davide Caratti 275c44aa19 net/sched: cls_u32: fix refcount leak in the error path of u32_change()
when users replace cls_u32 filters with new ones having wrong parameters,
so that u32_change() fails to validate them, the kernel doesn't roll-back
correctly, and leaves semi-configured rules.

Fix this in u32_walk(), avoiding a call to the walker function on filters
that don't have a match rule connected. The side effect is, these "empty"
filters are not even dumped when present; but that shouldn't be a problem
as long as we are restoring the original behaviour, where semi-configured
filters were not even added in the error path of u32_change().

Fixes: 6676d5e416 ("net: sched: set dedicated tcf_walker flag when tp is empty")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-19 17:53:05 -08:00
Phil Sutter 8cb4ec44de netfilter: nft_tproxy: Fix port selector on Big Endian
On Big Endian architectures, u16 port value was extracted from the wrong
parts of u32 sreg_port, just like commit 10596608c4 ("netfilter:
nf_tables: fix mismatch in big-endian system") describes.

Fixes: 4ed8eb6570 ("netfilter: nf_tables: Add native tproxy support")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Acked-by: Florian Westphal <fw@strlen.de>
Acked-by: Máté Eckl <ecklm94@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-12-20 02:12:28 +01:00
Florian Westphal e608f631f0 netfilter: ebtables: compat: reject all padding in matches/watchers
syzbot reported following splat:

BUG: KASAN: vmalloc-out-of-bounds in size_entry_mwt net/bridge/netfilter/ebtables.c:2063 [inline]
BUG: KASAN: vmalloc-out-of-bounds in compat_copy_entries+0x128b/0x1380 net/bridge/netfilter/ebtables.c:2155
Read of size 4 at addr ffffc900004461f4 by task syz-executor267/7937

CPU: 1 PID: 7937 Comm: syz-executor267 Not tainted 5.5.0-rc1-syzkaller #0
 size_entry_mwt net/bridge/netfilter/ebtables.c:2063 [inline]
 compat_copy_entries+0x128b/0x1380 net/bridge/netfilter/ebtables.c:2155
 compat_do_replace+0x344/0x720 net/bridge/netfilter/ebtables.c:2249
 compat_do_ebt_set_ctl+0x22f/0x27e net/bridge/netfilter/ebtables.c:2333
 [..]

Because padding isn't considered during computation of ->buf_user_offset,
"total" is decremented by fewer bytes than it should.

Therefore, the first part of

if (*total < sizeof(*entry) || entry->next_offset < sizeof(*entry))

will pass, -- it should not have.  This causes oob access:
entry->next_offset is past the vmalloced size.

Reject padding and check that computed user offset (sum of ebt_entry
structure plus all individual matches/watchers/targets) is same
value that userspace gave us as the offset of the next entry.

Reported-by: syzbot+f68108fed972453a0ad4@syzkaller.appspotmail.com
Fixes: 81e675c227 ("netfilter: ebtables: add CONFIG_COMPAT support")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-12-20 02:12:27 +01:00
Arnd Bergmann c9b3b8207b netfilter: nf_flow_table: fix big-endian integer overflow
In some configurations, gcc reports an integer overflow:

net/netfilter/nf_flow_table_offload.c: In function 'nf_flow_rule_match':
net/netfilter/nf_flow_table_offload.c:80:21: error: unsigned conversion from 'int' to '__be16' {aka 'short unsigned int'} changes value from '327680' to '0' [-Werror=overflow]
   mask->tcp.flags = TCP_FLAG_RST | TCP_FLAG_FIN;
                     ^~~~~~~~~~~~

From what I can tell, we want the upper 16 bits of these constants,
so they need to be shifted in cpu-endian mode.

Fixes: c29f74e0df ("netfilter: nf_flow_table: hardware offload support")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-12-20 02:12:18 +01:00
David S. Miller 0fd260056e Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2019-12-19

The following pull-request contains BPF updates for your *net* tree.

We've added 10 non-merge commits during the last 8 day(s) which contain
a total of 21 files changed, 269 insertions(+), 108 deletions(-).

The main changes are:

1) Fix lack of synchronization between xsk wakeup and destroying resources
   used by xsk wakeup, from Maxim Mikityanskiy.

2) Fix pruning with tail call patching, untrack programs in case of verifier
   error and fix a cgroup local storage tracking bug, from Daniel Borkmann.

3) Fix clearing skb->tstamp in bpf_redirect() when going from ingress to
   egress which otherwise cause issues e.g. on fq qdisc, from Lorenz Bauer.

4) Fix compile warning of unused proc_dointvec_minmax_bpf_restricted() when
   only cBPF is present, from Alexander Lobakin.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-19 14:20:47 -08:00
Alexander Lobakin 1148f9adbe net, sysctl: Fix compiler warning when only cBPF is present
proc_dointvec_minmax_bpf_restricted() has been firstly introduced
in commit 2e4a30983b ("bpf: restrict access to core bpf sysctls")
under CONFIG_HAVE_EBPF_JIT. Then, this ifdef has been removed in
ede95a63b5 ("bpf: add bpf_jit_limit knob to restrict unpriv
allocations"), because a new sysctl, bpf_jit_limit, made use of it.
Finally, this parameter has become long instead of integer with
fdadd04931 ("bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K")
and thus, a new proc_dolongvec_minmax_bpf_restricted() has been
added.

With this last change, we got back to that
proc_dointvec_minmax_bpf_restricted() is used only under
CONFIG_HAVE_EBPF_JIT, but the corresponding ifdef has not been
brought back.

So, in configurations like CONFIG_BPF_JIT=y && CONFIG_HAVE_EBPF_JIT=n
since v4.20 we have:

  CC      net/core/sysctl_net_core.o
net/core/sysctl_net_core.c:292:1: warning: ‘proc_dointvec_minmax_bpf_restricted’ defined but not used [-Wunused-function]
  292 | proc_dointvec_minmax_bpf_restricted(struct ctl_table *table, int write,
      | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Suppress this by guarding it with CONFIG_HAVE_EBPF_JIT again.

Fixes: fdadd04931 ("bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K")
Signed-off-by: Alexander Lobakin <alobakin@dlink.ru>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20191218091821.7080-1-alobakin@dlink.ru
2019-12-19 17:17:51 +01:00
Maxim Mikityanskiy 0687068208 xsk: Add rcu_read_lock around the XSK wakeup
The XSK wakeup callback in drivers makes some sanity checks before
triggering NAPI. However, some configuration changes may occur during
this function that affect the result of those checks. For example, the
interface can go down, and all the resources will be destroyed after the
checks in the wakeup function, but before it attempts to use these
resources. Wrap this callback in rcu_read_lock to allow driver to
synchronize_rcu before actually destroying the resources.

xsk_wakeup is a new function that encapsulates calling ndo_xsk_wakeup
wrapped into the RCU lock. After this commit, xsk_poll starts using
xsk_wakeup and checks xs->zc instead of ndo_xsk_wakeup != NULL to decide
ndo_xsk_wakeup should be called. It also fixes a bug introduced with the
need_wakeup feature: a non-zero-copy socket may be used with a driver
supporting zero-copy, and in this case ndo_xsk_wakeup should not be
called, so the xs->zc check is the correct one.

Fixes: 77cd0d7b3f ("xsk: add support for need_wakeup flag in AF_XDP rings")
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20191217162023.16011-2-maximmi@mellanox.com
2019-12-19 16:20:48 +01:00
Jia-Ju Bai b7ac893652 net: nfc: nci: fix a possible sleep-in-atomic-context bug in nci_uart_tty_receive()
The kernel may sleep while holding a spinlock.
The function call path (from bottom to top) in Linux 4.19 is:

net/nfc/nci/uart.c, 349:
	nci_skb_alloc in nci_uart_default_recv_buf
net/nfc/nci/uart.c, 255:
	(FUNC_PTR)nci_uart_default_recv_buf in nci_uart_tty_receive
net/nfc/nci/uart.c, 254:
	spin_lock in nci_uart_tty_receive

nci_skb_alloc(GFP_KERNEL) can sleep at runtime.
(FUNC_PTR) means a function pointer is called.

To fix this bug, GFP_KERNEL is replaced with GFP_ATOMIC for
nci_skb_alloc().

This bug is found by a static analysis tool STCheck written by myself.

Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-18 11:57:33 -08:00
Jouni Hogander ddd9b5e3e7 net-sysfs: Call dev_hold always in rx_queue_add_kobject
Dev_hold has to be called always in rx_queue_add_kobject.
Otherwise usage count drops below 0 in case of failure in
kobject_init_and_add.

Fixes: b8eb718348 ("net-sysfs: Fix reference count leak in rx|netdev_queue_add_kobject")
Reported-by: syzbot <syzbot+30209ea299c09d8785c9@syzkaller.appspotmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Miller <davem@davemloft.net>
Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Jouni Hogander <jouni.hogander@unikie.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-17 22:57:11 -08:00
Ben Dooks (Codethink) 4e2ce6e550 net: dsa: make unexported dsa_link_touch() static
dsa_link_touch() is not exported, or defined outside of the
file it is in so make it static to avoid the following warning:

net/dsa/dsa2.c:127:17: warning: symbol 'dsa_link_touch' was not declared. Should it be static?

Signed-off-by: Ben Dooks (Codethink) <ben.dooks@codethink.co.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-17 22:40:39 -08:00
Eric Dumazet 7c68fa2bdd net: annotate lockless accesses to sk->sk_pacing_shift
sk->sk_pacing_shift can be read and written without lock
synchronization. This patch adds annotations to
document this fact and avoid future syzbot complains.

This might also avoid unexpected false sharing
in sk_pacing_shift_update(), as the compiler
could remove the conditional check and always
write over sk->sk_pacing_shift :

if (sk->sk_pacing_shift != val)
	sk->sk_pacing_shift = val;

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-17 22:09:52 -08:00
Marcelo Ricardo Leitner 951c6db954 sctp: fix memleak on err handling of stream initialization
syzbot reported a memory leak when an allocation fails within
genradix_prealloc() for output streams. That's because
genradix_prealloc() leaves initialized members initialized when the
issue happens and SCTP stack will abort the current initialization but
without cleaning up such members.

The fix here is to always call genradix_free() when genradix_prealloc()
fails, for output and also input streams, as it suffers from the same
issue.

Reported-by: syzbot+772d9e36c490b18d51d1@syzkaller.appspotmail.com
Fixes: 2075e50caf ("sctp: convert to genradix")
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Tested-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-17 21:58:37 -08:00
David S. Miller ad125c6c05 A handful of fixes:
* disable AQL on most drivers, addressing the iwlwifi issues
  * fix double-free on network namespace changes
  * fix TID field in frames injected through monitor interfaces
  * fix ieee80211_calc_rx_airtime()
  * fix NULL pointer dereference in rfkill (and remove BUG_ON)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl33THYACgkQB8qZga/f
 l8QSsw/+MW4NiXgjFNIaTfyPocFCLi45Efdizef3D+T2MsAD279DFILLFc4Q08Tt
 kL2CxOaD2lrrrPXTa++vkcaSBBQtgRRFvPQPycwoju9QkTuEQ31wFtXSzeCdSFmx
 vLVaY+gMvPjw6HejWqouPlm1hBaA0jqZOCjwq3IWj0spDR/FwJ0HwXSzzEhUs7FV
 1097Q9i7kLDZjdMvUUVnKi8SyWPL8TMPfXxyGPOsSbMPG5QAYj3odfb7FtsZLYgD
 SwWafp6nroUfEDi3jk+QNEuJB4on6iAVEJxbltDQWqsBXO76CWVAezh9KqiDtzIt
 Ay2YtTyOUTZUTPR8lZnoiTvR0GLzeNybwT5BQ9COO1tCD4yB8y1/cpa8oJxv/YRB
 xekCcNMPDFqIwtrY4UKbTEuyCbf78uVO8cUYdlb4ZUUvLKFP2LiD63InyWAtvCdu
 N21mzausAUzy65j5AJ7IIut7iFrcNEQ2qQtQuECGEqmu9uqHD4S3e9MQYd6Qx429
 uXdWbtyoqnYXMaEhF4Zy+DNz5vELNUkP0Lv9sJ6ihALQSXWXwkiz+p7+eYsdBdZy
 mJslJs+sYCgt0y46xzUfAcwvEdAOGkBeF91gCHSJ7q9g8UCrnZ4SN+OCx93YGnvW
 w1AQgZ7VW2aL4i8lnfu6bd86K04M/AbwdRn2OL6Ug4LmmFvHUT8=
 =tvTq
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-net-2019-10-16' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
A handful of fixes:
 * disable AQL on most drivers, addressing the iwlwifi issues
 * fix double-free on network namespace changes
 * fix TID field in frames injected through monitor interfaces
 * fix ieee80211_calc_rx_airtime()
 * fix NULL pointer dereference in rfkill (and remove BUG_ON)
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-16 19:26:11 -08:00
Stefano Garzarella 4aaf596148 vsock/virtio: add WARN_ON check on virtio_transport_get_ops()
virtio_transport_get_ops() and virtio_transport_send_pkt_info()
can only be used on connecting/connected sockets, since a socket
assigned to a transport is required.

This patch adds a WARN_ON() on virtio_transport_get_ops() to check
this requirement, a comment and a returned error on
virtio_transport_send_pkt_info(),

Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-16 16:07:12 -08:00