1
0
Fork 0
Commit Graph

707207 Commits (0a90e251988ceedc528c8db98f25b051cf190f44)

Author SHA1 Message Date
Gavin Shan 0a90e25198 net/ncsi: Fix length of GVI response packet
The length of GVI (GetVersionInfo) response packet should be 40 instead
of 36. This issue was found from /sys/kernel/debug/ncsi/eth0/stats.

 # ethtool --ncsi eth0 swstats
     :
 RESPONSE     OK       TIMEOUT  ERROR
 =======================================
 GVI          0        0        2

With this applied, no error reported on GVI response packets:

 # ethtool --ncsi eth0 swstats
     :
 RESPONSE     OK       TIMEOUT  ERROR
 =======================================
 GVI          2        0        0

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-21 01:56:38 +01:00
Gavin Shan 52b4c8627f net/ncsi: Enforce failover on link monitor timeout
The NCSI channel has been configured to provide service if its link
monitor timer is enabled, regardless of its state (inactive or active).
So the timeout event on the link monitor indicates the out-of-service
on that channel, for which a failover is needed.

This sets NCSI_DEV_RESHUFFLE flag to enforce failover on link monitor
timeout, regardless the channel's original state (inactive or active).
Also, the link is put into "down" state to give the failing channel
lowest priority when selecting for the active channel. The state of
failing channel should be set to active in order for deinitialization
and failover to be done.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-21 01:56:38 +01:00
Gavin Shan 100ef01f3e net/ncsi: Disable HWA mode when no channels are found
When there are no NCSI channels probed, HWA (Hardware Arbitration)
mode is enabled. It's not correct because HWA depends on the fact:
NCSI channels exist and all of them support HWA mode. This disables
HWA when no channels are probed.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-21 01:56:38 +01:00
Samuel Mendoza-Jonas 0795fb2021 net/ncsi: Stop monitor if channel times out or is inactive
ncsi_channel_monitor() misses stopping the channel monitor in several
places that it should, causing a WARN_ON_ONCE() to trigger when the
monitor is re-started later, eg:

[  459.040000] WARNING: CPU: 0 PID: 1093 at net/ncsi/ncsi-manage.c:269 ncsi_start_channel_monitor+0x7c/0x90
[  459.040000] CPU: 0 PID: 1093 Comm: kworker/0:3 Not tainted 4.10.17-gaca2fdd #140
[  459.040000] Hardware name: ASpeed SoC
[  459.040000] Workqueue: events ncsi_dev_work
[  459.040000] [<80010094>] (unwind_backtrace) from [<8000d950>] (show_stack+0x20/0x24)
[  459.040000] [<8000d950>] (show_stack) from [<801dbf70>] (dump_stack+0x20/0x28)
[  459.040000] [<801dbf70>] (dump_stack) from [<80018d7c>] (__warn+0xe0/0x108)
[  459.040000] [<80018d7c>] (__warn) from [<80018e70>] (warn_slowpath_null+0x30/0x38)
[  459.040000] [<80018e70>] (warn_slowpath_null) from [<803f6a08>] (ncsi_start_channel_monitor+0x7c/0x90)
[  459.040000] [<803f6a08>] (ncsi_start_channel_monitor) from [<803f7664>] (ncsi_configure_channel+0xdc/0x5fc)
[  459.040000] [<803f7664>] (ncsi_configure_channel) from [<803f8160>] (ncsi_dev_work+0xac/0x474)
[  459.040000] [<803f8160>] (ncsi_dev_work) from [<8002d244>] (process_one_work+0x1e0/0x450)
[  459.040000] [<8002d244>] (process_one_work) from [<8002d510>] (worker_thread+0x5c/0x570)
[  459.040000] [<8002d510>] (worker_thread) from [<80033614>] (kthread+0x124/0x164)
[  459.040000] [<80033614>] (kthread) from [<8000a5e8>] (ret_from_fork+0x14/0x2c)

This also updates the monitor instead of just returning if
ncsi_xmit_cmd() fails to send the get-link-status command so that the
monitor properly times out.

Fixes: e6f44ed6d0 "net/ncsi: Package and channel management"

Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-21 01:56:38 +01:00
Samuel Mendoza-Jonas 6850d0f8b2 net/ncsi: Fix AEN HNCDSC packet length
Correct the value of the HNCDSC AEN packet.
Fixes: 7a82ecf4cf "net/ncsi: NCSI AEN packet handler"

Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-21 01:56:38 +01:00
Eric Dumazet 509c7a1ecc packet: avoid panic in packet_getsockopt()
syzkaller got crashes in packet_getsockopt() processing
PACKET_ROLLOVER_STATS command while another thread was managing
to change po->rollover

Using RCU will fix this bug. We might later add proper RCU annotations
for sparse sake.

In v2: I replaced kfree(rollover) in fanout_add() to kfree_rcu()
variant, as spotted by John.

Fixes: a9b6391814 ("packet: rollover statistics")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: John Sperbeck <jsperbeck@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-21 01:51:34 +01:00
Eric Dumazet c92e8c02fe tcp/dccp: fix ireq->opt races
syzkaller found another bug in DCCP/TCP stacks [1]

For the reasons explained in commit ce1050089c ("tcp/dccp: fix
ireq->pktopts race"), we need to make sure we do not access
ireq->opt unless we own the request sock.

Note the opt field is renamed to ireq_opt to ease grep games.

[1]
BUG: KASAN: use-after-free in ip_queue_xmit+0x1687/0x18e0 net/ipv4/ip_output.c:474
Read of size 1 at addr ffff8801c951039c by task syz-executor5/3295

CPU: 1 PID: 3295 Comm: syz-executor5 Not tainted 4.14.0-rc4+ #80
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:16 [inline]
 dump_stack+0x194/0x257 lib/dump_stack.c:52
 print_address_description+0x73/0x250 mm/kasan/report.c:252
 kasan_report_error mm/kasan/report.c:351 [inline]
 kasan_report+0x25b/0x340 mm/kasan/report.c:409
 __asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:427
 ip_queue_xmit+0x1687/0x18e0 net/ipv4/ip_output.c:474
 tcp_transmit_skb+0x1ab7/0x3840 net/ipv4/tcp_output.c:1135
 tcp_send_ack.part.37+0x3bb/0x650 net/ipv4/tcp_output.c:3587
 tcp_send_ack+0x49/0x60 net/ipv4/tcp_output.c:3557
 __tcp_ack_snd_check+0x2c6/0x4b0 net/ipv4/tcp_input.c:5072
 tcp_ack_snd_check net/ipv4/tcp_input.c:5085 [inline]
 tcp_rcv_state_process+0x2eff/0x4850 net/ipv4/tcp_input.c:6071
 tcp_child_process+0x342/0x990 net/ipv4/tcp_minisocks.c:816
 tcp_v4_rcv+0x1827/0x2f80 net/ipv4/tcp_ipv4.c:1682
 ip_local_deliver_finish+0x2e2/0xba0 net/ipv4/ip_input.c:216
 NF_HOOK include/linux/netfilter.h:249 [inline]
 ip_local_deliver+0x1ce/0x6e0 net/ipv4/ip_input.c:257
 dst_input include/net/dst.h:464 [inline]
 ip_rcv_finish+0x887/0x19a0 net/ipv4/ip_input.c:397
 NF_HOOK include/linux/netfilter.h:249 [inline]
 ip_rcv+0xc3f/0x1820 net/ipv4/ip_input.c:493
 __netif_receive_skb_core+0x1a3e/0x34b0 net/core/dev.c:4476
 __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:4514
 netif_receive_skb_internal+0x10b/0x670 net/core/dev.c:4587
 netif_receive_skb+0xae/0x390 net/core/dev.c:4611
 tun_rx_batched.isra.50+0x5ed/0x860 drivers/net/tun.c:1372
 tun_get_user+0x249c/0x36d0 drivers/net/tun.c:1766
 tun_chr_write_iter+0xbf/0x160 drivers/net/tun.c:1792
 call_write_iter include/linux/fs.h:1770 [inline]
 new_sync_write fs/read_write.c:468 [inline]
 __vfs_write+0x68a/0x970 fs/read_write.c:481
 vfs_write+0x18f/0x510 fs/read_write.c:543
 SYSC_write fs/read_write.c:588 [inline]
 SyS_write+0xef/0x220 fs/read_write.c:580
 entry_SYSCALL_64_fastpath+0x1f/0xbe
RIP: 0033:0x40c341
RSP: 002b:00007f469523ec10 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000718000 RCX: 000000000040c341
RDX: 0000000000000037 RSI: 0000000020004000 RDI: 0000000000000015
RBP: 0000000000000086 R08: 0000000000000000 R09: 0000000000000000
R10: 00000000000f4240 R11: 0000000000000293 R12: 00000000004b7fd1
R13: 00000000ffffffff R14: 0000000020000000 R15: 0000000000025000

Allocated by task 3295:
 save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
 save_stack+0x43/0xd0 mm/kasan/kasan.c:447
 set_track mm/kasan/kasan.c:459 [inline]
 kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
 __do_kmalloc mm/slab.c:3725 [inline]
 __kmalloc+0x162/0x760 mm/slab.c:3734
 kmalloc include/linux/slab.h:498 [inline]
 tcp_v4_save_options include/net/tcp.h:1962 [inline]
 tcp_v4_init_req+0x2d3/0x3e0 net/ipv4/tcp_ipv4.c:1271
 tcp_conn_request+0xf6d/0x3410 net/ipv4/tcp_input.c:6283
 tcp_v4_conn_request+0x157/0x210 net/ipv4/tcp_ipv4.c:1313
 tcp_rcv_state_process+0x8ea/0x4850 net/ipv4/tcp_input.c:5857
 tcp_v4_do_rcv+0x55c/0x7d0 net/ipv4/tcp_ipv4.c:1482
 tcp_v4_rcv+0x2d10/0x2f80 net/ipv4/tcp_ipv4.c:1711
 ip_local_deliver_finish+0x2e2/0xba0 net/ipv4/ip_input.c:216
 NF_HOOK include/linux/netfilter.h:249 [inline]
 ip_local_deliver+0x1ce/0x6e0 net/ipv4/ip_input.c:257
 dst_input include/net/dst.h:464 [inline]
 ip_rcv_finish+0x887/0x19a0 net/ipv4/ip_input.c:397
 NF_HOOK include/linux/netfilter.h:249 [inline]
 ip_rcv+0xc3f/0x1820 net/ipv4/ip_input.c:493
 __netif_receive_skb_core+0x1a3e/0x34b0 net/core/dev.c:4476
 __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:4514
 netif_receive_skb_internal+0x10b/0x670 net/core/dev.c:4587
 netif_receive_skb+0xae/0x390 net/core/dev.c:4611
 tun_rx_batched.isra.50+0x5ed/0x860 drivers/net/tun.c:1372
 tun_get_user+0x249c/0x36d0 drivers/net/tun.c:1766
 tun_chr_write_iter+0xbf/0x160 drivers/net/tun.c:1792
 call_write_iter include/linux/fs.h:1770 [inline]
 new_sync_write fs/read_write.c:468 [inline]
 __vfs_write+0x68a/0x970 fs/read_write.c:481
 vfs_write+0x18f/0x510 fs/read_write.c:543
 SYSC_write fs/read_write.c:588 [inline]
 SyS_write+0xef/0x220 fs/read_write.c:580
 entry_SYSCALL_64_fastpath+0x1f/0xbe

Freed by task 3306:
 save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
 save_stack+0x43/0xd0 mm/kasan/kasan.c:447
 set_track mm/kasan/kasan.c:459 [inline]
 kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524
 __cache_free mm/slab.c:3503 [inline]
 kfree+0xca/0x250 mm/slab.c:3820
 inet_sock_destruct+0x59d/0x950 net/ipv4/af_inet.c:157
 __sk_destruct+0xfd/0x910 net/core/sock.c:1560
 sk_destruct+0x47/0x80 net/core/sock.c:1595
 __sk_free+0x57/0x230 net/core/sock.c:1603
 sk_free+0x2a/0x40 net/core/sock.c:1614
 sock_put include/net/sock.h:1652 [inline]
 inet_csk_complete_hashdance+0xd5/0xf0 net/ipv4/inet_connection_sock.c:959
 tcp_check_req+0xf4d/0x1620 net/ipv4/tcp_minisocks.c:765
 tcp_v4_rcv+0x17f6/0x2f80 net/ipv4/tcp_ipv4.c:1675
 ip_local_deliver_finish+0x2e2/0xba0 net/ipv4/ip_input.c:216
 NF_HOOK include/linux/netfilter.h:249 [inline]
 ip_local_deliver+0x1ce/0x6e0 net/ipv4/ip_input.c:257
 dst_input include/net/dst.h:464 [inline]
 ip_rcv_finish+0x887/0x19a0 net/ipv4/ip_input.c:397
 NF_HOOK include/linux/netfilter.h:249 [inline]
 ip_rcv+0xc3f/0x1820 net/ipv4/ip_input.c:493
 __netif_receive_skb_core+0x1a3e/0x34b0 net/core/dev.c:4476
 __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:4514
 netif_receive_skb_internal+0x10b/0x670 net/core/dev.c:4587
 netif_receive_skb+0xae/0x390 net/core/dev.c:4611
 tun_rx_batched.isra.50+0x5ed/0x860 drivers/net/tun.c:1372
 tun_get_user+0x249c/0x36d0 drivers/net/tun.c:1766
 tun_chr_write_iter+0xbf/0x160 drivers/net/tun.c:1792
 call_write_iter include/linux/fs.h:1770 [inline]
 new_sync_write fs/read_write.c:468 [inline]
 __vfs_write+0x68a/0x970 fs/read_write.c:481
 vfs_write+0x18f/0x510 fs/read_write.c:543
 SYSC_write fs/read_write.c:588 [inline]
 SyS_write+0xef/0x220 fs/read_write.c:580
 entry_SYSCALL_64_fastpath+0x1f/0xbe

Fixes: e994b2f0fb ("tcp: do not lock listener to process SYN packets")
Fixes: 079096f103 ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-21 01:33:19 +01:00
David S. Miller e95c6cf447 Merge branch 'sockmap-fixes'
John Fastabend says:

====================
sockmap fixes for net

The following implements a set of fixes for sockmap and changes the
API slightly in a few places to reduce preempt_disable/enable scope.
We do this here in net because it requires an API change and this
avoids getting stuck with legacy API going forward.

The short description:

Access to skb mark is removed, it is problematic when we add
features in the future because mark is a union and used by the
TCP/socket code internally. We don't want to expose this to the
BPF programs or let programs change the values.

The other change is caching metadata in the skb itself between
when the BPF program returns a redirect code and the core code
implements the redirect. This avoids having per cpu metadata.

Finally, tighten restriction on using sockmap to CAP_NET_ADMIN and
only SOCK_STREAM sockets.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-20 13:01:30 +01:00
John Fastabend 9ef2a8cd5c bpf: require CAP_NET_ADMIN when using devmap
Devmap is used with XDP which requires CAP_NET_ADMIN so lets also
make CAP_NET_ADMIN required to use the map.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-20 13:01:29 +01:00
John Fastabend fb50df8d32 bpf: require CAP_NET_ADMIN when using sockmap maps
Restrict sockmap to CAP_NET_ADMIN.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-20 13:01:29 +01:00
John Fastabend f7e9cb1ecb bpf: remove mark access for SK_SKB program types
The skb->mark field is a union with reserved_tailroom which is used
in the TCP code paths from stream memory allocation. Allowing SK_SKB
programs to set this field creates a conflict with future code
optimizations, such as "gifting" the skb to the egress path instead
of creating a new skb and doing a memcpy.

Because we do not have a released version of SK_SKB yet lets just
remove it for now. A more appropriate scratch pad to use at the
socket layer is dev_scratch, but lets add that in future kernels
when needed.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-20 13:01:29 +01:00
John Fastabend 34f79502bb bpf: avoid preempt enable/disable in sockmap using tcp_skb_cb region
SK_SKB BPF programs are run from the socket/tcp context but early in
the stack before much of the TCP metadata is needed in tcp_skb_cb. So
we can use some unused fields to place BPF metadata needed for SK_SKB
programs when implementing the redirect function.

This allows us to drop the preempt disable logic. It does however
require an API change so sk_redirect_map() has been updated to
additionally provide ctx_ptr to skb. Note, we do however continue to
disable/enable preemption around actual BPF program running to account
for map updates.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-20 13:01:29 +01:00
John Fastabend 435bf0d3f9 bpf: enforce TCP only support for sockmap
Only TCP sockets have been tested and at the moment the state change
callback only handles TCP sockets. This adds a check to ensure that
sockets actually being added are TCP sockets.

For net-next we can consider UDP support.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-20 13:01:29 +01:00
Xin Long 1cc276cec9 sctp: add the missing sock_owned_by_user check in sctp_icmp_redirect
Now sctp processes icmp redirect packet in sctp_icmp_redirect where
it calls sctp_transport_dst_check in which tp->dst can be released.

The problem is before calling sctp_transport_dst_check, it doesn't
check sock_owned_by_user, which means tp->dst could be freed while
a process is accessing it with owning the socket.

An use-after-free issue could be triggered by this.

This patch is to fix it by checking sock_owned_by_user before calling
sctp_transport_dst_check in sctp_icmp_redirect, so that it would not
release tp->dst if users still hold sock lock.

Besides, the same issue fixed in commit 45caeaa5ac ("dccp/tcp: fix
routing redirect race") on sctp also needs this check.

Fixes: 55be7a9c60 ("ipv4: Add redirect support to all protocol icmp error handlers")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-20 12:53:45 +01:00
Xin Long df80cd9b28 sctp: do not peel off an assoc from one netns to another one
Now when peeling off an association to the sock in another netns, all
transports in this assoc are not to be rehashed and keep use the old
key in hashtable.

As a transport uses sk->net as the hash key to insert into hashtable,
it would miss removing these transports from hashtable due to the new
netns when closing the sock and all transports are being freeed, then
later an use-after-free issue could be caused when looking up an asoc
and dereferencing those transports.

This is a very old issue since very beginning, ChunYu found it with
syzkaller fuzz testing with this series:

  socket$inet6_sctp()
  bind$inet6()
  sendto$inet6()
  unshare(0x40000000)
  getsockopt$inet_sctp6_SCTP_GET_ASSOC_ID_LIST()
  getsockopt$inet_sctp6_SCTP_SOCKOPT_PEELOFF()

This patch is to block this call when peeling one assoc off from one
netns to another one, so that the netns of all transport would not
go out-sync with the key in hashtable.

Note that this patch didn't fix it by rehashing transports, as it's
difficult to handle the situation when the tuple is already in use
in the new netns. Besides, no one would like to peel off one assoc
to another netns, considering ipaddrs, ifaces, etc. are usually
different.

Reported-by: ChunYu Wang <chunwang@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-19 13:16:07 +01:00
David S. Miller 4bbb508348 Merge branch 'bpf-Fix-for-BPF-devmap-percpu-allocation-splat'
Daniel Borkmann says:

====================
bpf: Fix for BPF devmap percpu allocation splat

The set fixes a splat in devmap percpu allocation when we alloc
the flush bitmap. Patch 1 is a prerequisite for the fix in patch 2,
patch 1 is rather small, so if this could be routed via -net, for
example, with Tejun's Ack that would be good. Patch 3 gets rid of
remaining PCPU_MIN_UNIT_SIZE checks, which are percpu allocator
internals and should not be used.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-19 13:14:11 +01:00
Daniel Borkmann bc6d5031b4 bpf: do not test for PCPU_MIN_UNIT_SIZE before percpu allocations
PCPU_MIN_UNIT_SIZE is an implementation detail of the percpu
allocator. Given we support __GFP_NOWARN now, lets just let
the allocation request fail naturally instead. The two call
sites from BPF mistakenly assumed __GFP_NOWARN would work, so
no changes needed to their actual __alloc_percpu_gfp() calls
which use the flag already.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-19 13:13:50 +01:00
Daniel Borkmann 82f8dd28bd bpf: fix splat for illegal devmap percpu allocation
It was reported that syzkaller was able to trigger a splat on
devmap percpu allocation due to illegal/unsupported allocation
request size passed to __alloc_percpu():

  [   70.094249] illegal size (32776) or align (8) for percpu allocation
  [   70.094256] ------------[ cut here ]------------
  [   70.094259] WARNING: CPU: 3 PID: 3451 at mm/percpu.c:1365 pcpu_alloc+0x96/0x630
  [...]
  [   70.094325] Call Trace:
  [   70.094328]  __alloc_percpu_gfp+0x12/0x20
  [   70.094330]  dev_map_alloc+0x134/0x1e0
  [   70.094331]  SyS_bpf+0x9bc/0x1610
  [   70.094333]  ? selinux_task_setrlimit+0x5a/0x60
  [   70.094334]  ? security_task_setrlimit+0x43/0x60
  [   70.094336]  entry_SYSCALL_64_fastpath+0x1a/0xa5

This was due to too large max_entries for the map such that we
surpassed the upper limit of PCPU_MIN_UNIT_SIZE. It's fine to
fail naturally here, so switch to __alloc_percpu_gfp() and pass
__GFP_NOWARN instead.

Fixes: 11393cc9b9 ("xdp: Add batching support to redirect map")
Reported-by: Mark Rutland <mark.rutland@arm.com>
Reported-by: Shankara Pailoor <sp3485@columbia.edu>
Reported-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-19 13:13:50 +01:00
Daniel Borkmann 0ea7eeec24 mm, percpu: add support for __GFP_NOWARN flag
Add an option for pcpu_alloc() to support __GFP_NOWARN flag.
Currently, we always throw a warning when size or alignment
is unsupported (and also dump stack on failed allocation
requests). The warning itself is harmless since we return
NULL anyway for any failed request, which callers are
required to handle anyway. However, it becomes harmful when
panic_on_warn is set.

The rationale for the WARN() in pcpu_alloc() is that it can
be tracked when larger than supported allocation requests are
made such that allocations limits can be tweaked if warranted.
This makes sense for in-kernel users, however, there are users
of pcpu allocator where allocation size is derived from user
space requests, e.g. when creating BPF maps. In these cases,
the requests should fail gracefully without throwing a splat.

The current work-around was to check allocation size against
the upper limit of PCPU_MIN_UNIT_SIZE from call-sites for
bailing out prior to a call to pcpu_alloc() in order to
avoid throwing the WARN(). This is bad in multiple ways since
PCPU_MIN_UNIT_SIZE is an implementation detail, and having
the checks on call-sites only complicates the code for no
good reason. Thus, lets fix it generically by supporting the
__GFP_NOWARN flag that users can then use with calling the
__alloc_percpu_gfp() helper instead.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-19 13:13:49 +01:00
David S. Miller 3fd3b03b43 Merge branch 'ena-fixes'
Netanel Belgazal says:

====================
ENA ethernet driver bug fixes

Some fixes for ENA ethernet driver
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-19 12:49:16 +01:00
Netanel Belgazal a59df39676 net: ena: fix wrong max Tx/Rx queues on ethtool
ethtool ena_get_channels() expose the max number of queues as the max
number of queues ENA supports (128 queues) and not the actual number
of created queues.

Signed-off-by: Netanel Belgazal <netanel@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-19 12:49:15 +01:00
Netanel Belgazal 411838e7b4 net: ena: fix rare kernel crash when bar memory remap fails
This failure is rare and only found on testing where deliberately fail
devm_ioremap()

[  451.170464] ena 0000:04:00.0: failed to remap regs bar
451.170549] Workqueue: pciehp-1 pciehp_power_thread
[  451.170551] task: ffff88085a5f2d00 task.stack: ffffc9000756c000
[  451.170552] RIP: 0010:devm_iounmap+0x2d/0x40
[  451.170553] RSP: 0018:ffffc9000756fac0 EFLAGS: 00010282
[  451.170554] RAX: 00000000fffffffe RBX: 0000000000000000 RCX:
0000000000000000
[  451.170555] RDX: ffffffff813a7e00 RSI: 0000000000000282 RDI:
0000000000000282
[  451.170556] RBP: ffffc9000756fac8 R08: 00000000fffffffe R09:
00000000000009b7
[  451.170557] R10: 0000000000000005 R11: 00000000000009b6 R12:
ffff880856c9d0a0
[  451.170558] R13: ffffc9000f5c90c0 R14: ffff880856c9d0a0 R15:
0000000000000028
[  451.170559] FS:  0000000000000000(0000) GS:ffff88085f400000(0000)
knlGS:0000000000000000
[  451.170560] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  451.170561] CR2: 00007f169038b000 CR3: 0000000001c09000 CR4:
00000000003406f0
[  451.170562] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[  451.170562] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[  451.170563] Call Trace:
[  451.170572]  ena_release_bars.isra.48+0x34/0x60 [ena]
[  451.170574]  ena_probe+0x144/0xd90 [ena]
[  451.170579]  ? ida_simple_get+0x98/0x100
[  451.170585]  ? kernfs_next_descendant_post+0x40/0x50
[  451.170591]  local_pci_probe+0x45/0xa0
[  451.170592]  pci_device_probe+0x157/0x180
[  451.170599]  driver_probe_device+0x2a8/0x460
[  451.170600]  __device_attach_driver+0x7e/0xe0
[  451.170602]  ? driver_allows_async_probing+0x30/0x30
[  451.170603]  bus_for_each_drv+0x68/0xb0
[  451.170605]  __device_attach+0xdd/0x160
[  451.170607]  device_attach+0x10/0x20
[  451.170610]  pci_bus_add_device+0x4f/0xa0
[  451.170611]  pci_bus_add_devices+0x39/0x70
[  451.170613]  pciehp_configure_device+0x96/0x120
[  451.170614]  pciehp_enable_slot+0x1b3/0x290
[  451.170616]  pciehp_power_thread+0x3b/0xb0
[  451.170622]  process_one_work+0x149/0x360
[  451.170623]  worker_thread+0x4d/0x3c0
[  451.170626]  kthread+0x109/0x140
[  451.170627]  ? rescuer_thread+0x380/0x380
[  451.170628]  ? kthread_park+0x60/0x60
[  451.170632]  ret_from_fork+0x25/0x30

Signed-off-by: Netanel Belgazal <netanel@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-19 12:49:15 +01:00
Netanel Belgazal cd7aea1875 net: ena: reduce the severity of some printouts
Decrease log level of checksum errors as these messages can be
triggered remotely by bad packets.

Signed-off-by: Netanel Belgazal <netanel@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-19 12:49:15 +01:00
Jakub Kicinski 28e33f9d78 bpf: disallow arithmetic operations on context pointer
Commit f1174f77b5 ("bpf/verifier: rework value tracking")
removed the crafty selection of which pointer types are
allowed to be modified.  This is OK for most pointer types
since adjust_ptr_min_max_vals() will catch operations on
immutable pointers.  One exception is PTR_TO_CTX which is
now allowed to be offseted freely.

The intent of aforementioned commit was to allow context
access via modified registers.  The offset passed to
->is_valid_access() verifier callback has been adjusted
by the value of the variable offset.

What is missing, however, is taking the variable offset
into account when the context register is used.  Or in terms
of the code adding the offset to the value passed to the
->convert_ctx_access() callback.  This leads to the following
eBPF user code:

     r1 += 68
     r0 = *(u32 *)(r1 + 8)
     exit

being translated to this in kernel space:

   0: (07) r1 += 68
   1: (61) r0 = *(u32 *)(r1 +180)
   2: (95) exit

Offset 8 is corresponding to 180 in the kernel, but offset
76 is valid too.  Verifier will "accept" access to offset
68+8=76 but then "convert" access to offset 8 as 180.
Effective access to offset 248 is beyond the kernel context.
(This is a __sk_buff example on a debug-heavy kernel -
packet mark is 8 -> 180, 76 would be data.)

Dereferencing the modified context pointer is not as easy
as dereferencing other types, because we have to translate
the access to reading a field in kernel structures which is
usually at a different offset and often of a different size.
To allow modifying the pointer we would have to make sure
that given eBPF instruction will always access the same
field or the fields accessed are "compatible" in terms of
offset and size...

Disallow dereferencing modified context pointers and add
to selftests the test case described here.

Fixes: f1174f77b5 ("bpf/verifier: rework value tracking")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-18 13:21:13 +01:00
Johannes Berg 48044eb490 netlink: fix netlink_ack() extack race
It seems that it's possible to toggle NETLINK_F_EXT_ACK
through setsockopt() while another thread/CPU is building
a message inside netlink_ack(), which could then trigger
the WARN_ON()s I added since if it goes from being turned
off to being turned on between allocating and filling the
message, the skb could end up being too small.

Avoid this whole situation by storing the value of this
flag in a separate variable and using that throughout the
function instead.

Fixes: 2d4bc93368 ("netlink: extended ACK reporting")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-18 12:22:28 +01:00
Thomas Falcon 2de09681e4 ibmvnic: Fix calculation of number of TX header descriptors
This patch correctly sets the number of additional header descriptors
that will be sent in an indirect SCRQ entry.

Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-18 12:20:39 +01:00
Ido Schimmel d965465b60 mlxsw: core: Fix possible deadlock
When an EMAD is transmitted, a timeout work item is scheduled with a
delay of 200ms, so that another EMAD will be retried until a maximum of
five retries.

In certain situations, it's possible for the function waiting on the
EMAD to be associated with a work item that is queued on the same
workqueue (`mlxsw_core`) as the timeout work item. This results in
flushing a work item on the same workqueue.

According to commit e159489baa ("workqueue: relax lockdep annotation
on flush_work()") the above may lead to a deadlock in case the workqueue
has only one worker active or if the system in under memory pressure and
the rescue worker is in use. The latter explains the very rare and
random nature of the lockdep splats we have been seeing:

[   52.730240] ============================================
[   52.736179] WARNING: possible recursive locking detected
[   52.742119] 4.14.0-rc3jiri+ #4 Not tainted
[   52.746697] --------------------------------------------
[   52.752635] kworker/1:3/599 is trying to acquire lock:
[   52.758378]  (mlxsw_core_driver_name){+.+.}, at: [<ffffffff811c4fa4>] flush_work+0x3a4/0x5e0
[   52.767837]
               but task is already holding lock:
[   52.774360]  (mlxsw_core_driver_name){+.+.}, at: [<ffffffff811c65c4>] process_one_work+0x7d4/0x12f0
[   52.784495]
               other info that might help us debug this:
[   52.791794]  Possible unsafe locking scenario:
[   52.798413]        CPU0
[   52.801144]        ----
[   52.803875]   lock(mlxsw_core_driver_name);
[   52.808556]   lock(mlxsw_core_driver_name);
[   52.813236]
                *** DEADLOCK ***
[   52.819857]  May be due to missing lock nesting notation
[   52.827450] 3 locks held by kworker/1:3/599:
[   52.832221]  #0:  (mlxsw_core_driver_name){+.+.}, at: [<ffffffff811c65c4>] process_one_work+0x7d4/0x12f0
[   52.842846]  #1:  ((&(&bridge->fdb_notify.dw)->work)){+.+.}, at: [<ffffffff811c65c4>] process_one_work+0x7d4/0x12f0
[   52.854537]  #2:  (rtnl_mutex){+.+.}, at: [<ffffffff822ad8e7>] rtnl_lock+0x17/0x20
[   52.863021]
               stack backtrace:
[   52.867890] CPU: 1 PID: 599 Comm: kworker/1:3 Not tainted 4.14.0-rc3jiri+ #4
[   52.875773] Hardware name: Mellanox Technologies Ltd. "MSN2100-CB2F"/"SA001017", BIOS 5.6.5 06/07/2016
[   52.886267] Workqueue: mlxsw_core mlxsw_sp_fdb_notify_work [mlxsw_spectrum]
[   52.894060] Call Trace:
[   52.909122]  __lock_acquire+0xf6f/0x2a10
[   53.025412]  lock_acquire+0x158/0x440
[   53.047557]  flush_work+0x3c4/0x5e0
[   53.087571]  __cancel_work_timer+0x3ca/0x5e0
[   53.177051]  cancel_delayed_work_sync+0x13/0x20
[   53.182142]  mlxsw_reg_trans_bulk_wait+0x12d/0x7a0 [mlxsw_core]
[   53.194571]  mlxsw_core_reg_access+0x586/0x990 [mlxsw_core]
[   53.225365]  mlxsw_reg_query+0x10/0x20 [mlxsw_core]
[   53.230882]  mlxsw_sp_fdb_notify_work+0x2a3/0x9d0 [mlxsw_spectrum]
[   53.237801]  process_one_work+0x8f1/0x12f0
[   53.321804]  worker_thread+0x1fd/0x10c0
[   53.435158]  kthread+0x28e/0x370
[   53.448703]  ret_from_fork+0x2a/0x40
[   53.453017] mlxsw_spectrum 0000:01:00.0: EMAD retries (2/5) (tid=bf4549b100000774)
[   53.453119] mlxsw_spectrum 0000:01:00.0: EMAD retries (5/5) (tid=bf4549b100000770)
[   53.453132] mlxsw_spectrum 0000:01:00.0: EMAD reg access failed (tid=bf4549b100000770,reg_id=200b(sfn),type=query,status=0(operation performed))
[   53.453143] mlxsw_spectrum 0000:01:00.0: Failed to get FDB notifications

Fix this by creating another workqueue for EMAD timeouts, thereby
preventing the situation of a work item trying to flush a work item
queued on the same workqueue.

Fixes: caf7297e7a ("mlxsw: core: Introduce support for asynchronous EMAD register access")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reported-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-18 12:19:15 +01:00
David S. Miller 1b72bf5a07 Just a single fix, for a WoWLAN-related part of CVE-2017-13080.
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEExu3sM/nZ1eRSfR9Ha3t4Rpy0AB0FAlnkt7YACgkQa3t4Rpy0
 AB3UZg/8CsZr3SeDM1CxLTmDxJ734qq2kcoBo8nJgJqSwIWE5goGXCJSIGQw0o9O
 X8IF/xKj+Vgqfmxs73dnZM15klkVc9HBOdMdnjayWZZa5h6wKFFVNdjkMsrFxdV8
 iugcCaUYCg1/Sc/qq4hBWLe5k3mQa5fB1clr7qykYEfIuY2ORpEJ+P2Uqjb5/RKx
 ugLO1qdCVMk7kqB/Fc8XKrPdiCgmfaMdz9lTqm1yXDThQe0rNxmWR339YpleqEiA
 lWIOZ53OlHZ3LTz8YDPcNA7mf5KtK1qQc60kZLygDcy5swvKaSIUaLfYOIjsPLIi
 5tJ4tztEPpi5RTMPHkOLCcdGCT2bQ9InCmgSmEwymMbn36vaLuyH+9pfkqgbunUq
 M1C07VNG9HUi+Qz6mhtbY6ZyG0s1M09rAouCOGo4T9x7YHA73BetvcxYcxhdTm6t
 vOTDd/xiWM1Dsgc2tu41A+jB3cShi5so/OLE1KcLe4r1cso6+FdR+p5VW6XyRV73
 Kbdh39azsdlyK6L+BW6cWCm/DUstourVmOz7mPebE0NW7N8cEhTtvclu0dz1gdKL
 nBMha+lOBmy2+acU5Z58f6/jG8RCMcyiD+0z9yrZmOmO4OgVyxHXP2GrsQ1d/Qr5
 vhBLT1LFxcvJpIlKZy3dqv1lZQ10suwcsFzJG3NsQbczV7Vk9aM=
 =Iwyf
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-davem-2017-10-16' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
Just a single fix, for a WoWLAN-related part of CVE-2017-13080.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16 21:27:16 +01:00
Xin Long 823038ca03 dev_ioctl: add missing NETDEV_CHANGE_TX_QUEUE_LEN event notification
When changing dev tx_queue_len via netlink or net-sysfs,
a NETDEV_CHANGE_TX_QUEUE_LEN event notification will be
called.

But dev_ioctl missed this event notification, which could
cause no userspace notification would be sent.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16 21:23:10 +01:00
Or Gerlitz c019b5166e net/sched: cls_flower: Set egress_dev mark when calling into the HW driver
Commit 7091d8c '(net/sched: cls_flower: Add offload support using egress
Hardware device') made sure (when fl_hw_replace_filter is called) to put
the egress_dev mark on persisent structure instance. Hence, following calls
into the HW driver for stats and deletion will note it and act accordingly.

With commit de4784ca03 this property is lost and hence when called,
the HW driver failes to operate (stats, delete) on the offloaded flow.

Fix it by setting the egress_dev flag whenever the ingress device is
different from the hw device since this is exactly the condition under
which we're calling into the HW driver through the egress port net-device.

Fixes: de4784ca03 ('net: sched: get rid of struct tc_to_netdev')
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roi Dayan <roid@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16 21:18:59 +01:00
Cong Wang 0ad646c81b tun: call dev_get_valid_name() before register_netdevice()
register_netdevice() could fail early when we have an invalid
dev name, in which case ->ndo_uninit() is not called. For tun
device, this is a problem because a timer etc. are already
initialized and it expects ->ndo_uninit() to clean them up.

We could move these initializations into a ->ndo_init() so
that register_netdevice() knows better, however this is still
complicated due to the logic in tun_detach().

Therefore, I choose to just call dev_get_valid_name() before
register_netdevice(), which is quicker and much easier to audit.
And for this specific case, it is already enough.

Fixes: 96442e4242 ("tuntap: choose the txq based on rxq")
Reported-by: Dmitry Alexeev <avekceeb@gmail.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16 21:02:54 +01:00
Nicolas Dichtel 2459b4c635 net: enable interface alias removal via rtnl
IFLA_IFALIAS is defined as NLA_STRING. It means that the minimal length of
the attribute is 1 ("\0"). However, to remove an alias, the attribute
length must be 0 (see dev_set_alias()).

Let's define the type to NLA_BINARY to allow 0-length string, so that the
alias can be removed.

Example:
$ ip l s dummy0 alias foo
$ ip l l dev dummy0
5: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether ae:20:30:4f:a7:f3 brd ff:ff:ff:ff:ff:ff
    alias foo

Before the patch:
$ ip l s dummy0 alias ""
RTNETLINK answers: Numerical result out of range

After the patch:
$ ip l s dummy0 alias ""
$ ip l l dev dummy0
5: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether ae:20:30:4f:a7:f3 brd ff:ff:ff:ff:ff:ff

CC: Oliver Hartkopp <oliver@hartkopp.net>
CC: Stephen Hemminger <stephen@networkplumber.org>
Fixes: 96ca4a2cc1 ("net: remove ifalias on empty given alias")
Reported-by: Julien FLoret <julien.floret@6wind.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16 20:52:43 +01:00
David S. Miller 2fd7c5abb8 Merge branch 'rtnetlink-dev-notification-fixes'
Xin Long says:

====================
rtnetlink: a bunch of fixes for userspace notifications in changing dev properties

Whenever any property of a link, address, route, etc. changes by whatever way,
kernel should notify the programs that listen for such events in userspace.

The patchet "rtnetlink: Cleanup user notifications for netdev events" tried to
fix a redundant notifications issue, but it also introduced a side effect.

After that, user notifications could only be sent when changing dev properties
via netlink api. As it removed some events process in rtnetlink_event where
the notifications was sent to users.

It resulted in no notification generated when dev properties are changed via
other ways, like ioctl, sysfs, etc. It may cause some user programs doesn't
work as expected because of the missing notifications.

This patchset will fix it by bringing some of these netdev events back and
also fix the old redundant notifications issue with a proper way.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16 20:48:45 +01:00
Xin Long 2d7f669b42 rtnetlink: do not set notification for tx_queue_len in do_setlink
NETDEV_CHANGE_TX_QUEUE_LEN event process in rtnetlink_event would
send a notification for userspace and tx_queue_len's setting in
do_setlink would trigger NETDEV_CHANGE_TX_QUEUE_LEN.

So it shouldn't set DO_SETLINK_NOTIFY status for this change to
send a notification any more.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16 20:48:45 +01:00
Xin Long 64ff90cc2e rtnetlink: check DO_SETLINK_NOTIFY correctly in do_setlink
The check 'status & DO_SETLINK_NOTIFY' in do_setlink doesn't really
work after status & DO_SETLINK_MODIFIED, as:

  DO_SETLINK_MODIFIED 0x1
  DO_SETLINK_NOTIFY 0x3

Considering that notifications are suppposed to be sent only when
status have the flag DO_SETLINK_NOTIFY, the right check would be:

  (status & DO_SETLINK_NOTIFY) == DO_SETLINK_NOTIFY

This would avoid lots of duplicated notifications when setting some
properties of a link.

Fixes: ba9989069f ("rtnl/do_setlink(): notify when a netdev is modified")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: David Ahern <dsahern@gmail.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16 20:48:45 +01:00
Xin Long dc709f3757 rtnetlink: bring NETDEV_CHANGEUPPER event process back in rtnetlink_event
libteam needs this event notification in userspace when dev's master
dev has been changed. After this, the redundant notifications issue
would be fixed in the later patch 'rtnetlink: check DO_SETLINK_NOTIFY
correctly in do_setlink'.

Fixes: b6b36eb23a ("rtnetlink: Do not generate notifications for NETDEV_CHANGEUPPER event")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16 20:48:45 +01:00
Xin Long e6e6659446 rtnetlink: bring NETDEV_POST_TYPE_CHANGE event process back in rtnetlink_event
As I said in patch 'rtnetlink: bring NETDEV_CHANGEMTU event process back
in rtnetlink_event', removing NETDEV_POST_TYPE_CHANGE event was not the
right fix for the redundant notifications issue.

So bring this event process back to rtnetlink_event and the old redundant
notifications issue would be fixed in the later patch 'rtnetlink: check
DO_SETLINK_NOTIFY correctly in do_setlink'.

Fixes: aef091ae58 ("rtnetlink: Do not generate notifications for POST_TYPE_CHANGE event")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16 20:48:45 +01:00
Xin Long ebdcf0450b rtnetlink: bring NETDEV_CHANGE_TX_QUEUE_LEN event process back in rtnetlink_event
The same fix for changing mtu in the patch 'rtnetlink: bring
NETDEV_CHANGEMTU event process back in rtnetlink_event' is
needed for changing tx_queue_len.

Note that the redundant notifications issue for tx_queue_len
will be fixed in the later patch 'rtnetlink: do not send
notification for tx_queue_len in do_setlink'.

Fixes: 27b3b551d8 ("rtnetlink: Do not generate notifications for NETDEV_CHANGE_TX_QUEUE_LEN event")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16 20:48:44 +01:00
Xin Long 8a212589fe rtnetlink: bring NETDEV_CHANGEMTU event process back in rtnetlink_event
Commit 085e1a65f0 ("rtnetlink: Do not generate notifications for MTU
events") tried to fix the redundant notifications issue when ip link
set mtu by removing NETDEV_CHANGEMTU event process in rtnetlink_event.

But it also resulted in no notification generated when dev's mtu is
changed via other methods, like:
  'ifconfig eth1 mtu 1400' or 'echo 1400 > /sys/class/net/eth1/mtu'
It would cause users not to be notified by this change.

This patch is to fix it by bringing NETDEV_CHANGEMTU event back into
rtnetlink_event, and the redundant notifications issue will be fixed
in the later patch 'rtnetlink: check DO_SETLINK_NOTIFY correctly in
do_setlink'.

Fixes: 085e1a65f0 ("rtnetlink: Do not generate notifications for MTU events")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16 20:48:44 +01:00
Johannes Berg fdf7cb4185 mac80211: accept key reinstall without changing anything
When a key is reinstalled we can reset the replay counters
etc. which can lead to nonce reuse and/or replay detection
being impossible, breaking security properties, as described
in the "KRACK attacks".

In particular, CVE-2017-13080 applies to GTK rekeying that
happened in firmware while the host is in D3, with the second
part of the attack being done after the host wakes up. In
this case, the wpa_supplicant mitigation isn't sufficient
since wpa_supplicant doesn't know the GTK material.

In case this happens, simply silently accept the new key
coming from userspace but don't take any action on it since
it's the same key; this keeps the PN replay counters intact.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-10-16 13:02:03 +02:00
David S. Miller 8f718fa6bb Merge branch 'bnxt_en-fixes'
Michael Chan says:

====================
bnxt_en: bug fixes.

Various bug fixes for the VF/PF link change logic, VF resource checking,
potential firmware response corruption on NVRAM and DCB parameters,
and reading the wrong register for PCIe link speed on the VF.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-14 18:51:52 -07:00
Sankar Patchineelam 5b1e1a9ce0 bnxt_en: Fix possible corruption in DCB parameters from firmware.
hwrm_send_message() is replaced with _hwrm_send_message(), and
hwrm_cmd_lock mutex lock is grabbed for the whole period of
firmware call until the firmware DCB parameters have been copied.
This will prevent possible corruption of the firmware data.

Fixes: 7df4ae9fe8 ("bnxt_en: Implement DCBNL to support host-based DCBX.")
Signed-off-by: Sankar Patchineelam <sankar.patchineelam@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-14 18:51:51 -07:00
Michael Chan cc72f3b1fe bnxt_en: Fix possible corrupted NVRAM parameters from firmware response.
In bnxt_find_nvram_item(), it is copying firmware response data after
releasing the mutex.  This can cause the firmware response data
to be corrupted if the next firmware response overwrites the response
buffer.  The rare problem shows up when running ethtool -i repeatedly.

Fix it by calling the new variant _hwrm_send_message_silent() that requires
the caller to take the mutex and to release it after the response data has
been copied.

Fixes: 3ebf6f0a09 ("bnxt_en: Add installed-package version reporting via Ethtool GDRVINFO")
Reported-by: Sarveswara Rao Mygapula <sarveswararao.mygapula@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-14 18:51:51 -07:00
Michael Chan 021570793d bnxt_en: Fix VF resource checking.
In bnxt_sriov_enable(), we calculate to see if we have enough hardware
resources to enable the requested number of VFs.  The logic to check
for minimum completion rings and statistics contexts is missing.  Add
the required checks so that VF configuration won't fail.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-14 18:51:51 -07:00
Vasundhara Volam 7ab0760f51 bnxt_en: Fix VF PCIe link speed and width logic.
PCIE PCIE_EP_REG_LINK_STATUS_CONTROL register is only defined in PF
config space, so we must read it from the PF.

Fixes: 90c4f788f6 ("bnxt_en: Report PCIe link speed and width during driver load")
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-14 18:51:51 -07:00
Michael Chan e2dc9b6e38 bnxt_en: Don't use rtnl lock to protect link change logic in workqueue.
As a further improvement to the PF/VF link change logic, use a private
mutex instead of the rtnl lock to protect link change logic.  With the
new mutex, we don't have to take the rtnl lock in the workqueue when
we have to handle link related functions.  If the VF and PF drivers
are running on the same host and both take the rtnl lock and one is
waiting for the other, it will cause timeout.  This patch fixes these
timeouts.

Fixes: 90c694bb71 ("bnxt_en: Fix RTNL lock usage on bnxt_update_link().")
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-14 18:51:51 -07:00
Michael Chan c213eae8d3 bnxt_en: Improve VF/PF link change logic.
Link status query firmware messages originating from the VFs are forwarded
to the PF.  The driver handles these interactions in a workqueue for the
VF and PF.  The VF driver waits for the response from the PF in the
workqueue.  If the PF and VF driver are running on the same host and the
work for both PF and VF are queued on the same workqueue, the VF driver
may not get the response if the PF work item is queued behind it on the
same workqueue.  This will lead to the VF link query message timing out.

To prevent this, we create a private workqueue for PFs instead of using
the common workqueue.  The VF query and PF response will never be on
the same workqueue.

Fixes: c0c050c58d ("bnxt_en: New Broadcom ethernet driver.")
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-14 18:51:51 -07:00
Vivien Didelot 3efc93c2bc net: dsa: mv88e6060: fix switch MAC address
The 88E6060 Ethernet switch always transmits the multicast bit of the
switch MAC address as a zero. It re-uses the corresponding bit 8 of the
register "Switch MAC Address Register Bytes 0 & 1" for "DiffAddr".

If the "DiffAddr" bit is 0, then all ports transmit the same source
address. If it is set to 1, then bit 2:0 are used for the port number.

The mv88e6060 driver is currently wrongly shifting the MAC address byte
0 by 9. To fix this, shift it by 8 as usual and clear its bit 0.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Woojung Huh <woojung.huh@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-14 18:40:03 -07:00
Guillaume Nault 5903f59493 l2tp: check ps->sock before running pppol2tp_session_ioctl()
When pppol2tp_session_ioctl() is called by pppol2tp_tunnel_ioctl(),
the session may be unconnected. That is, it was created by
pppol2tp_session_create() and hasn't been connected with
pppol2tp_connect(). In this case, ps->sock is NULL, so we need to check
for this case in order to avoid dereferencing a NULL pointer.

Fixes: 309795f4be ("l2tp: Add netlink control API for L2TP")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-14 18:38:56 -07:00
Wenhua Shi 09001b03f7 net: fix typo in skbuff.c
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-14 18:23:43 -07:00