1
0
Fork 0
Commit Graph

1590 Commits (13ba17bee18e321b073b49a88dcab10881f757da)

Author SHA1 Message Date
Björn Töpel 02671e23e7 xsk: wire up XDP_SKB side of AF_XDP
This commit wires up the xskmap to XDP_SKB layer.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-03 15:55:24 -07:00
Florian Fainelli e283de3a4f net: core: Inline netdev_features_size_check()
We do not require this inline function to be used in multiple different
locations, just inline it where it gets used in register_netdevice().

Suggested-by: David Miller <davem@davemloft.net>
Suggested-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-01 14:24:19 -04:00
Ilya Lesokhin ebf4e808fa net: Add Software fallback infrastructure for socket dependent offloads
With socket dependent offloads we rely on the netdev to transform
the transmitted packets before sending them to the wire.
When a packet from an offloaded socket is rerouted to a different
device we need to detect it and do the transformation in software.

Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-01 09:42:46 -04:00
Florian Fainelli 3ac305c386 net: core: Assert the size of netdev_featres_t
We have about 53 netdev_features_t bits defined and counting, add a
build time check to catch when an u64 type will not be enough and we
will have to convert that to a bitmap. This is done in
register_netdevice() for convenience.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-29 22:50:36 -04:00
Alexander Duyck 1b837d489e net: Revoke export for __skb_tx_hash, update it to just be static skb_tx_hash
I am dropping the export of __skb_tx_hash as after my patches nobody is
using it outside of the net/core/dev.c file. In addition I am renaming and
repurposing it to just be a static declaration of skb_tx_hash since that
was the only user for it at this point. By doing this the compiler can
inline it into __netdev_pick_tx as that will improve performance.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-29 22:01:33 -04:00
Kirill Tkhai 3f5ecd8a90 net: Fix coccinelle warning
kbuild test robot says:

  >coccinelle warnings: (new ones prefixed by >>)
  >>> net/core/dev.c:1588:2-3: Unneeded semicolon

So, let's remove it.

Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27 13:53:14 -04:00
Nikita V. Shirokov f761312023 bpf: fix xdp_generic for bpf_adjust_tail usecase
When bpf_adjust_tail was introduced for generic xdp, it changed skb's tail
pointer, so it was pointing to the new "end of the packet". However skb's
len field wasn't properly modified, so on the wire ethernet frame had
original (or even bigger, if adjust_head was used) size. This diff is
fixing this.

Fixes: 198d83bb3 (" bpf: make generic xdp compatible w/ bpf_xdp_adjust_tail")
Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-04-26 22:56:40 +02:00
David S. Miller e0ada51db9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts were simple overlapping changes in microchip
driver.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-21 16:32:48 -04:00
David S. Miller 1b80f86ed6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2018-04-21

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Initial work on BPF Type Format (BTF) is added, which is a meta
   data format which describes the data types of BPF programs / maps.
   BTF has its roots from CTF (Compact C-Type format) with a number
   of changes to it. First use case is to provide a generic pretty
   print capability for BPF maps inspection, later work will also
   add BTF to bpftool. pahole support to convert dwarf to BTF will
   be upstreamed as well (https://github.com/iamkafai/pahole/tree/btf),
   from Martin.

2) Add a new xdp_bpf_adjust_tail() BPF helper for XDP that allows
   for changing the data_end pointer. Only shrinking is currently
   supported which helps for crafting ICMP control messages. Minor
   changes in drivers have been added where needed so they recalc
   the packet's length also when data_end was adjusted, from Nikita.

3) Improve bpftool to make it easier to feed hex bytes via cmdline
   for map operations, from Quentin.

4) Add support for various missing BPF prog types and attach types
   that have been added to kernel recently but neither to bpftool
   nor libbpf yet. Doc and bash completion updates have been added
   as well for bpftool, from Andrey.

5) Proper fix for avoiding to leak info stored in frame data on page
   reuse for the two bpf_xdp_adjust_{head,meta} helpers by disallowing
   to move the pointers into struct xdp_frame area, from Jesper.

6) Follow-up compile fix from BTF in order to include stdbool.h in
   libbpf, from Björn.

7) Few fixes in BPF sample code, that is, a typo on the netdevice
   in a comment and fixup proper dump of XDP action code in the
   tracepoint exception, from Wang and Jesper.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-21 15:56:15 -04:00
Stephen Hemminger 0fe554a46a hv_netvsc: propogate Hyper-V friendly name into interface alias
This patch implement the 'Device Naming' feature of the Hyper-V
network device API. In Hyper-V on the host through the GUI or PowerShell
it is possible to enable the device naming feature which causes
the host to make available to the guest the name of the device.
This shows up in the RNDIS protocol as the friendly name.

The name has no particular meaning and is limited to 256 characters.
The value can only be set via PowerShell on the host, but could
be scripted for mass deployments. The default value is the
string 'Network Adapter' and since that is the same for all devices
and useless, the driver ignores it.

In Windows, the value goes into a registry key for use in SNMP
ifAlias. For Linux, this patch puts the value in the network
device alias property; where it is visible in ip tools and SNMP.

The host provided ifAlias is just a suggestion, and can be
overridden by later ip commands.

Also requires exporting dev_set_alias in netdev core.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-18 21:19:46 -04:00
Nikita V. Shirokov 198d83bb3b bpf: make generic xdp compatible w/ bpf_xdp_adjust_tail
w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
well (only "decrease" of pointer's location is going to be supported).
changing of this pointer will change packet's size.
for generic XDP we need to reflect this packet's length change by
adjusting skb's tail pointer

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-04-18 23:34:16 +02:00
Toshiaki Makita 7ce2367254 vlan: Fix reading memory beyond skb->tail in skb_vlan_tagged_multi
Syzkaller spotted an old bug which leads to reading skb beyond tail by 4
bytes on vlan tagged packets.
This is caused because skb_vlan_tagged_multi() did not check
skb_headlen.

BUG: KMSAN: uninit-value in eth_type_vlan include/linux/if_vlan.h:283 [inline]
BUG: KMSAN: uninit-value in skb_vlan_tagged_multi include/linux/if_vlan.h:656 [inline]
BUG: KMSAN: uninit-value in vlan_features_check include/linux/if_vlan.h:672 [inline]
BUG: KMSAN: uninit-value in dflt_features_check net/core/dev.c:2949 [inline]
BUG: KMSAN: uninit-value in netif_skb_features+0xd1b/0xdc0 net/core/dev.c:3009
CPU: 1 PID: 3582 Comm: syzkaller435149 Not tainted 4.16.0+ #82
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:17 [inline]
  dump_stack+0x185/0x1d0 lib/dump_stack.c:53
  kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
  __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
  eth_type_vlan include/linux/if_vlan.h:283 [inline]
  skb_vlan_tagged_multi include/linux/if_vlan.h:656 [inline]
  vlan_features_check include/linux/if_vlan.h:672 [inline]
  dflt_features_check net/core/dev.c:2949 [inline]
  netif_skb_features+0xd1b/0xdc0 net/core/dev.c:3009
  validate_xmit_skb+0x89/0x1320 net/core/dev.c:3084
  __dev_queue_xmit+0x1cb2/0x2b60 net/core/dev.c:3549
  dev_queue_xmit+0x4b/0x60 net/core/dev.c:3590
  packet_snd net/packet/af_packet.c:2944 [inline]
  packet_sendmsg+0x7c57/0x8a10 net/packet/af_packet.c:2969
  sock_sendmsg_nosec net/socket.c:630 [inline]
  sock_sendmsg net/socket.c:640 [inline]
  sock_write_iter+0x3b9/0x470 net/socket.c:909
  do_iter_readv_writev+0x7bb/0x970 include/linux/fs.h:1776
  do_iter_write+0x30d/0xd40 fs/read_write.c:932
  vfs_writev fs/read_write.c:977 [inline]
  do_writev+0x3c9/0x830 fs/read_write.c:1012
  SYSC_writev+0x9b/0xb0 fs/read_write.c:1085
  SyS_writev+0x56/0x80 fs/read_write.c:1082
  do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x43ffa9
RSP: 002b:00007fff2cff3948 EFLAGS: 00000217 ORIG_RAX: 0000000000000014
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 000000000043ffa9
RDX: 0000000000000001 RSI: 0000000020000080 RDI: 0000000000000003
RBP: 00000000006cb018 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000217 R12: 00000000004018d0
R13: 0000000000401960 R14: 0000000000000000 R15: 0000000000000000

Uninit was created at:
  kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
  kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188
  kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314
  kmsan_slab_alloc+0x11/0x20 mm/kmsan/kmsan.c:321
  slab_post_alloc_hook mm/slab.h:445 [inline]
  slab_alloc_node mm/slub.c:2737 [inline]
  __kmalloc_node_track_caller+0xaed/0x11c0 mm/slub.c:4369
  __kmalloc_reserve net/core/skbuff.c:138 [inline]
  __alloc_skb+0x2cf/0x9f0 net/core/skbuff.c:206
  alloc_skb include/linux/skbuff.h:984 [inline]
  alloc_skb_with_frags+0x1d4/0xb20 net/core/skbuff.c:5234
  sock_alloc_send_pskb+0xb56/0x1190 net/core/sock.c:2085
  packet_alloc_skb net/packet/af_packet.c:2803 [inline]
  packet_snd net/packet/af_packet.c:2894 [inline]
  packet_sendmsg+0x6444/0x8a10 net/packet/af_packet.c:2969
  sock_sendmsg_nosec net/socket.c:630 [inline]
  sock_sendmsg net/socket.c:640 [inline]
  sock_write_iter+0x3b9/0x470 net/socket.c:909
  do_iter_readv_writev+0x7bb/0x970 include/linux/fs.h:1776
  do_iter_write+0x30d/0xd40 fs/read_write.c:932
  vfs_writev fs/read_write.c:977 [inline]
  do_writev+0x3c9/0x830 fs/read_write.c:1012
  SYSC_writev+0x9b/0xb0 fs/read_write.c:1085
  SyS_writev+0x56/0x80 fs/read_write.c:1082
  do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Fixes: 58e998c6d2 ("offloading: Force software GSO for multiple vlan tags.")
Reported-and-tested-by: syzbot+0bbe42c764feafa82c5a@syzkaller.appspotmail.com
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 13:59:28 -04:00
Eric Dumazet a9d48205d0 net: fool proof dev_valid_name()
We want to use dev_valid_name() to validate tunnel names,
so better use strnlen(name, IFNAMSIZ) than strlen(name) to make
sure to not upset KASAN.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-05 15:16:15 -04:00
David S. Miller c0b458a946 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Minor conflicts in drivers/net/ethernet/mellanox/mlx5/core/en_rep.c,
we had some overlapping changes:

1) In 'net' MLX5E_PARAMS_LOG_{SQ,RQ}_SIZE -->
   MLX5E_REP_PARAMS_LOG_{SQ,RQ}_SIZE

2) In 'net-next' params->log_rq_size is renamed to be
   params->log_rq_mtu_frames.

3) In 'net-next' params->hard_mtu is added.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-01 19:49:34 -04:00
Kirill Tkhai fc1dd36992 net: Remove net_rwsem from {, un}register_netdevice_notifier()
These functions take net_rwsem, while wireless_nlevent_flush()
also takes it. But down_read() can't be taken recursive,
because of rw_semaphore design, which prevents it to be occupied
by only readers forever.

Since we take pernet_ops_rwsem in {,un}register_netdevice_notifier(),
net list can't change, so these down_read()/up_read() can be removed.

Fixes: f0b07bb151 "net: Introduce net_rwsem to protect net_namespace_list"
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31 22:24:58 -04:00
Kirill Tkhai 328fbe747a net: Close race between {un, }register_netdevice_notifier() and setup_net()/cleanup_net()
{un,}register_netdevice_notifier() iterate over all net namespaces
hashed to net_namespace_list. But pernet_operations register and
unregister netdevices in unhashed net namespace, and they are not
seen for netdevice notifiers. This results in asymmetry:

1)Race with register_netdevice_notifier()
  pernet_operations::init(net)	...
   register_netdevice()		...
    call_netdevice_notifiers()  ...
      ... nb is not called ...
  ...				register_netdevice_notifier(nb) -> net skipped
  ...				...
  list_add_tail(&net->list, ..) ...

  Then, userspace stops using net, and it's destructed:

  pernet_operations::exit(net)
   unregister_netdevice()
    call_netdevice_notifiers()
      ... nb is called ...

This always happens with net::loopback_dev, but it may be not the only device.

2)Race with unregister_netdevice_notifier()
  pernet_operations::init(net)
   register_netdevice()
    call_netdevice_notifiers()
      ... nb is called ...

  Then, userspace stops using net, and it's destructed:

  list_del_rcu(&net->list)	...
  pernet_operations::exit(net)  unregister_netdevice_notifier(nb) -> net skipped
   dev_change_net_namespace()	...
    call_netdevice_notifiers()
      ... nb is not called ...
   unregister_netdevice()
    call_netdevice_notifiers()
      ... nb is not called ...

This race is more danger, since dev_change_net_namespace() moves real
network devices, which use not trivial netdevice notifiers, and if this
will happen, the system will be left in unpredictable state.

The patch closes the race. During the testing I found two places,
where register_netdevice_notifier() is called from pernet init/exit
methods (which led to deadlock) and fixed them (see previous patches).

The review moved me to one more unusual registration place:
raw_init() (can driver). It may be a reason of problems,
if someone creates in-kernel CAN_RAW sockets, since they
will be destroyed in exit method and raw_release()
will call unregister_netdevice_notifier(). But grep over
kernel tree does not show, someone creates such sockets
from kernel space.

Theoretically, there can be more places like this, and which are
hidden from review, but we found them on the first bumping there
(since there is no a race, it will be 100% reproducible).

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-30 10:59:35 -04:00
Gal Pressman 9daae9bd47 net: Call add/kill vid ndo on vlan filter feature toggling
NETIF_F_HW_VLAN_[CS]TAG_FILTER features require more than just a bit
flip in dev->features in order to keep the driver in a consistent state.
These features notify the driver of each added/removed vlan, but toggling
of vlan-filter does not notify the driver accordingly for each of the
existing vlans.

This patch implements a similar solution to NETIF_F_RX_UDP_TUNNEL_PORT
behavior (which notifies the driver about UDP ports in the same manner
that vids are reported).

Each toggling of the features propagates to the 8021q module, which
iterates over the vlans and call add/kill ndo accordingly.

Signed-off-by: Gal Pressman <galp@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-30 09:58:59 -04:00
Kirill Tkhai f0b07bb151 net: Introduce net_rwsem to protect net_namespace_list
rtnl_lock() is used everywhere, and contention is very high.
When someone wants to iterate over alive net namespaces,
he/she has no a possibility to do that without exclusive lock.
But the exclusive rtnl_lock() in such places is overkill,
and it just increases the contention. Yes, there is already
for_each_net_rcu() in kernel, but it requires rcu_read_lock(),
and this can't be sleepable. Also, sometimes it may be need
really prevent net_namespace_list growth, so for_each_net_rcu()
is not fit there.

This patch introduces new rw_semaphore, which will be used
instead of rtnl_mutex to protect net_namespace_list. It is
sleepable and allows not-exclusive iterations over net
namespaces list. It allows to stop using rtnl_lock()
in several places (what is made in next patches) and makes
less the time, we keep rtnl_mutex. Here we just add new lock,
while the explanation of we can remove rtnl_lock() there are
in next patches.

Fine grained locks generally are better, then one big lock,
so let's do that with net_namespace_list, while the situation
allows that.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-29 13:47:53 -04:00
Kirill Tkhai 2f635ceeb2 net: Drop pernet_operations::async
Synchronous pernet_operations are not allowed anymore.
All are asynchronous. So, drop the structure member.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-27 13:18:09 -04:00
Eric Dumazet 1dfe82ebd7 net: fix possible out-of-bound read in skb_network_protocol()
skb mac header is not necessarily set at the time skb_network_protocol()
is called. Use skb->data instead.

BUG: KASAN: slab-out-of-bounds in skb_network_protocol+0x46b/0x4b0 net/core/dev.c:2739
Read of size 2 at addr ffff8801b3097a0b by task syz-executor5/14242

CPU: 1 PID: 14242 Comm: syz-executor5 Not tainted 4.16.0-rc6+ #280
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x194/0x24d lib/dump_stack.c:53
 print_address_description+0x73/0x250 mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report+0x23c/0x360 mm/kasan/report.c:412
 __asan_report_load_n_noabort+0xf/0x20 mm/kasan/report.c:443
 skb_network_protocol+0x46b/0x4b0 net/core/dev.c:2739
 harmonize_features net/core/dev.c:2924 [inline]
 netif_skb_features+0x509/0x9b0 net/core/dev.c:3011
 validate_xmit_skb+0x81/0xb00 net/core/dev.c:3084
 validate_xmit_skb_list+0xbf/0x120 net/core/dev.c:3142
 packet_direct_xmit+0x117/0x790 net/packet/af_packet.c:256
 packet_snd net/packet/af_packet.c:2944 [inline]
 packet_sendmsg+0x3aed/0x60b0 net/packet/af_packet.c:2969
 sock_sendmsg_nosec net/socket.c:629 [inline]
 sock_sendmsg+0xca/0x110 net/socket.c:639
 ___sys_sendmsg+0x767/0x8b0 net/socket.c:2047
 __sys_sendmsg+0xe5/0x210 net/socket.c:2081

Fixes: 19acc32725 ("gso: Handle Trans-Ether-Bridging protocol in skb_network_protocol()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Pravin B Shelar <pshelar@ovn.org>
Reported-by: Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-27 10:53:27 -04:00
Kirill Tkhai 070f2d7e26 net: Drop NETDEV_UNREGISTER_FINAL
Last user is gone after bdf5bd7f21 "rds: tcp: remove
register_netdevice_notifier infrastructure.", so we can
remove this netdevice command. This allows to delete
rtnl_lock() in netdev_run_todo(), which is hot path for
net namespace unregistration.

dev_change_net_namespace() and netdev_wait_allrefs()
have rcu_barrier() before NETDEV_UNREGISTER_FINAL call,
and the source commits say they were introduced to
delemit the call with NETDEV_UNREGISTER, but this patch
leaves them on the places, since they require additional
analysis, whether we need in them for something else.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-26 11:34:00 -04:00
Kirill Tkhai ede2762d93 net: Make NETDEV_XXX commands enum { }
This patch is preparation to drop NETDEV_UNREGISTER_FINAL.
Since the cmd is used in usnic_ib_netdev_event_to_string()
to get cmd name, after plain removing NETDEV_UNREGISTER_FINAL
from everywhere, we'd have holes in event2str[] in this
function.

Instead of that, let's make NETDEV_XXX commands names
available for everyone, and to define netdev_cmd_to_name()
in the way we won't have to shaffle names after their
numbers are changed.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-26 11:33:26 -04:00
David S. Miller 03fe2debbb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Fun set of conflict resolutions here...

For the mac80211 stuff, these were fortunately just parallel
adds.  Trivially resolved.

In drivers/net/phy/phy.c we had a bug fix in 'net' that moved the
function phy_disable_interrupts() earlier in the file, whilst in
'net-next' the phy_error() call from this function was removed.

In net/ipv4/xfrm4_policy.c, David Ahern's changes to remove the
'rt_table_id' member of rtable collided with a bug fix in 'net' that
added a new struct member "rt_mtu_locked" which needs to be copied
over here.

The mlxsw driver conflict consisted of net-next separating
the span code and definitions into separate files, whilst
a 'net' bug fix made some changes to that moved code.

The mlx5 infiniband conflict resolution was quite non-trivial,
the RDMA tree's merge commit was used as a guide here, and
here are their notes:

====================

    Due to bug fixes found by the syzkaller bot and taken into the for-rc
    branch after development for the 4.17 merge window had already started
    being taken into the for-next branch, there were fairly non-trivial
    merge issues that would need to be resolved between the for-rc branch
    and the for-next branch.  This merge resolves those conflicts and
    provides a unified base upon which ongoing development for 4.17 can
    be based.

    Conflicts:
            drivers/infiniband/hw/mlx5/main.c - Commit 42cea83f95
            (IB/mlx5: Fix cleanup order on unload) added to for-rc and
            commit b5ca15ad7e (IB/mlx5: Add proper representors support)
            add as part of the devel cycle both needed to modify the
            init/de-init functions used by mlx5.  To support the new
            representors, the new functions added by the cleanup patch
            needed to be made non-static, and the init/de-init list
            added by the representors patch needed to be modified to
            match the init/de-init list changes made by the cleanup
            patch.
    Updates:
            drivers/infiniband/hw/mlx5/mlx5_ib.h - Update function
            prototypes added by representors patch to reflect new function
            names as changed by cleanup patch
            drivers/infiniband/hw/mlx5/ib_rep.c - Update init/de-init
            stage list to match new order from cleanup patch
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-23 11:31:58 -04:00
Kirill Tkhai b0f3debc9a net: Use rtnl_lock_killable() in register_netdev()
This patch adds rtnl_lock_killable() to one of hot path
using rtnl_lock().

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-16 12:31:19 -04:00
Eric Dumazet 4dcb31d464 net: use skb_to_full_sk() in skb_update_prio()
Andrei Vagin reported a KASAN: slab-out-of-bounds error in
skb_update_prio()

Since SYNACK might be attached to a request socket, we need to
get back to the listener socket.
Since this listener is manipulated without locks, add const
qualifiers to sock_cgroup_prioidx() so that the const can also
be used in skb_update_prio()

Also add the const qualifier to sock_cgroup_classid() for consistency.

Fixes: ca6fb06518 ("tcp: attach SYNACK messages to request sockets instead of listener")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-14 12:53:23 -04:00
Gal Pressman de8d5ab2ff net: Make RX-FCS and HW GRO mutually exclusive
Same as LRO, hardware GRO cannot be enabled with RX-FCS.
When both are requested, hardware GRO will be dropped.

Suggested-by: David Miller <davem@davemloft.net>
Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-12 11:15:16 -04:00
Paolo Abeni f5426250a6 net: introduce IFF_NO_RX_HANDLER
Some network devices - notably ipvlan slave - are not compatible with
any kind of rx_handler. Currently the hook can be installed but any
configuration (bridge, bond, macsec, ...) is nonfunctional.

This change allocates a priv_flag bit to mark such devices and explicitly
forbid installing a rx_handler if such bit is set. The new bit is used
by ipvlan slave device.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-09 13:00:08 -05:00
David S. Miller 0f3e9c97eb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
All of the conflicts were cases of overlapping changes.

In net/core/devlink.c, we have to make care that the
resouce size_params have become a struct member rather
than a pointer to such an object.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-06 01:20:46 -05:00
Gal Pressman e6c6a92905 net: Make RX-FCS and LRO mutually exclusive
LRO and RX-FCS offloads cannot be enabled at the same time since it is
not clear what should happen to the FCS of each coalesced packet.
The FCS is not really part of the TCP payload, hence cannot be merged
into one big packet. On the other hand, providing one big LRO packet
with one FCS contradicts the RX-FCS feature goal.

Use the fix features mechanism in order to prevent intersection of the
features and drop LRO in case RX-FCS is requested.

Enabling RX-FCS while LRO is enabled will result in:
$ ethtool -K ens6 rx-fcs on
Actual changes:
large-receive-offload: off [requested on]
rx-fcs: on

Signed-off-by: Gal Pressman <galp@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-05 10:25:51 -05:00
Mike Manning 50d629e7a8 net: allow interface to be set into VRF if VLAN interface in same VRF
Setting an interface into a VRF fails with 'RTNETLINK answers: File
exists' if one of its VLAN interfaces is already in the same VRF.
As the VRF is an upper device of the VLAN interface, it is also showing
up as an upper device of the interface itself. The solution is to
restrict this check to devices other than master. As only one master
device can be linked to a device, the check in this case is that the
upper device (VRF) being linked to is not the same as the master device
instead of it not being any one of the upper devices.

The following example shows an interface ens12 (with a VLAN interface
ens12.10) being set into VRF green, which behaves as expected:

  # ip link add link ens12 ens12.10 type vlan id 10
  # ip link set dev ens12 master vrfgreen
  # ip link show dev ens12
    3: ens12: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel
       master vrfgreen state UP mode DEFAULT group default qlen 1000
       link/ether 52:54:00:4c:a0:45 brd ff:ff:ff:ff:ff:ff

But if the VLAN interface has previously been set into the same VRF,
then setting the interface into the VRF fails:

  # ip link set dev ens12 nomaster
  # ip link set dev ens12.10 master vrfgreen
  # ip link show dev ens12.10
    39: ens12.10@ens12: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500
    qdisc noqueue master vrfgreen state UP mode DEFAULT group default
    qlen 1000 link/ether 52:54:00:4c:a0:45 brd ff:ff:ff:ff:ff:ff
  # ip link set dev ens12 master vrfgreen
    RTNETLINK answers: File exists

The workaround is to move the VLAN interface back into the default VRF
beforehand, but it has to be shut first so as to avoid the risk of
traffic leaking from the VRF. This fix avoids needing this workaround.

Signed-off-by: Mike Manning <mmanning@att.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-01 21:25:29 -05:00
Gal Pressman 3a053b1a30 net: Fix spelling mistake "greater then" -> "greater than"
Fix trivial spelling mistake "greater then" -> "greater than".

Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-01 13:39:39 -05:00
David S. Miller f5c0c6f429 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-02-19 18:46:11 -05:00
Jakub Kicinski ac5b70198a net: fix race on decreasing number of TX queues
netif_set_real_num_tx_queues() can be called when netdev is up.
That usually happens when user requests change of number of
channels/rings with ethtool -L.  The procedure for changing
the number of queues involves resetting the qdiscs and setting
dev->num_tx_queues to the new value.  When the new value is
lower than the old one, extra care has to be taken to ensure
ordering of accesses to the number of queues vs qdisc reset.

Currently the queues are reset before new dev->num_tx_queues
is assigned, leaving a window of time where packets can be
enqueued onto the queues going down, leading to a likely
crash in the drivers, since most drivers don't check if TX
skbs are assigned to an active queue.

Fixes: e6484930d7 ("net: allocate tx queues in register_netdevice")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-14 14:12:55 -05:00
David Ahern 330c7272c4 net: Make dn_ptr depend on CONFIG_DECNET
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-14 11:55:33 -05:00
Kirill Tkhai 2608e6b7ad net: Convert default_device_ops
These pernet operations consist of exit() and exit_batch() methods.

default_device_exit() moves not-local and virtual devices to init_net.
There is nothing exciting, because this may happen in any time
on a working system, and rtnl_lock() and synchronize_net() protect
us from all cases of external dereference.

The same for default_device_exit_batch(). Similar unregisteration
may happen in any time on a system. Here several lists (like todo_list),
which are accessed under rtnl_lock(). After rtnl_unlock() and
netdev_run_todo() all the devices are flushed.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13 10:36:09 -05:00
Kirill Tkhai 88b8ffebdb net: Convert pernet_subsys ops, registered via net_dev_init()
There are:
1)dev_proc_ops and dev_mc_net_ops, which create and destroy
pernet proc file and not interesting for another net namespaces;
2)netdev_net_ops, which creates pernet hashes, which are not
touched by another pernet_operations.

So, make them async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13 10:36:07 -05:00
Cong Wang 48bfd55e7e net_sched: plug in qdisc ops change_tx_queue_len
Introduce a new qdisc ops ->change_tx_queue_len() so that
each qdisc could decide how to implement this if it wants.
Previously we simply read dev->tx_queue_len, after pfifo_fast
switches to skb array, we need this API to resize the skb array
when we change dev->tx_queue_len.

To avoid handling race conditions with TX BH, we need to
deactivate all TX queues before change the value and bring them
back after we are done, this also makes implementation easier.

Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-29 12:42:15 -05:00
Cong Wang 6a643ddb56 net: introduce helper dev_change_tx_queue_len()
This patch promotes the local change_tx_queue_len() to a core
helper function, dev_change_tx_queue_len(), so that rtnetlink
and net-sysfs could share the code. This also prepares for the
following patch.

Note, the -EFAULT in the original code doesn't make sense,
we should propagate the errno from notifiers.

Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-29 12:42:15 -05:00
Nicolas Dichtel 38e01b3056 dev: advertise the new ifindex when the netns iface changes
The goal is to let the user follow an interface that moves to another
netns.

CC: Jiri Benc <jbenc@redhat.com>
CC: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-29 12:23:52 -05:00
Nicolas Dichtel c36ac8e230 dev: always advertise the new nsid when the netns iface changes
The user should be able to follow any interface that moves to another
netns.  There is no reason to hide physical interfaces.

CC: Jiri Benc <jbenc@redhat.com>
CC: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-29 12:23:51 -05:00
David S. Miller 5ca114400d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
en_rx_am.c was deleted in 'net-next' but had a bug fixed in it in
'net'.

The esp{4,6}_offload.c conflicts were overlapping changes.
The 'out' label is removed so we just return ERR_PTR(-EINVAL)
directly.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-23 13:51:56 -05:00
Florian Fainelli 7a006d5988 net: core: Fix kernel-doc for netdev_upper_link()
Fixes the following warnings:
./net/core/dev.c:6438: warning: No description found for parameter 'extack'
./net/core/dev.c:6461: warning: No description found for parameter 'extack'

Fixes: 42ab19ee90 ("net: Add extack to upper device linking")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-23 11:06:50 -05:00
Florian Fainelli 5de30d5df9 net: core: Fix kernel-doc for call_netdevice_notifiers_info()
Remove the @dev comment, since we do not have a net_device argument, fixes the
following kernel doc warning: /net/core/dev.c:1707: warning: Excess function
parameter 'dev' description in 'call_netdevice_notifiers_info'

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-23 11:06:50 -05:00
Eric Dumazet 7c68d1a6b4 net: qdisc_pkt_len_init() should be more robust
Without proper validation of DODGY packets, we might very well
feed qdisc_pkt_len_init() with invalid GSO packets.

tcp_hdrlen() might access out-of-bound data, so let's use
skb_header_pointer() and proper checks.

Whole story is described in commit d0c081b491 ("flow_dissector:
properly cap thoff field")

We have the goal of validating DODGY packets earlier in the stack,
so we might very well revert this fix in the future.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jason Wang <jasowang@redhat.com>
Reported-by: syzbot+9da69ebac7dddd804552@syzkaller.appspotmail.com
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-22 16:00:05 -05:00
Alexander Duyck d584527c70 net: Cap number of queues even with accel_priv
With the recent fix to ixgbe we can cap the number of queues always
regardless of if accel_priv is being used or not since the actual number of
queues are being reported via real_num_tx_queues.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-01-12 08:20:36 -08:00
David S. Miller 8c2e6c904f Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2018-01-11

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Various BPF related improvements and fixes to nfp driver: i) do
   not register XDP RXQ structure to control queues, ii) round up
   program stack size to word size for nfp, iii) restrict MTU changes
   when BPF offload is active, iv) add more fully featured relocation
   support to JIT, v) add support for signed compare instructions to
   the nfp JIT, vi) export and reuse verfier log routine for nfp, and
   many more, from Jakub, Quentin and Nic.

2) Fix a syzkaller reported GPF in BPF's copy_verifier_state() when
   we hit kmalloc failure path, from Alexei.

3) Add two follow-up fixes for the recent XDP RXQ series: i) kvzalloc()
   allocated memory was only kfree()'ed, and ii) fix a memory leak where
   RX queue was not freed in netif_free_rx_queues(), from Jakub.

4) Add a sample for transferring XDP meta data into the skb, here it
   is used for setting skb->mark with the buffer from XDP, from Jesper.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-11 13:59:41 -05:00
Jakub Kicinski 82aaff2f63 net: free RX queue structures
Looks like commit e817f85652 ("xdp: generic XDP handling of
xdp_rxq_info") replaced kvfree(dev->_rx) in free_netdev() with
a call to netif_free_rx_queues() which doesn't actually free
the rings?

While at it remove the unnecessary temporary variable.

Fixes: e817f85652 ("xdp: generic XDP handling of xdp_rxq_info")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-10 12:06:17 +01:00
Jakub Kicinski 141b52a98a net: use the right variant of kfree
kvzalloc'ed memory should be kvfree'd.

Fixes: e817f85652 ("xdp: generic XDP handling of xdp_rxq_info")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-10 12:06:17 +01:00
David S. Miller a0ce093180 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-01-09 10:37:00 -05:00
Jesper Dangaard Brouer e817f85652 xdp: generic XDP handling of xdp_rxq_info
Hook points for xdp_rxq_info:
 * reg  : netif_alloc_rx_queues
 * unreg: netif_free_rx_queues

The net_device have some members (num_rx_queues + real_num_rx_queues)
and data-area (dev->_rx with struct netdev_rx_queue's) that were
primarily used for exporting information about RPS (CONFIG_RPS) queues
to sysfs (CONFIG_SYSFS).

For generic XDP extend struct netdev_rx_queue with the xdp_rxq_info,
and remove some of the CONFIG_SYSFS ifdefs.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-05 15:21:22 -08:00
David S. Miller 55a5ec9b77 Revert "net: core: dev_get_valid_name is now the same as dev_alloc_name_ns"
This reverts commit 87c320e515.

Changing the error return code in some situations turns out to
be harmful in practice.  In particular Michael Ellerman reports
that DHCP fails on his powerpc machines, and this revert gets
things working again.

Johannes Berg agrees that this revert is the best course of
action for now.

Fixes: 029b6d1405 ("Revert "net: core: maybe return -EEXIST in __dev_alloc_name"")
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-02 11:50:12 -05:00
David S. Miller 9f30e5c5c2 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2017-12-22

1) Separate ESP handling from segmentation for GRO packets.
   This unifies the IPsec GSO and non GSO codepath.

2) Add asynchronous callbacks for xfrm on layer 2. This
   adds the necessary infrastructure to core networking.

3) Allow to use the layer2 IPsec GSO codepath for software
   crypto, all infrastructure is there now.

4) Also allow IPsec GSO with software crypto for local sockets.

5) Don't require synchronous crypto fallback on IPsec offloading,
   it is not needed anymore.

6) Check for xdo_dev_state_free and only call it if implemented.
   From Shannon Nelson.

7) Check for the required add and delete functions when a driver
   registers xdo_dev_ops. From Shannon Nelson.

8) Define xfrmdev_ops only with offload config.
   From Shannon Nelson.

9) Update the xfrm stats documentation.
   From Shannon Nelson.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-27 11:15:14 -05:00
David S. Miller fba961ab29 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Lots of overlapping changes.  Also on the net-next side
the XDP state management is handled more in the generic
layers so undo the 'net' nfp fix which isn't applicable
in net-next.

Include a necessary change by Jakub Kicinski, with log message:

====================
cls_bpf no longer takes care of offload tracking.  Make sure
netdevsim performs necessary checks.  This fixes a warning
caused by TC trying to remove a filter it has not added.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-22 11:16:31 -05:00
Steffen Klassert f53c723902 net: Add asynchronous callbacks for xfrm on layer 2.
This patch implements asynchronous crypto callbacks
and a backlog handler that can be used when IPsec
is done at layer 2 in the TX path. It also extends
the skb validate functions so that we can update
the driver transmit return codes based on async
crypto operation or to indicate that we queued the
packet in a backlog queue.

Joint work with: Aviv Heller <avivh@mellanox.com>

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-12-20 10:41:36 +01:00
Steffen Klassert 3dca3f38cf xfrm: Separate ESP handling from segmentation for GRO packets.
We change the ESP GSO handlers to only segment the packets.
The ESP handling and encryption is defered to validate_xmit_xfrm()
where this is done for non GRO packets too. This makes the code
more robust and prepares for asynchronous crypto handling.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-12-20 10:41:31 +01:00
Michael Chan 56f5aa77cd net: Disable GRO_HW when generic XDP is installed on a device.
Hardware should not aggregate any packets when generic XDP is installed.

Cc: Ariel Elior <Ariel.Elior@cavium.com>
Cc: everest-linux-l2@cavium.com
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-19 10:38:36 -05:00
Michael Chan fb1f5f79ae net: Introduce NETIF_F_GRO_HW.
Introduce NETIF_F_GRO_HW feature flag for NICs that support hardware
GRO.  With this flag, we can now independently turn on or off hardware
GRO when GRO is on.  Previously, drivers were using NETIF_F_GRO to
control hardware GRO and so it cannot be independently turned on or
off without affecting GRO.

Hardware GRO (just like GRO) guarantees that packets can be re-segmented
by TSO/GSO to reconstruct the original packet stream.  Logically,
GRO_HW should depend on GRO since it a subset, but we will let
individual drivers enforce this dependency as they see fit.

Since NETIF_F_GRO is not propagated between upper and lower devices,
NETIF_F_GRO_HW should follow suit since it is a subset of GRO.  In other
words, a lower device can independent have GRO/GRO_HW enabled or disabled
and no feature propagation is required.  This will preserve the current
GRO behavior.  This can be changed later if we decide to propagate GRO/
GRO_HW/RXCSUM from upper to lower devices.

Cc: Ariel Elior <Ariel.Elior@cavium.com>
Cc: everest-linux-l2@cavium.com
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-19 10:38:36 -05:00
Song Liu 2d17d8d79e xdp: linearize skb in netif_receive_generic_xdp()
In netif_receive_generic_xdp(), it is necessary to linearize all
nonlinear skb. However, in current implementation, skb with
troom <= 0 are not linearized. This patch fixes this by calling
skb_linearize() for all nonlinear skb.

Fixes: de8f3a83b0 ("bpf: add meta pointer for direct access")
Signed-off-by: Song Liu <songliubraving@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2017-12-15 14:34:36 +01:00
Willem de Bruijn 8d74e9f88d net: avoid skb_warn_bad_offload on IS_ERR
skb_warn_bad_offload warns when packets enter the GSO stack that
require skb_checksum_help or vice versa. Do not warn on arbitrary
bad packets. Packet sockets can craft many. Syzkaller was able to
demonstrate another one with eth_type games.

In particular, suppress the warning when segmentation returns an
error, which is for reasons other than checksum offload.

See also commit 36c9247449 ("net: WARN if skb_checksum_help() is
called on skb requiring segmentation") for context on this warning.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-13 15:14:10 -05:00
David S. Miller 51e18a453f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflict was two parallel additions of include files to sch_generic.c,
no biggie.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-09 22:09:55 -05:00
John Fastabend 6b3ba9146f net: sched: allow qdiscs to handle locking
This patch adds a flag for queueing disciplines to indicate the stack
does not need to use the qdisc lock to protect operations. This can
be used to build lockless scheduling algorithms and improving
performance.

The flag is checked in the tx path and the qdisc lock is only taken
if it is not set. For now use a conditional if statement. Later we
could be more aggressive if it proves worthwhile and use a static key
or wrap this in a likely().

Also the lockless case drops the TCQ_F_CAN_BYPASS logic. The reason
for this is synchronizing a qlen counter across threads proves to
cost more than doing the enqueue/dequeue operations when tested with
pktgen.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 13:32:25 -05:00
John Fastabend 6c148184b5 net: sched: cleanup qdisc_run and __qdisc_run semantics
Currently __qdisc_run calls qdisc_run_end() but does not call
qdisc_run_begin(). This makes it hard to track pairs of
qdisc_run_{begin,end} across function calls.

To simplify reading these code paths this patch moves begin/end calls
into qdisc_run().

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 13:32:25 -05:00
Johannes Berg 029b6d1405 Revert "net: core: maybe return -EEXIST in __dev_alloc_name"
This reverts commit d6f295e9def0; some userspace (in the case
we noticed it's wpa_supplicant), is relying on the current
error code to determine that a fixed name interface already
exists.

Reported-by: Jouni Malinen <j@w1.fi>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-05 11:29:48 -05:00
Jakub Kicinski bd0b2e7fe6 net: xdp: make the stack take care of the tear down
Since day one of XDP drivers had to remember to free the program
on the remove path.  This leads to code duplication and is error
prone.  Make the stack query the installed programs on unregister
and if something is installed, remove the program.  Freeing of
program attached to XDP generic is moved from free_netdev() as well.

Because the remove will now be called before notifiers are
invoked, BPF offload state of the program will not get destroyed
before uninstall.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2017-12-03 00:27:57 +01:00
Jakub Kicinski 118b4aa25d net: xdp: avoid output parameters when querying XDP prog
The output parameters will get unwieldy if we want to add more
information about the program.  Simply pass the entire
struct netdev_bpf in.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2017-12-03 00:27:57 +01:00
David S. Miller e4be7baba8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2017-11-23

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Several BPF offloading fixes, from Jakub. Among others:

    - Limit offload to cls_bpf and XDP program types only.
    - Move device validation into the driver and don't make
      any assumptions about the device in the classifier due
      to shared blocks semantics.
    - Don't pass offloaded XDP program into the driver when
      it should be run in native XDP instead. Offloaded ones
      are not JITed for the host in such cases.
    - Don't destroy device offload state when moved to
      another namespace.
    - Revert dumping offload info into user space for now,
      since ifindex alone is not sufficient. This will be
      redone properly for bpf-next tree.

2) Fix test_verifier to avoid using bpf_probe_write_user()
   helper in test cases, since it's dumping a warning into
   kernel log which may confuse users when only running tests.
   Switch to use bpf_trace_printk() instead, from Yonghong.

3) Several fixes for correcting ARG_CONST_SIZE_OR_ZERO semantics
   before it becomes uabi, from Gianluca. More specifically:

    - Add a type ARG_PTR_TO_MEM_OR_NULL that is used only
      by bpf_csum_diff(), where the argument is either a
      valid pointer or NULL. The subsequent ARG_CONST_SIZE_OR_ZERO
      then enforces a valid pointer in case of non-0 size
      or a valid pointer or NULL in case of size 0. Given
      that, the semantics for ARG_PTR_TO_MEM in combination
      with ARG_CONST_SIZE_OR_ZERO are now such that in case
      of size 0, the pointer must always be valid and cannot
      be NULL. This fix in semantics allows for bpf_probe_read()
      to drop the recently added size == 0 check in the helper
      that would become part of uabi otherwise once released.
      At the same time we can then fix bpf_probe_read_str() and
      bpf_perf_event_output() to use ARG_CONST_SIZE_OR_ZERO
      instead of ARG_CONST_SIZE in order to fix recently
      reported issues by Arnaldo et al, where LLVM optimizes
      two boundary checks into a single one for unknown
      variables where the verifier looses track of the variable
      bounds and thus rejects valid programs otherwise.

4) A fix for the verifier for the case when it detects
   comparison of two constants where the branch is guaranteed
   to not be taken at runtime. Verifier will rightfully prune
   the exploration of such paths, but we still pass the program
   to JITs, where they would complain about using reserved
   fields, etc. Track such dead instructions and sanitize
   them with mov r0,r0. Rejection is not possible since LLVM
   may generate them for valid C code and doesn't do as much
   data flow analysis as verifier. For bpf-next we might
   implement removal of such dead code and adjust branches
   instead. Fix from Alexei.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-24 02:33:01 +09:00
Willem de Bruijn 0c19f846d5 net: accept UFO datagrams from tuntap and packet
Tuntap and similar devices can inject GSO packets. Accept type
VIRTIO_NET_HDR_GSO_UDP, even though not generating UFO natively.

Processes are expected to use feature negotiation such as TUNSETOFFLOAD
to detect supported offload types and refrain from injecting other
packets. This process breaks down with live migration: guest kernels
do not renegotiate flags, so destination hosts need to expose all
features that the source host does.

Partially revert the UFO removal from 182e0b6b5846~1..d9d30adf5677.
This patch introduces nearly(*) no new code to simplify verification.
It brings back verbatim tuntap UFO negotiation, VIRTIO_NET_HDR_GSO_UDP
insertion and software UFO segmentation.

It does not reinstate protocol stack support, hardware offload
(NETIF_F_UFO), SKB_GSO_UDP tunneling in SKB_GSO_SOFTWARE or reception
of VIRTIO_NET_HDR_GSO_UDP packets in tuntap.

To support SKB_GSO_UDP reappearing in the stack, also reinstate
logic in act_csum and openvswitch. Achieve equivalence with v4.13 HEAD
by squashing in commit 939912216f ("net: skb_needs_check() removes
CHECKSUM_UNNECESSARY check for tx.") and reverting commit 8d63bee643
("net: avoid skb_warn_bad_offload false positives on UFO").

(*) To avoid having to bring back skb_shinfo(skb)->ip6_frag_id,
ipv6_proxy_select_ident is changed to return a __be32 and this is
assigned directly to the frag_hdr. Also, SKB_GSO_UDP is inserted
at the end of the enum to minimize code churn.

Tested
  Booted a v4.13 guest kernel with QEMU. On a host kernel before this
  patch `ethtool -k eth0` shows UFO disabled. After the patch, it is
  enabled, same as on a v4.13 host kernel.

  A UFO packet sent from the guest appears on the tap device:
    host:
      nc -l -p -u 8000 &
      tcpdump -n -i tap0

    guest:
      dd if=/dev/zero of=payload.txt bs=1 count=2000
      nc -u 192.16.1.1 8000 < payload.txt

  Direct tap to tap transmission of VIRTIO_NET_HDR_GSO_UDP succeeds,
  packets arriving fragmented:

    ./with_tap_pair.sh ./tap_send_ufo tap0 tap1
    (from https://github.com/wdebruij/kerneltools/tree/master/tests)

Changes
  v1 -> v2
    - simplified set_offload change (review comment)
    - documented test procedure

Link: http://lkml.kernel.org/r/<CAF=yD-LuUeDuL9YWPJD9ykOZ0QCjNeznPDr6whqZ9NGMNF12Mw@mail.gmail.com>
Fixes: fb652fdfe8 ("macvlan/macvtap: Remove NETIF_F_UFO advertisement.")
Reported-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-24 01:37:35 +09:00
Jakub Kicinski 441a33031f net: xdp: don't allow device-bound programs in driver mode
Currently device-bound programs are not able to run on the host
to save resources (host JIT is not invoked).  Don't allow XDP
programs to be attached without the HW_MODE flag.  In theory
if program is already translated for device offload the driver
should choose to offload it instead of loading it in the driver.
However, offloading translated program may still fail resulting
in device-bound program being run on the host.

Prevent this by refusing to attach device bound programs if
XDP_FLAGS_HW_MODE is not set.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2017-11-21 00:37:35 +01:00
Jakub Kicinski 288b3de55a bpf: offload: move offload device validation out to the drivers
With TC shared block changes we can't depend on correct netdev
pointer being available in cls_bpf.  Move the device validation
to the driver.  Core will only make sure that offloaded programs
are always attached in the driver (or in HW by the driver).  We
trust that drivers which implement offload callbacks will perform
necessary checks.

Moving the checks to the driver is generally a useful thing,
in practice the check should be against a switchdev instance,
not a netdev, given that most ASICs will probably allow using
the same program on many ports.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2017-11-21 00:37:35 +01:00
Linus Torvalds 5bbcc0f595 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
 "Highlights:

   1) Maintain the TCP retransmit queue using an rbtree, with 1GB
      windows at 100Gb this really has become necessary. From Eric
      Dumazet.

   2) Multi-program support for cgroup+bpf, from Alexei Starovoitov.

   3) Perform broadcast flooding in hardware in mv88e6xxx, from Andrew
      Lunn.

   4) Add meter action support to openvswitch, from Andy Zhou.

   5) Add a data meta pointer for BPF accessible packets, from Daniel
      Borkmann.

   6) Namespace-ify almost all TCP sysctl knobs, from Eric Dumazet.

   7) Turn on Broadcom Tags in b53 driver, from Florian Fainelli.

   8) More work to move the RTNL mutex down, from Florian Westphal.

   9) Add 'bpftool' utility, to help with bpf program introspection.
      From Jakub Kicinski.

  10) Add new 'cpumap' type for XDP_REDIRECT action, from Jesper
      Dangaard Brouer.

  11) Support 'blocks' of transformations in the packet scheduler which
      can span multiple network devices, from Jiri Pirko.

  12) TC flower offload support in cxgb4, from Kumar Sanghvi.

  13) Priority based stream scheduler for SCTP, from Marcelo Ricardo
      Leitner.

  14) Thunderbolt networking driver, from Amir Levy and Mika Westerberg.

  15) Add RED qdisc offloadability, and use it in mlxsw driver. From
      Nogah Frankel.

  16) eBPF based device controller for cgroup v2, from Roman Gushchin.

  17) Add some fundamental tracepoints for TCP, from Song Liu.

  18) Remove garbage collection from ipv6 route layer, this is a
      significant accomplishment. From Wei Wang.

  19) Add multicast route offload support to mlxsw, from Yotam Gigi"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2177 commits)
  tcp: highest_sack fix
  geneve: fix fill_info when link down
  bpf: fix lockdep splat
  net: cdc_ncm: GetNtbFormat endian fix
  openvswitch: meter: fix NULL pointer dereference in ovs_meter_cmd_reply_start
  netem: remove unnecessary 64 bit modulus
  netem: use 64 bit divide by rate
  tcp: Namespace-ify sysctl_tcp_default_congestion_control
  net: Protect iterations over net::fib_notifier_ops in fib_seq_sum()
  ipv6: set all.accept_dad to 0 by default
  uapi: fix linux/tls.h userspace compilation error
  usbnet: ipheth: prevent TX queue timeouts when device not ready
  vhost_net: conditionally enable tx polling
  uapi: fix linux/rxrpc.h userspace compilation errors
  net: stmmac: fix LPI transitioning for dwmac4
  atm: horizon: Fix irq release error
  net-sysfs: trigger netlink notification on ifalias change via sysfs
  openvswitch: Using kfree_rcu() to simplify the code
  openvswitch: Make local function ovs_nsh_key_attr_size() static
  openvswitch: Fix return value check in ovs_meter_cmd_features()
  ...
2017-11-15 11:56:19 -08:00
Rasmus Villemoes 87c320e515 net: core: dev_get_valid_name is now the same as dev_alloc_name_ns
If name contains a %, it's easy to see that this patch doesn't change
anything (other than eliminate the duplicate dev_valid_name
call). Otherwise, we'll now just spend a little time in snprintf()
copying name to the stack buffer allocated in dev_alloc_name_ns, and do
the __dev_get_by_name using that buffer rather than name.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-14 16:38:46 +09:00
Rasmus Villemoes d6f295e9de net: core: maybe return -EEXIST in __dev_alloc_name
If we're given format string with no %d, -EEXIST is a saner error code.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-14 16:38:46 +09:00
Rasmus Villemoes 93809105cf net: core: check dev_valid_name in __dev_alloc_name
We currently only exclude non-sysfs-friendly names via
dev_get_valid_name; there doesn't seem to be a reason to allow such
names when we're called via dev_alloc_name.

This does duplicate the dev_valid_name check in the dev_get_valid_name()
case; we'll fix that shortly.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-14 16:38:46 +09:00
Rasmus Villemoes 6224abda0d net: core: drop pointless check in __dev_alloc_name
The only caller passes a stack buffer as buf, so it won't equal the
passed-in name. Moreover, we're already using buf as a scratch buffer
inside the if (p) {} block, so if buf and name were the same, that
snprintf() call would be overwriting its own format string.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-14 16:38:46 +09:00
Rasmus Villemoes c46d7642e9 net: core: eliminate dev_alloc_name{,_ns} code duplication
dev_alloc_name contained a BUG_ON(), which I moved to dev_alloc_name_ns;
the only other caller of that already has the same BUG_ON.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-14 16:38:46 +09:00
Rasmus Villemoes 2c88b85598 net: core: move dev_alloc_name_ns a little higher
No functional change.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-14 16:38:46 +09:00
Rasmus Villemoes 51f299dd94 net: core: improve sanity checking in __dev_alloc_name
__dev_alloc_name is called from the public (and exported)
dev_alloc_name(), so we don't have a guarantee that strlen(name) is at
most IFNAMSIZ. If somebody manages to get __dev_alloc_name called with a
% char beyond the 31st character, we'd be making a snprintf() call that
will very easily crash the kernel (using an appropriate %p extension,
we'll likely dereference some completely bogus pointer).

In the normal case where strlen() is sane, we don't even save anything
by limiting to IFNAMSIZ, so just use strchr().

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-14 16:38:45 +09:00
Vasily Averin ee21b18b6b netdev: exit_net cleanup check added
Be sure that dev_base_head list initialized in net_init hook was return
to initial state

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-14 15:45:53 +09:00
Jakub Kicinski 248f346ffe xdp: allow attaching programs loaded for specific device
Pass the netdev pointer to bpf_prog_get_type().  This way
BPF code can decide whether the device matches what the
code was loaded/translated for.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05 22:26:19 +09:00
Jakub Kicinski f4e63525ee net: bpf: rename ndo_xdp to ndo_bpf
ndo_xdp is a control path callback for setting up XDP in the
driver.  We can reuse it for other forms of communication
between the eBPF stack and the drivers.  Rename the callback
and associated structures and definitions.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05 22:26:18 +09:00
Jiri Pirko 46209401f8 net: core: introduce mini_Qdisc and eliminate usage of tp->q for clsact fastpath
In sch_handle_egress and sch_handle_ingress tp->q is used only in order
to update stats. So stats and filter list are the only things that are
needed in clsact qdisc fastpath processing. Introduce new mini_Qdisc
struct to hold those items. Also, introduce a helper to swap the
mini_Qdisc structures in case filter list head changes.

This removes need for tp->q usage without added overhead.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-03 21:57:24 +09:00
Mark Rutland 6aa7de0591 locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()
Please do not apply this to mainline directly, instead please re-run the
coccinelle script shown below and apply its output.

For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't harmful, and changing them results in
churn.

However, for some features, the read/write distinction is critical to
correct operation. To distinguish these cases, separate read/write
accessors must be used. This patch migrates (most) remaining
ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
coccinelle script:

----
// Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
// WRITE_ONCE()

// $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch

virtual patch

@ depends on patch @
expression E1, E2;
@@

- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)

@ depends on patch @
expression E;
@@

- ACCESS_ONCE(E)
+ READ_ONCE(E)
----

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davem@davemloft.net
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: tj@kernel.org
Cc: viro@zeniv.linux.org.uk
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-25 11:01:08 +02:00
David S. Miller f8ddadc4db Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
There were quite a few overlapping sets of changes here.

Daniel's bug fix for off-by-ones in the new BPF branch instructions,
along with the added allowances for "data_end > ptr + x" forms
collided with the metadata additions.

Along with those three changes came veritifer test cases, which in
their final form I tried to group together properly.  If I had just
trimmed GIT's conflict tags as-is, this would have split up the
meta tests unnecessarily.

In the socketmap code, a set of preemption disabling changes
overlapped with the rename of bpf_compute_data_end() to
bpf_compute_data_pointers().

Changes were made to the mv88e6060.c driver set addr method
which got removed in net-next.

The hyperv transport socket layer had a locking change in 'net'
which overlapped with a change of socket state macro usage
in 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-22 13:39:14 +01:00
Jesper Dangaard Brouer 1c601d829a bpf: cpumap xdp_buff to skb conversion and allocation
This patch makes cpumap functional, by adding SKB allocation and
invoking the network stack on the dequeuing CPU.

For constructing the SKB on the remote CPU, the xdp_buff in converted
into a struct xdp_pkt, and it mapped into the top headroom of the
packet, to avoid allocating separate mem.  For now, struct xdp_pkt is
just a cpumap internal data structure, with info carried between
enqueue to dequeue.

If a driver doesn't have enough headroom it is simply dropped, with
return code -EOVERFLOW.  This will be picked up the xdp tracepoint
infrastructure, to allow users to catch this.

V2: take into account xdp->data_meta

V4:
 - Drop busypoll tricks, keeping it more simple.
 - Skip RPS and Generic-XDP-recursive-reinjection, suggested by Alexei

V5: correct RCU read protection around __netif_receive_skb_core.

V6: Setting TASK_RUNNING vs TASK_INTERRUPTIBLE based on talk with Rik van Riel

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-18 12:12:18 +01:00
Henrik Austad 8a5f2166a6 net: export netdev_txq_to_tc to allow sch_mqprio to compile as module
In commit 32302902ff ("mqprio: Reserve last 32 classid values for HW
traffic classes and misc IDs") sch_mqprio started using netdev_txq_to_tc
to find the correct tc instead of dev->tc_to_txq[]

However, when mqprio is compiled as a module, it cannot resolve the
symbol, leading to this error:

     ERROR: "netdev_txq_to_tc" [net/sched/sch_mqprio.ko] undefined!

This adds an EXPORT_SYMBOL() since the other user in the kernel
(netif_set_xps_queue) is also EXPORT_SYMBOL() (and not _GPL) or in a
sysfs-callback.

Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Cc: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Henrik Austad <haustad@cisco.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-17 17:00:08 +01:00
Cong Wang 0ad646c81b tun: call dev_get_valid_name() before register_netdevice()
register_netdevice() could fail early when we have an invalid
dev name, in which case ->ndo_uninit() is not called. For tun
device, this is a problem because a timer etc. are already
initialized and it expects ->ndo_uninit() to clean them up.

We could move these initializations into a ->ndo_init() so
that register_netdevice() knows better, however this is still
complicated due to the logic in tun_detach().

Therefore, I choose to just call dev_get_valid_name() before
register_netdevice(), which is quicker and much easier to audit.
And for this specific case, it is already enough.

Fixes: 96442e4242 ("tuntap: choose the txq based on rxq")
Reported-by: Dmitry Alexeev <avekceeb@gmail.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16 21:02:54 +01:00
David Ahern 42ab19ee90 net: Add extack to upper device linking
Add extack arg to netdev_upper_dev_link and netdev_master_upper_dev_link

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-04 21:39:33 -07:00
David Ahern 51d0c04795 net: Add extack to netdev_notifier_info
Add netlink_ext_ack to netdev_notifier_info to allow notifier
handlers to return errors to userspace.

Clean up the initialization in dev.c such that extack is easily
added in subsequent patches where relevant. Specifically, remove
the init call in call_netdevice_notifiers_info and have callers
initalize on stack when info is declared.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-04 21:39:33 -07:00
Nicolas Dichtel 6621dd29eb dev: advertise the new nsid when the netns iface changes
x-netns interfaces are bound to two netns: the link netns and the upper
netns. Usually, this kind of interfaces is created in the link netns and
then moved to the upper netns. At the end, the interface is visible only
in the upper netns. The link nsid is advertised via netlink in the upper
netns, thus the user always knows where is the link part.

There is no such mechanism in the link netns. When the interface is moved
to another netns, the user cannot "follow" it.
This patch adds a new netlink attribute which helps to follow an interface
which moves to another netns. When the interface is unregistered, the new
nsid is advertised. If the interface is a x-netns interface (ie
rtnl_link_ops->get_link_net is defined), the nsid is allocated if needed.

CC: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-04 18:04:41 -07:00
Florian Westphal 20e883204f net: core: fix kerneldoc comment
net/core/dev.c:1306: warning: No description found for parameter 'name'
net/core/dev.c:1306: warning: Excess function parameter 'alias' description in 'dev_get_alias'

Fixes: 6c5570016b ("net: core: decouple ifalias get/set from rtnl lock")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-04 10:28:22 -07:00
Florian Westphal 6c5570016b net: core: decouple ifalias get/set from rtnl lock
Device alias can be set by either rtnetlink (rtnl is held) or sysfs.

rtnetlink hold the rtnl mutex, sysfs acquires it for this purpose.
Add an extra mutex for it and use rcu to protect concurrent accesses.

This allows the sysfs path to not take rtnl and would later allow
to not hold it when dumping ifalias.

Based on suggestion from Eric Dumazet.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-03 15:56:01 -07:00
Daniel Borkmann de8f3a83b0 bpf: add meta pointer for direct access
This work enables generic transfer of metadata from XDP into skb. The
basic idea is that we can make use of the fact that the resulting skb
must be linear and already comes with a larger headroom for supporting
bpf_xdp_adjust_head(), which mangles xdp->data. Here, we base our work
on a similar principle and introduce a small helper bpf_xdp_adjust_meta()
for adjusting a new pointer called xdp->data_meta. Thus, the packet has
a flexible and programmable room for meta data, followed by the actual
packet data. struct xdp_buff is therefore laid out that we first point
to data_hard_start, then data_meta directly prepended to data followed
by data_end marking the end of packet. bpf_xdp_adjust_head() takes into
account whether we have meta data already prepended and if so, memmove()s
this along with the given offset provided there's enough room.

xdp->data_meta is optional and programs are not required to use it. The
rationale is that when we process the packet in XDP (e.g. as DoS filter),
we can push further meta data along with it for the XDP_PASS case, and
give the guarantee that a clsact ingress BPF program on the same device
can pick this up for further post-processing. Since we work with skb
there, we can also set skb->mark, skb->priority or other skb meta data
out of BPF, thus having this scratch space generic and programmable
allows for more flexibility than defining a direct 1:1 transfer of
potentially new XDP members into skb (it's also more efficient as we
don't need to initialize/handle each of such new members). The facility
also works together with GRO aggregation. The scratch space at the head
of the packet can be multiple of 4 byte up to 32 byte large. Drivers not
yet supporting xdp->data_meta can simply be set up with xdp->data_meta
as xdp->data + 1 as bpf_xdp_adjust_meta() will detect this and bail out,
such that the subsequent match against xdp->data for later access is
guaranteed to fail.

The verifier treats xdp->data_meta/xdp->data the same way as we treat
xdp->data/xdp->data_end pointer comparisons. The requirement for doing
the compare against xdp->data is that it hasn't been modified from it's
original address we got from ctx access. It may have a range marking
already from prior successful xdp->data/xdp->data_end pointer comparisons
though.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-26 13:36:44 -07:00
David S. Miller 1f8d31d189 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-09-23 10:16:53 -07:00
Alexey Dobriyan 52a59bd509 net: use 32-bit arithmetic while allocating net device
Private part of allocation is never big enough to warrant size_t.

Space savings:

	add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-10 (-10)
	function                                     old     new   delta
	alloc_netdev_mqs                            1120    1110     -10

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-22 21:25:52 -07:00
Willem de Bruijn 581fe0ea61 net: orphan frags on stand-alone ptype in dev_queue_xmit_nit
Zerocopy skbs frags are copied when the skb is looped to a local sock.
Commit 1080e512d4 ("net: orphan frags on receive") introduced calls
to skb_orphan_frags to deliver_skb and __netif_receive_skb for this.

With msg_zerocopy, these skbs can also exist in the tx path and thus
loop from dev_queue_xmit_nit. This already calls deliver_skb in its
loop. But it does not orphan before a separate pt_prev->func().

Add the missing skb_orphan_frags_rx.

Changes
  v1->v2: handle skb_orphan_frags_rx failure

Fixes: 1f8b977ab3 ("sock: enable MSG_ZEROCOPY")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-22 20:31:29 -07:00
Edward Cree 92dd5452c1 net: change skb->mac_header when Generic XDP calls adjust_head
Since XDP's view of the packet includes the MAC header, moving the start-
 of-packet with bpf_xdp_adjust_head needs to also update the offset of the
 MAC header (which is relative to skb->head, not to the skb->data that was
 changed).
Without this, tcpdump sees packets starting from the old MAC header rather
 than the new one, at least in my tests on the loopback device.

Fixes: b5cdae3291 ("net: Generic XDP")
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-20 15:44:22 -07:00
John Fastabend bbbe211c29 net: rcu lock and preempt disable missing around generic xdp
do_xdp_generic must be called inside rcu critical section with preempt
disabled to ensure BPF programs are valid and per-cpu variables used
for redirect operations are consistent. This patch ensures this is true
and fixes the splat below.

The netif_receive_skb_internal() code path is now broken into two rcu
critical sections. I decided it was better to limit the preempt_enable/disable
block to just the xdp static key portion and the fallout is more
rcu_read_lock/unlock calls. Seems like the best option to me.

[  607.596901] =============================
[  607.596906] WARNING: suspicious RCU usage
[  607.596912] 4.13.0-rc4+ #570 Not tainted
[  607.596917] -----------------------------
[  607.596923] net/core/dev.c:3948 suspicious rcu_dereference_check() usage!
[  607.596927]
[  607.596927] other info that might help us debug this:
[  607.596927]
[  607.596933]
[  607.596933] rcu_scheduler_active = 2, debug_locks = 1
[  607.596938] 2 locks held by pool/14624:
[  607.596943]  #0:  (rcu_read_lock_bh){......}, at: [<ffffffff95445ffd>] ip_finish_output2+0x14d/0x890
[  607.596973]  #1:  (rcu_read_lock_bh){......}, at: [<ffffffff953c8e3a>] __dev_queue_xmit+0x14a/0xfd0
[  607.597000]
[  607.597000] stack backtrace:
[  607.597006] CPU: 5 PID: 14624 Comm: pool Not tainted 4.13.0-rc4+ #570
[  607.597011] Hardware name: Dell Inc. Precision Tower 5810/0HHV7N, BIOS A17 03/01/2017
[  607.597016] Call Trace:
[  607.597027]  dump_stack+0x67/0x92
[  607.597040]  lockdep_rcu_suspicious+0xdd/0x110
[  607.597054]  do_xdp_generic+0x313/0xa50
[  607.597068]  ? time_hardirqs_on+0x5b/0x150
[  607.597076]  ? mark_held_locks+0x6b/0xc0
[  607.597088]  ? netdev_pick_tx+0x150/0x150
[  607.597117]  netif_rx_internal+0x205/0x3f0
[  607.597127]  ? do_xdp_generic+0xa50/0xa50
[  607.597144]  ? lock_downgrade+0x2b0/0x2b0
[  607.597158]  ? __lock_is_held+0x93/0x100
[  607.597187]  netif_rx+0x119/0x190
[  607.597202]  loopback_xmit+0xfd/0x1b0
[  607.597214]  dev_hard_start_xmit+0x127/0x4e0

Fixes: d445516966 ("net: xdp: support xdp generic on virtual devices")
Fixes: b5cdae3291 ("net: Generic XDP")
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-08 21:11:00 -07:00
David S. Miller 6026e043d0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Three cases of simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-01 17:42:05 -07:00
Ido Schimmel 25cc72a338 mlxsw: spectrum: Forbid linking to devices that have uppers
The mlxsw driver relies on NETDEV_CHANGEUPPER events to configure the
device in case a port is enslaved to a master netdev such as bridge or
bond.

Since the driver ignores events unrelated to its ports and their
uppers, it's possible to engineer situations in which the device's data
path differs from the kernel's.

One example to such a situation is when a port is enslaved to a bond
that is already enslaved to a bridge. When the bond was enslaved the
driver ignored the event - as the bond wasn't one of its uppers - and
therefore a bridge port instance isn't created in the device.

Until such configurations are supported forbid them by checking that the
upper device doesn't have uppers of its own.

Fixes: 0d65fc1304 ("mlxsw: spectrum: Implement LAG port join/leave")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reported-by: Nogah Frankel <nogahf@mellanox.com>
Tested-by: Nogah Frankel <nogahf@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-01 09:59:41 -07:00
Jesper Dangaard Brouer 1e22391e8f net: missing call of trace_napi_poll in busy_poll_stop
Noticed that busy_poll_stop() also invoke the drivers napi->poll()
function pointer, but didn't have an associated call to trace_napi_poll()
like all other call sites.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-28 11:22:21 -07:00