Commit graph

4662 commits

Author SHA1 Message Date
Robert Shearman 84a8cbe46a ila: autoload module
Avoid users having to manually load the module by adding a module
alias allowing it to be autoloaded by the lwt infra.

Signed-off-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-21 22:00:28 -05:00
Daniel Borkmann 6b83d28a55 net: use skb_postpush_rcsum instead of own implementations
Replace individual implementations with the recently introduced
skb_postpush_rcsum() helper.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Tom Herbert <tom@herbertland.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-19 23:43:10 -05:00
Wei Wang e0d8c1b738 ipv6: pass up EMSGSIZE msg for UDP socket in Ipv6
In ipv4,  when  the machine receives a ICMP_FRAG_NEEDED message,  the
connected UDP socket will get EMSGSIZE message on its next read from the
socket.
However, this is not the case for ipv6.
This fix modifies the udp err handler in Ipv6 for ICMP6_PKT_TOOBIG to
make it similar to ipv4 behavior. That is when the machine gets an
ICMP6_PKT_TOOBIG message, the connected UDP socket will get EMSGSIZE
message on its next read from the socket.

Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-19 15:46:24 -05:00
Anton Protopopov a97eb33ff2 rtnl: RTM_GETNETCONF: fix wrong return value
An error response from a RTM_GETNETCONF request can return the positive
error value EINVAL in the struct nlmsgerr that can mislead userspace.

Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-19 15:33:46 -05:00
Jiri Benc d13b161c2c gre: clear IFF_TX_SKB_SHARING
ether_setup sets IFF_TX_SKB_SHARING but this is not supported by gre
as it modifies the skb on xmit.

Also, clean up whitespace in ipgre_tap_setup when we're already touching it.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-18 14:43:48 -05:00
Jiri Benc 7f290c9435 iptunnel: scrub packet in iptunnel_pull_header
Part of skb_scrub_packet was open coded in iptunnel_pull_header. Let it call
skb_scrub_packet directly instead.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-18 14:34:54 -05:00
Eric Dumazet 7716682cc5 tcp/dccp: fix another race at listener dismantle
Ilya reported following lockdep splat:

kernel: =========================
kernel: [ BUG: held lock freed! ]
kernel: 4.5.0-rc1-ceph-00026-g5e0a311 #1 Not tainted
kernel: -------------------------
kernel: swapper/5/0 is freeing memory
ffff880035c9d200-ffff880035c9dbff, with a lock still held there!
kernel: (&(&queue->rskq_lock)->rlock){+.-...}, at:
[<ffffffff816f6a88>] inet_csk_reqsk_queue_add+0x28/0xa0
kernel: 4 locks held by swapper/5/0:
kernel: #0:  (rcu_read_lock){......}, at: [<ffffffff8169ef6b>]
netif_receive_skb_internal+0x4b/0x1f0
kernel: #1:  (rcu_read_lock){......}, at: [<ffffffff816e977f>]
ip_local_deliver_finish+0x3f/0x380
kernel: #2:  (slock-AF_INET){+.-...}, at: [<ffffffff81685ffb>]
sk_clone_lock+0x19b/0x440
kernel: #3:  (&(&queue->rskq_lock)->rlock){+.-...}, at:
[<ffffffff816f6a88>] inet_csk_reqsk_queue_add+0x28/0xa0

To properly fix this issue, inet_csk_reqsk_queue_add() needs
to return to its callers if the child as been queued
into accept queue.

We also need to make sure listener is still there before
calling sk->sk_data_ready(), by holding a reference on it,
since the reference carried by the child can disappear as
soon as the child is put on accept queue.

Reported-by: Ilya Dryomov <idryomov@gmail.com>
Fixes: ebb516af60 ("tcp/dccp: fix race at listener dismantle phase")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-18 11:35:51 -05:00
Nikolay Borisov e21145a987 ipv4: namespacify ip_early_demux sysctl knob
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16 20:42:54 -05:00
Paolo Abeni e09acddf87 ip_tunnel: replace dst_cache with generic implementation
The current ip_tunnel cache implementation is prone to a race
that will cause the wrong dst to be cached on cuncurrent dst cache
miss and ip tunnel update via netlink.

Replacing with the generic implementation fix the issue.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Suggested-and-acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16 20:21:48 -05:00
Paolo Abeni 607f725f6f net: replace dst_cache ip6_tunnel implementation with the generic one
This also fix a potential race into the existing tunnel code, which
could lead to the wrong dst to be permanenty cached:

CPU1:					CPU2:
  <xmit on ip6_tunnel>
  <cache lookup fails>
  dst = ip6_route_output(...)
					<tunnel params are changed via nl>
					dst_cache_reset() // no effect,
							// the cache is empty
  dst_cache_set() // the wrong dst
	// is permanenty stored
	// into the cache

With the new dst implementation the above race is not possible
since the first cache lookup after dst_cache_reset will fail due
to the timestamp check

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Suggested-and-acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16 20:21:48 -05:00
David S. Miller dba36b382e Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contain a rather large batch for your net that
includes accumulated bugfixes, they are:

1) Run conntrack cleanup from workqueue process context to avoid hitting
   soft lockup via watchdog for large tables. This is required by the
   IPv6 masquerading extension. From Florian Westphal.

2) Use original skbuff from nfnetlink batch when calling netlink_ack()
   on error since this needs to access the skb->sk pointer.

3) Incremental fix on top of recent Sasha Levin's lock fix for conntrack
   resizing.

4) Fix several problems in nfnetlink batch message header sanitization
   and error handling, from Phil Turnbull.

5) Select NF_DUP_IPV6 based on CONFIG_IPV6, from Arnd Bergmann.

6) Fix wrong signess in return values on nf_tables counter expression,
   from Anton Protopopov.

Due to the NetDev 1.1 organization burden, I had no chance to pass up
this to you any sooner in this release cycle, sorry about that.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16 12:56:00 -05:00
Edward Cree 6fa79666e2 net: ip_tunnel: remove 'csum_help' argument to iptunnel_handle_offloads
All users now pass false, so we can remove it, and remove the code that
 was conditional upon it.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-12 05:52:16 -05:00
Edward Cree d75f1306d9 net: udp: always set up for CHECKSUM_PARTIAL offload
If the dst device doesn't support it, it'll get fixed up later anyway
 by validate_xmit_skb().  Also, this allows us to take advantage of LCO
 to avoid summing the payload multiple times.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-12 05:52:15 -05:00
Edward Cree 179bc67f69 net: local checksum offload for encapsulation
The arithmetic properties of the ones-complement checksum mean that a
 correctly checksummed inner packet, including its checksum, has a ones
 complement sum depending only on whatever value was used to initialise
 the checksum field before checksumming (in the case of TCP and UDP,
 this is the ones complement sum of the pseudo header, complemented).
Consequently, if we are going to offload the inner checksum with
 CHECKSUM_PARTIAL, we can compute the outer checksum based only on the
 packed data not covered by the inner checksum, and the initial value of
 the inner checksum field.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-12 05:52:15 -05:00
Johannes Berg 7a02bf892d ipv6: add option to drop unsolicited neighbor advertisements
In certain 802.11 wireless deployments, there will be NA proxies
that use knowledge of the network to correctly answer requests.
To prevent unsolicitd advertisements on the shared medium from
being a problem, on such deployments wireless needs to drop them.

Enable this by providing an option called "drop_unsolicited_na".

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11 04:27:36 -05:00
Johannes Berg abbc30436d ipv6: add option to drop unicast encapsulated in L2 multicast
In order to solve a problem with 802.11, the so-called hole-196 attack,
add an option (sysctl) called "drop_unicast_in_l2_multicast" which, if
enabled, causes the stack to drop IPv6 unicast packets encapsulated in
link-layer multi- or broadcast frames. Such frames can (as an attack)
be created by any member of the same wireless network and transmitted
as valid encrypted frames since the symmetric key for broadcast frames
is shared between all stations.

Reviewed-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11 04:27:36 -05:00
Craig Gallek c125e80b88 soreuseport: fast reuseport TCP socket selection
This change extends the fast SO_REUSEPORT socket lookup implemented
for UDP to TCP.  Listener sockets with SO_REUSEPORT and the same
receive address are additionally added to an array for faster
random access.  This means that only a single socket from the group
must be found in the listener list before any socket in the group can
be used to receive a packet.  Previously, every socket in the group
needed to be considered before handing off the incoming packet.

This feature also exposes the ability to use a BPF program when
selecting a socket from a reuseport group.

Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11 03:54:15 -05:00
Craig Gallek a583636a83 inet: refactor inet[6]_lookup functions to take skb
This is a preliminary step to allow fast socket lookup of SO_REUSEPORT
groups.  Doing so with a BPF filter will require access to the
skb in question.  This change plumbs the skb (and offset to payload
data) through the call stack to the listening socket lookup
implementations where it will be used in a following patch.

Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11 03:54:14 -05:00
Craig Gallek 496611d7b5 inet: create IPv6-equivalent inet_hash function
In order to support fast lookups for TCP sockets with SO_REUSEPORT,
the function that adds sockets to the listening hash set needs
to be able to check receive address equality.  Since this equality
check is different for IPv4 and IPv6, we will need two different
socket hashing functions.

This patch adds inet6_hash identical to the existing inet_hash function
and updates the appropriate references.  A following patch will
differentiate the two by passing different comparison functions to
__inet_hash.

Additionally, in order to use the IPv6 address equality function from
inet6_hashtables (which is compiled as a built-in object when IPv6 is
enabled) it also needs to be in a built-in object file as well.  This
moves ipv6_rcv_saddr_equal into inet_hashtables to accomplish this.

Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11 03:54:14 -05:00
Craig Gallek 086c653f58 sock: struct proto hash function may error
In order to support fast reuseport lookups in TCP, the hash function
defined in struct proto must be capable of returning an error code.
This patch changes the function signature of all related hash functions
to return an integer and handles or propagates this return value at
all call sites.

Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11 03:54:14 -05:00
Eric Dumazet 9cf7490360 tcp: do not drop syn_recv on all icmp reports
Petr Novopashenniy reported that ICMP redirects on SYN_RECV sockets
were leading to RST.

This is of course incorrect.

A specific list of ICMP messages should be able to drop a SYN_RECV.

For instance, a REDIRECT on SYN_RECV shall be ignored, as we do
not hold a dst per SYN_RECV pseudo request.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=111751
Fixes: 079096f103 ("tcp/dccp: install syn_recv requests into ehash table")
Reported-by: Petr Novopashenniy <pety@rusnet.ru>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-09 04:15:37 -05:00
Eric Dumazet 44c3d0c1c0 ipv6: fix a lockdep splat
Silence lockdep false positive about rcu_dereference() being
used in the wrong context.

First one should use rcu_dereference_protected() as we own the spinlock.

Second one should be a normal assignation, as no barrier is needed.

Fixes: 18367681a1 ("ipv6 flowlabel: Convert np->ipv6_fl_list to RCU.")
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-08 10:33:32 -05:00
Nikolay Borisov 12ed8244ed ipv4: Namespaceify tcp syncookies sysctl knob
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07 14:35:10 -05:00
subashab@codeaurora.org 16186a82de ipv6: addrconf: Fix recursive spin lock call
A rcu stall with the following backtrace was seen on a system with
forwarding, optimistic_dad and use_optimistic set. To reproduce,
set these flags and allow ipv6 autoconf.

This occurs because the device write_lock is acquired while already
holding the read_lock. Back trace below -

INFO: rcu_preempt self-detected stall on CPU { 1}  (t=2100 jiffies
 g=3992 c=3991 q=4471)
<6> Task dump for CPU 1:
<2> kworker/1:0     R  running task    12168    15   2 0x00000002
<2> Workqueue: ipv6_addrconf addrconf_dad_work
<6> Call trace:
<2> [<ffffffc000084da8>] el1_irq+0x68/0xdc
<2> [<ffffffc000cc4e0c>] _raw_write_lock_bh+0x20/0x30
<2> [<ffffffc000bc5dd8>] __ipv6_dev_ac_inc+0x64/0x1b4
<2> [<ffffffc000bcbd2c>] addrconf_join_anycast+0x9c/0xc4
<2> [<ffffffc000bcf9f0>] __ipv6_ifa_notify+0x160/0x29c
<2> [<ffffffc000bcfb7c>] ipv6_ifa_notify+0x50/0x70
<2> [<ffffffc000bd035c>] addrconf_dad_work+0x314/0x334
<2> [<ffffffc0000b64c8>] process_one_work+0x244/0x3fc
<2> [<ffffffc0000b7324>] worker_thread+0x2f8/0x418
<2> [<ffffffc0000bb40c>] kthread+0xe0/0xec

v2: do addrconf_dad_kick inside read lock and then acquire write
lock for ipv6_ifa_notify as suggested by Eric

Fixes: 7fd2561e4e ("net: ipv6: Add a sysctl to make optimistic
addresses useful candidates")

Cc: Eric Dumazet <edumazet@google.com>
Cc: Erik Kline <ek@google.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-06 03:08:15 -05:00
Linus Torvalds 34229b2774 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "This looks like a lot but it's a mixture of regression fixes as well
  as fixes for longer standing issues.

   1) Fix on-channel cancellation in mac80211, from Johannes Berg.

   2) Handle CHECKSUM_COMPLETE properly in xt_TCPMSS netfilter xtables
      module, from Eric Dumazet.

   3) Avoid infinite loop in UDP SO_REUSEPORT logic, also from Eric
      Dumazet.

   4) Avoid a NULL deref if we try to set SO_REUSEPORT after a socket is
      bound, from Craig Gallek.

   5) GRO key comparisons don't take lightweight tunnels into account,
      from Jesse Gross.

   6) Fix struct pid leak via SCM credentials in AF_UNIX, from Eric
      Dumazet.

   7) We need to set the rtnl_link_ops of ipv6 SIT tunnels before we
      register them, otherwise the NEWLINK netlink message is missing
      the proper attributes.  From Thadeu Lima de Souza Cascardo.

   8) Several Spectrum chip bug fixes for mlxsw switch driver, from Ido
      Schimmel

   9) Handle fragments properly in ipv4 easly socket demux, from Eric
      Dumazet.

  10) Don't ignore the ifindex key specifier on ipv6 output route
      lookups, from Paolo Abeni"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (128 commits)
  tcp: avoid cwnd undo after receiving ECN
  irda: fix a potential use-after-free in ircomm_param_request
  net: tg3: avoid uninitialized variable warning
  net: nb8800: avoid uninitialized variable warning
  net: vxge: avoid unused function warnings
  net: bgmac: clarify CONFIG_BCMA dependency
  net: hp100: remove unnecessary #ifdefs
  net: davinci_cpdma: use dma_addr_t for DMA address
  ipv6/udp: use sticky pktinfo egress ifindex on connect()
  ipv6: enforce flowi6_oif usage in ip6_dst_lookup_tail()
  netlink: not trim skb for mmaped socket when dump
  vxlan: fix a out of bounds access in __vxlan_find_mac
  net: dsa: mv88e6xxx: fix port VLAN maps
  fib_trie: Fix shift by 32 in fib_table_lookup
  net: moxart: use correct accessors for DMA memory
  ipv4: ipconfig: avoid unused ic_proto_used symbol
  bnxt_en: Fix crash in bnxt_free_tx_skbs() during tx timeout.
  bnxt_en: Exclude rx_drop_pkts hw counter from the stack's rx_dropped counter.
  bnxt_en: Ring free response from close path should use completion ring
  net_sched: drr: check for NULL pointer in drr_dequeue
  ...
2016-02-01 15:56:08 -08:00
Florian Westphal d93c6258ee netfilter: conntrack: resched in nf_ct_iterate_cleanup
Ulrich reports soft lockup with following (shortened) callchain:

NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s!
__netif_receive_skb_core+0x6e4/0x774
process_backlog+0x94/0x160
net_rx_action+0x88/0x178
call_do_softirq+0x24/0x3c
do_softirq+0x54/0x6c
__local_bh_enable_ip+0x7c/0xbc
nf_ct_iterate_cleanup+0x11c/0x22c [nf_conntrack]
masq_inet_event+0x20/0x30 [nf_nat_masquerade_ipv6]
atomic_notifier_call_chain+0x1c/0x2c
ipv6_del_addr+0x1bc/0x220 [ipv6]

Problem is that nf_ct_iterate_cleanup can run for a very long time
since it can be interrupted by softirq processing.
Moreover, atomic_notifier_call_chain runs with rcu readlock held.

So lets call cond_resched() in nf_ct_iterate_cleanup and defer
the call to a work queue for the atomic_notifier_call_chain case.

We also need another cond_resched in get_next_corpse, since we
have to deal with iter() always returning false, in that case
get_next_corpse will walk entire conntrack table.

Reported-by: Ulrich Weber <uw@ocedo.com>
Tested-by: Ulrich Weber <uw@ocedo.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-02-01 00:15:26 +01:00
Paolo Abeni 1cdda91871 ipv6/udp: use sticky pktinfo egress ifindex on connect()
Currently, the egress interface index specified via IPV6_PKTINFO
is ignored by __ip6_datagram_connect(), so that RFC 3542 section 6.7
can be subverted when the user space application calls connect()
before sendmsg().
Fix it by initializing properly flowi6_oif in connect() before
performing the route lookup.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-29 20:31:26 -08:00
Paolo Abeni 6f21c96a78 ipv6: enforce flowi6_oif usage in ip6_dst_lookup_tail()
The current implementation of ip6_dst_lookup_tail basically
ignore the egress ifindex match: if the saddr is set,
ip6_route_output() purposefully ignores flowi6_oif, due
to the commit d46a9d678e ("net: ipv6: Dont add RT6_LOOKUP_F_IFACE
flag if saddr set"), if the saddr is 'any' the first route lookup
in ip6_dst_lookup_tail fails, but upon failure a second lookup will
be performed with saddr set, thus ignoring the ifindex constraint.

This commit adds an output route lookup function variant, which
allows the caller to specify lookup flags, and modify
ip6_dst_lookup_tail() to enforce the ifindex match on the second
lookup via said helper.

ip6_route_output() becames now a static inline function build on
top of ip6_route_output_flags(); as a side effect, out-of-tree
modules need now a GPL license to access the output route lookup
functionality.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-29 20:31:26 -08:00
Herbert Xu cf80e0e47e tcp: Use ahash
This patch replaces uses of the long obsolete hash interface with
ahash.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: David S. Miller <davem@davemloft.net>
2016-01-27 20:36:18 +08:00
Thadeu Lima de Souza Cascardo 87e57399e9 sit: set rtnl_link_ops before calling register_netdevice
When creating a SIT tunnel with ip tunnel, rtnl_link_ops is not set before
ipip6_tunnel_create is called. When register_netdevice is called, there is
no linkinfo attribute in the NEWLINK message because of that.

Setting rtnl_link_ops before calling register_netdevice fixes that.

Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-25 10:51:53 -08:00
Thomas Egerer 32b6170ca5 ipv4+ipv6: Make INET*_ESP select CRYPTO_ECHAINIV
The ESP algorithms using CBC mode require echainiv. Hence INET*_ESP have
to select CRYPTO_ECHAINIV in order to work properly. This solves the
issues caused by a misconfiguration as described in [1].
The original approach, patching crypto/Kconfig was turned down by
Herbert Xu [2].

[1] https://lists.strongswan.org/pipermail/users/2015-December/009074.html
[2] http://marc.info/?l=linux-crypto-vger&m=145224655809562&w=2

Signed-off-by: Thomas Egerer <hakke_007@gmx.de>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-25 10:45:41 -08:00
Vladimir Davydov d55f90bfab net: drop tcp_memcontrol.c
tcp_memcontrol.c only contains legacy memory.tcp.kmem.* file definitions
and mem_cgroup->tcp_mem init/destroy stuff.  This doesn't belong to
network subsys.  Let's move it to memcontrol.c.  This also allows us to
reuse generic code for handling legacy memcg files.

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: "David S. Miller" <davem@davemloft.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20 17:09:18 -08:00
Eric Dumazet ed0dfffd7d udp: fix potential infinite loop in SO_REUSEPORT logic
Using a combination of connected and un-connected sockets, Dmitry
was able to trigger soft lockups with his fuzzer.

The problem is that sockets in the SO_REUSEPORT array might have
different scores.

Right after sk2=socket(), setsockopt(sk2,...,SO_REUSEPORT, on) and
bind(sk2, ...), but _before_ the connect(sk2) is done, sk2 is added into
the soreuseport array, with a score which is smaller than the score of
first socket sk1 found in hash table (I am speaking of the regular UDP
hash table), if sk1 had the connect() done, giving a +8 to its score.

hash bucket [X] -> sk1 -> sk2 -> NULL

sk1 score = 14  (because it did a connect())
sk2 score = 6

SO_REUSEPORT fast selection is an optimization. If it turns out the
score of the selected socket does not match score of first socket, just
fallback to old SO_REUSEPORT logic instead of trying to be too smart.

Normal SO_REUSEPORT users do not mix different kind of sockets, as this
mechanism is used for load balance traffic.

Fixes: e32ea7e747 ("soreuseport: fast reuseport UDP socket selection")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Craig Gallek <kraigatgoog@gmail.com>
Acked-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-19 13:52:25 -05:00
Linus Torvalds 4e5448a31d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "A quick set of bug fixes after there initial networking merge:

  1) Netlink multicast group storage allocator only was tested with
     nr_groups equal to 1, make it work for other values too.  From
     Matti Vaittinen.

  2) Check build_skb() return value in macb and hip04_eth drivers, from
     Weidong Wang.

  3) Don't leak x25_asy on x25_asy_open() failure.

  4) More DMA map/unmap fixes in 3c59x from Neil Horman.

  5) Don't clobber IP skb control block during GSO segmentation, from
     Konstantin Khlebnikov.

  6) ECN helpers for ipv6 don't fixup the checksum, from Eric Dumazet.

  7) Fix SKB segment utilization estimation in xen-netback, from David
     Vrabel.

  8) Fix lockdep splat in bridge addrlist handling, from Nikolay
     Aleksandrov"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (26 commits)
  bgmac: Fix reversed test of build_skb() return value.
  bridge: fix lockdep addr_list_lock false positive splat
  net: smsc: Add support h8300
  xen-netback: free queues after freeing the net device
  xen-netback: delete NAPI instance when queue fails to initialize
  xen-netback: use skb to determine number of required guest Rx requests
  net: sctp: Move sequence start handling into sctp_transport_get_idx()
  ipv6: update skb->csum when CE mark is propagated
  net: phy: turn carrier off on phy attach
  net: macb: clear interrupts when disabling them
  sctp: support to lookup with ep+paddr in transport rhashtable
  net: hns: fixes no syscon error when init mdio
  dts: hisi: fixes no syscon fault when init mdio
  net: preserve IP control block during GSO segmentation
  fsl/fman: Delete one function call "put_device" in dtsec_config()
  hip04_eth: fix missing error handle for build_skb failed
  3c59x: fix another page map/single unmap imbalance
  3c59x: balance page maps and unmaps
  x25_asy: Free x25_asy on x25_asy_open() failure.
  mlxsw: fix SWITCHDEV_OBJ_ID_PORT_MDB
  ...
2016-01-15 13:33:12 -08:00
Eric Dumazet 34ae6a1aa0 ipv6: update skb->csum when CE mark is propagated
When a tunnel decapsulates the outer header, it has to comply
with RFC 6080 and eventually propagate CE mark into inner header.

It turns out IP6_ECN_set_ce() does not correctly update skb->csum
for CHECKSUM_COMPLETE packets, triggering infamous "hw csum failure"
messages and stack traces.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-15 15:07:23 -05:00
Johannes Weiner baac50bbc3 net: tcp_memcontrol: simplify linkage between socket and page counter
There won't be any separate counters for socket memory consumed by
protocols other than TCP in the future.  Remove the indirection and link
sockets directly to their owning memory cgroup.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
David S. Miller 9d367eddf3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/bonding/bond_main.c
	drivers/net/ethernet/mellanox/mlxsw/spectrum.h
	drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c

The bond_main.c and mellanox switch conflicts were cases of
overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-11 23:55:43 -05:00
Michal Kubeček 40ba330227 udp: disallow UFO for sockets with SO_NO_CHECK option
Commit acf8dd0a9d ("udp: only allow UFO for packets from SOCK_DGRAM
sockets") disallows UFO for packets sent from raw sockets. We need to do
the same also for SOCK_DGRAM sockets with SO_NO_CHECK options, even if
for a bit different reason: while such socket would override the
CHECKSUM_PARTIAL set by ip_ufo_append_data(), gso_size is still set and
bad offloading flags warning is triggered in __skb_gso_segment().

In the IPv6 case, SO_NO_CHECK option is ignored but we need to disallow
UFO for packets sent by sockets with UDP_NO_CHECK6_TX option.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Tested-by: Shannon Nelson <shannon.nelson@intel.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-11 17:40:57 -05:00
Eric Dumazet 3e4006f0b8 ipv6: tcp: add rcu locking in tcp_v6_send_synack()
When first SYNACK is sent, we already hold rcu_read_lock(), but this
is not true if a SYNACK is retransmitted, as a timer (soft) interrupt
does not hold rcu_read_lock()

Fixes: 45f6fad84c ("ipv6: add complete rcu protection around np->opt")
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-10 22:58:03 -05:00
Lubomir Rintel 3d171f3907 ipv6: always add flag an address that failed DAD with DADFAILED
The userspace needs to know why is the address being removed so that it can
perhaps obtain a new address.

Without the DADFAILED flag it's impossible to distinguish removal of a
temporary and tentative address due to DAD failure from other reasons (device
removed, manual address removal).

Signed-off-by: Lubomir Rintel <lkundrak@v3.sk>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-10 22:54:27 -05:00
David S. Miller 9b59377b75 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next, they are:

1) Release nf_tables objects on netns destructions via
   nft_release_afinfo().

2) Destroy basechain and rules on netdevice removal in the new netdev
   family.

3) Get rid of defensive check against removal of inactive objects in
   nf_tables.

4) Pass down netns pointer to our existing nfnetlink callbacks, as well
   as commit() and abort() nfnetlink callbacks.

5) Allow to invert limit expression in nf_tables, so we can throttle
   overlimit traffic.

6) Add packet duplication for the netdev family.

7) Add forward expression for the netdev family.

8) Define pr_fmt() in conntrack helpers.

9) Don't leave nfqueue configuration on inconsistent state in case of
   errors, from Ken-ichirou MATSUZAWA, follow up patches are also from
   him.

10) Skip queue option handling after unbind.

11) Return error on unknown both in nfqueue and nflog command.

12) Autoload ctnetlink when NFQA_CFG_F_CONNTRACK is set.

13) Add new NFTA_SET_USERDATA attribute to store user data in sets,
    from Carlos Falgueras.

14) Add support for 64 bit byteordering changes nf_tables, from Florian
    Westphal.

15) Add conntrack byte/packet counter matching support to nf_tables,
    also from Florian.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-08 20:53:16 -05:00
Craig Gallek 1134158ba3 soreuseport: pass skb to secondary UDP socket lookup
This socket-lookup path did not pass along the skb in question
in my original BPF-based socket selection patch.  The skb in the
udpN_lib_lookup2 path can be used for BPF-based socket selection just
like it is in the 'traditional' udpN_lib_lookup path.

udpN_lib_lookup2 kicks in when there are greater than 10 sockets in
the same hlist slot.  Coincidentally, I chose 10 sockets per
reuseport group in my functional test, so the lookup2 path was not
excersised. This adds an additional set of tests with 20 sockets.

Fixes: 538950a1b7 ("soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPF")
Fixes: 3ca8e40299 ("soreuseport: BPF selection functional test")
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Craig Gallek <kraig@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-06 01:28:04 -05:00
Florian Westphal a72a5e2d34 inet: kill unused skb_free op
The only user was removed in commit
029f7f3b87 ("netfilter: ipv6: nf_defrag: avoid/free clone operations").

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-05 22:25:57 -05:00
Craig Gallek 538950a1b7 soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPF
Expose socket options for setting a classic or extended BPF program
for use when selecting sockets in an SO_REUSEPORT group.  These options
can be used on the first socket to belong to a group before bind or
on any socket in the group after bind.

This change includes refactoring of the existing sk_filter code to
allow reuse of the existing BPF filter validation checks.

Signed-off-by: Craig Gallek <kraig@google.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-04 22:49:59 -05:00
Craig Gallek e32ea7e747 soreuseport: fast reuseport UDP socket selection
Include a struct sock_reuseport instance when a UDP socket binds to
a specific address for the first time with the reuseport flag set.
When selecting a socket for an incoming UDP packet, use the information
available in sock_reuseport if present.

This required adding an additional field to the UDP source address
equality function to differentiate between exact and wildcard matches.
The original use case allowed wildcard matches when checking for
existing port uses during bind.  The new use case of adding a socket
to a reuseport group requires exact address matching.

Performance test (using a machine with 2 CPU sockets and a total of
48 cores):  Create reuseport groups of varying size.  Use one socket
from this group per user thread (pinning each thread to a different
core) calling recvmmsg in a tight loop.  Record number of messages
received per second while saturating a 10G link.
  10 sockets: 18% increase (~2.8M -> 3.3M pkts/s)
  20 sockets: 14% increase (~2.9M -> 3.3M pkts/s)
  40 sockets: 13% increase (~3.0M -> 3.4M pkts/s)

This work is based off a similar implementation written by
Ying Cai <ycai@google.com> for implementing policy-based reuseport
selection.

Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-04 22:49:58 -05:00
Eric Dumazet 197c949e77 udp: properly support MSG_PEEK with truncated buffers
Backport of this upstream commit into stable kernels :
89c22d8c3b ("net: Fix skb csum races when peeking")
exposed a bug in udp stack vs MSG_PEEK support, when user provides
a buffer smaller than skb payload.

In this case,
skb_copy_and_csum_datagram_iovec(skb, sizeof(struct udphdr),
                                 msg->msg_iov);
returns -EFAULT.

This bug does not happen in upstream kernels since Al Viro did a great
job to replace this into :
skb_copy_and_csum_datagram_msg(skb, sizeof(struct udphdr), msg);
This variant is safe vs short buffers.

For the time being, instead reverting Herbert Xu patch and add back
skb->ip_summed invalid changes, simply store the result of
udp_lib_checksum_complete() so that we avoid computing the checksum a
second time, and avoid the problematic
skb_copy_and_csum_datagram_iovec() call.

This patch can be applied on recent kernels as it avoids a double
checksumming, then backported to stable kernels as a bug fix.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-04 17:23:36 -05:00
David S. Miller c07f30ad68 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-12-31 18:20:10 -05:00
Pablo Neira Ayuso df05ef874b netfilter: nf_tables: release objects on netns destruction
We have to release the existing objects on netns removal otherwise we
leak them. Chains are unregistered in first place to make sure no
packets are walking on our rules and sets anymore.

The object release happens by when we unregister the family via
nft_release_afinfo() which is called from nft_unregister_afinfo() from
the corresponding __net_exit path in every family.

Reported-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-12-28 18:34:35 +01:00
Pravin B Shelar 039f50629b ip_tunnel: Move stats update to iptunnel_xmit()
By moving stats update into iptunnel_xmit(), we can simplify
iptunnel_xmit() usage. With this change there is no need to
call another function (iptunnel_xmit_stats()) to update stats
in tunnel xmit code path.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-25 23:32:23 -05:00
Hannes Frederic Sowa c1a9a291ce ipv6: honor ifindex in case we receive ll addresses in router advertisements
Marc Haber reported we don't honor interface indexes when we receive link
local router addresses in router advertisements. Luckily the non-strict
version of ipv6_chk_addr already does the correct job here, so we can
simply use it to lighten the checks and use those addresses by default
without any configuration change.

Link: <http://permalink.gmane.org/gmane.linux.network/391348>
Reported-by: Marc Haber <mh+netdev@zugschlus.de>
Cc: Marc Haber <mh+netdev@zugschlus.de>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-23 22:03:54 -05:00
Florian Westphal 271c3b9b7b tcp: honour SO_BINDTODEVICE for TW_RST case too
Hannes points out that when we generate tcp reset for timewait sockets we
pretend we found no socket and pass NULL sk to tcp_vX_send_reset().

Make it cope with inet tw sockets and then provide tw sk.

This makes RSTs appear on correct interface when SO_BINDTODEVICE is used.

Packetdrill test case:
// want default route to be used, we rely on BINDTODEVICE
`ip route del 192.0.2.0/24 via 192.168.0.2 dev tun0`

0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
// test case still works due to BINDTODEVICE
0.001 setsockopt(3, SOL_SOCKET, SO_BINDTODEVICE, "tun0", 4) = 0
0.100...0.200 connect(3, ..., ...) = 0

0.100 > S 0:0(0) <mss 1460,sackOK,nop,nop>
0.200 < S. 0:0(0) ack 1 win 32792 <mss 1460,sackOK,nop,nop>
0.200 > . 1:1(0) ack 1

0.210 close(3) = 0

0.210 > F. 1:1(0) ack 1 win 29200
0.300 < . 1:1(0) ack 2 win 46

// more data while in FIN_WAIT2, expect RST
1.300 < P. 1:1001(1000) ack 1 win 46

// fails without this change -- default route is used
1.301 > R 1:1(0) win 0

Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-22 17:03:05 -05:00
Florian Westphal e46787f0dd tcp: send_reset: test for non-NULL sk first
tcp_md5_do_lookup requires a full socket, so once we extend
_send_reset() to also accept timewait socket we would have to change

if (!sk && hash_location)

to something like

if ((!sk || !sk_fullsock(sk)) && hash_location) {
  ...
} else {
  (sk && sk_fullsock(sk)) tcp_md5_do_lookup()
}

Switch the two branches: check if we have a socket first, then
fall back to a listener lookup if we saw a md5 option (hash_location).

Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-22 17:03:05 -05:00
WANG Cong 5449a5ca9b addrconf: always initialize sysctl table data
When sysctl performs restrict writes, it allows to write from
a middle position of a sysctl file, which requires us to initialize
the table data before calling proc_dostring() for the write case.

Fixes: 3d1bec9932 ("ipv6: introduce secret_stable to ipv6_devconf")
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-22 17:00:58 -05:00
David S. Miller 024f35c552 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:

====================
pull request (net): ipsec 2015-12-22

Just one patch to fix dst_entries_init with multiple namespaces.
From Dan Streetman.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-22 16:26:31 -05:00
Andrey Ryabinin e459dfeeb6 ipv6/addrlabel: fix ip6addrlbl_get()
ip6addrlbl_get() has never worked. If ip6addrlbl_hold() succeeded,
ip6addrlbl_get() will exit with '-ESRCH'. If ip6addrlbl_hold() failed,
ip6addrlbl_get() will use about to be free ip6addrlbl_entry pointer.

Fix this by inverting ip6addrlbl_hold() check.

Fixes: 2a8cc6c890 ("[IPV6] ADDRCONF: Support RFC3484 configurable address selection policy table.")
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Cong Wang <cwang@twopensource.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-22 15:57:54 -05:00
David S. Miller 59ce9670ce Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains the first batch of Netfilter updates for
the upcoming 4.5 kernel. This batch contains userspace netfilter header
compilation fixes, support for packet mangling in nf_tables, the new
tracing infrastructure for nf_tables and cgroup2 support for iptables.
More specifically, they are:

1) Two patches to include dependencies in our netfilter userspace
   headers to resolve compilation problems, from Mikko Rapeli.

2) Four comestic cleanup patches for the ebtables codebase, from Ian Morris.

3) Remove duplicate include in the netfilter reject infrastructure,
   from Stephen Hemminger.

4) Two patches to simplify the netfilter defragmentation code for IPv6,
   patch from Florian Westphal.

5) Fix root ownership of /proc/net netfilter for unpriviledged net
   namespaces, from Philip Whineray.

6) Get rid of unused fields in struct nft_pktinfo, from Florian Westphal.

7) Add mangling support to our nf_tables payload expression, from
   Patrick McHardy.

8) Introduce a new netlink-based tracing infrastructure for nf_tables,
   from Florian Westphal.

9) Change setter functions in nfnetlink_log to be void, from
    Rami Rosen.

10) Add netns support to the cttimeout infrastructure.

11) Add cgroup2 support to iptables, from Tejun Heo.

12) Introduce nfnl_dereference_protected() in nfnetlink, from Florian.

13) Add support for mangling pkttype in the nf_tables meta expression,
    also from Florian.

BTW, I need that you pull net into net-next, I have another batch that
requires changes that I don't yet see in net.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-18 15:37:42 -05:00
David Ahern 6dd9a14e92 net: Allow accepted sockets to be bound to l3mdev domain
Allow accepted sockets to derive their sk_bound_dev_if setting from the
l3mdev domain in which the packets originated. A sysctl setting is added
to control the behavior which is similar to sk_mark and
sysctl_tcp_fwmark_accept.

This effectively allow a process to have a "VRF-global" listen socket,
with child sockets bound to the VRF device in which the packet originated.
A similar behavior can be achieved using sk_mark, but a solution using marks
is incomplete as it does not handle duplicate addresses in different L3
domains/VRFs. Allowing sockets to inherit the sk_bound_dev_if from l3mdev
domain provides a complete solution.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-18 14:43:38 -05:00
Bjørn Mork cc9da6cc4f ipv6: addrconf: use stable address generator for ARPHRD_NONE
Add a new address generator mode, using the stable address generator
with an automatically generated secret. This is intended as a default
address generator mode for device types with no EUI64 implementation.
The new generator is used for ARPHRD_NONE interfaces initially, adding
default IPv6 autoconf support to e.g. tun interfaces.

If the addrgenmode is set to 'random', either by default or manually,
and no stable secret is available, then a random secret is used as
input for the stable-privacy address generator.  The secret can be
read and modified like manually configured secrets, using the proc
interface.  Modifying the secret will change the addrgen mode to
'stable-privacy' to indicate that it operates on a known secret.

Existing behaviour of the 'stable-privacy' mode is kept unchanged. If
a known secret is available when the device is created, then the mode
will default to 'stable-privacy' as before.  The mode can be manually
set to 'random' but it will behave exactly like 'stable-privacy' in
this case. The secret will not change.

Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: 吉藤英明 <hideaki.yoshifuji@miraclelinux.com>
Signed-off-by: Bjørn Mork <bjorn@mork.no>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-18 14:41:07 -05:00
Arnd Bergmann 8cb964daeb ila: add NETFILTER dependency
The recently added generic ILA translation facility fails to
build when CONFIG_NETFILTER is disabled:

net/ipv6/ila/ila_xlat.c:229:20: warning: 'struct nf_hook_state' declared inside parameter list
net/ipv6/ila/ila_xlat.c:235:27: error: array type has incomplete element type 'struct nf_hook_ops'
 static struct nf_hook_ops ila_nf_hook_ops[] __read_mostly = {

This adds an explicit Kconfig dependency to avoid that case.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 7f00feaf10 ("ila: Add generic ILA translation facility")
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-18 14:19:28 -05:00
David S. Miller b3e0d3d7ba Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/geneve.c

Here we had an overlapping change, where in 'net' the extraneous stats
bump was being removed whilst in 'net-next' the final argument to
udp_tunnel6_xmit_skb() was being changed.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-17 22:08:28 -05:00
Hannes Frederic Sowa 715f504b11 ipv6: add IPV6_HDRINCL option for raw sockets
Same as in Windows, we miss IPV6_HDRINCL for SOL_IPV6 and SOL_RAW.
The SOL_IP/IP_HDRINCL is not available for IPv6 sockets.

Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-17 15:12:28 -05:00
Xin Long 32bc201e19 ipv6: allow routes to be configured with expire values
Add the support for adding expire value to routes,  requested by
Tom Gundersen <teg@jklm.no> for systemd-networkd, and NetworkManager
wants it too.

implement it by adding the new RTNETLINK attribute RTA_EXPIRES.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-17 15:08:51 -05:00
Hannes Frederic Sowa 9b29c6962b ipv6: automatically enable stable privacy mode if stable_secret set
Bjørn reported that while we switch all interfaces to privacy stable mode
when setting the secret, we don't set this mode for new interfaces. This
does not make sense, so change this behaviour.

Fixes: 622c81d57b ("ipv6: generation of stable privacy addresses for link-local and autoconf")
Reported-by: Bjørn Mork <bjorn@mork.no>
Cc: Bjørn Mork <bjorn@mork.no>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-15 23:37:32 -05:00
Lorenzo Colitti c1e64e298b net: diag: Support destroying TCP sockets.
This implements SOCK_DESTROY for TCP sockets. It causes all
blocking calls on the socket to fail fast with ECONNABORTED and
causes a protocol close of the socket. It informs the other end
of the connection by sending a RST, i.e., initiating a TCP ABORT
as per RFC 793. ECONNABORTED was chosen for consistency with
FreeBSD.

Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-15 23:26:52 -05:00
Tom Herbert 7f00feaf10 ila: Add generic ILA translation facility
This patch implements an ILA tanslation table. This table can be
configured with identifier to locator mappings, and can be be queried
to resolve a mapping. Queries can be parameterized based on interface,
direction (incoming or outoing), and matching locator.  The table is
implemented using rhashtable and is configured via netlink (through
"ip ila .." in iproute).

The table may be used as alternative means to do do ILA tanslations
other than the lw tunnels

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-15 23:25:20 -05:00
Tom Herbert 33f11d1614 ila: Create net/ipv6/ila directory
Create ila directory in preparation for supporting other hooks in the
kernel than LWT for doing ILA. This includes:
  - Moving ila.c to ila/ila_lwt.c
  - Splitting out some common functions into ila_common.c

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-15 23:25:20 -05:00
Tom Herbert c8cd0989bd net: Eliminate NETIF_F_GEN_CSUM and NETIF_F_V[46]_CSUM
These netif flags are unnecessary convolutions. It is more
straightforward to just use NETIF_F_HW_CSUM, NETIF_F_IP_CSUM,
and NETIF_F_IPV6_CSUM directly.

This patch also:
    - Cleans up can_checksum_protocol
    - Simplifies netdev_intersect_features

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-15 16:50:20 -05:00
Eric Dumazet 5037e9ef94 net: fix IP early demux races
David Wilder reported crashes caused by dst reuse.

<quote David>
  I am seeing a crash on a distro V4.2.3 kernel caused by a double
  release of a dst_entry.  In ipv4_dst_destroy() the call to
  list_empty() finds a poisoned next pointer, indicating the dst_entry
  has already been removed from the list and freed. The crash occurs
  18 to 24 hours into a run of a network stress exerciser.
</quote>

Thanks to his detailed report and analysis, we were able to understand
the core issue.

IP early demux can associate a dst to skb, after a lookup in TCP/UDP
sockets.

When socket cache is not properly set, we want to store into
sk->sk_dst_cache the dst for future IP early demux lookups,
by acquiring a stable refcount on the dst.

Problem is this acquisition is simply using an atomic_inc(),
which works well, unless the dst was queued for destruction from
dst_release() noticing dst refcount went to zero, if DST_NOCACHE
was set on dst.

We need to make sure current refcount is not zero before incrementing
it, or risk double free as David reported.

This patch, being a stable candidate, adds two new helpers, and use
them only from IP early demux problematic paths.

It might be possible to merge in net-next skb_dst_force() and
skb_dst_force_safe(), but I prefer having the smallest patch for stable
kernels : Maybe some skb_dst_force() callers do not expect skb->dst
can suddenly be cleared.

Can probably be backported back to linux-3.6 kernels

Reported-by: David J. Wilder <dwilder@us.ibm.com>
Tested-by: David J. Wilder <dwilder@us.ibm.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-14 23:52:00 -05:00
Alexander Aring 5241c2d7c5 ipv6: addrconf: drop ieee802154 specific things
This patch removes ARPHRD_IEEE802154 from addrconf handling. In the
earlier days of 802.15.4 6LoWPAN, the interface type was ARPHRD_IEEE802154
which introduced several issues, because 802.15.4 interfaces used the
same type.

Since commit 965e613d29 ("ieee802154: 6lowpan: fix ARPHRD to
ARPHRD_6LOWPAN") we use ARPHRD_6LOWPAN for 6LoWPAN interfaces. This
patch will remove ARPHRD_IEEE802154 which is currently deadcode, because
ARPHRD_IEEE802154 doesn't reach the minimum 1280 MTU of IPv6.

Also we use 6LoWPAN EUI64 specific defines instead using link-layer
constanst from 802.15.4 link-layer header.

Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-14 16:10:34 -05:00
Hannes Frederic Sowa 79462ad02e net: add validation for the socket syscall protocol argument
郭永刚 reported that one could simply crash the kernel as root by
using a simple program:

	int socket_fd;
	struct sockaddr_in addr;
	addr.sin_port = 0;
	addr.sin_addr.s_addr = INADDR_ANY;
	addr.sin_family = 10;

	socket_fd = socket(10,3,0x40000000);
	connect(socket_fd , &addr,16);

AF_INET, AF_INET6 sockets actually only support 8-bit protocol
identifiers. inet_sock's skc_protocol field thus is sized accordingly,
thus larger protocol identifiers simply cut off the higher bits and
store a zero in the protocol fields.

This could lead to e.g. NULL function pointer because as a result of
the cut off inet_num is zero and we call down to inet_autobind, which
is NULL for raw sockets.

kernel: Call Trace:
kernel:  [<ffffffff816db90e>] ? inet_autobind+0x2e/0x70
kernel:  [<ffffffff816db9a4>] inet_dgram_connect+0x54/0x80
kernel:  [<ffffffff81645069>] SYSC_connect+0xd9/0x110
kernel:  [<ffffffff810ac51b>] ? ptrace_notify+0x5b/0x80
kernel:  [<ffffffff810236d8>] ? syscall_trace_enter_phase2+0x108/0x200
kernel:  [<ffffffff81645e0e>] SyS_connect+0xe/0x10
kernel:  [<ffffffff81779515>] tracesys_phase2+0x84/0x89

I found no particular commit which introduced this problem.

CVE: CVE-2015-8543
Cc: Cong Wang <cwang@twopensource.com>
Reported-by: 郭永刚 <guoyonggang@360.cn>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-14 16:09:30 -05:00
Pablo Neira Ayuso a4ec80082c Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Resolve conflict between commit 264640fc2c ("ipv6: distinguish frag
queues by device for multicast and link-local packets") from the net
tree and commit 029f7f3b87 ("netfilter: ipv6: nf_defrag: avoid/free
clone operations") from the nf-next tree.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

Conflicts:
	net/ipv6/netfilter/nf_conntrack_reasm.c
2015-12-14 20:31:16 +01:00
David S. Miller 9e5be5bd43 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
netfilter fixes for net

The following patchset contains Netfilter fixes for you net tree,
specifically for nf_tables and nfnetlink_queue, they are:

1) Avoid a compilation warning in nfnetlink_queue that was introduced
   in the previous merge window with the simplification of the conntrack
   integration, from Arnd Bergmann.

2) nfnetlink_queue is leaking the pernet subsystem registration from
   a failure path, patch from Nikolay Borisov.

3) Pass down netns pointer to batch callback in nfnetlink, this is the
   largest patch and it is not a bugfix but it is a dependency to
   resolve a splat in the correct way.

4) Fix a splat due to incorrect socket memory accounting with nfnetlink
   skbuff clones.

5) Add missing conntrack dependencies to NFT_DUP_IPV4 and NFT_DUP_IPV6.

6) Traverse the nftables commit list in reverse order from the commit
   path, otherwise we crash when the user applies an incremental update
   via 'nft -f' that deletes an object that was just introduced in this
   batch, from Xin Long.

Regarding the compilation warning fix, many people have sent us (and
keep sending us) patches to address this, that's why I'm including this
batch even if this is not critical.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-14 11:09:01 -05:00
Pablo Neira Ayuso d3340b79ec netfilter: nf_dup: add missing dependencies with NF_CONNTRACK
CONFIG_NF_CONNTRACK=m
CONFIG_NF_DUP_IPV4=y

results in:

   net/built-in.o: In function `nf_dup_ipv4':
>> (.text+0xd434f): undefined reference to `nf_conntrack_untracked'

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-12-10 18:17:06 +01:00
Florian Westphal e97ac12859 netfilter: ipv6: nf_defrag: fix NULL deref panic
Valdis reports NULL deref in nf_ct_frag6_gather.
Problem is bogus use of skb_queue_walk() -- we miss first skb in the list
since we start with head->next instead of head.

In case the element we're looking for was head->next we won't find
a result and then trip over NULL iter.

(defrag uses plain NULL-terminated list rather than one terminated by
 head-of-list-pointer, which is what skb_queue_walk expects).

Fixes: 029f7f3b87 ("netfilter: ipv6: nf_defrag: avoid/free clone operations")
Reported-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-12-09 14:26:31 +01:00
Bjørn Mork 9a1ec4612c ipv6: keep existing flags when setting IFA_F_OPTIMISTIC
Commit 64236f3f3d ("ipv6: introduce IFA_F_STABLE_PRIVACY flag")
failed to update the setting of the IFA_F_OPTIMISTIC flag, causing
the IFA_F_STABLE_PRIVACY flag to be lost if IFA_F_OPTIMISTIC is set.

Cc: Erik Kline <ek@google.com>
Cc: Fernando Gont <fgont@si6networks.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Cc: YOSHIFUJI Hideaki/吉藤英明 <hideaki.yoshifuji@miraclelinux.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Fixes: 64236f3f3d ("ipv6: introduce IFA_F_STABLE_PRIVACY flag")
Signed-off-by: Bjørn Mork <bjorn@mork.no>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-05 18:09:46 -05:00
Andrew Lunn 3ef0952ca8 ipv6: Only act upon NETDEV_*_TYPE_CHANGE if we have ipv6 addresses
An interface changing type may not have IPv6 addresses. Don't
call the address configuration type change in this case.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-05 17:41:42 -05:00
Nicolas Dichtel 6a61d4dbf4 gre6: allow to update all parameters via rtnl
Parameters were updated only if the kernel was unable to find the tunnel
with the new parameters, ie only if core pamareters were updated (keys,
addr, link, type).
Now it's possible to update ttl, hoplimit, flowinfo and flags.

Fixes: c12b395a46 ("gre: Support GRE over IPv6")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-04 16:52:06 -05:00
David S. Miller f188b951f3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/renesas/ravb_main.c
	kernel/bpf/syscall.c
	net/ipv4/ipmr.c

All three conflicts were cases of overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-03 21:09:12 -05:00
Phil Sutter d6df198d92 net: ipv6: restrict hop_limit sysctl setting to range [1; 255]
Setting a value bigger than 255 resulted in using only the lower eight
bits of that value as it is assigned to the u8 header field. To avoid
this unexpected result, reject such values.

Setting a value of zero is technically possible, but hosts receiving
such a packet have to treat it like hop_limit was set to one, according
to RFC2460. Therefore I don't see a use-case for that.

Setting a route's hop_limit to zero in iproute2 means to use the sysctl
default, which is not the case here: Setting e.g.
net.conf.eth0.hop_limit=0 will not make the kernel use
net.conf.all.hop_limit for outgoing packets on eth0. To avoid these
kinds of confusion, reject zero.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-03 14:42:10 -05:00
Eric Dumazet 6bd4f355df ipv6: kill sk_dst_lock
While testing the np->opt RCU conversion, I found that UDP/IPv6 was
using a mixture of xchg() and sk_dst_lock to protect concurrent changes
to sk->sk_dst_cache, leading to possible corruptions and crashes.

ip6_sk_dst_lookup_flow() uses sk_dst_check() anyway, so the simplest
way to fix the mess is to remove sk_dst_lock completely, as we did for
IPv4.

__ip6_dst_store() and ip6_dst_store() share same implementation.

sk_setup_caps() being called with socket lock being held or not,
we have to use sk_dst_set() instead of __sk_dst_set()

Note that I had to move the "np->dst_cookie = rt6_get_cookie(rt);"
in ip6_dst_store() before the sk_setup_caps(sk, dst) call.

This is because ip6_dst_store() can be called from process context,
without any lock held.

As soon as the dst is installed in sk->sk_dst_cache, dst can be freed
from another cpu doing a concurrent ip6_dst_store()

Doing the dst dereference before doing the install is needed to make
sure no use after free would trigger.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-03 11:32:06 -05:00
Eric Dumazet 7450aaf61f tcp: suppress too verbose messages in tcp_send_ack()
If tcp_send_ack() can not allocate skb, we properly handle this
and setup a timer to try later.

Use __GFP_NOWARN to avoid polluting syslog in the case host is
under memory pressure, so that pertinent messages are not lost under
a flood of useless information.

sk_gfp_atomic() can use its gfp_mask argument (all callers currently
were using GFP_ATOMIC before this patch)

We rename sk_gfp_atomic() to sk_gfp_mask() to clearly express this
function now takes into account its second argument (gfp_mask)

Note that when tcp_transmit_skb() is called with clone_it set to false,
we do not attempt memory allocations, so can pass a 0 gfp_mask, which
most compilers can emit faster than a non zero or constant value.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-02 23:44:32 -05:00
Eric Dumazet 45f6fad84c ipv6: add complete rcu protection around np->opt
This patch addresses multiple problems :

UDP/RAW sendmsg() need to get a stable struct ipv6_txoptions
while socket is not locked : Other threads can change np->opt
concurrently. Dmitry posted a syzkaller
(http://github.com/google/syzkaller) program desmonstrating
use-after-free.

Starting with TCP/DCCP lockless listeners, tcp_v6_syn_recv_sock()
and dccp_v6_request_recv_sock() also need to use RCU protection
to dereference np->opt once (before calling ipv6_dup_options())

This patch adds full RCU protection to np->opt

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-02 23:37:16 -05:00
Nicolas Dichtel 304d888b29 Revert "ipv6: ndisc: inherit metadata dst when creating ndisc requests"
This reverts commit ab450605b3.

In IPv6, we cannot inherit the dst of the original dst. ndisc packets
are IPv6 packets and may take another route than the original packet.

This patch breaks the following scenario: a packet comes from eth0 and
is forwarded through vxlan1. The encapsulated packet triggers an NS
which cannot be sent because of the wrong route.

CC: Jiri Benc <jbenc@redhat.com>
CC: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-01 15:07:59 -05:00
Nikolay Aleksandrov dfc3b0e891 net: remove unnecessary mroute.h includes
It looks like many files are including mroute.h unnecessarily, so remove
the include. Most importantly remove it from ipv6.

CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
CC: Steffen Klassert <steffen.klassert@secunet.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-30 15:26:21 -05:00
Nikolay Aleksandrov fbdd29bfd2 net: ipmr, ip6mr: fix vif/tunnel failure race condition
Since (at least) commit b17a7c179d ("[NET]: Do sysfs registration as
part of register_netdevice."), netdev_run_todo() deals only with
unregistration, so we don't need to do the rtnl_unlock/lock cycle to
finish registration when failing pimreg or dvmrp device creation. In
fact that opens a race condition where someone can delete the device
while rtnl is unlocked because it's fully registered. The problem gets
worse when netlink support is introduced as there are more points of entry
that can cause it and it also makes reusing that code correctly impossible.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-24 17:15:56 -05:00
Michal Kubeček 264640fc2c ipv6: distinguish frag queues by device for multicast and link-local packets
If a fragmented multicast packet is received on an ethernet device which
has an active macvlan on top of it, each fragment is duplicated and
received both on the underlying device and the macvlan. If some
fragments for macvlan are processed before the whole packet for the
underlying device is reassembled, the "overlapping fragments" test in
ip6_frag_queue() discards the whole fragment queue.

To resolve this, add device ifindex to the search key and require it to
match reassembling multicast packets and packets to link-local
addresses.

Note: similar patch has been already submitted by Yoshifuji Hideaki in

  http://patchwork.ozlabs.org/patch/220979/

but got lost and forgotten for some reason.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-24 16:45:47 -05:00
Florian Westphal daaa7d647f netfilter: ipv6: avoid nf_iterate recursion
The previous patch changed nf_ct_frag6_gather() to morph reassembled skb
with the previous one.

This means that the return value is always NULL or the skb argument.
So change it to an err value.

Instead of invoking NF_HOOK recursively with threshold to skip already-called hooks
we can now just return NF_ACCEPT to move on to the next hook except for
-EINPROGRESS (which means skb has been queued for reassembly), in which case we
return NF_STOLEN.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-11-23 17:54:45 +01:00
Florian Westphal 029f7f3b87 netfilter: ipv6: nf_defrag: avoid/free clone operations
commit 6aafeef03b
("netfilter: push reasm skb through instead of original frag skbs")
changed ipv6 defrag to not use the original skbs anymore.

So rather than keeping the original skbs around just to discard them
afterwards just use the original skbs directly for the fraglist of
the newly assembled skb and remove the extra clone/free operations.

The skb that completes the fragment queue is morphed into a the
reassembled one instead, just like ipv4 defrag.

openvswitch doesn't need any additional skb_morph magic anymore to deal
with this situation so just remove that.

A followup patch can then also remove the NF_HOOK (re)invocation in
the ipv6 netfilter defrag hook.

Cc: Joe Stringer <joestringer@nicira.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-11-23 17:54:44 +01:00
stephen hemminger a18fd970ce netfilter: remove duplicate include
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-11-23 17:54:43 +01:00
Nikolay Aleksandrov 4c6980462f net: ip6mr: fix static mfc/dev leaks on table destruction
Similar to ipv4, when destroying an mrt table the static mfc entries and
the static devices are kept, which leads to devices that can never be
destroyed (because of refcnt taken) and leaked memory. Make sure that
everything is cleaned up on netns destruction.

Fixes: 8229efdaef ("netns: ip6mr: enable namespace support in ipv6 multicast forwarding code")
CC: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-22 20:44:47 -05:00
David Ahern b811580d91 net: IPv6 fib lookup tracepoint
Add tracepoint to show fib6 table lookups and result.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-22 11:54:10 -05:00
Paolo Abeni 206b49500d net/ip6_tunnel: fix dst leak
the commit cdf3464e6c ("ipv6: Fix dst_entry refcnt bugs in ip6_tunnel")
introduced percpu storage for ip6_tunnel dst cache, but while clearing
such cache it used raw_cpu_ptr to walk the per cpu entries, so cached
dst on non current cpu are not actually reset.

This patch replaces raw_cpu_ptr with per_cpu_ptr, properly cleaning
such storage.

Fixes: cdf3464e6c ("ipv6: Fix dst_entry refcnt bugs in ip6_tunnel")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-18 16:25:01 -05:00
Neil Horman 41033f029e snmp: Remove duplicate OUTMCAST stat increment
the OUTMCAST stat is double incremented, getting bumped once in the mcast code
itself, and again in the common ip output path.  Remove the mcast bump, as its
not needed

Validated by the reporter, with good results

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Reported-by: Claus Jensen <claus.jensen@microsemi.com>
CC: Claus Jensen <claus.jensen@microsemi.com>
CC: David Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-16 16:36:32 -05:00
Eric Dumazet 00fd38d938 tcp: ensure proper barriers in lockless contexts
Some functions access TCP sockets without holding a lock and
might output non consistent data, depending on compiler and or
architecture.

tcp_diag_get_info(), tcp_get_info(), tcp_poll(), get_tcp4_sock() ...

Introduce sk_state_load() and sk_state_store() to fix the issues,
and more clearly document where this lack of locking is happening.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-15 18:36:38 -05:00
Martin KaFai Lau 02bcf4e082 ipv6: Check rt->dst.from for the DST_NOCACHE route
All DST_NOCACHE rt6_info used to have rt->dst.from set to
its parent.

After commit 8e3d5be736 ("ipv6: Avoid double dst_free"),
DST_NOCACHE is also set to rt6_info which does not have
a parent (i.e. rt->dst.from is NULL).

This patch catches the rt->dst.from == NULL case.

Fixes: 8e3d5be736 ("ipv6: Avoid double dst_free")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-15 17:12:37 -05:00
Martin KaFai Lau 5973fb1e24 ipv6: Check expire on DST_NOCACHE route
Since the expires of the DST_NOCACHE rt can be set during
the ip6_rt_update_pmtu(), we also need to consider the expires
value when doing ip6_dst_check().

This patches creates __rt6_check_expired() to only
check the expire value (if one exists) of the current rt.

In rt6_dst_from_check(), it adds __rt6_check_expired() as
one of the condition check.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-15 17:12:37 -05:00
Martin KaFai Lau 0d3f6d297b ipv6: Avoid creating RTF_CACHE from a rt that is not managed by fib6 tree
The original bug report:
https://bugzilla.redhat.com/show_bug.cgi?id=1272571

The setup has a IPv4 GRE tunnel running in a IPSec.  The bug
happens when ndisc starts sending router solicitation at the gre
interface.  The simplified oops stack is like:

__lock_acquire+0x1b2/0x1c30
lock_acquire+0xb9/0x140
_raw_write_lock_bh+0x3f/0x50
__ip6_ins_rt+0x2e/0x60
ip6_ins_rt+0x49/0x50
~~~~~~~~
__ip6_rt_update_pmtu.part.54+0x145/0x250
ip6_rt_update_pmtu+0x2e/0x40
~~~~~~~~
ip_tunnel_xmit+0x1f1/0xf40
__gre_xmit+0x7a/0x90
ipgre_xmit+0x15a/0x220
dev_hard_start_xmit+0x2bd/0x480
__dev_queue_xmit+0x696/0x730
dev_queue_xmit+0x10/0x20
neigh_direct_output+0x11/0x20
ip6_finish_output2+0x21f/0x770
ip6_finish_output+0xa7/0x1d0
ip6_output+0x56/0x190
~~~~~~~~
ndisc_send_skb+0x1d9/0x400
ndisc_send_rs+0x88/0xc0
~~~~~~~~

The rt passed to ip6_rt_update_pmtu() is created by
icmp6_dst_alloc() and it is not managed by the fib6 tree,
so its rt6i_table == NULL.  When __ip6_rt_update_pmtu() creates
a RTF_CACHE clone, the newly created clone also has rt6i_table == NULL
and it causes the ip6_ins_rt() oops.

During pmtu update, we only want to create a RTF_CACHE clone
from a rt which is currently managed (or owned) by the
fib6 tree.  It means either rt->rt6i_node != NULL or
rt is a RTF_PCPU clone.

It is worth to note that rt6i_table may not be NULL even it is
not (yet) managed by the fib6 tree (e.g. addrconf_dst_alloc()).
Hence, rt6i_node is a better check instead of rt6i_table.

Fixes: 45e4fd2668 ("ipv6: Only create RTF_CACHE routes after encountering pmtu")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reported-by: Chris Siebenmann <cks-rhbugzilla@cs.toronto.edu>
Cc: Chris Siebenmann <cks-rhbugzilla@cs.toronto.edu>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-15 17:12:37 -05:00
Eric Dumazet 49a496c97d tcp: use correct req pointer in tcp_move_syn() calls
I mistakenly took wrong request sock pointer when calling tcp_move_syn()

@req_unhash is either a copy of @req, or a NULL value for
FastOpen connexions (as we do not expect to unhash the temporary
request sock from ehash table)

Fixes: 805c4bc057 ("tcp: fix req->saved_syn race")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ying Cai <ycai@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-05 15:57:51 -05:00
Eric Dumazet 805c4bc057 tcp: fix req->saved_syn race
For the reasons explained in commit ce1050089c ("tcp/dccp: fix
ireq->pktopts race"), we need to make sure we do not access
req->saved_syn unless we own the request sock.

This fixes races for listeners using TCP_SAVE_SYN option.

Fixes: e994b2f0fb ("tcp: do not lock listener to process SYN packets")
Fixes: 079096f103 ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Ying Cai <ycai@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-05 14:36:09 -05:00
Sabrina Dubroca 2a189f9e57 ipv6: clean up dev_snmp6 proc entry when we fail to initialize inet6_dev
In ipv6_add_dev, when addrconf_sysctl_register fails, we do not clean up
the dev_snmp6 entry that we have already registered for this device.
Call snmp6_unregister_dev in this case.

Fixes: a317a2f19d ("ipv6: fail early when creating netdev named all or default")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-04 23:49:48 -05:00