redonkable/alistair23-linux

Author	SHA1	Message	Date
Nicolas Dichtel	29c994e361	netconf: add a notif when settings are created All changes are notified, but the initial state was missing. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-01 15:18:08 -07:00
Nicolas Dichtel	d26c638c16	ipv6: add missing netconf notif when 'all' is updated The 'default' value was not advertised. Fixes: `f3a1bfb11c` ("rtnl/ipv6: use netconf msg to advertise forwarding status") Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-01 15:18:08 -07:00
stephen hemminger	6501f34ff7	ila: make nla_policy const Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-01 14:09:01 -07:00
Roopa Prabhu	14972cbd34	net: lwtunnel: Handle fragmentation Today mpls iptunnel lwtunnel_output redirect expects the tunnel output function to handle fragmentation. This is ok but can be avoided if we did not do the mpls output redirect too early. ie we could wait until ip fragmentation is done and then call mpls output for each ip fragment. To make this work we will need, 1) the lwtunnel state to carry encap headroom 2) and do the redirect to the encap output handler on the ip fragment (essentially do the output redirect after fragmentation) This patch adds tunnel headroom in lwtstate to make sure we account for tunnel data in mtu calculations during fragmentation and adds new xmit redirect handler to redirect to lwtunnel xmit func after ip fragmentation. This includes IPV6 and some mtu fixes and testing from David Ahern. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-08-30 22:27:18 -07:00
Gao Feng	779994fa36	netfilter: log: Check param to avoid overflow in nf_log_set The nf_log_set is an interface function, so it should do the strict sanity check of parameters. Convert the return value of nf_log_set as int instead of void. When the pf is invalid, return -EOPNOTSUPP. Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-08-30 11:52:32 +02:00
David S. Miller	6abdd5f593	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net All three conflicts were cases of simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>	2016-08-30 00:54:02 -04:00
Eric Dumazet	c9c3321257	tcp: add tcp_add_backlog() When TCP operates in lossy environments (between 1 and 10 % packet losses), many SACK blocks can be exchanged, and I noticed we could drop them on busy senders, if these SACK blocks have to be queued into the socket backlog. While the main cause is the poor performance of RACK/SACK processing, we can try to avoid these drops of valuable information that can lead to spurious timeouts and retransmits. Cause of the drops is the skb->truesize overestimation caused by : - drivers allocating ~2048 (or more) bytes as a fragment to hold an Ethernet frame. - various pskb_may_pull() calls bringing the headers into skb->head might have pulled all the frame content, but skb->truesize could not be lowered, as the stack has no idea of each fragment truesize. The backlog drops are also more visible on bidirectional flows, since their sk_rmem_alloc can be quite big. Let's add some room for the backlog, as only the socket owner can selectively take action to lower memory needs, like collapsing receive queues or partial ofo pruning. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-08-29 00:20:24 -04:00
Tom Herbert	3203558589	tcp: Set read_sock and peek_len proto_ops In inet_stream_ops we set read_sock to tcp_read_sock and peek_len to tcp_peek_len (which is just a stub function that calls tcp_inq). Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-08-28 23:32:41 -04:00
Eric Dumazet	72145a68e4	tcp: md5: add LINUX_MIB_TCPMD5FAILURE counter Adds SNMP counter for drops caused by MD5 mismatches. The current syslog might help, but a counter is more precise and helps monitoring. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-08-25 16:43:11 -07:00
Eric Dumazet	e65c332de8	tcp: md5: increment sk_drops on syn_recv state TCP MD5 mismatches do increment sk_drops counter in all states but SYN_RECV. This is very unlikely to happen in the real world, but worth adding to help diagnostics. We increase the parent (listener) sk_drops. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-08-25 16:43:11 -07:00
Liping Zhang	89e1f6d2b9	netfilter: nft_reject: restrict to INPUT/FORWARD/OUTPUT After I add the nft rule "nft add rule filter prerouting reject with tcp reset", kernel panic happened on my system: NULL pointer dereference at ... IP: [<ffffffff81b9db2f>] nf_send_reset+0xaf/0x400 Call Trace: [<ffffffff81b9da80>] ? nf_reject_ip_tcphdr_get+0x160/0x160 [<ffffffffa0928061>] nft_reject_ipv4_eval+0x61/0xb0 [nft_reject_ipv4] [<ffffffffa08e836a>] nft_do_chain+0x1fa/0x890 [nf_tables] [<ffffffffa08e8170>] ? __nft_trace_packet+0x170/0x170 [nf_tables] [<ffffffffa06e0900>] ? nf_ct_invert_tuple+0xb0/0xc0 [nf_conntrack] [<ffffffffa07224d4>] ? nf_nat_setup_info+0x5d4/0x650 [nf_nat] [...] Because in the PREROUTING chain, routing information is not exist, then we will dereference the NULL pointer and oops happen. So we restrict reject expression to INPUT, FORWARD and OUTPUT chain. This is consistent with iptables REJECT target. Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-08-25 12:55:34 +02:00
Eric Dumazet	391bb6be65	ipv6: tcp: get rid of tcp_v6_clear_sk() Now RCU lookups of IPv6 TCP sockets no longer dereference pinet6, we do not need tcp_v6_clear_sk() anymore. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-08-23 23:25:29 -07:00
Eric Dumazet	4cac820466	udp: get rid of sk_prot_clear_portaddr_nulls() Since we no longer use SLAB_DESTROY_BY_RCU for UDP, we do not need sk_prot_clear_portaddr_nulls() helper. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-08-23 23:25:29 -07:00
Eric Dumazet	6a6ad2a4e5	ipv6: udp: remove udp_v6_clear_sk() Now RCU lookups of ipv6 udp sockets no longer dereference pinet6 field, we can get rid of udp_v6_clear_sk() helper. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-08-23 23:23:50 -07:00
David Ahern	5d77dca828	net: diag: support SOCK_DESTROY for UDP sockets This implements SOCK_DESTROY for UDP sockets similar to what was done for TCP with commit `c1e64e298b` ("net: diag: Support destroying TCP sockets.") A process with a UDP socket targeted for destroy is awakened and recvmsg fails with ECONNABORTED. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-08-23 23:12:27 -07:00
Eric Dumazet	75d855a5e9	udp: get rid of SLAB_DESTROY_BY_RCU allocations After commit `ca065d0cf8` ("udp: no longer use SLAB_DESTROY_BY_RCU") we do not need this special allocation mode anymore, even if it is harmless. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-08-23 17:46:17 -07:00
Eric Dumazet	20a2b49fc5	tcp: properly scale window in tcp_v[46]_reqsk_send_ack() When sending an ack in SYN_RECV state, we must scale the offered window if wscale option was negotiated and accepted. Tested: Following packetdrill test demonstrates the issue : 0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 // Establish a connection. +0 < S 0:0(0) win 20000 <mss 1000,sackOK,wscale 7, nop, TS val 100 ecr 0> +0 > S. 0:0(0) ack 1 win 28960 <mss 1460,sackOK, TS val 100 ecr 100, nop, wscale 7> +0 < . 1:11(10) ack 1 win 156 <nop,nop,TS val 99 ecr 100> // check that window is properly scaled ! +0 > . 1:1(0) ack 1 win 226 <nop,nop,TS val 200 ecr 100> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-08-23 16:55:49 -07:00
Mike Manning	85b51b1211	net: ipv6: Remove addresses for failures with strict DAD If DAD fails with accept_dad set to 2, global addresses and host routes are incorrectly left in place. Even though disable_ipv6 is set, contrary to documentation, the addresses are not dynamically deleted from the interface. It is only on a subsequent link down/up that these are removed. The fix is not only to set the disable_ipv6 flag, but also to call addrconf_ifdown(), which is the action to carry out when disabling IPv6. This results in the addresses and routes being deleted immediately. The DAD failure for the LL addr is determined as before via netlink, or by the absence of the LL addr (which also previously would have had to be checked for in case of an intervening link down and up). As the call to addrconf_ifdown() requires an rtnl lock, the logic to disable IPv6 when DAD fails is moved to addrconf_dad_work(). Previous behavior: root@vm1:/# sysctl net.ipv6.conf.eth3.accept_dad=2 net.ipv6.conf.eth3.accept_dad = 2 root@vm1:/# ip -6 addr add 2000::10/64 dev eth3 root@vm1:/# ip link set up eth3 root@vm1:/# ip -6 addr show dev eth3 5: eth3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qlen 1000 inet6 2000::10/64 scope global valid_lft forever preferred_lft forever inet6 fe80::5054:ff:fe43:dd5a/64 scope link tentative dadfailed valid_lft forever preferred_lft forever root@vm1:/# ip -6 route show dev eth3 2000::/64 proto kernel metric 256 fe80::/64 proto kernel metric 256 root@vm1:/# ip link set down eth3 root@vm1:/# ip link set up eth3 root@vm1:/# ip -6 addr show dev eth3 root@vm1:/# ip -6 route show dev eth3 root@vm1:/# New behavior: root@vm1:/# sysctl net.ipv6.conf.eth3.accept_dad=2 net.ipv6.conf.eth3.accept_dad = 2 root@vm1:/# ip -6 addr add 2000::10/64 dev eth3 root@vm1:/# ip link set up eth3 root@vm1:/# ip -6 addr show dev eth3 root@vm1:/# ip -6 route show dev eth3 root@vm1:/# Signed-off-by: Mike Manning <mmanning@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-08-22 16:59:37 -07:00
David Ahern	11d7a0bb95	xfrm: Only add l3mdev oif to dst lookups Subash reported that commit `42a7b32b73` ("xfrm: Add oif to dst lookups") broke a wifi use case that uses fib rules and xfrms. The intent of `42a7b32b73` was driven by VRFs with IPsec. As a compromise relax the use of oif in xfrm lookups to L3 master devices only (ie., oif is either an L3 master device or is enslaved to a master device). Fixes: `42a7b32b73` ("xfrm: Add oif to dst lookups") Reported-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2016-08-22 06:33:32 +02:00
David S. Miller	60747ef4d1	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Minor overlapping changes for both merge conflicts. Resolution work done by Stephen Rothwell was used as a reference. Signed-off-by: David S. Miller <davem@davemloft.net>	2016-08-18 01:17:32 -04:00
Tom Herbert	9b73896a81	kcm: Use stream parser Adapt KCM to use the stream parser. This mostly involves removing the RX handling and setting up the strparser using the interface. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-08-17 19:36:23 -04:00
Simon Horman	3d7b332092	gre: set inner_protocol on xmit Ensure that the inner_protocol is set on transmit so that GSO segmentation, which relies on that field, works correctly. This is achieved by setting the inner_protocol in gre_build_header rather than each caller of that function. It ensures that the inner_protocol is set when gre_fb_xmit() is used to transmit GRE which was not previously the case. I have observed this is not the case when OvS transmits GRE using lwtunnel metadata (which it always does). Fixes: `3872035241` ("gre: Use inner_proto to obtain inner header protocol") Cc: Pravin Shelar <pshelar@ovn.org> Acked-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Simon Horman <simon.horman@netronome.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-08-15 13:37:12 -07:00
Lorenzo Colitti	5e45789698	net: ipv6: Fix ping to link-local addresses. ping_v6_sendmsg does not set flowi6_oif in response to sin6_scope_id or sk_bound_dev_if, so it is not possible to use these APIs to ping an IPv6 address on a different interface. Instead, it sets flowi6_iif, which is incorrect but harmless. Stop setting flowi6_iif, and support various ways of setting oif in the same priority order used by udpv6_sendmsg. Tested: https://android-review.googlesource.com/#/c/254470/ Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-08-15 12:19:09 -07:00
Mike Manning	bc561632dd	net: ipv6: Do not keep IPv6 addresses when IPv6 is disabled If IPv6 is disabled when the option is set to keep IPv6 addresses on link down, userspace is unaware of this as there is no such indication via netlink. The solution is to remove the IPv6 addresses in this case, which results in netlink messages indicating removal of addresses in the usual manner. This fix also makes the behavior consistent with the case of having IPv6 disabled first, which stops IPv6 addresses from being added. Fixes: `f1705ec197` ("net: ipv6: Make address flushing on ifdown optional") Signed-off-by: Mike Manning <mmanning@brocade.com> Acked-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-08-13 15:14:00 -07:00
Colin Ian King	b4c0e0c61f	calipso: fix resource leak on calipso_genopt failure Currently, if calipso_genopt fails then the error exit path does not free the ipv6_opt_hdr new causing a memory leak. Fix this by kfree'ing new on the error exit path. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-08-13 14:56:17 -07:00
Wei Yongjun	03ff497934	sit: make function ipip6_valid_ip_proto() static Fixes the following sparse warning: net/ipv6/sit.c:1129:6: warning: symbol 'ipip6_valid_ip_proto' was not declared. Should it be static? Signed-off-by: Wei Yongjun <weiyj.lk@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-08-12 21:52:18 -07:00
Alexey Kodanev	1625f45299	net/xfrm_input: fix possible NULL deref of tunnel.ip6->parms.i_key Running LTP 'icmp-uni-basic.sh -6 -p ipcomp -m tunnel' test over openvswitch + veth can trigger kernel panic: BUG: unable to handle kernel NULL pointer dereference at 00000000000000e0 IP: [<ffffffff8169d1d2>] xfrm_input+0x82/0x750 ... [<ffffffff816d472e>] xfrm6_rcv_spi+0x1e/0x20 [<ffffffffa082c3c2>] xfrm6_tunnel_rcv+0x42/0x50 [xfrm6_tunnel] [<ffffffffa082727e>] tunnel6_rcv+0x3e/0x8c [tunnel6] [<ffffffff8169f365>] ip6_input_finish+0xd5/0x430 [<ffffffff8169fc53>] ip6_input+0x33/0x90 [<ffffffff8169f1d5>] ip6_rcv_finish+0xa5/0xb0 ... It seems that tunnel.ip6 can have garbage values and also dereferenced without a proper check, only tunnel.ip4 is being verified. Fix it by adding one more if block for AF_INET6 and initialize tunnel.ip6 with NULL inside xfrm6_rcv_spi() (which is similar to xfrm4_rcv_spi()). Fixes: `049f8e2` ("xfrm: Override skb->mark with tunnel->parm.i_key in xfrm_input") Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2016-08-11 13:15:57 +02:00
Jiri Kosina	e87a8f24c9	net: resolve symbol conflicts with generic hashtable.h This is a preparatory patch for converting qdisc linked list into a hashtable. As we'll need to include hashtable.h in netdevice.h, we first have to make sure that this will not introduce symbol conflicts for any of the netdevice.h users. Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-08-10 17:18:52 -07:00
Hangbin Liu	a052517a8f	net/multicast: should not send source list records when have filter mode change Based on RFC3376 5.1 and RFC3810 6.1 If the per-interface listening change that triggers the new report is a filter mode change, then the next [Robustness Variable] State Change Reports will include a Filter Mode Change Record. This applies even if any number of source list changes occur in that period. Old State New State State Change Record Sent --------- --------- ------------------------ INCLUDE (A) EXCLUDE (B) TO_EX (B) EXCLUDE (A) INCLUDE (B) TO_IN (B) So we should not send source-list change if there is a filter-mode change. Here are two scenarios: 1. Group deleted and filter mode is EXCLUDE, which means we need send a TO_IN { }. 2. Not group deleted, but has pcm->crcount, which means we need send a normal filter-mode-change. At the same time, if the type is ALLOW or BLOCK, and have psf->sf_crcount, we stop add records and decrease sf_crcount directly Reference: https://www.ietf.org/mail-archive/web/magma/current/msg01274.html Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-08-08 16:04:39 -07:00
Wei Yongjun	c882219ae4	net: ipv6: use list_move instead of list_del/list_add Using list_move() instead of list_del() + list_add(). Signed-off-by: Wei Yongjun <weiyj.lk@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-30 20:41:59 -07:00
Linus Torvalds	7a1e8b80fb	Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull security subsystem updates from James Morris: "Highlights: - TPM core and driver updates/fixes - IPv6 security labeling (CALIPSO) - Lots of Apparmor fixes - Seccomp: remove 2-phase API, close hole where ptrace can change syscall #" * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (156 commits) apparmor: fix SECURITY_APPARMOR_HASH_DEFAULT parameter handling tpm: Add TPM 2.0 support to the Nuvoton i2c driver (NPCT6xx family) tpm: Factor out common startup code tpm: use devm_add_action_or_reset tpm2_i2c_nuvoton: add irq validity check tpm: read burstcount from TPM_STS in one 32-bit transaction tpm: fix byte-order for the value read by tpm2_get_tpm_pt tpm_tis_core: convert max timeouts from msec to jiffies apparmor: fix arg_size computation for when setprocattr is null terminated apparmor: fix oops, validate buffer size in apparmor_setprocattr() apparmor: do not expose kernel stack apparmor: fix module parameters can be changed after policy is locked apparmor: fix oops in profile_unpack() when policy_db is not present apparmor: don't check for vmalloc_addr if kvzalloc() failed apparmor: add missing id bounds check on dfa verification apparmor: allow SYS_CAP_RESOURCE to be sufficient to prlimit another task apparmor: use list_next_entry instead of list_entry_next apparmor: fix refcount race when finding a child profile apparmor: fix ref count leak when profile sha1 hash is read apparmor: check that xindex is in trans_table bounds ...	2016-07-29 17:38:46 -07:00
Nikolay Aleksandrov	90b5ca1766	net: ipmr/ip6mr: update lastuse on entry change Currently lastuse is updated on entry creation and cache hit, but it should also be updated on entry change. Since both on add and update the ttl array is updated we can simply update the lastuse in ipmr_update_thresholds. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> CC: Roopa Prabhu <roopa@cumulusnetworks.com> CC: Donald Sharp <sharpd@cumulusnetworks.com> CC: David S. Miller <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-26 15:18:31 -07:00
Daniel Borkmann	ba66bbe548	udp: use sk_filter_trim_cap for udp{,6}_queue_rcv_skb After `a612769774` ("udp: prevent bugcheck if filter truncates packet too much"), there followed various other fixes for similar cases such as `f4979fcea7` ("rose: limit sk_filter trim to payload"). Latter introduced a new helper sk_filter_trim_cap(), where we can pass the trim limit directly to the socket filter handling. Make use of it here as well with sizeof(struct udphdr) as lower cap limit and drop the extra skb->len test in UDP's input path. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Willem de Bruijn <willemb@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-25 21:40:33 -07:00
Mike Manning	ea06f71764	net: ipv6: Always leave anycast and multicast groups on link down Default kernel behavior is to delete IPv6 addresses on link down, which entails deletion of the multicast and the subnet-router anycast addresses. These deletions do not happen with sysctl setting to keep global IPv6 addresses on link down, so every link down/up causes an increment of the anycast and multicast refcounts. These bogus refcounts may stop these addrs from being removed on subsequent calls to delete them. The solution is to leave the groups for the multicast and subnet anycast on link down for the callflow when global IPv6 addresses are kept. Fixes: `f1705ec197` ("net: ipv6: Make address flushing on ifdown optional") Signed-off-by: Mike Manning <mmanning@brocade.com> Acked-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-25 11:15:58 -07:00
David S. Miller	c42d7121fb	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter/IPVS updates for net-next, they are: 1) Count pre-established connections as active in "least connection" schedulers such that pre-established connections to avoid overloading backend servers on peak demands, from Michal Kubecek via Simon Horman. 2) Address a race condition when resizing the conntrack table by caching the bucket size when fulling iterating over the hashtable in these three possible scenarios: 1) dump via /proc/net/nf_conntrack, 2) unlinking userspace helper and 3) unlinking custom conntrack timeout. From Liping Zhang. 3) Revisit early_drop() path to perform lockless traversal on conntrack eviction under stress, use del_timer() as synchronization point to avoid two CPUs evicting the same entry, from Florian Westphal. 4) Move NAT hlist_head to nf_conn object, this simplifies the existing NAT extension and it doesn't increase size since recent patches to align nf_conn, from Florian. 5) Use rhashtable for the by-source NAT hashtable, also from Florian. 6) Don't allow --physdev-is-out from OUTPUT chain, just like --physdev-out is not either, from Hangbin Liu. 7) Automagically set on nf_conntrack counters if the user tries to match ct bytes/packets from nftables, from Liping Zhang. 8) Remove possible_net_t fields in nf_tables set objects since we just simply pass the net pointer to the backend set type implementations. 9) Fix possible off-by-one in h323, from Toby DiPasquale. 10) early_drop() may be called from ctnetlink patch, so we must hold rcu read size lock from them too, this amends Florian's patch #3 coming in this batch, from Liping Zhang. 11) Use binary search to validate jump offset in x_tables, this addresses the O(n!) validation that was introduced recently resolve security issues with unpriviledge namespaces, from Florian. 12) Fix reference leak to connlabel in error path of nft_ct, from Zhang. 13) Three updates for nft_log: Fix log prefix leak in error path. Bail out on loglevel larger than debug in nft_log and set on the new NF_LOG_F_COPY_LEN flag when snaplen is specified. Again from Zhang. 14) Allow to filter rule dumps in nf_tables based on table and chain names. 15) Simplify connlabel to always use 128 bits to store labels and get rid of unused function in xt_connlabel, from Florian. 16) Replace set_expect_timeout() by mod_timer() from the h323 conntrack helper, by Gao Feng. 17) Put back x_tables module reference in nft_compat on error, from Liping Zhang. 18) Add a reference count to the x_tables extensions cache in nft_compat, so we can remove them when unused and avoid a crash if the extensions are rmmod, again from Zhang. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-24 22:02:36 -07:00
David S. Miller	de0ba9a0d8	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Just several instances of overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-24 00:53:32 -04:00
Florian Westphal	f4dc77713f	netfilter: x_tables: speed up jump target validation The dummy ruleset I used to test the original validation change was broken, most rules were unreachable and were not tested by mark_source_chains(). In some cases rulesets that used to load in a few seconds now require several minutes. sample ruleset that shows the behaviour: echo "*filter" for i in $(seq 0 100000);do printf ":chain_%06x - [0:0]\n" $i done for i in $(seq 0 100000);do printf -- "-A INPUT -j chain_%06x\n" $i printf -- "-A INPUT -j chain_%06x\n" $i printf -- "-A INPUT -j chain_%06x\n" $i done echo COMMIT [ pipe result into iptables-restore ] This ruleset will be about 74mbyte in size, with ~500k searches though all 500k[1] rule entries. iptables-restore will take forever (gave up after 10 minutes) Instead of always searching the entire blob for a match, fill an array with the start offsets of every single ipt_entry struct, then do a binary search to check if the jump target is present or not. After this change ruleset restore times get again close to what one gets when reverting `3647234101` (~3 seconds on my workstation). [1] every user-defined rule gets an implicit RETURN, so we get 300k jumps + 100k userchains + 100k returns -> 500k rule entries Fixes: `3647234101` ("netfilter: x_tables: validate targets of jumps") Reported-by: Jeff Wu <wujiafu@gmail.com> Tested-by: Jeff Wu <wujiafu@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-07-18 21:35:23 +02:00
Nikolay Aleksandrov	43b9e12740	net: ipmr/ip6mr: add support for keeping an entry age In preparation for hardware offloading of ipmr/ip6mr we need an interface that allows to check (and later update) the age of entries. Relying on stats alone can show activity but not actual age of the entry, furthermore when there're tens of thousands of entries a lot of the hardware implementations only support "hit" bits which are cleared on read to denote that the entry was active and shouldn't be aged out, these can then be naturally translated into age timestamp and will be compatible with the software forwarding age. Using a lastuse entry doesn't affect performance because the members in that cache line are written to along with the age. Since all new users are encouraged to use ipmr via netlink, this is exported via the RTA_EXPIRES attribute. Also do a minor local variable declaration style adjustment - arrange them longest to shortest. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> CC: Roopa Prabhu <roopa@cumulusnetworks.com> CC: Shrijeet Mukherjee <shm@cumulusnetworks.com> CC: Satish Ashok <sashok@cumulusnetworks.com> CC: Donald Sharp <sharpd@cumulusnetworks.com> CC: David S. Miller <davem@davemloft.net> CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> CC: James Morris <jmorris@namei.org> CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> CC: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-16 20:19:43 -07:00
Michal Kubeček	a612769774	udp: prevent bugcheck if filter truncates packet too much If socket filter truncates an udp packet below the length of UDP header in udpv6_queue_rcv_skb() or udp_queue_rcv_skb(), it will trigger a BUG_ON in skb_pull_rcsum(). This BUG_ON (and therefore a system crash if kernel is configured that way) can be easily enforced by an unprivileged user which was reported as CVE-2016-6162. For a reproducer, see http://seclists.org/oss-sec/2016/q3/8 Fixes: `e6afc8ace6` ("udp: remove headers from UDP packets before queueing") Reported-by: Marco Grassi <marco.gra@gmail.com> Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-11 12:43:15 -07:00
Eric Dumazet	927265bc6c	ipv6: do not abuse GFP_ATOMIC in inet6_netconf_notify_devconf() All inet6_netconf_notify_devconf() callers are in process context, so we can use GFP_KERNEL allocations if we take care of not holding a rwlock while not needed in ip6mr (we hold RTNL there) Fixes: `d67b8c616b` ("netconf: advertise mc_forwarding status") Fixes: `f3a1bfb11c` ("rtnl/ipv6: use netconf msg to advertise forwarding status") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-09 18:13:20 -04:00
Simon Horman	49dbe7ae21	sit: support MPLS over IPv4 Extend the SIT driver to support MPLS over IPv4. This implementation extends existing support for IPv6 over IPv4 and IPv4 over IPv4. Signed-off-by: Simon Horman <simon.horman@netronome.com> Reviewed-by: Dinan Gunawardena <dinan.gunawardena@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-09 17:45:56 -04:00
James Morris	d011a4d861	Merge branch 'stable-4.8' of git://git.infradead.org/users/pcmoore/selinux into next	2016-07-07 10:15:34 +10:00
David S. Miller	30d0844bdc	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/ethernet/mellanox/mlx5/core/en.h drivers/net/ethernet/mellanox/mlx5/core/en_main.c drivers/net/usb/r8152.c All three conflicts were overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-06 10:35:22 -07:00
David S. Miller	ae3e4562e2	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next, they are: 1) Don't use userspace datatypes in bridge netfilter code, from Tobin Harding. 2) Iterate only once over the expectation table when removing the helper module, instead of once per-netns, from Florian Westphal. 3) Extra sanitization in xt_hook_ops_alloc() to return error in case we ever pass zero hooks, xt_hook_ops_alloc(): 4) Handle NFPROTO_INET from the logging core infrastructure, from Liping Zhang. 5) Autoload loggers when TRACE target is used from rules, this doesn't change the behaviour in case the user already selected nfnetlink_log as preferred way to print tracing logs, also from Liping Zhang. 6) Conntrack slabs with SLAB_HWCACHE_ALIGN to allow rearranging fields by cache lines, increases the size of entries in 11% per entry. From Florian Westphal. 7) Skip zone comparison if CONFIG_NF_CONNTRACK_ZONES=n, from Florian. 8) Remove useless defensive check in nf_logger_find_get() from Shivani Bhardwaj. 9) Remove zone extension as place it in the conntrack object, this is always include in the hashing and we expect more intensive use of zones since containers are in place. Also from Florian Westphal. 10) Owner match now works from any namespace, from Eric Bierdeman. 11) Make sure we only reply with TCP reset to TCP traffic from nf_reject_ipv4, patch from Liping Zhang. 12) Introduce --nflog-size to indicate amount of network packet bytes that are copied to userspace via log message, from Vishwanath Pai. This obsoletes --nflog-range that has never worked, it was designed to achieve this but it has never worked. 13) Introduce generic macros for nf_tables object generation masks. 14) Use generation mask in table, chain and set objects in nf_tables. This allows fixes interferences with ongoing preparation phase of the commit protocol and object listings going on at the same time. This update is introduced in three patches, one per object. 15) Check if the object is active in the next generation for element deactivation in the rbtree implementation, given that deactivation happens from the commit phase path we have to observe the future status of the object. 16) Support for deletion of just added elements in the hash set type. 17) Allow to resize hashtable from /proc entry, not only from the obscure /sys entry that maps to the module parameter, from Florian Westphal. 18) Get rid of NFT_BASECHAIN_DISABLED, this code is not exercised anymore since we tear down the ruleset whenever the netdevice goes away. 19) Support for matching inverted set lookups, from Arturo Borrero. 20) Simplify the iptables_mangle_hook() by removing a superfluous extra branch. 21) Introduce ether_addr_equal_masked() and use it from the netfilter codebase, from Joe Perches. 22) Remove references to "Use netfilter MARK value as routing key" from the Netfilter Kconfig description given that this toggle doesn't exists already for 10 years, from Moritz Sichert. 23) Introduce generic NF_INVF() and use it from the xtables codebase, from Joe Perches. 24) Setting logger to NONE via /proc was not working unless explicit nul-termination was included in the string. This fixes seems to leave the former behaviour there, so we don't break backward. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-06 09:15:15 -07:00
Martin KaFai Lau	903ce4abdf	ipv6: Fix mem leak in rt6i_pcpu It was first reported and reproduced by Petr (thanks!) in https://bugzilla.kernel.org/show_bug.cgi?id=119581 free_percpu(rt->rt6i_pcpu) used to always happen in ip6_dst_destroy(). However, after fixing a deadlock bug in commit `9c7370a166` ("ipv6: Fix a potential deadlock when creating pcpu rt"), free_percpu() is not called before setting non_pcpu_rt->rt6i_pcpu to NULL. It is worth to note that rt6i_pcpu is protected by table->tb6_lock. kmemleak somehow did not report it. We nailed it down by observing the pcpu entries in /proc/vmallocinfo (first suggested by Hannes, thanks!). Signed-off-by: Martin KaFai Lau <kafai@fb.com> Fixes: `9c7370a166` ("ipv6: Fix a potential deadlock when creating pcpu rt") Reported-by: Petr Novopashenniy <pety@rusnet.ru> Tested-by: Petr Novopashenniy <pety@rusnet.ru> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Petr Novopashenniy <pety@rusnet.ru> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-05 14:09:23 -07:00
Joe Perches	c37a2dfa67	netfilter: Convert FWINV<[foo]> macros and uses to NF_INVF netfilter uses multiple FWINV #defines with identical form that hide a specific structure variable and dereference it with a invflags member. $ git grep "#define FWINV" include/linux/netfilter_bridge/ebtables.h:#define FWINV(bool,invflg) ((bool) ^ !!(info->invflags & invflg)) net/bridge/netfilter/ebtables.c:#define FWINV2(bool, invflg) ((bool) ^ !!(e->invflags & invflg)) net/ipv4/netfilter/arp_tables.c:#define FWINV(bool, invflg) ((bool) ^ !!(arpinfo->invflags & (invflg))) net/ipv4/netfilter/ip_tables.c:#define FWINV(bool, invflg) ((bool) ^ !!(ipinfo->invflags & (invflg))) net/ipv6/netfilter/ip6_tables.c:#define FWINV(bool, invflg) ((bool) ^ !!(ip6info->invflags & (invflg))) net/netfilter/xt_tcpudp.c:#define FWINVTCP(bool, invflg) ((bool) ^ !!(tcpinfo->invflags & (invflg))) Consolidate these macros into a single NF_INVF macro. Miscellanea: o Neaten the alignment around these uses o A few lines are > 80 columns for intelligibility Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-07-03 10:55:07 +02:00
Pablo Neira Ayuso	468b021b94	netfilter: x_tables: simplify ip{6}table_mangle_hook() No need for a special case to handle NF_INET_POST_ROUTING, this is basically the same handling as for prerouting, input, forward. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-07-01 16:37:02 +02:00
Eric Dumazet	19689e38ec	tcp: md5: use kmalloc() backed scratch areas Some arches have virtually mapped kernel stacks, or will soon have. tcp_md5_hash_header() uses an automatic variable to copy tcp header before mangling th->check and calling crypto function, which might be problematic on such arches. David says that using percpu storage is also problematic on non SMP builds. Just use kmalloc() to allocate scratch areas. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 04:02:55 -04:00
David S. Miller	ee58b57100	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Several cases of overlapping changes, except the packet scheduler conflicts which deal with the addition of the free list parameter to qdisc_enqueue(). Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-30 05:03:36 -04:00
Tom Goff	70a0dec451	ipmr/ip6mr: Initialize the last assert time of mfc entries. This fixes wrong-interface signaling on 32-bit platforms for entries created when jiffies > 2^31 + MFC_ASSERT_THRESH. Signed-off-by: Tom Goff <thomas.goff@ll.mit.edu> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-28 04:14:09 -04:00
Huw Davies	4fee5242bf	calipso: Add a label cache. This works in exactly the same way as the CIPSO label cache. The idea is to allow the lsm to cache the result of a secattr lookup so that it doesn't need to perform the lookup for every skbuff. It introduces two sysctl controls: calipso_cache_enable - enables/disables the cache. calipso_cache_bucket_size - sets the size of a cache bucket. Signed-off-by: Huw Davies <huw@codeweavers.com> Signed-off-by: Paul Moore <paul@paul-moore.com>	2016-06-27 15:06:17 -04:00
Huw Davies	2e532b7028	calipso: Add validation of CALIPSO option. Lengths, checksum and the DOI are checked. Checking of the level and categories are left for the socket layer. CRC validation is performed in the calipso module to avoid unconditionally linking crc_ccitt() into ipv6. Signed-off-by: Huw Davies <huw@codeweavers.com> Signed-off-by: Paul Moore <paul@paul-moore.com>	2016-06-27 15:06:17 -04:00
Huw Davies	2917f57b6b	calipso: Allow the lsm to label the skbuff directly. In some cases, the lsm needs to add the label to the skbuff directly. A NF_INET_LOCAL_OUT IPv6 hook is added to selinux to match the IPv4 behaviour. This allows selinux to label the skbuffs that it requires. Signed-off-by: Huw Davies <huw@codeweavers.com> Signed-off-by: Paul Moore <paul@paul-moore.com>	2016-06-27 15:06:15 -04:00
Huw Davies	0868383b82	ipv6: constify the skb pointer of ipv6_find_tlv(). Signed-off-by: Huw Davies <huw@codeweavers.com> Signed-off-by: Paul Moore <paul@paul-moore.com>	2016-06-27 15:06:15 -04:00
Huw Davies	e1adea9270	calipso: Allow request sockets to be relabelled by the lsm. Request sockets need to have a label that takes into account the incoming connection as well as their parent's label. This is used for the outgoing SYN-ACK and for their child full-socket. Signed-off-by: Huw Davies <huw@codeweavers.com> Signed-off-by: Paul Moore <paul@paul-moore.com>	2016-06-27 15:05:29 -04:00
Huw Davies	56ac42bc94	ipv6: Allow request socks to contain IPv6 options. If set, these will take precedence over the parent's options during both sending and child creation. If they're not set, the parent's options (if any) will be used. This is to allow the security_inet_conn_request() hook to modify the IPv6 options in just the same way that it already may do for IPv4. Signed-off-by: Huw Davies <huw@codeweavers.com> Signed-off-by: Paul Moore <paul@paul-moore.com>	2016-06-27 15:05:28 -04:00
Huw Davies	ceba1832b1	calipso: Set the calipso socket label to match the secattr. CALIPSO is a hop-by-hop IPv6 option. A lot of this patch is based on the equivalent CISPO code. The main difference is due to manipulating the options in the hop-by-hop header. Signed-off-by: Huw Davies <huw@codeweavers.com> Signed-off-by: Paul Moore <paul@paul-moore.com>	2016-06-27 15:02:51 -04:00
Huw Davies	e67ae213c7	ipv6: Add ipv6_renew_options_kern() that accepts a kernel mem pointer. The functionality is equivalent to ipv6_renew_options() except that the newopt pointer is in kernel, not user, memory The kernel memory implementation will be used by the CALIPSO network labelling engine, which needs to be able to set IPv6 hop-by-hop options. Signed-off-by: Huw Davies <huw@codeweavers.com> Signed-off-by: Paul Moore <paul@paul-moore.com>	2016-06-27 15:02:50 -04:00
Huw Davies	d7cce01504	netlabel: Add support for removing a CALIPSO DOI. Remove a specified DOI through the NLBL_CALIPSO_C_REMOVE command. It requires the attribute: NLBL_CALIPSO_A_DOI. Signed-off-by: Huw Davies <huw@codeweavers.com> Signed-off-by: Paul Moore <paul@paul-moore.com>	2016-06-27 15:02:49 -04:00
Huw Davies	e1ce69df7e	netlabel: Add support for enumerating the CALIPSO DOI list. Enumerate the DOI list through the NLBL_CALIPSO_C_LISTALL command. It takes no attributes. Signed-off-by: Huw Davies <huw@codeweavers.com> Signed-off-by: Paul Moore <paul@paul-moore.com>	2016-06-27 15:02:48 -04:00
Huw Davies	a5e34490c3	netlabel: Add support for querying a CALIPSO DOI. Query a specified DOI through the NLBL_CALIPSO_C_LIST command. It requires the attribute: NLBL_CALIPSO_A_DOI. The reply will contain: NLBL_CALIPSO_A_MTYPE Signed-off-by: Huw Davies <huw@codeweavers.com> Signed-off-by: Paul Moore <paul@paul-moore.com>	2016-06-27 15:02:47 -04:00
Huw Davies	cb72d38211	netlabel: Initial support for the CALIPSO netlink protocol. CALIPSO is a packet labelling protocol for IPv6 which is very similar to CIPSO. It is specified in RFC 5570. Much of the code is based on the current CIPSO code. This adds support for adding passthrough-type CALIPSO DOIs through the NLBL_CALIPSO_C_ADD command. It requires attributes: NLBL_CALIPSO_A_TYPE which must be CALIPSO_MAP_PASS. NLBL_CALIPSO_A_DOI. In passthrough mode the CALIPSO engine will map MLS secattr levels and categories directly to the packet label. At this stage, the major difference between this and the CIPSO code is that IPv6 may be compiled as a module. To allow for this the CALIPSO functions are registered at module init time. Signed-off-by: Huw Davies <huw@codeweavers.com> Signed-off-by: Paul Moore <paul@paul-moore.com>	2016-06-27 15:02:46 -04:00
Paolo Abeni	48f1dcb55a	ipv6: enforce egress device match in per table nexthop lookups with the commit `8c14586fc3` ("net: ipv6: Use passed in table for nexthop lookups"), net hop lookup is first performed on route creation in the passed-in table. However device match is not enforced in table lookup, so the found route can be later discarded due to egress device mismatch and no global lookup will be performed. This cause the following to fail: ip link add dummy1 type dummy ip link add dummy2 type dummy ip link set dummy1 up ip link set dummy2 up ip route add 2001:db8:8086::/48 dev dummy1 metric 20 ip route add 2001:db8:d34d::/64 via 2001:db8:8086::2 dev dummy1 metric 20 ip route add 2001:db8:8086::/48 dev dummy2 metric 21 ip route add 2001:db8:d34d::/64 via 2001:db8:8086::2 dev dummy2 metric 21 RTNETLINK answers: No route to host This change fixes the issue enforcing device lookup in ip6_nh_lookup_table() v1->v2: updated commit message title Fixes: `8c14586fc3` ("net: ipv6: Use passed in table for nexthop lookups") Reported-and-tested-by: Beniamino Galvani <bgalvani@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-27 10:37:20 -04:00
Eric Dumazet	20e1954fe2	ipv6: RFC 4884 partial support for SIT/GRE tunnels When receiving an ICMPv4 message containing extensions as defined in RFC 4884, and translating it to ICMPv6 at SIT or GRE tunnel, we need some extra manipulation in order to properly forward the extensions. This patch only takes care of Time Exceeded messages as they are the ones that typically carry information from various routers in a fabric during a traceroute session. It also avoids complex skb logic if the data_len is not a multiple of 8. RFC states : The "original datagram" field MUST contain at least 128 octets. If the original datagram did not contain 128 octets, the "original datagram" field MUST be zero padded to 128 octets. In practice routers use 128 bytes of original datagram, not more. Initial translation was added in commit `ca15a078bd` ("sit: generate icmpv6 error when receiving icmpv4 error") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Oussama Ghorbel <ghorbel@pivasoftware.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-18 22:11:39 -07:00
Eric Dumazet	2d7a3b276b	ipv6: translate ICMP_TIME_EXCEEDED to ICMPV6_TIME_EXCEED For better traceroute/mtr support for SIT and GRE tunnels, we translate IPV4 ICMP ICMP_TIME_EXCEEDED to ICMPV6_TIME_EXCEED We also have to translate the IPv4 source IP address of ICMP message to IPv6 v4mapped. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-18 22:11:39 -07:00
Eric Dumazet	5fbba8ac93	ip6: move ipip6_err_gen_icmpv6_unreach() We want to use this helper from GRE as well, so this is the time to move it in net/ipv6/icmp.c Also add a @nhs parameter, since SIT and GRE have different values for the header(s) to skip. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-18 22:11:39 -07:00
Eric Dumazet	b1cadc1a09	ipv6: icmp: add a force_saddr param to icmp6_send() SIT or GRE tunnels might want to translate an IPV4 address into a v4mapped one when translating ICMP to ICMPv6. This patch adds the parameter to icmp6_send() but does not change icmpv6_send() signature. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-18 22:11:38 -07:00
David Ahern	afbac6010a	net: ipv6: Address selection needs to consider L3 domains IPv6 version of `3f2fb9a834` ("net: l3mdev: address selection should only consider devices in L3 domain") and the follow up commit, a17b693cdd876 ("net: l3mdev: prefer VRF master for source address selection"). That is, if outbound device is given then the address preference order is an address from that device, an address from the master device if it is enslaved, and then an address from a device in the same L3 domain. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-17 21:25:29 -07:00
David Ahern	0d240e7811	net: vrf: Implement get_saddr for IPv6 IPv6 source address selection needs to consider the real egress route. Similar to IPv4 implement a get_saddr6 method which is called if source address has not been set. The get_saddr6 method does a full lookup which means pulling a route from the VRF FIB table and properly considering linklocal/multicast destination addresses. Lookup failures (eg., unreachable) then cause the source address selection to fail which gets propagated back to the caller. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-17 21:25:29 -07:00
David Ahern	a2e2ff560f	net: ipv6: Move ip6_route_get_saddr to inline VRF driver needs access to ip6_route_get_saddr code. Since it does little beyond ipv6_dev_get_saddr and ipv6_dev_get_saddr is already exported for modules move ip6_route_get_saddr to the header as an inline. Code move only; no functional change. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-17 21:25:29 -07:00
Arnd Bergmann	318d3cc04e	net: xfrm: fix old-style declaration Modern C standards expect the '__inline__' keyword to come before the return type in a declaration, and we get a couple of warnings for this with "make W=1" in the xfrm{4,6}_policy.c files: net/ipv6/xfrm6_policy.c:369:1: error: 'inline' is not at beginning of declaration [-Werror=old-style-declaration] static int inline xfrm6_net_sysctl_init(struct net net) net/ipv6/xfrm6_policy.c:374:1: error: 'inline' is not at beginning of declaration [-Werror=old-style-declaration] static void inline xfrm6_net_sysctl_exit(struct net net) net/ipv4/xfrm4_policy.c:339:1: error: 'inline' is not at beginning of declaration [-Werror=old-style-declaration] static int inline xfrm4_net_sysctl_init(struct net net) net/ipv4/xfrm4_policy.c:344:1: error: 'inline' is not at beginning of declaration [-Werror=old-style-declaration] static void inline xfrm4_net_sysctl_exit(struct net net) Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-16 22:06:30 -07:00
Simon Horman	d5d8760b78	sit: correct IP protocol used in ipip6_err Since `32b8a8e59c` ("sit: add IPv4 over IPv4 support") ipip6_err() may be called for packets whose IP protocol is IPPROTO_IPIP as well as those whose IP protocol is IPPROTO_IPV6. In the case of IPPROTO_IPIP packets the correct protocol value is not passed to ipv4_update_pmtu() or ipv4_redirect(). This patch resolves this problem by using the IP protocol of the packet rather than a hard-coded value. This appears to be consistent with the usage of the protocol of a packet by icmp_socket_deliver() the caller of ipip6_err(). I was able to exercise the redirect case by using a setup where an ICMP redirect was received for the destination of the encapsulated packet. However, it appears that although incorrect the protocol field is not used in this case and thus no problem manifests. On inspection it does not appear that a problem will manifest in the fragmentation needed/update pmtu case either. In short I believe this is a cosmetic fix. None the less, the use of IPPROTO_IPV6 seems wrong and confusing. Reviewed-by: Dinan Gunawardena <dinan.gunawardena@netronome.com> Signed-off-by: Simon Horman <simon.horman@netronome.com> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-16 17:10:30 -07:00
Eric Dumazet	e582615ad3	gre: fix error handler 1) gre_parse_header() can be called from gre_err() At this point transport header points to ICMP header, not the inner header. 2) We can not really change transport header as ipgre_err() will later assume transport header still points to ICMP header (using icmp_hdr()) 3) pskb_may_pull() logic in gre_parse_header() really works if we are interested at zone pointed by skb->data 4) As Jiri explained in commit `b7f8fe251e` ("gre: do not pull header in ICMP error processing") we should not pull headers in error handler. So this fix : A) changes gre_parse_header() to use skb->data instead of skb_transport_header() B) Adds a nhs parameter to gre_parse_header() so that we can skip the not pulled IP header from error path. This offset is 0 for normal receive path. C) remove obsolete IPV6 includes Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <tom@herbertland.com> Cc: Maciej Żenczykowski <maze@google.com> Cc: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 22:15:21 -07:00
Tom Herbert	0b797c8589	ila: Fix checksum neutral mapping The algorithm for checksum neutral mapping is incorrect. This problem was being hidden since we were previously always performing checksum offload on the translated addresses and only with IPv6 HW csum. Enabling an ILA router shows the issue. Corrected algorithm: old_loc is the original locator in the packet, new_loc is the value to overwrite with and is found in the lookup table. old_flag is the old flag value (zero of CSUM_NEUTRAL_FLAG) and new_flag is then (old_flag ^ CSUM_NEUTRAL_FLAG) & CSUM_NEUTRAL_FLAG. Need SUM(new_id + new_flag + diff) == SUM(old_id + old_flag) for checksum neutral translation. Solving for diff gives: diff = (old_id - new_id) + (old_flag - new_flag) compute_csum_diff8(new_id, old_id) gives old_id - new_id If old_flag is set old_flag - new_flag = old_flag = CSUM_NEUTRAL_FLAG Else old_flag - new_flag = -new_flag = ~CSUM_NEUTRAL_FLAG Tested: - Implemented a user space program that creates random addresses and random locators to overwrite. Compares the checksum over the address before and after translation (must always be equal) - Enabled ILA router and showed proper operation. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 21:40:00 -07:00
Alexander Aring	cc84b3c6b4	ipv6: export several functions This patch exports some neighbour discovery functions which can be used by 6lowpan neighbour discovery ops functionality then. Cc: David S. Miller <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com> Signed-off-by: Alexander Aring <aar@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 20:41:23 -07:00
Alexander Aring	f997c55c1d	ipv6: introduce neighbour discovery ops This patch introduces neighbour discovery ops callback structure. The idea is to separate the handling for 6LoWPAN into the 6lowpan module. These callback offers 6lowpan different handling, such as 802.15.4 short address handling or RFC6775 (Neighbor Discovery Optimization for IPv6 over 6LoWPANs). Cc: David S. Miller <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: Alexander Aring <aar@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 20:41:23 -07:00
Alexander Aring	4f672235cb	addrconf: put prefix address add in an own function This patch moves the functionality to add a RA PIO prefix generated address in an own function. This move prepares to add a hook for adding a second address for a second link-layer address. E.g. short address for 802.15.4 6LoWPAN. Cc: David S. Miller <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: Alexander Aring <aar@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 20:41:23 -07:00
Alexander Aring	8ec5da4150	ndisc: add __ndisc_fill_addr_option function This patch adds __ndisc_fill_addr_option as low-level function for ndisc_fill_addr_option which doesn't depend on net_device parameter. Cc: David S. Miller <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com> Signed-off-by: Alexander Aring <aar@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 20:41:23 -07:00
Alexander Aring	2ad3ed5919	6lowpan: add 802.15.4 short addr slaac This patch adds the autoconfiguration if a valid 802.15.4 short address is available for 802.15.4 6LoWPAN interfaces. Cc: David S. Miller <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com> Signed-off-by: Alexander Aring <aar@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 20:41:22 -07:00
David Ahern	9ff7438460	net: vrf: Handle ipv6 multicast and link-local addresses IPv6 multicast and link-local addresses require special handling by the VRF driver: 1. Rather than using the VRF device index and full FIB lookups, packets to/from these addresses should use direct FIB lookups based on the VRF device table. 2. fail sends/receives on a VRF device to/from a multicast address (e.g, make ping6 ff02::1%<vrf> fail) 3. move the setting of the flow oif to the first dst lookup and revert the change in icmpv6_echo_reply made in `ca254490c8` ("net: Add VRF support to IPv6 stack"). Linklocal/mcast addresses require use of the skb->dev. With this change connections into and out of a VRF enslaved device work for multicast and link-local addresses work (icmp, tcp, and udp) e.g., 1. packets into VM with VRF config: ping6 -c3 fe80::e0:f9ff:fe1c:b974%br1 ping6 -c3 ff02::1%br1 ssh -6 fe80::e0:f9ff:fe1c:b974%br1 2. packets going out a VRF enslaved device: ping6 -c3 fe80::18f8:83ff:fe4b:7a2e%eth1 ping6 -c3 ff02::1%eth1 ssh -6 root@fe80::18f8:83ff:fe4b:7a2e%eth1 Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 12:34:34 -07:00
David Ahern	ba46ee4c0e	net: ipv6: Do not add multicast route for l3 master devices L3 master devices are virtual devices similar to the loopback device. Link local and multicast routes for these devices do not make sense. The ipv6 addrconf code already skips adding a linklocal address; do the same for the mcast route. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 12:34:34 -07:00
Su, Xuemin	d1e37288c9	udp reuseport: fix packet of same flow hashed to different socket There is a corner case in which udp packets belonging to a same flow are hashed to different socket when hslot->count changes from 10 to 11: 1) When hslot->count <= 10, __udp_lib_lookup() searches udp_table->hash, and always passes 'daddr' to udp_ehashfn(). 2) When hslot->count > 10, __udp_lib_lookup() searches udp_table->hash2, but may pass 'INADDR_ANY' to udp_ehashfn() if the sockets are bound to INADDR_ANY instead of some specific addr. That means when hslot->count changes from 10 to 11, the hash calculated by udp_ehashfn() is also changed, and the udp packets belonging to a same flow will be hashed to different socket. This is easily reproduced: 1) Create 10 udp sockets and bind all of them to 0.0.0.0:40000. 2) From the same host send udp packets to 127.0.0.1:40000, record the socket index which receives the packets. 3) Create 1 more udp socket and bind it to 0.0.0.0:44096. The number 44096 is 40000 + UDP_HASH_SIZE(4096), this makes the new socket put into the same hslot as the aformentioned 10 sockets, and makes the hslot->count change from 10 to 11. 4) From the same host send udp packets to 127.0.0.1:40000, and the socket index which receives the packets will be different from the one received in step 2. This should not happen as the socket bound to 0.0.0.0:44096 should not change the behavior of the sockets bound to 0.0.0.0:40000. It's the same case for IPv6, and this patch also fixes that. Signed-off-by: Su, Xuemin <suxm@chinanetcenter.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-14 17:23:09 -04:00
Hannes Frederic Sowa	c148d16369	ipv6: fix checksum annotation in udp6_csum_init Cc: Tom Herbert <tom@herbertland.com> Fixes: `4068579e1e` ("net: Implmement RFC 6936 (zero RX csums for UDP/IPv6") Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-14 15:26:42 -04:00
Hannes Frederic Sowa	5119bd1681	ipv6: tcp: fix endianness annotation in tcp_v6_send_response Cc: Florent Fourcot <florent.fourcot@enst-bretagne.fr> Fixes: `1d13a96c74` ("ipv6: tcp: fix flowlabel value in ACK messages send from TIME_WAIT") Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-14 15:25:35 -04:00
Hannes Frederic Sowa	dcb94b88c0	ipv6: fix endianness error in icmpv6_err IPv6 ping socket error handler doesn't correctly convert the new 32 bit mtu to host endianness before using. Cc: Lorenzo Colitti <lorenzo@google.com> Fixes: `6d0bfe2261` ("net: ipv6: Add IPv6 support to the ping socket.") Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-14 15:24:35 -04:00
Hannes Frederic Sowa	38b7097b55	ipv6: use TOS marks from sockets for routing decision In IPv6 the ToS values are part of the flowlabel in flowi6 and get extracted during fib rule lookup, but we forgot to correctly initialize the flowlabel before the routing lookup. Reported-by: <liam.mcbirnie@boeing.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-11 15:33:26 -07:00
David S. Miller	1578b0a5e9	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: net/sched/act_police.c net/sched/sch_drr.c net/sched/sch_hfsc.c net/sched/sch_prio.c net/sched/sch_red.c net/sched/sch_tbf.c In net-next the drop methods of the packet schedulers got removed, so the bug fixes to them in 'net' are irrelevant. A packet action unload crash fix conflicts with the addition of the new firstuse timestamp. Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-10 11:52:24 -07:00
David Ahern	e434863718	net: vrf: Fix crash when IPv6 is disabled at boot time Frank Kellermann reported a kernel crash with 4.5.0 when IPv6 is disabled at boot using the kernel option ipv6.disable=1. Using current net-next with the boot option: $ ip link add red type vrf table 1001 Generates: [12210.919584] BUG: unable to handle kernel NULL pointer dereference at 0000000000000748 [12210.921341] IP: [<ffffffff814b30e3>] fib6_get_table+0x2c/0x5a [12210.922537] PGD b79e3067 PUD bb32b067 PMD 0 [12210.923479] Oops: 0000 [#1] SMP [12210.924001] Modules linked in: ipvlan 8021q garp mrp stp llc [12210.925130] CPU: 3 PID: 1177 Comm: ip Not tainted 4.7.0-rc1+ #235 [12210.926168] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014 [12210.928065] task: ffff8800b9ac4640 ti: ffff8800bacac000 task.ti: ffff8800bacac000 [12210.929328] RIP: 0010:[<ffffffff814b30e3>] [<ffffffff814b30e3>] fib6_get_table+0x2c/0x5a [12210.930697] RSP: 0018:ffff8800bacaf888 EFLAGS: 00010202 [12210.931563] RAX: 0000000000000748 RBX: ffffffff81a9e280 RCX: ffff8800b9ac4e28 [12210.932688] RDX: 00000000000000e9 RSI: 0000000000000002 RDI: 0000000000000286 [12210.933820] RBP: ffff8800bacaf898 R08: ffff8800b9ac4df0 R09: 000000000052001b [12210.934941] R10: 00000000657c0000 R11: 000000000000c649 R12: 00000000000003e9 [12210.936032] R13: 00000000000003e9 R14: ffff8800bace7800 R15: ffff8800bb3ec000 [12210.937103] FS: 00007faa1766c700(0000) GS:ffff88013ac00000(0000) knlGS:0000000000000000 [12210.938321] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [12210.939166] CR2: 0000000000000748 CR3: 00000000b79d6000 CR4: 00000000000406e0 [12210.940278] Stack: [12210.940603] ffff8800bb3ec000 ffffffff81a9e280 ffff8800bacaf8c8 ffffffff814b3135 [12210.941818] ffff8800bb3ec000 ffffffff81a9e280 ffffffff81a9e280 ffff8800bace7800 [12210.943040] ffff8800bacaf8f0 ffffffff81397c88 ffff8800bb3ec000 ffffffff81a9e280 [12210.944288] Call Trace: [12210.944688] [<ffffffff814b3135>] fib6_new_table+0x24/0x8a [12210.945516] [<ffffffff81397c88>] vrf_dev_init+0xd4/0x162 [12210.946328] [<ffffffff814091e1>] register_netdevice+0x100/0x396 [12210.947209] [<ffffffff8139823d>] vrf_newlink+0x40/0xb3 [12210.948001] [<ffffffff814187f0>] rtnl_newlink+0x5d3/0x6d5 ... The problem above is due to the fact that the fib hash table is not allocated when IPv6 is disabled at boot. As for the VRF driver it should not do any IPv6 initializations if IPv6 is disabled, so it needs to know if IPv6 is disabled at boot. The disable parameter is private to the IPv6 module, so provide an accessor for modules to determine if IPv6 was disabled at boot time. Fixes: `35402e3136` ("net: Add IPv6 support to VRF device") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-09 23:34:42 -07:00
Simon Horman	adba931fbc	sit: remove unnecessary protocol check in ipip6_tunnel_xmit() ipip6_tunnel_xmit() is called immediately after checking that skb->protocol is htons(ETH_P_IPV6) so there is no need to check it a second time. Found by inspection. Signed-off-by: Simon Horman <simon.horman@netronome.com> Reviewed-by: Dinan Gunawardena <dinan.gunawardena@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-09 11:23:37 -07:00
Shweta Choudaha	76e48f9fbe	ip6gre: Allow live link address change The ip6 GRE tap device should not be forced to down state to change the mac address and should allow live address change for tap device similar to ipv4 gre. Signed-off-by: Shweta Choudaha <schoudah@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-08 22:35:44 -07:00
Shweta Choudaha	0a46baaf63	ip6gre: Allow live link address change The ip6 GRE tap device should not be forced to down state to change the mac address and should allow live address change for tap device similar to ipv4 gre. Signed-off-by: Shweta Choudaha <schoudah@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-08 22:02:17 -07:00
David Ahern	96c63fa739	net: Add l3mdev rule Currently, VRFs require 1 oif and 1 iif rule per address family per VRF. As the number of VRF devices increases it brings scalability issues with the increasing rule list. All of the VRF rules have the same format with the exception of the specific table id to direct the lookup. Since the table id is available from the oif or iif in the loopup, the VRF rules can be consolidated to a single rule that pulls the table from the VRF device. This patch introduces a new rule attribute l3mdev. The l3mdev rule means the table id used for the lookup is pulled from the L3 master device (e.g., VRF) rather than being statically defined. With the l3mdev rule all of the basic VRF FIB rules are reduced to 1 l3mdev rule per address family (IPv4 and IPv6). If an admin wishes to insert higher priority rules for specific VRFs those rules will co-exist with the l3mdev rule. This capability means current VRF scripts will co-exist with this new simpler implementation. Currently, the rules list for both ipv4 and ipv6 look like this: $ ip ru ls 1000: from all oif vrf1 lookup 1001 1000: from all iif vrf1 lookup 1001 1000: from all oif vrf2 lookup 1002 1000: from all iif vrf2 lookup 1002 1000: from all oif vrf3 lookup 1003 1000: from all iif vrf3 lookup 1003 1000: from all oif vrf4 lookup 1004 1000: from all iif vrf4 lookup 1004 1000: from all oif vrf5 lookup 1005 1000: from all iif vrf5 lookup 1005 1000: from all oif vrf6 lookup 1006 1000: from all iif vrf6 lookup 1006 1000: from all oif vrf7 lookup 1007 1000: from all iif vrf7 lookup 1007 1000: from all oif vrf8 lookup 1008 1000: from all iif vrf8 lookup 1008 ... 32765: from all lookup local 32766: from all lookup main 32767: from all lookup default With the l3mdev rule the list is just the following regardless of the number of VRFs: $ ip ru ls 1000: from all lookup [l3mdev table] 32765: from all lookup local 32766: from all lookup main 32767: from all lookup default (Note: the above pretty print of the rule is based on an iproute2 prototype. Actual verbage may change) Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-08 11:36:02 -07:00
Jakub Sitnicki	00bc0ef588	ipv6: Skip XFRM lookup if dst_entry in socket cache is valid At present we perform an xfrm_lookup() for each UDPv6 message we send. The lookup involves querying the flow cache (flow_cache_lookup) and, in case of a cache miss, creating an XFRM bundle. If we miss the flow cache, we can end up creating a new bundle and deriving the path MTU (xfrm_init_pmtu) from on an already transformed dst_entry, which we pass from the socket cache (sk->sk_dst_cache) down to xfrm_lookup(). This can happen only if we're caching the dst_entry in the socket, that is when we're using a connected UDP socket. To put it another way, the path MTU shrinks each time we miss the flow cache, which later on leads to incorrectly fragmented payload. It can be observed with ESPv6 in transport mode: 1) Set up a transformation and lower the MTU to trigger fragmentation # ip xfrm policy add dir out src ::1 dst ::1 \ tmpl src ::1 dst ::1 proto esp spi 1 # ip xfrm state add src ::1 dst ::1 \ proto esp spi 1 enc 'aes' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b # ip link set dev lo mtu 1500 2) Monitor the packet flow and set up an UDP sink # tcpdump -ni lo -ttt & # socat udp6-listen:12345,fork /dev/null & 3) Send a datagram that needs fragmentation with a connected socket # perl -e 'print "@" x 1470 \| socat - udp6:[::1]:12345 2016/06/07 18:52:52 socat[724] E read(3, 0x555bb3d5ba00, 8192): Protocol error 00:00:00.000000 IP6 ::1 > ::1: frag (0\|1448) ESP(spi=0x00000001,seq=0x2), length 1448 00:00:00.000014 IP6 ::1 > ::1: frag (1448\|32) 00:00:00.000050 IP6 ::1 > ::1: ESP(spi=0x00000001,seq=0x3), length 1272 (^ ICMPv6 Parameter Problem) 00:00:00.000022 IP6 ::1 > ::1: ESP(spi=0x00000001,seq=0x5), length 136 4) Compare it to a non-connected socket # perl -e 'print "@" x 1500' \| socat - udp6-sendto:[::1]:12345 00:00:40.535488 IP6 ::1 > ::1: frag (0\|1448) ESP(spi=0x00000001,seq=0x6), length 1448 00:00:00.000010 IP6 ::1 > ::1: frag (1448\|64) What happens in step (3) is: 1) when connecting the socket in __ip6_datagram_connect(), we perform an XFRM lookup, miss the flow cache, create an XFRM bundle, and cache the destination, 2) afterwards, when sending the datagram, we perform an XFRM lookup, again, miss the flow cache (due to mismatch of flowi6_iif and flowi6_oif, which is an issue of its own), and recreate an XFRM bundle based on the cached (and already transformed) destination. To prevent the recreation of an XFRM bundle, avoid an XFRM lookup altogether whenever we already have a destination entry cached in the socket. This prevents the path MTU shrinkage and brings us on par with UDPv4. The fix also benefits connected PINGv6 sockets, another user of ip6_sk_dst_lookup_flow(), who also suffer messages being transformed twice. Joint work with Hannes Frederic Sowa. Reported-by: Jan Tluka <jtluka@redhat.com> Signed-off-by: Jakub Sitnicki <jkbs@redhat.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-08 11:16:06 -07:00
Tom Herbert	707a2ca487	ila: Perform only one translation in forwarding path When setting up ILA in a router we noticed that the the encapsulation is invoked twice: once in the route input path and again upon route output. To resolve this we add a flag set_csum_neutral for the ila_update_ipv6_locator. If this flag is set and the checksum neutral bit is also set we assume that checksum-neutral translation has already been performed and take no further action. The flag is set only in ila_output path. The flag is not set for ila_input and ila_xlat. Tested: Used 3 netns to set to emulate a router and two hosts. The router translates SIR addresses between the two destinations in other two netns. Verified ping and netperf are functional. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-08 00:40:34 -07:00
David Ahern	b4869aa2f8	net: vrf: ipv6 support for local traffic to local addresses Add support for locally originated traffic to VRF-local IPv6 addresses. Similar to IPv4 a local dst is set on the skb and the packet is reinserted with a call to netif_rx. With this patch, ping, tcp and udp packets to a local IPv6 address are successfully routed: $ ip addr show dev eth1 4: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000 link/ether 02:e0:f9:1c:b9:74 brd ff:ff:ff:ff:ff:ff inet 10.100.1.1/24 brd 10.100.1.255 scope global eth1 valid_lft forever preferred_lft forever inet6 2100:1::1/120 scope global valid_lft forever preferred_lft forever inet6 fe80::e0:f9ff:fe1c:b974/64 scope link valid_lft forever preferred_lft forever $ ping6 -c1 -I red 2100:1::1 ping6: Warning: source address might be selected on device other than red. PING 2100:1::1(2100:1::1) from 2100:1::1 red: 56 data bytes 64 bytes from 2100:1::1: icmp_seq=1 ttl=64 time=0.098 ms ip6_input is exported so the VRF driver can use it for the dst input function. The dst_alloc function for IPv4 defaults to setting the input and output functions; IPv6's does not. VRF does not need to duplicate the Rx path so just export the ipv6 input function. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-08 00:25:38 -07:00
Yuchung Cheng	ce3cf4ec03	tcp: record TLP and ER timer stats in v6 stats The v6 tcp stats scan do not provide TLP and ER timer information correctly like the v4 version . This patch fixes that. Fixes: `6ba8a3b19e` ("tcp: Tail loss probe (TLP)") Fixes: `eed530b6c6` ("tcp: early retransmit") Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-07 17:12:22 -07:00
Marcelo Ricardo Leitner	ae7ef81ef0	skbuff: introduce skb_gso_validate_mtu skb_gso_network_seglen is not enough for checking fragment sizes if skb is using GSO_BY_FRAGS as we have to check frag per frag. This patch introduces skb_gso_validate_mtu, based on the former, which will wrap the use case inside it as all calls to skb_gso_network_seglen were to validate if it fits on a given TMU, and improve the check. Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Tested-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-03 19:37:21 -04:00
Eric Dumazet	ce25d66ad5	Possible problem with `e6afc8ac` ("udp: remove headers from UDP packets before queueing") Paul Moore tracked a regression caused by a recent commit, which mistakenly assumed that sk_filter() could be avoided if socket had no current BPF filter. The intent was to avoid udp_lib_checksum_complete() overhead. But sk_filter() also checks skb_pfmemalloc() and security_sock_rcv_skb(), so better call it. Fixes: `e6afc8ace6` ("udp: remove headers from UDP packets before queueing") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Paul Moore <paul@paul-moore.com> Tested-by: Paul Moore <paul@paul-moore.com> Tested-by: Stephen Smalley <sds@tycho.nsa.gov> Cc: samanthakumar <samanthakumar@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-02 18:29:49 -04:00
David S. Miller	fc14963f24	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for your net tree, they are: 1) Fix incorrect timestamp in nfnetlink_queue introduced when addressing y2038 safe timestamp, from Florian Westphal. 2) Get rid of leftover conntrack definition from the previous merge window, oneliner from Florian. 3) Make nf_queue handler pernet to resolve race on dereferencing the hook state structure with netns removal, from Eric Biederman. 4) Ensure clean exit on unregistered helper ports, from Taehee Yoo. 5) Restore FLOWI_FLAG_KNOWN_NH in nf_dup_ipv6. This got lost while generalizing xt_TEE to add packet duplication support in nf_tables, from Paolo Abeni. 6) Insufficient netlink NFTA_SET_TABLE attribute check in nf_tables_getset(), from Phil Turnbull. 7) Reject helper registration on duplicated ports via modparams. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-01 17:54:19 -07:00
Arnd Bergmann	95e4daa820	fou: fix IPv6 Kconfig options The Kconfig options I added to work around broken compilation ended up screwing up things more, as I used the wrong symbol to control compilation of the file, resulting in IPv6 fou support to never be built into the kernel. Changing CONFIG_NET_FOU_IPV6_TUNNELS to CONFIG_IPV6_FOU fixes that problem, I had renamed the symbol in one location but not the other, and as the file is never being used by other kernel code, this did not lead to a build failure that I would have caught. After that fix, another issue with the same patch becomes obvious, as we 'select INET6_TUNNEL', which is related to IPV6_TUNNEL, but not the same, and this can still cause the original build failure when IPV6_TUNNEL is not built-in but IPV6_FOU is. The fix is equally trivial, we just need to select the right symbol. I have successfully build 350 randconfig kernels with this patch and verified that the driver is now being built. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reported-by: Valentin Rothberg <valentinrothberg@gmail.com> Fixes: `fabb13db44` ("fou: add Kconfig options for IPv6 support") Signed-off-by: David S. Miller <davem@davemloft.net>	2016-05-31 14:07:49 -07:00

1 2 3 4 5 ...

4923 commits