remarkable-linux

redonkable

Author	SHA1	Message	Date
Florian Westphal	2420b79f8c	netfilter: debug: check for sorted array Make sure our grow/shrink routine places them in the correct order. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-08-28 17:44:01 +02:00
Aaron Conole	960632ece6	netfilter: convert hook list to an array This converts the storage and layout of netfilter hook entries from a linked list to an array. After this commit, hook entries will be stored adjacent in memory. The next pointer is no longer required. The ops pointers are stored at the end of the array as they are only used in the register/unregister path and in the legacy br_netfilter code. nf_unregister_net_hooks() is slower than needed as it just calls nf_unregister_net_hook in a loop (i.e. at least n synchronize_net() calls), this will be addressed in followup patch. Test setup: - ixgbe 10gbit - netperf UDP_STREAM, 64 byte packets - 5 hooks: (raw + mangle prerouting, mangle+filter input, inet filter): empty mangle and raw prerouting, mangle and filter input hooks: 353.9 this patch: 364.2 Signed-off-by: Aaron Conole <aconole@bytheb.org> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-08-28 17:44:00 +02:00
Florian Westphal	5fd02ebe65	netfilter: fix a few (harmless) sparse warnings net/netfilter/nft_payload.c:187:18: warning: incorrect type in return expression (expected bool got restricted __sum16 [usertype] check) net/netfilter/nft_exthdr.c:222:14: warning: cast to restricted __be32 net/netfilter/nft_rt.c:49:23: warning: incorrect type in assignment (different base types expected unsigned int got restricted __be32) net/netfilter/nft_rt.c:70:25: warning: symbol 'nft_rt_policy' was not declared. Should it be static? Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-08-28 17:42:56 +02:00
David S. Miller	af57d2b720	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for your net tree, they are: 1) Fix use after free of struct proc_dir_entry in ipt_CLUSTERIP, patch from Sabrina Dubroca. 2) Fix spurious EINVAL errors from iptables over nft compatibility layer. 3) Reload pointer to ip header only if there is non-terminal verdict, ie. XT_CONTINUE, otherwise invalid memory access may happen, patch from Taehee Yoo. 4) Fix interaction between SYNPROXY and NAT, SYNPROXY adds sequence adjustment already, however from nf_nat_setup() assumes there's not. Patch from Xin Long. 5) Fix burst arithmetics in nft_limit as Joe Stringer mentioned during NFWS in Faro. Patch from Andy Zhou. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-24 11:49:19 -07:00
Florian Westphal	b3480fe059	netfilter: conntrack: make protocol tracker pointers const Doesn't change generated code, but will make it easier to eventually make the actual trackers themselvers const. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-08-24 18:52:33 +02:00
Florian Westphal	ea48cc83cf	netfilter: conntrack: print_conntrack only needed if CONFIG_NF_CONNTRACK_PROCFS Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-08-24 18:52:33 +02:00
Florian Westphal	91950833dd	netfilter: conntrack: place print_tuple in procfs part CONFIG_NF_CONNTRACK_PROCFS is deprecated, no need to use a function pointer in the trackers for this. Place the printf formatting in the one place that uses it. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-08-24 18:52:32 +02:00
Florian Westphal	09ec82f5af	netfilter: conntrack: remove protocol name from l4proto struct no need to waste storage for something that is only needed in one place and can be deduced from protocol number. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-08-24 18:52:32 +02:00
Florian Westphal	a3134d537f	netfilter: conntrack: remove protocol name from l3proto struct no need to waste storage for something that is only needed in one place and can be deduced from protocol number. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-08-24 18:52:32 +02:00
Florian Westphal	0d03510038	netfilter: conntrack: compute l3proto nla size at compile time avoids a pointer and allows struct to be const later on. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-08-24 18:52:32 +02:00
andy zhou	c26844eda9	netfilter: nf_tables: Fix nft limit burst handling Current implementation treats the burst configuration the same as rate configuration. This can cause the per packet cost to be lower than configured. In effect, this bug causes the token bucket to be refilled at a higher rate than what user has specified. This patch changes the implementation so that the token bucket size is controlled by "rate + burst", while maintain the token bucket refill rate the same as user specified. Fixes: `96518518cc` ("netfilter: add nftables") Signed-off-by: Andy Zhou <azhou@ovn.org> Acked-by: Joe Stringer <joe@ovn.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-08-24 16:23:17 +02:00
Xin Long	ab6dd1beac	netfilter: check for seqadj ext existence before adding it in nf_nat_setup_info Commit `4440a2ab3b` ("netfilter: synproxy: Check oom when adding synproxy and seqadj ct extensions") wanted to drop the packet when it fails to add seqadj ext due to no memory by checking if nfct_seqadj_ext_add returns NULL. But that nfct_seqadj_ext_add returns NULL can also happen when seqadj ext already exists in a nf_conn. It will cause that userspace protocol doesn't work when both dnat and snat are configured. Li Shuang found this issue in the case: Topo: ftp client router ftp server 10.167.131.2 <-> 10.167.131.254 10.167.141.254 <-> 10.167.141.1 Rules: # iptables -t nat -A PREROUTING -i eth1 -p tcp -m tcp --dport 21 -j \ DNAT --to-destination 10.167.141.1 # iptables -t nat -A POSTROUTING -o eth2 -p tcp -m tcp --dport 21 -j \ SNAT --to-source 10.167.141.254 In router, when both dnat and snat are added, nf_nat_setup_info will be called twice. The packet can be dropped at the 2nd time for DNAT due to seqadj ext is already added at the 1st time for SNAT. This patch is to fix it by checking for seqadj ext existence before adding it, so that the packet will not be dropped if seqadj ext already exists. Note that as Florian mentioned, as a long term, we should review ext_add() behaviour, it's better to return a pointer to the existing ext instead. Fixes: `4440a2ab3b` ("netfilter: synproxy: Check oom when adding synproxy and seqadj ct extensions") Reported-by: Li Shuang <shuali@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-08-24 16:09:03 +02:00
Florian Westphal	6b5dc98e8f	netfilter: rt: add support to fetch path mss to be used in combination with tcp option set support to mimic iptables TCPMSS --clamp-mss-to-pmtu. v2: Eric Dumazet points out dst must be initialized. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-08-19 13:15:10 +02:00
Florian Westphal	99d1712bc4	netfilter: exthdr: tcp option set support This allows setting 2 and 4 byte quantities in the tcp option space. Main purpose is to allow native replacement for xt_TCPMSS to work around pmtu blackholes. Writes to kind and len are now allowed at the moment, it does not seem useful to do this as it causes corruption of the tcp option space. We can always lift this restriction later if a use-case appears. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-08-19 13:15:10 +02:00
Florian Westphal	5e7d695a48	netfilter: exthdr: split netlink dump function so eval and uncoming eval_set versions can reuse a common helper. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-08-19 13:15:10 +02:00
Florian Westphal	a18177008b	netfilter: exthdr: factor out tcp option access Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-08-19 13:15:10 +02:00
Geliang Tang	46b20c38f3	netfilter: use audit_log() Use audit_log() instead of open-coding it. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-08-19 13:09:31 +02:00
Taehee Yoo	166327d79d	netfilter: remove prototype of netfilter_queue_init The netfilter_queue_init() has been removed. so we can remove the prototype of that. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-08-19 13:08:25 +02:00
Taehee Yoo	a2acc54340	netfilter: connlimit: merge root4 and root6. The root4 variable is used only when connlimit extension module has been stored by the iptables command. and the roo6 variable is used only when connlimit extension module has been stored by the ip6tables command. So the root4 and roo6 variable does not be used at the same time. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-08-19 13:07:53 +02:00
David Ahern	4297a0ef08	net: ipv6: add second dif to inet6 socket lookups Add a second device index, sdif, to inet6 socket lookups. sdif is the index for ingress devices enslaved to an l3mdev. It allows the lookups to consider the enslaved device as well as the L3 domain when searching for a socket. TCP moves the data in the cb. Prior to tcp_v4_rcv (e.g., early demux) the ingress index is obtained from IPCB using inet_sdif and after tcp_v4_rcv tcp_v4_sdif is used. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-07 11:39:22 -07:00
David Ahern	3fa6f616a7	net: ipv4: add second dif to inet socket lookups Add a second device index, sdif, to inet socket lookups. sdif is the index for ingress devices enslaved to an l3mdev. It allows the lookups to consider the enslaved device as well as the L3 domain when searching for a socket. TCP moves the data in the cb. Prior to tcp_v4_rcv (e.g., early demux) the ingress index is obtained from IPCB using inet_sdif and after the cb move in tcp_v4_rcv the tcp_v4_sdif helper is used. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-07 11:39:21 -07:00
Julia Lawall	549d2d41c1	netfilter: constify nf_loginfo structures The nf_loginfo structures are only passed as the seventh argument to nf_log_trace, which is declared as const or stored in a local const variable. Thus the nf_loginfo structures themselves can be const. Done with the help of Coccinelle. // <smpl> @r disable optional_qualifier@ identifier i; position p; @@ static struct nf_loginfo i@p = { ... }; @ok1@ identifier r.i; expression list[6] es; position p; @@ nf_log_trace(es,&i@p,...) @ok2@ identifier r.i; const struct nf_loginfo *e; position p; @@ e = &i@p @bad@ position p != {r.p,ok1.p,ok2.p}; identifier r.i; struct nf_loginfo e; @@ e@i@p @depends on !bad disable optional_qualifier@ identifier r.i; @@ static +const struct nf_loginfo i = { ... }; // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-08-02 14:25:59 +02:00
Julia Lawall	2a04aabf5c	netfilter: constify nf_conntrack_l3/4proto parameters When a nf_conntrack_l3/4proto parameter is not on the left hand side of an assignment, its address is not taken, and it is not passed to a function that may modify its fields, then it can be declared as const. This change is useful from a documentation point of view, and can possibly facilitate making some nf_conntrack_l3/4proto structures const subsequently. Done with the help of Coccinelle. Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-08-02 14:25:57 +02:00
Florian Westphal	4d3a57f23d	netfilter: conntrack: do not enable connection tracking unless needed Discussion during NFWS 2017 in Faro has shown that the current conntrack behaviour is unreasonable. Even if conntrack module is loaded on behalf of a single net namespace, its turned on for all namespaces, which is expensive. Commit `481fa37347` ("netfilter: conntrack: add nf_conntrack_default_on sysctl") attempted to provide an alternative to the 'default on' behaviour by adding a sysctl to change it. However, as Eric points out, the sysctl only becomes available once the module is loaded, and then its too late. So we either have to move the sysctl to the core, or, alternatively, change conntrack to become active only once the rule set requires this. This does the latter, conntrack is only enabled when a rule needs it. Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-07-31 20:42:00 +02:00
Florian Westphal	9b7e26aee7	netfilter: nft_set_rbtree: use seqcount to avoid lock in most cases switch to lockless lockup. write side now also increments sequence counter. On lookup, sample counter value and only take the lock if we did not find a match and the counter has changed. This avoids need to write to private area in normal (lookup) cases. In case we detect a writer (seqretry is true) we fall back to taking the readlock. The readlock is also used during dumps to ensure we get a consistent tree walk. Similar technique (rbtree+seqlock) was used by David Howells in rxrpc. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-07-31 20:41:59 +02:00
Phil Sutter	6150957521	netfilter: nf_tables: Allow object names of up to 255 chars Same conversion as for table names, use NFT_NAME_MAXLEN as upper boundary as well. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-07-31 20:41:59 +02:00
Phil Sutter	387454901b	netfilter: nf_tables: Allow set names of up to 255 chars Same conversion as for table names, use NFT_NAME_MAXLEN as upper boundary as well. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-07-31 20:41:58 +02:00
Phil Sutter	b7263e071a	netfilter: nf_tables: Allow chain name of up to 255 chars Same conversion as for table names, use NFT_NAME_MAXLEN as upper boundary as well. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-07-31 20:41:57 +02:00
Phil Sutter	e46abbcc05	netfilter: nf_tables: Allow table names of up to 255 chars Allocate all table names dynamically to allow for arbitrary lengths but introduce NFT_NAME_MAXLEN as an upper sanity boundary. It's value was chosen to allow using a domain name as per RFC 1035. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-07-31 20:41:57 +02:00
Phil Sutter	6e692678d7	netfilter: nf_tables: No need to check chain existence when tracing nft_trace_notify() is called only from __nft_trace_packet(), which assigns its parameter 'chain' to info->chain. __nft_trace_packet() in turn later dereferences 'chain' unconditionally, which indicates that it's never NULL. Same does nft_do_chain(), the only user of the tracing infrastructure. Hence it is safe to assume the check removed here is not needed. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-07-31 19:14:05 +02:00
Florian Westphal	591bb2789b	netfilter: nf_hook_ops structs can be const We no longer place these on a list so they can be const. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-07-31 19:10:44 +02:00
Florian Westphal	5da773a3e8	netfilter: nfnetlink_queue: don't queue dying conntracks to userspace When skb is queued to userspace it leaves softirq/rcu protection. skb->nfct (via conntrack extensions such as helper) could then reference modules that no longer exist if the conntrack was not yet confirmed. nf_ct_iterate_destroy() will set the DYING bit for unconfirmed conntracks, we therefore solve this race as follows: 1. take the queue spinlock. 2. check if the conntrack is unconfirmed and has dying bit set. In this case, we must discard skb while we're still inside rcu read-side section. 3. If nf_ct_iterate_destroy() is called right after the packet is queued to userspace, it will be removed from the queue via nf_ct_iterate_destroy -> nf_queue_nf_hook_drop. When userspace sends the verdict (nfnetlink takes rcu read lock), there are two cases to consider: 1. nf_ct_iterate_destroy() was called while packet was out. In this case, skb will have been removed from the queue already and no reinject takes place as we won't find a matching entry for the packet id. 2. nf_ct_iterate_destroy() gets called right after verdict callback found and removed the skb from queue list. In this case, skb->nfct is marked as dying but it is still valid. The skb will be dropped either in nf_conntrack_confirm (we don't insert DYING conntracks into hash table) or when we try to queue the skb again, but either events don't occur before the rcu read lock is dropped. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-07-31 19:09:39 +02:00
Florian Westphal	e2a750070a	netfilter: conntrack: destroy functions need to free queued packets queued skbs might be using conntrack extensions that are being removed, such as timeout. This happens for skbs that have a skb->nfct in unconfirmed state (i.e., not in hash table yet). This is destructive, but there are only two use cases: - module removal (rare) - netns cleanup (most likely no conntracks exist, and if they do, they are removed anyway later on). Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-07-31 19:09:39 +02:00
Florian Westphal	84657984c2	netfilter: add and use nf_ct_unconfirmed_destroy This also removes __nf_ct_unconfirmed_destroy() call from nf_ct_iterate_cleanup_net, so that function can be used only when missing conntracks from unconfirmed list isn't a problem. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-07-31 19:09:39 +02:00
Florian Westphal	ac7b848390	netfilter: expect: add and use nf_ct_expect_iterate helpers We have several spots that open-code a expect walk, add a helper that is similar to nf_ct_iterate_destroy/nf_ct_iterate_cleanup. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-07-31 19:09:38 +02:00
subashab@codeaurora.org	a232cd0e0c	netfilter: conntrack: Change to deferable work queue Delayed workqueue causes wakeups to idle CPUs. This was causing a power impact for devices. Use deferable work queue instead so that gc_worker runs when CPU is active only. Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-07-31 19:03:50 +02:00
Pablo M. Bermudo Garay	6392c22603	netfilter: nf_tables: add fib expression to the netdev family Add fib expression support for netdev family. Like inet family, netdev delegates the actual decision to the corresponding backend, either ipv4 or ipv6. This allows to perform very early reverse path filtering, among other things. You can find more information about fib expression in the `f6d0cbcf09` ("<netfilter: nf_tables: add fib expression>") commit message. Signed-off-by: Pablo M. Bermudo Garay <pablombg@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-07-31 19:01:40 +02:00
stephen hemminger	3754b87a4e	netfilter: remove unused variable warning: ‘recent_old_fops’ defined but not used Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-07-25 12:31:37 -07:00
Manfred Spraul	3ef0c7a730	net/netfilter/nf_conntrack_core: Fix net_conntrack_lock() As we want to remove spin_unlock_wait() and replace it with explicit spin_lock()/spin_unlock() calls, we can use this to simplify the locking. In addition: - Reading nf_conntrack_locks_all needs ACQUIRE memory ordering. - The new code avoids the backwards loop. Only slightly tested, I did not manage to trigger calls to nf_conntrack_all_lock(). V2: With improved comments, to clearly show how the barriers pair. Fixes: `b16c29191d` ("netfilter: nf_conntrack: use safer way to lock all buckets") Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: <stable@vger.kernel.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: netfilter-devel@vger.kernel.org Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2017-07-25 10:08:58 -07:00
Phil Sutter	784b4e612d	netfilter: nf_tables: Attach process info to NFT_MSG_NEWGEN notifications This is helpful for 'nft monitor' to track which process caused a given change to the ruleset. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-07-24 13:25:07 +02:00
Taehee Yoo	0b35f6031a	netfilter: Remove duplicated rcu_read_lock. This patch removes duplicate rcu_read_lock(). 1. IPVS part: According to Julian Anastasov's mention, contexts of ipvs are described at: http://marc.info/?l=netfilter-devel&m=149562884514072&w=2, in summary: - packet RX/TX: does not need locks because packets come from hooks. - sync msg RX: backup server uses RCU locks while registering new connections. - ip_vs_ctl.c: configuration get/set, RCU locks needed. - xt_ipvs.c: It is a netfilter match, running from hook context. As result, rcu_read_lock and rcu_read_unlock can be removed from: - ip_vs_core.c: all - ip_vs_ctl.c: - only from ip_vs_has_real_service - ip_vs_ftp.c: all - ip_vs_proto_sctp.c: all - ip_vs_proto_tcp.c: all - ip_vs_proto_udp.c: all - ip_vs_xmit.c: all (contains only packet processing) 2. Netfilter part: There are three types of functions that are guaranteed the rcu_read_lock(). First, as result, functions are only called by nf_hook(): - nf_conntrack_broadcast_help(), pptp_expectfn(), set_expected_rtp_rtcp(). - tcpmss_reverse_mtu(), tproxy_laddr4(), tproxy_laddr6(). - match_lookup_rt6(), check_hlist(), hashlimit_mt_common(). - xt_osf_match_packet(). Second, functions that caller already held the rcu_read_lock(). - destroy_conntrack(), ctnetlink_conntrack_event(). - ctnl_timeout_find_get(), nfqnl_nf_hook_drop(). Third, functions that are mixed with type1 and type2. These functions are called by nf_hook() also these are called by ordinary functions that already held the rcu_read_lock(): - __ctnetlink_glue_build(), ctnetlink_expect_event(). - ctnetlink_proto_size(). Applied files are below: - nf_conntrack_broadcast.c, nf_conntrack_core.c, nf_conntrack_netlink.c. - nf_conntrack_pptp.c, nf_conntrack_sip.c, nfnetlink_cttimeout.c. - nfnetlink_queue.c, xt_TCPMSS.c, xt_TPROXY.c, xt_addrtype.c. - xt_connlimit.c, xt_hashlimit.c, xt_osf.c Detailed calltrace can be found at: http://marc.info/?l=netfilter-devel&m=149667610710350&w=2 Signed-off-by: Taehee Yoo <ap420073@gmail.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-07-24 13:24:46 +02:00
Pablo Neira Ayuso	9f08ea8481	netfilter: nf_tables: keep chain counters away from hot path These chain counters are only used by the iptables-compat tool, that allow users to use the x_tables extensions from the existing nf_tables framework. This patch makes nf_tables by ~5% for the general usecase, ie. native nft users, where no chain counters are used at all. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-07-24 12:23:16 +02:00
Florian Westphal	56a97e701c	netfilter: expect: add to hash table after expect init assuming we have lockless readers we should make sure they can only see expectations that have already been initialized. hlist_add_head_rcu acts as memory barrier, move it after timer setup. Theoretically we could crash due to a del_timer() on other cpu seeing garbage data. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-07-24 12:20:10 +02:00
Linus Torvalds	96080f6977	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull networking fixes from David Miller: 1) BPF verifier signed/unsigned value tracking fix, from Daniel Borkmann, Edward Cree, and Josef Bacik. 2) Fix memory allocation length when setting up calls to ->ndo_set_mac_address, from Cong Wang. 3) Add a new cxgb4 device ID, from Ganesh Goudar. 4) Fix FIB refcount handling, we have to set it's initial value before the configure callback (which can bump it). From David Ahern. 5) Fix double-free in qcom/emac driver, from Timur Tabi. 6) A bunch of gcc-7 string format overflow warning fixes from Arnd Bergmann. 7) Fix link level headroom tests in ip_do_fragment(), from Vasily Averin. 8) Fix chunk walking in SCTP when iterating over error and parameter headers. From Alexander Potapenko. 9) TCP BBR congestion control fixes from Neal Cardwell. 10) Fix SKB fragment handling in bcmgenet driver, from Doug Berger. 11) BPF_CGROUP_RUN_PROG_SOCK_OPS needs to check for null __sk, from Cong Wang. 12) xmit_recursion in ppp driver needs to be per-device not per-cpu, from Gao Feng. 13) Cannot release skb->dst in UDP if IP options processing needs it. From Paolo Abeni. 14) Some netdev ioctl ifr_name[] NULL termination fixes. From Alexander Levin and myself. 15) Revert some rtnetlink notification changes that are causing regressions, from David Ahern. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (83 commits) net: bonding: Fix transmit load balancing in balance-alb mode rds: Make sure updates to cp_send_gen can be observed net: ethernet: ti: cpsw: Push the request_irq function to the end of probe ipv4: initialize fib_trie prior to register_netdev_notifier call. rtnetlink: allocate more memory for dev_set_mac_address() net: dsa: b53: Add missing ARL entries for BCM53125 bpf: more tests for mixed signed and unsigned bounds checks bpf: add test for mixed signed and unsigned bounds checks bpf: fix up test cases with mixed signed/unsigned bounds bpf: allow to specify log level and reduce it for test_verifier bpf: fix mixed signed/unsigned derived min/max value bounds ipv6: avoid overflow of offset in ip6_find_1stfragopt net: tehuti: don't process data if it has not been copied from userspace Revert "rtnetlink: Do not generate notifications for CHANGEADDR event" net: dsa: mv88e6xxx: Enable CMODE config support for 6390X dt-binding: ptp: Add SoC compatibility strings for dte ptp clock NET: dwmac: Make dwmac reset unconditional net: Zero terminate ifr_name in dev_ifname(). wireless: wext: terminate ifr name coming from userspace netfilter: fix netfilter_net_init() return ...	2017-07-20 16:33:39 -07:00
Pablo Neira Ayuso	f7fb77fc12	netfilter: nft_compat: check extension hook mask only if set If the x_tables extension comes with no hook mask, skip this validation. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-07-19 11:53:30 +02:00
Dan Carpenter	073dd5ad34	netfilter: fix netfilter_net_init() return We accidentally return an uninitialized variable. Fixes: `cf56c2f892` ("netfilter: remove old pre-netns era hook api") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-07-18 14:50:28 -07:00
Florian Westphal	36ac344e16	netfilter: expect: fix crash when putting uninited expectation We crash in __nf_ct_expect_check, it calls nf_ct_remove_expect on the uninitialised expectation instead of existing one, so del_timer chokes on random memory address. Fixes: `ec0e3f0111` ("netfilter: nf_ct_expect: Add nf_ct_remove_expect()") Reported-by: Sergey Kvachonok <ravenexp@gmail.com> Tested-by: Sergey Kvachonok <ravenexp@gmail.com> Cc: Gao Feng <fgao@ikuai8.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-07-17 17:03:12 +02:00
Florian Westphal	97772bcd56	netfilter: nat: fix src map lookup When doing initial conversion to rhashtable I replaced the bucket walk with a single rhashtable_lookup_fast(). When moving to rhlist I failed to properly walk the list of identical tuples, but that is what is needed for this to work correctly. The table contains the original tuples, so the reply tuples are all distinct. We currently decide that mapping is (not) in range only based on the first entry, but in case its not we need to try the reply tuple of the next entry until we either find an in-range mapping or we checked all the entries. This bug makes nat core attempt collision resolution while it might be able to use the mapping as-is. Fixes: `870190a9ec` ("netfilter: nat: convert nat bysrc hash to rhashtable") Reported-by: Jaco Kroon <jaco@uls.co.za> Tested-by: Jaco Kroon <jaco@uls.co.za> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-07-17 17:02:19 +02:00
Florian Westphal	cf56c2f892	netfilter: remove old pre-netns era hook api no more users in the tree, remove this. The old api is racy wrt. module removal, all users have been converted to the netns-aware api. The old api pretended we still have global hooks but that has not been true for a long time. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-07-17 17:01:10 +02:00
Mateusz Jurczyk	f55ce7b024	netfilter: nfnetlink: Improve input length sanitization in nfnetlink_rcv Verify that the length of the socket buffer is sufficient to cover the nlmsghdr structure before accessing the nlh->nlmsg_len field for further input sanitization. If the client only supplies 1-3 bytes of data in sk_buff, then nlh->nlmsg_len remains partially uninitialized and contains leftover memory from the corresponding kernel allocation. Operating on such data may result in indeterminate evaluation of the nlmsg_len < NLMSG_HDRLEN expression. The bug was discovered by a runtime instrumentation designed to detect use of uninitialized memory in the kernel. The patch prevents this and other similar tools (e.g. KMSAN) from flagging this behavior in the future. Signed-off-by: Mateusz Jurczyk <mjurczyk@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-07-17 13:27:46 +02:00
Michal Hocko	eacd86ca3b	net/netfilter/x_tables.c: use kvmalloc() in xt_alloc_table_info() xt_alloc_table_info() basically opencodes kvmalloc() so use the library function instead. Link: http://lkml.kernel.org/r/20170531155145.17111-4-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Cc: Florian Westphal <fw@strlen.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-07-12 16:26:02 -07:00
David S. Miller	c644bd79c0	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains two Netfilter fixes for your net tree, they are: 1) Fix memleak from netns release path of conntrack protocol trackers, patch from Liping Zhang. 2) Uninitialized flags field in ebt_log, that results in unpredictable logging format in ebtables, also from Liping. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-07-06 14:02:22 +01:00
Xin Long	4ae70c0845	sctp: remove the typedef sctp_inithdr_t This patch is to remove the typedef sctp_inithdr_t, and replace with struct sctp_inithdr in the places where it's using this typedef. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-07-01 09:08:42 -07:00
Xin Long	922dbc5be2	sctp: remove the typedef sctp_chunkhdr_t This patch is to remove the typedef sctp_chunkhdr_t, and replace with struct sctp_chunkhdr in the places where it's using this typedef. It is also to fix some indents and use sizeof(variable) instead of sizeof(type)., especially in sctp_new. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-07-01 09:08:41 -07:00
Xin Long	ae146d9b76	sctp: remove the typedef sctp_sctphdr_t This patch is to remove the typedef sctp_sctphdr_t, and replace with struct sctphdr in the places where it's using this typedef. It is also to fix some indents and use sizeof(variable) instead of sizeof(type). Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-07-01 09:08:41 -07:00
Reshetova, Elena	41c6d650f6	net: convert sock.sk_refcnt from atomic_t to refcount_t refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. This patch uses refcount_inc_not_zero() instead of atomic_inc_not_zero_hint() due to absense of a _hint() version of refcount API. If the hint() version must be used, we might need to revisit API. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-07-01 07:39:08 -07:00
David S. Miller	52a623bd61	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for your net-next tree. This batch contains connection tracking updates for the cleanup iteration path, patches from Florian Westphal: X) Skip unconfirmed conntracks in nf_ct_iterate_cleanup_net(), just set dying bit to let the CPU release them. X) Add nf_ct_iterate_destroy() to be used on module removal, to kill conntrack from all namespace. X) Restart iteration on hashtable resizing, since both may occur at the same time. X) Use the new nf_ct_iterate_destroy() to remove conntrack with NAT mapping on module removal. X) Use nf_ct_iterate_destroy() to remove conntrack entries helper module removal, from Liping Zhang. X) Use nf_ct_iterate_cleanup_net() to remove the timeout extension if user requests this, also from Liping. X) Add net_ns_barrier() and use it from FTP helper, so make sure no concurrent namespace removal happens at the same time while the helper module is being removed. X) Use NFPROTO_MAX in layer 3 conntrack protocol array, to reduce module size. Same thing in nf_tables. Updates for the nf_tables infrastructure: X) Prepare usage of the extended ACK reporting infrastructure for nf_tables. X) Remove unnecessary forward declaration in nf_tables hash set. X) Skip set size estimation if number of element is not specified. X) Changes to accomodate a (faster) unresizable hash set implementation, for anonymous sets and dynamic size fixed sets with no timeouts. X) Faster lookup function for unresizable hash table for 2 and 4 bytes key. And, finally, a bunch of asorted small updates and cleanups: X) Do not hold reference to netdev from ipt_CLUSTER, instead subscribe to device events and look up for index from the packet path, this is fixing an issue that is present since the very beginning, patch from Xin Long. X) Use nf_register_net_hook() in ipt_CLUSTER, from Florian Westphal. X) Use ebt_invalid_target() whenever possible in the ebtables tree, from Gao Feng. X) Calm down compilation warning in nf_dup infrastructure, patch from stephen hemminger. X) Statify functions in nftables rt expression, also from stephen. X) Update Makefile to use canonical method to specify nf_tables-objs. From Jike Song. X) Use nf_conntrack_helpers_register() in amanda and H323. X) Space cleanup for ctnetlink, from linzhang. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-06-30 06:27:09 -07:00
Liping Zhang	deaa0a976b	netfilter: nf_ct_dccp/sctp: fix memory leak after netns cleanup After running the following commands for a while, kmemleak reported that "1879 new suspected memory leaks" happened: # while : ; do ip netns add test ip netns delete test done unreferenced object 0xffff88006342fa38 (size 1024): comm "ip", pid 15477, jiffies 4295982857 (age 957.836s) hex dump (first 32 bytes): b8 b0 4d a0 ff ff ff ff c0 34 c3 59 00 88 ff ff ..M......4.Y.... 04 00 00 00 a4 01 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff8190510a>] kmemleak_alloc+0x4a/0xa0 [<ffffffff81284130>] __kmalloc_track_caller+0x150/0x300 [<ffffffff812302d0>] kmemdup+0x20/0x50 [<ffffffffa04d598a>] dccp_init_net+0x8a/0x160 [nf_conntrack] [<ffffffffa04cf9f5>] nf_ct_l4proto_pernet_register_one+0x25/0x90 ... unreferenced object 0xffff88006342da58 (size 1024): comm "ip", pid 15477, jiffies 4295982857 (age 957.836s) hex dump (first 32 bytes): 10 b3 4d a0 ff ff ff ff 04 35 c3 59 00 88 ff ff ..M......5.Y.... 04 00 00 00 a4 01 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff8190510a>] kmemleak_alloc+0x4a/0xa0 [<ffffffff81284130>] __kmalloc_track_caller+0x150/0x300 [<ffffffff812302d0>] kmemdup+0x20/0x50 [<ffffffffa04d6a9d>] sctp_init_net+0x5d/0x130 [nf_conntrack] [<ffffffffa04cf9f5>] nf_ct_l4proto_pernet_register_one+0x25/0x90 ... This is because we forgot to implement the get_net_proto for sctp and dccp, so we won't invoke the nf_ct_unregister_sysctl to free the ctl_table when do netns cleanup. Also note, we will fail to register the sysctl for dccp/sctp either due to the lack of get_net_proto. Fixes: `c51d39010a` ("netfilter: conntrack: built-in support for DCCP") Fixes: `a85406afeb` ("netfilter: conntrack: built-in support for SCTP") Cc: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Acked-by: Davide Caratti <dcaratti@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-06-29 18:47:01 +02:00
Pablo Neira Ayuso	04ba724b65	netfilter: nfnetlink: extended ACK reporting Pass down struct netlink_ext_ack as parameter to all of our nfnetlink subsystem callbacks, so we can work on follow up patches to provide finer grain error reporting using the new infrastructure that `2d4bc93368` ("netlink: extended ACK reporting") provides. No functional change, just pass down this new object to callbacks. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-06-19 19:38:24 +02:00
Florian Westphal	d8297d4f3e	netfilter: nf_tables: reduce chain type table size text data bss dec hex filename old: 151590 2240 1152 154982 25d66 net/netfilter/nf_tables_api.o new: 151666 2240 416 154322 25ad2 net/netfilter/nf_tables_api.o Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-06-19 19:20:59 +02:00
Florian Westphal	b7b5fda468	netfilter: conntrack: use NFPROTO_MAX to size array We don't support anything larger than NFPROTO_MAX, so we can shrink this a bit: text data dec hex filename old: 8259 1096 9355 248b net/netfilter/nf_conntrack_proto.o new: 8259 624 8883 22b3 net/netfilter/nf_conntrack_proto.o Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-06-19 19:20:49 +02:00
Liping Zhang	d53e3fc390	netfilter: use nf_conntrack_helpers_register when possible amanda_helper, nf_conntrack_helper_ras and nf_conntrack_helper_q931 are all arrays, so we can use nf_conntrack_helpers_register to register the ct helper, this will help us to eliminate some "goto errX" statements. Also introduce h323_helper_init/exit helper function to register the ct helpers, this is prepared for the followup patch, which will add net namespace support for ct helper. Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-06-19 19:13:21 +02:00
Jike Song	2becbbc547	netfilter, kbuild: use canonical method to specify objs. Should use ":=" instead of "+=". Signed-off-by: Jike Song <jike.song@intel.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-06-19 19:09:20 +02:00
Florian Westphal	7866cc57b5	netns: add and use net_ns_barrier Quoting Joe Stringer: If a user loads nf_conntrack_ftp, sends FTP traffic through a network namespace, destroys that namespace then unloads the FTP helper module, then the kernel will crash. Events that lead to the crash: 1. conntrack is created with ftp helper in netns x 2. This netns is destroyed 3. netns destruction is scheduled 4. netns destruction wq starts, removes netns from global list 5. ftp helper is unloaded, which resets all helpers of the conntracks via for_each_net() but because netns is already gone from list the for_each_net() loop doesn't include it, therefore all of these conntracks are unaffected. 6. helper module unload finishes 7. netns wq invokes destructor for rmmod'ed helper CC: "Eric W. Biederman" <ebiederm@xmission.com> Reported-by: Joe Stringer <joe@ovn.org> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-06-19 19:09:19 +02:00
Florian Westphal	2c41f33c1b	netfilter: move table iteration out of netns exit paths We only need to iterate & remove in case of module removal; for netns destruction all conntracks will be removed anyway. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-06-19 19:09:19 +02:00
Johannes Berg	4df864c1d9	networking: make skb_put & friends return void pointers It seems like a historic accident that these return unsigned char , and in many places that means casts are required, more often than not. Make these functions (skb_put, __skb_put and pskb_put) return void and remove all the casts across the tree, adding a (u8 ) cast only where the unsigned char pointer was used directly, all done with the following spatch: @@ expression SKB, LEN; typedef u8; identifier fn = { skb_put, __skb_put }; @@ - (fn(SKB, LEN)) + (u8 )fn(SKB, LEN) @@ expression E, SKB, LEN; identifier fn = { skb_put, __skb_put }; type T; @@ - E = ((T *)(fn(SKB, LEN))) + E = fn(SKB, LEN) which actually doesn't cover pskb_put since there are only three users overall. A handful of stragglers were converted manually, notably a macro in drivers/isdn/i4l/isdn_bsdcomp.c and, oddly enough, one of the many instances in net/bluetooth/hci_sock.c. In the former file, I also had to fix one whitespace problem spatch introduced. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-06-16 11:48:39 -04:00
David S. Miller	216fe8f021	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Just some simple overlapping changes in marvell PHY driver and the DSA core code. Signed-off-by: David S. Miller <davem@davemloft.net>	2017-06-06 22:20:08 -04:00
Liping Zhang	34158151d2	netfilter: cttimeout: use nf_ct_iterate_cleanup_net to unlink timeout objs Similar to nf_conntrack_helper, we can use nf_ct_iterare_cleanup_net to remove these copy & paste code. Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-05-29 12:46:24 +02:00
Liping Zhang	ff1acc4964	netfilter: nf_ct_helper: use nf_ct_iterate_destroy to unlink helper objs When we unlink the helper objects, we will iterate the nf_conntrack_hash, iterate the unconfirmed list, handle the hash resize situation, etc. Actually this logic is same as the nf_ct_iterate_destroy, so we can use it to remove these copy & paste code. Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-05-29 12:46:23 +02:00
Pablo Neira Ayuso	446a8268b7	netfilter: nft_set_hash: add lookup variant for fixed size hashtable This patch provides a faster variant of the lookup function for 2 and 4 byte keys. Optimizing the one byte case is not worth, as the set backend selection will always select the bitmap set type for such case. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-05-29 12:46:22 +02:00
Pablo Neira Ayuso	6c03ae210c	netfilter: nft_set_hash: add non-resizable hashtable implementation This patch adds a simple non-resizable hashtable implementation. If the user specifies the set size, then this new faster hashtable flavour is selected. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-05-29 12:46:21 +02:00
Pablo Neira Ayuso	1ff75a3e9a	netfilter: nf_tables: allow large allocations for new sets The new fixed size hashtable backend implementation may result in a large array of buckets that would spew splats from mm. Update this code to fall back on vmalloc in case the memory allocation order is too costly. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-05-29 12:46:20 +02:00
Pablo Neira Ayuso	2111515abc	netfilter: nft_set_hash: add nft_hash_buckets() Add nft_hash_buckets() helper function to calculate the number of hashtable buckets based on the elements. This function can be reused from the follow up patch to add non-resizable hashtables. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-05-29 12:46:19 +02:00
Pablo Neira Ayuso	347b408d59	netfilter: nf_tables: pass set description to ->privsize The new non-resizable hashtable variant needs this to calculate the size of the bucket array. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-05-29 12:46:18 +02:00
Pablo Neira Ayuso	2b664957c2	netfilter: nf_tables: select set backend flavour depending on description This patch adds the infrastructure to support several implementations of the same set type. This selection will be based on the set description and the features available for this set. This allow us to select set backend implementation that will result in better performance numbers. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-05-29 12:46:17 +02:00
Pablo Neira Ayuso	5fc6ced958	netfilter: nft_set_hash: use nft_rhash prefix for resizable set backend This patch prepares the introduction of a non-resizable hashtable implementation that is significantly faster. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-05-29 12:46:16 +02:00
Pablo Neira Ayuso	080ed636a5	netfilter: nf_tables: no size estimation if number of set elements is unknown This size estimation is ignored by the existing set backend selection logic, since this estimation structure is stack allocated, set this to ~0 to make it easier to catch bugs in future changes. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-05-29 12:46:15 +02:00
Pablo Neira Ayuso	187388bc3d	netfilter: nft_set_hash: unnecessary forward declaration Replace struct rhashtable_params forward declaration by the structure definition itself. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-05-29 12:46:14 +02:00
Florian Westphal	8f23f35f1e	netfilter: nat: destroy nat mappings on module exit path only We don't need pernetns cleanup anymore. If the netns is being destroyed, conntrack netns exit will kill all entries in this namespace, and neither conntrack hash table nor bysource hash are per namespace. For the rmmod case, we have to make sure we remove all entries from the nat bysource table, so call the new nf_ct_iterate_destroy in module exit path. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-05-29 12:46:13 +02:00
Florian Westphal	0d02d5646e	netfilter: conntrack: restart iteration on resize We could some conntracks when a resize occurs in parallel. Avoid this by sampling generation seqcnt and doing a restart if needed. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-05-29 12:46:11 +02:00
Florian Westphal	2843fb6998	netfilter: conntrack: add nf_ct_iterate_destroy sledgehammer to be used on module unload (to remove affected conntracks from all namespaces). It will also flag all unconfirmed conntracks as dying, i.e. they will not be committed to main table. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-05-29 12:46:10 +02:00
Florian Westphal	b0feacaad1	netfilter: conntrack: don't call iter for non-confirmed conntracks nf_ct_iterate_cleanup_net currently calls iter() callback also for conntracks on the unconfirmed list, but this is unsafe. Acesses to nf_conn are fine, but some users access the extension area in the iter() callback, but that does only work reliably for confirmed conntracks (ct->ext can be reallocated at any time for unconfirmed conntrack). The seond issue is that there is a short window where a conntrack entry is neither on the list nor in the table: To confirm an entry, it is first removed from the unconfirmed list, then insert into the table. Fix this by iterating the unconfirmed list first and marking all entries as dying, then wait for rcu grace period. This makes sure all entries that were about to be confirmed either are in the main table, or will be dropped soon. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-05-29 12:46:09 +02:00
Florian Westphal	9fd6452d67	netfilter: conntrack: rename nf_ct_iterate_cleanup There are several places where we needlesly call nf_ct_iterate_cleanup, we should instead iterate the full table at module unload time. This is a leftover from back when the conntrack table got duplicated per net namespace. So rename nf_ct_iterate_cleanup to nf_ct_iterate_cleanup_net. A later patch will then add a non-net variant. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-05-29 12:46:08 +02:00
stephen hemminger	cad4394453	netfilter: nft_rt: make local functions static Resolves warnings: net/netfilter/nft_rt.c:26:6: warning: no previous prototype for ‘nft_rt_get_eval’ [-Wmissing-prototypes] net/netfilter/nft_rt.c:75:5: warning: no previous prototype for ‘nft_rt_get_init’ [-Wmissing-prototypes] net/netfilter/nft_rt.c:106:5: warning: no previous prototype for ‘nft_rt_get_dump’ [-Wmissing-prototypes] Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-05-29 12:45:59 +02:00
stephen hemminger	a32770b1e7	netfilter: dup: resolve warnings about missing prototypes Missing include file causes: net/netfilter/nf_dup_netdev.c:26:6: warning: no previous prototype for ‘nf_fwd_netdev_egress’ [-Wmissing-prototypes] net/netfilter/nf_dup_netdev.c:40:6: warning: no previous prototype for ‘nf_dup_netdev_egress’ [-Wmissing-prototypes] Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-05-29 11:32:36 +02:00
linzhang	04b80ceadc	netfilter: ctnetlink: delete extra spaces This patch cleans up extra spaces. Signed-off-by: linzhang <xiaolou4617@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-05-29 11:32:29 +02:00
Liping Zhang	fefa92679d	netfilter: ctnetlink: fix incorrect nf_ct_put during hash resize If nf_conntrack_htable_size was adjusted by the user during the ct dump operation, we may invoke nf_ct_put twice for the same ct, i.e. the "last" ct. This will cause the ct will be freed but still linked in hash buckets. It's very easy to reproduce the problem by the following commands: # while : ; do echo $RANDOM > /proc/sys/net/netfilter/nf_conntrack_buckets done # while : ; do conntrack -L done # iperf -s 127.0.0.1 & # iperf -c 127.0.0.1 -P 60 -t 36000 After a while, the system will hang like this: NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [bash:20184] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [iperf:20382] ... So at last if we find cb->args[1] is equal to "last", this means hash resize happened, then we can set cb->args[1] to 0 to fix the above issue. Fixes: `d205dc4079` ("[NETFILTER]: ctnetlink: fix deadlock in table dumping") Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-05-24 11:26:01 +02:00
Liping Zhang	124dffea9e	netfilter: nat: use atomic bit op to clear the _SRC_NAT_DONE_BIT We need to clear the IPS_SRC_NAT_DONE_BIT to indicate that the ct has been removed from nat_bysource table. But unfortunately, we use the non-atomic bit operation: "ct->status &= ~IPS_NAT_DONE_MASK". So there's a race condition that we may clear the _DYING_BIT set by another CPU unexpectedly. Since we don't care about the IPS_DST_NAT_DONE_BIT, so just using clear_bit to clear the IPS_SRC_NAT_DONE_BIT is enough. Also note, this is the last user which use the non-atomic bit operation to update the confirmed ct->status. Reported-by: Florian Westphal <fw@strlen.de> Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-05-23 22:54:51 +02:00
Pablo Neira Ayuso	d2df92e98a	netfilter: nft_set_rbtree: handle element re-addition after deletion The existing code selects no next branch to be inspected when re-inserting an inactive element into the rb-tree, looping endlessly. This patch restricts the check for active elements to the EEXIST case only. Fixes: `e701001e7c` ("netfilter: nft_rbtree: allow adjacent intervals with dynamic updates") Reported-by: Wolfgang Bumiller <w.bumiller@proxmox.com> Tested-by: Wolfgang Bumiller <w.bumiller@proxmox.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-05-23 22:54:14 +02:00
Davide Caratti	f3c0eb05e2	netfilter: conntrack: fix false CRC32c mismatch using paged skb sctp_compute_cksum() implementation assumes that at least the SCTP header is in the linear part of skb: modify conntrack error callback to avoid false CRC32c mismatch, if the transport header is partially/entirely paged. Fixes: `cf6e007eef` ("netfilter: conntrack: validate SCTP crc32c in PREROUTING") Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-05-23 22:54:14 +02:00
David S. Miller	218b6a5b23	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2017-05-22 23:32:48 -04:00
David S. Miller	23416e2304	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf Pablo Neira Ayuso says: ==================== Netfilter/IPVS fixes for net The following patchset contains Netfilter/IPVS fixes for your net tree, they are: 1) When using IPVS in direct-routing mode, normal traffic from the LVS host to a back-end server is sometimes incorrectly NATed on the way back into the LVS host. Patch to fix this from Julian Anastasov. 2) Calm down clang compilation warning in ctnetlink due to type mismatch, from Matthias Kaehlcke. 3) Do not re-setup NAT for conntracks that are already confirmed, this is fixing a problem that was introduced in the previous nf-next batch. Patch from Liping Zhang. 4) Do not allow conntrack helper removal from userspace cthelper infrastructure if already in used. This comes with an initial patch to introduce nf_conntrack_helper_put() that is required by this fix. From Liping Zhang. 5) Zero the pad when copying data to userspace, otherwise iptables fails to remove rules. This is a follow up on the patchset that sorts out the internal match/target structure pointer leak to userspace. Patch from the same author, Willem de Bruijn. This also comes with a build failure when CONFIG_COMPAT is not on, coming in the last patch of this series. 6) SYNPROXY crashes with conntrack entries that are created via ctnetlink, more specifically via conntrackd state sync. Patch from Eric Leblond. 7) RCU safe iteration on set element dumping in nf_tables, from Liping Zhang. 8) Missing sanitization of immediate date for the bitwise and cmp expressions in nf_tables. 9) Refcounting logic for chain and objects from set elements does not integrate into the nf_tables 2-phase commit protocol. 10) Missing sanitization of target verdict in ebtables arpreply target, from Gao Feng. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-05-21 13:00:02 -04:00
Willem de Bruijn	751a9c7638	netfilter: xtables: fix build failure from COMPAT_XT_ALIGN outside CONFIG_COMPAT The patch in the Fixes references COMPAT_XT_ALIGN in the definition of XT_DATA_TO_USER, outside an #ifdef CONFIG_COMPAT block. Split XT_DATA_TO_USER into separate compat and non compat variants and define the first inside an CONFIG_COMPAT block. This simplifies both variants by removing branches inside the macro. Fixes: `324318f024` ("netfilter: xtables: zero padding in data_to_user") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-05-18 13:10:03 +02:00
Eric Dumazet	9a568de481	tcp: switch TCP TS option (RFC 7323) to 1ms clock TCP Timestamps option is defined in RFC 7323 Traditionally on linux, it has been tied to the internal 'jiffies' variable, because it had been a cheap and good enough generator. For TCP flows on the Internet, 1 ms resolution would be much better than 4ms or 10ms (HZ=250 or HZ=100 respectively) For TCP flows in the DC, Google has used usec resolution for more than two years with great success [1] Receive size autotuning (DRS) is indeed more precise and converges faster to optimal window size. This patch converts tp->tcp_mstamp to a plain u64 value storing a 1 usec TCP clock. This choice will allow us to upstream the 1 usec TS option as discussed in IETF 97. [1] https://www.ietf.org/proceedings/97/slides/slides-97-tcpm-tcp-options-for-low-latency-00.pdf Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-05-17 16:06:01 -04:00
Pablo Neira Ayuso	591054469b	netfilter: nf_tables: revisit chain/object refcounting from elements Andreas reports that the following incremental update using our commit protocol doesn't work. # nft -f incremental-update.nft delete element ip filter client_to_any { 10.180.86.22 : goto CIn_1 } delete chain ip filter CIn_1 ... Error: Could not process rule: Device or resource busy The existing code is not well-integrated into the commit phase protocol, since element deletions do not result in refcount decrement from the preparation phase. This results in bogus EBUSY errors like the one above. Two new functions come with this patch: * nft_set_elem_activate() function is used from the abort path, to restore the set element refcounting on objects that occurred from the preparation phase. * nft_set_elem_deactivate() that is called from nft_del_setelem() to decrement set element refcounting on objects from the preparation phase in the commit protocol. The nft_data_uninit() has been renamed to nft_data_release() since this function does not uninitialize any data store in the data register, instead just releases the references to objects. Moreover, a new function nft_data_hold() has been introduced to be used from nft_set_elem_activate(). Reported-by: Andreas Schultz <aschultz@tpip.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-05-15 12:51:41 +02:00
Pablo Neira Ayuso	71df14b0ce	netfilter: nf_tables: missing sanitization in data from userspace Do not assume userspace always sends us NFT_DATA_VALUE for bitwise and cmp expressions. Although NFT_DATA_VERDICT does not make any sense, it is still possible to handcraft a netlink message using this incorrect data type. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-05-15 12:51:40 +02:00
Liping Zhang	fa803605ee	netfilter: nf_tables: can't assume lock is acquired when dumping set elems When dumping the elements related to a specified set, we may invoke the nf_tables_dump_set with the NFNL_SUBSYS_NFTABLES lock not acquired. So we should use the proper rcu operation to avoid race condition, just like other nft dump operations. Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-05-15 12:51:39 +02:00
Eric Leblond	87e94dbc21	netfilter: synproxy: fix conntrackd interaction This patch fixes the creation of connection tracking entry from netlink when synproxy is used. It was missing the addition of the synproxy extension. This was causing kernel crashes when a conntrack entry created by conntrackd was used after the switch of traffic from active node to the passive node. Signed-off-by: Eric Leblond <eric@regit.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-05-15 12:51:39 +02:00
Willem de Bruijn	324318f024	netfilter: xtables: zero padding in data_to_user When looking up an iptables rule, the iptables binary compares the aligned match and target data (XT_ALIGN). In some cases this can exceed the actual data size to include padding bytes. Before commit `f77bc5b23f` ("iptables: use match, target and data copy_to_user helpers") the malloc()ed bytes were overwritten by the kernel with kzalloced contents, zeroing the padding and making the comparison succeed. After this patch, the kernel copies and clears only data, leaving the padding bytes undefined. Extend the clear operation from data size to aligned data size to include the padding bytes, if any. Padding bytes can be observed in both match and target, and the bug triggered, by issuing a rule with match icmp and target ACCEPT: iptables -t mangle -A INPUT -i lo -p icmp --icmp-type 1 -j ACCEPT iptables -t mangle -D INPUT -i lo -p icmp --icmp-type 1 -j ACCEPT Fixes: `f77bc5b23f` ("iptables: use match, target and data copy_to_user helpers") Reported-by: Paul Moore <pmoore@redhat.com> Reported-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-05-15 12:51:38 +02:00
Pablo Neira Ayuso	ff1e4300cf	Merge tag 'ipvs-fixes-for-v4.12' of http://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs Simon Horman says: ==================== IPVS Fixes for v4.12 please consider this fix to IPVS for v4.12. * It is a fix from Julian Anastasov to only SNAT SNAT packet replies only for NATed connections My understanding is that this fix is appropriate for 4.9.25, 4.10.13, 4.11 as well as the nf tree. Julian has separately posted backports for other -stable kernels; please see: * [PATCH 3.2.88,3.4.113 -stable 1/3] ipvs: SNAT packet replies only for NATed connections * [PATCH 3.10.105,3.12.73,3.16.43,4.1.39 -stable 2/3] ipvs: SNAT packet replies only for NATed connections * [PATCH 4.4.65 -stable 3/3] ipvs: SNAT packet replies only for NATed connections ==================== Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-05-15 12:50:12 +02:00
Liping Zhang	9338d7b441	netfilter: nfnl_cthelper: reject del request if helper obj is in use We can still delete the ct helper even if it is in use, this will cause a use-after-free error. In more detail, I mean: # nfct helper add ssdp inet udp # iptables -t raw -A OUTPUT -p udp -j CT --helper ssdp # nfct helper delete ssdp //--> oops, succeed! BUG: unable to handle kernel paging request at 000026ca IP: 0x26ca [...] Call Trace: ? ipv4_helper+0x62/0x80 [nf_conntrack_ipv4] nf_hook_slow+0x21/0xb0 ip_output+0xe9/0x100 ? ip_fragment.constprop.54+0xc0/0xc0 ip_local_out+0x33/0x40 ip_send_skb+0x16/0x80 udp_send_skb+0x84/0x240 udp_sendmsg+0x35d/0xa50 So add reference count to fix this issue, if ct helper is used by others, reject the delete request. Apply this patch: # nfct helper delete ssdp nfct v1.4.3: netlink error: Device or resource busy Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-05-15 12:42:29 +02:00
Liping Zhang	d91fc59cd7	netfilter: introduce nf_conntrack_helper_put helper function And convert module_put invocation to nf_conntrack_helper_put, this is prepared for the followup patch, which will add a refcnt for cthelper, so we can reject the deleting request when cthelper is in use. Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-05-15 12:42:29 +02:00
Liping Zhang	d110a3942a	netfilter: don't setup nat info for confirmed ct We cannot setup nat info if the ct has been confirmed already, else, different cpu may race to handle the same ct. In extreme situation, we may hit the "BUG_ON(nf_nat_initialized(ct, maniptype))" in the nf_nat_setup_info. Also running the following commands will easily hit NF_CT_ASSERT in nf_conntrack_alter_reply: # nft flush ruleset # ping -c 2 -W 1 1.1.1.111 & # nft add table t # nft add chain t c {type nat hook postrouting priority 0 \;} # nft add rule t c snat to 4.5.6.7 WARNING: CPU: 1 PID: 10065 at net/netfilter/nf_conntrack_core.c:1472 nf_conntrack_alter_reply+0x9a/0x1a0 [nf_conntrack] [...] Call Trace: nf_nat_setup_info+0xad/0x840 [nf_nat] ? deactivate_slab+0x65d/0x6c0 nft_nat_eval+0xcd/0x100 [nft_nat] nft_do_chain+0xff/0x5d0 [nf_tables] ? mark_held_locks+0x6f/0xa0 ? __local_bh_enable_ip+0x70/0xa0 ? trace_hardirqs_on_caller+0x11f/0x190 ? ipt_do_table+0x310/0x610 ? trace_hardirqs_on+0xd/0x10 ? __local_bh_enable_ip+0x70/0xa0 ? ipt_do_table+0x32b/0x610 ? __lock_acquire+0x2ac/0x1580 ? ipt_do_table+0x32b/0x610 nft_nat_do_chain+0x65/0x80 [nft_chain_nat_ipv4] nf_nat_ipv4_fn+0x1ae/0x240 [nf_nat_ipv4] nf_nat_ipv4_out+0x4a/0xf0 [nf_nat_ipv4] nft_nat_ipv4_out+0x15/0x20 [nft_chain_nat_ipv4] nf_hook_slow+0x2c/0xf0 ip_output+0x154/0x270 So for the confirmed ct, just ignore it and return NF_ACCEPT. Fixes: `9a08ecfe74` ("netfilter: don't attach a nat extension by default") Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-05-15 12:42:28 +02:00
Matthias Kaehlcke	a2b7cbdd25	netfilter: ctnetlink: Make some parameters integer to avoid enum mismatch Not all parameters passed to ctnetlink_parse_tuple() and ctnetlink_exp_dump_tuple() match the enum type in the signatures of these functions. Since this is intended change the argument type of to be an unsigned integer value. Signed-off-by: Matthias Kaehlcke <mka@chromium.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-05-15 12:10:27 +02:00
Linus Torvalds	de4d195308	Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU updates from Ingo Molnar: "The main changes are: - Debloat RCU headers - Parallelize SRCU callback handling (plus overlapping patches) - Improve the performance of Tree SRCU on a CPU-hotplug stress test - Documentation updates - Miscellaneous fixes" * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (74 commits) rcu: Open-code the rcu_cblist_n_lazy_cbs() function rcu: Open-code the rcu_cblist_n_cbs() function rcu: Open-code the rcu_cblist_empty() function rcu: Separately compile large rcu_segcblist functions srcu: Debloat the <linux/rcu_segcblist.h> header srcu: Adjust default auto-expediting holdoff srcu: Specify auto-expedite holdoff time srcu: Expedite first synchronize_srcu() when idle srcu: Expedited grace periods with reduced memory contention srcu: Make rcutorture writer stalls print SRCU GP state srcu: Exact tracking of srcu_data structures containing callbacks srcu: Make SRCU be built by default srcu: Fix Kconfig botch when SRCU not selected rcu: Make non-preemptive schedule be Tasks RCU quiescent state srcu: Expedite srcu_schedule_cbs_snp() callback invocation srcu: Parallelize callback handling kvm: Move srcu_struct fields to end of struct kvm rcu: Fix typo in PER_RCU_NODE_PERIOD header comment rcu: Use true/false in assignment to bool rcu: Use bool value directly ...	2017-05-10 10:30:46 -07:00
Michal Hocko	19809c2da2	mm, vmalloc: use __GFP_HIGHMEM implicitly __vmalloc* allows users to provide gfp flags for the underlying allocation. This API is quite popular $ git grep "=[[:space:]]__vmalloc\\|return[[:space:]]*__vmalloc" \| wc -l 77 The only problem is that many people are not aware that they really want to give __GFP_HIGHMEM along with other flags because there is really no reason to consume precious lowmemory on CONFIG_HIGHMEM systems for pages which are mapped to the kernel vmalloc space. About half of users don't use this flag, though. This signals that we make the API unnecessarily too complex. This patch simply uses __GFP_HIGHMEM implicitly when allocating pages to be mapped to the vmalloc space. Current users which add __GFP_HIGHMEM are simplified and drop the flag. Link: http://lkml.kernel.org/r/20170307141020.29107-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Cristopher Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-05-08 17:15:13 -07:00
Michal Hocko	752ade68cb	treewide: use kv[mz]alloc* rather than opencoded variants There are many code paths opencoding kvmalloc. Let's use the helper instead. The main difference to kvmalloc is that those users are usually not considering all the aspects of the memory allocator. E.g. allocation requests <= 32kB (with 4kB pages) are basically never failing and invoke OOM killer to satisfy the allocation. This sounds too disruptive for something that has a reasonable fallback - the vmalloc. On the other hand those requests might fallback to vmalloc even when the memory allocator would succeed after several more reclaim/compaction attempts previously. There is no guarantee something like that happens though. This patch converts many of those places to kv[mz]alloc* helpers because they are more conservative. Link: http://lkml.kernel.org/r/20170306103327.2766-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> # Xen bits Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Andreas Dilger <andreas.dilger@intel.com> # Lustre Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> # KVM/s390 Acked-by: Dan Williams <dan.j.williams@intel.com> # nvdim Acked-by: David Sterba <dsterba@suse.com> # btrfs Acked-by: Ilya Dryomov <idryomov@gmail.com> # Ceph Acked-by: Tariq Toukan <tariqt@mellanox.com> # mlx4 Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx5 Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Anton Vorontsov <anton@enomsg.org> Cc: Colin Cross <ccross@android.com> Cc: Tony Luck <tony.luck@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Santosh Raspatur <santosh@chelsio.com> Cc: Hariprasad S <hariprasad@chelsio.com> Cc: Yishai Hadas <yishaih@mellanox.com> Cc: Oleg Drokin <oleg.drokin@intel.com> Cc: "Yan, Zheng" <zyan@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-05-08 17:15:13 -07:00
Julian Anastasov	3c5ab3f395	ipvs: SNAT packet replies only for NATed connections We do not check if packet from real server is for NAT connection before performing SNAT. This causes problems for setups that use DR/TUN and allow local clients to access the real server directly, for example: - local client in director creates IPVS-DR/TUN connection CIP->VIP and the request packets are routed to RIP. Talks are finished but IPVS connection is not expired yet. - second local client creates non-IPVS connection CIP->RIP with same reply tuple RIP->CIP and when replies are received on LOCAL_IN we wrongly assign them for the first client connection because RIP->CIP matches the reply direction. As result, IPVS SNATs replies for non-IPVS connections. The problem is more visible to local UDP clients but in rare cases it can happen also for TCP or remote clients when the real server sends the reply traffic via the director. So, better to be more precise for the reply traffic. As replies are not expected for DR/TUN connections, better to not touch them. Reported-by: Nick Moriarty <nick.moriarty@york.ac.uk> Tested-by: Nick Moriarty <nick.moriarty@york.ac.uk> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>	2017-05-08 11:38:35 +02:00
Linus Torvalds	4ac4d58488	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull networking fixes from David Miller: 1) The wireless rate info fix from Johannes Berg. 2) When a RAW socket is in hdrincl mode, we need to make sure that the user provided at least a minimally sized ipv4/ipv6 header. Fix from Alexander Potapenko. 3) We must emit IFLA_PHYS_PORT_NAME netlink attributes using nla_put_string() so that it is NULL terminated. 4) Fix a bug in TCP fastopen handling, wherein child sockets erroneously inherit the fastopen_req from the parent, and later can end up derefencing freed memory or doing a double free. From Eric Dumazet. 5) Don't clear out netdev stats at close time in tg3 driver, from YueHaibing. 6) Fix refcount leak in xt_CT, from Gao Feng. 7) In nft_set_bitmap() don't leak dummy elements, from Liping Zhang. 8) Fix deadlock due to taking the expectation lock twice, also from Liping Zhang. 9) Make xt_socket work again with ipv6, from Peter Tirsek. 10) Don't allow IPV6 to be used with IPVS if ipv6.disable=1, from Paolo Abeni. 11) Make the BPF loader more flexible wrt. changes to the bpf MAP entry layout. From Jesper Dangaard Brouer. 12) Fix ethtool reported device name in aquantia driver, from Pavel Belous. 13) Fix build failures due to the compile time size test not working in netfilter conntrack. From Geert Uytterhoeven. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (52 commits) cfg80211: make RATE_INFO_BW_20 the default ipv6: initialize route null entry in addrconf_init() qede: Fix possible misconfiguration of advertised autoneg value. qed: Fix overriding of supported autoneg value. qed*: Fix possible overflow for status block id field. rtnetlink: NUL-terminate IFLA_PHYS_PORT_NAME string netvsc: make sure napi enabled before vmbus_open aquantia: Fix driver name reported by ethtool ipv4, ipv6: ensure raw socket message is big enough to hold an IP header net/sched: remove redundant null check on head tcp: do not inherit fastopen_req from parent forcedeth: remove unnecessary carrier status check ibmvnic: Move queue restarting in ibmvnic_tx_complete ibmvnic: Record SKB RX queue during poll ibmvnic: Continue skb processing after skb completion error ibmvnic: Check for driver reset first in ibmvnic_xmit ibmvnic: Wait for any pending scrqs entries at driver close ibmvnic: Clean up tx pools when closing ibmvnic: Whitespace correction in release_rx_pools ibmvnic: Delete napi's when releasing driver resources ...	2017-05-04 12:26:43 -07:00
Linus Torvalds	46f0537b1e	Merge branch 'stable-4.12' of git://git.infradead.org/users/pcmoore/audit Pull audit updates from Paul Moore: "Fourteen audit patches for v4.12 that span the full range of fixes, new features, and internal cleanups. We have a patches to move to 64-bit timestamps, convert refcounts from atomic_t to refcount_t, track PIDs using the pid struct instead of pid_t, convert our own private audit buffer cache to a standard kmem_cache, log kernel module names when they are unloaded, and normalize the NETFILTER_PKT to make the userspace folks happier. From a fixes perspective, the most important is likely the auditd connection tracking RCU fix; it was a rather brain dead bug that I'll take the blame for, but thankfully it didn't seem to affect many people (only one report). I think the patch subject lines and commit descriptions do a pretty good job of explaining the details and why the changes are important so I'll point you there instead of duplicating it here; as usual, if you have any questions you know where to find us. We also manage to take out more code than we put in this time, that always makes me happy :)" * 'stable-4.12' of git://git.infradead.org/users/pcmoore/audit: audit: fix the RCU locking for the auditd_connection structure audit: use kmem_cache to manage the audit_buffer cache audit: Use timespec64 to represent audit timestamps audit: store the auditd PID as a pid struct instead of pid_t audit: kernel generated netlink traffic should have a portid of 0 audit: combine audit_receive() and audit_receive_skb() audit: convert audit_watch.count from atomic_t to refcount_t audit: convert audit_tree.count from atomic_t to refcount_t audit: normalize NETFILTER_PKT netfilter: use consistent ipv4 network offset in xt_AUDIT audit: log module name on delete_module audit: remove unnecessary semicolon in audit_watch_handle_event() audit: remove unnecessary semicolon in audit_mark_handle_event() audit: remove unnecessary semicolon in audit_field_valid()	2017-05-03 09:21:59 -07:00
David S. Miller	4d89ac2dd5	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf Pablo Neira Ayuso says: ==================== Netfilter/IPVS/OVS fixes for net The following patchset contains a rather large batch of Netfilter, IPVS and OVS fixes for your net tree. This includes fixes for ctnetlink, the userspace conntrack helper infrastructure, conntrack OVS support, ebtables DNAT target, several leaks in error path among other. More specifically, they are: 1) Fix reference count leak in the CT target error path, from Gao Feng. 2) Remove conntrack entry clashing with a matching expectation, patch from Jarno Rajahalme. 3) Fix bogus EEXIST when registering two different userspace helpers, from Liping Zhang. 4) Don't leak dummy elements in the new bitmap set type in nf_tables, from Liping Zhang. 5) Get rid of module autoload from conntrack update path in ctnetlink, we don't need autoload at this late stage and it is happening with rcu read lock held which is not good. From Liping Zhang. 6) Fix deadlock due to double-acquire of the expect_lock from conntrack update path, this fixes a bug that was introduced when the central spinlock got removed. Again from Liping Zhang. 7) Safe ct->status update from ctnetlink path, from Liping. The expect_lock protection that was selected when the central spinlock was removed was not really protecting anything at all. 8) Protect sequence adjustment under ct->lock. 9) Missing socket match with IPv6, from Peter Tirsek. 10) Adjust skb->pkt_type of DNAT'ed frames from ebtables, from Linus Luessing. 11) Don't give up on evaluating the expression on new entries added via dynset expression in nf_tables, from Liping Zhang. 12) Use skb_checksum() when mangling icmpv6 in IPv6 NAT as this deals with non-linear skbuffs. 13) Don't allow IPv6 service in IPVS if no IPv6 support is available, from Paolo Abeni. 14) Missing mutex release in error path of xt_find_table_lock(), from Dan Carpenter. 15) Update maintainers files, Netfilter section. Add Florian to the file, refer to nftables.org and change project status from Supported to Maintained. 16) Bail out on mismatching extensions in element updates in nf_tables. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-05-03 10:11:26 -04:00
Geert Uytterhoeven	ab71632c45	netfilter: conntrack: Force inlining of build check to prevent build failure If gcc (e.g. 4.1.2) decides not to inline total_extension_size(), the build will fail with: net/built-in.o: In function `nf_conntrack_init_start': (.text+0x9baf6): undefined reference to `__compiletime_assert_1893' or ERROR: "__compiletime_assert_1893" [net/netfilter/nf_conntrack.ko] undefined! Fix this by forcing inlining of total_extension_size(). Fixes: `b3a5db109e` ("netfilter: conntrack: use u8 for extension sizes again") Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-05-03 09:51:26 -04:00
Pablo Neira Ayuso	9744a6fcef	netfilter: nf_tables: check if same extensions are set when adding elements If no NLM_F_EXCL is set and the element already exists in the set, make sure that both elements have the same extensions. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-05-03 10:58:00 +02:00
Richard Guy Briggs	2173c519d5	audit: normalize NETFILTER_PKT Eliminate flipping in and out of message fields, dropping fields in the process. Sample raw message format IPv4 UDP: type=NETFILTER_PKT msg=audit(1487874761.386:228): mark=0xae8a2732 saddr=127.0.0.1 daddr=127.0.0.1 proto=17^] Sample raw message format IPv6 ICMP6: type=NETFILTER_PKT msg=audit(1487874761.381:227): mark=0x223894b7 saddr=::1 daddr=::1 proto=58^] Issue: https://github.com/linux-audit/audit-kernel/issues/11 Test case: https://github.com/linux-audit/audit-testsuite/issues/43 Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>	2017-05-02 10:16:04 -04:00
Richard Guy Briggs	0cb88b6ff0	netfilter: use consistent ipv4 network offset in xt_AUDIT Even though the skb->data pointer has been moved from the link layer header to the network layer header, use the same method to calculate the offset in ipv4 and ipv6 routines. Signed-off-by: Richard Guy Briggs <rgb@redhat.com> [PM: munged subject line] Signed-off-by: Paul Moore <paul@paul-moore.com>	2017-05-02 10:16:04 -04:00
David S. Miller	a01aa920b8	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter updates for your net-next tree. A large bunch of code cleanups, simplify the conntrack extension codebase, get rid of the fake conntrack object, speed up netns by selective synchronize_net() calls. More specifically, they are: 1) Check for ct->status bit instead of using nfct_nat() from IPVS and Netfilter codebase, patch from Florian Westphal. 2) Use kcalloc() wherever possible in the IPVS code, from Varsha Rao. 3) Simplify FTP IPVS helper module registration path, from Arushi Singhal. 4) Introduce nft_is_base_chain() helper function. 5) Enforce expectation limit from userspace conntrack helper, from Gao Feng. 6) Add nf_ct_remove_expect() helper function, from Gao Feng. 7) NAT mangle helper function return boolean, from Gao Feng. 8) ctnetlink_alloc_expect() should only work for conntrack with helpers, from Gao Feng. 9) Add nfnl_msg_type() helper function to nfnetlink to build the netlink message type. 10) Get rid of unnecessary cast on void, from simran singhal. 11) Use seq_puts()/seq_putc() instead of seq_printf() where possible, also from simran singhal. 12) Use list_prev_entry() from nf_tables, from simran signhal. 13) Remove unnecessary & on pointer function in the Netfilter and IPVS code. 14) Remove obsolete comment on set of rules per CPU in ip6_tables, no longer true. From Arushi Singhal. 15) Remove duplicated nf_conntrack_l4proto_udplite4, from Gao Feng. 16) Remove unnecessary nested rcu_read_lock() in __nf_nat_decode_session(). Code running from hooks are already guaranteed to run under RCU read side. 17) Remove deadcode in nf_tables_getobj(), from Aaron Conole. 18) Remove double assignment in nf_ct_l4proto_pernet_unregister_one(), also from Aaron. 19) Get rid of unsed __ip_set_get_netlink(), from Aaron Conole. 20) Don't propagate NF_DROP error to userspace via ctnetlink in __nf_nat_alloc_null_binding() function, from Gao Feng. 21) Revisit nf_ct_deliver_cached_events() to remove unnecessary checks, from Gao Feng. 22) Kill the fake untracked conntrack objects, use ctinfo instead to annotate a conntrack object is untracked, from Florian Westphal. 23) Remove nf_ct_is_untracked(), now obsolete since we have no conntrack template anymore, from Florian. 24) Add event mask support to nft_ct, also from Florian. 25) Move nf_conn_help structure to include/net/netfilter/nf_conntrack_helper.h. 26) Add a fixed 32 bytes scratchpad area for conntrack helpers. Thus, we don't deal with variable conntrack extensions anymore. Make sure userspace conntrack helper doesn't go over that size. Remove variable size ct extension infrastructure now this code got no more clients. From Florian Westphal. 27) Restore offset and length of nf_ct_ext structure to 8 bytes now that wraparound is not possible any longer, also from Florian. 28) Allow to get rid of unassured flows under stress in conntrack, this applies to DCCP, SCTP and TCP protocols, from Florian. 29) Shrink size of nf_conntrack_ecache structure, from Florian. 30) Use TCP_MAX_WSCALE instead of hardcoded 14 in TCP tracker, from Gao Feng. 31) Register SYNPROXY hooks on demand, from Florian Westphal. 32) Use pernet hook whenever possible, instead of global hook registration, from Florian Westphal. 33) Pass hook structure to ebt_register_table() to consolidate some infrastructure code, from Florian Westphal. 34) Use consume_skb() and return NF_STOLEN, instead of NF_DROP in the SYNPROXY code, to make sure device stats are not fooled, patch from Gao Feng. 35) Remove NF_CT_EXT_F_PREALLOC this kills quite some code that we don't need anymore if we just select a fixed size instead of expensive runtime time calculation of this. From Florian. 36) Constify nf_ct_extend_register() and nf_ct_extend_unregister(), from Florian. 37) Simplify nf_ct_ext_add(), this kills nf_ct_ext_create(), from Florian. 38) Attach NAT extension on-demand from masquerade and pptp helper path, from Florian. 39) Get rid of useless ip_vs_set_state_timeout(), from Aaron Conole. 40) Speed up netns by selective calls of synchronize_net(), from Florian Westphal. 41) Silence stack size warning gcc in 32-bit arch in snmp helper, from Florian. 42) Inconditionally call nf_ct_ext_destroy(), even if we have no extensions, to deal with the NF_NAT_MANIP_SRC case. Patch from Liping Zhang. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-05-01 10:47:53 -04:00
Liping Zhang	8eeef23504	netfilter: nf_ct_ext: invoke destroy even when ext is not attached For NF_NAT_MANIP_SRC, we will insert the ct to the nat_bysource_table, then remove it from the nat_bysource_table via nat_extend->destroy. But now, the nat extension is attached on demand, so if the nat extension is not attached, we will not be notified when the ct is destroyed, i.e. we may fail to remove ct from the nat_bysource_table. So just keep it simple, even if the extension is not attached, we will still invoke the related ext->destroy. And this will also preserve the flexibility for the future extension. Fixes: `9a08ecfe74` ("netfilter: don't attach a nat extension by default") Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-05-01 11:48:49 +02:00
Pablo Neira Ayuso	d1908ca8dc	Merge tag 'ipvs3-for-v4.12' of http://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next Simon Horman says: ==================== Third Round of IPVS Updates for v4.12 please consider these enhancements to IPVS for v4.12. If it is too late for v4.12 then please consider them for v4.13. * Remove unused function * Correct comparison of unsigned value ==================== Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-05-01 11:46:50 +02:00
Florian Westphal	039b40ee58	netfilter: nf_queue: only call synchronize_net twice if nf_queue is active nf_unregister_net_hook(s) can avoid a second call to synchronize_net, provided there is no nfqueue active in that net namespace (which is the common case). This also gets rid of the extra arg to nf_queue_nf_hook_drop(), normally this gets called during netns cleanup so no packets should be queued. For the rare case of base chain being unregistered or module removal while nfqueue is in use the extra hiccup due to the packet drops isn't a big deal. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-05-01 11:19:12 +02:00
Florian Westphal	c83fa19603	netfilter: nf_log: don't call synchronize_rcu in nf_log_unset nf_log_unregister() (which is what gets called in the logger backends module exit paths) does a (required, module is removed) synchronize_rcu(). But nf_log_unset() is only called from pernet exit handlers. It doesn't free any memory so there appears to be no need to call synchronize_rcu. v2: Liping Zhang points out that nf_log_unregister() needs to be called after pernet unregister, else rmmod would become unsafe. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-05-01 11:19:07 +02:00
Florian Westphal	933bd83ed6	netfilter: batch synchronize_net calls during hook unregister synchronize_net is expensive and slows down netns cleanup a lot. We have two APIs to unregister a hook: nf_unregister_net_hook (which calls synchronize_net()) and nf_unregister_net_hooks (calls nf_unregister_net_hook in a loop) Make nf_unregister_net_hook a wapper around new helper __nf_unregister_net_hook, which unlinks the hook but does not free it. Then, we can call that helper in nf_unregister_net_hooks and then call synchronize_net() only once. Andrey Konovalov reports this change improves syzkaller fuzzing speed at least twice. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-05-01 11:18:54 +02:00
Pablo Neira Ayuso	1a41dbce0d	Merge tag 'ipvs-fixes-for-v4.11' of http://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs Simon Horman says: ==================== IPVS Fixes for v4.11 I would also like it considered for stable. * Explicitly forbid ipv6 service/dest creation if ipv6 mod is disabled to avoid oops caused by IPVS accesing IPv6 routing code in such circumstances. ==================== Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-28 16:22:11 +02:00
Dan Carpenter	7dde07e9c5	netfilter: x_tables: unlock on error in xt_find_table_lock() According to my static checker we should unlock here before the return. That seems reasonable to me as well. Fixes" `b9e69e1273` ("netfilter: xtables: don't hook tables by default") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-28 15:49:48 +02:00
Paolo Abeni	1442f6f7c1	ipvs: explicitly forbid ipv6 service/dest creation if ipv6 mod is disabled When creating a new ipvs service, ipv6 addresses are always accepted if CONFIG_IP_VS_IPV6 is enabled. On dest creation the address family is not explicitly checked. This allows the user-space to configure ipvs services even if the system is booted with ipv6.disable=1. On specific configuration, ipvs can try to call ipv6 routing code at setup time, causing the kernel to oops due to fib6_rules_ops being NULL. This change addresses the issue adding a check for the ipv6 module being enabled while validating ipv6 service operations and adding the same validation for dest operations. According to git history, this issue is apparently present since the introduction of ipv6 support, and the oops can be triggered since commit `09571c7ae3` ("IPVS: Add function to determine if IPv6 address is local") Fixes: `09571c7ae3` ("IPVS: Add function to determine if IPv6 address is local") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>	2017-04-28 12:04:35 +02:00
Aaron Conole	fb90e8dedb	ipvs: change comparison on sync_refresh_period The sync_refresh_period variable is unsigned, so it can never be < 0. Signed-off-by: Aaron Conole <aconole@bytheb.org> Signed-off-by: Simon Horman <horms@verge.net.au>	2017-04-28 12:00:10 +02:00
Aaron Conole	65ba101ebc	ipvs: remove unused function ip_vs_set_state_timeout There are no in-tree callers of this function and it isn't exported. Signed-off-by: Aaron Conole <aconole@bytheb.org> Signed-off-by: Simon Horman <horms@verge.net.au>	2017-04-28 12:00:10 +02:00
Florian Westphal	9a08ecfe74	netfilter: don't attach a nat extension by default nowadays the NAT extension only stores the interface index (used to purge connections that got masqueraded when interface goes down) and pptp nat information. Previous patches moved nf_ct_nat_ext_add to those places that need it. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-26 09:30:22 +02:00
Florian Westphal	2fe7c321ab	netfilter: pptp: attach nat extension when needed make sure nat extension gets added if the master conntrack is subject to NAT. This will be required once the nat core stops adding it by default. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-26 09:30:22 +02:00
Florian Westphal	22d4536d2c	netfilter: conntrack: handle initial extension alloc via krealloc krealloc(NULL, ..) is same as kmalloc(), so we can avoid special-casing the initial allocation after the prealloc removal (we had to use ->alloc_len as the initial allocation size). This also means we do not zero the preallocated memory anymore; only offsets[]. Existing code makes sure the new (used) extension space gets zeroed out. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-26 09:30:22 +02:00
Florian Westphal	23f671a1b5	netfilter: conntrack: mark extension structs as const Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-26 09:30:22 +02:00
Florian Westphal	54044b1f02	netfilter: conntrack: remove prealloc support It was used by the nat extension, but since commit `7c96643519` ("netfilter: move nat hlist_head to nf_conn") its only needed for connections that use MASQUERADE target or a nat helper. Also it seems a lot easier to preallocate a fixed size instead. With default settings, conntrack first adds ecache extension (sysctl defaults to 1), so we get 40(ct extension header) + 24 (ecache) == 64 byte on x86_64 for initial allocation. Followup patches can constify the extension structs and avoid the initial zeroing of the entire extension area. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-26 09:30:22 +02:00
Florian Westphal	efe4160618	ipvs: convert to use pernet nf_hook api nf_(un)register_hooks has to maintain an internal hook list to add/remove those hooks from net namespaces as they are added/deleted. ipvs already uses pernet_ops, so we can switch to the (more recent) pernet hook api instead. Compile tested only. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-26 09:30:21 +02:00
Liping Zhang	277a292835	netfilter: nft_dynset: continue to next expr if _OP_ADD succeeded Currently, after adding the following nft rules: # nft add set x target1 { type ipv4_addr \; flags timeout \;} # nft add rule x y set add ip daddr timeout 1d @target1 counter the counters will always be zero despite of the elements are added to the dynamic set "target1" or not, as we will break the nft expr traversal unconditionally: # nft list ruleset ... set target1 { ... elements = { 8.8.8.8 expires 23h59m53s} } chain output { ... set add ip daddr timeout 1d @target1 counter packets 0 bytes 0 ^ ^ ... } Since we add the elements to the set successfully, we should continue to the next expression. Additionally, if elements are added to "flow table" successfully, we will _always_ continue to the next expr, even if the operation is _OP_ADD. So it's better to keep them to be consistent. Fixes: `22fe54d5fe` ("netfilter: nf_tables: add support for dynamic set updates") Reported-by: Robert White <rwhite@pobox.com> Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-25 11:10:37 +02:00
Peter Tirsek	6bd3d19292	netfilter: xt_socket: Fix broken IPv6 handling Commit `834184b1f3` ("netfilter: defrag: only register defrag functionality if needed") used the outdated XT_SOCKET_HAVE_IPV6 macro which was removed earlier in commit `8db4c5be88` ("netfilter: move socket lookup infrastructure to nf_socket_ipv{4,6}.c"). With that macro never being defined, the xt_socket match emits an "Unknown family 10" warning when used with IPv6: WARNING: CPU: 0 PID: 1377 at net/netfilter/xt_socket.c:160 socket_mt_enable_defrag+0x47/0x50 [xt_socket] Unknown family 10 Modules linked in: xt_socket nf_socket_ipv4 nf_socket_ipv6 nf_defrag_ipv4 [...] CPU: 0 PID: 1377 Comm: ip6tables-resto Not tainted 4.10.10 #1 Hardware name: [...] Call Trace: ? __warn+0xe7/0x100 ? socket_mt_enable_defrag+0x47/0x50 [xt_socket] ? socket_mt_enable_defrag+0x47/0x50 [xt_socket] ? warn_slowpath_fmt+0x39/0x40 ? socket_mt_enable_defrag+0x47/0x50 [xt_socket] ? socket_mt_v2_check+0x12/0x40 [xt_socket] ? xt_check_match+0x6b/0x1a0 [x_tables] ? xt_find_match+0x93/0xd0 [x_tables] ? xt_request_find_match+0x20/0x80 [x_tables] ? translate_table+0x48e/0x870 [ip6_tables] ? translate_table+0x577/0x870 [ip6_tables] ? walk_component+0x3a/0x200 ? kmalloc_order+0x1d/0x50 ? do_ip6t_set_ctl+0x181/0x490 [ip6_tables] ? filename_lookup+0xa5/0x120 ? nf_setsockopt+0x3a/0x60 ? ipv6_setsockopt+0xb0/0xc0 ? sock_common_setsockopt+0x23/0x30 ? SyS_socketcall+0x41d/0x630 ? vfs_read+0xfa/0x120 ? do_fast_syscall_32+0x7a/0x110 ? entry_SYSENTER_32+0x47/0x71 This patch brings the conditional back in line with how the rest of the file handles IPv6. Fixes: `834184b1f3` ("netfilter: defrag: only register defrag functionality if needed") Signed-off-by: Peter Tirsek <peter@tirsek.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-24 20:06:29 +02:00
Liping Zhang	64f3967c7a	netfilter: ctnetlink: acquire ct->lock before operating nf_ct_seqadj We should acquire the ct->lock before accessing or modifying the nf_ct_seqadj, as another CPU may modify the nf_ct_seqadj at the same time during its packet proccessing. Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-24 20:06:29 +02:00
Liping Zhang	53b56da83d	netfilter: ctnetlink: make it safer when updating ct->status After converting to use rcu for conntrack hash, one CPU may update the ct->status via ctnetlink, while another CPU may process the packets and update the ct->status. So the non-atomic operation "ct->status \|= status;" via ctnetlink becomes unsafe, and this may clear the IPS_DYING_BIT bit set by another CPU unexpectedly. For example: CPU0 CPU1 ctnetlink_change_status __nf_conntrack_find_get old = ct->status nf_ct_gc_expired - nf_ct_kill - test_and_set_bit(IPS_DYING_BIT new = old \| status; - ct->status = new; <-- oops, _DYING_ is cleared! Now using a series of atomic bit operation to solve the above issue. Also note, user shouldn't set IPS_TEMPLATE, IPS_SEQ_ADJUST directly, so make these two bits be unchangable too. If we set the IPS_TEMPLATE_BIT, ct will be freed by nf_ct_tmpl_free, but actually it is alloced by nf_conntrack_alloc. If we set the IPS_SEQ_ADJUST_BIT, this may cause the NULL pointer deference, as the nfct_seqadj(ct) maybe NULL. Last, add some comments to describe the logic change due to the commit `a963d710f3` ("netfilter: ctnetlink: Fix regression in CTA_STATUS processing"), which makes me feel a little confusing. Fixes: `76507f69c4` ("[NETFILTER]: nf_conntrack: use RCU for conntrack hash") Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-24 20:06:28 +02:00
Liping Zhang	88be4c09d9	netfilter: ctnetlink: fix deadlock due to acquire _expect_lock twice Currently, ctnetlink_change_conntrack is always protected by _expect_lock, but this will cause a deadlock when deleting the helper from a conntrack, as the _expect_lock will be acquired again by nf_ct_remove_expectations: CPU0 ---- lock(nf_conntrack_expect_lock); lock(nf_conntrack_expect_lock); * DEADLOCK * May be due to missing lock nesting notation 2 locks held by lt-conntrack_gr/12853: #0: (&table[i].mutex){+.+.+.}, at: [<ffffffffa05e2009>] nfnetlink_rcv_msg+0x399/0x6a9 [nfnetlink] #1: (nf_conntrack_expect_lock){+.....}, at: [<ffffffffa05f2c1f>] ctnetlink_new_conntrack+0x17f/0x408 [nf_conntrack_netlink] Call Trace: dump_stack+0x85/0xc2 __lock_acquire+0x1608/0x1680 ? ctnetlink_parse_tuple_proto+0x10f/0x1c0 [nf_conntrack_netlink] lock_acquire+0x100/0x1f0 ? nf_ct_remove_expectations+0x32/0x90 [nf_conntrack] _raw_spin_lock_bh+0x3f/0x50 ? nf_ct_remove_expectations+0x32/0x90 [nf_conntrack] nf_ct_remove_expectations+0x32/0x90 [nf_conntrack] ctnetlink_change_helper+0xc6/0x190 [nf_conntrack_netlink] ctnetlink_new_conntrack+0x1b2/0x408 [nf_conntrack_netlink] nfnetlink_rcv_msg+0x60a/0x6a9 [nfnetlink] ? nfnetlink_rcv_msg+0x1b9/0x6a9 [nfnetlink] ? nfnetlink_bind+0x1a0/0x1a0 [nfnetlink] netlink_rcv_skb+0xa4/0xc0 nfnetlink_rcv+0x87/0x770 [nfnetlink] Since the operations are unrelated to nf_ct_expect, so we can drop the _expect_lock. Also note, after removing the _expect_lock protection, another CPU may invoke nf_conntrack_helper_unregister, so we should use rcu_read_lock to protect __nf_conntrack_helper_find invoked by ctnetlink_change_helper. Fixes: `ca7433df3a` ("netfilter: conntrack: seperate expect locking from nf_conntrack_lock") Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-24 20:06:28 +02:00
Liping Zhang	14e5676156	netfilter: ctnetlink: drop the incorrect cthelper module request First, when creating a new ct, we will invoke request_module to try to load the related inkernel cthelper. So there's no need to call the request_module again when updating the ct helpinfo. Second, ctnetlink_change_helper may be called with rcu_read_lock held, i.e. rcu_read_lock -> nfqnl_recv_verdict -> nfqnl_ct_parse -> ctnetlink_glue_parse -> ctnetlink_glue_parse_ct -> ctnetlink_change_helper. But the request_module invocation may sleep, so we can't call it with the rcu_read_lock held. Remove it now. Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-24 20:06:28 +02:00
Liping Zhang	54a5f9d9ab	netfilter: nft_set_bitmap: free dummy elements when destroy the set We forget to free dummy elements when deleting the set. So when I was running nft-test.py, I saw many kmemleak warnings: kmemleak: 1344 new suspected memory leaks ... # cat /sys/kernel/debug/kmemleak unreferenced object 0xffff8800631345c8 (size 32): comm "nft", pid 9075, jiffies 4295743309 (age 1354.815s) hex dump (first 32 bytes): f8 63 13 63 00 88 ff ff 88 79 13 63 00 88 ff ff .c.c.....y.c.... 04 0c 00 00 00 00 00 00 00 00 00 00 08 03 00 00 ................ backtrace: [<ffffffff819059da>] kmemleak_alloc+0x4a/0xa0 [<ffffffff81288174>] __kmalloc+0x164/0x310 [<ffffffffa061269d>] nft_set_elem_init+0x3d/0x1b0 [nf_tables] [<ffffffffa06130da>] nft_add_set_elem+0x45a/0x8c0 [nf_tables] [<ffffffffa0613645>] nf_tables_newsetelem+0x105/0x1d0 [nf_tables] [<ffffffffa05fe6d4>] nfnetlink_rcv+0x414/0x770 [nfnetlink] [<ffffffff817f0ca6>] netlink_unicast+0x1f6/0x310 [<ffffffff817f10c6>] netlink_sendmsg+0x306/0x3b0 ... Fixes: `e920dde516` ("netfilter: nft_set_bitmap: keep a list of dummy elements") Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-24 20:05:25 +02:00
Liping Zhang	66e5a6b18b	netfilter: nf_ct_helper: permit cthelpers with different names via nfnetlink cthelpers added via nfnetlink may have the same tuple, i.e. except for the l3proto and l4proto, other fields are all zero. So even with the different names, we will also fail to add them: # nfct helper add ssdp inet udp # nfct helper add tftp inet udp nfct v1.4.3: netlink error: File exists So in order to avoid unpredictable behaviour, we should: 1. cthelpers can be selected by nft ct helper obj or xt_CT target, so report error if duplicated { name, l3proto, l4proto } tuple exist. 2. cthelpers can be selected by nf_ct_tuple_src_mask_cmp when nf_ct_auto_assign_helper is enabled, so also report error if duplicated { l3proto, l4proto, src-port } tuple exist. Also note, if the cthelper is added from userspace, then the src-port will always be zero, it's invalid for nf_ct_auto_assign_helper, so there's no need to check the second point listed above. Fixes: `893e093c78` ("netfilter: nf_ct_helper: bail out on duplicated helpers") Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-24 20:05:05 +02:00
Gao Feng	470acf55a0	netfilter: xt_CT: fix refcnt leak on error path There are two cases which causes refcnt leak. 1. When nf_ct_timeout_ext_add failed in xt_ct_set_timeout, it should free the timeout refcnt. Now goto the err_put_timeout error handler instead of going ahead. 2. When the time policy is not found, we should call module_put. Otherwise, the related cthelper module cannot be removed anymore. It is easy to reproduce by typing the following command: # iptables -t raw -A OUTPUT -p tcp -j CT --helper ftp --timeout xxx Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-24 20:03:01 +02:00
Ingo Molnar	58d30c36d4	Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull RCU updates from Paul E. McKenney: - Documentation updates. - Miscellaneous fixes. - Parallelize SRCU callback handling (plus overlapping patches). Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-04-23 11:12:44 +02:00
Gao Feng	122868b378	netfilter: tcp: Use TCP_MAX_WSCALE instead of literal 14 The window scale may be enlarged from 14 to 15 according to the itef draft https://tools.ietf.org/html/draft-nishida-tcpm-maxwin-03. Use the macro TCP_MAX_WSCALE to support it easily with TCP stack in the future. Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-19 17:55:17 +02:00
Florian Westphal	be7be6e161	netfilter: ipvs: fix incorrect conflict resolution The commit `ab8bc7ed86` ("netfilter: remove nf_ct_is_untracked") changed the line if (ct && !nf_ct_is_untracked(ct) && nfct_nat(ct)) { to if (ct && nfct_nat(ct)) { meanwhile, the commit `41390895e5` ("netfilter: ipvs: don't check for presence of nat extension") from ipvs-next had changed the same line to if (ct && !nf_ct_is_untracked(ct) && (ct->status & IPS_NAT_MASK)) { When ipvs-next got merged into nf-next, the merge resolution took the first version, dropping the conversion of nfct_nat(). While this doesn't cause a problem at the moment, it will once we stop adding the nat extension by default. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-19 17:55:17 +02:00
Florian Westphal	01026edef9	nefilter: eache: reduce struct size from 32 to 24 byte Only "cache" needs to use ulong (its used with set_bit()), missed can use u16. Also add build-time assertion to ensure event bits fit. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-19 17:55:17 +02:00
Florian Westphal	c6dd940b1f	netfilter: allow early drop of assured conntracks If insertion of a new conntrack fails because the table is full, the kernel searches the next buckets of the hash slot where the new connection was supposed to be inserted at for an entry that hasn't seen traffic in reply direction (non-assured), if it finds one, that entry is is dropped and the new connection entry is allocated. Allow the conntrack gc worker to also remove assured conntracks if resources are low. Do this by querying the l4 tracker, e.g. tcp connections are now dropped if they are no longer established (e.g. in finwait). This could be refined further, e.g. by adding 'soft' established timeout (i.e., a timeout that is only used once we get close to resource exhaustion). Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-19 17:55:17 +02:00
Florian Westphal	b3a5db109e	netfilter: conntrack: use u8 for extension sizes again commit `223b02d923` ("netfilter: nf_conntrack: reserve two bytes for nf_ct_ext->len") had to increase size of the extension offsets because total size of the extensions had increased to a point where u8 did overflow. 3 years later we've managed to diet extensions a bit and we no longer need u16. Furthermore we can now add a compile-time assertion for this problem. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-19 17:55:17 +02:00
Florian Westphal	faec865db9	netfilter: remove last traces of variable-sized extensions get rid of the (now unused) nf_ct_ext_add_length define and also rename the function to plain nf_ct_ext_add(). Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-19 17:55:17 +02:00
Florian Westphal	9f0f3ebeda	netfilter: helpers: remove data_len usage for inkernel helpers No need to track this for inkernel helpers anymore as NF_CT_HELPER_BUILD_BUG_ON checks do this now. All inkernel helpers know what kind of structure they stored in helper->data. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-19 17:55:17 +02:00
Florian Westphal	157ffffeb5	netfilter: nfnetlink_cthelper: reject too large userspace allocation requests Userspace should not abuse the kernel to store large amounts of data, reject requests larger than the private area can accommodate. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-19 17:55:17 +02:00

1 2 3 4 5 ...

4119 Commits (redonkable)