remarkable-linux

redonkable

Author	SHA1	Message	Date
Liping Zhang	9338d7b441	netfilter: nfnl_cthelper: reject del request if helper obj is in use We can still delete the ct helper even if it is in use, this will cause a use-after-free error. In more detail, I mean: # nfct helper add ssdp inet udp # iptables -t raw -A OUTPUT -p udp -j CT --helper ssdp # nfct helper delete ssdp //--> oops, succeed! BUG: unable to handle kernel paging request at 000026ca IP: 0x26ca [...] Call Trace: ? ipv4_helper+0x62/0x80 [nf_conntrack_ipv4] nf_hook_slow+0x21/0xb0 ip_output+0xe9/0x100 ? ip_fragment.constprop.54+0xc0/0xc0 ip_local_out+0x33/0x40 ip_send_skb+0x16/0x80 udp_send_skb+0x84/0x240 udp_sendmsg+0x35d/0xa50 So add reference count to fix this issue, if ct helper is used by others, reject the delete request. Apply this patch: # nfct helper delete ssdp nfct v1.4.3: netlink error: Device or resource busy Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-05-15 12:42:29 +02:00
Liping Zhang	d91fc59cd7	netfilter: introduce nf_conntrack_helper_put helper function And convert module_put invocation to nf_conntrack_helper_put, this is prepared for the followup patch, which will add a refcnt for cthelper, so we can reject the deleting request when cthelper is in use. Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-05-15 12:42:29 +02:00
Liping Zhang	d110a3942a	netfilter: don't setup nat info for confirmed ct We cannot setup nat info if the ct has been confirmed already, else, different cpu may race to handle the same ct. In extreme situation, we may hit the "BUG_ON(nf_nat_initialized(ct, maniptype))" in the nf_nat_setup_info. Also running the following commands will easily hit NF_CT_ASSERT in nf_conntrack_alter_reply: # nft flush ruleset # ping -c 2 -W 1 1.1.1.111 & # nft add table t # nft add chain t c {type nat hook postrouting priority 0 \;} # nft add rule t c snat to 4.5.6.7 WARNING: CPU: 1 PID: 10065 at net/netfilter/nf_conntrack_core.c:1472 nf_conntrack_alter_reply+0x9a/0x1a0 [nf_conntrack] [...] Call Trace: nf_nat_setup_info+0xad/0x840 [nf_nat] ? deactivate_slab+0x65d/0x6c0 nft_nat_eval+0xcd/0x100 [nft_nat] nft_do_chain+0xff/0x5d0 [nf_tables] ? mark_held_locks+0x6f/0xa0 ? __local_bh_enable_ip+0x70/0xa0 ? trace_hardirqs_on_caller+0x11f/0x190 ? ipt_do_table+0x310/0x610 ? trace_hardirqs_on+0xd/0x10 ? __local_bh_enable_ip+0x70/0xa0 ? ipt_do_table+0x32b/0x610 ? __lock_acquire+0x2ac/0x1580 ? ipt_do_table+0x32b/0x610 nft_nat_do_chain+0x65/0x80 [nft_chain_nat_ipv4] nf_nat_ipv4_fn+0x1ae/0x240 [nf_nat_ipv4] nf_nat_ipv4_out+0x4a/0xf0 [nf_nat_ipv4] nft_nat_ipv4_out+0x15/0x20 [nft_chain_nat_ipv4] nf_hook_slow+0x2c/0xf0 ip_output+0x154/0x270 So for the confirmed ct, just ignore it and return NF_ACCEPT. Fixes: `9a08ecfe74` ("netfilter: don't attach a nat extension by default") Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-05-15 12:42:28 +02:00
Matthias Kaehlcke	a2b7cbdd25	netfilter: ctnetlink: Make some parameters integer to avoid enum mismatch Not all parameters passed to ctnetlink_parse_tuple() and ctnetlink_exp_dump_tuple() match the enum type in the signatures of these functions. Since this is intended change the argument type of to be an unsigned integer value. Signed-off-by: Matthias Kaehlcke <mka@chromium.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-05-15 12:10:27 +02:00
Linus Torvalds	de4d195308	Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU updates from Ingo Molnar: "The main changes are: - Debloat RCU headers - Parallelize SRCU callback handling (plus overlapping patches) - Improve the performance of Tree SRCU on a CPU-hotplug stress test - Documentation updates - Miscellaneous fixes" * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (74 commits) rcu: Open-code the rcu_cblist_n_lazy_cbs() function rcu: Open-code the rcu_cblist_n_cbs() function rcu: Open-code the rcu_cblist_empty() function rcu: Separately compile large rcu_segcblist functions srcu: Debloat the <linux/rcu_segcblist.h> header srcu: Adjust default auto-expediting holdoff srcu: Specify auto-expedite holdoff time srcu: Expedite first synchronize_srcu() when idle srcu: Expedited grace periods with reduced memory contention srcu: Make rcutorture writer stalls print SRCU GP state srcu: Exact tracking of srcu_data structures containing callbacks srcu: Make SRCU be built by default srcu: Fix Kconfig botch when SRCU not selected rcu: Make non-preemptive schedule be Tasks RCU quiescent state srcu: Expedite srcu_schedule_cbs_snp() callback invocation srcu: Parallelize callback handling kvm: Move srcu_struct fields to end of struct kvm rcu: Fix typo in PER_RCU_NODE_PERIOD header comment rcu: Use true/false in assignment to bool rcu: Use bool value directly ...	2017-05-10 10:30:46 -07:00
Michal Hocko	19809c2da2	mm, vmalloc: use __GFP_HIGHMEM implicitly __vmalloc* allows users to provide gfp flags for the underlying allocation. This API is quite popular $ git grep "=[[:space:]]__vmalloc\\|return[[:space:]]*__vmalloc" \| wc -l 77 The only problem is that many people are not aware that they really want to give __GFP_HIGHMEM along with other flags because there is really no reason to consume precious lowmemory on CONFIG_HIGHMEM systems for pages which are mapped to the kernel vmalloc space. About half of users don't use this flag, though. This signals that we make the API unnecessarily too complex. This patch simply uses __GFP_HIGHMEM implicitly when allocating pages to be mapped to the vmalloc space. Current users which add __GFP_HIGHMEM are simplified and drop the flag. Link: http://lkml.kernel.org/r/20170307141020.29107-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Cristopher Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-05-08 17:15:13 -07:00
Michal Hocko	752ade68cb	treewide: use kv[mz]alloc* rather than opencoded variants There are many code paths opencoding kvmalloc. Let's use the helper instead. The main difference to kvmalloc is that those users are usually not considering all the aspects of the memory allocator. E.g. allocation requests <= 32kB (with 4kB pages) are basically never failing and invoke OOM killer to satisfy the allocation. This sounds too disruptive for something that has a reasonable fallback - the vmalloc. On the other hand those requests might fallback to vmalloc even when the memory allocator would succeed after several more reclaim/compaction attempts previously. There is no guarantee something like that happens though. This patch converts many of those places to kv[mz]alloc* helpers because they are more conservative. Link: http://lkml.kernel.org/r/20170306103327.2766-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> # Xen bits Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Andreas Dilger <andreas.dilger@intel.com> # Lustre Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> # KVM/s390 Acked-by: Dan Williams <dan.j.williams@intel.com> # nvdim Acked-by: David Sterba <dsterba@suse.com> # btrfs Acked-by: Ilya Dryomov <idryomov@gmail.com> # Ceph Acked-by: Tariq Toukan <tariqt@mellanox.com> # mlx4 Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx5 Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Anton Vorontsov <anton@enomsg.org> Cc: Colin Cross <ccross@android.com> Cc: Tony Luck <tony.luck@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Santosh Raspatur <santosh@chelsio.com> Cc: Hariprasad S <hariprasad@chelsio.com> Cc: Yishai Hadas <yishaih@mellanox.com> Cc: Oleg Drokin <oleg.drokin@intel.com> Cc: "Yan, Zheng" <zyan@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-05-08 17:15:13 -07:00
Julian Anastasov	3c5ab3f395	ipvs: SNAT packet replies only for NATed connections We do not check if packet from real server is for NAT connection before performing SNAT. This causes problems for setups that use DR/TUN and allow local clients to access the real server directly, for example: - local client in director creates IPVS-DR/TUN connection CIP->VIP and the request packets are routed to RIP. Talks are finished but IPVS connection is not expired yet. - second local client creates non-IPVS connection CIP->RIP with same reply tuple RIP->CIP and when replies are received on LOCAL_IN we wrongly assign them for the first client connection because RIP->CIP matches the reply direction. As result, IPVS SNATs replies for non-IPVS connections. The problem is more visible to local UDP clients but in rare cases it can happen also for TCP or remote clients when the real server sends the reply traffic via the director. So, better to be more precise for the reply traffic. As replies are not expected for DR/TUN connections, better to not touch them. Reported-by: Nick Moriarty <nick.moriarty@york.ac.uk> Tested-by: Nick Moriarty <nick.moriarty@york.ac.uk> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>	2017-05-08 11:38:35 +02:00
Linus Torvalds	4ac4d58488	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull networking fixes from David Miller: 1) The wireless rate info fix from Johannes Berg. 2) When a RAW socket is in hdrincl mode, we need to make sure that the user provided at least a minimally sized ipv4/ipv6 header. Fix from Alexander Potapenko. 3) We must emit IFLA_PHYS_PORT_NAME netlink attributes using nla_put_string() so that it is NULL terminated. 4) Fix a bug in TCP fastopen handling, wherein child sockets erroneously inherit the fastopen_req from the parent, and later can end up derefencing freed memory or doing a double free. From Eric Dumazet. 5) Don't clear out netdev stats at close time in tg3 driver, from YueHaibing. 6) Fix refcount leak in xt_CT, from Gao Feng. 7) In nft_set_bitmap() don't leak dummy elements, from Liping Zhang. 8) Fix deadlock due to taking the expectation lock twice, also from Liping Zhang. 9) Make xt_socket work again with ipv6, from Peter Tirsek. 10) Don't allow IPV6 to be used with IPVS if ipv6.disable=1, from Paolo Abeni. 11) Make the BPF loader more flexible wrt. changes to the bpf MAP entry layout. From Jesper Dangaard Brouer. 12) Fix ethtool reported device name in aquantia driver, from Pavel Belous. 13) Fix build failures due to the compile time size test not working in netfilter conntrack. From Geert Uytterhoeven. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (52 commits) cfg80211: make RATE_INFO_BW_20 the default ipv6: initialize route null entry in addrconf_init() qede: Fix possible misconfiguration of advertised autoneg value. qed: Fix overriding of supported autoneg value. qed*: Fix possible overflow for status block id field. rtnetlink: NUL-terminate IFLA_PHYS_PORT_NAME string netvsc: make sure napi enabled before vmbus_open aquantia: Fix driver name reported by ethtool ipv4, ipv6: ensure raw socket message is big enough to hold an IP header net/sched: remove redundant null check on head tcp: do not inherit fastopen_req from parent forcedeth: remove unnecessary carrier status check ibmvnic: Move queue restarting in ibmvnic_tx_complete ibmvnic: Record SKB RX queue during poll ibmvnic: Continue skb processing after skb completion error ibmvnic: Check for driver reset first in ibmvnic_xmit ibmvnic: Wait for any pending scrqs entries at driver close ibmvnic: Clean up tx pools when closing ibmvnic: Whitespace correction in release_rx_pools ibmvnic: Delete napi's when releasing driver resources ...	2017-05-04 12:26:43 -07:00
Linus Torvalds	46f0537b1e	Merge branch 'stable-4.12' of git://git.infradead.org/users/pcmoore/audit Pull audit updates from Paul Moore: "Fourteen audit patches for v4.12 that span the full range of fixes, new features, and internal cleanups. We have a patches to move to 64-bit timestamps, convert refcounts from atomic_t to refcount_t, track PIDs using the pid struct instead of pid_t, convert our own private audit buffer cache to a standard kmem_cache, log kernel module names when they are unloaded, and normalize the NETFILTER_PKT to make the userspace folks happier. From a fixes perspective, the most important is likely the auditd connection tracking RCU fix; it was a rather brain dead bug that I'll take the blame for, but thankfully it didn't seem to affect many people (only one report). I think the patch subject lines and commit descriptions do a pretty good job of explaining the details and why the changes are important so I'll point you there instead of duplicating it here; as usual, if you have any questions you know where to find us. We also manage to take out more code than we put in this time, that always makes me happy :)" * 'stable-4.12' of git://git.infradead.org/users/pcmoore/audit: audit: fix the RCU locking for the auditd_connection structure audit: use kmem_cache to manage the audit_buffer cache audit: Use timespec64 to represent audit timestamps audit: store the auditd PID as a pid struct instead of pid_t audit: kernel generated netlink traffic should have a portid of 0 audit: combine audit_receive() and audit_receive_skb() audit: convert audit_watch.count from atomic_t to refcount_t audit: convert audit_tree.count from atomic_t to refcount_t audit: normalize NETFILTER_PKT netfilter: use consistent ipv4 network offset in xt_AUDIT audit: log module name on delete_module audit: remove unnecessary semicolon in audit_watch_handle_event() audit: remove unnecessary semicolon in audit_mark_handle_event() audit: remove unnecessary semicolon in audit_field_valid()	2017-05-03 09:21:59 -07:00
David S. Miller	4d89ac2dd5	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf Pablo Neira Ayuso says: ==================== Netfilter/IPVS/OVS fixes for net The following patchset contains a rather large batch of Netfilter, IPVS and OVS fixes for your net tree. This includes fixes for ctnetlink, the userspace conntrack helper infrastructure, conntrack OVS support, ebtables DNAT target, several leaks in error path among other. More specifically, they are: 1) Fix reference count leak in the CT target error path, from Gao Feng. 2) Remove conntrack entry clashing with a matching expectation, patch from Jarno Rajahalme. 3) Fix bogus EEXIST when registering two different userspace helpers, from Liping Zhang. 4) Don't leak dummy elements in the new bitmap set type in nf_tables, from Liping Zhang. 5) Get rid of module autoload from conntrack update path in ctnetlink, we don't need autoload at this late stage and it is happening with rcu read lock held which is not good. From Liping Zhang. 6) Fix deadlock due to double-acquire of the expect_lock from conntrack update path, this fixes a bug that was introduced when the central spinlock got removed. Again from Liping Zhang. 7) Safe ct->status update from ctnetlink path, from Liping. The expect_lock protection that was selected when the central spinlock was removed was not really protecting anything at all. 8) Protect sequence adjustment under ct->lock. 9) Missing socket match with IPv6, from Peter Tirsek. 10) Adjust skb->pkt_type of DNAT'ed frames from ebtables, from Linus Luessing. 11) Don't give up on evaluating the expression on new entries added via dynset expression in nf_tables, from Liping Zhang. 12) Use skb_checksum() when mangling icmpv6 in IPv6 NAT as this deals with non-linear skbuffs. 13) Don't allow IPv6 service in IPVS if no IPv6 support is available, from Paolo Abeni. 14) Missing mutex release in error path of xt_find_table_lock(), from Dan Carpenter. 15) Update maintainers files, Netfilter section. Add Florian to the file, refer to nftables.org and change project status from Supported to Maintained. 16) Bail out on mismatching extensions in element updates in nf_tables. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-05-03 10:11:26 -04:00
Geert Uytterhoeven	ab71632c45	netfilter: conntrack: Force inlining of build check to prevent build failure If gcc (e.g. 4.1.2) decides not to inline total_extension_size(), the build will fail with: net/built-in.o: In function `nf_conntrack_init_start': (.text+0x9baf6): undefined reference to `__compiletime_assert_1893' or ERROR: "__compiletime_assert_1893" [net/netfilter/nf_conntrack.ko] undefined! Fix this by forcing inlining of total_extension_size(). Fixes: `b3a5db109e` ("netfilter: conntrack: use u8 for extension sizes again") Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-05-03 09:51:26 -04:00
Pablo Neira Ayuso	9744a6fcef	netfilter: nf_tables: check if same extensions are set when adding elements If no NLM_F_EXCL is set and the element already exists in the set, make sure that both elements have the same extensions. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-05-03 10:58:00 +02:00
Richard Guy Briggs	2173c519d5	audit: normalize NETFILTER_PKT Eliminate flipping in and out of message fields, dropping fields in the process. Sample raw message format IPv4 UDP: type=NETFILTER_PKT msg=audit(1487874761.386:228): mark=0xae8a2732 saddr=127.0.0.1 daddr=127.0.0.1 proto=17^] Sample raw message format IPv6 ICMP6: type=NETFILTER_PKT msg=audit(1487874761.381:227): mark=0x223894b7 saddr=::1 daddr=::1 proto=58^] Issue: https://github.com/linux-audit/audit-kernel/issues/11 Test case: https://github.com/linux-audit/audit-testsuite/issues/43 Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>	2017-05-02 10:16:04 -04:00
Richard Guy Briggs	0cb88b6ff0	netfilter: use consistent ipv4 network offset in xt_AUDIT Even though the skb->data pointer has been moved from the link layer header to the network layer header, use the same method to calculate the offset in ipv4 and ipv6 routines. Signed-off-by: Richard Guy Briggs <rgb@redhat.com> [PM: munged subject line] Signed-off-by: Paul Moore <paul@paul-moore.com>	2017-05-02 10:16:04 -04:00
David S. Miller	a01aa920b8	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter updates for your net-next tree. A large bunch of code cleanups, simplify the conntrack extension codebase, get rid of the fake conntrack object, speed up netns by selective synchronize_net() calls. More specifically, they are: 1) Check for ct->status bit instead of using nfct_nat() from IPVS and Netfilter codebase, patch from Florian Westphal. 2) Use kcalloc() wherever possible in the IPVS code, from Varsha Rao. 3) Simplify FTP IPVS helper module registration path, from Arushi Singhal. 4) Introduce nft_is_base_chain() helper function. 5) Enforce expectation limit from userspace conntrack helper, from Gao Feng. 6) Add nf_ct_remove_expect() helper function, from Gao Feng. 7) NAT mangle helper function return boolean, from Gao Feng. 8) ctnetlink_alloc_expect() should only work for conntrack with helpers, from Gao Feng. 9) Add nfnl_msg_type() helper function to nfnetlink to build the netlink message type. 10) Get rid of unnecessary cast on void, from simran singhal. 11) Use seq_puts()/seq_putc() instead of seq_printf() where possible, also from simran singhal. 12) Use list_prev_entry() from nf_tables, from simran signhal. 13) Remove unnecessary & on pointer function in the Netfilter and IPVS code. 14) Remove obsolete comment on set of rules per CPU in ip6_tables, no longer true. From Arushi Singhal. 15) Remove duplicated nf_conntrack_l4proto_udplite4, from Gao Feng. 16) Remove unnecessary nested rcu_read_lock() in __nf_nat_decode_session(). Code running from hooks are already guaranteed to run under RCU read side. 17) Remove deadcode in nf_tables_getobj(), from Aaron Conole. 18) Remove double assignment in nf_ct_l4proto_pernet_unregister_one(), also from Aaron. 19) Get rid of unsed __ip_set_get_netlink(), from Aaron Conole. 20) Don't propagate NF_DROP error to userspace via ctnetlink in __nf_nat_alloc_null_binding() function, from Gao Feng. 21) Revisit nf_ct_deliver_cached_events() to remove unnecessary checks, from Gao Feng. 22) Kill the fake untracked conntrack objects, use ctinfo instead to annotate a conntrack object is untracked, from Florian Westphal. 23) Remove nf_ct_is_untracked(), now obsolete since we have no conntrack template anymore, from Florian. 24) Add event mask support to nft_ct, also from Florian. 25) Move nf_conn_help structure to include/net/netfilter/nf_conntrack_helper.h. 26) Add a fixed 32 bytes scratchpad area for conntrack helpers. Thus, we don't deal with variable conntrack extensions anymore. Make sure userspace conntrack helper doesn't go over that size. Remove variable size ct extension infrastructure now this code got no more clients. From Florian Westphal. 27) Restore offset and length of nf_ct_ext structure to 8 bytes now that wraparound is not possible any longer, also from Florian. 28) Allow to get rid of unassured flows under stress in conntrack, this applies to DCCP, SCTP and TCP protocols, from Florian. 29) Shrink size of nf_conntrack_ecache structure, from Florian. 30) Use TCP_MAX_WSCALE instead of hardcoded 14 in TCP tracker, from Gao Feng. 31) Register SYNPROXY hooks on demand, from Florian Westphal. 32) Use pernet hook whenever possible, instead of global hook registration, from Florian Westphal. 33) Pass hook structure to ebt_register_table() to consolidate some infrastructure code, from Florian Westphal. 34) Use consume_skb() and return NF_STOLEN, instead of NF_DROP in the SYNPROXY code, to make sure device stats are not fooled, patch from Gao Feng. 35) Remove NF_CT_EXT_F_PREALLOC this kills quite some code that we don't need anymore if we just select a fixed size instead of expensive runtime time calculation of this. From Florian. 36) Constify nf_ct_extend_register() and nf_ct_extend_unregister(), from Florian. 37) Simplify nf_ct_ext_add(), this kills nf_ct_ext_create(), from Florian. 38) Attach NAT extension on-demand from masquerade and pptp helper path, from Florian. 39) Get rid of useless ip_vs_set_state_timeout(), from Aaron Conole. 40) Speed up netns by selective calls of synchronize_net(), from Florian Westphal. 41) Silence stack size warning gcc in 32-bit arch in snmp helper, from Florian. 42) Inconditionally call nf_ct_ext_destroy(), even if we have no extensions, to deal with the NF_NAT_MANIP_SRC case. Patch from Liping Zhang. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-05-01 10:47:53 -04:00
Liping Zhang	8eeef23504	netfilter: nf_ct_ext: invoke destroy even when ext is not attached For NF_NAT_MANIP_SRC, we will insert the ct to the nat_bysource_table, then remove it from the nat_bysource_table via nat_extend->destroy. But now, the nat extension is attached on demand, so if the nat extension is not attached, we will not be notified when the ct is destroyed, i.e. we may fail to remove ct from the nat_bysource_table. So just keep it simple, even if the extension is not attached, we will still invoke the related ext->destroy. And this will also preserve the flexibility for the future extension. Fixes: `9a08ecfe74` ("netfilter: don't attach a nat extension by default") Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-05-01 11:48:49 +02:00
Pablo Neira Ayuso	d1908ca8dc	Merge tag 'ipvs3-for-v4.12' of http://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next Simon Horman says: ==================== Third Round of IPVS Updates for v4.12 please consider these enhancements to IPVS for v4.12. If it is too late for v4.12 then please consider them for v4.13. * Remove unused function * Correct comparison of unsigned value ==================== Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-05-01 11:46:50 +02:00
Florian Westphal	039b40ee58	netfilter: nf_queue: only call synchronize_net twice if nf_queue is active nf_unregister_net_hook(s) can avoid a second call to synchronize_net, provided there is no nfqueue active in that net namespace (which is the common case). This also gets rid of the extra arg to nf_queue_nf_hook_drop(), normally this gets called during netns cleanup so no packets should be queued. For the rare case of base chain being unregistered or module removal while nfqueue is in use the extra hiccup due to the packet drops isn't a big deal. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-05-01 11:19:12 +02:00
Florian Westphal	c83fa19603	netfilter: nf_log: don't call synchronize_rcu in nf_log_unset nf_log_unregister() (which is what gets called in the logger backends module exit paths) does a (required, module is removed) synchronize_rcu(). But nf_log_unset() is only called from pernet exit handlers. It doesn't free any memory so there appears to be no need to call synchronize_rcu. v2: Liping Zhang points out that nf_log_unregister() needs to be called after pernet unregister, else rmmod would become unsafe. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-05-01 11:19:07 +02:00
Florian Westphal	933bd83ed6	netfilter: batch synchronize_net calls during hook unregister synchronize_net is expensive and slows down netns cleanup a lot. We have two APIs to unregister a hook: nf_unregister_net_hook (which calls synchronize_net()) and nf_unregister_net_hooks (calls nf_unregister_net_hook in a loop) Make nf_unregister_net_hook a wapper around new helper __nf_unregister_net_hook, which unlinks the hook but does not free it. Then, we can call that helper in nf_unregister_net_hooks and then call synchronize_net() only once. Andrey Konovalov reports this change improves syzkaller fuzzing speed at least twice. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-05-01 11:18:54 +02:00
Pablo Neira Ayuso	1a41dbce0d	Merge tag 'ipvs-fixes-for-v4.11' of http://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs Simon Horman says: ==================== IPVS Fixes for v4.11 I would also like it considered for stable. * Explicitly forbid ipv6 service/dest creation if ipv6 mod is disabled to avoid oops caused by IPVS accesing IPv6 routing code in such circumstances. ==================== Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-28 16:22:11 +02:00
Dan Carpenter	7dde07e9c5	netfilter: x_tables: unlock on error in xt_find_table_lock() According to my static checker we should unlock here before the return. That seems reasonable to me as well. Fixes" `b9e69e1273` ("netfilter: xtables: don't hook tables by default") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-28 15:49:48 +02:00
Paolo Abeni	1442f6f7c1	ipvs: explicitly forbid ipv6 service/dest creation if ipv6 mod is disabled When creating a new ipvs service, ipv6 addresses are always accepted if CONFIG_IP_VS_IPV6 is enabled. On dest creation the address family is not explicitly checked. This allows the user-space to configure ipvs services even if the system is booted with ipv6.disable=1. On specific configuration, ipvs can try to call ipv6 routing code at setup time, causing the kernel to oops due to fib6_rules_ops being NULL. This change addresses the issue adding a check for the ipv6 module being enabled while validating ipv6 service operations and adding the same validation for dest operations. According to git history, this issue is apparently present since the introduction of ipv6 support, and the oops can be triggered since commit `09571c7ae3` ("IPVS: Add function to determine if IPv6 address is local") Fixes: `09571c7ae3` ("IPVS: Add function to determine if IPv6 address is local") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>	2017-04-28 12:04:35 +02:00
Aaron Conole	fb90e8dedb	ipvs: change comparison on sync_refresh_period The sync_refresh_period variable is unsigned, so it can never be < 0. Signed-off-by: Aaron Conole <aconole@bytheb.org> Signed-off-by: Simon Horman <horms@verge.net.au>	2017-04-28 12:00:10 +02:00
Aaron Conole	65ba101ebc	ipvs: remove unused function ip_vs_set_state_timeout There are no in-tree callers of this function and it isn't exported. Signed-off-by: Aaron Conole <aconole@bytheb.org> Signed-off-by: Simon Horman <horms@verge.net.au>	2017-04-28 12:00:10 +02:00
Florian Westphal	9a08ecfe74	netfilter: don't attach a nat extension by default nowadays the NAT extension only stores the interface index (used to purge connections that got masqueraded when interface goes down) and pptp nat information. Previous patches moved nf_ct_nat_ext_add to those places that need it. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-26 09:30:22 +02:00
Florian Westphal	2fe7c321ab	netfilter: pptp: attach nat extension when needed make sure nat extension gets added if the master conntrack is subject to NAT. This will be required once the nat core stops adding it by default. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-26 09:30:22 +02:00
Florian Westphal	22d4536d2c	netfilter: conntrack: handle initial extension alloc via krealloc krealloc(NULL, ..) is same as kmalloc(), so we can avoid special-casing the initial allocation after the prealloc removal (we had to use ->alloc_len as the initial allocation size). This also means we do not zero the preallocated memory anymore; only offsets[]. Existing code makes sure the new (used) extension space gets zeroed out. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-26 09:30:22 +02:00
Florian Westphal	23f671a1b5	netfilter: conntrack: mark extension structs as const Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-26 09:30:22 +02:00
Florian Westphal	54044b1f02	netfilter: conntrack: remove prealloc support It was used by the nat extension, but since commit `7c96643519` ("netfilter: move nat hlist_head to nf_conn") its only needed for connections that use MASQUERADE target or a nat helper. Also it seems a lot easier to preallocate a fixed size instead. With default settings, conntrack first adds ecache extension (sysctl defaults to 1), so we get 40(ct extension header) + 24 (ecache) == 64 byte on x86_64 for initial allocation. Followup patches can constify the extension structs and avoid the initial zeroing of the entire extension area. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-26 09:30:22 +02:00
Florian Westphal	efe4160618	ipvs: convert to use pernet nf_hook api nf_(un)register_hooks has to maintain an internal hook list to add/remove those hooks from net namespaces as they are added/deleted. ipvs already uses pernet_ops, so we can switch to the (more recent) pernet hook api instead. Compile tested only. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-26 09:30:21 +02:00
Liping Zhang	277a292835	netfilter: nft_dynset: continue to next expr if _OP_ADD succeeded Currently, after adding the following nft rules: # nft add set x target1 { type ipv4_addr \; flags timeout \;} # nft add rule x y set add ip daddr timeout 1d @target1 counter the counters will always be zero despite of the elements are added to the dynamic set "target1" or not, as we will break the nft expr traversal unconditionally: # nft list ruleset ... set target1 { ... elements = { 8.8.8.8 expires 23h59m53s} } chain output { ... set add ip daddr timeout 1d @target1 counter packets 0 bytes 0 ^ ^ ... } Since we add the elements to the set successfully, we should continue to the next expression. Additionally, if elements are added to "flow table" successfully, we will _always_ continue to the next expr, even if the operation is _OP_ADD. So it's better to keep them to be consistent. Fixes: `22fe54d5fe` ("netfilter: nf_tables: add support for dynamic set updates") Reported-by: Robert White <rwhite@pobox.com> Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-25 11:10:37 +02:00
Peter Tirsek	6bd3d19292	netfilter: xt_socket: Fix broken IPv6 handling Commit `834184b1f3` ("netfilter: defrag: only register defrag functionality if needed") used the outdated XT_SOCKET_HAVE_IPV6 macro which was removed earlier in commit `8db4c5be88` ("netfilter: move socket lookup infrastructure to nf_socket_ipv{4,6}.c"). With that macro never being defined, the xt_socket match emits an "Unknown family 10" warning when used with IPv6: WARNING: CPU: 0 PID: 1377 at net/netfilter/xt_socket.c:160 socket_mt_enable_defrag+0x47/0x50 [xt_socket] Unknown family 10 Modules linked in: xt_socket nf_socket_ipv4 nf_socket_ipv6 nf_defrag_ipv4 [...] CPU: 0 PID: 1377 Comm: ip6tables-resto Not tainted 4.10.10 #1 Hardware name: [...] Call Trace: ? __warn+0xe7/0x100 ? socket_mt_enable_defrag+0x47/0x50 [xt_socket] ? socket_mt_enable_defrag+0x47/0x50 [xt_socket] ? warn_slowpath_fmt+0x39/0x40 ? socket_mt_enable_defrag+0x47/0x50 [xt_socket] ? socket_mt_v2_check+0x12/0x40 [xt_socket] ? xt_check_match+0x6b/0x1a0 [x_tables] ? xt_find_match+0x93/0xd0 [x_tables] ? xt_request_find_match+0x20/0x80 [x_tables] ? translate_table+0x48e/0x870 [ip6_tables] ? translate_table+0x577/0x870 [ip6_tables] ? walk_component+0x3a/0x200 ? kmalloc_order+0x1d/0x50 ? do_ip6t_set_ctl+0x181/0x490 [ip6_tables] ? filename_lookup+0xa5/0x120 ? nf_setsockopt+0x3a/0x60 ? ipv6_setsockopt+0xb0/0xc0 ? sock_common_setsockopt+0x23/0x30 ? SyS_socketcall+0x41d/0x630 ? vfs_read+0xfa/0x120 ? do_fast_syscall_32+0x7a/0x110 ? entry_SYSENTER_32+0x47/0x71 This patch brings the conditional back in line with how the rest of the file handles IPv6. Fixes: `834184b1f3` ("netfilter: defrag: only register defrag functionality if needed") Signed-off-by: Peter Tirsek <peter@tirsek.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-24 20:06:29 +02:00
Liping Zhang	64f3967c7a	netfilter: ctnetlink: acquire ct->lock before operating nf_ct_seqadj We should acquire the ct->lock before accessing or modifying the nf_ct_seqadj, as another CPU may modify the nf_ct_seqadj at the same time during its packet proccessing. Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-24 20:06:29 +02:00
Liping Zhang	53b56da83d	netfilter: ctnetlink: make it safer when updating ct->status After converting to use rcu for conntrack hash, one CPU may update the ct->status via ctnetlink, while another CPU may process the packets and update the ct->status. So the non-atomic operation "ct->status \|= status;" via ctnetlink becomes unsafe, and this may clear the IPS_DYING_BIT bit set by another CPU unexpectedly. For example: CPU0 CPU1 ctnetlink_change_status __nf_conntrack_find_get old = ct->status nf_ct_gc_expired - nf_ct_kill - test_and_set_bit(IPS_DYING_BIT new = old \| status; - ct->status = new; <-- oops, _DYING_ is cleared! Now using a series of atomic bit operation to solve the above issue. Also note, user shouldn't set IPS_TEMPLATE, IPS_SEQ_ADJUST directly, so make these two bits be unchangable too. If we set the IPS_TEMPLATE_BIT, ct will be freed by nf_ct_tmpl_free, but actually it is alloced by nf_conntrack_alloc. If we set the IPS_SEQ_ADJUST_BIT, this may cause the NULL pointer deference, as the nfct_seqadj(ct) maybe NULL. Last, add some comments to describe the logic change due to the commit `a963d710f3` ("netfilter: ctnetlink: Fix regression in CTA_STATUS processing"), which makes me feel a little confusing. Fixes: `76507f69c4` ("[NETFILTER]: nf_conntrack: use RCU for conntrack hash") Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-24 20:06:28 +02:00
Liping Zhang	88be4c09d9	netfilter: ctnetlink: fix deadlock due to acquire _expect_lock twice Currently, ctnetlink_change_conntrack is always protected by _expect_lock, but this will cause a deadlock when deleting the helper from a conntrack, as the _expect_lock will be acquired again by nf_ct_remove_expectations: CPU0 ---- lock(nf_conntrack_expect_lock); lock(nf_conntrack_expect_lock); * DEADLOCK * May be due to missing lock nesting notation 2 locks held by lt-conntrack_gr/12853: #0: (&table[i].mutex){+.+.+.}, at: [<ffffffffa05e2009>] nfnetlink_rcv_msg+0x399/0x6a9 [nfnetlink] #1: (nf_conntrack_expect_lock){+.....}, at: [<ffffffffa05f2c1f>] ctnetlink_new_conntrack+0x17f/0x408 [nf_conntrack_netlink] Call Trace: dump_stack+0x85/0xc2 __lock_acquire+0x1608/0x1680 ? ctnetlink_parse_tuple_proto+0x10f/0x1c0 [nf_conntrack_netlink] lock_acquire+0x100/0x1f0 ? nf_ct_remove_expectations+0x32/0x90 [nf_conntrack] _raw_spin_lock_bh+0x3f/0x50 ? nf_ct_remove_expectations+0x32/0x90 [nf_conntrack] nf_ct_remove_expectations+0x32/0x90 [nf_conntrack] ctnetlink_change_helper+0xc6/0x190 [nf_conntrack_netlink] ctnetlink_new_conntrack+0x1b2/0x408 [nf_conntrack_netlink] nfnetlink_rcv_msg+0x60a/0x6a9 [nfnetlink] ? nfnetlink_rcv_msg+0x1b9/0x6a9 [nfnetlink] ? nfnetlink_bind+0x1a0/0x1a0 [nfnetlink] netlink_rcv_skb+0xa4/0xc0 nfnetlink_rcv+0x87/0x770 [nfnetlink] Since the operations are unrelated to nf_ct_expect, so we can drop the _expect_lock. Also note, after removing the _expect_lock protection, another CPU may invoke nf_conntrack_helper_unregister, so we should use rcu_read_lock to protect __nf_conntrack_helper_find invoked by ctnetlink_change_helper. Fixes: `ca7433df3a` ("netfilter: conntrack: seperate expect locking from nf_conntrack_lock") Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-24 20:06:28 +02:00
Liping Zhang	14e5676156	netfilter: ctnetlink: drop the incorrect cthelper module request First, when creating a new ct, we will invoke request_module to try to load the related inkernel cthelper. So there's no need to call the request_module again when updating the ct helpinfo. Second, ctnetlink_change_helper may be called with rcu_read_lock held, i.e. rcu_read_lock -> nfqnl_recv_verdict -> nfqnl_ct_parse -> ctnetlink_glue_parse -> ctnetlink_glue_parse_ct -> ctnetlink_change_helper. But the request_module invocation may sleep, so we can't call it with the rcu_read_lock held. Remove it now. Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-24 20:06:28 +02:00
Liping Zhang	54a5f9d9ab	netfilter: nft_set_bitmap: free dummy elements when destroy the set We forget to free dummy elements when deleting the set. So when I was running nft-test.py, I saw many kmemleak warnings: kmemleak: 1344 new suspected memory leaks ... # cat /sys/kernel/debug/kmemleak unreferenced object 0xffff8800631345c8 (size 32): comm "nft", pid 9075, jiffies 4295743309 (age 1354.815s) hex dump (first 32 bytes): f8 63 13 63 00 88 ff ff 88 79 13 63 00 88 ff ff .c.c.....y.c.... 04 0c 00 00 00 00 00 00 00 00 00 00 08 03 00 00 ................ backtrace: [<ffffffff819059da>] kmemleak_alloc+0x4a/0xa0 [<ffffffff81288174>] __kmalloc+0x164/0x310 [<ffffffffa061269d>] nft_set_elem_init+0x3d/0x1b0 [nf_tables] [<ffffffffa06130da>] nft_add_set_elem+0x45a/0x8c0 [nf_tables] [<ffffffffa0613645>] nf_tables_newsetelem+0x105/0x1d0 [nf_tables] [<ffffffffa05fe6d4>] nfnetlink_rcv+0x414/0x770 [nfnetlink] [<ffffffff817f0ca6>] netlink_unicast+0x1f6/0x310 [<ffffffff817f10c6>] netlink_sendmsg+0x306/0x3b0 ... Fixes: `e920dde516` ("netfilter: nft_set_bitmap: keep a list of dummy elements") Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-24 20:05:25 +02:00
Liping Zhang	66e5a6b18b	netfilter: nf_ct_helper: permit cthelpers with different names via nfnetlink cthelpers added via nfnetlink may have the same tuple, i.e. except for the l3proto and l4proto, other fields are all zero. So even with the different names, we will also fail to add them: # nfct helper add ssdp inet udp # nfct helper add tftp inet udp nfct v1.4.3: netlink error: File exists So in order to avoid unpredictable behaviour, we should: 1. cthelpers can be selected by nft ct helper obj or xt_CT target, so report error if duplicated { name, l3proto, l4proto } tuple exist. 2. cthelpers can be selected by nf_ct_tuple_src_mask_cmp when nf_ct_auto_assign_helper is enabled, so also report error if duplicated { l3proto, l4proto, src-port } tuple exist. Also note, if the cthelper is added from userspace, then the src-port will always be zero, it's invalid for nf_ct_auto_assign_helper, so there's no need to check the second point listed above. Fixes: `893e093c78` ("netfilter: nf_ct_helper: bail out on duplicated helpers") Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-24 20:05:05 +02:00
Gao Feng	470acf55a0	netfilter: xt_CT: fix refcnt leak on error path There are two cases which causes refcnt leak. 1. When nf_ct_timeout_ext_add failed in xt_ct_set_timeout, it should free the timeout refcnt. Now goto the err_put_timeout error handler instead of going ahead. 2. When the time policy is not found, we should call module_put. Otherwise, the related cthelper module cannot be removed anymore. It is easy to reproduce by typing the following command: # iptables -t raw -A OUTPUT -p tcp -j CT --helper ftp --timeout xxx Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-24 20:03:01 +02:00
Ingo Molnar	58d30c36d4	Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull RCU updates from Paul E. McKenney: - Documentation updates. - Miscellaneous fixes. - Parallelize SRCU callback handling (plus overlapping patches). Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-04-23 11:12:44 +02:00
Gao Feng	122868b378	netfilter: tcp: Use TCP_MAX_WSCALE instead of literal 14 The window scale may be enlarged from 14 to 15 according to the itef draft https://tools.ietf.org/html/draft-nishida-tcpm-maxwin-03. Use the macro TCP_MAX_WSCALE to support it easily with TCP stack in the future. Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-19 17:55:17 +02:00
Florian Westphal	be7be6e161	netfilter: ipvs: fix incorrect conflict resolution The commit `ab8bc7ed86` ("netfilter: remove nf_ct_is_untracked") changed the line if (ct && !nf_ct_is_untracked(ct) && nfct_nat(ct)) { to if (ct && nfct_nat(ct)) { meanwhile, the commit `41390895e5` ("netfilter: ipvs: don't check for presence of nat extension") from ipvs-next had changed the same line to if (ct && !nf_ct_is_untracked(ct) && (ct->status & IPS_NAT_MASK)) { When ipvs-next got merged into nf-next, the merge resolution took the first version, dropping the conversion of nfct_nat(). While this doesn't cause a problem at the moment, it will once we stop adding the nat extension by default. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-19 17:55:17 +02:00
Florian Westphal	01026edef9	nefilter: eache: reduce struct size from 32 to 24 byte Only "cache" needs to use ulong (its used with set_bit()), missed can use u16. Also add build-time assertion to ensure event bits fit. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-19 17:55:17 +02:00
Florian Westphal	c6dd940b1f	netfilter: allow early drop of assured conntracks If insertion of a new conntrack fails because the table is full, the kernel searches the next buckets of the hash slot where the new connection was supposed to be inserted at for an entry that hasn't seen traffic in reply direction (non-assured), if it finds one, that entry is is dropped and the new connection entry is allocated. Allow the conntrack gc worker to also remove assured conntracks if resources are low. Do this by querying the l4 tracker, e.g. tcp connections are now dropped if they are no longer established (e.g. in finwait). This could be refined further, e.g. by adding 'soft' established timeout (i.e., a timeout that is only used once we get close to resource exhaustion). Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-19 17:55:17 +02:00
Florian Westphal	b3a5db109e	netfilter: conntrack: use u8 for extension sizes again commit `223b02d923` ("netfilter: nf_conntrack: reserve two bytes for nf_ct_ext->len") had to increase size of the extension offsets because total size of the extensions had increased to a point where u8 did overflow. 3 years later we've managed to diet extensions a bit and we no longer need u16. Furthermore we can now add a compile-time assertion for this problem. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-19 17:55:17 +02:00
Florian Westphal	faec865db9	netfilter: remove last traces of variable-sized extensions get rid of the (now unused) nf_ct_ext_add_length define and also rename the function to plain nf_ct_ext_add(). Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-19 17:55:17 +02:00
Florian Westphal	9f0f3ebeda	netfilter: helpers: remove data_len usage for inkernel helpers No need to track this for inkernel helpers anymore as NF_CT_HELPER_BUILD_BUG_ON checks do this now. All inkernel helpers know what kind of structure they stored in helper->data. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-19 17:55:17 +02:00
Florian Westphal	157ffffeb5	netfilter: nfnetlink_cthelper: reject too large userspace allocation requests Userspace should not abuse the kernel to store large amounts of data, reject requests larger than the private area can accommodate. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-19 17:55:17 +02:00
Florian Westphal	dcf67740f2	netfilter: helper: add build-time asserts for helper data size add a 32 byte scratch area in the helper struct instead of relying on variable sized helpers plus compile-time asserts to let us know if 32 bytes aren't enough anymore. Not having variable sized helpers will later allow to add BUILD_BUG_ON for the total size of conntrack extensions -- the helper extension is the only one that doesn't have a fixed size. The (useless!) NF_CT_HELPER_BUILD_BUG_ON(0); are added so that in case someone adds a new helper and copy-pastes from one that doesn't store private data at least some indication that this macro should be used somehow is there... Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-19 17:55:16 +02:00
Florian Westphal	694a0055f0	netfilter: nft_ct: allow to set ctnetlink event types of a connection By default the kernel emits all ctnetlink events for a connection. This allows to select the types of events to generate. This can be used to e.g. only send DESTROY events but no NEW/UPDATE ones and will work even if sysctl net.netfilter.nf_conntrack_events is set to 0. This was already possible via iptables' CT target, but the nft version has the advantage that it can also be used with already-established conntracks. The added nf_ct_is_template() check isn't a bug fix as we only support mark and labels (and unlike ecache the conntrack core doesn't copy those). Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-19 17:55:16 +02:00
Paul E. McKenney	5f0d5a3ae7	mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU A group of Linux kernel hackers reported chasing a bug that resulted from their assumption that SLAB_DESTROY_BY_RCU provided an existence guarantee, that is, that no block from such a slab would be reallocated during an RCU read-side critical section. Of course, that is not the case. Instead, SLAB_DESTROY_BY_RCU only prevents freeing of an entire slab of blocks. However, there is a phrase for this, namely "type safety". This commit therefore renames SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU in order to avoid future instances of this sort of confusion. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: <linux-mm@kvack.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> [ paulmck: Add comments mentioning the old name, as requested by Eric Dumazet, in order to help people familiar with the old name find the new one. ] Acked-by: David Rientjes <rientjes@google.com>	2017-04-18 11:42:36 -07:00
David S. Miller	6b6cbc1471	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts were simply overlapping changes. In the net/ipv4/route.c case the code had simply moved around a little bit and the same fix was made in both 'net' and 'net-next'. In the net/sched/sch_generic.c case a fix in 'net' happened at the same time that a new argument was added to qdisc_hash_add(). Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-15 21:16:30 -04:00
Florian Westphal	ab8bc7ed86	netfilter: remove nf_ct_is_untracked This function is now obsolete and always returns false. This change has no effect on generated code. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-15 11:51:33 +02:00
Florian Westphal	cc41c84b7e	netfilter: kill the fake untracked conntrack objects resurrect an old patch from Pablo Neira to remove the untracked objects. Currently, there are four possible states of an skb wrt. conntrack. 1. No conntrack attached, ct is NULL. 2. Normal (kmem cache allocated) ct attached. 3. a template (kmalloc'd), not in any hash tables at any point in time 4. the 'untracked' conntrack, a percpu nf_conn object, tagged via IPS_UNTRACKED_BIT in ct->status. Untracked is supposed to be identical to case 1. It exists only so users can check -m conntrack --ctstate UNTRACKED vs. -m conntrack --ctstate INVALID e.g. attempts to set connmark on INVALID or UNTRACKED conntracks is supposed to be a no-op. Thus currently we need to check ct == NULL \|\| nf_ct_is_untracked(ct) in a lot of places in order to avoid altering untracked objects. The other consequence of the percpu untracked object is that all -j NOTRACK (and, later, kfree_skb of such skbs) result in an atomic op (inc/dec the untracked conntracks refcount). This adds a new kernel-private ctinfo state, IP_CT_UNTRACKED, to make the distinction instead. The (few) places that care about packet invalid (ct is NULL) vs. packet untracked now need to test ct == NULL vs. ctinfo == IP_CT_UNTRACKED, but all other places can omit the nf_ct_is_untracked() check. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-15 11:47:57 +02:00
Gao Feng	6e354a5e56	netfilter: ecache: Refine the nf_ct_deliver_cached_events 1. Remove single !events condition check to deliver the missed event even though there is no new event happened. Consider this case: 1) nf_ct_deliver_cached_events is invoked at the first time, the event is failed to deliver, then the missed is set. 2) nf_ct_deliver_cached_events is invoked again, but there is no any new event happened. The missed event is lost really. It would try to send the missed event again after remove this check. And it is ok if there is no missed event because the latter check !((events \| missed) & e->ctmask) could avoid it. 2. Correct the return value check of notify->fcn. When send the event successfully, it returns 0, not postive value. Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-15 11:43:49 +02:00
Gao Feng	7025bac47f	netfilter: nf_nat: Fix return NF_DROP in nfnetlink_parse_nat_setup The __nf_nat_alloc_null_binding invokes nf_nat_setup_info which may return NF_DROP when memory is exhausted, so convert NF_DROP to -ENOMEM to make ctnetlink happy. Or ctnetlink_setup_nat treats it as a success when one error NF_DROP happens actully. Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-15 11:04:14 +02:00
Pablo Neira Ayuso	a702ece3b1	Merge tag 'ipvs2-for-v4.12' of https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next Simon Horman says: ==================== Second Round of IPVS Updates for v4.12 please consider these clean-ups and enhancements to IPVS for v4.12. * Removal unused variable * Use kzalloc where appropriate * More efficient detection of presence of NAT extension ==================== Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Conflicts: net/netfilter/ipvs/ip_vs_ftp.c	2017-04-15 10:54:40 +02:00
Aaron Conole	db268d4dfd	ipset: remove unused function __ip_set_get_netlink There are no in-tree callers. Signed-off-by: Aaron Conole <aconole@bytheb.org> Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-15 10:24:41 +02:00
Aaron Conole	809c2d9a3b	netfilter: nf_conntrack: remove double assignment The protonet pointer will unconditionally be rewritten, so just do the needed assignment first. Signed-off-by: Aaron Conole <aconole@bytheb.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-14 01:54:23 +02:00
Aaron Conole	7925056827	netfilter: nf_tables: remove double return statement Signed-off-by: Aaron Conole <aconole@bytheb.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-14 01:54:19 +02:00
Liping Zhang	79e09ef96b	netfilter: nft_hash: do not dump the auto generated seed This can prevent the nft utility from printing out the auto generated seed to the user, which is unnecessary and confusing. Fixes: `cb1b69b0b1` ("netfilter: nf_tables: add hash expression") Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-13 23:20:13 +02:00
Taehee Yoo	5389023421	netfilter: nat: remove rcu_read_lock in __nf_nat_decode_session. __nf_nat_decode_session is called from nf_nat_decode_session as decodefn. before calling decodefn, it already set rcu_read_lock. so rcu_read_lock in __nf_nat_decode_session can be removed. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-13 22:48:03 +02:00
Johannes Berg	fe52145f91	netlink: pass extended ACK struct where available This is an add-on to the previous patch that passes the extended ACK structure where it's already available by existing genl_info or extack function arguments. This was done with this spatch (with some manual adjustment of indentation): @@ expression A, B, C, D, E; identifier fn, info; @@ fn(..., struct genl_info info, ...) { ... -nlmsg_parse(A, B, C, D, E, NULL) +nlmsg_parse(A, B, C, D, E, info->extack) ... } @@ expression A, B, C, D, E; identifier fn, info; @@ fn(..., struct genl_info info, ...) { <... -nla_parse_nested(A, B, C, D, NULL) +nla_parse_nested(A, B, C, D, info->extack) ...> } @@ expression A, B, C, D, E; identifier fn, extack; @@ fn(..., struct netlink_ext_ack extack, ...) { <... -nlmsg_parse(A, B, C, D, E, NULL) +nlmsg_parse(A, B, C, D, E, extack) ...> } @@ expression A, B, C, D, E; identifier fn, extack; @@ fn(..., struct netlink_ext_ack extack, ...) { <... -nla_parse(A, B, C, D, E, NULL) +nla_parse(A, B, C, D, E, extack) ...> } @@ expression A, B, C, D, E; identifier fn, extack; @@ fn(..., struct netlink_ext_ack extack, ...) { ... -nlmsg_parse(A, B, C, D, E, NULL) +nlmsg_parse(A, B, C, D, E, extack) ... } @@ expression A, B, C, D; identifier fn, extack; @@ fn(..., struct netlink_ext_ack extack, ...) { <... -nla_parse_nested(A, B, C, D, NULL) +nla_parse_nested(A, B, C, D, extack) ...> } @@ expression A, B, C, D; identifier fn, extack; @@ fn(..., struct netlink_ext_ack extack, ...) { <... -nlmsg_validate(A, B, C, D, NULL) +nlmsg_validate(A, B, C, D, extack) ...> } @@ expression A, B, C, D; identifier fn, extack; @@ fn(..., struct netlink_ext_ack extack, ...) { <... -nla_validate(A, B, C, D, NULL) +nla_validate(A, B, C, D, extack) ...> } @@ expression A, B, C; identifier fn, extack; @@ fn(..., struct netlink_ext_ack *extack, ...) { <... -nla_validate_nested(A, B, C, NULL) +nla_validate_nested(A, B, C, extack) ...> } Signed-off-by: Johannes Berg <johannes.berg@intel.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-13 13:58:22 -04:00
Johannes Berg	fceb6435e8	netlink: pass extended ACK struct to parsing functions Pass the new extended ACK reporting struct to all of the generic netlink parsing functions. For now, pass NULL in almost all callers (except for some in the core.) Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-13 13:58:22 -04:00
Johannes Berg	2d4bc93368	netlink: extended ACK reporting Add the base infrastructure and UAPI for netlink extended ACK reporting. All "manual" calls to netlink_ack() pass NULL for now and thus don't get extended ACK reporting. Big thanks goes to Pablo Neira Ayuso for not only bringing up the whole topic at netconf (again) but also coming up with the nlattr passing trick and various other ideas. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Reviewed-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-13 13:58:20 -04:00
Liping Zhang	7cddd967bf	netfilter: nf_ct_expect: use proper RCU list traversal/update APIs We should use proper RCU list APIs to manipulate help->expectations, as we can dump the conntrack's expectations via nfnetlink, i.e. in ctnetlink_exp_ct_dump_table(), where only rcu_read_lock is acquired. So for list traversal, use hlist_for_each_entry_rcu; for list add/del, use hlist_add_head_rcu and hlist_del_rcu. Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-08 23:52:17 +02:00
Liping Zhang	207df81501	netfilter: ctnetlink: skip dumping expect when nfct_help(ct) is NULL For IPCTNL_MSG_EXP_GET, if the CTA_EXPECT_MASTER attr is specified, then the NLM_F_DUMP request will dump the expectations related to this connection tracking. But we forget to check whether the conntrack has nf_conn_help or not, so if nfct_help(ct) is NULL, oops will happen: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 IP: ctnetlink_exp_ct_dump_table+0xf9/0x1e0 [nf_conntrack_netlink] Call Trace: ? ctnetlink_exp_ct_dump_table+0x75/0x1e0 [nf_conntrack_netlink] netlink_dump+0x124/0x2a0 __netlink_dump_start+0x161/0x190 ctnetlink_dump_exp_ct+0x16c/0x1bc [nf_conntrack_netlink] ? ctnetlink_exp_fill_info.constprop.33+0xf0/0xf0 [nf_conntrack_netlink] ? ctnetlink_glue_seqadj+0x20/0x20 [nf_conntrack_netlink] ctnetlink_get_expect+0x32e/0x370 [nf_conntrack_netlink] ? debug_lockdep_rcu_enabled+0x1d/0x20 nfnetlink_rcv_msg+0x60a/0x6a9 [nfnetlink] ? nfnetlink_rcv_msg+0x1b9/0x6a9 [nfnetlink] [...] Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-08 23:52:17 +02:00
Liping Zhang	0c7930e576	netfilter: make it safer during the inet6_dev->addr_list traversal inet6_dev->addr_list is protected by inet6_dev->lock, so only using rcu_read_lock is not enough, we should acquire read_lock_bh(&idev->lock) before the inet6_dev->addr_list traversal. Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-08 23:52:16 +02:00
Liping Zhang	3173d5b8c8	netfilter: ctnetlink: make it safer when checking the ct helper name One CPU is doing ctnetlink_change_helper(), while another CPU is doing unhelp() at the same time. So even if help->helper is not NULL at first, the later statement strcmp(help->helper->name, ...) may still access the NULL pointer. So we must use rcu_read_lock and rcu_dereference to avoid such _bad_ thing happen. Fixes: `f95d7a46bc` ("netfilter: ctnetlink: Fix regression in CTA_HELP processing") Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-08 23:52:16 +02:00
Gao Feng	8b5995d063	netfilter: helper: Add the rcu lock when call __nf_conntrack_helper_find When invoke __nf_conntrack_helper_find, it needs the rcu lock to protect the helper module which would not be unloaded. Now there are two caller nf_conntrack_helper_try_module_get and ctnetlink_create_expect which don't hold rcu lock. And the other callers left like ctnetlink_change_helper, ctnetlink_create_conntrack, and ctnetlink_glue_attach_expect, they already hold the rcu lock or spin_lock_bh. Remove the rcu lock in functions nf_ct_helper_expectfn_find_by_name and nf_ct_helper_expectfn_find_by_symbol. Because they return one pointer which needs rcu lock, so their caller should hold the rcu lock, not in these two functions. Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-08 23:52:15 +02:00
Liping Zhang	97aae0df1d	netfilter: ctnetlink: using bit to represent the ct event Otherwise, creating a new conntrack via nfnetlink: # conntrack -I -p udp -s 1.1.1.1 -d 2.2.2.2 -t 10 --sport 10 --dport 20 will emit the wrong ct events(where UPDATE should be NEW): # conntrack -E [UPDATE] udp 17 10 src=1.1.1.1 dst=2.2.2.2 sport=10 dport=20 [UNREPLIED] src=2.2.2.2 dst=1.1.1.1 sport=20 dport=10 mark=0 Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-08 23:52:15 +02:00
Eric Dumazet	2638fd0f92	netfilter: xt_TCPMSS: add more sanity tests on tcph->doff Denys provided an awesome KASAN report pointing to an use after free in xt_TCPMSS I have provided three patches to fix this issue, either in xt_TCPMSS or in xt_tcpudp.c. It seems xt_TCPMSS patch has the smallest possible impact. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Denys Fedoryshchenko <nuclearcat@nuclearcat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-08 22:24:19 +02:00
Arushi Singhal	d4ef383541	netfilter: Remove exceptional & on function name Remove & from function pointers to conform to the style found elsewhere in the file. Done using the following semantic patch // <smpl> @r@ identifier f; @@ f(...) { ... } @@ identifier r.f; @@ - &f + f // </smpl> Signed-off-by: Arushi Singhal <arushisinghal19971997@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-07 18:24:47 +02:00
simran singhal	cbbb40e2ec	net: netfilter: Use list_{next/prev}_entry instead of list_entry This patch replace list_entry with list_prev_entry as it makes the code more clear to read. Signed-off-by: simran singhal <singhalsimran0@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-07 17:33:02 +02:00
simran singhal	cdec26858e	netfilter: Use seq_puts()/seq_putc() where possible For string without format specifiers, use seq_puts(). For seq_printf("\n"), use seq_putc('\n'). Signed-off-by: simran singhal <singhalsimran0@gmail.com> Acked-by: Simon Horman <horms+renesas@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-07 17:29:21 +02:00
simran singhal	68ad546aef	netfilter: Remove unnecessary cast on void pointer The following Coccinelle script was used to detect this: @r@ expression x; void* e; type T; identifier f; @@ ( ((T )e) \| ((T )x)[...] \| ((T)x)->f \| - (T*) e ) Unnecessary parantheses are also remove. Signed-off-by: simran singhal <singhalsimran0@gmail.com> Reviewed-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-07 17:29:17 +02:00
Pablo Neira Ayuso	dedb67c4b4	netfilter: Add nfnl_msg_type() helper function Add and use nfnl_msg_type() function to replace opencoded nfnetlink message type. I suggested this change, Arushi Singhal made an initial patch to address this but was missing several spots. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-07 16:31:36 +02:00
Gao Feng	2c62e0bc68	netfilter: ctnetlink: Expectations must have a conntrack helper area The expect check function __nf_ct_expect_check() asks the master_help is necessary. So it is unnecessary to go ahead in ctnetlink_alloc_expect when there is no help. Actually the commit `bc01befdcf` ("netfilter: ctnetlink: add support for user-space expectation helpers") permits ctnetlink create one expect even though there is no master help. But the latter commit `3d058d7bc2` ("netfilter: rework user-space expectation helper support") disables it again. Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-06 22:01:42 +02:00
Florian Westphal	6e699867f8	netfilter: nat: avoid use of nf_conn_nat extension successful insert into the bysource hash sets IPS_SRC_NAT_DONE status bit so we can check that instead of presence of nat extension which requires extra deref. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-06 22:01:42 +02:00
Gao Feng	cba81cc4c9	netfilter: nat: nf_nat_mangle_{udp,tcp}_packet returns boolean nf_nat_mangle_{udp,tcp}_packet() returns int. However, it is used as bool type in many spots. Fix this by consistently handle this return value as a boolean. Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-06 22:01:38 +02:00
Gao Feng	ec0e3f0111	netfilter: nf_ct_expect: Add nf_ct_remove_expect() When remove one expect, it needs three statements. And there are multiple duplicated codes in current code. So add one common function nf_ct_remove_expect to consolidate this. Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-06 18:39:40 +02:00
Gao Feng	92f73221f9	netfilter: expect: Make sure the max_expected limit is effective Because the type of expecting, the member of nf_conn_help, is u8, it would overflow after reach U8_MAX(255). So it doesn't work when we configure the max_expected exceeds 255 with expect policy. Now add the check for max_expected. Return the -EINVAL when it exceeds the limit. Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-06 18:32:16 +02:00
Pablo Neira Ayuso	f323d95469	netfilter: nf_tables: add nft_is_base_chain() helper This new helper function allows us to check if this is a basechain. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-04-06 18:32:04 +02:00
David S. Miller	6f14f443d3	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Mostly simple cases of overlapping changes (adding code nearby, a function whose name changes, for example). Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-06 08:24:51 -07:00
Arushi Singhal	e241137699	ipvs: remove unused variable This patch uses the following coccinelle script to remove a variable that was simply used to store the return value of a function call before returning it: @@ identifier len,f; @@ -int len; ... when != len when strict -len = +return f(...); -return len; Signed-off-by: Arushi Singhal <arushisinghal19971997@gmail.com> Signed-off-by: Simon Horman <horms@verge.net.au>	2017-03-30 14:54:47 +02:00
Varsha Rao	848850a3e9	netfilter: ipvs: Replace kzalloc with kcalloc. Replace kzalloc with kcalloc. As kcalloc is preferred for allocating an array instead of kzalloc. This patch fixes the checkpatch issue. Signed-off-by: Varsha Rao <rvarsha016@gmail.com>	2017-03-30 14:44:14 +02:00
Florian Westphal	41390895e5	netfilter: ipvs: don't check for presence of nat extension Check for the NAT status bits, they are set once conntrack needs NAT in source or reply direction, this is slightly faster than nfct_nat() as that has to check the extension area. Signed-off-by: Florian Westphal <fw@strlen.de>	2017-03-30 14:41:42 +02:00
Liping Zhang	77c1c03c5b	netfilter: nfnetlink_queue: fix secctx memory leak We must call security_release_secctx to free the memory returned by security_secid_to_secctx, otherwise memory may be leaked forever. Fixes: `ef493bd930` ("netfilter: nfnetlink_queue: add security context information") Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-03-29 12:20:50 +02:00
Liping Zhang	9c3f379492	netfilter: nf_ct_ext: fix possible panic after nf_ct_extend_unregister If one cpu is doing nf_ct_extend_unregister while another cpu is doing __nf_ct_ext_add_length, then we may hit BUG_ON(t == NULL). Moreover, there's no synchronize_rcu invocation after set nf_ct_ext_types[id] to NULL, so it's possible that we may access invalid pointer. But actually, most of the ct extends are built-in, so the problem listed above will not happen. However, there are two exceptions: NF_CT_EXT_NAT and NF_CT_EXT_SYNPROXY. For _EXT_NAT, the panic will not happen, since adding the nat extend and unregistering the nat extend are located in the same file(nf_nat_core.c), this means that after the nat module is removed, we cannot add the nat extend too. For _EXT_SYNPROXY, synproxy extend may be added by init_conntrack, while synproxy extend unregister will be done by synproxy_core_exit. So after nf_synproxy_core.ko is removed, we may still try to add the synproxy extend, then kernel panic may happen. I know it's very hard to reproduce this issue, but I can play a tricky game to make it happen very easily :) Step 1. Enable SYNPROXY for tcp dport 1234 at FORWARD hook: # iptables -I FORWARD -p tcp --dport 1234 -j SYNPROXY Step 2. Queue the syn packet to the userspace at raw table OUTPUT hook. Also note, in the userspace we only add a 20s' delay, then reinject the syn packet to the kernel: # iptables -t raw -I OUTPUT -p tcp --syn -j NFQUEUE --queue-num 1 Step 3. Using "nc 2.2.2.2 1234" to connect the server. Step 4. Now remove the nf_synproxy_core.ko quickly: # iptables -F FORWARD # rmmod ipt_SYNPROXY # rmmod nf_synproxy_core Step 5. After 20s' delay, the syn packet is reinjected to the kernel. Now you will see the panic like this: kernel BUG at net/netfilter/nf_conntrack_extend.c:91! Call Trace: ? __nf_ct_ext_add_length+0x53/0x3c0 [nf_conntrack] init_conntrack+0x12b/0x600 [nf_conntrack] nf_conntrack_in+0x4cc/0x580 [nf_conntrack] ipv4_conntrack_local+0x48/0x50 [nf_conntrack_ipv4] nf_reinject+0x104/0x270 nfqnl_recv_verdict+0x3e1/0x5f9 [nfnetlink_queue] ? nfqnl_recv_verdict+0x5/0x5f9 [nfnetlink_queue] ? nla_parse+0xa0/0x100 nfnetlink_rcv_msg+0x175/0x6a9 [nfnetlink] [...] One possible solution is to make NF_CT_EXT_SYNPROXY extend built-in, i.e. introduce nf_conntrack_synproxy.c and only do ct extend register and unregister in it, similar to nf_conntrack_timeout.c. But having such a obscure restriction of nf_ct_extend_unregister is not a good idea, so we should invoke synchronize_rcu after set nf_ct_ext_types to NULL, and check the NULL pointer when do __nf_ct_ext_add_length. Then it will be easier if we add new ct extend in the future. Last, we use kfree_rcu to free nf_ct_ext, so rcu_barrier() is unnecessary anymore, remove it too. Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-03-27 13:47:29 +02:00
Liping Zhang	83d90219a5	netfilter: nfnl_cthelper: fix a race when walk the nf_ct_helper_hash table The nf_ct_helper_hash table is protected by nf_ct_helper_mutex, while nfct_helper operation is protected by nfnl_lock(NFNL_SUBSYS_CTHELPER). So it's possible that one CPU is walking the nf_ct_helper_hash for cthelper add/get/del, another cpu is doing nf_conntrack_helpers_unregister at the same time. This is dangrous, and may cause use after free error. Note, delete operation will flush all cthelpers added via nfnetlink, so using rcu to do protect is not easy. Now introduce a dummy list to record all the cthelpers added via nfnetlink, then we can walk the dummy list instead of walking the nf_ct_helper_hash. Also, keep nfnl_cthelper_dump_table unchanged, it may be invoked without nfnl_lock(NFNL_SUBSYS_CTHELPER) held. Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-03-27 13:47:29 +02:00
Liping Zhang	3b7dabf029	netfilter: invoke synchronize_rcu after set the _hook_ to NULL Otherwise, another CPU may access the invalid pointer. For example: CPU0 CPU1 - rcu_read_lock(); - pfunc = _hook_; _hook_ = NULL; - mod unload - - pfunc(); // invalid, panic - rcu_read_unlock(); So we must call synchronize_rcu() to wait the rcu reader to finish. Also note, in nf_nat_snmp_basic_fini, synchronize_rcu() will be invoked by later nf_conntrack_helper_unregister, but I'm inclined to add a explicit synchronize_rcu after set the nf_nat_snmp_hook to NULL. Depend on such obscure assumptions is not a good idea. Last, in nfnetlink_cttimeout, we use kfree_rcu to free the time object, so in cttimeout_exit, invoking rcu_barrier() is not necessary at all, remove it too. Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-03-27 13:47:28 +02:00
David S. Miller	16ae1f2236	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/ethernet/broadcom/genet/bcmmii.c drivers/net/hyperv/netvsc.c kernel/bpf/hashtab.c Almost entirely overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>	2017-03-23 16:41:27 -07:00
Jeffy Chen	f83bf8da11	netfilter: nfnl_cthelper: Fix memory leak We have memory leaks of nf_conntrack_helper & expect_policy. Signed-off-by: Jeffy Chen <jeffy.chen@rock-chips.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-03-22 12:45:32 +01:00
Pablo Neira Ayuso	2c42225755	netfilter: nfnl_cthelper: fix runtime expectation policy updates We only allow runtime updates of expectation policies for timeout and maximum number of expectations, otherwise reject the update. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Liping Zhang <zlpnobody@gmail.com>	2017-03-22 12:20:16 +01:00
Liping Zhang	ae5c682113	netfilter: nfnl_cthelper: fix incorrect helper->expect_class_max The helper->expect_class_max must be set to the total number of expect_policy minus 1, since we will use the statement "if (class > helper->expect_class_max)" to validate the CTA_EXPECT_CLASS attr in ctnetlink_alloc_expect. So for compatibility, set the helper->expect_class_max to the NFCTH_POLICY_SET_NUM attr's value minus 1. Also: it's invalid when the NFCTH_POLICY_SET_NUM attr's value is zero. 1. this will result "expect_policy = kzalloc(0, GFP_KERNEL);"; 2. we cannot set the helper->expect_class_max to a proper value. So if nla_get_be32(tb[NFCTH_POLICY_SET_NUM]) is zero, report -EINVAL to the userspace. Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-03-21 15:47:09 +01:00
Reshetova, Elena	4485a841be	netfilter: fix the warning on unused refcount variable net/netfilter/nfnetlink_acct.c: In function 'nfnl_acct_try_del': net/netfilter/nfnetlink_acct.c:329:15: warning: unused variable 'refcount' [-Wunused-variable] unsigned int refcount; ^ Fixes: `b54ab92b84` ("netfilter: refcounter conversions") Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-03-20 10:49:12 +01:00
Reshetova, Elena	b54ab92b84	netfilter: refcounter conversions refcount_t type and corresponding API (see include/linux/refcount.h) should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-03-17 12:49:43 +01:00
Cong Wang	864e91ca98	ipvs: remove an annoying printk in netns init At most it is used for debugging purpose, but I don't think it is even useful for debugging, just remove it. Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Simon Horman <horms@verge.net.au>	2017-03-16 13:33:39 +01:00
Liping Zhang	4494dbc6de	netfilter: nft_ct: do cleanup work when NFTA_CT_DIRECTION is invalid We should jump to invoke __nft_ct_set_destroy() instead of just return error. Fixes: `edee4f1e92` ("netfilter: nft_ct: add zone id set support") Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-03-15 17:15:54 +01:00
Liping Zhang	03e5fd0e9b	netfilter: nft_set_rbtree: use per-set rwlock to improve the scalability Karel Rericha reported that in his test case, ICMP packets going through boxes had normally about 5ms latency. But when running nft, actually listing the sets with interval flags, latency would go up to 30-100ms. This was observed when router throughput is from 600Mbps to 2Gbps. This is because we use a single global spinlock to protect the whole rbtree sets, so "dumping sets" will race with the "key lookup" inevitably. But actually they are all _readers_, so it's ok to convert the spinlock to rwlock to avoid competition between them. Also use per-set rwlock since each set is independent. Reported-by: Karel Rericha <karel@unitednetworks.cz> Tested-by: Karel Rericha <karel@unitednetworks.cz> Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-03-13 19:30:43 +01:00
Liping Zhang	2cb4bbd75b	netfilter: limit: use per-rule spinlock to improve the scalability The limit token is independent between each rules, so there's no need to use a global spinlock. Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-03-13 19:30:31 +01:00
Florian Westphal	fc09e4a75a	netfilter: nf_conntrack: reduce resolve_normal_ct args also mark init_conntrack noinline, in most cases resolve_normal_ct will find an existing conntrack entry. text data bss dec hex filename 16735 5707 176 22618 585a net/netfilter/nf_conntrack_core.o 16687 5707 176 22570 582a net/netfilter/nf_conntrack_core.o Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-03-13 19:30:20 +01:00
Pablo Neira Ayuso	04166f48d9	Revert "netfilter: nf_tables: add flush field to struct nft_set_iter" This reverts commit `1f48ff6c53`. This patch is not required anymore now that we keep a dummy list of set elements in the bitmap set implementation, so revert this before we forget this code has no clients. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-03-13 17:30:16 +01:00
Phil Sutter	055c4b34b9	netfilter: nft_fib: Support existence check Instead of the actual interface index or name, set destination register to just 1 or 0 depending on whether the lookup succeeded or not if NFTA_FIB_F_PRESENT was set in userspace. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-03-13 13:45:36 +01:00
Florian Westphal	1a64edf54f	netfilter: nft_ct: add helper set support this allows to assign connection tracking helpers to connections via nft objref infrastructure. The idea is to first specifiy a helper object: table ip filter { ct helper some-name { type "ftp" protocol tcp l3proto ip } } and then assign it via nft add ... ct helper set "some-name" helper assignment works for new conntracks only as we cannot expand the conntrack extension area once it has been committed to the main conntrack table. ipv4 and ipv6 protocols are tracked stored separately so we can also handle families that observe both ipv4 and ipv6 traffic. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-03-13 13:42:09 +01:00
Florian Westphal	84fba05511	netfilter: provide nft_ctx in object init function this is needed by the upcoming ct helper object type -- we'd like to be able use the table family (ip, ip6, inet) to figure out which helper has to be requested. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-03-13 13:42:00 +01:00
Pablo Neira Ayuso	e920dde516	netfilter: nft_set_bitmap: keep a list of dummy elements Element comments may come without any prior set flag, so we have to keep a list of dummy struct nft_set_ext to keep this information around. This is only useful for set dumps to userspace. From the packet path, this set type relies on the bitmap representation. This patch simplifies the logic since we don't need to allocate the dummy nft_set_ext structure anymore on the fly at the cost of increasing memory consumption because of the list of dummy struct nft_set_ext. Fixes: `665153ff57` ("netfilter: nf_tables: add bitmap set type") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-03-13 13:34:21 +01:00
Steven Rostedt (VMware)	170a1fb9c0	netfilter: Force fake conntrack entry to be at least 8 bytes aligned Since the nfct and nfctinfo have been combined, the nf_conn structure must be at least 8 bytes aligned, as the 3 LSB bits are used for the nfctinfo. But there's a fake nf_conn structure to denote untracked connections, which is created by a PER_CPU construct. This does not guarantee that it will be 8 bytes aligned and can break the logic in determining the correct nfctinfo. I triggered this on a 32bit machine with the following error: BUG: unable to handle kernel NULL pointer dereference at 00000af4 IP: nf_ct_deliver_cached_events+0x1b/0xfb pdpt = 0000000031962001 pde = 0000000000000000 Oops: 0000 [#1] SMP [Modules linked in: ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipv6 crc_ccitt ppdev r8169 parport_pc parport OK ] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.10.0-test+ #75 Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014 task: c126ec00 task.stack: c1258000 EIP: nf_ct_deliver_cached_events+0x1b/0xfb EFLAGS: 00010202 CPU: 0 EAX: 0021cd01 EBX: 00000000 ECX: 27b0c767 EDX: 32bcb17a ESI: f34135c0 EDI: f34135c0 EBP: f2debd60 ESP: f2debd3c DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 CR0: 80050033 CR2: 00000af4 CR3: 309a0440 CR4: 001406f0 Call Trace: <SOFTIRQ> ? ipv6_skip_exthdr+0xac/0xcb ipv6_confirm+0x10c/0x119 [nf_conntrack_ipv6] nf_hook_slow+0x22/0xc7 nf_hook+0x9a/0xad [ipv6] ? ip6t_do_table+0x356/0x379 [ip6_tables] ? ip6_fragment+0x9e9/0x9e9 [ipv6] ip6_output+0xee/0x107 [ipv6] ? ip6_fragment+0x9e9/0x9e9 [ipv6] dst_output+0x36/0x4d [ipv6] NF_HOOK.constprop.37+0xb2/0xba [ipv6] ? icmp6_dst_alloc+0x2c/0xfd [ipv6] ? local_bh_enable+0x14/0x14 [ipv6] mld_sendpack+0x1c5/0x281 [ipv6] ? mark_held_locks+0x40/0x5c mld_ifc_timer_expire+0x1f6/0x21e [ipv6] call_timer_fn+0x135/0x283 ? detach_if_pending+0x55/0x55 ? mld_dad_timer_expire+0x3e/0x3e [ipv6] __run_timers+0x111/0x14b ? mld_dad_timer_expire+0x3e/0x3e [ipv6] run_timer_softirq+0x1c/0x36 __do_softirq+0x185/0x37c ? test_ti_thread_flag.constprop.19+0xd/0xd do_softirq_own_stack+0x22/0x28 </SOFTIRQ> irq_exit+0x5a/0xa4 smp_apic_timer_interrupt+0x2a/0x34 apic_timer_interrupt+0x37/0x3c By using DEFINE/DECLARE_PER_CPU_ALIGNED we can enforce at least 8 byte alignment as all cache line sizes are at least 8 bytes or more. Fixes: `a9e419dc7b` ("netfilter: merge ctinfo into nfct pointer storage area") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-03-13 13:33:58 +01:00
Liping Zhang	10596608c4	netfilter: nf_tables: fix mismatch in big-endian system Currently, there are two different methods to store an u16 integer to the u32 data register. For example: u32 dest = &regs->data[priv->dreg]; 1. dest = 0; (u16 ) dest = val_u16; 2. dest = val_u16; For method 1, the u16 value will be stored like this, either in big-endian or little-endian system: 0 15 31 +-+-+-+-+-+-+-+-+-+-+-+-+ \| Value \| 0 \| +-+-+-+-+-+-+-+-+-+-+-+-+ For method 2, in little-endian system, the u16 value will be the same as listed above. But in big-endian system, the u16 value will be stored like this: 0 15 31 +-+-+-+-+-+-+-+-+-+-+-+-+ \| 0 \| Value \| +-+-+-+-+-+-+-+-+-+-+-+-+ So later we use "memcmp(&regs->data[priv->sreg], data, 2);" to do compare in nft_cmp, nft_lookup expr ..., method 2 will get the wrong result in big-endian system, as 0~15 bits will always be zero. For the similar reason, when loading an u16 value from the u32 data register, we should use "(u16 ) sreg;" instead of "(u16)sreg;", the 2nd method will get the wrong value in the big-endian system. So introduce some wrapper functions to store/load an u8 or u16 integer to/from the u32 data register, and use them in the right place. Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-03-13 13:30:28 +01:00
Liping Zhang	fd89b23a46	netfilter: nft_set_bitmap: fetch the element key based on the set->klen Currently we just assume the element key as a u32 integer, regardless of the set key length. This is incorrect, for example, the tcp port number is only 16 bits. So when we use the nft_payload expr to get the tcp dport and store it to dreg, the dport will be stored at 0~15 bits, and 16~31 bits will be padded with zero. So the reg->data[dreg] will be looked like as below: 0 15 31 +-+-+-+-+-+-+-+-+-+-+-+-+ \| tcp dport \| 0 \| +-+-+-+-+-+-+-+-+-+-+-+-+ But for these big-endian systems, if we treate this register as a u32 integer, the element key will be larger than 65535, so the following lookup in bitmap set will cause out of bound access. Another issue is that if we add element with comments in bitmap set(although the comments will be ignored eventually), the element will vanish strangely. Because we treate the element key as a u32 integer, so the comments will become the part of the element key, then the element key will also be larger than 65535 and out of bound access will happen: # nft add element t s { 1 comment test } Since set->klen is 1 or 2, it's fine to treate the element key as a u8 or u16 integer. Fixes: `665153ff57` ("netfilter: nf_tables: add bitmap set type") Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-03-13 13:16:42 +01:00
Ying Xue	8e05ba7f84	netfilter: nf_nat_sctp: fix ICMP packet to be dropped accidently Regarding RFC 792, the first 64 bits of the original SCTP datagram's data could be contained in ICMP packet, such as: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ \| Type \| Code \| Checksum \| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ \| unused \| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ \| Internet Header + 64 bits of Original Data Datagram \| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ However, according to RFC 4960, SCTP datagram header is as below: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ \| Source Port Number \| Destination Port Number \| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ \| Verification Tag \| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ \| Checksum \| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ It means only the first three fields of SCTP header can be carried in ICMP packet except for Checksum field. At present in sctp_manip_pkt(), no matter whether the packet is ICMP or not, it always calculates SCTP packet checksum. However, not only the calculation of checksum is unnecessary for ICMP, but also it causes another fatal issue that ICMP packet is dropped. The header size of SCTP is used to identify whether the writeable length of skb is bigger than skb->len through skb_make_writable() in sctp_manip_pkt(). But when it deals with ICMP packet, skb_make_writable() directly returns false as the writeable length of skb is bigger than skb->len. Subsequently ICMP is dropped. Now we correct this misbahavior. When sctp_manip_pkt() handles ICMP packet, 8 bytes rather than the whole SCTP header size is used to check if writeable length of skb is overflowed. Meanwhile, as it's meaningless to calculate checksum when packet is ICMP, the computation of checksum is ignored as well. Signed-off-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-03-08 18:04:06 +01:00
Pablo Neira Ayuso	c7a72e3fdb	netfilter: nf_tables: add nft_set_lookup() This new function consolidates set lookup via either name or ID by introducing a new nft_set_lookup() function. Replace existing spots where we can use this too. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-03-06 18:23:23 +01:00
Liping Zhang	c56e3956c1	netfilter: nf_tables: validate the expr explicitly after init successfully When we want to validate the expr's dependency or hooks, we must do two things to accomplish it. First, write a X_validate callback function and point ->validate to it. Second, call X_validate in init routine. This is very common, such as fib, nat, reject expr and so on ... It is a little ugly, since we will call X_validate in the expr's init routine, it's better to do it in nf_tables_newexpr. So we can avoid to do this again and again. After doing this, the second step listed above is not useful anymore, remove them now. Patch was tested by nftables/tests/py/nft-test.py and nftables/tests/shell/run-tests.sh. Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-03-06 18:22:12 +01:00
Laura Garcia Liebana	3206caded8	netfilter: nft_hash: support of symmetric hash This patch provides symmetric hash support according to source ip address and port, and destination ip address and port. For this purpose, the __skb_get_hash_symmetric() is used to identify the flow as it uses FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL flag by default. The new attribute NFTA_HASH_TYPE has been included to support different types of hashing functions. Currently supported NFT_HASH_JENKINS through jhash and NFT_HASH_SYM through symhash. The main difference between both types are: - jhash requires an expression with sreg, symhash doesn't. - symhash supports modulus and offset, but not seed. Examples: nft add rule ip nat prerouting ct mark set jhash ip saddr mod 2 nft add rule ip nat prerouting ct mark set symhash mod 2 By default, jenkins hash will be used if no hash type is provided for compatibility reasons. Signed-off-by: Laura Garcia Liebana <laura.garcia@zevenet.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-03-06 17:57:42 +01:00
Laura Garcia Liebana	511040eea2	netfilter: nft_hash: rename nft_hash to nft_jhash This patch renames the local nft_hash structure and functions to nft_jhash in order to prepare the nft_hash module code to add new hash functions. Signed-off-by: Laura Garcia Liebana <laura.garcia@zevenet.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-03-06 17:57:34 +01:00
Phil Sutter	3c1fece881	netfilter: nft_exthdr: Allow checking TCP option presence, too Honor NFT_EXTHDR_F_PRESENT flag so we check if the TCP option is present. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-03-06 17:52:56 +01:00
Linus Torvalds	8d70eeb84a	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull networking fixes from David Miller: 1) Fix double-free in batman-adv, from Sven Eckelmann. 2) Fix packet stats for fast-RX path, from Joannes Berg. 3) Netfilter's ip_route_me_harder() doesn't handle request sockets properly, fix from Florian Westphal. 4) Fix sendmsg deadlock in rxrpc, from David Howells. 5) Add missing RCU locking to transport hashtable scan, from Xin Long. 6) Fix potential packet loss in mlxsw driver, from Ido Schimmel. 7) Fix race in NAPI handling between poll handlers and busy polling, from Eric Dumazet. 8) TX path in vxlan and geneve need proper RCU locking, from Jakub Kicinski. 9) SYN processing in DCCP and TCP need to disable BH, from Eric Dumazet. 10) Properly handle net_enable_timestamp() being invoked from IRQ context, also from Eric Dumazet. 11) Fix crash on device-tree systems in xgene driver, from Alban Bedel. 12) Do not call sk_free() on a locked socket, from Arnaldo Carvalho de Melo. 13) Fix use-after-free in netvsc driver, from Dexuan Cui. 14) Fix max MTU setting in bonding driver, from WANG Cong. 15) xen-netback hash table can be allocated from softirq context, so use GFP_ATOMIC. From Anoob Soman. 16) Fix MAC address change bug in bgmac driver, from Hari Vyas. 17) strparser needs to destroy strp_wq on module exit, from WANG Cong. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (69 commits) strparser: destroy workqueue on module exit sfc: fix IPID endianness in TSOv2 sfc: avoid max() in array size rds: remove unnecessary returned value check rxrpc: Fix potential NULL-pointer exception nfp: correct DMA direction in XDP DMA sync nfp: don't tell FW about the reserved buffer space net: ethernet: bgmac: mac address change bug net: ethernet: bgmac: init sequence bug xen-netback: don't vfree() queues under spinlock xen-netback: keep a local pointer for vif in backend_disconnect() netfilter: nf_tables: don't call nfnetlink_set_err() if nfnetlink_send() fails netfilter: nft_set_rbtree: incorrect assumption on lower interval lookups netfilter: nf_conntrack_sip: fix wrong memory initialisation can: flexcan: fix typo in comment can: usb_8dev: Fix memory leak of priv->cmd_msg_buffer can: gs_usb: fix coding style can: gs_usb: Don't use stack memory for USB transfers ixgbe: Limit use of 2K buffers on architectures with 256B or larger cache lines ixgbe: update the rss key on h/w, when ethtool ask for it ...	2017-03-04 17:31:39 -08:00
David S. Miller	20b83643ab	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for your net tree, they are: 1) Missing check for full sock in ip_route_me_harder(), from Florian Westphal. 2) Incorrect sip helper structure initilization that breaks it when several ports are used, from Christophe Leroy. 3) Fix incorrect assumption when looking up for matching with adjacent intervals in the nft_set_rbtree. 4) Fix broken netlink event error reporting in nf_tables that results in misleading ESRCH errors propagated to userspace listeners. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-03-03 20:40:06 -08:00
Pablo Neira Ayuso	25e94a997b	netfilter: nf_tables: don't call nfnetlink_set_err() if nfnetlink_send() fails The underlying nlmsg_multicast() already sets sk->sk_err for us to notify socket overruns, so we should not do anything with this return value. So we just call nfnetlink_set_err() if: 1) We fail to allocate the netlink message. or 2) We don't have enough space in the netlink message to place attributes, which means that we likely need to allocate a larger message. Before this patch, the internal ESRCH netlink error code was propagated to userspace, which is quite misleading. Netlink semantics mandate that listeners just hit ENOBUFS if the socket buffer overruns. Reported-by: Alexander Alemayhu <alexander@alemayhu.com> Tested-by: Alexander Alemayhu <alexander@alemayhu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-03-03 13:48:34 +01:00
Pablo Neira Ayuso	f9121355eb	netfilter: nft_set_rbtree: incorrect assumption on lower interval lookups In case of adjacent ranges, we may indeed see either the high part of the range in first place or the low part of it. Remove this incorrect assumption, let's make sure we annotate the low part of the interval in case of we have adjacent interva intervals so we hit a matching in lookups. Reported-by: Simon Hanisch <hanisch@wh2.tu-dresden.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-03-03 13:48:32 +01:00
Christophe Leroy	da2f27e9e6	netfilter: nf_conntrack_sip: fix wrong memory initialisation In commit `82de0be686` ("netfilter: Add helper array register/unregister functions"), struct nf_conntrack_helper sip[MAX_PORTS][4] was changed to sip[MAX_PORTS * 4], so the memory init should have been changed to memset(&sip[4 * i], 0, 4 * sizeof(sip[i])); But as the sip[] table is allocated in the BSS, it is already set to 0 Fixes: `82de0be686` ("netfilter: Add helper array register/unregister functions") Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-03-03 13:48:31 +01:00
Ingo Molnar	5b825c3af1	sched/headers: Prepare to remove <linux/cred.h> inclusion from <linux/sched.h> Add #include <linux/cred.h> dependencies to all .c files rely on sched.h doing that for them. Note that even if the count where we need to add extra headers seems high, it's still a net win, because <linux/sched.h> is included in over 2,200 files ... Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-03-02 08:42:31 +01:00
Linus Torvalds	c2eca00fec	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull networking fixes from David Miller: 1) Don't save TIPC header values before the header has been validated, from Jon Paul Maloy. 2) Fix memory leak in RDS, from Zhu Yanjun. 3) We miss to initialize the UID in the flow key in some paths, from Julian Anastasov. 4) Fix latent TOS masking bug in the routing cache removal from years ago, also from Julian. 5) We forget to set the sockaddr port in sctp_copy_local_addr_list(), fix from Xin Long. 6) Missing module ref count drop in packet scheduler actions, from Roman Mashak. 7) Fix RCU annotations in rht_bucket_nested, from Herbert Xu. 8) Fix use after free which happens because L2TP's ipv4 support returns non-zero values from it's backlog_rcv function which ipv4 interprets as protocol values. Fix from Paul Hüber. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (35 commits) qed: Don't use attention PTT for configuring BW qed: Fix race with multiple VFs l2tp: avoid use-after-free caused by l2tp_ip_backlog_recv xfrm: provide correct dst in xfrm_neigh_lookup rhashtable: Fix RCU dereference annotation in rht_bucket_nested rhashtable: Fix use before NULL check in bucket_table_free net sched actions: do not overwrite status of action creation. rxrpc: Kernel calls get stuck in recvmsg net sched actions: decrement module reference count after table flush. lib: Allow compile-testing of parman ipv6: check sk sk_type and protocol early in ip_mroute_set/getsockopt sctp: set sin_port for addr param when checking duplicate address net/mlx4_en: fix overflow in mlx4_en_init_timestamp() netfilter: nft_set_bitmap: incorrect bitmap size net: s2io: fix typo argumnet argument net: vxge: fix typo argumnet argument netfilter: nf_ct_expect: Change __nf_ct_expect_check() return value. ipv4: mask tos for input route ipv4: add missing initialization for flowi4_uid lib: fix spelling mistake: "actualy" -> "actually" ...	2017-02-28 10:00:39 -08:00
Alexey Dobriyan	5b5e0928f7	lib/vsprintf.c: remove %Z support Now that %z is standartised in C99 there is no reason to support %Z. Unlike %L it doesn't even make format strings smaller. Use BUILD_BUG_ON in a couple ATM drivers. In case anyone didn't notice lib/vsprintf.o is about half of SLUB which is in my opinion is quite an achievement. Hopefully this patch inspires someone else to trim vsprintf.c more. Link: http://lkml.kernel.org/r/20170103230126.GA30170@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-02-27 18:43:47 -08:00
Masahiro Yamada	550116d21a	scripts/spelling.txt: add "aligment" pattern and fix typo instances Fix typos and add the following to the scripts/spelling.txt: aligment\|\|alignment I did not touch the "N_BYTE_ALIGMENT" macro in drivers/net/wireless/realtek/rtlwifi/wifi.h to avoid unpredictable impact. I fixed "_aligment_handler" in arch/openrisc/kernel/entry.S because it is surrounded by #if 0 ... #endif. It is surely safe and I confirmed "_alignment_handler" is correct. I also fixed the "controler" I found in the same hunk in arch/openrisc/kernel/head.S. Link: http://lkml.kernel.org/r/1481573103-11329-8-git-send-email-yamada.masahiro@socionext.com Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-02-27 18:43:46 -08:00
Masahiro Yamada	9332ef9dbd	scripts/spelling.txt: add "an user" pattern and fix typo instances Fix typos and add the following to the scripts/spelling.txt: an user\|\|a user an userspace\|\|a userspace I also added "userspace" to the list since it is a common word in Linux. I found some instances for "an userfaultfd", but I did not add it to the list. I felt it is endless to find words that start with "user" such as "userland" etc., so must draw a line somewhere. Link: http://lkml.kernel.org/r/1481573103-11329-4-git-send-email-yamada.masahiro@socionext.com Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-02-27 18:43:46 -08:00
Pablo Neira Ayuso	13aa5a8f49	netfilter: nft_set_bitmap: incorrect bitmap size priv->bitmap_size stores the real bitmap size, instead of the full struct nft_bitmap object. Fixes: `665153ff57` ("netfilter: nf_tables: add bitmap set type") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-02-26 21:00:19 +01:00
Jarno Rajahalme	4b86c459c7	netfilter: nf_ct_expect: Change __nf_ct_expect_check() return value. Commit `4dee62b1b9` ("netfilter: nf_ct_expect: nf_ct_expect_insert() returns void") inadvertently changed the successful return value of nf_ct_expect_related_report() from 0 to 1 due to __nf_ct_expect_check() returning 1 on success. Prevent this regression in the future by changing the return value of __nf_ct_expect_check() to 0 on success. Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Acked-by: Joe Stringer <joe@ovn.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-02-26 17:06:59 +01:00
Jarno Rajahalme	7fb668ac7b	netfilter: nf_ct_expect: nf_ct_expect_related_report(): Return zero on success. Commit `4dee62b1b9` ("netfilter: nf_ct_expect: nf_ct_expect_insert() returns void") inadvertently changed the successful return value of nf_ct_expect_related_report() from 0 to 1, which caused openvswitch conntrack integration fail in FTP test cases. Fix this by always returning zero on the success code path. Fixes: `4dee62b1b9` ("netfilter: nf_ct_expect: nf_ct_expect_insert() returns void") Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Acked-by: Joe Stringer <joe@ovn.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-02-25 13:32:35 +01:00
Florian Westphal	427345d612	netfilter: nft_ct: fix random validation errors for zone set support Dan reports: net/netfilter/nft_ct.c:549 nft_ct_set_init() error: uninitialized symbol 'len'. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Fixes: `edee4f1e92` ("netfilter: nft_ct: add zone id set support") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-02-23 21:50:28 +01:00
David S. Miller	ccaba0621a	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for your net tree, they are: 1) Revisit warning logic when not applying default helper assignment. Jiri Kosina considers we are breaking existing setups and not warning our users accordinly now that automatic helper assignment has been turned off by default. So let's make him happy by spotting the warning by when we find a helper but we cannot attach, instead of warning on the former deprecated behaviour. Patch from Jiri Kosina. 2) Two patches to fix regression in ctnetlink interfaces with nfnetlink_queue. Specifically, perform more relaxed in CTA_STATUS and do not bail out if CTA_HELP indicates the same helper that we already have. Patches from Kevin Cernekee. 3) A couple of bugfixes for ipset via Jozsef Kadlecsik. Due to wrong index logic in hash set types and null pointer exception in the list:set type. 4) hashlimit bails out with correct userspace parameters due to wrong arithmetics in the code that avoids "divide by zero" when transforming the userspace timing in milliseconds to token credits. Patch from Alban Browaeys. 5) Fix incorrect NFQA_VLAN_MAX definition, patch from Ken-ichirou MATSUZAWA. 6) Don't not declare nfnetlink batch error list as static, since this may be used by several subsystems at the same time. Patch from Liping Zhang. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-23 10:59:15 -05:00
Pablo Neira Ayuso	3ef767e5cb	Merge branch 'master' of git://blackhole.kfki.hu/nf Jozsef Kadlecsik says: ==================== ipset patches for nf Please apply the next patches for ipset in your nf branch. Both patches should go into the stable kernel branches as well, because these are important bugfixes: * Sometimes valid entries in hash:* types of sets were evicted due to a typo in an index. The wrong evictions happen when entries are deleted from the set and the bucket is shrinked. Bug was reported by Eric Ewanco and the patch fixes netfilter bugzilla id #1119. * Fixing of a null pointer exception when someone wants to add an entry to an empty list type of set and specifies an add before/after option. The fix is from Vishwanath Pai. ==================== Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-02-21 14:01:05 +01:00
Liping Zhang	4eba8b78e1	netfilter: nfnetlink: remove static declaration from err_list Otherwise, different subsys will race to access the err_list, with holding the different nfnl_lock(subsys_id). But this will not happen now, since ->call_batch is only implemented by nftables, so the err_list is protected by nfnl_lock(NFNL_SUBSYS_NFTABLES). Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-02-21 13:45:47 +01:00
Alban Browaeys	ad5b557619	netfilter: xt_hashlimit: Fix integer divide round to zero. Diving the divider by the multiplier before applying to the input. When this would "divide by zero", divide the multiplier by the divider first then multiply the input by this value. Currently user2creds outputs zero when input value is bigger than the number of slices and lower than scale. This as then user input is applied an integer divide operation to a number greater than itself (scale). That rounds up to zero, then we multiply zero by the credits slice size. iptables -t filter -I INPUT --protocol tcp --match hashlimit --hashlimit 40/second --hashlimit-burst 20 --hashlimit-mode srcip --hashlimit-name syn-flood --jump RETURN thus trigger the overflow detection code: xt_hashlimit: overflow, try lower: 25000/20 (25000 as hashlimit avg and 20 the burst) Here: 134217 slices of (HZ * CREDITS_PER_JIFFY) size. 500000 is user input value 1000000 is XT_HASHLIMIT_SCALE_v2 gives: 0 as user2creds output Setting burst to "1" typically solve the issue ... but setting it to "40" does too ! This is on 32bit arch calling into revision 2 of hashlimit. Signed-off-by: Alban Browaeys <alban.browaeys@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-02-19 21:12:23 +01:00
Vishwanath Pai	40b446a1d8	netfilter: ipset: Null pointer exception in ipset list:set If we use before/after to add an element to an empty list it will cause a kernel panic. $> cat crash.restore create a hash:ip create b hash:ip create test list:set timeout 5 size 4 add test b before a $> ipset -R < crash.restore Executing the above will crash the kernel. Signed-off-by: Vishwanath Pai <vpai@akamai.com> Reviewed-by: Josh Hunt <johunt@akamai.com> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>	2017-02-19 19:08:47 +01:00
Jozsef Kadlecsik	50054a9223	Fix bug: sometimes valid entries in hash:* types of sets were evicted Wrong index was used and therefore when shrinking a hash bucket at deleting an entry, valid entries could be evicted as well. Thanks to Eric Ewanco for the thorough bugreport. Fixes netfilter bugzilla #1119 Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>	2017-02-19 19:08:32 +01:00
Pablo Neira Ayuso	7286ff7fde	netfilter: nf_tables: honor NFT_SET_OBJECT in set backend selection Check for NFT_SET_OBJECT feature flag, otherwise we may end up selecting the wrong set backend. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-02-12 14:45:14 +01:00
Pablo Neira Ayuso	1a94e38d25	netfilter: nf_tables: add NFTA_RULE_ID attribute This new attribute allows us to uniquely identify a rule in transaction. Robots may trigger an insertion followed by deletion in a batch, in that scenario we still don't have a public rule handle that we can use to delete the rule. This is similar to the NFTA_SET_ID attribute that allows us to refer to an anonymous set from a batch. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-02-12 14:45:13 +01:00
Pablo Neira Ayuso	74e8bcd21c	netfilter: nf_tables: add check_genid to the nfnetlink subsystem This patch implements the check generation id as provided by nfnetlink. This allows us to reject ruleset updates against stale baseline, so userspace can retry update with a fresh ruleset cache. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-02-12 14:45:12 +01:00
Pablo Neira Ayuso	8c4d4e8b56	netfilter: nfnetlink: allow to check for generation ID This patch allows userspace to specify the generation ID that has been used to build an incremental batch update. If userspace specifies the generation ID in the batch message as attribute, then nfnetlink compares it to the current generation ID so you make sure that you work against the right baseline. Otherwise, bail out with ERESTART so userspace knows that its changeset is stale and needs to respin. Userspace can do this transparently at the cost of taking slightly more time to refresh caches and rework the changeset. This check is optional, if there is no NFNL_BATCH_GENID attribute in the batch begin message, then no check is performed. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-02-12 14:45:11 +01:00
Pablo Neira Ayuso	48656835c0	netfilter: nfnetlink: add nfnetlink_rcv_skb_batch() Add new nfnetlink_rcv_skb_batch() to wrap initial nfnetlink batch handling. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-02-12 14:45:10 +01:00
Pablo Neira Ayuso	b745d0358d	netfilter: nfnetlink: get rid of u_intX_t types Use uX types instead. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-02-12 14:45:09 +01:00
Gao Feng	4dee62b1b9	netfilter: nf_ct_expect: nf_ct_expect_insert() returns void Because nf_ct_expect_insert() always succeeds now, its return value can be just void instead of int. And remove code that checks for its return value. Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-02-12 14:44:08 +01:00
Gao Feng	a96e66e702	netfilter: nf_ct_sip: Use mod_timer_pending() timer_del() followed by timer_add() can be replaced by mod_timer_pending(). Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-02-12 14:39:06 +01:00
Manuel Messner	935b7f6430	netfilter: nft_exthdr: add TCP option matching This patch implements the kernel side of the TCP option patch. Signed-off-by: Manuel Messner <mm@skelett.io> Reviewed-by: Florian Westphal <fw@strlen.de> Acked-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-02-08 14:17:09 +01:00
Florian Westphal	edee4f1e92	netfilter: nft_ct: add zone id set support zones allow tracking multiple connections sharing identical tuples, this is needed e.g. when tracking distinct vlans with overlapping ip addresses (conntrack is l2 agnostic). Thus the zone has to be set before the packet is picked up by the connection tracker. This is done by means of 'conntrack templates' which are conntrack structures used solely to pass this info from one netfilter hook to the next. The iptables CT target instantiates these connection tracking templates once per rule, i.e. the template is fixed/tied to particular zone, can be read-only and therefore be re-used by as many skbs simultaneously as needed. We can't follow this model because we want to take the zone id from an sreg at rule eval time so we could e.g. fill in the zone id from the packets vlan id or a e.g. nftables key : value maps. To avoid cost of per packet alloc/free of the template, use a percpu template 'scratch' object and use the refcount to detect the (unlikely) case where the template is still attached to another skb (i.e., previous skb was nfqueued ...). Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-02-08 14:16:23 +01:00
Florian Westphal	5c178d81b6	netfilter: nft_ct: prepare for key-dependent error unwind Next patch will add ZONE_ID set support which will need similar error unwind (put operation) as conntrack labels. Prepare for this: remove the 'label_got' boolean in favor of a switch statement that can be extended in next patch. As we already have that in the set_destroy function place that in a separate function and call it from the set init function. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-02-08 14:16:23 +01:00
Florian Westphal	ab23821f7e	netfilter: nft_ct: add zone id get support Just like with counters the direction attribute is optional. We set priv->dir to MAX unconditionally to avoid duplicating the assignment for all keys with optional direction. For keys where direction is mandatory, existing code already returns an error. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-02-08 14:16:22 +01:00
Pablo Neira Ayuso	665153ff57	netfilter: nf_tables: add bitmap set type This patch adds a new bitmap set type. This bitmap uses two bits to represent one element. These two bits determine the element state in the current and the future generation that fits into the nf_tables commit protocol. When dumping elements back to userspace, the two bits are expanded into a struct nft_set_ext object. If no NFTA_SET_DESC_SIZE is specified, the existing automatic set backend selection prefers bitmap over hash in case of keys whose size is <= 16 bit. If the set size is know, the bitmap set type is selected if with 16 bit kets and more than 390 elements in the set, otherwise the hash table set implementation is used. For 8 bit keys, the bitmap consumes 66 bytes. For 16 bit keys, the bitmap takes 16388 bytes. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-02-08 14:16:21 +01:00
Pablo Neira Ayuso	0b5a787492	netfilter: nf_tables: add space notation to sets The space notation allows us to classify the set backend implementation based on the amount of required memory. This provides an order of the set representation scalability in terms of memory. The size field is still left in place so use this if the userspace provides no explicit number of elements, so we cannot calculate the real memory that this set needs. This also helps us break ties in the set backend selection routine, eg. two backend implementations provide the same performance. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-02-08 14:16:21 +01:00
Pablo Neira Ayuso	55af753cd9	netfilter: nf_tables: rename struct nft_set_estimate class field Use lookup as field name instead, to prepare the introduction of the memory class in a follow up patch. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-02-08 14:16:20 +01:00
Pablo Neira Ayuso	1f48ff6c53	netfilter: nf_tables: add flush field to struct nft_set_iter This provides context to walk callback iterator, thus, we know if the walk happens from the set flush path. This is required by the new bitmap set type coming in a follow up patch which has no real struct nft_set_ext, so it has to allocate it based on the two bit compact element representation. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-02-08 14:16:20 +01:00
Pablo Neira Ayuso	1ba1c41408	netfilter: nf_tables: rename deactivate_one() to flush() Although semantics are similar to deactivate() with no implicit element lookup, this is only called from the set flush path, so better rename this to flush(). Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-02-08 14:16:19 +01:00
Pablo Neira Ayuso	baa2d42cff	netfilter: nf_tables: use struct nft_set_iter in set element flush Instead of struct nft_set_dump_args, remove unnecessary wrapper structure. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-02-08 14:16:18 +01:00
Pablo Neira Ayuso	5cb82a38c6	netfilter: nf_tables: pass netns to set->ops->remove() This new parameter is required by the new bitmap set type that comes in a follow up patch. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-02-08 14:16:18 +01:00
Phil Sutter	c078ca3b0c	netfilter: nft_exthdr: Add support for existence check If NFT_EXTHDR_F_PRESENT is set, exthdr will not copy any header field data into *dest, but instead set it to 1 if the header is found and 0 otherwise. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-02-08 14:14:09 +01:00
Kevin Cernekee	f95d7a46bc	netfilter: ctnetlink: Fix regression in CTA_HELP processing Prior to Linux 4.4, it was usually harmless to send a CTA_HELP attribute containing the name of the current helper. That is no longer the case: as of Linux 4.4, if ctnetlink_change_helper() returns an error from the ct->master check, processing of the request will fail, skipping the NFQA_EXP attribute (if present). This patch changes the behavior to improve compatibility with user programs that expect the kernel interface to work the way it did prior to Linux 4.4. If a user program specifies CTA_HELP but the argument matches the current conntrack helper name, ignore it instead of generating an error. Signed-off-by: Kevin Cernekee <cernekee@chromium.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-02-06 12:49:05 +01:00
Kevin Cernekee	a963d710f3	netfilter: ctnetlink: Fix regression in CTA_STATUS processing The libnetfilter_conntrack userland library always sets IPS_CONFIRMED when building a CTA_STATUS attribute. If this toggles the bit from 0->1, the parser will return an error. On Linux 4.4+ this will cause any NFQA_EXP attribute in the packet to be ignored. This breaks conntrackd's userland helpers because they operate on unconfirmed connections. Instead of returning -EBUSY if the user program asks to modify an unchangeable bit, simply ignore the change. Also, fix the logic so that user programs are allowed to clear the bits that they are allowed to change. Signed-off-by: Kevin Cernekee <cernekee@chromium.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-02-06 12:48:26 +01:00
Jiri Kosina	dfe75ff8ca	netfilter: nf_ct_helper: warn when not applying default helper assignment Commit `3bb398d925` ("netfilter: nf_ct_helper: disable automatic helper assignment") is causing behavior regressions in firewalls, as traffic handled by conntrack helpers is now by default not passed through even though it was before due to missing CT targets (which were not necessary before this commit). The default had to be switched off due to security reasons [1] [2] and therefore should stay the way it is, but let's be friendly to firewall admins and issue a warning the first time we're in situation where packet would be likely passed through with the old default but we're likely going to drop it on the floor now. Rewrite the code a little bit as suggested by Linus, so that we avoid spaghettiing the code even more -- namely the whole decision making process regarding helper selection (either automatic or not) is being separated, so that the whole logic can be simplified and code (condition) duplication reduced. [1] https://cansecwest.com/csw12/conntrack-attack.pdf [2] https://home.regit.org/netfilter-en/secure-use-of-helpers/ Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-02-06 12:03:35 +01:00
David S. Miller	52e01b84a2	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for your net-next tree, they are: 1) Stash ctinfo 3-bit field into pointer to nf_conntrack object from sk_buff so we only access one single cacheline in the conntrack hotpath. Patchset from Florian Westphal. 2) Don't leak pointer to internal structures when exporting x_tables ruleset back to userspace, from Willem DeBruijn. This includes new helper functions to copy data to userspace such as xt_data_to_user() as well as conversions of our ip_tables, ip6_tables and arp_tables clients to use it. Not surprinsingly, ebtables requires an ad-hoc update. There is also a new field in x_tables extensions to indicate the amount of bytes that we copy to userspace. 3) Add nf_log_all_netns sysctl: This new knob allows you to enable logging via nf_log infrastructure for all existing netnamespaces. Given the effort to provide pernet syslog has been discontinued, let's provide a way to restore logging using netfilter kernel logging facilities in trusted environments. Patch from Michal Kubecek. 4) Validate SCTP checksum from conntrack helper, from Davide Caratti. 5) Merge UDPlite conntrack and NAT helpers into UDP, this was mostly a copy&paste from the original helper, from Florian Westphal. 6) Reset netfilter state when duplicating packets, also from Florian. 7) Remove unnecessary check for broadcast in IPv6 in pkttype match and nft_meta, from Liping Zhang. 8) Add missing code to deal with loopback packets from nft_meta when used by the netdev family, also from Liping. 9) Several cleanups on nf_tables, one to remove unnecessary check from the netlink control plane path to add table, set and stateful objects and code consolidation when unregister chain hooks, from Gao Feng. 10) Fix harmless reference counter underflow in IPVS that, however, results in problems with the introduction of the new refcount_t type, from David Windsor. 11) Enable LIBCRC32C from nf_ct_sctp instead of nf_nat_sctp, from Davide Caratti. 12) Missing documentation on nf_tables uapi header, from Liping Zhang. 13) Use rb_entry() helper in xt_connlimit, from Geliang Tang. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-03 16:58:20 -05:00
Michal Kubeček	2851940ffe	netfilter: allow logging from non-init namespaces Commit `69b34fb996` ("netfilter: xt_LOG: add net namespace support for xt_LOG") disabled logging packets using the LOG target from non-init namespaces. The motivation was to prevent containers from flooding kernel log of the host. The plan was to keep it that way until syslog namespace implementation allows containers to log in a safe way. However, the work on syslog namespace seems to have hit a dead end somewhere in 2013 and there are users who want to use xt_LOG in all network namespaces. This patch allows to do so by setting /proc/sys/net/netfilter/nf_log_all_netns to a nonzero value. This sysctl is only accessible from init_net so that one cannot switch the behaviour from inside a container. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-02-02 14:31:58 +01:00
David Windsor	90c1aff702	ipvs: free ip_vs_dest structs when refcnt=0 Currently, the ip_vs_dest cache frees ip_vs_dest objects when their reference count becomes < 0. Aside from not being semantically sound, this is problematic for the new type refcount_t, which will be introduced shortly in a separate patch. refcount_t is the new kernel type for holding reference counts, and provides overflow protection and a constrained interface relative to atomic_t (the type currently being used for kernel reference counts). Per Julian Anastasov: "The problem is that dest_trash currently holds deleted dests (unlinked from RCU lists) with refcnt=0." Changing dest_trash to hold dest with refcnt=1 will allow us to free ip_vs_dest structs when their refcnt=0, in ip_vs_dest_put_and_free(). Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-02-02 14:31:57 +01:00
Florian Westphal	a9e419dc7b	netfilter: merge ctinfo into nfct pointer storage area After this change conntrack operations (lookup, creation, matching from ruleset) only access one instead of two sk_buff cache lines. This works for normal conntracks because those are allocated from a slab that guarantees hw cacheline or 8byte alignment (whatever is larger) so the 3 bits needed for ctinfo won't overlap with nf_conn addresses. Template allocation now does manual address alignment (see previous change) on arches that don't have sufficent kmalloc min alignment. Some spots intentionally use skb->_nfct instead of skb_nfct() helpers, this is to avoid undoing the skb_nfct() use when we remove untracked conntrack object in the future. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-02-02 14:31:56 +01:00
Florian Westphal	3032230920	netfilter: guarantee 8 byte minalign for template addresses The next change will merge skb->nfct pointer and skb->nfctinfo status bits into single skb->_nfct (unsigned long) area. For this to work nf_conn addresses must always be aligned at least on an 8 byte boundary since we will need the lower 3bits to store nfctinfo. Conntrack templates are allocated via kmalloc. kbuild test robot reported BUILD_BUG_ON failed: NFCT_INFOMASK >= ARCH_KMALLOC_MINALIGN on v1 of this patchset, so not all platforms meet this requirement. Do manual alignment if needed, the alignment offset is stored in the nf_conn entry protocol area. This works because templates are not handed off to L4 protocol trackers. Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-02-02 14:31:55 +01:00
Florian Westphal	c74454fadd	netfilter: add and use nf_ct_set helper Add a helper to assign a nf_conn entry and the ctinfo bits to an sk_buff. This avoids changing code in followup patch that merges skb->nfct and skb->nfctinfo into skb->_nfct. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-02-02 14:31:54 +01:00
Florian Westphal	cb9c68363e	skbuff: add and use skb_nfct helper Followup patch renames skb->nfct and changes its type so add a helper to avoid intrusive rename change later. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-02-02 14:31:53 +01:00
Florian Westphal	97a6ad13de	netfilter: reduce direct skb->nfct usage Next patch makes direct skb->nfct access illegal, reduce noise in next patch by using accessors we already have. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-02-02 14:31:52 +01:00
Florian Westphal	11df4b760f	netfilter: conntrack: no need to pass ctinfo to error handler It is never accessed for reading and the only places that write to it are the icmp(6) handlers, which also set skb->nfct (and skb->nfctinfo). The conntrack core specifically checks for attached skb->nfct after ->error() invocation and returns early in this case. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-02-02 14:31:51 +01:00
Feng	10435c1192	netfilter: nf_tables: Eliminate duplicated code in nf_tables_table_enable() If something fails in nf_tables_table_enable(), it unregisters the chains. But the rollback code is the same as nf_tables_table_disable() almostly, except there is one counter check. Now create one wrapper function to eliminate the duplicated codes. Signed-off-by: Feng <fgao@ikuai8.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-02-02 14:30:19 +01:00
David S. Miller	4e8f2fc1a5	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Two trivial overlapping changes conflicts in MPLS and mlx5. Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-28 10:33:06 -05:00
Pablo Neira Ayuso	b2c11e4b95	netfilter: nf_tables: bump set->ndeact on set flush Add missing set->ndeact update on each deactivated element from the set flush path. Otherwise, sets with fixed size break after flush since accounting breaks. # nft add set x y { type ipv4_addr\; size 2\; } # nft add element x y { 1.1.1.1 } # nft add element x y { 1.1.1.2 } # nft flush set x y # nft add element x y { 1.1.1.1 } <cmdline>:1:1-28: Error: Could not process rule: Too many open files in system Fixes: `8411b6442e` ("netfilter: nf_tables: support for set flushing") Reported-by: Elise Lennion <elise.lennion@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-01-24 21:46:59 +01:00
Pablo Neira Ayuso	de70185de0	netfilter: nf_tables: deconstify walk callback function The flush operation needs to modify set and element objects, so let's deconstify this. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-01-24 21:46:58 +01:00
Pablo Neira Ayuso	35d0ac9070	netfilter: nf_tables: fix set->nelems counting with no NLM_F_EXCL If the element exists and no NLM_F_EXCL is specified, do not bump set->nelems, otherwise we leak one set element slot. This problem amplifies if the set is full since the abort path always decrements the counter for the -ENFILE case too, giving one spare extra slot. Fix this by moving set->nelems update to nft_add_set_elem() after successful element insertion. Moreover, remove the element if the set is full so there is no need to rely on the abort path to undo things anymore. Fixes: `c016c7e45d` ("netfilter: nf_tables: honor NLM_F_EXCL flag in set element insertion") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-01-24 21:46:57 +01:00
Liping Zhang	5ce6b04ce9	netfilter: nft_log: restrict the log prefix length to 127 First, log prefix will be truncated to NF_LOG_PREFIXLEN-1, i.e. 127, at nf_log_packet(), so the extra part is useless. Second, after adding a log rule with a very very long prefix, we will fail to dump the nft rules after this _special_ one, but acctually, they do exist. For example: # name_65000=$(printf "%0.sQ" {1..65000}) # nft add rule filter output log prefix "$name_65000" # nft add rule filter output counter # nft add rule filter output counter # nft list chain filter output table ip filter { chain output { type filter hook output priority 0; policy accept; } } So now, restrict the log prefix length to NF_LOG_PREFIXLEN-1. Fixes: `96518518cc` ("netfilter: add nftables") Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-01-24 21:46:29 +01:00
Krister Johansen	4548b683b7	Introduce a sysctl that modifies the value of PROT_SOCK. Add net.ipv4.ip_unprivileged_port_start, which is a per namespace sysctl that denotes the first unprivileged inet port in the namespace. To disable all privileged ports set this to zero. It also checks for overlap with the local port range. The privileged and local range may not overlap. The use case for this change is to allow containerized processes to bind to priviliged ports, but prevent them from ever being allowed to modify their container's network configuration. The latter is accomplished by ensuring that the network namespace is not a child of the user namespace. This modification was needed to allow the container manager to disable a namespace's priviliged port restrictions without exposing control of the network namespace to processes in the user namespace. Signed-off-by: Krister Johansen <kjlx@templeofstupid.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-24 12:10:51 -05:00
Liping Zhang	b2fbd04498	netfilter: nf_tables: validate the name size when possible Currently, if the user add a stateful object with the name size exceed NFT_OBJ_MAXNAMELEN - 1 (i.e. 31), we truncate it down to 31 silently. This is not friendly, furthermore, this will cause duplicated stateful objects when the first 31 characters of the name is same. So limit the stateful object's name size to NFT_OBJ_MAXNAMELEN - 1. After apply this patch, error message will be printed out like this: # name_32=$(printf "%0.sQ" {1..32}) # nft add counter filter $name_32 <cmdline>:1:1-52: Error: Could not process rule: Numerical result out of range add counter filter QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Also this patch cleans up the codes which missing the name size limit validation in nftables. Fixes: `e50092404c` ("netfilter: nf_tables: add stateful objects") Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-01-23 23:36:50 +01:00
Florian Westphal	e5072053b0	netfilter: conntrack: refine gc worker heuristics, redux This further refines the changes made to conntrack gc_worker in commit `e0df8cae6c` ("netfilter: conntrack: refine gc worker heuristics"). The main idea of that change was to reduce the scan interval when evictions take place. However, on the reporters' setup, there are 1-2 million conntrack entries in total and roughly 8k new (and closing) connections per second. In this case we'll always evict at least one entry per gc cycle and scan interval is always at 1 jiffy because of this test: } else if (expired_count) { gc_work->next_gc_run /= 2U; next_run = msecs_to_jiffies(1); being true almost all the time. Given we scan ~10k entries per run its clearly wrong to reduce interval based on nonzero eviction count, it will only waste cpu cycles since a vast majorities of conntracks are not timed out. Thus only look at the ratio (scanned entries vs. evicted entries) to make a decision on whether to reduce or not. Because evictor is supposed to only kick in when system turns idle after a busy period, pick a high ratio -- this makes it 50%. We thus keep the idea of increasing scan rate when its likely that table contains many expired entries. In order to not let timed-out entries hang around for too long (important when using event logging, in which case we want to timely destroy events), we now scan the full table within at most GC_MAX_SCAN_JIFFIES (16 seconds) even in worst-case scenario where all timed-out entries sit in same slot. I tested this with a vm under synflood (with sysctl net.netfilter.nf_conntrack_tcp_timeout_syn_recv=3). While flood is ongoing, interval now stays at its max rate (GC_MAX_SCAN_JIFFIES / GC_MAX_BUCKETS_DIV -> 125ms). With feedback from Nicolas Dichtel. Reported-by: Denys Fedoryshchenko <nuclearcat@nuclearcat.com> Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com> Fixes: `b87a2f9199` ("netfilter: conntrack: add gc worker to remove timed-out entries") Signed-off-by: Florian Westphal <fw@strlen.de> Tested-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Tested-by: Denys Fedoryshchenko <nuclearcat@nuclearcat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-01-19 14:28:01 +01:00
Florian Westphal	524b698db0	netfilter: conntrack: remove GC_MAX_EVICTS break Instead of breaking loop and instant resched, don't bother checking this in first place (the loop calls cond_resched for every bucket anyway). Suggested-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-01-19 14:27:41 +01:00
Gao Feng	1a28ad74eb	netfilter: nf_tables: eliminate useless condition checks The return value of nf_tables_table_lookup() is valid pointer or one pointer error. There are two cases: 1) IS_ERR(table) is true, it would return the error or reset the table as NULL, it is unnecessary to perform the latter check "table != NULL". 2) IS_ERR(obj) is false, the table is one valid pointer. It is also unnecessary to perform that check. The nf_tables_newset() and nf_tables_newobj() have same logic codes. In summary, we could move the block of condition check "table != NULL" in the else block to eliminate the original condition checks. Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-01-18 21:10:29 +01:00
Liping Zhang	f169fd695b	netfilter: nft_meta: deal with PACKET_LOOPBACK in netdev family After adding the following nft rule, then ping 224.0.0.1: # nft add rule netdev t c pkttype host counter The warning complain message will be printed out again and again: WARNING: CPU: 0 PID: 10182 at net/netfilter/nft_meta.c:163 \ nft_meta_get_eval+0x3fe/0x460 [nft_meta] [...] Call Trace: <IRQ> dump_stack+0x85/0xc2 __warn+0xcb/0xf0 warn_slowpath_null+0x1d/0x20 nft_meta_get_eval+0x3fe/0x460 [nft_meta] nft_do_chain+0xff/0x5e0 [nf_tables] So we should deal with PACKET_LOOPBACK in netdev family too. For ipv4, convert it to PACKET_BROADCAST/MULTICAST according to the destination address's type; For ipv6, convert it to PACKET_MULTICAST directly. Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-01-18 20:32:43 +01:00
Liping Zhang	9a6d876262	netfilter: pkttype: unnecessary to check ipv6 multicast address Since there's no broadcast address in IPV6, so in ipv6 family, the PACKET_LOOPBACK must be multicast packets, there's no need to check it again. Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-01-18 20:32:43 +01:00
William Breathitt Gray	e4670b058a	netfilter: Fix typo in NF_CONNTRACK Kconfig option description The NF_CONNTRACK Kconfig option description makes an incorrect reference to the "meta" expression where the "ct" expression would be correct.This patch fixes the respective typographical error. Fixes: `d497c63527` ("netfilter: add help information to new nf_tables Kconfig options") Signed-off-by: William Breathitt Gray <vilhelm.gray@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-01-16 14:23:02 +01:00
Liping Zhang	d21e540b4d	netfilter: nf_tables: fix possible oops when dumping stateful objects When dumping nft stateful objects, if NFTA_OBJ_TABLE and NFTA_OBJ_TYPE attributes are not specified either, filter will become NULL, so oops will happen(actually nft utility will always set NFTA_OBJ_TABLE attr, so I write a test program to make this happen): BUG: unable to handle kernel NULL pointer dereference at (null) IP: nf_tables_dump_obj+0x17c/0x330 [nf_tables] [...] Call Trace: ? nf_tables_dump_obj+0x5/0x330 [nf_tables] ? __kmalloc_reserve.isra.35+0x31/0x90 ? __alloc_skb+0x5b/0x1e0 netlink_dump+0x124/0x2a0 __netlink_dump_start+0x161/0x190 nf_tables_getobj+0xe8/0x280 [nf_tables] Fixes: `a9fea2a3c3` ("netfilter: nf_tables: allow to filter stateful object dumps by type") Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-01-16 14:23:02 +01:00
Willem de Bruijn	ec23189049	xtables: extend matches and targets with .usersize In matches and targets that define a kernel-only tail to their xt_match and xt_target data structs, add a field .usersize that specifies up to where data is to be shared with userspace. Performed a search for comment "Used internally by the kernel" to find relevant matches and targets. Manually inspected the structs to derive a valid offsetof. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-01-09 17:24:55 +01:00
Willem de Bruijn	4915f7bbc4	xtables: use match, target and data copy_to_user helpers in compat Convert compat to copying entries, matches and targets one by one, using the xt_match_to_user and xt_target_to_user helper functions. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-01-09 17:24:55 +01:00
Willem de Bruijn	f32815d21d	xtables: add xt_match, xt_target and data copy_to_user functions xt_entry_target, xt_entry_match and their private data may contain kernel data. Introduce helper functions xt_match_to_user, xt_target_to_user and xt_data_to_user that copy only the expected fields. These replace existing logic that calls copy_to_user on entire structs, then overwrites select fields. Private data is defined in xt_match and xt_target. All matches and targets that maintain kernel data store this at the tail of their private structure. Extend xt_match and xt_target with .usersize to limit how many bytes of data are copied. The remainder is cleared. If compatsize is specified, usersize can only safely be used if all fields up to usersize use platform-independent types. Otherwise, the compat_to_user callback must be defined. This patch does not yet enable the support logic. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-01-09 17:24:53 +01:00
David S. Miller	d896b3120b	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains accumulated Netfilter fixes for your net tree: 1) Ensure quota dump and reset happens iff we can deliver numbers to userspace. 2) Silence splat on incorrect use of smp_processor_id() from nft_queue. 3) Fix an out-of-bound access reported by KASAN in nf_tables_rule_destroy(), patch from Florian Westphal. 4) Fix layer 4 checksum mangling in the nf_tables payload expression with IPv6. 5) Fix a race in the CLUSTERIP target from control plane path when two threads run to add a new configuration object. Serialize invocations of clusterip_config_init() using spin_lock. From Xin Long. 6) Call br_nf_pre_routing_finish_bridge_finish() once we are done with the br_nf_pre_routing_finish() hook. From Artur Molchanov. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-05 11:49:57 -05:00
Geliang Tang	4cc4b72c13	netfilter: xt_connlimit: use rb_entry() To make the code clearer, use rb_entry() instead of container_of() to deal with rbtree. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-01-05 13:27:02 +01:00
Davide Caratti	cf6e007eef	netfilter: conntrack: validate SCTP crc32c in PREROUTING implement sctp_error to let nf_conntrack_in validate crc32c on the packet transport header. Assign skb->ip_summed to CHECKSUM_UNNECESSARY and return NF_ACCEPT in case of successful validation; otherwise, return -NF_ACCEPT to let netfilter skip connection tracking, like other protocols do. Besides preventing corrupted packets from matching conntrack entries, this fixes functionality of REJECT target: it was not generating any ICMP upon reception of SCTP packets, because it was computing RFC 1624 checksum on the packet and systematically mismatching crc32c in the SCTP header. Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-01-05 13:24:47 +01:00
Davide Caratti	300ae14946	netfilter: select LIBCRC32C together with SCTP conntrack nf_conntrack needs to compute crc32c when dealing with SCTP packets. Moreover, NF_NAT_PROTO_SCTP (currently selecting LIBCRC32C) can be enabled only if conntrack support for SCTP is enabled. Therefore, move enabling of kernel support for crc32c so that it is selected when NF_CT_PROTO_SCTP=y. Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-01-05 13:24:43 +01:00
Liping Zhang	949a358418	netfilter: nft_ct: add average bytes per packet support Similar to xt_connbytes, user can match how many average bytes per packet a connection has transferred so far. Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-01-03 14:33:26 +01:00
Florian Westphal	9700ba805b	netfilter: nat: merge udp and udplite helpers udplite nat was copied from udp nat, they are virtually 100% identical. Not really surprising given udplite is just udp with partial csum coverage. old: text data bss dec hex filename 11606 1457 210 13273 33d9 nf_nat.ko 330 0 2 332 14c nf_nat_proto_udp.o 276 0 2 278 116 nf_nat_proto_udplite.o new: text data bss dec hex filename 11598 1457 210 13265 33d1 nf_nat.ko 640 0 4 644 284 nf_nat_proto_udp.o Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-01-03 14:33:25 +01:00
Florian Westphal	e4781421e8	netfilter: merge udp and udplite conntrack helpers udplite was copied from udp, they are virtually 100% identical. This adds udplite tracker to udp instead, removes udplite module, and then makes the udplite tracker builtin. udplite will then simply re-use udp timeout settings. It makes little sense to add separate sysctls, nowadays we have fine-grained timeout policy support via the CT target. old: text data bss dec hex filename 1633 672 0 2305 901 nf_conntrack_proto_udp.o 1756 672 0 2428 97c nf_conntrack_proto_udplite.o 69526 17937 268 87731 156b3 nf_conntrack.ko new: text data bss dec hex filename 2442 1184 0 3626 e2a nf_conntrack_proto_udp.o 68565 17721 268 86554 1521a nf_conntrack.ko Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2017-01-03 14:33:25 +01:00
Thomas Gleixner	2456e85535	ktime: Get rid of the union ktime is a union because the initial implementation stored the time in scalar nanoseconds on 64 bit machine and in a endianess optimized timespec variant for 32bit machines. The Y2038 cleanup removed the timespec variant and switched everything to scalar nanoseconds. The union remained, but become completely pointless. Get rid of the union and just keep ktime_t as simple typedef of type s64. The conversion was done with coccinelle and some manual mopping up. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org>	2016-12-25 17:21:22 +01:00
Linus Torvalds	7c0f6ba682	Replace <asm/uaccess.h> with <linux/uaccess.h> globally This was entirely automated, using the script by Al: PATT='^[[:blank:]]#[[:blank:]]include[[:blank:]]*<asm/uaccess.h>' sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \ $(git grep -l "$PATT"\|grep -v ^include/linux/uaccess.h) to do the replacement at the end of the merge window. Requested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-12-24 11:46:01 -08:00
Pablo Neira Ayuso	053d20f571	netfilter: nft_payload: mangle ckecksum if NFT_PAYLOAD_L4CSUM_PSEUDOHDR is set If the NFT_PAYLOAD_L4CSUM_PSEUDOHDR flag is set, then mangle layer 4 checksum. This should not depend on csum_type NFT_PAYLOAD_CSUM_INET since IPv6 header has no checksum field, but still an update of any of the pseudoheader fields may trigger a layer 4 checksum update. Fixes: `1814096980` ("netfilter: nft_payload: layer 4 checksum adjustment for pseudoheader fields") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-14 23:39:11 +01:00
Florian Westphal	3e38df136e	netfilter: nf_tables: fix oob access BUG: KASAN: slab-out-of-bounds in nf_tables_rule_destroy+0xf1/0x130 at addr ffff88006a4c35c8 Read of size 8 by task nft/1607 When we've destroyed last valid expr, nft_expr_next() returns an invalid expr. We must not dereference it unless it passes != nft_expr_last() check. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-14 23:39:06 +01:00
Pablo Neira Ayuso	c2e756ff9e	netfilter: nft_queue: use raw_smp_processor_id() Using smp_processor_id() causes splats with PREEMPT_RCU: [19379.552780] BUG: using smp_processor_id() in preemptible [00000000] code: ping/32389 [19379.552793] caller is debug_smp_processor_id+0x17/0x19 [...] [19379.552823] Call Trace: [19379.552832] [<ffffffff81274e9e>] dump_stack+0x67/0x90 [19379.552837] [<ffffffff8129a4d4>] check_preemption_disabled+0xe5/0xf5 [19379.552842] [<ffffffff8129a4fb>] debug_smp_processor_id+0x17/0x19 [19379.552849] [<ffffffffa07c42dd>] nft_queue_eval+0x35/0x20c [nft_queue] No need to disable preemption since we only fetch the numeric value, so let's use raw_smp_processor_id() instead. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-14 23:39:01 +01:00
Pablo Neira Ayuso	8010d7feb2	netfilter: nft_quota: reset quota after dump Dumping of netlink attributes may fail due to insufficient room in the skbuff, so let's reset consumed quota if we succeed to put netlink attributes into the skbuff. Fixes: `43da04a593` ("netfilter: nf_tables: atomic dump and reset for stateful objects") Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-14 23:38:51 +01:00
Pablo Neira	d84701ecbc	netfilter: nft_counter: rework atomic dump and reset Dump and reset doesn't work unless cmpxchg64() is used both from packet and control plane paths. This approach is going to be slow though. Instead, use a percpu seqcount to fetch counters consistently, then subtract bytes and packets in case a reset was requested. The cpu that running over the reset code is guaranteed to own this stats exclusively, we have to turn counters into signed 64bit though so stats update on reset don't get wrong on underflow. This patch is based on original sketch from Eric Dumazet. Fixes: `43da04a593` ("netfilter: nf_tables: atomic dump and reset for stateful objects") Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-11 10:01:05 -05:00
David S. Miller	5fccd64aa4	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains a large Netfilter update for net-next, to summarise: 1) Add support for stateful objects. This series provides a nf_tables native alternative to the extended accounting infrastructure for nf_tables. Two initial stateful objects are supported: counters and quotas. Objects are identified by a user-defined name, you can fetch and reset them anytime. You can also use a maps to allow fast lookups using any arbitrary key combination. More info at: http://marc.info/?l=netfilter-devel&m=148029128323837&w=2 2) On-demand registration of nf_conntrack and defrag hooks per netns. Register nf_conntrack hooks if we have a stateful ruleset, ie. state-based filtering or NAT. The new nf_conntrack_default_on sysctl enables this from newly created netnamespaces. Default behaviour is not modified. Patches from Florian Westphal. 3) Allocate 4k chunks and then use these for x_tables counter allocation requests, this improves ruleset load time and also datapath ruleset evaluation, patches from Florian Westphal. 4) Add support for ebpf to the existing x_tables bpf extension. From Willem de Bruijn. 5) Update layer 4 checksum if any of the pseudoheader fields is updated. This provides a limited form of 1:1 stateless NAT that make sense in specific scenario, eg. load balancing. 6) Add support to flush sets in nf_tables. This series comes with a new set->ops->deactivate_one() indirection given that we have to walk over the list of set elements, then deactivate them one by one. The existing set->ops->deactivate() performs an element lookup that we don't need. 7) Two patches to avoid cloning packets, thus speed up packet forwarding via nft_fwd from ingress. From Florian Westphal. 8) Two IPVS patches via Simon Horman: Decrement ttl in all modes to prevent infinite loops, patch from Dwip Banerjee. And one minor refactoring from Gao feng. 9) Revisit recent log support for nf_tables netdev families: One patch to ensure that we correctly handle non-ethernet packets. Another patch to add missing logger definition for netdev. Patches from Liping Zhang. 10) Three patches for nft_fib, one to address insufficient register initialization and another to solve incorrect (although harmless) byteswap operation. Moreover update xt_rpfilter and nft_fib to match lbcast packets with zeronet as source, eg. DHCP Discover packets (0.0.0.0 -> 255.255.255.255). Also from Liping Zhang. 11) Built-in DCCP, SCTP and UDPlite conntrack and NAT support, from Davide Caratti. While DCCP is rather hopeless lately, and UDPlite has been broken in many-cast mode for some little time, let's give them a chance by placing them at the same level as other existing protocols. Thus, users don't explicitly have to modprobe support for this and NAT rules work for them. Some people point to the lack of support in SOHO Linux-based routers that make deployment of new protocols harder. I guess other middleboxes outthere on the Internet are also to blame. Anyway, let's see if this has any impact in the midrun. 12) Skip software SCTP software checksum calculation if the NIC comes with SCTP checksum offload support. From Davide Caratti. 13) Initial core factoring to prepare conversion to hook array. Three patches from Aaron Conole. 14) Gao Feng made a wrong conversion to switch in the xt_multiport extension in a patch coming in the previous batch. Fix it in this batch. 15) Get vmalloc call in sync with kmalloc flags to avoid a warning and likely OOM killer intervention from x_tables. From Marcelo Ricardo Leitner. 16) Update Arturo Borrero's email address in all source code headers. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-07 19:16:46 -05:00
Pablo Neira Ayuso	73c25fb139	netfilter: nft_quota: allow to restore consumed quota Allow to restore consumed quota, this is useful to restore the quota state across reboots. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-07 14:40:53 +01:00
Willem de Bruijn	2c16d60332	netfilter: xt_bpf: support ebpf Add support for attaching an eBPF object by file descriptor. The iptables binary can be called with a path to an elf object or a pinned bpf object. Also pass the mode and path to the kernel to be able to return it later for iptables dump and save. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-07 13:32:35 +01:00
Marcelo Ricardo Leitner	5bad87348c	netfilter: x_tables: avoid warn and OOM killer on vmalloc call Andrey Konovalov reported that this vmalloc call is based on an userspace request and that it's spewing traces, which may flood the logs and cause DoS if abused. Florian Westphal also mentioned that this call should not trigger OOM killer. This patch brings the vmalloc call in sync to kmalloc and disables the warn trace on allocation failure and also disable OOM killer invocation. Note, however, that under such stress situation, other places may trigger OOM killer invocation. Reported-by: Andrey Konovalov <andreyknvl@google.com> Cc: Florian Westphal <fw@strlen.de> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-07 13:31:41 +01:00
Pablo Neira Ayuso	8411b6442e	netfilter: nf_tables: support for set flushing This patch adds support for set flushing, that consists of walking over the set elements if the NFTA_SET_ELEM_LIST_ELEMENTS attribute is set. This patch requires the following changes: 1) Add set->ops->deactivate_one() operation: This allows us to deactivate an element from the set element walk path, given we can skip the lookup that happens in ->deactivate(). 2) Add a new nft_trans_alloc_gfp() function since we need to allocate transactions using GFP_ATOMIC given the set walk path happens with held rcu_read_lock. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-07 13:31:40 +01:00
Pablo Neira Ayuso	37df5301a3	netfilter: nft_set: introduce nft_{hash, rbtree}_deactivate_one() This new function allows us to deactivate one single element, this is required by the set flush command that comes in a follow up patch. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-07 13:31:02 +01:00
Pablo Neira Ayuso	1a37ef769d	netfilter: nf_tables: constify struct nft_ctx * parameter in nft_trans_alloc() Context is not modified by nft_trans_alloc(), so constify it. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-07 13:22:51 +01:00
Davide Caratti	3189a290f9	netfilter: nat: skip checksum on offload SCTP packets SCTP GSO and hardware can do CRC32c computation after netfilter processing, so we can avoid calling sctp_compute_checksum() on skb if skb->ip_summed is equal to CHECKSUM_PARTIAL. Moreover, set skb->ip_summed to CHECKSUM_NONE when the NAT code computes the CRC, to prevent offloaders from computing it again (on ixgbe this resulted in a transmission with wrong L4 checksum). Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-07 13:22:50 +01:00
Pablo Neira Ayuso	a9fea2a3c3	netfilter: nf_tables: allow to filter stateful object dumps by type This patch adds the netlink code to filter out dump of stateful objects, through the NFTA_OBJ_TYPE netlink attribute. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-07 13:22:49 +01:00
Pablo Neira Ayuso	63aea29060	netfilter: nft_objref: support for stateful object maps This patch allows us to refer to stateful object dictionaries, the source register indicates the key data to be used to look up for the corresponding state object. We can refer to these maps through names or, alternatively, the map transaction id. This allows us to refer to both anonymous and named maps. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-07 13:22:48 +01:00
Pablo Neira Ayuso	8aeff920dc	netfilter: nf_tables: add stateful object reference to set elements This patch allows you to refer to stateful objects from set elements. This provides the infrastructure to create maps where the right hand side of the mapping is a stateful object. This allows us to build dictionaries of stateful objects, that you can use to perform fast lookups using any arbitrary key combination. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-07 13:22:47 +01:00
Pablo Neira Ayuso	1896531710	netfilter: nft_quota: add depleted flag for objects Notify on depleted quota objects. The NFT_QUOTA_F_DEPLETED flag indicates we have reached overquota. Add pointer to table from nft_object, so we can use it when sending the depletion notification to userspace. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-07 13:22:12 +01:00
Pablo Neira Ayuso	2599e98934	netfilter: nf_tables: notify internal updates of stateful objects Introduce nf_tables_obj_notify() to notify internal state changes in stateful objects. This is used by the quota object to report depletion in a follow up patch. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-07 12:57:20 +01:00
Pablo Neira Ayuso	43da04a593	netfilter: nf_tables: atomic dump and reset for stateful objects This patch adds a new NFT_MSG_GETOBJ_RESET command perform an atomic dump-and-reset of the stateful object. This also comes with add support for atomic dump and reset for counter and quota objects. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-07 12:56:57 +01:00
Pablo Neira Ayuso	795595f68d	netfilter: nft_quota: dump consumed quota Add a new attribute NFTA_QUOTA_CONSUMED that displays the amount of quota that has been already consumed. This allows us to restore the internal state of the quota object between reboots as well as to monitor how wasted it is. This patch changes the logic to account for the consumed bytes, instead of the bytes that remain to be consumed. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-07 12:54:22 +01:00
Pablo Neira Ayuso	c97d22e68b	netfilter: nf_tables: add stateful object reference expression This new expression allows us to refer to existing stateful objects from rules. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-06 21:48:25 +01:00
Pablo Neira Ayuso	173705d9a2	netfilter: nft_quota: add stateful object type Register a new quota stateful object type into the new stateful object infrastructure. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-06 21:48:24 +01:00
Pablo Neira Ayuso	b1ce0ced10	netfilter: nft_counter: add stateful object type Register a new percpu counter stateful object type into the stateful object infrastructure. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-06 21:48:23 +01:00
Pablo Neira Ayuso	e50092404c	netfilter: nf_tables: add stateful objects This patch augments nf_tables to support stateful objects. This new infrastructure allows you to create, dump and delete stateful objects, that are identified by a user-defined name. This patch adds the generic infrastructure, follow up patches add support for two stateful objects: counters and quotas. This patch provides a native infrastructure for nf_tables to replace nfacct, the extended accounting infrastructure for iptables. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-06 21:48:22 +01:00
Florian Westphal	3bf3276119	netfilter: add and use nf_fwd_netdev_egress ... so we can use current skb instead of working with a clone. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-06 21:48:22 +01:00
Gao Feng	1ed9887ee3	netfilter: xt_multiport: Fix wrong unmatch result with multiple ports I lost one test case in the last commit for xt_multiport. For example, the rule is "-m multiport --dports 22,80,443". When first port is unmatched and the second is matched, the curent codes could not return the right result. It would return false directly when the first port is unmatched. Fixes: `dd2602d00f` ("netfilter: xt_multiport: Use switch case instead of multiple condition checks") Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-06 21:48:20 +01:00
Pablo Neira Ayuso	1814096980	netfilter: nft_payload: layer 4 checksum adjustment for pseudoheader fields This patch adds a new flag that signals the kernel to update layer 4 checksum if the packet field belongs to the layer 4 pseudoheader. This implicitly provides stateless NAT 1:1 that is useful under very specific usecases. Since rules mangling layer 3 fields that are part of the pseudoheader may potentially convey any layer 4 packet, we have to deal with the layer 4 checksum adjustment using protocol specific code. This patch adds support for TCP, UDP and ICMPv6, since they include the pseudoheader in the layer 4 checksum calculation. ICMP doesn't, so we can skip it. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-06 21:47:54 +01:00
Liping Zhang	11583438b7	netfilter: nft_fib: convert htonl to ntohl properly Acctually ntohl and htonl are identical, so this doesn't affect anything, but it is conceptually wrong. Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-06 21:42:20 +01:00
Florian Westphal	ae0ac0ed6f	netfilter: x_tables: pack percpu counter allocations instead of allocating each xt_counter individually, allocate 4k chunks and then use these for counter allocation requests. This should speed up rule evaluation by increasing data locality, also speeds up ruleset loading because we reduce calls to the percpu allocator. As Eric points out we can't use PAGE_SIZE, page_allocator would fail on arches with 64k page size. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-06 21:42:19 +01:00
Florian Westphal	f28e15bace	netfilter: x_tables: pass xt_counters struct to counter allocator Keeps some noise away from a followup patch. Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-06 21:42:18 +01:00
Florian Westphal	4d31eef517	netfilter: x_tables: pass xt_counters struct instead of packet counter On SMP we overload the packet counter (unsigned long) to contain percpu offset. Hide this from callers and pass xt_counters address instead. Preparation patch to allocate the percpu counters in page-sized batch chunks. Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-06 21:42:17 +01:00
Aaron Conole	679972f3be	netfilter: convert while loops to for loops This is to facilitate converting from a singly-linked list to an array of elements. Signed-off-by: Aaron Conole <aconole@bytheb.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-06 21:42:16 +01:00
Aaron Conole	0aa8c57a04	netfilter: introduce accessor functions for hook entries This allows easier future refactoring. Signed-off-by: Aaron Conole <aconole@bytheb.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-06 21:42:15 +01:00
Florian Westphal	834184b1f3	netfilter: defrag: only register defrag functionality if needed nf_defrag modules for ipv4 and ipv6 export an empty stub function. Any module that needs the defragmentation hooks registered simply 'calls' this empty function to create a phony module dependency -- modprobe will then load the defrag module too. This extends netfilter ipv4/ipv6 defragmentation modules to delay the hook registration until the functionality is requested within a network namespace instead of module load time for all namespaces. Hooks are only un-registered on module unload or when a namespace that used such defrag functionality exits. We have to use struct net for this as the register hooks can be called before netns initialization here from the ipv4/ipv6 conntrack module init path. There is no unregister functionality support, defrag will always be active once it was requested inside a net namespace. The reason is that defrag has impact on nft and iptables rulesets (without defrag we might see framents). Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-06 21:42:00 +01:00
Eric Dumazet	1c0d32fde5	net_sched: gen_estimator: complete rewrite of rate estimators 1) Old code was hard to maintain, due to complex lock chains. (We probably will be able to remove some kfree_rcu() in callers) 2) Using a single timer to update all estimators does not scale. 3) Code was buggy on 32bit kernel (WRITE_ONCE() on 64bit quantity is not supposed to work well) In this rewrite : - I removed the RB tree that had to be scanned in gen_estimator_active(). qdisc dumps should be much faster. - Each estimator has its own timer. - Estimations are maintained in net_rate_estimator structure, instead of dirtying the qdisc. Minor, but part of the simplification. - Reading the estimator uses RCU and a seqcount to provide proper support for 32bit kernels. - We reduce memory need when estimators are not used, since we store a pointer, instead of the bytes/packets counters. - xt_rateest_mt() no longer has to grab a spinlock. (In the future, xt_rateest_tg() could be switched to per cpu counters) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-05 15:21:59 -05:00
Florian Westphal	481fa37347	netfilter: conntrack: add nf_conntrack_default_on sysctl This switch (default on) can be used to disable automatic registration of connection tracking functionality in newly created network namespaces. This means that when net namespace goes down (or the tracker protocol module is unloaded) we might have to unregister the hooks. We can either add another per-netns variable that tells if the hooks got registered by default, or, alternatively, just call the protocol _put() function and have the callee deal with a possible 'extra' put() operation that doesn't pair with a get() one. This uses the latter approach, i.e. a put() without a get has no effect. Conntrack is still enabled automatically regardless of the new sysctl setting if the new net namespace requires connection tracking, e.g. when NAT rules are created. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-04 21:17:25 +01:00
Florian Westphal	0c66dc1ea3	netfilter: conntrack: register hooks in netns when needed by ruleset This makes use of nf_ct_netns_get/put added in previous patch. We add get/put functions to nf_conntrack_l3proto structure, ipv4 and ipv6 then implement use-count to track how many users (nft or xtables modules) have a dependency on ipv4 and/or ipv6 connection tracking functionality. When count reaches zero, the hooks are unregistered. This delays activation of connection tracking inside a namespace until stateful firewall rule or nat rule gets added. This patch breaks backwards compatibility in the sense that connection tracking won't be active anymore when the protocol tracker module is loaded. This breaks e.g. setups that ctnetlink for flow accounting and the like, without any '-m conntrack' packet filter rules. Followup patch restores old behavour and makes new delayed scheme optional via sysctl. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-04 21:17:24 +01:00
Florian Westphal	20afd42397	netfilter: nf_tables: add conntrack dependencies for nat/masq/redir expressions so that conntrack core will add the needed hooks in this namespace. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-04 21:17:16 +01:00
Florian Westphal	a357b3f80b	netfilter: nat: add dependencies on conntrack module MASQUERADE, S/DNAT and REDIRECT already call functions that depend on the conntrack module. However, since the conntrack hooks are now registered in a lazy fashion (i.e., only when needed) a symbol reference is not enough. Thus, when something is added to a nat table, make sure that it will see packets by calling nf_ct_netns_get() which will register the conntrack hooks in the current netns. An alternative would be to add these dependencies to the NAT table. However, that has problems when using non-modular builds -- we might register e.g. ipv6 conntrack before its initcall has run, leading to NULL deref crashes since its per-netns storage has not yet been allocated. Adding the dependency in the modules instead has the advantage that nat table also does not register its hooks until rules are added. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-04 21:16:51 +01:00
Florian Westphal	ecb2421b5d	netfilter: add and use nf_ct_netns_get/put currently aliased to try_module_get/_put. Will be changed in next patch when we add functions to make use of ->net argument to store usercount per l3proto tracker. This is needed to avoid registering the conntrack hooks in all netns and later only enable connection tracking in those that need conntrack. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-04 21:16:50 +01:00
Florian Westphal	a379854d91	netfilter: conntrack: remove unused init_net hook since `adf0516845` ("netfilter: remove ip_conntrack* sysctl compat code") the only user (ipv4 tracker) sets this to an empty stub function. After this change nf_ct_l3proto_pernet_register() is also empty, but this will change in a followup patch to add conditional register of the hooks. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-04 21:16:41 +01:00
Davide Caratti	9b91c96c5d	netfilter: conntrack: built-in support for UDPlite CONFIG_NF_CT_PROTO_UDPLITE is no more a tristate. When set to y, connection tracking support for UDPlite protocol is built-in into nf_conntrack.ko. footprint test: $ ls -l net/netfilter/nf_conntrack{_proto_udplite,}.ko \ net/ipv4/netfilter/nf_conntrack_ipv4.ko \ net/ipv6/netfilter/nf_conntrack_ipv6.ko (builtin)\|\| udplite\| ipv4 \| ipv6 \|nf_conntrack ---------++--------+--------+--------+-------------- none \|\| 432538 \| 828755 \| 828676 \| 6141434 UDPlite \|\| - \| 829649 \| 829362 \| 6498204 Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-04 20:57:36 +01:00
Davide Caratti	a85406afeb	netfilter: conntrack: built-in support for SCTP CONFIG_NF_CT_PROTO_SCTP is no more a tristate. When set to y, connection tracking support for SCTP protocol is built-in into nf_conntrack.ko. footprint test: $ ls -l net/netfilter/nf_conntrack{_proto_sctp,}.ko \ net/ipv4/netfilter/nf_conntrack_ipv4.ko \ net/ipv6/netfilter/nf_conntrack_ipv6.ko (builtin)\|\| sctp \| ipv4 \| ipv6 \| nf_conntrack ---------++--------+--------+--------+-------------- none \|\| 498243 \| 828755 \| 828676 \| 6141434 SCTP \|\| - \| 829254 \| 829175 \| 6547872 Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-04 20:55:37 +01:00
Davide Caratti	c51d39010a	netfilter: conntrack: built-in support for DCCP CONFIG_NF_CT_PROTO_DCCP is no more a tristate. When set to y, connection tracking support for DCCP protocol is built-in into nf_conntrack.ko. footprint test: $ ls -l net/netfilter/nf_conntrack{_proto_dccp,}.ko \ net/ipv4/netfilter/nf_conntrack_ipv4.ko \ net/ipv6/netfilter/nf_conntrack_ipv6.ko (builtin)\|\| dccp \| ipv4 \| ipv6 \| nf_conntrack ---------++--------+--------+--------+-------------- none \|\| 469140 \| 828755 \| 828676 \| 6141434 DCCP \|\| - \| 830566 \| 829935 \| 6533526 Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-04 20:53:15 +01:00
Pablo Neira Ayuso	f6b3ef5e38	Merge tag 'ipvs-for-v4.10' of https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next Simon Horman says: ==================== IPVS Updates for v4.10 please consider these enhancements to the IPVS for v4.10. * Decrement the IP ttl in all the modes in order to prevent infinite route loops. Thanks to Dwip Banerjee. * Use IS_ERR_OR_NULL macro. Clean-up from Gao Feng. ==================== Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-04 20:46:16 +01:00
Liping Zhang	a7647080d3	netfilter: nfnetlink_log: add "nf-logger-5-1" module alias name So we can autoload nfnetlink_log.ko when the user adding nft log group X rule in netdev family. Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-04 20:45:34 +01:00
Liping Zhang	673ab46f34	netfilter: nf_log: do not assume ethernet header in netdev family In netdev family, we will handle non ethernet packets, so using eth_hdr(skb)->h_proto is incorrect. Meanwhile, we can use socket(AF_PACKET...) to sending packets, so skb->protocol is not always set in bridge family. Add an extra parameter into nf_log_l2packet to solve this issue. Fixes: `1fddf4bad0` ("netfilter: nf_log: add packet logging for netdev family") Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-04 20:45:33 +01:00
Davide Caratti	b8ad652f97	netfilter: built-in NAT support for UDPlite CONFIG_NF_NAT_PROTO_UDPLITE is no more a tristate. When set to y, NAT support for UDPlite protocol is built-in into nf_nat.ko. footprint test: (nf_nat_proto_) \|udplite \|\| nf_nat --------------------------+--------++-------- no builtin \| 408048 \|\| 2241312 UDPLITE builtin \| - \|\| 2577256 Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-04 20:45:32 +01:00
Davide Caratti	7a2dd28c70	netfilter: built-in NAT support for SCTP CONFIG_NF_NAT_PROTO_SCTP is no more a tristate. When set to y, NAT support for SCTP protocol is built-in into nf_nat.ko. footprint test: (nf_nat_proto_) \| sctp \|\| nf_nat --------------------------+--------++-------- no builtin \| 428344 \|\| 2241312 SCTP builtin \| - \|\| 2597032 Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-04 20:45:31 +01:00
Davide Caratti	0c4e966eaf	netfilter: built-in NAT support for DCCP CONFIG_NF_NAT_PROTO_DCCP is no more a tristate. When set to y, NAT support for DCCP protocol is built-in into nf_nat.ko. footprint test: (nf_nat_proto_) \| dccp \|\| nf_nat --------------------------+--------++-------- no builtin \| 409800 \|\| 2241312 DCCP builtin \| - \|\| 2578968 Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-04 20:45:30 +01:00
Arturo Borrero Gonzalez	cd72751468	netfilter: update Arturo Borrero Gonzalez email address The email address has changed, let's update the copyright statements. Signed-off-by: Arturo Borrero Gonzalez <arturo@debian.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-12-04 20:45:25 +01:00
David S. Miller	2745529ac7	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Couple conflicts resolved here: 1) In the MACB driver, a bug fix to properly initialize the RX tail pointer properly overlapped with some changes to support variable sized rings. 2) In XGBE we had a "CONFIG_PM" --> "CONFIG_PM_SLEEP" fix overlapping with a reorganization of the driver to support ACPI, OF, as well as PCI variants of the chip. 3) In 'net' we had several probe error path bug fixes to the stmmac driver, meanwhile a lot of this code was cleaned up and reorganized in 'net-next'. 4) The cls_flower classifier obtained a helper function in 'net-next' called __fl_delete() and this overlapped with Daniel Borkamann's bug fix to use RCU for object destruction in 'net'. It also overlapped with Jiri's change to guard the rhashtable_remove_fast() call with a check against tc_skip_sw(). 5) In mlx4, a revert bug fix in 'net' overlapped with some unrelated changes in 'net-next'. 6) In geneve, a stale header pointer after pskb_expand_head() bug fix in 'net' overlapped with a large reorganization of the same code in 'net-next'. Since the 'net-next' code no longer had the bug in question, there was nothing to do other than to simply take the 'net-next' hunks. Signed-off-by: David S. Miller <davem@davemloft.net>	2016-12-03 12:29:53 -05:00
Liping Zhang	49cdc4c749	netfilter: nft_range: add the missing NULL pointer check Otherwise, kernel panic will happen if the user does not specify the related attributes. Fixes: `0f3cd9b369` ("netfilter: nf_tables: add range expression") Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-11-24 14:43:35 +01:00

... 3 4 5 6 7 ...

4119 Commits (redonkable)