Commit graph

4959 commits

Author SHA1 Message Date
David S. Miller 3ebb41bf47 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Postpone chain policy update to drop after transaction is complete,
   from Florian Westphal.

2) Add entry to flowtable after confirmation to fix UDP flows with
   packets going in one single direction.

3) Reference count leak in dst object, from Taehee Yoo.

4) Check for TTL field in flowtable datapath, from Taehee Yoo.

5) Fix h323 conntrack helper due to incorrect boundary check,
   from Jakub Jankowski.

6) Fix incorrect rcu dereference when fetching basechain stats,
   from Florian Westphal.

7) Missing error check when adding new entries to flowtable,
   from Taehee Yoo.

8) Use version field in nfnetlink message to honor the nfgen_family
   field, from Kristian Evensen.

9) Remove incorrect configuration check for CONFIG_NF_CONNTRACK_IPV6,
   from Subash Abhinov Kasiviswanathan.

10) Prevent dying entries from being added to the flowtable,
    from Taehee Yoo.

11) Don't hit WARN_ON() with malformed blob in ebtables with
    trailing data after last rule, reported by syzbot, patch
    from Florian Westphal.

12) Remove NFT_CT_TIMEOUT enumeration, never used in the kernel
    code.

13) Fix incorrect definition for NFT_LOGLEVEL_MAX, from Florian
    Westphal.

This batch comes with a conflict that can be fixed with this patch:

diff --cc include/uapi/linux/netfilter/nf_tables.h
index 7bdb234f3d8c,f0cf7b0f4f35..505393c6e959
--- a/include/uapi/linux/netfilter/nf_tables.h
+++ b/include/uapi/linux/netfilter/nf_tables.h
@@@ -966,6 -966,8 +966,7 @@@ enum nft_socket_keys
   * @NFT_CT_DST_IP: conntrack layer 3 protocol destination (IPv4 address)
   * @NFT_CT_SRC_IP6: conntrack layer 3 protocol source (IPv6 address)
   * @NFT_CT_DST_IP6: conntrack layer 3 protocol destination (IPv6 address)
 - * @NFT_CT_TIMEOUT: connection tracking timeout policy assigned to conntrack
+  * @NFT_CT_ID: conntrack id
   */
  enum nft_ct_keys {
  	NFT_CT_STATE,
@@@ -991,6 -993,8 +992,7 @@@
  	NFT_CT_DST_IP,
  	NFT_CT_SRC_IP6,
  	NFT_CT_DST_IP6,
 -	NFT_CT_TIMEOUT,
+ 	NFT_CT_ID,
  	__NFT_CT_MAX
  };
  #define NFT_CT_MAX		(__NFT_CT_MAX - 1)

That replaces the unused NFT_CT_TIMEOUT definition by NFT_CT_ID. If you prefer,
I can also solve this conflict here, just let me know.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-13 08:55:15 -07:00
Taehee Yoo 8cd2bc981c netfilter: nf_flow_table: do not flow offload deleted conntrack entries
Conntrack entries can be deleted by the masquerade module. In that case,
flow offload should be deleted too, but GC and data-path of flow offload
do not check for conntrack status bits, hence flow offload entries will
be removed only by the timeout.

Update garbage collector and data-path to check for ct->status. If
IPS_DYING_BIT is set, garbage collector removes flow offload entries and
data-path routine ignores them.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-06 15:15:09 +02:00
Subash Abhinov Kasiviswanathan b33c448c4f netfilter: nf_conntrack_h323: Remove deprecated config check
CONFIG_NF_CONNTRACK_IPV6 has been deprecated so replace it with a check
for IPV6 instead.

Use nf_ip6_route6() instead of v6ops->route() and keep the IS_MODULE()
in nf_ipv6_ops as mentioned by Florian so that direct calls are used
when IPV6 is builtin and indirect calls are used only when IPV6 is a
module.

Fixes: a0ae2562c6 ("netfilter: conntrack: remove l3proto abstraction")
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-06 15:15:09 +02:00
Kristian Evensen f8e6089820 netfilter: ctnetlink: Resolve conntrack L3-protocol flush regression
Commit 59c08c69c2 ("netfilter: ctnetlink: Support L3 protocol-filter
on flush") introduced a user-space regression when flushing connection
track entries. Before this commit, the nfgen_family field was not used
by the kernel and all entries were removed. Since this commit,
nfgen_family is used to filter out entries that should not be removed.
One example a broken tool is conntrack. conntrack always sets
nfgen_family to AF_INET, so after 59c08c69c2 only IPv4 entries were
removed with the -F parameter.

Pablo Neira Ayuso suggested using nfgenmsg->version to resolve the
regression, and this commit implements his suggestion. nfgenmsg->version
is so far set to zero, so it is well-suited to be used as a flag for
selecting old or new flush behavior. If version is 0, nfgen_family is
ignored and all entries are used. If user-space sets the version to one
(or any other value than 0), then the new behavior is used. As version
only can have two valid values, I chose not to add a new
NFNETLINK_VERSION-constant.

Fixes: 59c08c69c2 ("netfilter: ctnetlink: Support L3 protocol-filter on flush")
Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Kristian Evensen <kristian.evensen@gmail.com>
Tested-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-06 15:15:02 +02:00
Gustavo A. R. Silva eabb478219 netfilter: xt_hashlimit: use struct_size() helper
Make use of the struct_size() helper instead of an open-coded version
in order to avoid any potential type mistakes, in particular in the
context in which this code is being used.

So, replace code of the following form:

sizeof(struct xt_hashlimit_htable) + sizeof(struct hlist_head) * size

with:

struct_size(hinfo, hash, size)

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-06 01:03:04 +02:00
Taehee Yoo 43c8f13118 netfilter: nf_flow_table: fix missing error check for rhashtable_insert_fast
rhashtable_insert_fast() may return an error value when memory
allocation fails, but flow_offload_add() does not check for errors.
This patch just adds missing error checking.

Fixes: ac2a66665e ("netfilter: add generic flow table infrastructure")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-06 00:36:30 +02:00
Florian Westphal edbd82c5fb netfilter: nf_tables: fix base chain stat rcu_dereference usage
Following splat gets triggered when nfnetlink monitor is running while
xtables-nft selftests are running:

net/netfilter/nf_tables_api.c:1272 suspicious rcu_dereference_check() usage!
other info that might help us debug this:

1 lock held by xtables-nft-mul/27006:
 #0: 00000000e0f85be9 (&net->nft.commit_mutex){+.+.}, at: nf_tables_valid_genid+0x1a/0x50
Call Trace:
 nf_tables_fill_chain_info.isra.45+0x6cc/0x6e0
 nf_tables_chain_notify+0xf8/0x1a0
 nf_tables_commit+0x165c/0x1740

nf_tables_fill_chain_info() can be called both from dumps (rcu read locked)
or from the transaction path if a userspace process subscribed to nftables
notifications.

In the 'table dump' case, rcu_access_pointer() cannot be used: We do not
hold transaction mutex so the pointer can be NULLed right after the check.
Just unconditionally fetch the value, then have the helper return
immediately if its NULL.

In the notification case we don't hold the rcu read lock, but updates are
prevented due to transaction mutex. Use rcu_dereference_check() to make lockdep
aware of this.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-06 00:36:29 +02:00
Jakub Jankowski f5e85ce8e7 netfilter: nf_conntrack_h323: restore boundary check correctness
Since commit bc7d811ace ("netfilter: nf_ct_h323: Convert
CHECK_BOUND macro to function"), NAT traversal for H.323
doesn't work, failing to parse H323-UserInformation.
nf_h323_error_boundary() compares contents of the bitstring,
not the addresses, preventing valid H.323 packets from being
conntrack'd.

This looks like an oversight from when CHECK_BOUND macro was
converted to a function.

To fix it, stop dereferencing bs->cur and bs->end.

Fixes: bc7d811ace ("netfilter: nf_ct_h323: Convert CHECK_BOUND macro to function")
Signed-off-by: Jakub Jankowski <shasta@toxcorp.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-06 00:36:17 +02:00
Colin Ian King 1de6f33421 netfilter: connlabels: fix spelling mistake "trackling" -> "tracking"
There is a spelling mistake in the module description. Fix this.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-30 14:19:57 +02:00
Brett Mastbergen 3087c3f7c2 netfilter: nft_ct: Add ct id support
The 'id' key returns the unique id of the conntrack entry as returned
by nf_ct_get_id().

Signed-off-by: Brett Mastbergen <bmastbergen@untangle.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-30 14:19:57 +02:00
Flavio Leitner 53b11308a1 netfilter: nf_nat: register NAT helpers.
Register amanda, ftp, irc, sip and tftp NAT helpers.

Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-30 14:19:55 +02:00
Flavio Leitner 08010a2160 netfilter: add API to manage NAT helpers.
The API allows a conntrack helper to indicate its corresponding
NAT helper which then can be loaded and reference counted.

Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-30 14:19:55 +02:00
Flavio Leitner e1f172e162 netfilter: use macros to create module aliases.
Each NAT helper creates a module alias which follows a pattern.
Use macros for consistency.

Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-30 14:19:54 +02:00
Tonghao Zhang 8f14c99c7e netfilter: conntrack: limit sysctl setting for boolean options
We use the zero and one to limit the boolean options setting.
After this patch we only set 0 or 1 to boolean options for nf
conntrack sysctl.

Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-30 14:18:56 +02:00
Taehee Yoo 33cc3c0cfa netfilter: nf_flow_table: check ttl value in flow offload data path
nf_flow_offload_ip_hook() and nf_flow_offload_ipv6_hook() do not check
ttl value. So, ttl value overflow may occur.

Fixes: 97add9f0d6 ("netfilter: flow table support for IPv4")
Fixes: 0995210753 ("netfilter: flow table support for IPv6")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-30 13:56:19 +02:00
Taehee Yoo 26a302afbe netfilter: nf_flow_table: fix netdev refcnt leak
flow_offload_alloc() calls nf_route() to get a dst_entry. Internally,
nf_route() calls ip_route_output_key() that allocates a dst_entry and
holds it. So, a dst_entry should be released by dst_release() if
nf_route() is successful.

Otherwise, netns exit routine cannot be finished and the following
message is printed:

[  257.490952] unregister_netdevice: waiting for lo to become free. Usage count = 1

Fixes: ac2a66665e ("netfilter: add generic flow table infrastructure")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-30 13:55:00 +02:00
Pablo Neira Ayuso 270a8a297f netfilter: nft_flow_offload: add entry to flowtable after confirmation
This is fixing flow offload for UDP traffic where packets only follow
one single direction.

The flow_offload_fixup_tcp() mechanism works fine in case that the
offloaded entry remains in SYN_RECV state, given sequence tracking is
reset and that conntrack handles syn+ack packets as a retransmission, ie.

	sES + synack => sIG

for reply traffic.

Fixes: a3c90f7a23 ("netfilter: nf_tables: flow offload expression")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-30 13:48:23 +02:00
Florian Westphal 66293c46c9 netfilter: nf_tables: delay chain policy update until transaction is complete
When we process a long ruleset of the form

chain input {
   type filter hook input priority filter; policy drop;
   ...
}

Then the base chain gets registered early on, we then continue to
process/validate the next messages coming in the same transaction.

Problem is that if the base chain policy is 'drop', it will take effect
immediately, which causes all traffic to get blocked until the
transaction completes or is aborted.

Fix this by deferring the policy until the transaction has been
processed and all of the rules have been flagged as active.

Reported-by: Jann Haber <jann.haber@selfnet.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-30 13:47:32 +02:00
Paul Gortmaker c5f1931f66 netfilter: nf_tables: fix implicit include of module.h
This file clearly uses modular infrastructure but does not call
out the inclusion of <linux/module.h> explicitly.  We add that
include explicitly here, so we can tidy up some header usage
elsewhere w/o causing build breakage.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-30 13:35:26 +02:00
Paul Gortmaker f1f86d09ca netfilter: nf_tables: relocate header content to consumer
The nf_tables.h header is used in a lot of files, but it turns out
that there is only one actual user of nft_expr_clone().

Hence we relocate that function to be with the one consumer of it
and avoid having to process it with CPP for all the other files.

This will also enable a reduction in the other headers that the
nf_tables.h itself has to include just to be stand-alone, hence
a pending further significant reduction in the CPP content that
needs to get processed for each netfilter file.

Note that the explicit "inline" has been dropped as part of this
relocation.  In similar changes to this, I believe Dave has asked
this be done, so we free up gcc to make the choice of whether to
inline or not.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-30 13:35:17 +02:00
Johannes Berg ef6243acb4 genetlink: optionally validate strictly/dumps
Add options to strictly validate messages and dump messages,
sometimes perhaps validating dump messages non-strictly may
be required, so add an option for that as well.

Since none of this can really be applied to existing commands,
set the options everwhere using the following spatch:

    @@
    identifier ops;
    expression X;
    @@
    struct genl_ops ops[] = {
    ...,
     {
            .cmd = X,
    +       .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
            ...
     },
    ...
    };

For new commands one should just not copy the .validate 'opt-out'
flags and thus get strict validation.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27 17:07:22 -04:00
Johannes Berg 8cb081746c netlink: make validation more configurable for future strictness
We currently have two levels of strict validation:

 1) liberal (default)
     - undefined (type >= max) & NLA_UNSPEC attributes accepted
     - attribute length >= expected accepted
     - garbage at end of message accepted
 2) strict (opt-in)
     - NLA_UNSPEC attributes accepted
     - attribute length >= expected accepted

Split out parsing strictness into four different options:
 * TRAILING     - check that there's no trailing data after parsing
                  attributes (in message or nested)
 * MAXTYPE      - reject attrs > max known type
 * UNSPEC       - reject attributes with NLA_UNSPEC policy entries
 * STRICT_ATTRS - strictly validate attribute size

The default for future things should be *everything*.
The current *_strict() is a combination of TRAILING and MAXTYPE,
and is renamed to _deprecated_strict().
The current regular parsing has none of this, and is renamed to
*_parse_deprecated().

Additionally it allows us to selectively set one of the new flags
even on old policies. Notably, the UNSPEC flag could be useful in
this case, since it can be arranged (by filling in the policy) to
not be an incompatible userspace ABI change, but would then going
forward prevent forgetting attribute entries. Similar can apply
to the POLICY flag.

We end up with the following renames:
 * nla_parse           -> nla_parse_deprecated
 * nla_parse_strict    -> nla_parse_deprecated_strict
 * nlmsg_parse         -> nlmsg_parse_deprecated
 * nlmsg_parse_strict  -> nlmsg_parse_deprecated_strict
 * nla_parse_nested    -> nla_parse_nested_deprecated
 * nla_validate_nested -> nla_validate_nested_deprecated

Using spatch, of course:
    @@
    expression TB, MAX, HEAD, LEN, POL, EXT;
    @@
    -nla_parse(TB, MAX, HEAD, LEN, POL, EXT)
    +nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT)

    @@
    expression NLH, HDRLEN, TB, MAX, POL, EXT;
    @@
    -nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT)
    +nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT)

    @@
    expression NLH, HDRLEN, TB, MAX, POL, EXT;
    @@
    -nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
    +nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT)

    @@
    expression TB, MAX, NLA, POL, EXT;
    @@
    -nla_parse_nested(TB, MAX, NLA, POL, EXT)
    +nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT)

    @@
    expression START, MAX, POL, EXT;
    @@
    -nla_validate_nested(START, MAX, POL, EXT)
    +nla_validate_nested_deprecated(START, MAX, POL, EXT)

    @@
    expression NLH, HDRLEN, MAX, POL, EXT;
    @@
    -nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT)
    +nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT)

For this patch, don't actually add the strict, non-renamed versions
yet so that it breaks compile if I get it wrong.

Also, while at it, make nla_validate and nla_parse go down to a
common __nla_validate_parse() function to avoid code duplication.

Ultimately, this allows us to have very strict validation for every
new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the
next patch, while existing things will continue to work as is.

In effect then, this adds fully strict validation for any new command.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27 17:07:21 -04:00
Michal Kubecek 12ad5f65f0 ipset: drop ipset_nest_start() and ipset_nest_end()
After the previous commit, both ipset_nest_start() and ipset_nest_end() are
just aliases for nla_nest_start() and nla_nest_end() so that there is no
need to keep them.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27 17:03:44 -04:00
Michal Kubecek ae0be8de9a netlink: make nla_nest_start() add NLA_F_NESTED flag
Even if the NLA_F_NESTED flag was introduced more than 11 years ago, most
netlink based interfaces (including recently added ones) are still not
setting it in kernel generated messages. Without the flag, message parsers
not aware of attribute semantics (e.g. wireshark dissector or libmnl's
mnl_nlmsg_fprintf()) cannot recognize nested attributes and won't display
the structure of their contents.

Unfortunately we cannot just add the flag everywhere as there may be
userspace applications which check nlattr::nla_type directly rather than
through a helper masking out the flags. Therefore the patch renames
nla_nest_start() to nla_nest_start_noflag() and introduces nla_nest_start()
as a wrapper adding NLA_F_NESTED. The calls which add NLA_F_NESTED manually
are rewritten to use nla_nest_start().

Except for changes in include/net/netlink.h, the patch was generated using
this semantic patch:

@@ expression E1, E2; @@
-nla_nest_start(E1, E2)
+nla_nest_start_noflag(E1, E2)

@@ expression E1, E2; @@
-nla_nest_start_noflag(E1, E2 | NLA_F_NESTED)
+nla_nest_start(E1, E2)

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27 17:03:44 -04:00
David S. Miller 8b44836583 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Two easy cases of overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-25 23:52:29 -04:00
Andrei Vagin d48668052b netfilter: fix nf_l4proto_log_invalid to log invalid packets
It doesn't log a packet if sysctl_log_invalid isn't equal to protonum
OR sysctl_log_invalid isn't equal to IPPROTO_RAW. This sentence is
always true. I believe we need to replace OR to AND.

Cc: Florian Westphal <fw@strlen.de>
Fixes: c4f3db1595 ("netfilter: conntrack: add and use nf_l4proto_log_invalid")
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-22 10:38:50 +02:00
Florian Westphal 916f6efae6 netfilter: never get/set skb->tstamp
setting net.netfilter.nf_conntrack_timestamp=1 breaks xmit with fq
scheduler.  skb->tstamp might be "refreshed" using ktime_get_real(),
but fq expects CLOCK_MONOTONIC.

This patch removes all places in netfilter that check/set skb->tstamp:

1. To fix the bogus "start" time seen with conntrack timestamping for
   outgoing packets, never use skb->tstamp and always use current time.
2. In nfqueue and nflog, only use skb->tstamp for incoming packets,
   as determined by current hook (prerouting, input, forward).
3. xt_time has to use system clock as well rather than skb->tstamp.
   We could still use skb->tstamp for prerouting/input/foward, but
   I see no advantage to make this conditional.

Fixes: fb420d5d91 ("tcp/fq: move back to CLOCK_MONOTONIC")
Cc: Eric Dumazet <edumazet@google.com>
Reported-by: Michal Soltys <soltys@ziu.info>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-22 10:34:30 +02:00
Florian Westphal 5bdac418f3 netfilter: nat: fix icmp id randomization
Sven Auhagen reported that a 2nd ping request will fail if 'fully-random'
mode is used.

Reason is that if no proto information is given, min/max are both 0,
so we set the icmp id to 0 instead of chosing a random value between
0 and 65535.

Update test case as well to catch this, without fix this yields:
[..]
ERROR: cannot ping ns1 from ns2 with ip masquerade fully-random (attempt 2)
ERROR: cannot ping ns1 from ns2 with ipv6 masquerade fully-random (attempt 2)

... becaus 2nd ping clashes with existing 'id 0' icmp conntrack and gets
dropped.

Fixes: 203f2e7820 ("netfilter: nat: remove l4proto->unique_tuple")
Reported-by: Sven Auhagen <sven.auhagen@voleatech.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-15 07:31:50 +02:00
Dan Carpenter 33d1c01817 netfilter: nf_tables: prevent shift wrap in nft_chain_parse_hook()
I believe that "hook->num" can be up to UINT_MAX.  Shifting more than
31 bits would is undefined in C but in practice it would lead to shift
wrapping.  That would lead to an array overflow in nf_tables_addchain():

	ops->hook       = hook.type->hooks[ops->hooknum];

Fixes: fe19c04ca1 ("netfilter: nf_tables: remove nhooks field from struct nft_af_info")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-15 07:31:50 +02:00
Florian Westphal 3c79107631 netfilter: ctnetlink: don't use conntrack/expect object addresses as id
else, we leak the addresses to userspace via ctnetlink events
and dumps.

Compute an ID on demand based on the immutable parts of nf_conn struct.

Another advantage compared to using an address is that there is no
immediate re-use of the same ID in case the conntrack entry is freed and
reallocated again immediately.

Fixes: 3583240249 ("[NETFILTER]: nf_conntrack_expect: kill unique ID")
Fixes: 7f85f91472 ("[NETFILTER]: nf_conntrack: kill unique ID")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-15 07:31:44 +02:00
Julian Anastasov 0261ea1bd1 ipvs: do not schedule icmp errors from tunnels
We can receive ICMP errors from client or from
tunneling real server. While the former can be
scheduled to real server, the latter should
not be scheduled, they are decapsulated only when
existing connection is found.

Fixes: 6044eeffaf ("ipvs: attempt to schedule icmp packets")
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-13 14:52:57 +02:00
Alexander Potapenko 8176c83327 netfilter: conntrack: initialize ct->timeout
KMSAN started reporting an error when accessing ct->timeout for the
first time without initialization:

 BUG: KMSAN: uninit-value in __nf_ct_refresh_acct+0x1ae/0x470 net/netfilter/nf_conntrack_core.c:1765
 ...
 dump_stack+0x173/0x1d0 lib/dump_stack.c:113
 kmsan_report+0x131/0x2a0 mm/kmsan/kmsan.c:624
 __msan_warning+0x7a/0xf0 mm/kmsan/kmsan_instr.c:310
 __nf_ct_refresh_acct+0x1ae/0x470 net/netfilter/nf_conntrack_core.c:1765
 nf_ct_refresh_acct ./include/net/netfilter/nf_conntrack.h:201
 nf_conntrack_udp_packet+0xb44/0x1040 net/netfilter/nf_conntrack_proto_udp.c:122
 nf_conntrack_handle_packet net/netfilter/nf_conntrack_core.c:1605
 nf_conntrack_in+0x1250/0x26c9 net/netfilter/nf_conntrack_core.c:1696
 ...
 Uninit was created at:
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:205
 kmsan_internal_poison_shadow+0x92/0x150 mm/kmsan/kmsan.c:159
 kmsan_kmalloc+0xa9/0x130 mm/kmsan/kmsan_hooks.c:173
 kmem_cache_alloc+0x554/0xb10 mm/slub.c:2789
 __nf_conntrack_alloc+0x16f/0x690 net/netfilter/nf_conntrack_core.c:1342
 init_conntrack+0x6cb/0x2490 net/netfilter/nf_conntrack_core.c:1421

Signed-off-by: Alexander Potapenko <glider@google.com>
Fixes: cc16921351 ("netfilter: conntrack: avoid same-timeout update")
Cc: Florian Westphal <fw@strlen.de>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-13 14:52:57 +02:00
Florian Westphal 1025ce7521 netfilter: conntrack: don't set related state for different outer address
Luca Moro says:
 ------
The issue lies in the filtering of ICMP and ICMPv6 errors that include an
inner IP datagram.
For these packets, icmp_error_message() extract the ICMP error and inner
layer to search of a known state.
If a state is found the packet is tagged as related (IP_CT_RELATED).

The problem is that there is no correlation check between the inner and
outer layer of the packet.
So one can encapsulate an error with an inner layer matching a known state,
while its outer layer is directed to a filtered host.
In this case the whole packet will be tagged as related.
This has various implications from a rule bypass (if a rule to related
trafic is allow), to a known state oracle.

Unfortunately, we could not find a real statement in a RFC on how this case
should be filtered.
The closest we found is RFC5927 (Section 4.3) but it is not very clear.

A possible fix would be to check that the inner IP source is the same than
the outer destination.

We believed this kind of attack was not documented yet, so we started to
write a blog post about it.
You can find it attached to this mail (sorry for the extract quality).
It contains more technical details, PoC and discussion about the identified
behavior.
We discovered later that
https://www.gont.com.ar/papers/filtering-of-icmp-error-messages.pdf
described a similar attack concept in 2004 but without the stateful
filtering in mind.
 -----

This implements above suggested fix:
In icmp(v6) error handler, take outer destination address, then pass
that into the common function that does the "related" association.

After obtaining the nf_conn of the matching inner-headers connection,
check that the destination address of the opposite direction tuple
is the same as the outer address and only set RELATED if thats the case.

Reported-by: Luca Moro <luca.moro@synacktiv.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-13 14:52:57 +02:00
Florian Westphal 971502d77f bridge: netfilter: unroll NF_HOOK helper in bridge input path
Replace NF_HOOK() based invocation of the netfilter hooks with a private
copy of nf_hook_slow().

This copy has one difference: it can return the rx handler value expected
by the stack, i.e. RX_HANDLER_CONSUMED or RX_HANDLER_PASS.

This is needed by the next patch to invoke the ebtables
"broute" table via the standard netfilter hooks rather than the custom
"br_should_route_hook" indirection that is used now.

When the skb is to be "brouted", we must return RX_HANDLER_PASS from the
bridge rx input handler, but there is no way to indicate this via
NF_HOOK(), unless perhaps by some hack such as exposing bridge_cb in the
netfilter core or a percpu flag.

  text    data     bss     dec   filename
  3369      56       0    3425   net/bridge/br_input.o.before
  3458      40       0    3498   net/bridge/br_input.o.after

This allows removal of the "br_should_route_hook" in the next patch.

Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-12 01:47:39 +02:00
Florian Westphal 610a43149c netfilter: nf_nat_masquerade: unify ipv4/6 notifier registration
Only reason for having two different register functions was because of
ipt_MASQUERADE and ip6t_MASQUERADE being two different modules.

Previous patch merged those into xt_MASQUERADE, so we can merge this too.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-11 20:59:34 +02:00
Florian Westphal adf82accc5 netfilter: x_tables: merge ip and ipv6 masquerade modules
No need to have separate modules for this.
before:
 text    data   bss    dec  filename
 2038    1168     0   3206  net/ipv4/netfilter/ipt_MASQUERADE.ko
 1526    1024     0   2550  net/ipv6/netfilter/ip6t_MASQUERADE.ko
after:
 text    data   bss    dec  filename
 2521    1296     0   3817  net/netfilter/xt_MASQUERADE.ko

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-11 20:59:29 +02:00
Florian Westphal bf8981a2aa netfilter: nf_nat: merge ip/ip6 masquerade headers
Both are now implemented by nf_nat_masquerade.c, so no need to keep
different headers.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-11 20:59:21 +02:00
Florian Westphal 3b0a081db1 netfilter: make two functions static
They have no external callers anymore.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-08 23:28:33 +02:00
Fernando Fernandez Mancera 22c7652cda netfilter: nft_osf: Add version option support
Add version option support to the nftables "osf" expression.

Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-08 23:27:12 +02:00
Florian Westphal 63ce3940f3 netfilter: nft_redir: add inet support
allows to redirect both ipv4 and ipv6 with a single rule in an
inet nat table.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-08 23:03:01 +02:00
Florian Westphal 071657d2c3 netfilter: nft_masq: add inet support
This allows use of a single masquerade rule in nat inet family
to handle both ipv4 and ipv6.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-08 23:02:58 +02:00
Florian Westphal 4806e97572 netfilter: replace NF_NAT_NEEDED with IS_ENABLED(CONFIG_NF_NAT)
NF_NAT_NEEDED is true whenever nat support for either ipv4 or ipv6 is
enabled.  Now that the af-specific nat configuration switches have been
removed, IS_ENABLED(CONFIG_NF_NAT) has the same effect.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-08 23:02:52 +02:00
Florian Westphal c1deb065cf netfilter: nf_tables: merge route type into core
very little code, so it really doesn't make sense to have extra
modules or even a kconfig knob for this.

Merge them and make functionality available unconditionally.
The merge makes inet family route support trivial, so add it
as well here.

Before:
   text	   data	    bss	    dec	    hex	filename
    835	    832	      0	   1667	    683 nft_chain_route_ipv4.ko
    870	    832	      0	   1702	    6a6	nft_chain_route_ipv6.ko
 111568	   2556	    529	 114653	  1bfdd	nf_tables.ko

After:
   text	   data	    bss	    dec	    hex	filename
 113133	   2556	    529	 116218	  1c5fa	nf_tables.ko

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-08 23:01:42 +02:00
Florian Westphal d164385ec5 netfilter: nat: add inet family nat support
We need minimal support from the nat core for this, as we do not
want to register additional base hooks.

When an inet hook is registered, interally register ipv4 and ipv6
hooks for them and unregister those when inet hooks are removed.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-08 23:01:39 +02:00
Jacky Hu 84c0d5e96f ipvs: allow tunneling with gue encapsulation
ipip packets are blocked in some public cloud environments, this patch
allows gue encapsulation with the tunneling method, which would make
tunneling working in those environments.

Signed-off-by: Jacky Hu <hengqing.hu@gmail.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-08 22:57:59 +02:00
Pablo Neira Ayuso 227e1e4d0d netfilter: nf_flowtable: skip device lookup from interface index
Use the output device from the route that we cache in the flowtable
entry.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-08 22:57:02 +02:00
YueHaibing b3dfee340a netfilter: nft_redir: Make nft_redir_dump static
Fix sparse warning:

net/netfilter/nft_redir.c:85:5:
 warning: symbol 'nft_redir_dump' was not declared. Should it be static?

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-08 22:57:02 +02:00
Colin Ian King f7e840ee4d netfilter: nf_tables: remove unused parameter ctx
Function nf_tables_set_desc_parse parameter ctx is not being used
so remove it as it is redundant.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-08 22:56:52 +02:00
Li RongQing 3b15d09f7e time: Introduce jiffies64_to_msecs()
there is a similar helper in net/netfilter/nf_tables_api.c,
this maybe become a common request someday, so move it to
time.c

Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Acked-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-08 22:56:14 +02:00
NeilBrown 8f0db01800 rhashtable: use bit_spin_locks to protect hash bucket.
This patch changes rhashtables to use a bit_spin_lock on BIT(1) of the
bucket pointer to lock the hash chain for that bucket.

The benefits of a bit spin_lock are:
 - no need to allocate a separate array of locks.
 - no need to have a configuration option to guide the
   choice of the size of this array
 - locking cost is often a single test-and-set in a cache line
   that will have to be loaded anyway.  When inserting at, or removing
   from, the head of the chain, the unlock is free - writing the new
   address in the bucket head implicitly clears the lock bit.
   For __rhashtable_insert_fast() we ensure this always happens
   when adding a new key.
 - even when lockings costs 2 updates (lock and unlock), they are
   in a cacheline that needs to be read anyway.

The cost of using a bit spin_lock is a little bit of code complexity,
which I think is quite manageable.

Bit spin_locks are sometimes inappropriate because they are not fair -
if multiple CPUs repeatedly contend of the same lock, one CPU can
easily be starved.  This is not a credible situation with rhashtable.
Multiple CPUs may want to repeatedly add or remove objects, but they
will typically do so at different buckets, so they will attempt to
acquire different locks.

As we have more bit-locks than we previously had spinlocks (by at
least a factor of two) we can expect slightly less contention to
go with the slightly better cache behavior and reduced memory
consumption.

To enhance type checking, a new struct is introduced to represent the
  pointer plus lock-bit
that is stored in the bucket-table.  This is "struct rhash_lock_head"
and is empty.  A pointer to this needs to be cast to either an
unsigned lock, or a "struct rhash_head *" to be useful.
Variables of this type are most often called "bkt".

Previously "pprev" would sometimes point to a bucket, and sometimes a
->next pointer in an rhash_head.  As these are now different types,
pprev is NULL when it would have pointed to the bucket. In that case,
'blk' is used, together with correct locking protocol.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-07 19:12:12 -07:00
Yi-Hung Wei 717700d183 netfilter: Export nf_ct_{set,destroy}_timeout()
This patch exports nf_ct_set_timeout() and nf_ct_destroy_timeout().
The two functions are derived from xt_ct_destroy_timeout() and
xt_ct_set_timeout() in xt_CT.c, and moved to nf_conntrack_timeout.c
without any functional change.
It would be useful for other users (i.e. OVS) that utilizes the
finer-grain conntrack timeout feature.

CC: Pablo Neira Ayuso <pablo@netfilter.org>
CC: Pravin Shelar <pshelar@ovn.org>
Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-28 16:53:29 -07:00
David S. Miller 356d71e00d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2019-03-27 17:37:58 -07:00
Johannes Berg 3b0f31f2b8 genetlink: make policy common to family
Since maxattr is common, the policy can't really differ sanely,
so make it common as well.

The only user that did in fact manage to make a non-common policy
is taskstats, which has to be really careful about it (since it's
still using a common maxattr!). This is no longer supported, but
we can fake it using pre_doit.

This reduces the size of e.g. nl80211.o (which has lots of commands):

   text	   data	    bss	    dec	    hex	filename
 398745	  14323	   2240	 415308	  6564c	net/wireless/nl80211.o (before)
 397913	  14331	   2240	 414484	  65314	net/wireless/nl80211.o (after)
--------------------------------
   -832      +8       0    -824

Which is obviously just 8 bytes for each command, and an added 8
bytes for the new policy pointer. I'm not sure why the ops list is
counted as .text though.

Most of the code transformations were done using the following spatch:
    @ops@
    identifier OPS;
    expression POLICY;
    @@
    struct genl_ops OPS[] = {
    ...,
     {
    -	.policy = POLICY,
     },
    ...
    };

    @@
    identifier ops.OPS;
    expression ops.POLICY;
    identifier fam;
    expression M;
    @@
    struct genl_family fam = {
            .ops = OPS,
            .maxattr = M,
    +       .policy = POLICY,
            ...
    };

This also gets rid of devlink_nl_cmd_region_read_dumpit() accessing
the cb->data as ops, which we want to change in a later genl patch.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-22 10:38:23 -04:00
Taehee Yoo b25a31bf0c netfilter: nf_tables: add missing ->release_ops() in error path of newrule()
->release_ops() callback releases resources and this is used in error path.
If nf_tables_newrule() fails after ->select_ops(), it should release
resources. but it can not call ->destroy() because that should be called
after ->init().
At this point, ->release_ops() should be used for releasing resources.

Test commands:
   modprobe -rv xt_tcpudp
   iptables-nft -I INPUT -m tcp   <-- error command
   lsmod

Result:
   Module                  Size  Used by
   xt_tcpudp              20480  2      <-- it should be 0

Fixes: b8e2040063 ("netfilter: nft_compat: use .release_ops and remove list of extension")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-03-20 08:32:58 +01:00
Pablo Neira Ayuso 74710e0590 netfilter: nft_redir: fix module autoload with ip4
AF_INET4 does not exist.

Fixes: c78efc99c7 ("netfilter: nf_tables: nat: merge nft_redir protocol specific modules)"
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-03-18 16:22:49 +01:00
Pablo Neira Ayuso 8ffcd32f64 netfilter: nf_tables: bogus EBUSY in helper removal from transaction
Proper use counter updates when activating and deactivating the object,
otherwise, this hits bogus EBUSY error.

Fixes: cd5125d8f5 ("netfilter: nf_tables: split set destruction in deactivate and destroy phase")
Reported-by: Laura Garcia <nevola@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-03-18 16:22:49 +01:00
Arnd Bergmann d1fa381033 netfilter: fix NETFILTER_XT_TARGET_TEE dependencies
With NETFILTER_XT_TARGET_TEE=y and IP6_NF_IPTABLES=m, we get a link
error when referencing the NF_DUP_IPV6 module:

net/netfilter/xt_TEE.o: In function `tee_tg6':
xt_TEE.c:(.text+0x14): undefined reference to `nf_dup_ipv6'

The problem here is the 'select NF_DUP_IPV6 if IP6_NF_IPTABLES'
that forces NF_DUP_IPV6 to be =m as well rather than setting it
to =y as was intended here. Adding a soft dependency on
IP6_NF_IPTABLES avoids that broken configuration.

Fixes: 5d400a4933 ("netfilter: Kconfig: Change select IPv6 dependencies")
Cc: Máté Eckl <ecklm94@gmail.com>
Cc: Taehee Yoo <ap420073@gmail.com>
Link: https://patchwork.ozlabs.org/patch/999498/
Link: https://lore.kernel.org/patchwork/patch/960062/
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-03-18 16:21:54 +01:00
Pablo Neira Ayuso 05b7639da5 netfilter: nft_set_rbtree: check for inactive element after flag mismatch
Otherwise, we hit bogus ENOENT when removing elements.

Fixes: e701001e7c ("netfilter: nft_rbtree: allow adjacent intervals with dynamic updates")
Reported-by: Václav Zindulka <vaclav.zindulka@tlapnet.cz>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-03-18 16:21:09 +01:00
Alin Nastac 29b0b5d565 netfilter: nf_conntrack_sip: remove direct dependency on IPv6
Previous implementation was not usable with CONFIG_IPV6=m.

Fixes: a3419ce335 ("netfilter: nf_conntrack_sip: add sip_external_media logic")
Signed-off-by: Alin Nastac <alin.nastac@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-03-18 16:11:54 +01:00
Florian Westphal b8b2749865 netfilter: nf_tables: return immediately on empty commit
When running 'nft flush ruleset' while no rules exist, we will increment
the generation counter and announce a new genid to userspace, yet
nothing had changed in the first place.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-03-11 20:01:20 +01:00
Pablo Neira Ayuso 3f3a390dbd netfilter: nf_tables: use-after-free in dynamic operations
Smatch reports:

       net/netfilter/nf_tables_api.c:2167 nf_tables_expr_destroy()
        error: dereferencing freed memory 'expr->ops'

net/netfilter/nf_tables_api.c
    2162 static void nf_tables_expr_destroy(const struct nft_ctx *ctx,
    2163                                   struct nft_expr *expr)
    2164 {
    2165        if (expr->ops->destroy)
    2166                expr->ops->destroy(ctx, expr);
                                                ^^^^
--> 2167        module_put(expr->ops->type->owner);
                           ^^^^^^^^^
    2168 }

Smatch says there are three functions which free expr->ops.

Fixes: b8e2040063 ("netfilter: nft_compat: use .release_ops and remove list of extension")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-03-11 13:19:49 +01:00
Pablo Neira Ayuso 273fe3f100 netfilter: nf_tables: bogus EBUSY when deleting set after flush
Set deletion after flush coming in the same batch results in EBUSY. Add
set use counter to track the number of references to this set from
rules. We cannot rely on the list of bindings for this since such list
is still populated from the preparation phase.

Reported-by: Václav Zindulka <vaclav.zindulka@tlapnet.cz>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-03-11 13:19:24 +01:00
Pablo Neira Ayuso 40ba1d9b4d netfilter: nf_tables: fix set double-free in abort path
The abort path can cause a double-free of an anonymous set.
Added-and-to-be-aborted rule looks like this:

udp dport { 137, 138 } drop

The to-be-aborted transaction list looks like this:

newset
newsetelem
newsetelem
rule

This gets walked in reverse order, so first pass disables the rule, the
set elements, then the set.

After synchronize_rcu(), we then destroy those in same order: rule, set
element, set element, newset.

Problem is that the anonymous set has already been bound to the rule, so
the rule (lookup expression destructor) already frees the set, when then
cause use-after-free when trying to delete the elements from this set,
then try to free the set again when handling the newset expression.

Rule releases the bound set in first place from the abort path, this
causes the use-after-free on set element removal when undoing the new
element transactions. To handle this, skip new element transaction if
set is bound from the abort path.

This is still causes the use-after-free on set element removal.  To
handle this, remove transaction from the list when the set is already
bound.

Joint work with Florian Westphal.

Fixes: f6ac858589 ("netfilter: nf_tables: unbind set in rule from commit path")
Bugzilla: https://bugzilla.netfilter.org/show_bug.cgi?id=1325
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-03-08 16:41:18 +01:00
Florian Westphal 46f7487e16 netfilter: nat: don't register device notifier twice
Otherwise, we get notifier list corruption.

This is the most simple fix: remove the device notifier call chain
from the ipv6 masquerade register function and handle it only
in the ipv4 version.

The better fix is merge
nf_nat_masquerade_ipv4/6_(un)register_notifier
  into a single
nf_nat_masquerade_(un)register_notifiers

but to do this its needed to first merge the two masquerade modules
into a single xt_MASQUERADE.

Furthermore, we need to use different refcounts for ipv4/ipv6
until we can merge MASQUERADE.

Fixes: d1aca8ab31 ("netfilter: nat: merge ipv4 and ipv6 masquerade functionality")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-03-08 16:41:09 +01:00
Florian Westphal db8ab38880 netfilter: nf_tables: merge ipv4 and ipv6 nat chain types
Merge the ipv4 and ipv6 nat chain type. This is the last
missing piece which allows to provide inet family support
for nat in a follow patch.

The kconfig knobs for ipv4/ipv6 nat chain are removed, the
nat chain type will be built unconditionally if NFT_NAT
expression is enabled.

Before:
   text	   data	    bss	    dec	    hex	filename
   1576     896       0    2472     9a8 nft_chain_nat_ipv4.ko
   1697     896       0    2593     a21 nft_chain_nat_ipv6.ko

After:
   text	   data	    bss	    dec	    hex	filename
   1832     896       0    2728     aa8 nft_chain_nat.ko

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-03-01 14:36:59 +01:00
Florian Westphal a9ce849e78 netfilter: nf_tables: nat: merge nft_masq protocol specific modules
The family specific masq modules are way too small to warrant
an extra module, just place all of them in nft_masq.

before:
  text	   data	    bss	    dec	    hex	filename
   1001	    832	      0	   1833	    729	nft_masq.ko
    766	    896	      0	   1662	    67e	nft_masq_ipv4.ko
    764	    896	      0	   1660	    67c	nft_masq_ipv6.ko

after:
   2010	    960	      0	   2970	    b9a	nft_masq.ko

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-03-01 14:36:59 +01:00
Florian Westphal c78efc99c7 netfilter: nf_tables: nat: merge nft_redir protocol specific modules
before:
 text	   data	    bss	    dec	    hex	filename
 990	    832	      0	   1822	    71e nft_redir.ko
 697	    896	      0	   1593	    639 nft_redir_ipv4.ko
 713	    896	      0	   1609	    649	nft_redir_ipv6.ko

after:
 text	   data	    bss	    dec	    hex	filename
 1910	    960	      0	   2870	    b36	nft_redir.ko

size is reduced, all helpers from nft_redir.ko can be made static.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-03-01 14:36:58 +01:00
Sami Tolvanen 20fdaf6e1e netfilter: xt_IDLETIMER: fix sysfs callback function type
Use struct device_attribute instead of struct idletimer_tg_attr, and
the correct callback function type to avoid indirect call mismatches
with Control Flow Integrity checking.

Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-03-01 14:36:57 +01:00
Li RongQing 2e7b162c5e netfilter: nf_conntrack: ensure that CONNTRACK_LOCKS is power of 2
CONNTRACK_LOCKS is divisor when computer array index, if it is power of
2, compiler will optimize modulo operation as bitwise AND, or else
modulo will lower performance.

Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-03-01 14:36:46 +01:00
Li RongQing a9f5e78c40 netfilter: nf_tables: check the result of dereferencing base_chain->stats
Check the result of dereferencing base_chain->stats, instead of result
of this_cpu_ptr with NULL.

base_chain->stats maybe be changed to NULL when a chain is updated and a
new NULL counter can be attached.

And we do not need to check returning of this_cpu_ptr since
base_chain->stats is from percpu allocator if it is non-NULL,
this_cpu_ptr returns a valid value.

And fix two sparse error by replacing rcu_access_pointer and
rcu_dereference with READ_ONCE under rcu_read_lock.

Thanks for Eric's help to finish this patch.

Fixes: 009240940e ("netfilter: nf_tables: don't assume chain stats are set when jumplabel is set")
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-03-01 14:34:24 +01:00
Xin Long f52a40fb41 ipvs: get sctphdr by sctphoff in sctp_csum_check
sctp_csum_check() is called by sctp_s/dnat_handler() where it calls
skb_make_writable() to ensure sctphdr to be linearized.

So there's no need to get sctphdr by calling skb_header_pointer()
in sctp_csum_check().

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-03-01 14:28:44 +01:00
Li RongQing 11d4dd0b20 netfilter: convert the proto argument from u8 to u16
The proto in struct xt_match and struct xt_target is u16, when
calling xt_check_target/match, their proto argument is u8,
and will cause truncation, it is harmless to ip packet, since
ip proto is u8

if a etable's match/target has proto that is u16, will cause
the check failure.

and convert be16 to short in bridge/netfilter/ebtables.c

Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-03-01 14:28:43 +01:00
wenxu 3e511d5652 netfilter: nft_tunnel: Add dst_cache support
The metadata_dst does not initialize the dst_cache field, this causes
problems to ip_md_tunnel_xmit() since it cannot use this cache, hence,
Triggering a route lookup for every packet.

Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-03-01 14:25:06 +01:00
Florian Westphal be0502a3f2 netfilter: conntrack: tcp: only close if RST matches exact sequence
TCP resets cause instant transition from established to closed state
provided the reset is in-window.  Endpoints that implement RFC 5961
require resets to match the next expected sequence number.
RST segments that are in-window (but that do not match RCV.NXT) are
ignored, and a "challenge ACK" is sent back.

Main problem for conntrack is that its a middlebox, i.e.  whereas an end
host might have ACK'd SEQ (and would thus accept an RST with this
sequence number), conntrack might not have seen this ACK (yet).

Therefore we can't simply flag RSTs with non-exact match as invalid.

This updates RST processing as follows:

1. If the connection is in a state other than ESTABLISHED, nothing is
   changed, RST is subject to normal in-window check.

2. If the RSTs sequence number either matches exactly RCV.NXT,
   connection state moves to CLOSE.

3. The same applies if the RST sequence number aligns with a previous
   packet in the same direction.

In all other cases, the connection remains in ESTABLISHED state.
If the normal-in-window check passes, the timeout will be lowered
to that of CLOSE.

If the peer sends a challenge ack, connection timeout will be reset.

If the challenge ACK triggers another RST (RST was valid after all),
this 2nd RST will match expected sequence and conntrack state changes to
CLOSE.

If no challenge ACK is received, the connection will time out after
CLOSE seconds (10 seconds by default), just like without this patch.

Packetdrill test case:

0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
0.000 bind(3, ..., ...) = 0
0.000 listen(3, 1) = 0

0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 win 64240 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4

// Receive a segment.
0.210 < P. 1:1001(1000) ack 1 win 46
0.210 > . 1:1(0) ack 1001

// Application writes 1000 bytes.
0.250 write(4, ..., 1000) = 1000
0.250 > P. 1:1001(1000) ack 1001

// First reset, old sequence. Conntrack (correctly) considers this
// invalid due to failed window validation (regardless of this patch).
0.260 < R  2:2(0) ack 1001 win 260

// 2nd reset, but too far ahead sequence.  Same: correctly handled
// as invalid.
0.270 < R 99990001:99990001(0) ack 1001 win 260

// in-window, but not exact sequence.
// Current Linux kernels might reply with a challenge ack, and do not
// remove connection.
// Without this patch, conntrack state moves to CLOSE.
// With patch, timeout is lowered like CLOSE, but connection stays
// in ESTABLISHED state.
0.280 < R 1010:1010(0) ack 1001 win 260

// Expect challenge ACK
0.281 > . 1001:1001(0) ack 1001 win 501

// With or without this patch, RST will cause connection
// to move to CLOSE (sequence number matches)
// 0.282 < R 1001:1001(0) ack 1001 win 260

// ACK
0.300 < . 1001:1001(0) ack 1001 win 257

// more data could be exchanged here, connection
// is still established

// Client closes the connection.
0.610 < F. 1001:1001(0) ack 1001 win 260
0.650 > . 1001:1001(0) ack 1002

// Close the connection without reading outstanding data
0.700 close(4) = 0

// so one more reset.  Will be deemed acceptable with patch as well:
// connection is already closing.
0.701 > R. 1001:1001(0) ack 1002 win 501
// End packetdrill test case.

With patch, this generates following conntrack events:
   [NEW] 120 SYN_SENT src=10.0.2.1 dst=10.0.0.1 sport=5437 dport=80 [UNREPLIED]
[UPDATE] 60 SYN_RECV src=10.0.2.1 dst=10.0.0.1 sport=5437 dport=80
[UPDATE] 432000 ESTABLISHED src=10.0.2.1 dst=10.0.0.1 sport=5437 dport=80 [ASSURED]
[UPDATE] 120 FIN_WAIT src=10.0.2.1 dst=10.0.0.1 sport=5437 dport=80 [ASSURED]
[UPDATE] 60 CLOSE_WAIT src=10.0.2.1 dst=10.0.0.1 sport=5437 dport=80 [ASSURED]
[UPDATE] 10 CLOSE src=10.0.2.1 dst=10.0.0.1 sport=5437 dport=80 [ASSURED]

Without patch, first RST moves connection to close, whereas socket state
does not change until FIN is received.
   [NEW] 120 SYN_SENT src=10.0.2.1 dst=10.0.0.1 sport=5141 dport=80 [UNREPLIED]
[UPDATE] 60 SYN_RECV src=10.0.2.1 dst=10.0.0.1 sport=5141 dport=80
[UPDATE] 432000 ESTABLISHED src=10.0.2.1 dst=10.0.0.1 sport=5141 dport=80 [ASSURED]
[UPDATE] 10 CLOSE src=10.0.2.1 dst=10.0.0.1 sport=5141 dport=80 [ASSURED]

Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-03-01 14:19:31 +01:00
Andrea Claudi f25a9b8515 ipvs: change some data types from int to bool
Change the data type of the following variables from int to bool
across ipvs code:

  - found
  - loop
  - need_full_dest
  - need_full_svc
  - payload_csum

Also change the following functions to use bool full_entry param
instead of int:

  - ip_vs_genl_parse_dest()
  - ip_vs_genl_parse_service()

This patch does not change any functionality but makes the source
code slightly easier to read.

Signed-off-by: Andrea Claudi <aclaudi@redhat.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-03-01 14:19:04 +01:00
Pablo Neira Ayuso 123f89c8aa netfilter: nft_set_hash: remove nft_hash_key()
hashtable is never used for 2-byte keys, remove nft_hash_key().

Fixes: e240cd0df4 ("netfilter: nf_tables: place all set backends in one single module")
Reported-by: Florian Westphal <fw@strlen.de>
Tested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-27 11:08:32 +01:00
Pablo Neira Ayuso a01cbae57e netfilter: nft_set_hash: bogus element self comparison from deactivation path
Use the element from the loop iteration, not the same element we want to
deactivate otherwise this branch always evaluates true.

Fixes: 6c03ae210c ("netfilter: nft_set_hash: add non-resizable hashtable implementation")
Reported-by: Florian Westphal <fw@strlen.de>
Tested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-27 11:08:31 +01:00
Pablo Neira Ayuso 3b02b0adc2 netfilter: nft_set_hash: fix lookups with fixed size hash on big endian
Call jhash_1word() for the 4-bytes key case from the insertion and
deactivation path, otherwise big endian arch set lookups fail.

Fixes: 446a8268b7 ("netfilter: nft_set_hash: add lookup variant for fixed size hashtable")
Reported-by: Florian Westphal <fw@strlen.de>
Tested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-27 11:08:31 +01:00
Li RongQing 35acfbab6e netfilter: remove unneeded switch fall-through
Empty case is fine and does not switch fall-through

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-27 11:03:59 +01:00
Florian Westphal cc16921351 netfilter: conntrack: avoid same-timeout update
No need to dirty a cache line if timeout is unchanged.
Also, WARN() is useless here: we crash on 'skb->len' access
if skb is NULL.

Last, ct->timeout is u32, not 'unsigned long' so adapt the
function prototype accordingly.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-27 10:58:21 +01:00
Florian Westphal d2c5c103b1 netfilter: nat: remove nf_nat_l3proto.h and nf_nat_core.h
The l3proto name is gone, its header file is the last trace.
While at it, also remove nf_nat_core.h, its very small and all users
include nf_nat.h too.

before:
   text    data     bss     dec     hex filename
  22948    1612    4136   28696    7018 nf_nat.ko

after removal of l3proto register/unregister functions:
   text	   data	    bss	    dec	    hex	filename
  22196	   1516	   4136	  27848	   6cc8 nf_nat.ko

checkpatch complains about overly long lines, but line breaks
do not make things more readable and the line length gets smaller
here, not larger.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-27 10:54:08 +01:00
Florian Westphal d6c4c8ffb5 netfilter: nat: remove l3proto struct
All l3proto function pointers have been removed.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-27 10:53:57 +01:00
Florian Westphal dac3fe7259 netfilter: nat: remove csum_recalc hook
We can now use direct calls.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-27 10:53:47 +01:00
Florian Westphal 03fe5efc4c netfilter: nat: remove csum_update hook
We can now use direct calls.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-27 10:53:35 +01:00
Florian Westphal 2e666b229d netfilter: nat: remove l3 manip_pkt hook
We can now use direct calls.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-27 10:53:05 +01:00
Florian Westphal 14cb1a6e29 netfilter: nat: remove nf_nat_l4proto.h
after ipv4/6 nat tracker merge, there are no external callers, so
make last function static and remove the header.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-27 10:52:47 +01:00
Florian Westphal 3bf195ae60 netfilter: nat: merge nf_nat_ipv4,6 into nat core
before:
   text    data     bss     dec     hex filename
  16566    1576    4136   22278    5706 nf_nat.ko
   3598	    844	      0	   4442	   115a	nf_nat_ipv6.ko
   3187	    844	      0	   4031	    fbf	nf_nat_ipv4.ko

after:
   text    data     bss     dec     hex filename
  22948    1612    4136   28696    7018 nf_nat.ko

... with ipv4/v6 nat now provided directly via nf_nat.ko.

Also changes:
       ret = nf_nat_ipv4_fn(priv, skb, state);
       if (ret != NF_DROP && ret != NF_STOLEN &&
into
	if (ret != NF_ACCEPT)
		return ret;

everywhere.

The nat hooks never should return anything other than
ACCEPT or DROP (and the latter only in rare error cases).

The original code uses multi-line ANDing including assignment-in-if:
        if (ret != NF_DROP && ret != NF_STOLEN &&
           !(IPCB(skb)->flags & IPSKB_XFRM_TRANSFORMED) &&
            (ct = nf_ct_get(skb, &ctinfo)) != NULL) {

I removed this while moving, breaking those in separate conditionals
and moving the assignments into extra lines.

checkpatch still generates some warnings:
 1. Overly long lines (of moved code).
    Breaking them is even more ugly. so I kept this as-is.
 2. use of extern function declarations in a .c file.
    This is necessary evil, we must call
    nf_nat_l3proto_register() from the nat core now.
    All l3proto related functions are removed later in this series,
    those prototypes are then removed as well.

v2: keep empty nf_nat_ipv6_csum_update stub for CONFIG_IPV6=n case.
v3: remove IS_ENABLED(NF_NAT_IPV4/6) tests, NF_NAT_IPVx toggles
    are removed here.
v4: also get rid of the assignments in conditionals.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-27 10:49:55 +01:00
Florian Westphal 096d09067a netfilter: nat: move nlattr parse and xfrm session decode to core
None of these functions calls any external functions, moving them allows
to avoid both the indirection and a need to export these symbols.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-27 10:49:42 +01:00
Florian Westphal d1aca8ab31 netfilter: nat: merge ipv4 and ipv6 masquerade functionality
Before:
   text	   data	    bss	    dec	    hex	filename
  13916	   1412	   4128	  19456	   4c00	nf_nat.ko
   4510	    968	      4	   5482	   156a	nf_nat_ipv4.ko
   5146	    944	      8	   6098	   17d2	nf_nat_ipv6.ko

After:
   text	   data	    bss	    dec	    hex	filename
  16566	   1576	   4136	  22278	   5706	nf_nat.ko
   3187	    844	      0	   4031	    fbf	nf_nat_ipv4.ko
   3598	    844	      0	   4442	   115a	nf_nat_ipv6.ko

... so no drastic changes in combined size.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-27 10:49:24 +01:00
Florian Tham 4283428e49 netfilter: nf_conntrack_amanda: add support for STATE streams
The Amanda CONNECT command has been updated to establish an optional
fourth connection [0]. Previously, a CONNECT command would look like:

    CONNECT DATA port0 MESG port1 INDEX port2

nf_conntrack_amanda analyses the CONNECT command string in order to
learn the port numbers of the related DATA, MESG and INDEX streams. As
of amanda v3.4, the CONNECT command can advertise an additional port:

    CONNECT DATA port0 MESG port1 INDEX port2 STATE port3

The new STATE stream is not handled, thus the connection on the STATE
port cannot be established.

The patch adds support for STATE streams to the amanda conntrack helper.

I tested with max_expected = 3, leaving the other patch hunks
unmodified. Amanda reports "connection refused" and aborts. After I set
max_expected to 4, the backup completes successfully.

[0] 3b8384fc9f (diff-711e502fc81a65182c0954765b42919eR456)

Signed-off-by: Florian Tham <tham@fidion.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-27 10:46:39 +01:00
Pablo Neira Ayuso b8e2040063 netfilter: nft_compat: use .release_ops and remove list of extension
Add .release_ops, that is called in case of error at a later stage in
the expression initialization path, ie. .select_ops() has been already
set up operations and that needs to be undone. This allows us to unwind
.select_ops from the error path, ie. release the dynamic operations for
this extension.

Moreover, allocate one single operation instead of recycling them, this
comes at the cost of consuming a bit more memory per rule, but it
simplifies the infrastructure.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-27 10:41:24 +01:00
David S. Miller 375ca548f7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Two easily resolvable overlapping change conflicts, one in
TCP and one in the eBPF verifier.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-20 00:34:07 -08:00
David S. Miller 8bbed40f10 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains Netfilter/IPVS updates for you net-next
tree:

1) Missing NFTA_RULE_POSITION_ID netlink attribute validation,
   from Phil Sutter.

2) Restrict matching on tunnel metadata to rx/tx path, from wenxu.

3) Avoid indirect calls for IPV6=y, from Florian Westphal.

4) Add two indirections to prepare merger of IPV4 and IPV6 nat
   modules, from Florian Westphal.

5) Broken indentation in ctnetlink, from Colin Ian King.

6) Patches to use struct_size() from netfilter and IPVS,
   from Gustavo A. R. Silva.

7) Display kernel splat only once in case of racing to confirm
   conntrack from bridge plus nfqueue setups, from Chieh-Min Wang.

8) Skip checksum validation for layer 4 protocols that don't need it,
   patch from Alin Nastac.

9) Sparse warning due to symbol that should be static in CLUSTERIP,
   from Wei Yongjun.

10) Add new toggle to disable SDP payload translation when media
    endpoint is reachable though the same interface as the signalling
    peer, from Alin Nastac.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-18 11:38:30 -08:00
Alin Nastac a3419ce335 netfilter: nf_conntrack_sip: add sip_external_media logic
When enabled, the sip_external_media logic will leave SDP
payload untouched when it detects that interface towards INVITEd
party is the same with the one towards media endpoint.

The typical scenario for this logic is when a LAN SIP agent has more
than one IP address (uses a different address for media streams than
the one used on signalling stream) and it also forwards calls to a
voice mailbox located on the WAN side. In such case sip_direct_media
must be disabled (so normal calls could be handled by the SIP
helper), but media streams that are not traversing this router must
also be excluded from address translation (e.g. call forwards).

Signed-off-by: Alin Nastac <alin.nastac@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-16 10:49:12 +01:00
Andrea Claudi c93a49b976 ipvs: fix warning on unused variable
When CONFIG_IP_VS_IPV6 is not defined, build produced this warning:

net/netfilter/ipvs/ip_vs_ctl.c:899:6: warning: unused variable ‘ret’ [-Wunused-variable]
  int ret = 0;
      ^~~

Fix this by moving the declaration of 'ret' in the CONFIG_IP_VS_IPV6
section in the same function.

While at it, drop its unneeded initialisation.

Fixes: 098e13f5b2 ("ipvs: fix dependency on nf_defrag_ipv6")
Reported-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Andrea Claudi <aclaudi@redhat.com>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-16 10:41:42 +01:00
David S. Miller 3313da8188 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The netfilter conflicts were rather simple overlapping
changes.

However, the cls_tcindex.c stuff was a bit more complex.

On the 'net' side, Cong is fixing several races and memory
leaks.  Whilst on the 'net-next' side we have Vlad adding
the rtnl-ness support.

What I've decided to do, in order to resolve this, is revert the
conversion over to using a workqueue that Cong did, bringing us back
to pure RCU.  I did it this way because I believe that either Cong's
races don't apply with have Vlad did things, or Cong will have to
implement the race fix slightly differently.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-15 12:38:38 -08:00
Pablo Neira Ayuso 23b7ca4f74 netfilter: nf_tables: fix flush after rule deletion in the same batch
Flush after rule deletion bogusly hits -ENOENT. Skip rules that have
been already from nft_delrule_by_chain() which is always called from the
flush path.

Fixes: cf9dc09d09 ("netfilter: nf_tables: fix missing rules flushing per table")
Reported-by: Phil Sutter <phil@nwl.cc>
Acked-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-15 15:32:32 +01:00
Pablo Neira Ayuso 753c111f65 netfilter: nft_compat: use-after-free when deleting targets
Fetch pointer to module before target object is released.

Fixes: 29e3880109 ("netfilter: nf_tables: fix use-after-free when deleting compat expressions")
Fixes: 0ca743a559 ("netfilter: nf_tables: add compatibility layer for x_tables")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-13 18:14:54 +01:00
Andrea Claudi 098e13f5b2 ipvs: fix dependency on nf_defrag_ipv6
ipvs relies on nf_defrag_ipv6 module to manage IPv6 fragmentation,
but lacks proper Kconfig dependencies and does not explicitly
request defrag features.

As a result, if netfilter hooks are not loaded, when IPv6 fragmented
packet are handled by ipvs only the first fragment makes through.

Fix it properly declaring the dependency on Kconfig and registering
netfilter hooks on ip_vs_add_service() and ip_vs_new_dest().

Reported-by: Li Shuang <shuali@redhat.com>
Signed-off-by: Andrea Claudi <aclaudi@redhat.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-12 11:24:01 +01:00
Chieh-Min Wang 13f5251fd1 netfilter: conntrack: fix cloned unconfirmed skb->_nfct race in __nf_conntrack_confirm
For bridge(br_flood) or broadcast/multicast packets, they could clone
skb with unconfirmed conntrack which break the rule that unconfirmed
skb->_nfct is never shared.  With nfqueue running on my system, the race
can be easily reproduced with following warning calltrace:

[13257.707525] CPU: 0 PID: 12132 Comm: main Tainted: P        W       4.4.60 #7744
[13257.707568] Hardware name: Qualcomm (Flattened Device Tree)
[13257.714700] [<c021f6dc>] (unwind_backtrace) from [<c021bce8>] (show_stack+0x10/0x14)
[13257.720253] [<c021bce8>] (show_stack) from [<c0449e10>] (dump_stack+0x94/0xa8)
[13257.728240] [<c0449e10>] (dump_stack) from [<c022a7e0>] (warn_slowpath_common+0x94/0xb0)
[13257.735268] [<c022a7e0>] (warn_slowpath_common) from [<c022a898>] (warn_slowpath_null+0x1c/0x24)
[13257.743519] [<c022a898>] (warn_slowpath_null) from [<c06ee450>] (__nf_conntrack_confirm+0xa8/0x618)
[13257.752284] [<c06ee450>] (__nf_conntrack_confirm) from [<c0772670>] (ipv4_confirm+0xb8/0xfc)
[13257.761049] [<c0772670>] (ipv4_confirm) from [<c06e7a60>] (nf_iterate+0x48/0xa8)
[13257.769725] [<c06e7a60>] (nf_iterate) from [<c06e7af0>] (nf_hook_slow+0x30/0xb0)
[13257.777108] [<c06e7af0>] (nf_hook_slow) from [<c07f20b4>] (br_nf_post_routing+0x274/0x31c)
[13257.784486] [<c07f20b4>] (br_nf_post_routing) from [<c06e7a60>] (nf_iterate+0x48/0xa8)
[13257.792556] [<c06e7a60>] (nf_iterate) from [<c06e7af0>] (nf_hook_slow+0x30/0xb0)
[13257.800458] [<c06e7af0>] (nf_hook_slow) from [<c07e5580>] (br_forward_finish+0x94/0xa4)
[13257.808010] [<c07e5580>] (br_forward_finish) from [<c07f22ac>] (br_nf_forward_finish+0x150/0x1ac)
[13257.815736] [<c07f22ac>] (br_nf_forward_finish) from [<c06e8df0>] (nf_reinject+0x108/0x170)
[13257.824762] [<c06e8df0>] (nf_reinject) from [<c06ea854>] (nfqnl_recv_verdict+0x3d8/0x420)
[13257.832924] [<c06ea854>] (nfqnl_recv_verdict) from [<c06e940c>] (nfnetlink_rcv_msg+0x158/0x248)
[13257.841256] [<c06e940c>] (nfnetlink_rcv_msg) from [<c06e5564>] (netlink_rcv_skb+0x54/0xb0)
[13257.849762] [<c06e5564>] (netlink_rcv_skb) from [<c06e4ec8>] (netlink_unicast+0x148/0x23c)
[13257.858093] [<c06e4ec8>] (netlink_unicast) from [<c06e5364>] (netlink_sendmsg+0x2ec/0x368)
[13257.866348] [<c06e5364>] (netlink_sendmsg) from [<c069fb8c>] (sock_sendmsg+0x34/0x44)
[13257.874590] [<c069fb8c>] (sock_sendmsg) from [<c06a03dc>] (___sys_sendmsg+0x1ec/0x200)
[13257.882489] [<c06a03dc>] (___sys_sendmsg) from [<c06a11c8>] (__sys_sendmsg+0x3c/0x64)
[13257.890300] [<c06a11c8>] (__sys_sendmsg) from [<c0209b40>] (ret_fast_syscall+0x0/0x34)

The original code just triggered the warning but do nothing. It will
caused the shared conntrack moves to the dying list and the packet be
droppped (nf_ct_resolve_clash returns NF_DROP for dying conntrack).

- Reproduce steps:

+----------------------------+
|          br0(bridge)       |
|                            |
+-+---------+---------+------+
  | eth0|   | eth1|   | eth2|
  |     |   |     |   |     |
  +--+--+   +--+--+   +---+-+
     |         |          |
     |         |          |
  +--+-+     +-+--+    +--+-+
  | PC1|     | PC2|    | PC3|
  +----+     +----+    +----+

iptables -A FORWARD -m mark --mark 0x1000000/0x1000000 -j NFQUEUE --queue-num 100 --queue-bypass

ps: Our nfq userspace program will set mark on packets whose connection
has already been processed.

PC1 sends broadcast packets simulated by hping3:

hping3 --rand-source --udp 192.168.1.255 -i u100

- Broadcast racing flow chart is as follow:

br_handle_frame
  BR_HOOK(NFPROTO_BRIDGE, NF_BR_PRE_ROUTING, br_handle_frame_finish)
  // skb->_nfct (unconfirmed conntrack) is constructed at PRE_ROUTING stage
  br_handle_frame_finish
    // check if this packet is broadcast
    br_flood_forward
      br_flood
        list_for_each_entry_rcu(p, &br->port_list, list) // iterate through each port
          maybe_deliver
            deliver_clone
              skb = skb_clone(skb)
              __br_forward
                BR_HOOK(NFPROTO_BRIDGE, NF_BR_FORWARD,...)
                // queue in our nfq and received by our userspace program
                // goto __nf_conntrack_confirm with process context on CPU 1
    br_pass_frame_up
      BR_HOOK(NFPROTO_BRIDGE, NF_BR_LOCAL_IN,...)
      // goto __nf_conntrack_confirm with softirq context on CPU 0

Because conntrack confirm can happen at both INPUT and POSTROUTING
stage.  So with NFQUEUE running, skb->_nfct with the same unconfirmed
conntrack could race on different core.

This patch fixes a repeating kernel splat, now it is only displayed
once.

Signed-off-by: Chieh-Min Wang <chiehminw@synology.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-12 11:14:51 +01:00
Gustavo A. R. Silva 6ca64ef37d netfilter: xt_recent: Use struct_size() in kvzalloc()
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:

struct foo {
    int stuff;
    void *entry[];
};

size = sizeof(struct foo) + count * sizeof(void *);
instance = alloc(size, GFP_KERNEL)

Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:

size = struct_size(instance, entry, count);
instance = alloc(size, GFP_KERNEL)

Notice that, in this case, variable sz is not necessary, hence it is
removed.

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-12 00:39:39 +01:00
Gustavo A. R. Silva 6fde9df6b7 ipvs: Use struct_size() helper
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:

struct foo {
    int stuff;
    struct boo entry[];
};

size = sizeof(struct foo) + count * sizeof(struct boo);
instance = alloc(size, GFP_KERNEL)

Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:

size = struct_size(instance, entry, count);

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Acked-by: Simon Horman <horms+renesas@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-12 00:39:39 +01:00
Colin Ian King 48ab807c79 netfilter: conntrack: fix indentation issue
A statement in an if block is not indented correctly. Fix this.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-12 00:39:39 +01:00
Francesco Ruggeri 8d29d16d21 netfilter: compat: initialize all fields in xt_init
If a non zero value happens to be in xt[NFPROTO_BRIDGE].cur at init
time, the following panic can be caused by running

% ebtables -t broute -F BROUTING

from a 32-bit user level on a 64-bit kernel. This patch replaces
kmalloc_array with kcalloc when allocating xt.

[  474.680846] BUG: unable to handle kernel paging request at 0000000009600920
[  474.687869] PGD 2037006067 P4D 2037006067 PUD 2038938067 PMD 0
[  474.693838] Oops: 0000 [#1] SMP
[  474.697055] CPU: 9 PID: 4662 Comm: ebtables Kdump: loaded Not tainted 4.19.17-11302235.AroraKernelnext.fc18.x86_64 #1
[  474.707721] Hardware name: Supermicro X9DRT/X9DRT, BIOS 3.0 06/28/2013
[  474.714313] RIP: 0010:xt_compat_calc_jump+0x2f/0x63 [x_tables]
[  474.720201] Code: 40 0f b6 ff 55 31 c0 48 6b ff 70 48 03 3d dc 45 00 00 48 89 e5 8b 4f 6c 4c 8b 47 60 ff c9 39 c8 7f 2f 8d 14 08 d1 fa 48 63 fa <41> 39 34 f8 4c 8d 0c fd 00 00 00 00 73 05 8d 42 01 eb e1 76 05 8d
[  474.739023] RSP: 0018:ffffc9000943fc58 EFLAGS: 00010207
[  474.744296] RAX: 0000000000000000 RBX: ffffc90006465000 RCX: 0000000002580249
[  474.751485] RDX: 00000000012c0124 RSI: fffffffff7be17e9 RDI: 00000000012c0124
[  474.758670] RBP: ffffc9000943fc58 R08: 0000000000000000 R09: ffffffff8117cf8f
[  474.765855] R10: ffffc90006477000 R11: 0000000000000000 R12: 0000000000000001
[  474.773048] R13: 0000000000000000 R14: ffffc9000943fcb8 R15: ffffc9000943fcb8
[  474.780234] FS:  0000000000000000(0000) GS:ffff88a03f840000(0063) knlGS:00000000f7ac7700
[  474.788612] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
[  474.794632] CR2: 0000000009600920 CR3: 0000002037422006 CR4: 00000000000606e0
[  474.802052] Call Trace:
[  474.804789]  compat_do_replace+0x1fb/0x2a3 [ebtables]
[  474.810105]  compat_do_ebt_set_ctl+0x69/0xe6 [ebtables]
[  474.815605]  ? try_module_get+0x37/0x42
[  474.819716]  compat_nf_setsockopt+0x4f/0x6d
[  474.824172]  compat_ip_setsockopt+0x7e/0x8c
[  474.828641]  compat_raw_setsockopt+0x16/0x3a
[  474.833220]  compat_sock_common_setsockopt+0x1d/0x24
[  474.838458]  __compat_sys_setsockopt+0x17e/0x1b1
[  474.843343]  ? __check_object_size+0x76/0x19a
[  474.847960]  __ia32_compat_sys_socketcall+0x1cb/0x25b
[  474.853276]  do_fast_syscall_32+0xaf/0xf6
[  474.857548]  entry_SYSENTER_compat+0x6b/0x7a

Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-12 00:35:18 +01:00
David S. Miller a655fe9f19 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
An ipvlan bug fix in 'net' conflicted with the abstraction away
of the IPV6 specific support in 'net-next'.

Similarly, a bug fix for mlx5 in 'net' conflicted with the flow
action conversion in 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-08 15:00:17 -08:00
Florian Westphal 947e492c0f netfilter: nft_compat: don't use refcount_inc on newly allocated entry
When I moved the refcount to refcount_t type I missed the fact that
refcount_inc() will result in use-after-free warning with
CONFIG_REFCOUNT_FULL=y builds.

The correct fix would be to init the reference count to 1 at allocation
time, but, unfortunately we cannot do this, as we can't undo that
in case something else fails later in the batch.

So only solution I see is to special-case the 'new entry' condition
and replace refcount_inc() with a "delayed" refcount_set(1) in this case,
as done here.

The .activate callback can be removed to simplify things, we only
need to make sure that deactivate() decrements/unlinks the entry
from the list at end of transaction phase (commit or abort).

Fixes: 12c44aba66 ("netfilter: nft_compat: use refcnt_t type for nft_xt reference count")
Reported-by: Jordan Glover <Golden_Miller83@protonmail.ch>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-05 14:10:33 +01:00
Florian Westphal ac02bcf9cc netfilter: ipv6: avoid indirect calls for IPV6=y case
indirect calls are only needed if ipv6 is a module.
Add helpers to abstract the v6ops indirections and use them instead.

fragment, reroute and route_input are kept as indirect calls.
The first two are not not used in hot path and route_input is only
used by bridge netfilter.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-04 18:21:12 +01:00
Pablo Neira Ayuso f6ac858589 netfilter: nf_tables: unbind set in rule from commit path
Anonymous sets that are bound to rules from the same transaction trigger
a kernel splat from the abort path due to double set list removal and
double free.

This patch updates the logic to search for the transaction that is
responsible for creating the set and disable the set list removal and
release, given the rule is now responsible for this. Lookup is reverse
since the transaction that adds the set is likely to be at the tail of
the list.

Moreover, this patch adds the unbind step to deliver the event from the
commit path.  This should not be done from the worker thread, since we
have no guarantees of in-order delivery to the listener.

This patch removes the assumption that both activate and deactivate
callbacks need to be provided.

Fixes: cd5125d8f5 ("netfilter: nf_tables: split set destruction in deactivate and destroy phase")
Reported-by: Mikhail Morfikov <mmorfikov@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-04 17:29:17 +01:00
wenxu a46c52d9f2 netfilter: nft_tunnel: Add NFTA_TUNNEL_MODE options
nft "tunnel" expr match both the tun_info of RX and TX. This patch
provide the NFTA_TUNNEL_MODE to individually match the tun_info of
RX or TX.

Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-04 14:39:58 +01:00
Martynas Pumputis 4e35c1cb94 netfilter: nf_nat: skip nat clash resolution for same-origin entries
It is possible that two concurrent packets originating from the same
socket of a connection-less protocol (e.g. UDP) can end up having
different IP_CT_DIR_REPLY tuples which results in one of the packets
being dropped.

To illustrate this, consider the following simplified scenario:

1. Packet A and B are sent at the same time from two different threads
   by same UDP socket.  No matching conntrack entry exists yet.
   Both packets cause allocation of a new conntrack entry.
2. get_unique_tuple gets called for A.  No clashing entry found.
   conntrack entry for A is added to main conntrack table.
3. get_unique_tuple is called for B and will find that the reply
   tuple of B is already taken by A.
   It will allocate a new UDP source port for B to resolve the clash.
4. conntrack entry for B cannot be added to main conntrack table
   because its ORIGINAL direction is clashing with A and the REPLY
   directions of A and B are not the same anymore due to UDP source
   port reallocation done in step 3.

This patch modifies nf_conntrack_tuple_taken so it doesn't consider
colliding reply tuples if the IP_CT_DIR_ORIGINAL tuples are equal.

[ Florian: simplify patch to not use .allow_clash setting
  and always ignore identical flows ]

Signed-off-by: Martynas Pumputis <martynas@weave.works>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-04 14:26:11 +01:00
David S. Miller eaf2a47f40 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2019-01-29 21:18:54 -08:00
Florian Westphal 0604628bb0 netfilter: nf_tables: add NFTA_RULE_POSITION_ID to nla_policy
Fixes: 75dd48e2e4 ("netfilter: nf_tables: Support RULE_ID reference in new rule")
Reported-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-29 16:59:06 +01:00
David S. Miller 343917b410 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains Netfilter/IPVS updates for your net-next tree:

1) Introduce a hashtable to speed up object lookups, from Florian Westphal.

2) Make direct calls to built-in extension, also from Florian.

3) Call helper before confirming the conntrack as it used to be originally,
   from Florian.

4) Call request_module() to autoload br_netfilter when physdev is used
   to relax the dependency, also from Florian.

5) Allow to insert rules at a given position ID that is internal to the
   batch, from Phil Sutter.

6) Several patches to replace conntrack indirections by direct calls,
   and to reduce modularization, from Florian. This also includes
   several follow up patches to deal with minor fallout from this
   rework.

7) Use RCU from conntrack gre helper, from Florian.

8) GRE conntrack module becomes built-in into nf_conntrack, from Florian.

9) Replace nf_ct_invert_tuplepr() by calls to nf_ct_invert_tuple(),
   from Florian.

10) Unify sysctl handling at the core of nf_conntrack, from Florian.

11) Provide modparam to register conntrack hooks.

12) Allow to match on the interface kind string, from wenxu.

13) Remove several exported symbols, not required anymore now after
    a bit of de-modulatization work has been done, from Florian.

14) Remove built-in map support in the hash extension, this can be
    done with the existing userspace infrastructure, from laura.

15) Remove indirection to calculate checksums in IPVS, from Matteo Croce.

16) Use call wrappers for indirection in IPVS, also from Matteo.

17) Remove superfluous __percpu parameter in nft_counter, patch from
    Luc Van Oostenryck.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-28 17:34:38 -08:00
Florian Westphal 83f529281d netfilter: ipv4: remove useless export_symbol
Only one caller; place it where needed and get rid of the EXPORT_SYMBOL.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-28 11:32:58 +01:00
Cong Wang ac088a88b5 netfilter: conntrack: fix error path in nf_conntrack_pernet_init()
When nf_ct_netns_get() fails, it should clean up itself,
its caller doesn't need to call nf_conntrack_fini_net().

nf_conntrack_init_net() is called after registering sysctl
and proc, so its cleanup function should be called before
unregistering sysctl and proc.

Fixes: ba3fbe6636 ("netfilter: nf_conntrack: provide modparam to always register conntrack hooks")
Fixes: b884fa4617 ("netfilter: conntrack: unify sysctl handling")
Reported-and-tested-by: syzbot+fcee88b2d87f0539dfe9@syzkaller.appspotmail.com
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-28 11:32:55 +01:00
Luc Van Oostenryck dd03b1ad26 netfilter: nft_counter: remove wrong __percpu of nft_counter_resest()'s arg
nft_counter_rest() has its first argument declared as
	struct nft_counter_percpu_priv __percpu *priv
but this structure is not percpu (it only countains
a member 'counter' which is, correctly, a pointer to a
percpu struct nft_counter).

So, remove the '__percpu' from the argument's declaration.

Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-28 11:32:43 +01:00
Matteo Croce 6ecd754883 ipvs: use indirect call wrappers
Use the new indirect call wrappers in IPVS when calling the TCP or UDP
protocol specific functions.
This avoids an indirect calls in IPVS, and reduces the performance
impact of the Spectre mitigation.

Signed-off-by: Matteo Croce <mcroce@redhat.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-28 11:15:58 +01:00
Matteo Croce fe19a8fea7 ipvs: avoid indirect calls when calculating checksums
The function pointer ip_vs_protocol->csum_check is only used in protocol
specific code, and never in the generic one.
Remove the function pointer from struct ip_vs_protocol and call the
checksum functions directly.
This reduces the performance impact of the Spectre mitigation, and
should give a small improvement even with RETPOLINES disabled.

Signed-off-by: Matteo Croce <mcroce@redhat.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-28 11:15:58 +01:00
Fernando Fernandez Mancera 1a6a0951fc netfilter: nfnetlink_osf: add missing fmatch check
When we check the tcp options of a packet and it doesn't match the current
fingerprint, the tcp packet option pointer must be restored to its initial
value in order to do the proper tcp options check for the next fingerprint.

Here we can see an example.
Assumming the following fingerprint base with two lines:

S10:64:1:60:M*,S,T,N,W6:      Linux:3.0::Linux 3.0
S20:64:1:60:M*,S,T,N,W7:      Linux:4.19:arch:Linux 4.1

Where TCP options are the last field in the OS signature, all of them overlap
except by the last one, ie. 'W6' versus 'W7'.

In case a packet for Linux 4.19 kicks in, the osf finds no matching because the
TCP options pointer is updated after checking for the TCP options in the first
line.

Therefore, reset pointer back to where it should be.

Fixes: 11eeef41d5 ("netfilter: passive OS fingerprint xtables match")
Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-28 11:09:11 +01:00
ZhangXiaoxu 53ab60baa1 ipvs: Fix signed integer overflow when setsockopt timeout
There is a UBSAN bug report as below:
UBSAN: Undefined behaviour in net/netfilter/ipvs/ip_vs_ctl.c:2227:21
signed integer overflow:
-2147483647 * 1000 cannot be represented in type 'int'

Reproduce program:
	#include <stdio.h>
	#include <sys/types.h>
	#include <sys/socket.h>

	#define IPPROTO_IP 0
	#define IPPROTO_RAW 255

	#define IP_VS_BASE_CTL		(64+1024+64)
	#define IP_VS_SO_SET_TIMEOUT	(IP_VS_BASE_CTL+10)

	/* The argument to IP_VS_SO_GET_TIMEOUT */
	struct ipvs_timeout_t {
		int tcp_timeout;
		int tcp_fin_timeout;
		int udp_timeout;
	};

	int main() {
		int ret = -1;
		int sockfd = -1;
		struct ipvs_timeout_t to;

		sockfd = socket(AF_INET, SOCK_RAW, IPPROTO_RAW);
		if (sockfd == -1) {
			printf("socket init error\n");
			return -1;
		}

		to.tcp_timeout = -2147483647;
		to.tcp_fin_timeout = -2147483647;
		to.udp_timeout = -2147483647;

		ret = setsockopt(sockfd,
				 IPPROTO_IP,
				 IP_VS_SO_SET_TIMEOUT,
				 (char *)(&to),
				 sizeof(to));

		printf("setsockopt return %d\n", ret);
		return ret;
	}

Return -EINVAL if the timeout value is negative or max than 'INT_MAX / HZ'.

Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-24 13:38:54 +01:00
Florian Westphal e2f7cc72cb netfilter: conntrack: fix bogus port values for other l4 protocols
We must only extract l4 proto information if we can track the layer 4
protocol.

Before removal of pkt_to_tuple callback, the code to extract port
information was only reached for TCP/UDP/LITE/DCCP/SCTP.

The other protocols were handled by the indirect call, and the
'generic' tracker took care of other protocols that have no notion
of 'ports'.

After removal of the callback we must be more strict here and only
init port numbers for those protocols that have ports.

Fixes: df5e162908 ("netfilter: conntrack: remove pkt_to_tuple callback")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-22 23:28:40 +01:00
Florian Westphal 81e01647fd netfilter: conntrack: fix IPV6=n builds
Stephen Rothwell reports:
 After merging the netfilter-next tree, today's linux-next build
 (powerpc ppc64_defconfig) failed like this:

 ERROR: "nf_conntrack_invert_icmpv6_tuple" [nf_conntrack.ko] undefined!
 ERROR: "nf_conntrack_icmpv6_packet" [nf_conntrack.ko] undefined!
 ERROR: "nf_conntrack_icmpv6_init_net" [nf_conntrack.ko] undefined!
 ERROR: "icmpv6_pkt_to_tuple" [nf_conntrack.ko] undefined!
 ERROR: "nf_ct_gre_keymap_destroy" [nf_conntrack.ko] undefined!

icmpv6 related errors are due to lack of IS_ENABLED(CONFIG_IPV6) (no
icmpv6 support is builtin if kernel has CONFIG_IPV6=n), the
nf_ct_gre_keymap_destroy error is due to lack of PROTO_GRE check.

Fixes: a47c540481 ("netfilter: conntrack: handle builtin l4proto packet functions via direct calls")
Fixes: e2e48b4716 ("netfilter: conntrack: handle icmp pkt_to_tuple helper via direct calls")
Fixes: 197c4300ae ("netfilter: conntrack: remove invert_tuple callback")
Fixes: 2a389de86e ("netfilter: conntrack: remove l4proto init and get_net callbacks")
Fixes: e56894356f ("netfilter: conntrack: remove l4proto destroy hook")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-22 23:28:24 +01:00
Laura Garcia Liebana 0123a75e1d Revert "netfilter: nft_hash: add map lookups for hashing operations"
A better way to implement this from userspace has been found without
specific code in the kernel side, revert this.

Fixes: b9ccc07e3f ("netfilter: nft_hash: add map lookups for hashing operations")
Signed-off-by: Laura Garcia Liebana <nevola@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:59:47 +01:00
Florian Westphal 472caa6918 netfilter: nat: un-export nf_nat_used_tuple
Not used since 203f2e7820 ("netfilter: nat: remove l4proto->unique_tuple")

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:59:45 +01:00
wenxu 0fb4d21956 netfilter: nft_meta: Add NFT_META_I/OIFKIND meta type
In the ip_rcv the skb goes through the PREROUTING hook first, then kicks
in vrf device and go through the same hook again. When conntrack dnat
works with vrf, there will be some conflict with rules because the
packet goes through the hook twice with different nf status.

ip link add user1 type vrf table 1
ip link add user2 type vrf table 2
ip l set dev tun1 master user1
ip l set dev tun2 master user2

nft add table firewall
nft add chain firewall zones { type filter hook prerouting  priority - 300 \; }
nft add rule firewall zones counter ct zone set iif map { "tun1" : 1, "tun2" : 2 }
nft add chain firewall rule-1000-ingress
nft add rule firewall rule-1000-ingress ct zone 1 tcp dport 22 ct state new counter accept
nft add rule firewall rule-1000-ingress counter drop
nft add chain firewall rule-1000-egress
nft add rule firewall rule-1000-egress tcp dport 22 ct state new counter drop
nft add rule firewall rule-1000-egress counter accept

nft add chain firewall rules-all { type filter hook prerouting priority - 150 \; }
nft add rule firewall rules-all ip daddr vmap { "2.2.2.11" : jump rule-1000-ingress }
nft add rule firewall rules-all ct zone vmap { 1 : jump rule-1000-egress }

nft add rule firewall dnat-all ct zone vmap { 1 : jump dnat-1000 }
nft add rule firewall dnat-1000 ip daddr 2.2.2.11 counter dnat to 10.0.0.7

For a package with ip daddr 2.2.2.11 and tcp dport 22, first time accept in the
rule-1000-ingress and dnat to 10.0.0.7. Then second time the packet goto the wrong
chain rule-1000-egress which leads the packet drop

With this patch, userspace can add the 'don't re-do entire ruleset for
vrf' policy itself via:

nft add rule firewall rules-all meta iifkind "vrf" counter accept

Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:58:20 +01:00
Pablo Neira Ayuso ba3fbe6636 netfilter: nf_conntrack: provide modparam to always register conntrack hooks
The connection tracking hooks can be optionally registered per netns
when conntrack is specifically invoked from the ruleset since
0c66dc1ea3 ("netfilter: conntrack: register hooks in netns when needed
by ruleset"). Then, since 4d3a57f23d ("netfilter: conntrack: do not
enable connection tracking unless needed"), the default behaviour is
changed to always register them on demand.

This patch provides a toggle that allows users to always register them.
Without this toggle, in order to use conntrack for statistics
collection, you need a dummy rule that refers to conntrack, eg.

        iptables -I INPUT -m state --state NEW

This patch allows users to restore the original behaviour via modparam,
ie. always register connection tracking, eg.

        modprobe nf_conntrack enable_hooks=1

Hence, no dummy rule is required.

Reported-by: Laura Garcia <nevola@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:34 +01:00
Florian Westphal 4a60dc748d netfilter: conntrack: remove nf_ct_l4proto_find_get
Its now same as __nf_ct_l4proto_find(), so rename that to
nf_ct_l4proto_find and use it everywhere.

It never returns NULL and doesn't need locks or reference counts.

Before this series:
302824  net/netfilter/nf_conntrack.ko
 21504  net/netfilter/nf_conntrack_proto_gre.ko

  text	   data	    bss	    dec	    hex	filename
  6281	   1732	      4	   8017	   1f51	nf_conntrack_proto_gre.ko
108356	  20613	    236	 129205	  1f8b5	nf_conntrack.ko

After:
294864  net/netfilter/nf_conntrack.ko
  text	   data	    bss	    dec	    hex	filename
106979	  19557	    240	 126776	  1ef38	nf_conntrack.ko

so, even with builtin gre, total size got reduced.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:34 +01:00
Florian Westphal e56894356f netfilter: conntrack: remove l4proto destroy hook
Only one user (gre), add a direct call and remove this facility.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:34 +01:00
Florian Westphal 2a389de86e netfilter: conntrack: remove l4proto init and get_net callbacks
Those were needed we still had modular trackers.
As we don't have those anymore, prefer direct calls and remove all
the (un)register infrastructure associated with this.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:34 +01:00
Florian Westphal 70aed4647c netfilter: conntrack: remove sysctl registration helpers
After previous patch these are not used anymore.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:34 +01:00
Florian Westphal b884fa4617 netfilter: conntrack: unify sysctl handling
Due to historical reasons, all l4 trackers register their own
sysctls.

This leads to copy&pasted boilerplate code, that does exactly same
thing, just with different data structure.

Place all of this in a single file.

This allows to remove the various ctl_table pointers from the ct_netns
structure and reduces overall code size.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:34 +01:00
Florian Westphal 303e0c5589 netfilter: conntrack: avoid unneeded nf_conntrack_l4proto lookups
after removal of the packet and invert function pointers, several
places do not need to lookup the l4proto structure anymore.

Remove those lookups.
The function nf_ct_invert_tuplepr becomes redundant, replace
it with nf_ct_invert_tuple everywhere.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:34 +01:00
Florian Westphal edf0338dab netfilter: conntrack: remove pernet l4 proto register interface
No used anymore.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:34 +01:00
Florian Westphal 44fb87f635 netfilter: conntrack: remove remaining l4proto indirect packet calls
Now that all l4trackers are builtin, no need to use a mix of direct and
indirect calls.
This removes the last two users: gre and the generic l4 protocol
tracker.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:34 +01:00
Florian Westphal b184356d0a netfilter: conntrack: remove module owner field
No need to get/put module owner reference, none of these can be removed
anymore.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:34 +01:00
Florian Westphal 197c4300ae netfilter: conntrack: remove invert_tuple callback
Only used by icmp(v6).  Prefer a direct call and remove this
function from the l4proto struct.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:34 +01:00
Florian Westphal df5e162908 netfilter: conntrack: remove pkt_to_tuple callback
GRE is now builtin, so we can handle it via direct call and
remove the callback.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:34 +01:00
Florian Westphal 751fc301ec netfilter: conntrack: remove net_id
No users anymore.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:33 +01:00
Florian Westphal 22fc4c4c9f netfilter: conntrack: gre: switch module to be built-in
This makes the last of the modular l4 trackers 'bool'.

After this, all infrastructure to handle dynamic l4 protocol registration
becomes obsolete and can be removed in followup patches.

Old:
302824 net/netfilter/nf_conntrack.ko
 21504 net/netfilter/nf_conntrack_proto_gre.ko

New:
313728 net/netfilter/nf_conntrack.ko

Old:
   text	   data	    bss	    dec	    hex	filename
   6281	   1732	      4	   8017	   1f51	nf_conntrack_proto_gre.ko
 108356	  20613	    236	 129205	  1f8b5	nf_conntrack.ko
New:
 112095	  21381	    240	 133716	  20a54	nf_conntrack.ko

The size increase is only temporary.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:33 +01:00
Florian Westphal 202e651cd4 netfilter: conntrack: gre: convert rwlock to rcu
We can use gre.  Lock is only needed when a new expectation is added.

In case a single spinlock proves to be problematic we can either add one
per netns or use an array of locks combined with net_hash_mix() or similar
to pick the 'correct' one.

But given this is only needed for an expectation rather than per packet
a single one should be ok.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:33 +01:00
Florian Westphal e2e48b4716 netfilter: conntrack: handle icmp pkt_to_tuple helper via direct calls
rather than handling them via indirect call, use a direct one instead.
This leaves GRE as the last user of this indirect call facility.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:33 +01:00
Florian Westphal a47c540481 netfilter: conntrack: handle builtin l4proto packet functions via direct calls
The l4 protocol trackers are invoked via indirect call: l4proto->packet().

With one exception (gre), all l4trackers are builtin, so we can make
.packet optional and use a direct call for most protocols.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:33 +01:00
Phil Sutter 75dd48e2e4 netfilter: nf_tables: Support RULE_ID reference in new rule
To allow for a batch to contain rules in arbitrary ordering, introduce
NFTA_RULE_POSITION_ID attribute which works just like NFTA_RULE_POSITION
but contains the ID of another rule within the same batch. This helps
iptables-nft-restore handling dumps with mixed insert/append commands
correctly.

Note that NFTA_RULE_POSITION takes precedence over
NFTA_RULE_POSITION_ID, so if the former is present, the latter is
ignored.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:33 +01:00
Florian Westphal 8e2f311a68 netfilter: physdev: relax br_netfilter dependency
Following command:
  iptables -D FORWARD -m physdev ...
causes connectivity loss in some setups.

Reason is that iptables userspace will probe kernel for the module revision
of the physdev patch, and physdev has an artificial dependency on
br_netfilter (xt_physdev use makes no sense unless a br_netfilter module
is loaded).

This causes the "phydev" module to be loaded, which in turn enables the
"call-iptables" infrastructure.

bridged packets might then get dropped by the iptables ruleset.

The better fix would be to change the "call-iptables" defaults to 0 and
enforce explicit setting to 1, but that breaks backwards compatibility.

This does the next best thing: add a request_module call to checkentry.
This was a stray '-D ... -m physdev' won't activate br_netfilter
anymore.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:33 +01:00
Florian Westphal 827318feb6 netfilter: conntrack: remove helper hook again
place them into the confirm one.

Old:
 hook (300): ipv4/6_help() first call helper, then seqadj.
 hook (INT_MAX): confirm

Now:
 hook (INT_MAX): confirm, first call helper, then seqadj, then confirm

Not having the extra call is noticeable in bechmarks.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:33 +01:00
Florian Westphal 10870dd89e netfilter: nf_tables: add direct calls for all builtin expressions
With CONFIG_RETPOLINE its faster to add an if (ptr == &foo_func)
check and and use direct calls for all the built-in expressions.

~15% improvement in pathological cases.

checkpatch doesn't like the X macro due to the embedded return statement,
but the macro has a very limited scope so I don't think its a problem.

I would like to avoid bugs of the form
  If (e->ops->eval == (unsigned long)nft_foo_eval)
	 nft_bar_eval();

and open-coded if ()/else if()/else cascade, thus the macro.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:33 +01:00
Florian Westphal 4d44175aa5 netfilter: nf_tables: handle nft_object lookups via rhltable
Instead of linear search, use rhlist interface to look up the objects.
This fixes rulesets with thousands of named objects (quota, counters and
the like).

We only use a single table for this and consider the address of the
table we're doing the lookup in as a part of the key.

This reduces restore time of a sample ruleset with ~20k named counters
from 37 seconds to 0.8 seconds.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:33 +01:00
Florian Westphal d152159b89 netfilter: nf_tables: prepare nft_object for lookups via hashtable
Add a 'key' structure for object, so we can look them up by name + table
combination (the name can be the same in each table).

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:32 +01:00
Florian Westphal b2e3d68d12 netfilter: nft_compat: destroy function must not have side effects
The nft_compat destroy function deletes the nft_xt object from a list.
This isn't allowed anymore. Destroy functions are called asynchronously,
i.e. next batch can find the object that has a pending ->destroy()
invocation:

cpu0                       cpu1
 worker
   ->destroy               for_each_entry()
	                     if (x == ...
			        return x->ops;
     list_del(x)
     kfree_rcu(x)
                           expr->ops->... // ops was free'd

To resolve this, the list_del needs to occur before the transaction
mutex gets released.  nf_tables has a 'deactivate' hook for this
purpose, so use that to unlink the object from the list.

Fixes: 0935d55884 ("netfilter: nf_tables: asynchronous release")
Reported-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 02:29:42 +01:00
Florian Westphal cf52572ebb netfilter: nft_compat: make lists per netns
There are two problems with nft_compat since the netlink config
plane uses a per-netns mutex:

1. Concurrent add/del accesses to the same list
2. accesses to a list element after it has been free'd already.

This patch fixes the first problem.

Freeing occurs from a work queue, after transaction mutexes have been
released, i.e., it still possible for a new transaction (even from
same net ns) to find the to-be-deleted expression in the list.

The ->destroy functions are not allowed to have any such side effects,
i.e. the list_del() in the destroy function is not allowed.

This part of the problem is solved in the next patch.
I tried to make this work by serializing list access via mutex
and by moving list_del() to a deactivate callback, but
Taehee spotted following race on this approach:

  NET #0                          NET #1
   >select_ops()
   ->init()
                                   ->select_ops()
   ->deactivate()
   ->destroy()
      nft_xt_put()
       kfree_rcu(xt, rcu_head);
                                   ->init() <-- use-after-free occurred.

Unfortunately, we can't increment reference count in
select_ops(), because we can't undo the refcount increase in
case a different expression fails in the same batch.

(The destroy hook will only be called in case the expression
 was initialized successfully).

Fixes: f102d66b33 ("netfilter: nf_tables: use dedicated mutex to guard transactions")
Reported-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 02:29:42 +01:00