Commit graph

4306 commits

Author SHA1 Message Date
Tom Herbert 97a6ec4ac0 rhashtable: Change rhashtable_walk_start to return void
Most callers of rhashtable_walk_start don't care about a resize event
which is indicated by a return value of -EAGAIN. So calls to
rhashtable_walk_start are wrapped wih code to ignore -EAGAIN. Something
like this is common:

       ret = rhashtable_walk_start(rhiter);
       if (ret && ret != -EAGAIN)
               goto out;

Since zero and -EAGAIN are the only possible return values from the
function this check is pointless. The condition never evaluates to true.

This patch changes rhashtable_walk_start to return void. This simplifies
code for the callers that ignore -EAGAIN. For the few cases where the
caller cares about the resize event, particularly where the table can be
walked in mulitple parts for netlink or seq file dump, the function
rhashtable_walk_start_check has been added that returns -EAGAIN on a
resize event.

Signed-off-by: Tom Herbert <tom@quantonium.net>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-11 09:58:38 -05:00
Florian Westphal f5b5702ac5 netfilter: exthdr: add missign attributes to policy
Add missing netlink attribute policy.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-12-11 13:46:04 +01:00
Kevin Cernekee 916a27901d netfilter: xt_osf: Add missing permission checks
The capability check in nfnetlink_rcv() verifies that the caller
has CAP_NET_ADMIN in the namespace that "owns" the netlink socket.
However, xt_osf_fingers is shared by all net namespaces on the
system.  An unprivileged user can create user and net namespaces
in which he holds CAP_NET_ADMIN to bypass the netlink_net_capable()
check:

    vpnns -- nfnl_osf -f /tmp/pf.os

    vpnns -- nfnl_osf -f /tmp/pf.os -d

These non-root operations successfully modify the systemwide OS
fingerprint list.  Add new capable() checks so that they can't.

Signed-off-by: Kevin Cernekee <cernekee@chromium.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-12-06 09:01:18 +01:00
Jann Horn 6ab405114b netfilter: xt_bpf: add overflow checks
Check whether inputs from userspace are too long (explicit length field too
big or string not null-terminated) to avoid out-of-bounds reads.

As far as I can tell, this can at worst lead to very limited kernel heap
memory disclosure or oopses.

This bug can be triggered by an unprivileged user even if the xt_bpf module
is not loaded: iptables is available in network namespaces, and the xt_bpf
module can be autoloaded.

Triggering the bug with a classic BPF filter with fake length 0x1000 causes
the following KASAN report:

==================================================================
BUG: KASAN: slab-out-of-bounds in bpf_prog_create+0x84/0xf0
Read of size 32768 at addr ffff8801eff2c494 by task test/4627

CPU: 0 PID: 4627 Comm: test Not tainted 4.15.0-rc1+ #1
[...]
Call Trace:
 dump_stack+0x5c/0x85
 print_address_description+0x6a/0x260
 kasan_report+0x254/0x370
 ? bpf_prog_create+0x84/0xf0
 memcpy+0x1f/0x50
 bpf_prog_create+0x84/0xf0
 bpf_mt_check+0x90/0xd6 [xt_bpf]
[...]
Allocated by task 4627:
 kasan_kmalloc+0xa0/0xd0
 __kmalloc_node+0x47/0x60
 xt_alloc_table_info+0x41/0x70 [x_tables]
[...]
The buggy address belongs to the object at ffff8801eff2c3c0
                which belongs to the cache kmalloc-2048 of size 2048
The buggy address is located 212 bytes inside of
                2048-byte region [ffff8801eff2c3c0, ffff8801eff2cbc0)
[...]
==================================================================

Fixes: e6f30c7317 ("netfilter: x_tables: add xt_bpf match")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-12-04 11:41:20 +01:00
Kevin Cernekee 4b380c42f7 netfilter: nfnetlink_cthelper: Add missing permission checks
The capability check in nfnetlink_rcv() verifies that the caller
has CAP_NET_ADMIN in the namespace that "owns" the netlink socket.
However, nfnl_cthelper_list is shared by all net namespaces on the
system.  An unprivileged user can create user and net namespaces
in which he holds CAP_NET_ADMIN to bypass the netlink_net_capable()
check:

    $ nfct helper list
    nfct v1.4.4: netlink error: Operation not permitted
    $ vpnns -- nfct helper list
    {
            .name = ftp,
            .queuenum = 0,
            .l3protonum = 2,
            .l4protonum = 6,
            .priv_data_len = 24,
            .status = enabled,
    };

Add capable() checks in nfnetlink_cthelper, as this is cleaner than
trying to generalize the solution.

Signed-off-by: Kevin Cernekee <cernekee@chromium.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-12-04 11:30:09 +01:00
Al Viro 7edffd25be ipvs: switch to sock_recvmsg()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-12-02 20:38:08 -05:00
David Miller b6ca8bd5a9 xfrm: Move child route linkage into xfrm_dst.
XFRM bundle child chains look like this:

	xdst1 --> xdst2 --> xdst3 --> path_dst

All of xdstN are xfrm_dst objects and xdst->u.dst.xfrm is non-NULL.
The final child pointer in the chain, here called 'path_dst', is some
other kind of route such as an ipv4 or ipv6 one.

The xfrm output path pops routes, one at a time, via the child
pointer, until we hit one which has a dst->xfrm pointer which
is NULL.

We can easily preserve the above mechanisms with child sitting
only in the xfrm_dst structure.  All children in the chain
before we break out of the xfrm_output() loop have dst->xfrm
non-NULL and are therefore xfrm_dst objects.

Since we break out of the loop when we find dst->xfrm NULL, we
will not try to dereference 'dst' as if it were an xfrm_dst.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-30 09:54:26 -05:00
Paul E. McKenney ffa53c5863 netfilter: Eliminate cond_resched_rcu_qs() in favor of cond_resched()
Now that cond_resched() also provides RCU quiescent states when
needed, it can be used in place of cond_resched_rcu_qs().  This
commit therefore makes this change.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Cc: Florian Westphal <fw@strlen.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: <netfilter-devel@vger.kernel.org>
2017-11-28 16:00:27 -08:00
Jay Elliott 8b1836c4b6 netfilter: conntrack: clamp timeouts to INT_MAX
When the conntracking code multiplies a timeout by HZ, it can overflow
from positive to negative; this causes it to instantly expire.  To
protect against this the multiplication is done in 64-bit so we can
prevent it from exceeding INT_MAX.

Signed-off-by: Jay Elliott <jelliott@arista.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-11-28 01:17:04 +01:00
Kees Cook e99e88a9d2 treewide: setup_timer() -> timer_setup()
This converts all remaining cases of the old setup_timer() API into using
timer_setup(), where the callback argument is the structure already
holding the struct timer_list. These should have no behavioral changes,
since they just change which pointer is passed into the callback with
the same available pointers after conversion. It handles the following
examples, in addition to some other variations.

Casting from unsigned long:

    void my_callback(unsigned long data)
    {
        struct something *ptr = (struct something *)data;
    ...
    }
    ...
    setup_timer(&ptr->my_timer, my_callback, ptr);

and forced object casts:

    void my_callback(struct something *ptr)
    {
    ...
    }
    ...
    setup_timer(&ptr->my_timer, my_callback, (unsigned long)ptr);

become:

    void my_callback(struct timer_list *t)
    {
        struct something *ptr = from_timer(ptr, t, my_timer);
    ...
    }
    ...
    timer_setup(&ptr->my_timer, my_callback, 0);

Direct function assignments:

    void my_callback(unsigned long data)
    {
        struct something *ptr = (struct something *)data;
    ...
    }
    ...
    ptr->my_timer.function = my_callback;

have a temporary cast added, along with converting the args:

    void my_callback(struct timer_list *t)
    {
        struct something *ptr = from_timer(ptr, t, my_timer);
    ...
    }
    ...
    ptr->my_timer.function = (TIMER_FUNC_TYPE)my_callback;

And finally, callbacks without a data assignment:

    void my_callback(unsigned long data)
    {
    ...
    }
    ...
    setup_timer(&ptr->my_timer, my_callback, 0);

have their argument renamed to verify they're unused during conversion:

    void my_callback(struct timer_list *unused)
    {
    ...
    }
    ...
    timer_setup(&ptr->my_timer, my_callback, 0);

The conversion is done with the following Coccinelle script:

spatch --very-quiet --all-includes --include-headers \
	-I ./arch/x86/include -I ./arch/x86/include/generated \
	-I ./include -I ./arch/x86/include/uapi \
	-I ./arch/x86/include/generated/uapi -I ./include/uapi \
	-I ./include/generated/uapi --include ./include/linux/kconfig.h \
	--dir . \
	--cocci-file ~/src/data/timer_setup.cocci

@fix_address_of@
expression e;
@@

 setup_timer(
-&(e)
+&e
 , ...)

// Update any raw setup_timer() usages that have a NULL callback, but
// would otherwise match change_timer_function_usage, since the latter
// will update all function assignments done in the face of a NULL
// function initialization in setup_timer().
@change_timer_function_usage_NULL@
expression _E;
identifier _timer;
type _cast_data;
@@

(
-setup_timer(&_E->_timer, NULL, _E);
+timer_setup(&_E->_timer, NULL, 0);
|
-setup_timer(&_E->_timer, NULL, (_cast_data)_E);
+timer_setup(&_E->_timer, NULL, 0);
|
-setup_timer(&_E._timer, NULL, &_E);
+timer_setup(&_E._timer, NULL, 0);
|
-setup_timer(&_E._timer, NULL, (_cast_data)&_E);
+timer_setup(&_E._timer, NULL, 0);
)

@change_timer_function_usage@
expression _E;
identifier _timer;
struct timer_list _stl;
identifier _callback;
type _cast_func, _cast_data;
@@

(
-setup_timer(&_E->_timer, _callback, _E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, &_callback, _E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, _callback, (_cast_data)_E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, &_callback, (_cast_data)_E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, (_cast_func)_callback, _E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, (_cast_func)&_callback, _E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, (_cast_func)_callback, (_cast_data)_E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, (_cast_func)&_callback, (_cast_data)_E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E._timer, _callback, (_cast_data)_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, _callback, (_cast_data)&_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, &_callback, (_cast_data)_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, &_callback, (_cast_data)&_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, (_cast_func)_callback, (_cast_data)_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, (_cast_func)_callback, (_cast_data)&_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, (_cast_func)&_callback, (_cast_data)_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, (_cast_func)&_callback, (_cast_data)&_E);
+timer_setup(&_E._timer, _callback, 0);
|
 _E->_timer@_stl.function = _callback;
|
 _E->_timer@_stl.function = &_callback;
|
 _E->_timer@_stl.function = (_cast_func)_callback;
|
 _E->_timer@_stl.function = (_cast_func)&_callback;
|
 _E._timer@_stl.function = _callback;
|
 _E._timer@_stl.function = &_callback;
|
 _E._timer@_stl.function = (_cast_func)_callback;
|
 _E._timer@_stl.function = (_cast_func)&_callback;
)

// callback(unsigned long arg)
@change_callback_handle_cast
 depends on change_timer_function_usage@
identifier change_timer_function_usage._callback;
identifier change_timer_function_usage._timer;
type _origtype;
identifier _origarg;
type _handletype;
identifier _handle;
@@

 void _callback(
-_origtype _origarg
+struct timer_list *t
 )
 {
(
	... when != _origarg
	_handletype *_handle =
-(_handletype *)_origarg;
+from_timer(_handle, t, _timer);
	... when != _origarg
|
	... when != _origarg
	_handletype *_handle =
-(void *)_origarg;
+from_timer(_handle, t, _timer);
	... when != _origarg
|
	... when != _origarg
	_handletype *_handle;
	... when != _handle
	_handle =
-(_handletype *)_origarg;
+from_timer(_handle, t, _timer);
	... when != _origarg
|
	... when != _origarg
	_handletype *_handle;
	... when != _handle
	_handle =
-(void *)_origarg;
+from_timer(_handle, t, _timer);
	... when != _origarg
)
 }

// callback(unsigned long arg) without existing variable
@change_callback_handle_cast_no_arg
 depends on change_timer_function_usage &&
                     !change_callback_handle_cast@
identifier change_timer_function_usage._callback;
identifier change_timer_function_usage._timer;
type _origtype;
identifier _origarg;
type _handletype;
@@

 void _callback(
-_origtype _origarg
+struct timer_list *t
 )
 {
+	_handletype *_origarg = from_timer(_origarg, t, _timer);
+
	... when != _origarg
-	(_handletype *)_origarg
+	_origarg
	... when != _origarg
 }

// Avoid already converted callbacks.
@match_callback_converted
 depends on change_timer_function_usage &&
            !change_callback_handle_cast &&
	    !change_callback_handle_cast_no_arg@
identifier change_timer_function_usage._callback;
identifier t;
@@

 void _callback(struct timer_list *t)
 { ... }

// callback(struct something *handle)
@change_callback_handle_arg
 depends on change_timer_function_usage &&
	    !match_callback_converted &&
            !change_callback_handle_cast &&
            !change_callback_handle_cast_no_arg@
identifier change_timer_function_usage._callback;
identifier change_timer_function_usage._timer;
type _handletype;
identifier _handle;
@@

 void _callback(
-_handletype *_handle
+struct timer_list *t
 )
 {
+	_handletype *_handle = from_timer(_handle, t, _timer);
	...
 }

// If change_callback_handle_arg ran on an empty function, remove
// the added handler.
@unchange_callback_handle_arg
 depends on change_timer_function_usage &&
	    change_callback_handle_arg@
identifier change_timer_function_usage._callback;
identifier change_timer_function_usage._timer;
type _handletype;
identifier _handle;
identifier t;
@@

 void _callback(struct timer_list *t)
 {
-	_handletype *_handle = from_timer(_handle, t, _timer);
 }

// We only want to refactor the setup_timer() data argument if we've found
// the matching callback. This undoes changes in change_timer_function_usage.
@unchange_timer_function_usage
 depends on change_timer_function_usage &&
            !change_callback_handle_cast &&
            !change_callback_handle_cast_no_arg &&
	    !change_callback_handle_arg@
expression change_timer_function_usage._E;
identifier change_timer_function_usage._timer;
identifier change_timer_function_usage._callback;
type change_timer_function_usage._cast_data;
@@

(
-timer_setup(&_E->_timer, _callback, 0);
+setup_timer(&_E->_timer, _callback, (_cast_data)_E);
|
-timer_setup(&_E._timer, _callback, 0);
+setup_timer(&_E._timer, _callback, (_cast_data)&_E);
)

// If we fixed a callback from a .function assignment, fix the
// assignment cast now.
@change_timer_function_assignment
 depends on change_timer_function_usage &&
            (change_callback_handle_cast ||
             change_callback_handle_cast_no_arg ||
             change_callback_handle_arg)@
expression change_timer_function_usage._E;
identifier change_timer_function_usage._timer;
identifier change_timer_function_usage._callback;
type _cast_func;
typedef TIMER_FUNC_TYPE;
@@

(
 _E->_timer.function =
-_callback
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E->_timer.function =
-&_callback
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E->_timer.function =
-(_cast_func)_callback;
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E->_timer.function =
-(_cast_func)&_callback
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E._timer.function =
-_callback
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E._timer.function =
-&_callback;
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E._timer.function =
-(_cast_func)_callback
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E._timer.function =
-(_cast_func)&_callback
+(TIMER_FUNC_TYPE)_callback
 ;
)

// Sometimes timer functions are called directly. Replace matched args.
@change_timer_function_calls
 depends on change_timer_function_usage &&
            (change_callback_handle_cast ||
             change_callback_handle_cast_no_arg ||
             change_callback_handle_arg)@
expression _E;
identifier change_timer_function_usage._timer;
identifier change_timer_function_usage._callback;
type _cast_data;
@@

 _callback(
(
-(_cast_data)_E
+&_E->_timer
|
-(_cast_data)&_E
+&_E._timer
|
-_E
+&_E->_timer
)
 )

// If a timer has been configured without a data argument, it can be
// converted without regard to the callback argument, since it is unused.
@match_timer_function_unused_data@
expression _E;
identifier _timer;
identifier _callback;
@@

(
-setup_timer(&_E->_timer, _callback, 0);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, _callback, 0L);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, _callback, 0UL);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E._timer, _callback, 0);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, _callback, 0L);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, _callback, 0UL);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_timer, _callback, 0);
+timer_setup(&_timer, _callback, 0);
|
-setup_timer(&_timer, _callback, 0L);
+timer_setup(&_timer, _callback, 0);
|
-setup_timer(&_timer, _callback, 0UL);
+timer_setup(&_timer, _callback, 0);
|
-setup_timer(_timer, _callback, 0);
+timer_setup(_timer, _callback, 0);
|
-setup_timer(_timer, _callback, 0L);
+timer_setup(_timer, _callback, 0);
|
-setup_timer(_timer, _callback, 0UL);
+timer_setup(_timer, _callback, 0);
)

@change_callback_unused_data
 depends on match_timer_function_unused_data@
identifier match_timer_function_unused_data._callback;
type _origtype;
identifier _origarg;
@@

 void _callback(
-_origtype _origarg
+struct timer_list *unused
 )
 {
	... when != _origarg
 }

Signed-off-by: Kees Cook <keescook@chromium.org>
2017-11-21 15:57:07 -08:00
Florian Westphal fbcd253d24 netfilter: conntrack: lower timeout to RETRANS seconds if window is 0
When zero window is announced we can get into a situation where
connection stays around forever:

1. One side announces zero window.
2. Other side closes.

In this case, no FIN is sent (stuck in send queue).

Unless other side opens the window up again conntrack
stays in ESTABLISHED state for a very long time.

Lets alleviate this by lowering the timeout to RETRANS (5 minutes),
the other end should be sending zero window probes to keep the
connection established as long as a socket still exists.

Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-11-20 13:30:24 +01:00
Eric Sesterhenn ec8a8f3c31 netfilter: nf_ct_h323: Extend nf_h323_error_boundary to work on bits as well
This patch fixes several out of bounds memory reads by extending
the nf_h323_error_boundary() function to work on bits as well
an check the affected parts.

Signed-off-by: Eric Sesterhenn <eric.sesterhenn@x41-dsec.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-11-20 12:03:41 +01:00
Eric Sesterhenn bc7d811ace netfilter: nf_ct_h323: Convert CHECK_BOUND macro to function
It is bad practive to return in a macro, this patch
moves the check into a function.

Signed-off-by: Eric Sesterhenn <eric.sesterhenn@x41-dsec.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-11-20 12:03:41 +01:00
Vasily Averin 613d0776d3 netfilter: exit_net cleanup check added
Be sure that lists initialized in net_init hook was return to initial
state.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-11-20 12:03:41 +01:00
Arnd Bergmann 8252fceac0 netfilter: add ifdef around ctnetlink_proto_size
This function is no longer marked 'inline', so we now get a warning
when it is unused:

net/netfilter/nf_conntrack_netlink.c:536:15: error: 'ctnetlink_proto_size' defined but not used [-Werror=unused-function]

We could mark it inline again, mark it __maybe_unused, or add an #ifdef
around the definition. I'm picking the third approach here since that
seems to be what the rest of the file has.

Fixes: 5caaed151a ("netfilter: conntrack: don't cache nlattr_tuple_size result in nla_size")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-16 10:49:00 +09:00
Linus Torvalds 1be2172e96 Modules updates for v4.15
Summary of modules changes for the 4.15 merge window:
 
 - Treewide module_param_call() cleanup, fix up set/get function
   prototype mismatches, from Kees Cook
 
 - Minor code cleanups
 
 Signed-off-by: Jessica Yu <jeyu@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJaDCyzAAoJEMBFfjjOO8FyaYQP/AwHBy6XmwwVlWDP4BqIF6hL
 Vhy3ccVLYEORvePv68tWSRPUz5n6+1Ebqanmwtkw6i8l+KwxY2SfkZql09cARc33
 2iBE4bHF98iWQmnJbF6me80fedY9n5bZJNMQKEF9VozJWwTMOTQFTCfmyJRDBmk9
 iidQj6M3idbSUOYIJjvc40VGx5NyQWSr+FFfqsz1rU5iLGRGEvA3I2/CDT0oTuV6
 D4MmFxzE2Tv/vIMa2GzKJ1LGScuUfSjf93Lq9Kk0cG36qWao8l930CaXyVdE9WJv
 bkUzpf3QYv/rDX6QbAGA0cada13zd+dfBr8YhchclEAfJ+GDLjMEDu04NEmI6KUT
 5lP0Xw0xYNZQI7bkdxDMhsj5jaz/HJpXCjPCtZBnSEKiL4OPXVMe+pBHoCJ2/yFN
 6M716XpWYgUviUOdiE+chczB5p3z4FA6u2ykaM4Tlk0btZuHGxjcSWwvcIdlPmjm
 kY4AfDV6K0bfEBVguWPJicvrkx44atqT5nWbbPhDwTSavtsuRJLb3GCsHedx7K8h
 ZO47lCQFAWCtrycK1HYw+oupNC3hYWQ0SR42XRdGhL1bq26C+1sei1QhfqSgA9PQ
 7CwWH4UTOL9fhtrzSqZngYOh9sjQNFNefqQHcecNzcEjK2vjrgQZvRNWZKHSwaFs
 fbGX8juZWP4ypbK+irTB
 =c8vb
 -----END PGP SIGNATURE-----

Merge tag 'modules-for-v4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux

Pull module updates from Jessica Yu:
 "Summary of modules changes for the 4.15 merge window:

   - treewide module_param_call() cleanup, fix up set/get function
     prototype mismatches, from Kees Cook

   - minor code cleanups"

* tag 'modules-for-v4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
  module: Do not paper over type mismatches in module_param_call()
  treewide: Fix function prototypes for module_param_call()
  module: Prepare to convert all module_param_call() prototypes
  kernel/module: Delete an error message for a failed memory allocation in add_module_usage()
2017-11-15 13:46:33 -08:00
Linus Torvalds 5bbcc0f595 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
 "Highlights:

   1) Maintain the TCP retransmit queue using an rbtree, with 1GB
      windows at 100Gb this really has become necessary. From Eric
      Dumazet.

   2) Multi-program support for cgroup+bpf, from Alexei Starovoitov.

   3) Perform broadcast flooding in hardware in mv88e6xxx, from Andrew
      Lunn.

   4) Add meter action support to openvswitch, from Andy Zhou.

   5) Add a data meta pointer for BPF accessible packets, from Daniel
      Borkmann.

   6) Namespace-ify almost all TCP sysctl knobs, from Eric Dumazet.

   7) Turn on Broadcom Tags in b53 driver, from Florian Fainelli.

   8) More work to move the RTNL mutex down, from Florian Westphal.

   9) Add 'bpftool' utility, to help with bpf program introspection.
      From Jakub Kicinski.

  10) Add new 'cpumap' type for XDP_REDIRECT action, from Jesper
      Dangaard Brouer.

  11) Support 'blocks' of transformations in the packet scheduler which
      can span multiple network devices, from Jiri Pirko.

  12) TC flower offload support in cxgb4, from Kumar Sanghvi.

  13) Priority based stream scheduler for SCTP, from Marcelo Ricardo
      Leitner.

  14) Thunderbolt networking driver, from Amir Levy and Mika Westerberg.

  15) Add RED qdisc offloadability, and use it in mlxsw driver. From
      Nogah Frankel.

  16) eBPF based device controller for cgroup v2, from Roman Gushchin.

  17) Add some fundamental tracepoints for TCP, from Song Liu.

  18) Remove garbage collection from ipv6 route layer, this is a
      significant accomplishment. From Wei Wang.

  19) Add multicast route offload support to mlxsw, from Yotam Gigi"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2177 commits)
  tcp: highest_sack fix
  geneve: fix fill_info when link down
  bpf: fix lockdep splat
  net: cdc_ncm: GetNtbFormat endian fix
  openvswitch: meter: fix NULL pointer dereference in ovs_meter_cmd_reply_start
  netem: remove unnecessary 64 bit modulus
  netem: use 64 bit divide by rate
  tcp: Namespace-ify sysctl_tcp_default_congestion_control
  net: Protect iterations over net::fib_notifier_ops in fib_seq_sum()
  ipv6: set all.accept_dad to 0 by default
  uapi: fix linux/tls.h userspace compilation error
  usbnet: ipheth: prevent TX queue timeouts when device not ready
  vhost_net: conditionally enable tx polling
  uapi: fix linux/rxrpc.h userspace compilation errors
  net: stmmac: fix LPI transitioning for dwmac4
  atm: horizon: Fix irq release error
  net-sysfs: trigger netlink notification on ifalias change via sysfs
  openvswitch: Using kfree_rcu() to simplify the code
  openvswitch: Make local function ovs_nsh_key_attr_size() static
  openvswitch: Fix return value check in ovs_meter_cmd_features()
  ...
2017-11-15 11:56:19 -08:00
Linus Torvalds 2bcc673101 Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
 "Yet another big pile of changes:

   - More year 2038 work from Arnd slowly reaching the point where we
     need to think about the syscalls themself.

   - A new timer function which allows to conditionally (re)arm a timer
     only when it's either not running or the new expiry time is sooner
     than the armed expiry time. This allows to use a single timer for
     multiple timeout requirements w/o caring about the first expiry
     time at the call site.

   - A new NMI safe accessor to clock real time for the printk timestamp
     work. Can be used by tracing, perf as well if required.

   - A large number of timer setup conversions from Kees which got
     collected here because either maintainers requested so or they
     simply got ignored. As Kees pointed out already there are a few
     trivial merge conflicts and some redundant commits which was
     unavoidable due to the size of this conversion effort.

   - Avoid a redundant iteration in the timer wheel softirq processing.

   - Provide a mechanism to treat RTC implementations depending on their
     hardware properties, i.e. don't inflict the write at the 0.5
     seconds boundary which originates from the PC CMOS RTC to all RTCs.
     No functional change as drivers need to be updated separately.

   - The usual small updates to core code clocksource drivers. Nothing
     really exciting"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (111 commits)
  timers: Add a function to start/reduce a timer
  pstore: Use ktime_get_real_fast_ns() instead of __getnstimeofday()
  timer: Prepare to change all DEFINE_TIMER() callbacks
  netfilter: ipvs: Convert timers to use timer_setup()
  scsi: qla2xxx: Convert timers to use timer_setup()
  block/aoe: discover_timer: Convert timers to use timer_setup()
  ide: Convert timers to use timer_setup()
  drbd: Convert timers to use timer_setup()
  mailbox: Convert timers to use timer_setup()
  crypto: Convert timers to use timer_setup()
  drivers/pcmcia: omap1: Fix error in automated timer conversion
  ARM: footbridge: Fix typo in timer conversion
  drivers/sgi-xp: Convert timers to use timer_setup()
  drivers/pcmcia: Convert timers to use timer_setup()
  drivers/memstick: Convert timers to use timer_setup()
  drivers/macintosh: Convert timers to use timer_setup()
  hwrng/xgene-rng: Convert timers to use timer_setup()
  auxdisplay: Convert timers to use timer_setup()
  sparc/led: Convert timers to use timer_setup()
  mips: ip22/32: Convert timers to use timer_setup()
  ...
2017-11-13 17:56:58 -08:00
Kees Cook 8ef81c6548 netfilter: ipvs: Convert timers to use timer_setup()
In preparation for unconditionally passing the struct timer_list pointer to
all timer callbacks, switch to using the new timer_setup() and from_timer()
to pass the timer pointer explicitly.

Cc: Wensong Zhang <wensong@linux-vs.org>
Cc: Simon Horman <horms@verge.net.au>
Cc: Julian Anastasov <ja@ssi.bg>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Cc: Florian Westphal <fw@strlen.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Cc: lvs-devel@vger.kernel.org
Cc: netfilter-devel@vger.kernel.org
Cc: coreteam@netfilter.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Simon Horman <horms@verge.net.au>
2017-11-08 15:53:58 -08:00
David S. Miller 2eb3ed33e5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains Netfilter/IPVS updates for your net-next
tree, they are:

1) Speed up table replacement on busy systems with large tables
   (and many cores) in x_tables. Now xt_replace_table() synchronizes by
   itself by waiting until all cpus had an even seqcount and we use no
   use seqlock when fetching old counters, from Florian Westphal.

2) Add nf_l4proto_log_invalid() and nf_ct_l4proto_log_invalid() to speed
   up packet processing in the fast path when logging is not enabled, from
   Florian Westphal.

3) Precompute masked address from configuration plane in xt_connlimit,
   from Florian.

4) Don't use explicit size for set selection if performance set policy
   is selected.

5) Allow to get elements from an existing set in nf_tables.

6) Fix incorrect check in nft_hash_deactivate(), from Florian.

7) Cache netlink attribute size result in l4proto->nla_size, from
   Florian.

8) Handle NFPROTO_INET in nf_ct_netns_get() from conntrack core.

9) Use power efficient workqueue in conntrack garbage collector, from
   Vincent Guittot.

10) Remove unnecessary parameter, in conntrack l4proto functions, also
    from Florian.

11) Constify struct nf_conntrack_l3proto definitions, from Florian.

12) Remove all typedefs in nf_conntrack_h323 via coccinelle semantic
    patch, from Harsha Sharma.

13) Don't store address in the rbtree nodes in xt_connlimit, they are
    never used, from Florian.

14) Fix out of bound access in the conntrack h323 helper, patch from
    Eric Sesterhenn.

15) Print symbols for the address returned with %pS in IPVS, from
    Helge Deller.

16) Proc output should only display its own netns in IPVS, from
    KUWAZAWA Takuya.

17) Small clean up in size_entry_mwt(), from Colin Ian King.

18) Use test_and_clear_bit from nf_nat_proto_clean() instead of separated
    non-atomic test and then clear bit, from Florian Westphal.

19) Consolidate prefix length maps in ipset, from Aaron Conole.

20) Fix sparse warnings in ipset, from Jozsef Kadlecsik.

21) Simplify list_set_memsize(), from simran singhal.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 14:22:50 +09:00
Ingo Molnar 8c5db92a70 Merge branch 'linus' into locking/core, to resolve conflicts
Conflicts:
	include/linux/compiler-clang.h
	include/linux/compiler-gcc.h
	include/linux/compiler-intel.h
	include/uapi/linux/stddef.h

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-07 10:32:44 +01:00
Pablo Neira Ayuso ba0e4d9917 netfilter: nf_tables: get set elements via netlink
This patch adds a new get operation to look up for specific elements in
a set via netlink interface. You can also use it to check if an interval
already exists.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-11-07 01:00:31 +01:00
Pablo Neira Ayuso 644e334eee netfilter: nf_tables: performance set policy skips size description in selection
Use the complexity and space notations if policy is performance, this
results in placing the bitmap set representation over the hashtable for
key <= 16 for better performance as we discussed during the last NFWS in
Faro, Portugal.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-11-07 01:00:30 +01:00
Vincent Guittot 0984d427c1 netfilter: conntrack: use power efficient workqueue
conntrack uses the bounded system_long_wq workqueue for its works that
don't have to run on the cpu they have been queued.
Using bounded workqueue prevents the scheduler to make smart decision about
the best place to schedule the work.

This patch replaces system_long_wq with system_power_efficient_wq. the work
stays bounded to a cpu by default unless the CONFIG_WQ_POWER_EFFICIENT is
enable. In the latter case, the work can be scheduled on the best cpu from
a power or a performance point of view.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-11-07 00:59:10 +01:00
Pablo Neira Ayuso 7e35ec0e80 netfilter: conntrack: move nf_ct_netns_{get,put}() to core
So we can call this from other expression that need conntrack in place
to work.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Florian Westphal <fw@strlen.de>
2017-11-06 16:48:39 +01:00
Florian Westphal 5caaed151a netfilter: conntrack: don't cache nlattr_tuple_size result in nla_size
We currently call ->nlattr_tuple_size() once at register time and
cache result in l4proto->nla_size.

nla_size is the only member that is written to, avoiding this would
allow to make l4proto trackers const.

We can use ->nlattr_tuple_size() at run time, and cache result in
the individual trackers instead.

This is an intermediate step, next patch removes nlattr_size()
callback and computes size at compile time, then removes nla_size.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-11-06 16:48:38 +01:00
Florian Westphal 7f4dae2d7f netfilter: nft_hash: fix nft_hash_deactivate
Jindřich Makovička says:
  The logical OR looks fishy to me. Shouldn't be && there instead?

Link: https://bugzilla.netfilter.org/show_bug.cgi?id=1199
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-11-06 16:48:37 +01:00
Florian Westphal b1fc1372c4 netfilter: xt_connlimit: remove mask argument
Instead of passing mask to all the helpers, just fixup the search key
early.

After rbtree conversion, each rbtree node stores connections of same
'addr & mask', so no need to pass the mask too.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-11-06 14:47:30 +01:00
KUWAZAWA Takuya c5504f724c netfilter: ipvs: Fix inappropriate output of procfs
Information about ipvs in different network namespace can be seen via procfs.

How to reproduce:

  # ip netns add ns01
  # ip netns add ns02
  # ip netns exec ns01 ip a add dev lo 127.0.0.1/8
  # ip netns exec ns02 ip a add dev lo 127.0.0.1/8
  # ip netns exec ns01 ipvsadm -A -t 10.1.1.1:80
  # ip netns exec ns02 ipvsadm -A -t 10.1.1.2:80

The ipvsadm displays information about its own network namespace only.

  # ip netns exec ns01 ipvsadm -Ln
  IP Virtual Server version 1.2.1 (size=4096)
  Prot LocalAddress:Port Scheduler Flags
    -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
  TCP  10.1.1.1:80 wlc

  # ip netns exec ns02 ipvsadm -Ln
  IP Virtual Server version 1.2.1 (size=4096)
  Prot LocalAddress:Port Scheduler Flags
    -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
  TCP  10.1.1.2:80 wlc

But I can see information about other network namespace via procfs.

  # ip netns exec ns01 cat /proc/net/ip_vs
  IP Virtual Server version 1.2.1 (size=4096)
  Prot LocalAddress:Port Scheduler Flags
    -> RemoteAddress:Port Forward Weight ActiveConn InActConn
  TCP  0A010101:0050 wlc
  TCP  0A010102:0050 wlc

  # ip netns exec ns02 cat /proc/net/ip_vs
  IP Virtual Server version 1.2.1 (size=4096)
  Prot LocalAddress:Port Scheduler Flags
    -> RemoteAddress:Port Forward Weight ActiveConn InActConn
  TCP  0A010102:0050 wlc

Signed-off-by: KUWAZAWA Takuya <albatross0@gmail.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-11-06 14:47:22 +01:00
Helge Deller c5cc0c6971 netfilter: ipvs: Use %pS printk format for direct addresses
The debug and error printk functions in ipvs uses wrongly the %pF instead of
the %pS printk format specifier for printing symbols for the address returned
by _builtin_return_address(0). Fix it for the ia64, ppc64 and parisc64
architectures.

Signed-off-by: Helge Deller <deller@gmx.de>
Cc: Wensong Zhang <wensong@linux-vs.org>
Cc: netdev@vger.kernel.org
Cc: lvs-devel@vger.kernel.org
Cc: netfilter-devel@vger.kernel.org
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-11-06 14:44:20 +01:00
David S. Miller 2a171788ba Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Files removed in 'net-next' had their license header updated
in 'net'.  We take the remove from 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-04 09:26:51 +09:00
Linus Torvalds 7ba3ebff9c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "Hopefully this is the last batch of networking fixes for 4.14

  Fingers crossed...

   1) Fix stmmac to use the proper sized OF property read, from Bhadram
      Varka.

   2) Fix use after free in net scheduler tc action code, from Cong
      Wang.

   3) Fix SKB control block mangling in tcp_make_synack().

   4) Use proper locking in fib_dump_info(), from Florian Westphal.

   5) Fix IPG encodings in systemport driver, from Florian Fainelli.

   6) Fix division by zero in NV TCP congestion control module, from
      Konstantin Khlebnikov.

   7) Fix use after free in nf_reject_ipv4, from Tejaswi Tanikella"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
  net: systemport: Correct IPG length settings
  tcp: do not mangle skb->cb[] in tcp_make_synack()
  fib: fib_dump_info can no longer use __in_dev_get_rtnl
  stmmac: use of_property_read_u32 instead of read_u8
  net_sched: hold netns refcnt for each action
  net_sched: acquire RTNL in tc_action_net_exit()
  net: vrf: correct FRA_L3MDEV encode type
  tcp_nv: fix division by zero in tcpnv_acked()
  netfilter: nf_reject_ipv4: Fix use-after-free in send_reset
  netfilter: nft_set_hash: disable fast_ops for 2-len keys
2017-11-03 09:09:21 -07:00
Greg Kroah-Hartman b24413180f License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier.  The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
 - file had no licensing information it it.
 - file was a */uapi/* one with no licensing information in it,
 - file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne.  Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed.  Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
 - Files considered eligible had to be source code files.
 - Make and config files were included as candidates if they contained >5
   lines of source
 - File already had some variant of a license header in it (even if <5
   lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

 - when both scanners couldn't find any license traces, file was
   considered to have no license information in it, and the top level
   COPYING file license applied.

   For non */uapi/* files that summary was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0                                              11139

   and resulted in the first patch in this series.

   If that file was a */uapi/* path one, it was "GPL-2.0 WITH
   Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0 WITH Linux-syscall-note                        930

   and resulted in the second patch in this series.

 - if a file had some form of licensing information in it, and was one
   of the */uapi/* ones, it was denoted with the Linux-syscall-note if
   any GPL family license was found in the file or had no licensing in
   it (per prior point).  Results summary:

   SPDX license identifier                            # files
   ---------------------------------------------------|------
   GPL-2.0 WITH Linux-syscall-note                       270
   GPL-2.0+ WITH Linux-syscall-note                      169
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
   LGPL-2.1+ WITH Linux-syscall-note                      15
   GPL-1.0+ WITH Linux-syscall-note                       14
   ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
   LGPL-2.0+ WITH Linux-syscall-note                       4
   LGPL-2.1 WITH Linux-syscall-note                        3
   ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
   ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1

   and that resulted in the third patch in this series.

 - when the two scanners agreed on the detected license(s), that became
   the concluded license(s).

 - when there was disagreement between the two scanners (one detected a
   license but the other didn't, or they both detected different
   licenses) a manual inspection of the file occurred.

 - In most cases a manual inspection of the information in the file
   resulted in a clear resolution of the license that should apply (and
   which scanner probably needed to revisit its heuristics).

 - When it was not immediately clear, the license identifier was
   confirmed with lawyers working with the Linux Foundation.

 - If there was any question as to the appropriate license identifier,
   the file was flagged for further research and to be revisited later
   in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights.  The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
 - a full scancode scan run, collecting the matched texts, detected
   license ids and scores
 - reviewing anything where there was a license detected (about 500+
   files) to ensure that the applied SPDX license was correct
 - reviewing anything where there was no detection but the patch license
   was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
   SPDX license was correct

This produced a worksheet with 20 files needing minor correction.  This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg.  Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected.  This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.)  Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-02 11:10:55 +01:00
Kees Cook e4dca7b7aa treewide: Fix function prototypes for module_param_call()
Several function prototypes for the set/get functions defined by
module_param_call() have a slightly wrong argument types. This fixes
those in an effort to clean up the calls when running under type-enforced
compiler instrumentation for CFI. This is the result of running the
following semantic patch:

@match_module_param_call_function@
declarer name module_param_call;
identifier _name, _set_func, _get_func;
expression _arg, _mode;
@@

 module_param_call(_name, _set_func, _get_func, _arg, _mode);

@fix_set_prototype
 depends on match_module_param_call_function@
identifier match_module_param_call_function._set_func;
identifier _val, _param;
type _val_type, _param_type;
@@

 int _set_func(
-_val_type _val
+const char * _val
 ,
-_param_type _param
+const struct kernel_param * _param
 ) { ... }

@fix_get_prototype
 depends on match_module_param_call_function@
identifier match_module_param_call_function._get_func;
identifier _val, _param;
type _val_type, _param_type;
@@

 int _get_func(
-_val_type _val
+char * _val
 ,
-_param_type _param
+const struct kernel_param * _param
 ) { ... }

Two additional by-hand changes are included for places where the above
Coccinelle script didn't notice them:

	drivers/platform/x86/thinkpad_acpi.c
	fs/lockd/svc.c

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2017-10-31 15:30:37 +01:00
Anatole Denis 0414c78f14 netfilter: nft_set_hash: disable fast_ops for 2-len keys
jhash_1word of a u16 is a different value from jhash of the same u16 with
length 2.
Since elements are always inserted in sets using jhash over the actual
klen, this would lead to incorrect lookups on fixed-size sets with a key
length of 2, as they would be inserted with hash value jhash(key, 2) and
looked up with hash value jhash_1word(key), which is different.

Example reproducer(v4.13+), using anonymous sets which always have a
fixed size:

  table inet t {
      chain c {
                  type filter hook output priority 0; policy accept;
                  tcp dport { 10001, 10003, 10005, 10007, 10009 } counter packets 4 bytes 240 reject
                  tcp dport 10001 counter packets 4 bytes 240 reject
                  tcp dport 10003 counter packets 4 bytes 240 reject
                  tcp dport 10005 counter packets 4 bytes 240 reject
                  tcp dport 10007 counter packets 0 bytes 0 reject
                  tcp dport 10009 counter packets 4 bytes 240 reject
          }
  }

then use nc -z localhost <port> to probe; incorrectly hashed ports will
pass through the set lookup and increment the counter of an individual
rule.

jhash being seeded with a random value, it is not deterministic which
ports will incorrectly hash, but in testing with 5 ports in the set I
always had 4 or 5 with an incorrect hash value.

Signed-off-by: Anatole Denis <anatole@rezel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-10-25 14:46:44 +02:00
Mark Rutland 14cd5d4a01 locking/atomics, net/netlink/netfilter: Convert ACCESS_ONCE() to READ_ONCE()/WRITE_ONCE()
For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't currently harmful.

However, for some features it is necessary to instrument reads and
writes separately, which is not possible with ACCESS_ONCE(). This
distinction is critical to correct operation.

It's possible to transform the bulk of kernel code using the Coccinelle
script below. However, this doesn't handle comments, leaving references
to ACCESS_ONCE() instances which have been removed. As a preparatory
step, this patch converts netlink and netfilter code and comments to use
{READ,WRITE}_ONCE() consistently.

----
virtual patch

@ depends on patch @
expression E1, E2;
@@

- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)

@ depends on patch @
expression E;
@@

- ACCESS_ONCE(E)
+ READ_ONCE(E)
----

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Florian Westphal <fw@strlen.de>
Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: tj@kernel.org
Cc: viro@zeniv.linux.org.uk
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-7-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-25 11:00:59 +02:00
Eric Sesterhenn e8daf27c2f netfilter: nf_ct_h323: Out Of Bound Read in Netfilter Conntrack
Add missing counter decrement to prevent out of bounds memory read.

Signed-off-by: Eric Sesterhenn <eric.sesterhenn@x41-dsec.de>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-10-24 18:29:15 +02:00
Florian Westphal ce49480dba netfilter: xt_connlimit: don't store address in the conn nodes
Only stored, never read.  This is a leftover from commit 7d08487777
("netfilter: connlimit: use rbtree for per-host conntrack obj storage"),
which added the rbtree node struct that stores the address instead.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-10-24 18:01:50 +02:00
Harsha Sharma 67704c2a05 netfilter: nf_conntrack_h323: Remove typedef struct
Remove typedef from struct as linux-kernel coding style tends to
avoid using typedefs.
Done using following coccinelle semantic patch

@r1@
type T;
@@

typedef struct { ... } T;

@script:python c1@
T2;
T << r1.T;
@@
if T[-2:] =="_t" or T[-2:] == "_T":
        coccinelle.T2 = T[:-2];
else:
        coccinelle.T2 = T;

print T, coccinelle.T2

@r2@
type r1.T;
identifier c1.T2;
@@
-typedef
struct
+ T2
{ ... }
-T
;

@r3@
type r1.T;
identifier c1.T2;
@@
-T
+struct T2

Signed-off-by: Harsha Sharma <harshasharmaiitr@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-10-24 18:01:50 +02:00
Florian Westphal 80055dab5d netfilter: x_tables: make xt_replace_table wait until old rules are not used anymore
xt_replace_table relies on table replacement counter retrieval (which
uses xt_recseq to synchronize pcpu counters).

This is fine, however with large rule set get_counters() can take
a very long time -- it needs to synchronize all counters because
it has to assume concurrent modifications can occur.

Make xt_replace_table synchronize by itself by waiting until all cpus
had an even seqcount.

This allows a followup patch to copy the counters of the old ruleset
without any synchonization after xt_replace_table has completed.

Cc: Dan Williams <dcbw@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-10-24 18:01:50 +02:00
Florian Westphal eb6fad5a4a netfilter: conntrack: remove pf argument from l4 packet functions
not needed/used anymore.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-10-24 18:01:49 +02:00
Florian Westphal 3d0b527bc9 netfilter: conntrack: add and use nf_ct_l4proto_log_invalid
We currently pass down the l4 protocol to the conntrack ->packet()
function, but the only user of this is the debug info decision.

Same information can be derived from struct nf_conn.
Add a wrapper for the previous patch that extracs the information
from nf_conn and passes it to nf_l4proto_log_invalid().

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-10-24 18:01:49 +02:00
Florian Westphal c4f3db1595 netfilter: conntrack: add and use nf_l4proto_log_invalid
We currently pass down the l4 protocol to the conntrack ->packet()
function, but the only user of this is the debug info decision.

Same information can be derived from struct nf_conn.
As a first step, add and use a new log function for this, similar to
nf_ct_helper_log().

Add __cold annotation -- invalid packets should be infrequent so
gcc can consider all call paths that lead to such a function as
unlikely.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-10-24 18:01:49 +02:00
Florian Westphal 2420770b3f netfilter: nat: use test_and_clear_bit when deleting ct from bysource list
We can use a single statement for this.
While at it, fixup the comment -- we don't have pernet table/ops
anymore, the function is only called from module exit path.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-10-24 17:54:47 +02:00
Chenbo Feng 6e71b04a82 bpf: Add file mode configuration into bpf maps
Introduce the map read/write flags to the eBPF syscalls that returns the
map fd. The flags is used to set up the file mode when construct a new
file descriptor for bpf maps. To not break the backward capability, the
f_flags is set to O_RDWR if the flag passed by syscall is 0. Otherwise
it should be O_RDONLY or O_WRONLY. When the userspace want to modify or
read the map content, it will check the file mode to see if it is
allowed to make the change.

Signed-off-by: Chenbo Feng <fengc@google.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-20 13:32:59 +01:00
Kees Cook a92c5751b9 netfilter: ipset: Convert timers to use timer_setup()
In preparation for unconditionally passing the struct timer_list pointer to
all timer callbacks, switch to using the new timer_setup() and from_timer()
to pass the timer pointer explicitly. This introduces a pointer back to the
struct ip_set, which is used instead of the struct timer_list .data field.

Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Cc: Florian Westphal <fw@strlen.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: simran singhal <singhalsimran0@gmail.com>
Cc: Muhammad Falak R Wani <falakreyaz@gmail.com>
Cc: netfilter-devel@vger.kernel.org
Cc: coreteam@netfilter.org
Cc: netdev@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-18 12:39:54 +01:00
Shmulik Ladkani 98589a0998 netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1'
Commit 2c16d60332 ("netfilter: xt_bpf: support ebpf") introduced
support for attaching an eBPF object by an fd, with the
'bpf_mt_check_v1' ABI expecting the '.fd' to be specified upon each
IPT_SO_SET_REPLACE call.

However this breaks subsequent iptables calls:

 # iptables -A INPUT -m bpf --object-pinned /sys/fs/bpf/xxx -j ACCEPT
 # iptables -A INPUT -s 5.6.7.8 -j ACCEPT
 iptables: Invalid argument. Run `dmesg' for more information.

That's because iptables works by loading existing rules using
IPT_SO_GET_ENTRIES to userspace, then issuing IPT_SO_SET_REPLACE with
the replacement set.

However, the loaded 'xt_bpf_info_v1' has an arbitrary '.fd' number
(from the initial "iptables -m bpf" invocation) - so when 2nd invocation
occurs, userspace passes a bogus fd number, which leads to
'bpf_mt_check_v1' to fail.

One suggested solution [1] was to hack iptables userspace, to perform a
"entries fixup" immediatley after IPT_SO_GET_ENTRIES, by opening a new,
process-local fd per every 'xt_bpf_info_v1' entry seen.

However, in [2] both Pablo Neira Ayuso and Willem de Bruijn suggested to
depricate the xt_bpf_info_v1 ABI dealing with pinned ebpf objects.

This fix changes the XT_BPF_MODE_FD_PINNED behavior to ignore the given
'.fd' and instead perform an in-kernel lookup for the bpf object given
the provided '.path'.

It also defines an alias for the XT_BPF_MODE_FD_PINNED mode, named
XT_BPF_MODE_PATH_PINNED, to better reflect the fact that the user is
expected to provide the path of the pinned object.

Existing XT_BPF_MODE_FD_ELF behavior (non-pinned fd mode) is preserved.

References: [1] https://marc.info/?l=netfilter-devel&m=150564724607440&w=2
            [2] https://marc.info/?l=netfilter-devel&m=150575727129880&w=2

Reported-by: Rafael Buchbinder <rafi@rbk.ms>
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-10-09 15:18:04 +02:00
Eric Dumazet e466af75c0 netfilter: x_tables: avoid stack-out-of-bounds read in xt_copy_counters_from_user
syzkaller reports an out of bound read in strlcpy(), triggered
by xt_copy_counters_from_user()

Fix this by using memcpy(), then forcing a zero byte at the last position
of the destination, as Florian did for the non COMPAT code.

Fixes: d7591f0c41 ("netfilter: x_tables: introduce and use xt_copy_counters_from_user")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-10-06 15:04:05 +02:00
Pablo Neira Ayuso 5f9bfe0ef6 netfilter: nf_tables: do not dump chain counters if not enabled
Chain counters are only enabled on demand since 9f08ea8481, skip them
when dumping them via netlink.

Fixes: 9f08ea8481 ("netfilter: nf_tables: keep chain counters away from hot path")
Reported-by: Johny Mattsson <johny.mattsson+kernel@gmail.com>
Tested-by: Johny Mattsson <johny.mattsson+kernel@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-10-06 14:49:19 +02:00
Aaron Conole 63c2af90e5 netfilter: ipset: deduplicate prefixlen maps
The prefixlen maps used here are identical, and have been since
introduction.  It seems to make sense to use a single large map,
that the preprocessor will fill appropriately.

Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-10-04 16:19:06 +02:00
Jozsef Kadlecsik 8851e799ff netfilter: ipset: Fix sparse warnings
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-10-04 16:19:06 +02:00
simran singhal 522e89d60b netfilter: ipset: Compress return logic
Simplify function returns by merging assignment and return into one
command line.

Signed-off-by: simran singhal <singhalsimran0@gmail.com>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-10-04 16:19:06 +02:00
Arvind Yadav e63aaaa6be netfilter: nf_tables: Release memory obtained by kasprintf
Free memory region, if nf_tables_set_alloc_name is not successful.

Fixes: 387454901b ("netfilter: nf_tables: Allow set names of up to 255 chars")
Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-10-03 15:21:19 +02:00
JingPiao Chen 0d18779be1 netfilter: nf_tables: fix update chain error
# nft add table filter
 # nft add chain filter c1
 # nft rename chain filter c1 c2

Error: Could not process rule: No such file or directory
rename chain filter c1 c2
^^^^^^^^^^^^^^^^^^^^^^^^^^

 # nft add chain filter c2
 # nft rename chain filter c1 c2
 # nft list table filter

table ip filter {
	chain c2 {
	}

	chain c2 {
	}
}

Fixes: 664b0f8cd8 ("netfilter: nf_tables: add generation mask to chains")
Signed-off-by: JingPiao Chen <chenjingpiao@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-09-29 13:34:36 +02:00
Ross Lagerwall e5173418ac netfilter: ipset: Fix race between dump and swap
Fix a race between ip_set_dump_start() and ip_set_swap().
The race is as follows:
* Without holding the ref lock, ip_set_swap() checks ref_netlink of the
  set and it is 0.
* ip_set_dump_start() takes a reference on the set.
* ip_set_swap() does the swap (even though it now has a non-zero
  reference count).
* ip_set_dump_start() gets the set from ip_set_list again which is now a
  different set since it has been swapped.
* ip_set_dump_start() calls __ip_set_put_netlink() and hits a BUG_ON due
  to the reference count being 0.

Fix this race by extending the critical region in which the ref lock is
held to include checking the ref counts.

The race can be reproduced with the following script:
  while :; do
    ipset destroy hash_ip1
    ipset destroy hash_ip2
    ipset create hash_ip1 hash:ip family inet hashsize 1024 \
        maxelem 500000
    ipset create hash_ip2 hash:ip family inet hashsize 300000 \
        maxelem 500000
    ipset create hash_ip3 hash:ip family inet hashsize 1024 \
        maxelem 500000
    ipset save &
    ipset swap hash_ip3 hash_ip2
    ipset destroy hash_ip3
    wait
  done

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-09-29 12:15:14 +02:00
Florian Westphal e23ed762db netfilter: ipset: pernet ops must be unregistered last
Removing the ipset module leaves a small window where one cpu performs
module removal while another runs a command like 'ipset flush'.

ipset uses net_generic(), unregistering the pernet ops frees this
storage area.

Fix it by first removing the user-visible api handlers and the pernet
ops last.

Fixes: 1785e8f473 ("netfiler: ipset: Add net namespace for ipset")
Reported-by: Li Shuang <shuali@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-09-26 20:15:17 +02:00
Jozsef Kadlecsik 48596a8ddc netfilter: ipset: Fix adding an IPv4 range containing more than 2^31 addresses
Wrong comparison prevented the hash types to add a range with more than
2^31 addresses but reported as a success.

Fixes Netfilter's bugzilla id #1005, reported by Oleg Serditov and
Oliver Ford.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-09-26 20:15:04 +02:00
Subash Abhinov Kasiviswanathan 89fcbb564f netfilter: xt_socket: Restore mark from full sockets only
An out of bounds error was detected on an ARM64 target with
Android based kernel 4.9. This occurs while trying to
restore mark on a skb from an inet request socket.

BUG: KASAN: slab-out-of-bounds in socket_match.isra.2+0xc8/0x1f0 net/netfilter/xt_socket.c:248
Read of size 4 at addr ffffffc06a8d824c by task syz-fuzzer/1532
CPU: 7 PID: 1532 Comm: syz-fuzzer Tainted: G        W  O    4.9.41+ #1
Call trace:
[<ffffff900808d2f8>] dump_backtrace+0x0/0x440 arch/arm64/kernel/traps.c:76
[<ffffff900808d760>] show_stack+0x28/0x38 arch/arm64/kernel/traps.c:226
[<ffffff90085f7dc8>] __dump_stack lib/dump_stack.c:15 [inline]
[<ffffff90085f7dc8>] dump_stack+0xe4/0x134 lib/dump_stack.c:51
[<ffffff900830f358>] print_address_description+0x68/0x258 mm/kasan/report.c:248
[<ffffff900830f770>] kasan_report_error mm/kasan/report.c:347 [inline]
[<ffffff900830f770>] kasan_report.part.2+0x228/0x2f0 mm/kasan/report.c:371
[<ffffff900830fdec>] kasan_report+0x5c/0x70 mm/kasan/report.c:372
[<ffffff900830de98>] check_memory_region_inline mm/kasan/kasan.c:308 [inline]
[<ffffff900830de98>] __asan_load4+0x88/0xa0 mm/kasan/kasan.c:740
[<ffffff90097498f8>] socket_match.isra.2+0xc8/0x1f0 net/netfilter/xt_socket.c:248
[<ffffff9009749a5c>] socket_mt4_v1_v2_v3+0x3c/0x48 net/netfilter/xt_socket.c:272
[<ffffff90097f7e4c>] ipt_do_table+0x54c/0xad8 net/ipv4/netfilter/ip_tables.c:311
[<ffffff90097fcf14>] iptable_mangle_hook+0x6c/0x220 net/ipv4/netfilter/iptable_mangle.c:90
...
Allocated by task 1532:
 save_stack_trace_tsk+0x0/0x2a0 arch/arm64/kernel/stacktrace.c:131
 save_stack_trace+0x28/0x38 arch/arm64/kernel/stacktrace.c:215
 save_stack mm/kasan/kasan.c:495 [inline]
 set_track mm/kasan/kasan.c:507 [inline]
 kasan_kmalloc+0xd8/0x188 mm/kasan/kasan.c:599
 kasan_slab_alloc+0x14/0x20 mm/kasan/kasan.c:537
 slab_post_alloc_hook mm/slab.h:417 [inline]
 slab_alloc_node mm/slub.c:2728 [inline]
 slab_alloc mm/slub.c:2736 [inline]
 kmem_cache_alloc+0x14c/0x2e8 mm/slub.c:2741
 reqsk_alloc include/net/request_sock.h:87 [inline]
 inet_reqsk_alloc+0x4c/0x238 net/ipv4/tcp_input.c:6236
 tcp_conn_request+0x2b0/0xea8 net/ipv4/tcp_input.c:6341
 tcp_v4_conn_request+0xe0/0x100 net/ipv4/tcp_ipv4.c:1256
 tcp_rcv_state_process+0x384/0x18a8 net/ipv4/tcp_input.c:5926
 tcp_v4_do_rcv+0x2f0/0x3e0 net/ipv4/tcp_ipv4.c:1430
 tcp_v4_rcv+0x1278/0x1350 net/ipv4/tcp_ipv4.c:1709
 ip_local_deliver_finish+0x174/0x3e0 net/ipv4/ip_input.c:216

v1->v2: Change socket_mt6_v1_v2_v3() as well as mentioned by Eric
v2->v3: Put the correct fixes tag

Fixes: 01555e74bd ("netfilter: xt_socket: add XT_SOCKET_RESTORESKMARK flag")
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-09-26 20:04:34 +02:00
Vadim Fedorenko b621129f4f netfilter: ipvs: full-functionality option for ECN encapsulation in tunnel
IPVS tunnel mode works as simple tunnel (see RFC 3168) copying ECN field
to outer header. That's result in packet drops on egress tunnels in case
the egress tunnel operates as ECN-capable with Full-functionality option
(like ip_tunnel and ip6_tunnel kernel modules), according to RFC 3168
section 9.1.1 recommendation.

This patch implements ECN full-functionality option into ipvs xmit code.

Cc: netdev@vger.kernel.org
Cc: lvs-devel@vger.kernel.org
Signed-off-by: Vadim Fedorenko <vfedorenko@yandex-team.ru>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-09-26 14:06:33 +02:00
Vishwanath Pai 7f4f7dd441 netfilter: ipset: ipset list may return wrong member count for set with timeout
Simple testcase:

$ ipset create test hash:ip timeout 5
$ ipset add test 1.2.3.4
$ ipset add test 1.2.2.2
$ sleep 5

$ ipset l
Name: test
Type: hash:ip
Revision: 5
Header: family inet hashsize 1024 maxelem 65536 timeout 5
Size in memory: 296
References: 0
Number of entries: 2
Members:

We return "Number of entries: 2" but no members are listed. That is
because mtype_list runs "ip_set_timeout_expired" and does not list the
expired entries, but set->elements is never upated (until mtype_gc
cleans it up later).

Reviewed-by: Joshua Hunt <johunt@akamai.com>
Signed-off-by: Vishwanath Pai <vpai@akamai.com>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-09-18 17:35:32 +02:00
Geert Uytterhoeven b0ade85165 netfilter: nat: Do not use ARRAY_SIZE() on spinlocks to fix zero div
If no spinlock debugging options (CONFIG_GENERIC_LOCKBREAK,
CONFIG_DEBUG_SPINLOCK, CONFIG_DEBUG_LOCK_ALLOC) are enabled on a UP
platform (e.g. m68k defconfig), arch_spinlock_t is an empty struct,
hence using ARRAY_SIZE(nf_nat_locks) causes a division by zero:

    net/netfilter/nf_nat_core.c: In function ‘nf_nat_setup_info’:
    net/netfilter/nf_nat_core.c:432: warning: division by zero
    net/netfilter/nf_nat_core.c: In function ‘__nf_nat_cleanup_conntrack’:
    net/netfilter/nf_nat_core.c:535: warning: division by zero
    net/netfilter/nf_nat_core.c:537: warning: division by zero
    net/netfilter/nf_nat_core.c: In function ‘nf_nat_init’:
    net/netfilter/nf_nat_core.c:810: warning: division by zero
    net/netfilter/nf_nat_core.c:811: warning: division by zero
    net/netfilter/nf_nat_core.c:824: warning: division by zero

Fix this by using the CONNTRACK_LOCKS definition instead.

Suggested-by: Florian Westphal <fw@strlen.de>
Fixes: 8073e960a0 ("netfilter: nat: use keyed locks")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-09-18 17:33:23 +02:00
Vishwanath Pai 90c4ae4e2c netfilter: xt_hashlimit: fix build error caused by 64bit division
64bit division causes build/link errors on 32bit architectures. It
prints out error messages like:

ERROR: "__aeabi_uldivmod" [net/netfilter/xt_hashlimit.ko] undefined!

The value of avg passed through by userspace in BYTE mode cannot exceed
U32_MAX. Which means 64bit division in user2rate_bytes is unnecessary.
To fix this I have changed the type of param 'user' to u32.

Since anything greater than U32_MAX is an invalid input we error out in
hashlimit_mt_check_common() when this is the case.

Changes in v2:
	Making return type as u32 would cause an overflow for small
	values of 'user' (for example 2, 3 etc). To avoid this I bumped up
	'r' to u64 again as well as the return type. This is OK since the
	variable that stores the result is u64. We still avoid 64bit
	division here since 'user' is u32.

Fixes: bea74641e3 ("netfilter: xt_hashlimit: add rate match mode")
Signed-off-by: Vishwanath Pai <vpai@akamai.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-09-08 18:55:53 +02:00
Zhizhou Tian 05d0eae7c1 netfilter: xt_hashlimit: alloc hashtable with right size
struct xt_byteslimit_htable used hlist_head, but memory allocation is
done through sizeof(struct list_head).

Signed-off-by: Zhizhou Tian <zhizhou.tian@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-09-08 18:55:53 +02:00
Florian Westphal 74585d4f84 netfilter: core: remove erroneous warn_on
kernel test robot reported:

WARNING: CPU: 0 PID: 1244 at net/netfilter/core.c:218 __nf_hook_entries_try_shrink+0x49/0xcd
[..]

After allowing batching in nf_unregister_net_hooks its possible that an earlier
call to __nf_hook_entries_try_shrink already compacted the list.
If this happens we don't need to do anything.

Fixes: d3ad2c17b4 ("netfilter: core: batch nf_unregister_net_hooks synchronize_net calls")
Reported-by: kernel test robot <xiaolong.ye@intel.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-09-08 18:55:52 +02:00
Florian Westphal 8073e960a0 netfilter: nat: use keyed locks
no need to serialize on a single lock, we can partition the table and
add/delete in parallel to different slots.
This restores one of the advantages that got lost with the rhlist
revert.

Cc: Ivan Babrou <ibobrik@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-09-08 18:55:52 +02:00
Florian Westphal e1bf168774 netfilter: nat: Revert "netfilter: nat: convert nat bysrc hash to rhashtable"
This reverts commit 870190a9ec.

It was not a good idea. The custom hash table was a much better
fit for this purpose.

A fast lookup is not essential, in fact for most cases there is no lookup
at all because original tuple is not taken and can be used as-is.
What needs to be fast is insertion and deletion.

rhlist removal however requires a rhlist walk.
We can have thousands of entries in such a list if source port/addresses
are reused for multiple flows, if this happens removal requests are so
expensive that deletions of a few thousand flows can take several
seconds(!).

The advantages that we got from rhashtable are:
1) table auto-sizing
2) multiple locks

1) would be nice to have, but it is not essential as we have at
most one lookup per new flow, so even a million flows in the bysource
table are not a problem compared to current deletion cost.
2) is easy to add to custom hash table.

I tried to add hlist_node to rhlist to speed up rhltable_remove but this
isn't doable without changing semantics.  rhltable_remove_fast will
check that the to-be-deleted object is part of the table and that
requires a list walk that we want to avoid.

Furthermore, using hlist_node increases size of struct rhlist_head, which
in turn increases nf_conn size.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=196821
Reported-by: Ivan Babrou <ibobrik@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-09-08 18:55:50 +02:00
Florian Westphal 75c2631468 netfilter: nf_nat: don't bug when mapping already exists
It seems preferrable to limp along if we have a conflicting mapping,
its certainly better than a BUG().

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-09-08 18:55:26 +02:00
Xin Long 68913a018f netfilter: ipvs: do not create conn for ABORT packet in sctp_conn_schedule
There's no reason for ipvs to create a conn for an ABORT packet
even if sysctl_sloppy_sctp is set.

This patch is to accept it without creating a conn, just as ipvs
does for tcp's RST packet.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-09-08 13:40:23 +02:00
Xin Long 1cc4a01866 netfilter: ipvs: fix the issue that sctp_conn_schedule drops non-INIT packet
Commit 5e26b1b3ab ("ipvs: support scheduling inverse and icmp SCTP
packets") changed to check packet type early. It introduced a side
effect: if it's not a INIT packet, ports will be set as  NULL, and
the packet will be dropped later.

It caused that sctp couldn't create connection when ipvs module is
loaded and any scheduler is registered on server.

Li Shuang reproduced it by running the cmds on sctp server:
  # ipvsadm -A -t 1.1.1.1:80 -s rr
  # ipvsadm -D -t 1.1.1.1:80
then the server could't work any more.

This patch is to return 1 when it's not an INIT packet. It means ipvs
will accept it without creating a conn for it, just like what it does
for tcp.

Fixes: 5e26b1b3ab ("ipvs: support scheduling inverse and icmp SCTP packets")
Reported-by: Li Shuang <shuali@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-09-08 13:40:02 +02:00
Linus Torvalds aae3dbb477 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:

 1) Support ipv6 checksum offload in sunvnet driver, from Shannon
    Nelson.

 2) Move to RB-tree instead of custom AVL code in inetpeer, from Eric
    Dumazet.

 3) Allow generic XDP to work on virtual devices, from John Fastabend.

 4) Add bpf device maps and XDP_REDIRECT, which can be used to build
    arbitrary switching frameworks using XDP. From John Fastabend.

 5) Remove UFO offloads from the tree, gave us little other than bugs.

 6) Remove the IPSEC flow cache, from Florian Westphal.

 7) Support ipv6 route offload in mlxsw driver.

 8) Support VF representors in bnxt_en, from Sathya Perla.

 9) Add support for forward error correction modes to ethtool, from
    Vidya Sagar Ravipati.

10) Add time filter for packet scheduler action dumping, from Jamal Hadi
    Salim.

11) Extend the zerocopy sendmsg() used by virtio and tap to regular
    sockets via MSG_ZEROCOPY. From Willem de Bruijn.

12) Significantly rework value tracking in the BPF verifier, from Edward
    Cree.

13) Add new jump instructions to eBPF, from Daniel Borkmann.

14) Rework rtnetlink plumbing so that operations can be run without
    taking the RTNL semaphore. From Florian Westphal.

15) Support XDP in tap driver, from Jason Wang.

16) Add 32-bit eBPF JIT for ARM, from Shubham Bansal.

17) Add Huawei hinic ethernet driver.

18) Allow to report MD5 keys in TCP inet_diag dumps, from Ivan
    Delalande.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1780 commits)
  i40e: point wb_desc at the nvm_wb_desc during i40e_read_nvm_aq
  i40e: avoid NVM acquire deadlock during NVM update
  drivers: net: xgene: Remove return statement from void function
  drivers: net: xgene: Configure tx/rx delay for ACPI
  drivers: net: xgene: Read tx/rx delay for ACPI
  rocker: fix kcalloc parameter order
  rds: Fix non-atomic operation on shared flag variable
  net: sched: don't use GFP_KERNEL under spin lock
  vhost_net: correctly check tx avail during rx busy polling
  net: mdio-mux: add mdio_mux parameter to mdio_mux_init()
  rxrpc: Make service connection lookup always check for retry
  net: stmmac: Delete dead code for MDIO registration
  gianfar: Fix Tx flow control deactivation
  cxgb4: Ignore MPS_TX_INT_CAUSE[Bubble] for T6
  cxgb4: Fix pause frame count in t4_get_port_stats
  cxgb4: fix memory leak
  tun: rename generic_xdp to skb_xdp
  tun: reserve extra headroom only when XDP is set
  net: dsa: bcm_sf2: Configure IMP port TC2QOS mapping
  net: dsa: bcm_sf2: Advertise number of egress queues
  ...
2017-09-06 14:45:08 -07:00
Pablo Neira Ayuso 9dee147412 netfilter: nf_tables: support for recursive chain deletion
This patch sorts out an asymmetry in deletions. Currently, table and set
deletion commands come with an implicit content flush on deletion.
However, chain deletion results in -EBUSY if there is content in this
chain, so no implicit flush happens. So you have to send a flush command
in first place to delete chains, this is inconsistent and it can be
annoying in terms of user experience.

This patch uses the new NLM_F_NONREC flag to request non-recursive chain
deletion, ie. if the chain to be removed contains rules, then this
returns EBUSY. This problem was discussed during the NFWS'17 in Faro,
Portugal. In iptables, you hit -EBUSY if you try to delete a chain that
contains rules, so you have to flush first before you can remove
anything. Since iptables-compat uses the nf_tables netlink interface, it
has to use the NLM_F_NONREC flag from userspace to retain the original
iptables semantics, ie.  bail out on removing chains that contain rules.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-09-04 17:34:55 +02:00
Pablo Neira Ayuso a827840080 netfilter: nf_tables: use NLM_F_NONREC for deletion requests
Bail out if user requests non-recursive deletion for tables and sets.
This new flags tells nf_tables netlink interface to reject deletions if
tables and sets have content.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-09-04 17:34:55 +02:00
Pablo Neira Ayuso 4035285fe0 netfilter: nf_tables: add nf_tables_addchain()
Wrap the chain addition path in a function to make it more maintainable.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-09-04 17:34:54 +02:00
Pablo Neira Ayuso 2c4a488a48 netfilter: nf_tables: add nf_tables_updchain()
nf_tables_newchain() is too large, wrap the chain update path in a
function to make it more maintainable.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-09-04 17:34:54 +02:00
Linus Torvalds 0081a0ce80 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnad:
 "The main RCU related changes in this cycle were:

   - Removal of spin_unlock_wait()
   - SRCU updates
   - RCU torture-test updates
   - RCU Documentation updates
   - Extend the sys_membarrier() ABI with the MEMBARRIER_CMD_PRIVATE_EXPEDITED variant
   - Miscellaneous RCU fixes
   - CPU-hotplug fixes"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (63 commits)
  arch: Remove spin_unlock_wait() arch-specific definitions
  locking: Remove spin_unlock_wait() generic definitions
  drivers/ata: Replace spin_unlock_wait() with lock/unlock pair
  ipc: Replace spin_unlock_wait() with lock/unlock pair
  exit: Replace spin_unlock_wait() with lock/unlock pair
  completion: Replace spin_unlock_wait() with lock/unlock pair
  doc: Set down RCU's scheduling-clock-interrupt needs
  doc: No longer allowed to use rcu_dereference on non-pointers
  doc: Add RCU files to docbook-generation files
  doc: Update memory-barriers.txt for read-to-write dependencies
  doc: Update RCU documentation
  membarrier: Provide expedited private command
  rcu: Remove exports from rcu_idle_exit() and rcu_idle_enter()
  rcu: Add warning to rcu_idle_enter() for irqs enabled
  rcu: Make rcu_idle_enter() rely on callers disabling irqs
  rcu: Add assertions verifying blocked-tasks list
  rcu/tracing: Set disable_rcu_irq_enter on rcu_eqs_exit()
  rcu: Add TPS() protection for _rcu_barrier_trace strings
  rcu: Use idle versions of swait to make idle-hack clear
  swait: Add idle variants which don't contribute to load average
  ...
2017-09-04 08:13:52 -07:00
Varsha Rao 44d6e2f273 net: Replace NF_CT_ASSERT() with WARN_ON().
This patch removes NF_CT_ASSERT() and instead uses WARN_ON().

Signed-off-by: Varsha Rao <rvarsha016@gmail.com>
2017-09-04 13:25:19 +02:00
Florian Westphal d1c1e39de8 netfilter: remove unused hooknum arg from packet functions
tested with allmodconfig build.

Signed-off-by: Florian Westphal <fw@strlen.de>
2017-09-04 13:25:18 +02:00
Pablo M. Bermudo Garay a691205571 netfilter: nft_limit: add stateful object type
Register a new limit stateful object type into the stateful object
infrastructure.

Signed-off-by: Pablo M. Bermudo Garay <pablombg@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-09-04 13:25:16 +02:00
Pablo M. Bermudo Garay 6e32388756 netfilter: nft_limit: replace pkt_bytes with bytes
Just a small refactor patch in order to improve the code readability.

Signed-off-by: Pablo M. Bermudo Garay <pablombg@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-09-04 13:25:15 +02:00
Pablo M. Bermudo Garay dfc46034b5 netfilter: nf_tables: add select_ops for stateful objects
This patch adds support for overloading stateful objects operations
through the select_ops() callback, just as it is implemented for
expressions.

This change is needed for upcoming additions to the stateful objects
infrastructure.

Signed-off-by: Pablo M. Bermudo Garay <pablombg@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-09-04 13:25:09 +02:00
Vishwanath Pai bea74641e3 netfilter: xt_hashlimit: add rate match mode
This patch adds a new feature to hashlimit that allows matching on the
current packet/byte rate without rate limiting. This can be enabled
with a new flag --hashlimit-rate-match. The match returns true if the
current rate of packets is above/below the user specified value.

The main difference between the existing algorithm and the new one is
that the existing algorithm rate-limits the flow whereas the new
algorithm does not. Instead it *classifies* the flow based on whether
it is above or below a certain rate. I will demonstrate this with an
example below. Let us assume this rule:

iptables -A INPUT -m hashlimit --hashlimit-above 10/s -j new_chain

If the packet rate is 15/s, the existing algorithm would ACCEPT 10
packets every second and send 5 packets to "new_chain".

But with the new algorithm, as long as the rate of 15/s is sustained,
all packets will continue to match and every packet is sent to new_chain.

This new functionality will let us classify different flows based on
their current rate, so that further decisions can be made on them based on
what the current rate is.

This is how the new algorithm works:
We divide time into intervals of 1 (sec/min/hour) as specified by
the user. We keep track of the number of packets/bytes processed in the
current interval. After each interval we reset the counter to 0.

When we receive a packet for match, we look at the packet rate
during the current interval and the previous interval to make a
decision:

if [ prev_rate < user and cur_rate < user ]
        return Below
else
        return Above

Where cur_rate is the number of packets/bytes seen in the current
interval, prev is the number of packets/bytes seen in the previous
interval and 'user' is the rate specified by the user.

We also provide flexibility to the user for choosing the time
interval using the option --hashilmit-interval. For example the user can
keep a low rate like x/hour but still keep the interval as small as 1
second.

To preserve backwards compatibility we have to add this feature in a new
revision, so I've created revision 3 for hashlimit. The two new options
we add are:

--hashlimit-rate-match
--hashlimit-rate-interval

I have updated the help text to add these new options. Also added a few
tests for the new options.

Suggested-by: Igor Lubashev <ilubashe@akamai.com>
Reviewed-by: Josh Hunt <johunt@akamai.com>
Signed-off-by: Vishwanath Pai <vpai@akamai.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-09-04 12:54:42 +02:00
David S. Miller b63f6044d8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for your net-next
tree. Basically, updates to the conntrack core, enhancements for
nf_tables, conversion of netfilter hooks from linked list to array to
improve memory locality and asorted improvements for the Netfilter
codebase. More specifically, they are:

1) Add expection to hashes after timer initialization to prevent
   access from another CPU that walks on the hashes and calls
   del_timer(), from Florian Westphal.

2) Don't update nf_tables chain counters from hot path, this is only
   used by the x_tables compatibility layer.

3) Get rid of nested rcu_read_lock() calls from netfilter hook path.
   Hooks are always guaranteed to run from rcu read side, so remove
   nested rcu_read_lock() where possible. Patch from Taehee Yoo.

4) nf_tables new ruleset generation notifications include PID and name
   of the process that has updated the ruleset, from Phil Sutter.

5) Use skb_header_pointer() from nft_fib, so we can reuse this code from
   the nf_family netdev family. Patch from Pablo M. Bermudo.

6) Add support for nft_fib in nf_tables netdev family, also from Pablo.

7) Use deferrable workqueue for conntrack garbage collection, to reduce
   power consumption, from Patch from Subash Abhinov Kasiviswanathan.

8) Add nf_ct_expect_iterate_net() helper and use it. From Florian
   Westphal.

9) Call nf_ct_unconfirmed_destroy only from cttimeout, from Florian.

10) Drop references on conntrack removal path when skbuffs has escaped via
    nfqueue, from Florian.

11) Don't queue packets to nfqueue with dying conntrack, from Florian.

12) Constify nf_hook_ops structure, from Florian.

13) Remove neededlessly branch in nf_tables trace code, from Phil Sutter.

14) Add nla_strdup(), from Phil Sutter.

15) Rise nf_tables objects name size up to 255 chars, people want to use
    DNS names, so increase this according to what RFC 1035 specifies.
    Patch series from Phil Sutter.

16) Kill nf_conntrack_default_on, it's broken. Default on conntrack hook
    registration on demand, suggested by Eric Dumazet, patch from Florian.

17) Remove unused variables in compat_copy_entry_from_user both in
    ip_tables and arp_tables code. Patch from Taehee Yoo.

18) Constify struct nf_conntrack_l4proto, from Julia Lawall.

19) Constify nf_loginfo structure, also from Julia.

20) Use a single rb root in connlimit, from Taehee Yoo.

21) Remove unused netfilter_queue_init() prototype, from Taehee Yoo.

22) Use audit_log() instead of open-coding it, from Geliang Tang.

23) Allow to mangle tcp options via nft_exthdr, from Florian.

24) Allow to fetch TCP MSS from nft_rt, from Florian. This includes
    a fix for a miscalculation of the minimal length.

25) Simplify branch logic in h323 helper, from Nick Desaulniers.

26) Calculate netlink attribute size for conntrack tuple at compile
    time, from Florian.

27) Remove protocol name field from nf_conntrack_{l3,l4}proto structure.
    From Florian.

28) Remove holes in nf_conntrack_l4proto structure, so it becomes
    smaller. From Florian.

29) Get rid of print_tuple() indirection for /proc conntrack listing.
    Place all the code in net/netfilter/nf_conntrack_standalone.c.
    Patch from Florian.

30) Do not built in print_conntrack() if CONFIG_NF_CONNTRACK_PROCFS is
    off. From Florian.

31) Constify most nf_conntrack_{l3,l4}proto helper functions, from
    Florian.

32) Fix broken indentation in ebtables extensions, from Colin Ian King.

33) Fix several harmless sparse warning, from Florian.

34) Convert netfilter hook infrastructure to use array for better memory
    locality, joint work done by Florian and Aaron Conole. Moreover, add
    some instrumentation to debug this.

35) Batch nf_unregister_net_hooks() calls, to call synchronize_net once
    per batch, from Florian.

36) Get rid of noisy logging in ICMPv6 conntrack helper, from Florian.

37) Get rid of obsolete NFDEBUG() instrumentation, from Varsha Rao.

38) Remove unused code in the generic protocol tracker, from Davide
    Caratti.

I think I will have material for a second Netfilter batch in my queue if
time allow to make it fit in this merge window.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-03 17:08:42 -07:00
David S. Miller 6026e043d0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Three cases of simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-01 17:42:05 -07:00
Florian Westphal 1aff64715e netfilter: rt: account for tcp header size too
This needs to accout for the ipv4/ipv6 header size and the tcp
header without options.

Fixes: 6b5dc98e8f ("netfilter: rt: add support to fetch path mss")
Reported-by: Matteo Croce <technoboy85@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-08-28 18:14:30 +02:00
Davide Caratti d11dd6cdc3 netfilter: conntrack: remove unused code in nf_conntrack_proto_generic.c
L4 protocol helpers for DCCP, SCTP and UDPlite can't be built as kernel
modules anymore, so we can remove code enclosed in
 #ifdef CONFIG_NF_CT_PROTO_{DCCP,SCTP,UDPLITE}_MODULE

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-08-28 18:14:29 +02:00
Varsha Rao b38cf90133 netfilter: Remove NFDEBUG()
Remove NFDEBUG and use pr_debug() instead of it.

Signed-off-by: Varsha Rao <rvarsha016@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-08-28 18:05:50 +02:00
Florian Westphal d3ad2c17b4 netfilter: core: batch nf_unregister_net_hooks synchronize_net calls
re-add batching in nf_unregister_net_hooks().

Similar as before, just store an array with to-be-free'd rule arrays
on stack, then call synchronize_net once per batch.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-08-28 17:44:02 +02:00
Florian Westphal 2420b79f8c netfilter: debug: check for sorted array
Make sure our grow/shrink routine places them in the correct order.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-08-28 17:44:01 +02:00
Aaron Conole 960632ece6 netfilter: convert hook list to an array
This converts the storage and layout of netfilter hook entries from a
linked list to an array.  After this commit, hook entries will be
stored adjacent in memory.  The next pointer is no longer required.

The ops pointers are stored at the end of the array as they are only
used in the register/unregister path and in the legacy br_netfilter code.

nf_unregister_net_hooks() is slower than needed as it just calls
nf_unregister_net_hook in a loop (i.e. at least n synchronize_net()
calls), this will be addressed in followup patch.

Test setup:
 - ixgbe 10gbit
 - netperf UDP_STREAM, 64 byte packets
 - 5 hooks: (raw + mangle prerouting, mangle+filter input, inet filter):
empty mangle and raw prerouting, mangle and filter input hooks:
353.9
this patch:
364.2

Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-08-28 17:44:00 +02:00
Florian Westphal 5fd02ebe65 netfilter: fix a few (harmless) sparse warnings
net/netfilter/nft_payload.c:187:18: warning: incorrect type in return expression (expected bool got restricted __sum16 [usertype] check)
net/netfilter/nft_exthdr.c:222:14: warning: cast to restricted __be32
net/netfilter/nft_rt.c:49:23: warning: incorrect type in assignment (different base types expected unsigned int got restricted __be32)
net/netfilter/nft_rt.c:70:25: warning: symbol 'nft_rt_policy' was not declared. Should it be static?

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-08-28 17:42:56 +02:00
David S. Miller af57d2b720 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for your net tree,
they are:

1) Fix use after free of struct proc_dir_entry in ipt_CLUSTERIP, patch
   from Sabrina Dubroca.

2) Fix spurious EINVAL errors from iptables over nft compatibility layer.

3) Reload pointer to ip header only if there is non-terminal verdict,
   ie. XT_CONTINUE, otherwise invalid memory access may happen, patch
   from Taehee Yoo.

4) Fix interaction between SYNPROXY and NAT, SYNPROXY adds sequence
   adjustment already, however from nf_nat_setup() assumes there's not.
   Patch from Xin Long.

5) Fix burst arithmetics in nft_limit as Joe Stringer mentioned during
   NFWS in Faro. Patch from Andy Zhou.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-24 11:49:19 -07:00
Florian Westphal b3480fe059 netfilter: conntrack: make protocol tracker pointers const
Doesn't change generated code, but will make it easier to eventually
make the actual trackers themselvers const.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-08-24 18:52:33 +02:00
Florian Westphal ea48cc83cf netfilter: conntrack: print_conntrack only needed if CONFIG_NF_CONNTRACK_PROCFS
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-08-24 18:52:33 +02:00
Florian Westphal 91950833dd netfilter: conntrack: place print_tuple in procfs part
CONFIG_NF_CONNTRACK_PROCFS is deprecated, no need to use a function
pointer in the trackers for this. Place the printf formatting in
the one place that uses it.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-08-24 18:52:32 +02:00
Florian Westphal 09ec82f5af netfilter: conntrack: remove protocol name from l4proto struct
no need to waste storage for something that is only needed
in one place and can be deduced from protocol number.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-08-24 18:52:32 +02:00
Florian Westphal a3134d537f netfilter: conntrack: remove protocol name from l3proto struct
no need to waste storage for something that is only needed
in one place and can be deduced from protocol number.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-08-24 18:52:32 +02:00
Florian Westphal 0d03510038 netfilter: conntrack: compute l3proto nla size at compile time
avoids a pointer and allows struct to be const later on.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-08-24 18:52:32 +02:00
andy zhou c26844eda9 netfilter: nf_tables: Fix nft limit burst handling
Current implementation treats the burst configuration the same as
rate configuration. This can cause the per packet cost to be lower
than configured. In effect, this bug causes the token bucket to be
refilled at a higher rate than what user has specified.

This patch changes the implementation so that the token bucket size
is controlled by "rate + burst", while maintain the token bucket
refill rate the same as user specified.

Fixes: 96518518cc ("netfilter: add nftables")
Signed-off-by: Andy Zhou <azhou@ovn.org>
Acked-by: Joe Stringer <joe@ovn.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-08-24 16:23:17 +02:00
Xin Long ab6dd1beac netfilter: check for seqadj ext existence before adding it in nf_nat_setup_info
Commit 4440a2ab3b ("netfilter: synproxy: Check oom when adding synproxy
and seqadj ct extensions") wanted to drop the packet when it fails to add
seqadj ext due to no memory by checking if nfct_seqadj_ext_add returns
NULL.

But that nfct_seqadj_ext_add returns NULL can also happen when seqadj ext
already exists in a nf_conn. It will cause that userspace protocol doesn't
work when both dnat and snat are configured.

Li Shuang found this issue in the case:

Topo:
   ftp client                   router                  ftp server
  10.167.131.2  <-> 10.167.131.254  10.167.141.254 <-> 10.167.141.1

Rules:
  # iptables -t nat -A PREROUTING -i eth1 -p tcp -m tcp --dport 21 -j \
    DNAT --to-destination 10.167.141.1
  # iptables -t nat -A POSTROUTING -o eth2 -p tcp -m tcp --dport 21 -j \
    SNAT --to-source 10.167.141.254

In router, when both dnat and snat are added, nf_nat_setup_info will be
called twice. The packet can be dropped at the 2nd time for DNAT due to
seqadj ext is already added at the 1st time for SNAT.

This patch is to fix it by checking for seqadj ext existence before adding
it, so that the packet will not be dropped if seqadj ext already exists.

Note that as Florian mentioned, as a long term, we should review ext_add()
behaviour, it's better to return a pointer to the existing ext instead.

Fixes: 4440a2ab3b ("netfilter: synproxy: Check oom when adding synproxy and seqadj ct extensions")
Reported-by: Li Shuang <shuali@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-08-24 16:09:03 +02:00
Florian Westphal 6b5dc98e8f netfilter: rt: add support to fetch path mss
to be used in combination with tcp option set support to mimic
iptables TCPMSS --clamp-mss-to-pmtu.

v2: Eric Dumazet points out dst must be initialized.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-08-19 13:15:10 +02:00
Florian Westphal 99d1712bc4 netfilter: exthdr: tcp option set support
This allows setting 2 and 4 byte quantities in the tcp option space.
Main purpose is to allow native replacement for xt_TCPMSS to
work around pmtu blackholes.

Writes to kind and len are now allowed at the moment, it does not seem
useful to do this as it causes corruption of the tcp option space.

We can always lift this restriction later if a use-case appears.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-08-19 13:15:10 +02:00
Florian Westphal 5e7d695a48 netfilter: exthdr: split netlink dump function
so eval and uncoming eval_set versions can reuse a common helper.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-08-19 13:15:10 +02:00
Florian Westphal a18177008b netfilter: exthdr: factor out tcp option access
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-08-19 13:15:10 +02:00
Geliang Tang 46b20c38f3 netfilter: use audit_log()
Use audit_log() instead of open-coding it.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-08-19 13:09:31 +02:00
Taehee Yoo 166327d79d netfilter: remove prototype of netfilter_queue_init
The netfilter_queue_init() has been removed.
so we can remove the prototype of that.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-08-19 13:08:25 +02:00
Taehee Yoo a2acc54340 netfilter: connlimit: merge root4 and root6.
The root4 variable is used only when connlimit extension module has been
stored by the iptables command. and the roo6 variable is used only when
connlimit extension module has been stored by the ip6tables command.
So the root4 and roo6 variable does not be used at the same time.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-08-19 13:07:53 +02:00
David Ahern 4297a0ef08 net: ipv6: add second dif to inet6 socket lookups
Add a second device index, sdif, to inet6 socket lookups. sdif is the
index for ingress devices enslaved to an l3mdev. It allows the lookups
to consider the enslaved device as well as the L3 domain when searching
for a socket.

TCP moves the data in the cb. Prior to tcp_v4_rcv (e.g., early demux) the
ingress index is obtained from IPCB using inet_sdif and after tcp_v4_rcv
tcp_v4_sdif is used.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-07 11:39:22 -07:00
David Ahern 3fa6f616a7 net: ipv4: add second dif to inet socket lookups
Add a second device index, sdif, to inet socket lookups. sdif is the
index for ingress devices enslaved to an l3mdev. It allows the lookups
to consider the enslaved device as well as the L3 domain when searching
for a socket.

TCP moves the data in the cb. Prior to tcp_v4_rcv (e.g., early demux) the
ingress index is obtained from IPCB using inet_sdif and after the cb move
in  tcp_v4_rcv the tcp_v4_sdif helper is used.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-07 11:39:21 -07:00
Julia Lawall 549d2d41c1 netfilter: constify nf_loginfo structures
The nf_loginfo structures are only passed as the seventh argument to
nf_log_trace, which is declared as const or stored in a local const
variable.  Thus the nf_loginfo structures themselves can be const.

Done with the help of Coccinelle.

// <smpl>
@r disable optional_qualifier@
identifier i;
position p;
@@
static struct nf_loginfo i@p = { ... };

@ok1@
identifier r.i;
expression list[6] es;
position p;
@@
 nf_log_trace(es,&i@p,...)

@ok2@
identifier r.i;
const struct nf_loginfo *e;
position p;
@@
 e = &i@p

@bad@
position p != {r.p,ok1.p,ok2.p};
identifier r.i;
struct nf_loginfo e;
@@
e@i@p

@depends on !bad disable optional_qualifier@
identifier r.i;
@@
static
+const
 struct nf_loginfo i = { ... };
// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-08-02 14:25:59 +02:00
Julia Lawall 2a04aabf5c netfilter: constify nf_conntrack_l3/4proto parameters
When a nf_conntrack_l3/4proto parameter is not on the left hand side
of an assignment, its address is not taken, and it is not passed to a
function that may modify its fields, then it can be declared as const.

This change is useful from a documentation point of view, and can
possibly facilitate making some nf_conntrack_l3/4proto structures const
subsequently.

Done with the help of Coccinelle.

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-08-02 14:25:57 +02:00
Florian Westphal 4d3a57f23d netfilter: conntrack: do not enable connection tracking unless needed
Discussion during NFWS 2017 in Faro has shown that the current
conntrack behaviour is unreasonable.

Even if conntrack module is loaded on behalf of a single net namespace,
its turned on for all namespaces, which is expensive.  Commit
481fa37347 ("netfilter: conntrack: add nf_conntrack_default_on sysctl")
attempted to provide an alternative to the 'default on' behaviour by
adding a sysctl to change it.

However, as Eric points out, the sysctl only becomes available
once the module is loaded, and then its too late.

So we either have to move the sysctl to the core, or, alternatively,
change conntrack to become active only once the rule set requires this.

This does the latter, conntrack is only enabled when a rule needs it.

Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-07-31 20:42:00 +02:00
Florian Westphal 9b7e26aee7 netfilter: nft_set_rbtree: use seqcount to avoid lock in most cases
switch to lockless lockup. write side now also increments sequence
counter.  On lookup, sample counter value and only take the lock
if we did not find a match and the counter has changed.

This avoids need to write to private area in normal (lookup) cases.

In case we detect a writer (seqretry is true) we fall back to taking
the readlock.

The readlock is also used during dumps to ensure we get a consistent
tree walk.

Similar technique (rbtree+seqlock) was used by David Howells in rxrpc.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-07-31 20:41:59 +02:00
Phil Sutter 6150957521 netfilter: nf_tables: Allow object names of up to 255 chars
Same conversion as for table names, use NFT_NAME_MAXLEN as upper
boundary as well.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-07-31 20:41:59 +02:00
Phil Sutter 387454901b netfilter: nf_tables: Allow set names of up to 255 chars
Same conversion as for table names, use NFT_NAME_MAXLEN as upper
boundary as well.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-07-31 20:41:58 +02:00
Phil Sutter b7263e071a netfilter: nf_tables: Allow chain name of up to 255 chars
Same conversion as for table names, use NFT_NAME_MAXLEN as upper
boundary as well.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-07-31 20:41:57 +02:00
Phil Sutter e46abbcc05 netfilter: nf_tables: Allow table names of up to 255 chars
Allocate all table names dynamically to allow for arbitrary lengths but
introduce NFT_NAME_MAXLEN as an upper sanity boundary. It's value was
chosen to allow using a domain name as per RFC 1035.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-07-31 20:41:57 +02:00
Phil Sutter 6e692678d7 netfilter: nf_tables: No need to check chain existence when tracing
nft_trace_notify() is called only from __nft_trace_packet(), which
assigns its parameter 'chain' to info->chain. __nft_trace_packet() in
turn later dereferences 'chain' unconditionally, which indicates that
it's never NULL. Same does nft_do_chain(), the only user of the tracing
infrastructure. Hence it is safe to assume the check removed here is not
needed.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-07-31 19:14:05 +02:00
Florian Westphal 591bb2789b netfilter: nf_hook_ops structs can be const
We no longer place these on a list so they can be const.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-07-31 19:10:44 +02:00
Florian Westphal 5da773a3e8 netfilter: nfnetlink_queue: don't queue dying conntracks to userspace
When skb is queued to userspace it leaves softirq/rcu protection.
skb->nfct (via conntrack extensions such as helper) could then reference
modules that no longer exist if the conntrack was not yet confirmed.

nf_ct_iterate_destroy() will set the DYING bit for unconfirmed
conntracks, we therefore solve this race as follows:

1. take the queue spinlock.
2. check if the conntrack is unconfirmed and has dying bit set.
   In this case, we must discard skb while we're still inside
   rcu read-side section.
3. If nf_ct_iterate_destroy() is called right after the packet is queued
   to userspace, it will be removed from the queue via
   nf_ct_iterate_destroy -> nf_queue_nf_hook_drop.

When userspace sends the verdict (nfnetlink takes rcu read lock), there
are two cases to consider:

1. nf_ct_iterate_destroy() was called while packet was out.
   In this case, skb will have been removed from the queue already
   and no reinject takes place as we won't find a matching entry for the
   packet id.

2. nf_ct_iterate_destroy() gets called right after verdict callback
   found and removed the skb from queue list.

   In this case, skb->nfct is marked as dying but it is still valid.
   The skb will be dropped either in nf_conntrack_confirm (we don't
   insert DYING conntracks into hash table) or when we try to queue
   the skb again, but either events don't occur before the rcu read lock
   is dropped.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-07-31 19:09:39 +02:00
Florian Westphal e2a750070a netfilter: conntrack: destroy functions need to free queued packets
queued skbs might be using conntrack extensions that are being removed,
such as timeout.  This happens for skbs that have a skb->nfct in
unconfirmed state (i.e., not in hash table yet).

This is destructive, but there are only two use cases:
 - module removal (rare)
 - netns cleanup (most likely no conntracks exist, and if they do,
   they are removed anyway later on).

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-07-31 19:09:39 +02:00
Florian Westphal 84657984c2 netfilter: add and use nf_ct_unconfirmed_destroy
This also removes __nf_ct_unconfirmed_destroy() call from
nf_ct_iterate_cleanup_net, so that function can be used only
when missing conntracks from unconfirmed list isn't a problem.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-07-31 19:09:39 +02:00
Florian Westphal ac7b848390 netfilter: expect: add and use nf_ct_expect_iterate helpers
We have several spots that open-code a expect walk, add a helper
that is similar to nf_ct_iterate_destroy/nf_ct_iterate_cleanup.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-07-31 19:09:38 +02:00
subashab@codeaurora.org a232cd0e0c netfilter: conntrack: Change to deferable work queue
Delayed workqueue causes wakeups to idle CPUs. This was
causing a power impact for devices. Use deferable work
queue instead so that gc_worker runs when CPU is active only.

Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-07-31 19:03:50 +02:00
Pablo M. Bermudo Garay 6392c22603 netfilter: nf_tables: add fib expression to the netdev family
Add fib expression support for netdev family. Like inet family, netdev
delegates the actual decision to the corresponding backend, either ipv4
or ipv6.

This allows to perform very early reverse path filtering, among other
things.

You can find more information about fib expression in the f6d0cbcf09
("<netfilter: nf_tables: add fib expression>") commit message.

Signed-off-by: Pablo M. Bermudo Garay <pablombg@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-07-31 19:01:40 +02:00
stephen hemminger 3754b87a4e netfilter: remove unused variable
warning: ‘recent_old_fops’ defined but not used

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-25 12:31:37 -07:00
Manfred Spraul 3ef0c7a730 net/netfilter/nf_conntrack_core: Fix net_conntrack_lock()
As we want to remove spin_unlock_wait() and replace it with explicit
spin_lock()/spin_unlock() calls, we can use this to simplify the
locking.

In addition:
- Reading nf_conntrack_locks_all needs ACQUIRE memory ordering.
- The new code avoids the backwards loop.

Only slightly tested, I did not manage to trigger calls to
nf_conntrack_all_lock().

V2: With improved comments, to clearly show how the barriers
    pair.

Fixes: b16c29191d ("netfilter: nf_conntrack: use safer way to lock all buckets")
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: <stable@vger.kernel.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: netfilter-devel@vger.kernel.org
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-07-25 10:08:58 -07:00
Phil Sutter 784b4e612d netfilter: nf_tables: Attach process info to NFT_MSG_NEWGEN notifications
This is helpful for 'nft monitor' to track which process caused a given
change to the ruleset.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-07-24 13:25:07 +02:00
Taehee Yoo 0b35f6031a netfilter: Remove duplicated rcu_read_lock.
This patch removes duplicate rcu_read_lock().

1. IPVS part:

According to Julian Anastasov's mention, contexts of ipvs are described
at: http://marc.info/?l=netfilter-devel&m=149562884514072&w=2, in summary:

 - packet RX/TX: does not need locks because packets come from hooks.
 - sync msg RX: backup server uses RCU locks while registering new
   connections.
 - ip_vs_ctl.c: configuration get/set, RCU locks needed.
 - xt_ipvs.c: It is a netfilter match, running from hook context.

As result, rcu_read_lock and rcu_read_unlock can be removed from:

 - ip_vs_core.c: all
 - ip_vs_ctl.c:
   - only from ip_vs_has_real_service
 - ip_vs_ftp.c: all
 - ip_vs_proto_sctp.c: all
 - ip_vs_proto_tcp.c: all
 - ip_vs_proto_udp.c: all
 - ip_vs_xmit.c: all (contains only packet processing)

2. Netfilter part:

There are three types of functions that are guaranteed the rcu_read_lock().
First, as result, functions are only called by nf_hook():

 - nf_conntrack_broadcast_help(), pptp_expectfn(), set_expected_rtp_rtcp().
 - tcpmss_reverse_mtu(), tproxy_laddr4(), tproxy_laddr6().
 - match_lookup_rt6(), check_hlist(), hashlimit_mt_common().
 - xt_osf_match_packet().

Second, functions that caller already held the rcu_read_lock().
 - destroy_conntrack(), ctnetlink_conntrack_event().
 - ctnl_timeout_find_get(), nfqnl_nf_hook_drop().

Third, functions that are mixed with type1 and type2.

These functions are called by nf_hook() also these are called by
ordinary functions that already held the rcu_read_lock():

 - __ctnetlink_glue_build(), ctnetlink_expect_event().
 - ctnetlink_proto_size().

Applied files are below:

- nf_conntrack_broadcast.c, nf_conntrack_core.c, nf_conntrack_netlink.c.
- nf_conntrack_pptp.c, nf_conntrack_sip.c, nfnetlink_cttimeout.c.
- nfnetlink_queue.c, xt_TCPMSS.c, xt_TPROXY.c, xt_addrtype.c.
- xt_connlimit.c, xt_hashlimit.c, xt_osf.c

Detailed calltrace can be found at:
http://marc.info/?l=netfilter-devel&m=149667610710350&w=2

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-07-24 13:24:46 +02:00
Pablo Neira Ayuso 9f08ea8481 netfilter: nf_tables: keep chain counters away from hot path
These chain counters are only used by the iptables-compat tool, that
allow users to use the x_tables extensions from the existing nf_tables
framework. This patch makes nf_tables by ~5% for the general usecase,
ie. native nft users, where no chain counters are used at all.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-07-24 12:23:16 +02:00
Florian Westphal 56a97e701c netfilter: expect: add to hash table after expect init
assuming we have lockless readers we should make sure they can only
see expectations that have already been initialized.

hlist_add_head_rcu acts as memory barrier, move it after timer setup.

Theoretically we could crash due to a del_timer() on other cpu
seeing garbage data.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-07-24 12:20:10 +02:00
Linus Torvalds 96080f6977 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) BPF verifier signed/unsigned value tracking fix, from Daniel
    Borkmann, Edward Cree, and Josef Bacik.

 2) Fix memory allocation length when setting up calls to
    ->ndo_set_mac_address, from Cong Wang.

 3) Add a new cxgb4 device ID, from Ganesh Goudar.

 4) Fix FIB refcount handling, we have to set it's initial value before
    the configure callback (which can bump it). From David Ahern.

 5) Fix double-free in qcom/emac driver, from Timur Tabi.

 6) A bunch of gcc-7 string format overflow warning fixes from Arnd
    Bergmann.

 7) Fix link level headroom tests in ip_do_fragment(), from Vasily
    Averin.

 8) Fix chunk walking in SCTP when iterating over error and parameter
    headers. From Alexander Potapenko.

 9) TCP BBR congestion control fixes from Neal Cardwell.

10) Fix SKB fragment handling in bcmgenet driver, from Doug Berger.

11) BPF_CGROUP_RUN_PROG_SOCK_OPS needs to check for null __sk, from Cong
    Wang.

12) xmit_recursion in ppp driver needs to be per-device not per-cpu,
    from Gao Feng.

13) Cannot release skb->dst in UDP if IP options processing needs it.
    From Paolo Abeni.

14) Some netdev ioctl ifr_name[] NULL termination fixes. From Alexander
    Levin and myself.

15) Revert some rtnetlink notification changes that are causing
    regressions, from David Ahern.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (83 commits)
  net: bonding: Fix transmit load balancing in balance-alb mode
  rds: Make sure updates to cp_send_gen can be observed
  net: ethernet: ti: cpsw: Push the request_irq function to the end of probe
  ipv4: initialize fib_trie prior to register_netdev_notifier call.
  rtnetlink: allocate more memory for dev_set_mac_address()
  net: dsa: b53: Add missing ARL entries for BCM53125
  bpf: more tests for mixed signed and unsigned bounds checks
  bpf: add test for mixed signed and unsigned bounds checks
  bpf: fix up test cases with mixed signed/unsigned bounds
  bpf: allow to specify log level and reduce it for test_verifier
  bpf: fix mixed signed/unsigned derived min/max value bounds
  ipv6: avoid overflow of offset in ip6_find_1stfragopt
  net: tehuti: don't process data if it has not been copied from userspace
  Revert "rtnetlink: Do not generate notifications for CHANGEADDR event"
  net: dsa: mv88e6xxx: Enable CMODE config support for 6390X
  dt-binding: ptp: Add SoC compatibility strings for dte ptp clock
  NET: dwmac: Make dwmac reset unconditional
  net: Zero terminate ifr_name in dev_ifname().
  wireless: wext: terminate ifr name coming from userspace
  netfilter: fix netfilter_net_init() return
  ...
2017-07-20 16:33:39 -07:00
Pablo Neira Ayuso f7fb77fc12 netfilter: nft_compat: check extension hook mask only if set
If the x_tables extension comes with no hook mask, skip this validation.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-07-19 11:53:30 +02:00
Dan Carpenter 073dd5ad34 netfilter: fix netfilter_net_init() return
We accidentally return an uninitialized variable.

Fixes: cf56c2f892 ("netfilter: remove old pre-netns era hook api")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-18 14:50:28 -07:00
Florian Westphal 36ac344e16 netfilter: expect: fix crash when putting uninited expectation
We crash in __nf_ct_expect_check, it calls nf_ct_remove_expect on the
uninitialised expectation instead of existing one, so del_timer chokes
on random memory address.

Fixes: ec0e3f0111 ("netfilter: nf_ct_expect: Add nf_ct_remove_expect()")
Reported-by: Sergey Kvachonok <ravenexp@gmail.com>
Tested-by: Sergey Kvachonok <ravenexp@gmail.com>
Cc: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-07-17 17:03:12 +02:00
Florian Westphal 97772bcd56 netfilter: nat: fix src map lookup
When doing initial conversion to rhashtable I replaced the bucket
walk with a single rhashtable_lookup_fast().

When moving to rhlist I failed to properly walk the list of identical
tuples, but that is what is needed for this to work correctly.
The table contains the original tuples, so the reply tuples are all
distinct.

We currently decide that mapping is (not) in range only based on the
first entry, but in case its not we need to try the reply tuple of the
next entry until we either find an in-range mapping or we checked
all the entries.

This bug makes nat core attempt collision resolution while it might be
able to use the mapping as-is.

Fixes: 870190a9ec ("netfilter: nat: convert nat bysrc hash to rhashtable")
Reported-by: Jaco Kroon <jaco@uls.co.za>
Tested-by: Jaco Kroon <jaco@uls.co.za>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-07-17 17:02:19 +02:00
Florian Westphal cf56c2f892 netfilter: remove old pre-netns era hook api
no more users in the tree, remove this.

The old api is racy wrt. module removal, all users have been converted
to the netns-aware api.

The old api pretended we still have global hooks but that has not been
true for a long time.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-07-17 17:01:10 +02:00
Mateusz Jurczyk f55ce7b024 netfilter: nfnetlink: Improve input length sanitization in nfnetlink_rcv
Verify that the length of the socket buffer is sufficient to cover the
nlmsghdr structure before accessing the nlh->nlmsg_len field for further
input sanitization. If the client only supplies 1-3 bytes of data in
sk_buff, then nlh->nlmsg_len remains partially uninitialized and
contains leftover memory from the corresponding kernel allocation.
Operating on such data may result in indeterminate evaluation of the
nlmsg_len < NLMSG_HDRLEN expression.

The bug was discovered by a runtime instrumentation designed to detect
use of uninitialized memory in the kernel. The patch prevents this and
other similar tools (e.g. KMSAN) from flagging this behavior in the future.

Signed-off-by: Mateusz Jurczyk <mjurczyk@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-07-17 13:27:46 +02:00
Michal Hocko eacd86ca3b net/netfilter/x_tables.c: use kvmalloc() in xt_alloc_table_info()
xt_alloc_table_info() basically opencodes kvmalloc() so use the library
function instead.

Link: http://lkml.kernel.org/r/20170531155145.17111-4-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12 16:26:02 -07:00
David S. Miller c644bd79c0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains two Netfilter fixes for your net tree,
they are:

1) Fix memleak from netns release path of conntrack protocol trackers,
   patch from Liping Zhang.

2) Uninitialized flags field in ebt_log, that results in unpredictable
   logging format in ebtables, also from Liping.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-06 14:02:22 +01:00
Xin Long 4ae70c0845 sctp: remove the typedef sctp_inithdr_t
This patch is to remove the typedef sctp_inithdr_t, and replace
with struct sctp_inithdr in the places where it's using this
typedef.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 09:08:42 -07:00
Xin Long 922dbc5be2 sctp: remove the typedef sctp_chunkhdr_t
This patch is to remove the typedef sctp_chunkhdr_t, and replace
with struct sctp_chunkhdr in the places where it's using this
typedef.

It is also to fix some indents and use sizeof(variable) instead
of sizeof(type)., especially in sctp_new.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 09:08:41 -07:00
Xin Long ae146d9b76 sctp: remove the typedef sctp_sctphdr_t
This patch is to remove the typedef sctp_sctphdr_t, and replace
with struct sctphdr in the places where it's using this typedef.

It is also to fix some indents and use sizeof(variable) instead
of sizeof(type).

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 09:08:41 -07:00
Reshetova, Elena 41c6d650f6 net: convert sock.sk_refcnt from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

This patch uses refcount_inc_not_zero() instead of
atomic_inc_not_zero_hint() due to absense of a _hint()
version of refcount API. If the hint() version must
be used, we might need to revisit API.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 07:39:08 -07:00
David S. Miller 52a623bd61 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for your net-next
tree. This batch contains connection tracking updates for the cleanup
iteration path, patches from Florian Westphal:

X) Skip unconfirmed conntracks in nf_ct_iterate_cleanup_net(), just set
   dying bit to let the CPU release them.

X) Add nf_ct_iterate_destroy() to be used on module removal, to kill
   conntrack from all namespace.

X) Restart iteration on hashtable resizing, since both may occur at
   the same time.

X) Use the new nf_ct_iterate_destroy() to remove conntrack with NAT
   mapping on module removal.

X) Use nf_ct_iterate_destroy() to remove conntrack entries helper
   module removal, from Liping Zhang.

X) Use nf_ct_iterate_cleanup_net() to remove the timeout extension
   if user requests this, also from Liping.

X) Add net_ns_barrier() and use it from FTP helper, so make sure
   no concurrent namespace removal happens at the same time while
   the helper module is being removed.

X) Use NFPROTO_MAX in layer 3 conntrack protocol array, to reduce
   module size. Same thing in nf_tables.

Updates for the nf_tables infrastructure:

X) Prepare usage of the extended ACK reporting infrastructure for
   nf_tables.

X) Remove unnecessary forward declaration in nf_tables hash set.

X) Skip set size estimation if number of element is not specified.

X) Changes to accomodate a (faster) unresizable hash set implementation,
   for anonymous sets and dynamic size fixed sets with no timeouts.

X) Faster lookup function for unresizable hash table for 2 and 4
   bytes key.

And, finally, a bunch of asorted small updates and cleanups:

X) Do not hold reference to netdev from ipt_CLUSTER, instead subscribe
   to device events and look up for index from the packet path, this
   is fixing an issue that is present since the very beginning, patch
   from Xin Long.

X) Use nf_register_net_hook() in ipt_CLUSTER, from Florian Westphal.

X) Use ebt_invalid_target() whenever possible in the ebtables tree,
   from Gao Feng.

X) Calm down compilation warning in nf_dup infrastructure, patch from
   stephen hemminger.

X) Statify functions in nftables rt expression, also from stephen.

X) Update Makefile to use canonical method to specify nf_tables-objs.
   From Jike Song.

X) Use nf_conntrack_helpers_register() in amanda and H323.

X) Space cleanup for ctnetlink, from linzhang.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-30 06:27:09 -07:00
Liping Zhang deaa0a976b netfilter: nf_ct_dccp/sctp: fix memory leak after netns cleanup
After running the following commands for a while, kmemleak reported that
"1879 new suspected memory leaks" happened:
  # while : ; do
  ip netns add test
  ip netns delete test
  done

  unreferenced object 0xffff88006342fa38 (size 1024):
  comm "ip", pid 15477, jiffies 4295982857 (age 957.836s)
  hex dump (first 32 bytes):
    b8 b0 4d a0 ff ff ff ff c0 34 c3 59 00 88 ff ff  ..M......4.Y....
    04 00 00 00 a4 01 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff8190510a>] kmemleak_alloc+0x4a/0xa0
    [<ffffffff81284130>] __kmalloc_track_caller+0x150/0x300
    [<ffffffff812302d0>] kmemdup+0x20/0x50
    [<ffffffffa04d598a>] dccp_init_net+0x8a/0x160 [nf_conntrack]
    [<ffffffffa04cf9f5>] nf_ct_l4proto_pernet_register_one+0x25/0x90
  ...
  unreferenced object 0xffff88006342da58 (size 1024):
  comm "ip", pid 15477, jiffies 4295982857 (age 957.836s)
  hex dump (first 32 bytes):
    10 b3 4d a0 ff ff ff ff 04 35 c3 59 00 88 ff ff  ..M......5.Y....
    04 00 00 00 a4 01 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff8190510a>] kmemleak_alloc+0x4a/0xa0
    [<ffffffff81284130>] __kmalloc_track_caller+0x150/0x300
    [<ffffffff812302d0>] kmemdup+0x20/0x50
    [<ffffffffa04d6a9d>] sctp_init_net+0x5d/0x130 [nf_conntrack]
    [<ffffffffa04cf9f5>] nf_ct_l4proto_pernet_register_one+0x25/0x90
  ...

This is because we forgot to implement the get_net_proto for sctp and
dccp, so we won't invoke the nf_ct_unregister_sysctl to free the
ctl_table when do netns cleanup. Also note, we will fail to register
the sysctl for dccp/sctp either due to the lack of get_net_proto.

Fixes: c51d39010a ("netfilter: conntrack: built-in support for DCCP")
Fixes: a85406afeb ("netfilter: conntrack: built-in support for SCTP")
Cc: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Acked-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-06-29 18:47:01 +02:00
Pablo Neira Ayuso 04ba724b65 netfilter: nfnetlink: extended ACK reporting
Pass down struct netlink_ext_ack as parameter to all of our nfnetlink
subsystem callbacks, so we can work on follow up patches to provide
finer grain error reporting using the new infrastructure that
2d4bc93368 ("netlink: extended ACK reporting") provides.

No functional change, just pass down this new object to callbacks.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-06-19 19:38:24 +02:00
Florian Westphal d8297d4f3e netfilter: nf_tables: reduce chain type table size
text  data  bss     dec    hex filename
old: 151590  2240 1152  154982  25d66 net/netfilter/nf_tables_api.o
new: 151666  2240  416  154322  25ad2 net/netfilter/nf_tables_api.o

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-06-19 19:20:59 +02:00
Florian Westphal b7b5fda468 netfilter: conntrack: use NFPROTO_MAX to size array
We don't support anything larger than NFPROTO_MAX, so we can shrink this a bit:

     text data  dec  hex filename
old: 8259 1096 9355 248b net/netfilter/nf_conntrack_proto.o
new: 8259  624 8883 22b3 net/netfilter/nf_conntrack_proto.o

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-06-19 19:20:49 +02:00
Liping Zhang d53e3fc390 netfilter: use nf_conntrack_helpers_register when possible
amanda_helper, nf_conntrack_helper_ras and nf_conntrack_helper_q931 are
all arrays, so we can use nf_conntrack_helpers_register to register
the ct helper, this will help us to eliminate some "goto errX"
statements.

Also introduce h323_helper_init/exit helper function to register the ct
helpers, this is prepared for the followup patch, which will add net
namespace support for ct helper.

Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-06-19 19:13:21 +02:00
Jike Song 2becbbc547 netfilter, kbuild: use canonical method to specify objs.
Should use ":=" instead of "+=".

Signed-off-by: Jike Song <jike.song@intel.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-06-19 19:09:20 +02:00
Florian Westphal 7866cc57b5 netns: add and use net_ns_barrier
Quoting Joe Stringer:
  If a user loads nf_conntrack_ftp, sends FTP traffic through a network
  namespace, destroys that namespace then unloads the FTP helper module,
  then the kernel will crash.

Events that lead to the crash:
1. conntrack is created with ftp helper in netns x
2. This netns is destroyed
3. netns destruction is scheduled
4. netns destruction wq starts, removes netns from global list
5. ftp helper is unloaded, which resets all helpers of the conntracks
via for_each_net()

but because netns is already gone from list the for_each_net() loop
doesn't include it, therefore all of these conntracks are unaffected.

6. helper module unload finishes
7. netns wq invokes destructor for rmmod'ed helper

CC: "Eric W. Biederman" <ebiederm@xmission.com>
Reported-by: Joe Stringer <joe@ovn.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-06-19 19:09:19 +02:00
Florian Westphal 2c41f33c1b netfilter: move table iteration out of netns exit paths
We only need to iterate & remove in case of module removal;
for netns destruction all conntracks will be removed anyway.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-06-19 19:09:19 +02:00
Johannes Berg 4df864c1d9 networking: make skb_put & friends return void pointers
It seems like a historic accident that these return unsigned char *,
and in many places that means casts are required, more often than not.

Make these functions (skb_put, __skb_put and pskb_put) return void *
and remove all the casts across the tree, adding a (u8 *) cast only
where the unsigned char pointer was used directly, all done with the
following spatch:

    @@
    expression SKB, LEN;
    typedef u8;
    identifier fn = { skb_put, __skb_put };
    @@
    - *(fn(SKB, LEN))
    + *(u8 *)fn(SKB, LEN)

    @@
    expression E, SKB, LEN;
    identifier fn = { skb_put, __skb_put };
    type T;
    @@
    - E = ((T *)(fn(SKB, LEN)))
    + E = fn(SKB, LEN)

which actually doesn't cover pskb_put since there are only three
users overall.

A handful of stragglers were converted manually, notably a macro in
drivers/isdn/i4l/isdn_bsdcomp.c and, oddly enough, one of the many
instances in net/bluetooth/hci_sock.c. In the former file, I also
had to fix one whitespace problem spatch introduced.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-16 11:48:39 -04:00
David S. Miller 216fe8f021 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Just some simple overlapping changes in marvell PHY driver
and the DSA core code.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-06 22:20:08 -04:00
Liping Zhang 34158151d2 netfilter: cttimeout: use nf_ct_iterate_cleanup_net to unlink timeout objs
Similar to nf_conntrack_helper, we can use nf_ct_iterare_cleanup_net to
remove these copy & paste code.

Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-29 12:46:24 +02:00
Liping Zhang ff1acc4964 netfilter: nf_ct_helper: use nf_ct_iterate_destroy to unlink helper objs
When we unlink the helper objects, we will iterate the nf_conntrack_hash,
iterate the unconfirmed list, handle the hash resize situation, etc.

Actually this logic is same as the nf_ct_iterate_destroy, so we can use
it to remove these copy & paste code.

Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-29 12:46:23 +02:00
Pablo Neira Ayuso 446a8268b7 netfilter: nft_set_hash: add lookup variant for fixed size hashtable
This patch provides a faster variant of the lookup function for 2 and 4
byte keys. Optimizing the one byte case is not worth, as the set backend
selection will always select the bitmap set type for such case.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-29 12:46:22 +02:00
Pablo Neira Ayuso 6c03ae210c netfilter: nft_set_hash: add non-resizable hashtable implementation
This patch adds a simple non-resizable hashtable implementation. If the
user specifies the set size, then this new faster hashtable flavour is
selected.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-29 12:46:21 +02:00
Pablo Neira Ayuso 1ff75a3e9a netfilter: nf_tables: allow large allocations for new sets
The new fixed size hashtable backend implementation may result in a
large array of buckets that would spew splats from mm. Update this code
to fall back on vmalloc in case the memory allocation order is too
costly.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-29 12:46:20 +02:00
Pablo Neira Ayuso 2111515abc netfilter: nft_set_hash: add nft_hash_buckets()
Add nft_hash_buckets() helper function to calculate the number of
hashtable buckets based on the elements. This function can be reused
from the follow up patch to add non-resizable hashtables.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-29 12:46:19 +02:00
Pablo Neira Ayuso 347b408d59 netfilter: nf_tables: pass set description to ->privsize
The new non-resizable hashtable variant needs this to calculate the
size of the bucket array.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-29 12:46:18 +02:00
Pablo Neira Ayuso 2b664957c2 netfilter: nf_tables: select set backend flavour depending on description
This patch adds the infrastructure to support several implementations of
the same set type. This selection will be based on the set description
and the features available for this set. This allow us to select set
backend implementation that will result in better performance numbers.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-29 12:46:17 +02:00
Pablo Neira Ayuso 5fc6ced958 netfilter: nft_set_hash: use nft_rhash prefix for resizable set backend
This patch prepares the introduction of a non-resizable hashtable
implementation that is significantly faster.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-29 12:46:16 +02:00
Pablo Neira Ayuso 080ed636a5 netfilter: nf_tables: no size estimation if number of set elements is unknown
This size estimation is ignored by the existing set backend selection
logic, since this estimation structure is stack allocated, set this to
~0 to make it easier to catch bugs in future changes.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-29 12:46:15 +02:00
Pablo Neira Ayuso 187388bc3d netfilter: nft_set_hash: unnecessary forward declaration
Replace struct rhashtable_params forward declaration by the structure
definition itself.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-29 12:46:14 +02:00
Florian Westphal 8f23f35f1e netfilter: nat: destroy nat mappings on module exit path only
We don't need pernetns cleanup anymore.  If the netns is being
destroyed, conntrack netns exit will kill all entries in this namespace,
and neither conntrack hash table nor bysource hash are per namespace.

For the rmmod case, we have to make sure we remove all entries from the
nat bysource table, so call the new nf_ct_iterate_destroy in module exit
path.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-29 12:46:13 +02:00
Florian Westphal 0d02d5646e netfilter: conntrack: restart iteration on resize
We could some conntracks when a resize occurs in parallel.

Avoid this by sampling generation seqcnt and doing a restart if needed.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-29 12:46:11 +02:00
Florian Westphal 2843fb6998 netfilter: conntrack: add nf_ct_iterate_destroy
sledgehammer to be used on module unload (to remove affected conntracks
from all namespaces).

It will also flag all unconfirmed conntracks as dying, i.e. they will
not be committed to main table.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-29 12:46:10 +02:00
Florian Westphal b0feacaad1 netfilter: conntrack: don't call iter for non-confirmed conntracks
nf_ct_iterate_cleanup_net currently calls iter() callback also for
conntracks on the unconfirmed list, but this is unsafe.

Acesses to nf_conn are fine, but some users access the extension area
in the iter() callback, but that does only work reliably for confirmed
conntracks (ct->ext can be reallocated at any time for unconfirmed
conntrack).

The seond issue is that there is a short window where a conntrack entry
is neither on the list nor in the table: To confirm an entry, it is first
removed from the unconfirmed list, then insert into the table.

Fix this by iterating the unconfirmed list first and marking all entries
as dying, then wait for rcu grace period.

This makes sure all entries that were about to be confirmed either are
in the main table, or will be dropped soon.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-29 12:46:09 +02:00
Florian Westphal 9fd6452d67 netfilter: conntrack: rename nf_ct_iterate_cleanup
There are several places where we needlesly call nf_ct_iterate_cleanup,
we should instead iterate the full table at module unload time.

This is a leftover from back when the conntrack table got duplicated
per net namespace.

So rename nf_ct_iterate_cleanup to nf_ct_iterate_cleanup_net.
A later patch will then add a non-net variant.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-29 12:46:08 +02:00
stephen hemminger cad4394453 netfilter: nft_rt: make local functions static
Resolves warnings:
net/netfilter/nft_rt.c:26:6: warning: no previous prototype for ‘nft_rt_get_eval’ [-Wmissing-prototypes]
net/netfilter/nft_rt.c:75:5: warning: no previous prototype for ‘nft_rt_get_init’ [-Wmissing-prototypes]
net/netfilter/nft_rt.c:106:5: warning: no previous prototype for ‘nft_rt_get_dump’ [-Wmissing-prototypes]

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-29 12:45:59 +02:00
stephen hemminger a32770b1e7 netfilter: dup: resolve warnings about missing prototypes
Missing include file causes:

net/netfilter/nf_dup_netdev.c:26:6: warning: no previous prototype for ‘nf_fwd_netdev_egress’ [-Wmissing-prototypes]
net/netfilter/nf_dup_netdev.c:40:6: warning: no previous prototype for ‘nf_dup_netdev_egress’ [-Wmissing-prototypes]

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-29 11:32:36 +02:00
linzhang 04b80ceadc netfilter: ctnetlink: delete extra spaces
This patch cleans up extra spaces.

Signed-off-by: linzhang <xiaolou4617@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-29 11:32:29 +02:00
Liping Zhang fefa92679d netfilter: ctnetlink: fix incorrect nf_ct_put during hash resize
If nf_conntrack_htable_size was adjusted by the user during the ct
dump operation, we may invoke nf_ct_put twice for the same ct, i.e.
the "last" ct. This will cause the ct will be freed but still linked
in hash buckets.

It's very easy to reproduce the problem by the following commands:
  # while : ; do
  echo $RANDOM > /proc/sys/net/netfilter/nf_conntrack_buckets
  done
  # while : ; do
  conntrack -L
  done
  # iperf -s 127.0.0.1 &
  # iperf -c 127.0.0.1 -P 60 -t 36000

After a while, the system will hang like this:
  NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [bash:20184]
  NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [iperf:20382]
  ...

So at last if we find cb->args[1] is equal to "last", this means hash
resize happened, then we can set cb->args[1] to 0 to fix the above
issue.

Fixes: d205dc4079 ("[NETFILTER]: ctnetlink: fix deadlock in table dumping")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-24 11:26:01 +02:00
Liping Zhang 124dffea9e netfilter: nat: use atomic bit op to clear the _SRC_NAT_DONE_BIT
We need to clear the IPS_SRC_NAT_DONE_BIT to indicate that the ct has
been removed from nat_bysource table. But unfortunately, we use the
non-atomic bit operation: "ct->status &= ~IPS_NAT_DONE_MASK". So
there's a race condition that we may clear the _DYING_BIT set by
another CPU unexpectedly.

Since we don't care about the IPS_DST_NAT_DONE_BIT, so just using
clear_bit to clear the IPS_SRC_NAT_DONE_BIT is enough.

Also note, this is the last user which use the non-atomic bit operation
to update the confirmed ct->status.

Reported-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-23 22:54:51 +02:00
Pablo Neira Ayuso d2df92e98a netfilter: nft_set_rbtree: handle element re-addition after deletion
The existing code selects no next branch to be inspected when
re-inserting an inactive element into the rb-tree, looping endlessly.
This patch restricts the check for active elements to the EEXIST case
only.

Fixes: e701001e7c ("netfilter: nft_rbtree: allow adjacent intervals with dynamic updates")
Reported-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
Tested-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-23 22:54:14 +02:00
Davide Caratti f3c0eb05e2 netfilter: conntrack: fix false CRC32c mismatch using paged skb
sctp_compute_cksum() implementation assumes that at least the SCTP header
is in the linear part of skb: modify conntrack error callback to avoid
false CRC32c mismatch, if the transport header is partially/entirely paged.

Fixes: cf6e007eef ("netfilter: conntrack: validate SCTP crc32c in PREROUTING")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-23 22:54:14 +02:00
David S. Miller 218b6a5b23 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-05-22 23:32:48 -04:00
David S. Miller 23416e2304 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter/IPVS fixes for net

The following patchset contains Netfilter/IPVS fixes for your net tree,
they are:

1) When using IPVS in direct-routing mode, normal traffic from the LVS
   host to a back-end server is sometimes incorrectly NATed on the way
   back into the LVS host. Patch to fix this from Julian Anastasov.

2) Calm down clang compilation warning in ctnetlink due to type
   mismatch, from Matthias Kaehlcke.

3) Do not re-setup NAT for conntracks that are already confirmed, this
   is fixing a problem that was introduced in the previous nf-next batch.
   Patch from Liping Zhang.

4) Do not allow conntrack helper removal from userspace cthelper
   infrastructure if already in used. This comes with an initial patch
   to introduce nf_conntrack_helper_put() that is required by this fix.
   From Liping Zhang.

5) Zero the pad when copying data to userspace, otherwise iptables fails
   to remove rules. This is a follow up on the patchset that sorts out
   the internal match/target structure pointer leak to userspace. Patch
   from the same author, Willem de Bruijn. This also comes with a build
   failure when CONFIG_COMPAT is not on, coming in the last patch of
   this series.

6) SYNPROXY crashes with conntrack entries that are created via
   ctnetlink, more specifically via conntrackd state sync. Patch from
   Eric Leblond.

7) RCU safe iteration on set element dumping in nf_tables, from
   Liping Zhang.

8) Missing sanitization of immediate date for the bitwise and cmp
   expressions in nf_tables.

9) Refcounting logic for chain and objects from set elements does not
   integrate into the nf_tables 2-phase commit protocol.

10) Missing sanitization of target verdict in ebtables arpreply target,
    from Gao Feng.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-21 13:00:02 -04:00
Willem de Bruijn 751a9c7638 netfilter: xtables: fix build failure from COMPAT_XT_ALIGN outside CONFIG_COMPAT
The patch in the Fixes references COMPAT_XT_ALIGN in the definition
of XT_DATA_TO_USER, outside an #ifdef CONFIG_COMPAT block.

Split XT_DATA_TO_USER into separate compat and non compat variants and
define the first inside an CONFIG_COMPAT block.

This simplifies both variants by removing branches inside the macro.

Fixes: 324318f024 ("netfilter: xtables: zero padding in data_to_user")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-18 13:10:03 +02:00
Eric Dumazet 9a568de481 tcp: switch TCP TS option (RFC 7323) to 1ms clock
TCP Timestamps option is defined in RFC 7323

Traditionally on linux, it has been tied to the internal
'jiffies' variable, because it had been a cheap and good enough
generator.

For TCP flows on the Internet, 1 ms resolution would be much better
than 4ms or 10ms (HZ=250 or HZ=100 respectively)

For TCP flows in the DC, Google has used usec resolution for more
than two years with great success [1]

Receive size autotuning (DRS) is indeed more precise and converges
faster to optimal window size.

This patch converts tp->tcp_mstamp to a plain u64 value storing
a 1 usec TCP clock.

This choice will allow us to upstream the 1 usec TS option as
discussed in IETF 97.

[1] https://www.ietf.org/proceedings/97/slides/slides-97-tcpm-tcp-options-for-low-latency-00.pdf

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 16:06:01 -04:00
Pablo Neira Ayuso 591054469b netfilter: nf_tables: revisit chain/object refcounting from elements
Andreas reports that the following incremental update using our commit
protocol doesn't work.

 # nft -f incremental-update.nft
 delete element ip filter client_to_any { 10.180.86.22 : goto CIn_1 }
 delete chain ip filter CIn_1
 ... Error: Could not process rule: Device or resource busy

The existing code is not well-integrated into the commit phase protocol,
since element deletions do not result in refcount decrement from the
preparation phase. This results in bogus EBUSY errors like the one
above.

Two new functions come with this patch:

* nft_set_elem_activate() function is used from the abort path, to
  restore the set element refcounting on objects that occurred from
  the preparation phase.

* nft_set_elem_deactivate() that is called from nft_del_setelem() to
  decrement set element refcounting on objects from the preparation
  phase in the commit protocol.

The nft_data_uninit() has been renamed to nft_data_release() since this
function does not uninitialize any data store in the data register,
instead just releases the references to objects. Moreover, a new
function nft_data_hold() has been introduced to be used from
nft_set_elem_activate().

Reported-by: Andreas Schultz <aschultz@tpip.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-15 12:51:41 +02:00
Pablo Neira Ayuso 71df14b0ce netfilter: nf_tables: missing sanitization in data from userspace
Do not assume userspace always sends us NFT_DATA_VALUE for bitwise and
cmp expressions. Although NFT_DATA_VERDICT does not make any sense, it
is still possible to handcraft a netlink message using this incorrect
data type.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-15 12:51:40 +02:00
Liping Zhang fa803605ee netfilter: nf_tables: can't assume lock is acquired when dumping set elems
When dumping the elements related to a specified set, we may invoke the
nf_tables_dump_set with the NFNL_SUBSYS_NFTABLES lock not acquired. So
we should use the proper rcu operation to avoid race condition, just
like other nft dump operations.

Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-15 12:51:39 +02:00
Eric Leblond 87e94dbc21 netfilter: synproxy: fix conntrackd interaction
This patch fixes the creation of connection tracking entry from
netlink when synproxy is used. It was missing the addition of
the synproxy extension.

This was causing kernel crashes when a conntrack entry created by
conntrackd was used after the switch of traffic from active node
to the passive node.

Signed-off-by: Eric Leblond <eric@regit.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-15 12:51:39 +02:00
Willem de Bruijn 324318f024 netfilter: xtables: zero padding in data_to_user
When looking up an iptables rule, the iptables binary compares the
aligned match and target data (XT_ALIGN). In some cases this can
exceed the actual data size to include padding bytes.

Before commit f77bc5b23f ("iptables: use match, target and data
copy_to_user helpers") the malloc()ed bytes were overwritten by the
kernel with kzalloced contents, zeroing the padding and making the
comparison succeed. After this patch, the kernel copies and clears
only data, leaving the padding bytes undefined.

Extend the clear operation from data size to aligned data size to
include the padding bytes, if any.

Padding bytes can be observed in both match and target, and the bug
triggered, by issuing a rule with match icmp and target ACCEPT:

  iptables -t mangle -A INPUT -i lo -p icmp --icmp-type 1 -j ACCEPT
  iptables -t mangle -D INPUT -i lo -p icmp --icmp-type 1 -j ACCEPT

Fixes: f77bc5b23f ("iptables: use match, target and data copy_to_user helpers")
Reported-by: Paul Moore <pmoore@redhat.com>
Reported-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-15 12:51:38 +02:00
Pablo Neira Ayuso ff1e4300cf Merge tag 'ipvs-fixes-for-v4.12' of http://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs
Simon Horman says:

====================
IPVS Fixes for v4.12

please consider this fix to IPVS for v4.12.

* It is a fix from Julian Anastasov to only SNAT SNAT packet replies only for
  NATed connections

My understanding is that this fix is appropriate for 4.9.25, 4.10.13, 4.11
as well as the nf tree. Julian has separately posted backports for other
-stable kernels; please see:

* [PATCH 3.2.88,3.4.113 -stable 1/3] ipvs: SNAT packet replies only for
        NATed connections
* [PATCH 3.10.105,3.12.73,3.16.43,4.1.39 -stable 2/3] ipvs: SNAT packet
        replies only for NATed connections
* [PATCH 4.4.65 -stable 3/3] ipvs: SNAT packet replies only for NATed
        connections
====================

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-15 12:50:12 +02:00
Liping Zhang 9338d7b441 netfilter: nfnl_cthelper: reject del request if helper obj is in use
We can still delete the ct helper even if it is in use, this will cause
a use-after-free error. In more detail, I mean:
  # nfct helper add ssdp inet udp
  # iptables -t raw -A OUTPUT -p udp -j CT --helper ssdp
  # nfct helper delete ssdp //--> oops, succeed!
  BUG: unable to handle kernel paging request at 000026ca
  IP: 0x26ca
  [...]
  Call Trace:
   ? ipv4_helper+0x62/0x80 [nf_conntrack_ipv4]
   nf_hook_slow+0x21/0xb0
   ip_output+0xe9/0x100
   ? ip_fragment.constprop.54+0xc0/0xc0
   ip_local_out+0x33/0x40
   ip_send_skb+0x16/0x80
   udp_send_skb+0x84/0x240
   udp_sendmsg+0x35d/0xa50

So add reference count to fix this issue, if ct helper is used by
others, reject the delete request.

Apply this patch:
  # nfct helper delete ssdp
  nfct v1.4.3: netlink error: Device or resource busy

Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-15 12:42:29 +02:00
Liping Zhang d91fc59cd7 netfilter: introduce nf_conntrack_helper_put helper function
And convert module_put invocation to nf_conntrack_helper_put, this is
prepared for the followup patch, which will add a refcnt for cthelper,
so we can reject the deleting request when cthelper is in use.

Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-15 12:42:29 +02:00
Liping Zhang d110a3942a netfilter: don't setup nat info for confirmed ct
We cannot setup nat info if the ct has been confirmed already, else,
different cpu may race to handle the same ct. In extreme situation,
we may hit the "BUG_ON(nf_nat_initialized(ct, maniptype))" in the
nf_nat_setup_info.

Also running the following commands will easily hit NF_CT_ASSERT in
nf_conntrack_alter_reply:
  # nft flush ruleset
  # ping -c 2 -W 1 1.1.1.111 &
  # nft add table t
  # nft add chain t c {type nat hook postrouting priority 0 \;}
  # nft add rule t c snat to 4.5.6.7
  WARNING: CPU: 1 PID: 10065 at net/netfilter/nf_conntrack_core.c:1472
  nf_conntrack_alter_reply+0x9a/0x1a0 [nf_conntrack]
  [...]
  Call Trace:
   nf_nat_setup_info+0xad/0x840 [nf_nat]
   ? deactivate_slab+0x65d/0x6c0
   nft_nat_eval+0xcd/0x100 [nft_nat]
   nft_do_chain+0xff/0x5d0 [nf_tables]
   ? mark_held_locks+0x6f/0xa0
   ? __local_bh_enable_ip+0x70/0xa0
   ? trace_hardirqs_on_caller+0x11f/0x190
   ? ipt_do_table+0x310/0x610
   ? trace_hardirqs_on+0xd/0x10
   ? __local_bh_enable_ip+0x70/0xa0
   ? ipt_do_table+0x32b/0x610
   ? __lock_acquire+0x2ac/0x1580
   ? ipt_do_table+0x32b/0x610
   nft_nat_do_chain+0x65/0x80 [nft_chain_nat_ipv4]
   nf_nat_ipv4_fn+0x1ae/0x240 [nf_nat_ipv4]
   nf_nat_ipv4_out+0x4a/0xf0 [nf_nat_ipv4]
   nft_nat_ipv4_out+0x15/0x20 [nft_chain_nat_ipv4]
   nf_hook_slow+0x2c/0xf0
   ip_output+0x154/0x270

So for the confirmed ct, just ignore it and return NF_ACCEPT.

Fixes: 9a08ecfe74 ("netfilter: don't attach a nat extension by default")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-15 12:42:28 +02:00
Matthias Kaehlcke a2b7cbdd25 netfilter: ctnetlink: Make some parameters integer to avoid enum mismatch
Not all parameters passed to ctnetlink_parse_tuple() and
ctnetlink_exp_dump_tuple() match the enum type in the signatures of these
functions. Since this is intended change the argument type of to be an
unsigned integer value.

Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-15 12:10:27 +02:00
Linus Torvalds de4d195308 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "The main changes are:

   - Debloat RCU headers

   - Parallelize SRCU callback handling (plus overlapping patches)

   - Improve the performance of Tree SRCU on a CPU-hotplug stress test

   - Documentation updates

   - Miscellaneous fixes"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (74 commits)
  rcu: Open-code the rcu_cblist_n_lazy_cbs() function
  rcu: Open-code the rcu_cblist_n_cbs() function
  rcu: Open-code the rcu_cblist_empty() function
  rcu: Separately compile large rcu_segcblist functions
  srcu: Debloat the <linux/rcu_segcblist.h> header
  srcu: Adjust default auto-expediting holdoff
  srcu: Specify auto-expedite holdoff time
  srcu: Expedite first synchronize_srcu() when idle
  srcu: Expedited grace periods with reduced memory contention
  srcu: Make rcutorture writer stalls print SRCU GP state
  srcu: Exact tracking of srcu_data structures containing callbacks
  srcu: Make SRCU be built by default
  srcu: Fix Kconfig botch when SRCU not selected
  rcu: Make non-preemptive schedule be Tasks RCU quiescent state
  srcu: Expedite srcu_schedule_cbs_snp() callback invocation
  srcu: Parallelize callback handling
  kvm: Move srcu_struct fields to end of struct kvm
  rcu: Fix typo in PER_RCU_NODE_PERIOD header comment
  rcu: Use true/false in assignment to bool
  rcu: Use bool value directly
  ...
2017-05-10 10:30:46 -07:00
Michal Hocko 19809c2da2 mm, vmalloc: use __GFP_HIGHMEM implicitly
__vmalloc* allows users to provide gfp flags for the underlying
allocation.  This API is quite popular

  $ git grep "=[[:space:]]__vmalloc\|return[[:space:]]*__vmalloc" | wc -l
  77

The only problem is that many people are not aware that they really want
to give __GFP_HIGHMEM along with other flags because there is really no
reason to consume precious lowmemory on CONFIG_HIGHMEM systems for pages
which are mapped to the kernel vmalloc space.  About half of users don't
use this flag, though.  This signals that we make the API unnecessarily
too complex.

This patch simply uses __GFP_HIGHMEM implicitly when allocating pages to
be mapped to the vmalloc space.  Current users which add __GFP_HIGHMEM
are simplified and drop the flag.

Link: http://lkml.kernel.org/r/20170307141020.29107-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Cristopher Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:13 -07:00
Michal Hocko 752ade68cb treewide: use kv[mz]alloc* rather than opencoded variants
There are many code paths opencoding kvmalloc.  Let's use the helper
instead.  The main difference to kvmalloc is that those users are
usually not considering all the aspects of the memory allocator.  E.g.
allocation requests <= 32kB (with 4kB pages) are basically never failing
and invoke OOM killer to satisfy the allocation.  This sounds too
disruptive for something that has a reasonable fallback - the vmalloc.
On the other hand those requests might fallback to vmalloc even when the
memory allocator would succeed after several more reclaim/compaction
attempts previously.  There is no guarantee something like that happens
though.

This patch converts many of those places to kv[mz]alloc* helpers because
they are more conservative.

Link: http://lkml.kernel.org/r/20170306103327.2766-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> # Xen bits
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Andreas Dilger <andreas.dilger@intel.com> # Lustre
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> # KVM/s390
Acked-by: Dan Williams <dan.j.williams@intel.com> # nvdim
Acked-by: David Sterba <dsterba@suse.com> # btrfs
Acked-by: Ilya Dryomov <idryomov@gmail.com> # Ceph
Acked-by: Tariq Toukan <tariqt@mellanox.com> # mlx4
Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx5
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Colin Cross <ccross@android.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Santosh Raspatur <santosh@chelsio.com>
Cc: Hariprasad S <hariprasad@chelsio.com>
Cc: Yishai Hadas <yishaih@mellanox.com>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: "Yan, Zheng" <zyan@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:13 -07:00
Julian Anastasov 3c5ab3f395 ipvs: SNAT packet replies only for NATed connections
We do not check if packet from real server is for NAT
connection before performing SNAT. This causes problems
for setups that use DR/TUN and allow local clients to
access the real server directly, for example:

- local client in director creates IPVS-DR/TUN connection
CIP->VIP and the request packets are routed to RIP.
Talks are finished but IPVS connection is not expired yet.

- second local client creates non-IPVS connection CIP->RIP
with same reply tuple RIP->CIP and when replies are received
on LOCAL_IN we wrongly assign them for the first client
connection because RIP->CIP matches the reply direction.
As result, IPVS SNATs replies for non-IPVS connections.

The problem is more visible to local UDP clients but in rare
cases it can happen also for TCP or remote clients when the
real server sends the reply traffic via the director.

So, better to be more precise for the reply traffic.
As replies are not expected for DR/TUN connections, better
to not touch them.

Reported-by: Nick Moriarty <nick.moriarty@york.ac.uk>
Tested-by: Nick Moriarty <nick.moriarty@york.ac.uk>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2017-05-08 11:38:35 +02:00
Linus Torvalds 4ac4d58488 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) The wireless rate info fix from Johannes Berg.

 2) When a RAW socket is in hdrincl mode, we need to make sure that the
    user provided at least a minimally sized ipv4/ipv6 header. Fix from
    Alexander Potapenko.

 3) We must emit IFLA_PHYS_PORT_NAME netlink attributes using
    nla_put_string() so that it is NULL terminated.

 4) Fix a bug in TCP fastopen handling, wherein child sockets
    erroneously inherit the fastopen_req from the parent, and later can
    end up derefencing freed memory or doing a double free. From Eric
    Dumazet.

 5) Don't clear out netdev stats at close time in tg3 driver, from
    YueHaibing.

 6) Fix refcount leak in xt_CT, from Gao Feng.

 7) In nft_set_bitmap() don't leak dummy elements, from Liping Zhang.

 8) Fix deadlock due to taking the expectation lock twice, also from
    Liping Zhang.

 9) Make xt_socket work again with ipv6, from Peter Tirsek.

10) Don't allow IPV6 to be used with IPVS if ipv6.disable=1, from Paolo
    Abeni.

11) Make the BPF loader more flexible wrt. changes to the bpf MAP entry
    layout. From Jesper Dangaard Brouer.

12) Fix ethtool reported device name in aquantia driver, from Pavel
    Belous.

13) Fix build failures due to the compile time size test not working in
    netfilter conntrack. From Geert Uytterhoeven.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (52 commits)
  cfg80211: make RATE_INFO_BW_20 the default
  ipv6: initialize route null entry in addrconf_init()
  qede: Fix possible misconfiguration of advertised autoneg value.
  qed: Fix overriding of supported autoneg value.
  qed*: Fix possible overflow for status block id field.
  rtnetlink: NUL-terminate IFLA_PHYS_PORT_NAME string
  netvsc: make sure napi enabled before vmbus_open
  aquantia: Fix driver name reported by ethtool
  ipv4, ipv6: ensure raw socket message is big enough to hold an IP header
  net/sched: remove redundant null check on head
  tcp: do not inherit fastopen_req from parent
  forcedeth: remove unnecessary carrier status check
  ibmvnic: Move queue restarting in ibmvnic_tx_complete
  ibmvnic: Record SKB RX queue during poll
  ibmvnic: Continue skb processing after skb completion error
  ibmvnic: Check for driver reset first in ibmvnic_xmit
  ibmvnic: Wait for any pending scrqs entries at driver close
  ibmvnic: Clean up tx pools when closing
  ibmvnic: Whitespace correction in release_rx_pools
  ibmvnic: Delete napi's when releasing driver resources
  ...
2017-05-04 12:26:43 -07:00
Linus Torvalds 46f0537b1e Merge branch 'stable-4.12' of git://git.infradead.org/users/pcmoore/audit
Pull audit updates from Paul Moore:
 "Fourteen audit patches for v4.12 that span the full range of fixes,
  new features, and internal cleanups.

  We have a patches to move to 64-bit timestamps, convert refcounts from
  atomic_t to refcount_t, track PIDs using the pid struct instead of
  pid_t, convert our own private audit buffer cache to a standard
  kmem_cache, log kernel module names when they are unloaded, and
  normalize the NETFILTER_PKT to make the userspace folks happier.

  From a fixes perspective, the most important is likely the auditd
  connection tracking RCU fix; it was a rather brain dead bug that I'll
  take the blame for, but thankfully it didn't seem to affect many
  people (only one report).

  I think the patch subject lines and commit descriptions do a pretty
  good job of explaining the details and why the changes are important
  so I'll point you there instead of duplicating it here; as usual, if
  you have any questions you know where to find us.

  We also manage to take out more code than we put in this time, that
  always makes me happy :)"

* 'stable-4.12' of git://git.infradead.org/users/pcmoore/audit:
  audit: fix the RCU locking for the auditd_connection structure
  audit: use kmem_cache to manage the audit_buffer cache
  audit: Use timespec64 to represent audit timestamps
  audit: store the auditd PID as a pid struct instead of pid_t
  audit: kernel generated netlink traffic should have a portid of 0
  audit: combine audit_receive() and audit_receive_skb()
  audit: convert audit_watch.count from atomic_t to refcount_t
  audit: convert audit_tree.count from atomic_t to refcount_t
  audit: normalize NETFILTER_PKT
  netfilter: use consistent ipv4 network offset in xt_AUDIT
  audit: log module name on delete_module
  audit: remove unnecessary semicolon in audit_watch_handle_event()
  audit: remove unnecessary semicolon in audit_mark_handle_event()
  audit: remove unnecessary semicolon in audit_field_valid()
2017-05-03 09:21:59 -07:00
David S. Miller 4d89ac2dd5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter/IPVS/OVS fixes for net

The following patchset contains a rather large batch of Netfilter, IPVS
and OVS fixes for your net tree. This includes fixes for ctnetlink, the
userspace conntrack helper infrastructure, conntrack OVS support,
ebtables DNAT target, several leaks in error path among other. More
specifically, they are:

1) Fix reference count leak in the CT target error path, from Gao Feng.

2) Remove conntrack entry clashing with a matching expectation, patch
   from Jarno Rajahalme.

3) Fix bogus EEXIST when registering two different userspace helpers,
   from Liping Zhang.

4) Don't leak dummy elements in the new bitmap set type in nf_tables,
   from Liping Zhang.

5) Get rid of module autoload from conntrack update path in ctnetlink,
   we don't need autoload at this late stage and it is happening with
   rcu read lock held which is not good. From Liping Zhang.

6) Fix deadlock due to double-acquire of the expect_lock from conntrack
   update path, this fixes a bug that was introduced when the central
   spinlock got removed. Again from Liping Zhang.

7) Safe ct->status update from ctnetlink path, from Liping. The expect_lock
   protection that was selected when the central spinlock was removed was
   not really protecting anything at all.

8) Protect sequence adjustment under ct->lock.

9) Missing socket match with IPv6, from Peter Tirsek.

10) Adjust skb->pkt_type of DNAT'ed frames from ebtables, from
    Linus Luessing.

11) Don't give up on evaluating the expression on new entries added via
    dynset expression in nf_tables, from Liping Zhang.

12) Use skb_checksum() when mangling icmpv6 in IPv6 NAT as this deals
    with non-linear skbuffs.

13) Don't allow IPv6 service in IPVS if no IPv6 support is available,
    from Paolo Abeni.

14) Missing mutex release in error path of xt_find_table_lock(), from
    Dan Carpenter.

15) Update maintainers files, Netfilter section. Add Florian to the
    file, refer to nftables.org and change project status from Supported
    to Maintained.

16) Bail out on mismatching extensions in element updates in nf_tables.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-03 10:11:26 -04:00
Geert Uytterhoeven ab71632c45 netfilter: conntrack: Force inlining of build check to prevent build failure
If gcc (e.g. 4.1.2) decides not to inline total_extension_size(), the
build will fail with:

    net/built-in.o: In function `nf_conntrack_init_start':
    (.text+0x9baf6): undefined reference to `__compiletime_assert_1893'

or

    ERROR: "__compiletime_assert_1893" [net/netfilter/nf_conntrack.ko] undefined!

Fix this by forcing inlining of total_extension_size().

Fixes: b3a5db109e ("netfilter: conntrack: use u8 for extension sizes again")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-03 09:51:26 -04:00
Pablo Neira Ayuso 9744a6fcef netfilter: nf_tables: check if same extensions are set when adding elements
If no NLM_F_EXCL is set and the element already exists in the set, make
sure that both elements have the same extensions.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-03 10:58:00 +02:00
Richard Guy Briggs 2173c519d5 audit: normalize NETFILTER_PKT
Eliminate flipping in and out of message fields, dropping fields in the
process.

Sample raw message format IPv4 UDP:
type=NETFILTER_PKT msg=audit(1487874761.386:228):  mark=0xae8a2732 saddr=127.0.0.1 daddr=127.0.0.1 proto=17^]
Sample raw message format IPv6 ICMP6:
type=NETFILTER_PKT msg=audit(1487874761.381:227):  mark=0x223894b7 saddr=::1 daddr=::1 proto=58^]

Issue: https://github.com/linux-audit/audit-kernel/issues/11
Test case: https://github.com/linux-audit/audit-testsuite/issues/43

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-05-02 10:16:04 -04:00
Richard Guy Briggs 0cb88b6ff0 netfilter: use consistent ipv4 network offset in xt_AUDIT
Even though the skb->data pointer has been moved from the link layer
header to the network layer header, use the same method to calculate the
offset in ipv4 and ipv6 routines.

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
[PM: munged subject line]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-05-02 10:16:04 -04:00
David S. Miller a01aa920b8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains Netfilter updates for your net-next
tree. A large bunch of code cleanups, simplify the conntrack extension
codebase, get rid of the fake conntrack object, speed up netns by
selective synchronize_net() calls. More specifically, they are:

1) Check for ct->status bit instead of using nfct_nat() from IPVS and
   Netfilter codebase, patch from Florian Westphal.

2) Use kcalloc() wherever possible in the IPVS code, from Varsha Rao.

3) Simplify FTP IPVS helper module registration path, from Arushi Singhal.

4) Introduce nft_is_base_chain() helper function.

5) Enforce expectation limit from userspace conntrack helper,
   from Gao Feng.

6) Add nf_ct_remove_expect() helper function, from Gao Feng.

7) NAT mangle helper function return boolean, from Gao Feng.

8) ctnetlink_alloc_expect() should only work for conntrack with
   helpers, from Gao Feng.

9) Add nfnl_msg_type() helper function to nfnetlink to build the
   netlink message type.

10) Get rid of unnecessary cast on void, from simran singhal.

11) Use seq_puts()/seq_putc() instead of seq_printf() where possible,
    also from simran singhal.

12) Use list_prev_entry() from nf_tables, from simran signhal.

13) Remove unnecessary & on pointer function in the Netfilter and IPVS
    code.

14) Remove obsolete comment on set of rules per CPU in ip6_tables,
    no longer true. From Arushi Singhal.

15) Remove duplicated nf_conntrack_l4proto_udplite4, from Gao Feng.

16) Remove unnecessary nested rcu_read_lock() in
    __nf_nat_decode_session(). Code running from hooks are already
    guaranteed to run under RCU read side.

17) Remove deadcode in nf_tables_getobj(), from Aaron Conole.

18) Remove double assignment in nf_ct_l4proto_pernet_unregister_one(),
    also from Aaron.

19) Get rid of unsed __ip_set_get_netlink(), from Aaron Conole.

20) Don't propagate NF_DROP error to userspace via ctnetlink in
    __nf_nat_alloc_null_binding() function, from Gao Feng.

21) Revisit nf_ct_deliver_cached_events() to remove unnecessary checks,
    from Gao Feng.

22) Kill the fake untracked conntrack objects, use ctinfo instead to
    annotate a conntrack object is untracked, from Florian Westphal.

23) Remove nf_ct_is_untracked(), now obsolete since we have no
    conntrack template anymore, from Florian.

24) Add event mask support to nft_ct, also from Florian.

25) Move nf_conn_help structure to
    include/net/netfilter/nf_conntrack_helper.h.

26) Add a fixed 32 bytes scratchpad area for conntrack helpers.
    Thus, we don't deal with variable conntrack extensions anymore.
    Make sure userspace conntrack helper doesn't go over that size.
    Remove variable size ct extension infrastructure now this code
    got no more clients. From Florian Westphal.

27) Restore offset and length of nf_ct_ext structure to 8 bytes now
    that wraparound is not possible any longer, also from Florian.

28) Allow to get rid of unassured flows under stress in conntrack,
    this applies to DCCP, SCTP and TCP protocols, from Florian.

29) Shrink size of nf_conntrack_ecache structure, from Florian.

30) Use TCP_MAX_WSCALE instead of hardcoded 14 in TCP tracker,
    from Gao Feng.

31) Register SYNPROXY hooks on demand, from Florian Westphal.

32) Use pernet hook whenever possible, instead of global hook
    registration, from Florian Westphal.

33) Pass hook structure to ebt_register_table() to consolidate some
    infrastructure code, from Florian Westphal.

34) Use consume_skb() and return NF_STOLEN, instead of NF_DROP in the
    SYNPROXY code, to make sure device stats are not fooled, patch
    from Gao Feng.

35) Remove NF_CT_EXT_F_PREALLOC this kills quite some code that we
    don't need anymore if we just select a fixed size instead of
    expensive runtime time calculation of this. From Florian.

36) Constify nf_ct_extend_register() and nf_ct_extend_unregister(),
    from Florian.

37) Simplify nf_ct_ext_add(), this kills nf_ct_ext_create(), from
    Florian.

38) Attach NAT extension on-demand from masquerade and pptp helper
    path, from Florian.

39) Get rid of useless ip_vs_set_state_timeout(), from Aaron Conole.

40) Speed up netns by selective calls of synchronize_net(), from
    Florian Westphal.

41) Silence stack size warning gcc in 32-bit arch in snmp helper,
    from Florian.

42) Inconditionally call nf_ct_ext_destroy(), even if we have no
    extensions, to deal with the NF_NAT_MANIP_SRC case. Patch from
    Liping Zhang.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-01 10:47:53 -04:00
Liping Zhang 8eeef23504 netfilter: nf_ct_ext: invoke destroy even when ext is not attached
For NF_NAT_MANIP_SRC, we will insert the ct to the nat_bysource_table,
then remove it from the nat_bysource_table via nat_extend->destroy.

But now, the nat extension is attached on demand, so if the nat extension
is not attached, we will not be notified when the ct is destroyed, i.e.
we may fail to remove ct from the nat_bysource_table.

So just keep it simple, even if the extension is not attached, we will
still invoke the related ext->destroy. And this will also preserve the
flexibility for the future extension.

Fixes: 9a08ecfe74 ("netfilter: don't attach a nat extension by default")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-01 11:48:49 +02:00
Pablo Neira Ayuso d1908ca8dc Merge tag 'ipvs3-for-v4.12' of http://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next
Simon Horman says:

====================
Third Round of IPVS Updates for v4.12

please consider these enhancements to IPVS for v4.12.
If it is too late for v4.12 then please consider them for v4.13.

* Remove unused function
* Correct comparison of unsigned value
====================

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-01 11:46:50 +02:00
Florian Westphal 039b40ee58 netfilter: nf_queue: only call synchronize_net twice if nf_queue is active
nf_unregister_net_hook(s) can avoid a second call to synchronize_net,
provided there is no nfqueue active in that net namespace (which is
the common case).

This also gets rid of the extra arg to nf_queue_nf_hook_drop(), normally
this gets called during netns cleanup so no packets should be queued.

For the rare case of base chain being unregistered or module removal
while nfqueue is in use the extra hiccup due to the packet drops isn't
a big deal.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-01 11:19:12 +02:00
Florian Westphal c83fa19603 netfilter: nf_log: don't call synchronize_rcu in nf_log_unset
nf_log_unregister() (which is what gets called in the logger backends
module exit paths) does a (required, module is removed) synchronize_rcu().

But nf_log_unset() is only called from pernet exit handlers. It doesn't
free any memory so there appears to be no need to call synchronize_rcu.

v2: Liping Zhang points out that nf_log_unregister() needs to be called
after pernet unregister, else rmmod would become unsafe.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-01 11:19:07 +02:00
Florian Westphal 933bd83ed6 netfilter: batch synchronize_net calls during hook unregister
synchronize_net is expensive and slows down netns cleanup a lot.

We have two APIs to unregister a hook:
nf_unregister_net_hook (which calls synchronize_net())
and
nf_unregister_net_hooks (calls nf_unregister_net_hook in a loop)

Make nf_unregister_net_hook a wapper around new helper
__nf_unregister_net_hook, which unlinks the hook but does not free it.

Then, we can call that helper in nf_unregister_net_hooks and then
call synchronize_net() only once.

Andrey Konovalov reports this change improves syzkaller fuzzing speed at
least twice.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-05-01 11:18:54 +02:00
Pablo Neira Ayuso 1a41dbce0d Merge tag 'ipvs-fixes-for-v4.11' of http://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs
Simon Horman says:

====================
IPVS Fixes for v4.11

I would also like it considered for stable.

* Explicitly forbid ipv6 service/dest creation if ipv6 mod is disabled
  to avoid oops caused by IPVS accesing IPv6 routing code in such
  circumstances.
====================

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-28 16:22:11 +02:00
Dan Carpenter 7dde07e9c5 netfilter: x_tables: unlock on error in xt_find_table_lock()
According to my static checker we should unlock here before the return.
That seems reasonable to me as well.

Fixes" b9e69e1273 ("netfilter: xtables: don't hook tables by default")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-28 15:49:48 +02:00
Paolo Abeni 1442f6f7c1 ipvs: explicitly forbid ipv6 service/dest creation if ipv6 mod is disabled
When creating a new ipvs service, ipv6 addresses are always accepted
if CONFIG_IP_VS_IPV6 is enabled. On dest creation the address family
is not explicitly checked.

This allows the user-space to configure ipvs services even if the
system is booted with ipv6.disable=1. On specific configuration, ipvs
can try to call ipv6 routing code at setup time, causing the kernel to
oops due to fib6_rules_ops being NULL.

This change addresses the issue adding a check for the ipv6
module being enabled while validating ipv6 service operations and
adding the same validation for dest operations.

According to git history, this issue is apparently present since
the introduction of ipv6 support, and the oops can be triggered
since commit 09571c7ae3 ("IPVS: Add function to determine
if IPv6 address is local")

Fixes: 09571c7ae3 ("IPVS: Add function to determine if IPv6 address is local")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2017-04-28 12:04:35 +02:00
Aaron Conole fb90e8dedb ipvs: change comparison on sync_refresh_period
The sync_refresh_period variable is unsigned, so it can never be < 0.

Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Simon Horman <horms@verge.net.au>
2017-04-28 12:00:10 +02:00
Aaron Conole 65ba101ebc ipvs: remove unused function ip_vs_set_state_timeout
There are no in-tree callers of this function and it isn't exported.

Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Simon Horman <horms@verge.net.au>
2017-04-28 12:00:10 +02:00
Florian Westphal 9a08ecfe74 netfilter: don't attach a nat extension by default
nowadays the NAT extension only stores the interface index
(used to purge connections that got masqueraded when interface goes down)
and pptp nat information.

Previous patches moved nf_ct_nat_ext_add to those places that need it.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-26 09:30:22 +02:00
Florian Westphal 2fe7c321ab netfilter: pptp: attach nat extension when needed
make sure nat extension gets added if the master conntrack is subject to
NAT.  This will be required once the nat core stops adding it by default.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-26 09:30:22 +02:00
Florian Westphal 22d4536d2c netfilter: conntrack: handle initial extension alloc via krealloc
krealloc(NULL, ..) is same as kmalloc(), so we can avoid special-casing
the initial allocation after the prealloc removal (we had to use
->alloc_len as the initial allocation size).

This also means we do not zero the preallocated memory anymore; only
offsets[].  Existing code makes sure the new (used) extension space gets
zeroed out.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-26 09:30:22 +02:00
Florian Westphal 23f671a1b5 netfilter: conntrack: mark extension structs as const
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-26 09:30:22 +02:00
Florian Westphal 54044b1f02 netfilter: conntrack: remove prealloc support
It was used by the nat extension, but since commit
7c96643519 ("netfilter: move nat hlist_head to nf_conn") its only needed
for connections that use MASQUERADE target or a nat helper.

Also it seems a lot easier to preallocate a fixed size instead.

With default settings, conntrack first adds ecache extension (sysctl
defaults to 1), so we get 40(ct extension header) + 24 (ecache) == 64 byte
on x86_64 for initial allocation.

Followup patches can constify the extension structs and avoid
the initial zeroing of the entire extension area.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-26 09:30:22 +02:00
Florian Westphal efe4160618 ipvs: convert to use pernet nf_hook api
nf_(un)register_hooks has to maintain an internal hook list to add/remove
those hooks from net namespaces as they are added/deleted.

ipvs already uses pernet_ops, so we can switch to the (more recent)
pernet hook api instead.

Compile tested only.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-26 09:30:21 +02:00
Liping Zhang 277a292835 netfilter: nft_dynset: continue to next expr if _OP_ADD succeeded
Currently, after adding the following nft rules:
  # nft add set x target1 { type ipv4_addr \; flags timeout \;}
  # nft add rule x y set add ip daddr timeout 1d @target1 counter

the counters will always be zero despite of the elements are added
to the dynamic set "target1" or not, as we will break the nft expr
traversal unconditionally:
  # nft list ruleset
  ...
  set target1 {
      ...
      elements = { 8.8.8.8 expires 23h59m53s}
  }
  chain output {
      ...
      set add ip daddr timeout 1d @target1 counter packets 0 bytes 0
                                                           ^       ^
      ...
  }

Since we add the elements to the set successfully, we should continue
to the next expression.

Additionally, if elements are added to "flow table" successfully, we
will _always_ continue to the next expr, even if the operation is
_OP_ADD. So it's better to keep them to be consistent.

Fixes: 22fe54d5fe ("netfilter: nf_tables: add support for dynamic set updates")
Reported-by: Robert White <rwhite@pobox.com>
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-25 11:10:37 +02:00
Peter Tirsek 6bd3d19292 netfilter: xt_socket: Fix broken IPv6 handling
Commit 834184b1f3 ("netfilter: defrag: only register defrag
functionality if needed") used the outdated XT_SOCKET_HAVE_IPV6 macro
which was removed earlier in commit 8db4c5be88 ("netfilter: move
socket lookup infrastructure to nf_socket_ipv{4,6}.c"). With that macro
never being defined, the xt_socket match emits an "Unknown family 10"
warning when used with IPv6:

WARNING: CPU: 0 PID: 1377 at net/netfilter/xt_socket.c:160 socket_mt_enable_defrag+0x47/0x50 [xt_socket]
Unknown family 10
Modules linked in: xt_socket nf_socket_ipv4 nf_socket_ipv6 nf_defrag_ipv4 [...]
CPU: 0 PID: 1377 Comm: ip6tables-resto Not tainted 4.10.10 #1
Hardware name: [...]
Call Trace:
? __warn+0xe7/0x100
? socket_mt_enable_defrag+0x47/0x50 [xt_socket]
? socket_mt_enable_defrag+0x47/0x50 [xt_socket]
? warn_slowpath_fmt+0x39/0x40
? socket_mt_enable_defrag+0x47/0x50 [xt_socket]
? socket_mt_v2_check+0x12/0x40 [xt_socket]
? xt_check_match+0x6b/0x1a0 [x_tables]
? xt_find_match+0x93/0xd0 [x_tables]
? xt_request_find_match+0x20/0x80 [x_tables]
? translate_table+0x48e/0x870 [ip6_tables]
? translate_table+0x577/0x870 [ip6_tables]
? walk_component+0x3a/0x200
? kmalloc_order+0x1d/0x50
? do_ip6t_set_ctl+0x181/0x490 [ip6_tables]
? filename_lookup+0xa5/0x120
? nf_setsockopt+0x3a/0x60
? ipv6_setsockopt+0xb0/0xc0
? sock_common_setsockopt+0x23/0x30
? SyS_socketcall+0x41d/0x630
? vfs_read+0xfa/0x120
? do_fast_syscall_32+0x7a/0x110
? entry_SYSENTER_32+0x47/0x71

This patch brings the conditional back in line with how the rest of the
file handles IPv6.

Fixes: 834184b1f3 ("netfilter: defrag: only register defrag functionality if needed")
Signed-off-by: Peter Tirsek <peter@tirsek.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-24 20:06:29 +02:00
Liping Zhang 64f3967c7a netfilter: ctnetlink: acquire ct->lock before operating nf_ct_seqadj
We should acquire the ct->lock before accessing or modifying the
nf_ct_seqadj, as another CPU may modify the nf_ct_seqadj at the same
time during its packet proccessing.

Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-24 20:06:29 +02:00
Liping Zhang 53b56da83d netfilter: ctnetlink: make it safer when updating ct->status
After converting to use rcu for conntrack hash, one CPU may update
the ct->status via ctnetlink, while another CPU may process the
packets and update the ct->status.

So the non-atomic operation "ct->status |= status;" via ctnetlink
becomes unsafe, and this may clear the IPS_DYING_BIT bit set by
another CPU unexpectedly. For example:
         CPU0                            CPU1
  ctnetlink_change_status        __nf_conntrack_find_get
      old = ct->status              nf_ct_gc_expired
          -                         nf_ct_kill
          -                      test_and_set_bit(IPS_DYING_BIT
      new = old | status;                 -
  ct->status = new; <-- oops, _DYING_ is cleared!

Now using a series of atomic bit operation to solve the above issue.

Also note, user shouldn't set IPS_TEMPLATE, IPS_SEQ_ADJUST directly,
so make these two bits be unchangable too.

If we set the IPS_TEMPLATE_BIT, ct will be freed by nf_ct_tmpl_free,
but actually it is alloced by nf_conntrack_alloc.
If we set the IPS_SEQ_ADJUST_BIT, this may cause the NULL pointer
deference, as the nfct_seqadj(ct) maybe NULL.

Last, add some comments to describe the logic change due to the
commit a963d710f3 ("netfilter: ctnetlink: Fix regression in CTA_STATUS
processing"), which makes me feel a little confusing.

Fixes: 76507f69c4 ("[NETFILTER]: nf_conntrack: use RCU for conntrack hash")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-24 20:06:28 +02:00
Liping Zhang 88be4c09d9 netfilter: ctnetlink: fix deadlock due to acquire _expect_lock twice
Currently, ctnetlink_change_conntrack is always protected by _expect_lock,
but this will cause a deadlock when deleting the helper from a conntrack,
as the _expect_lock will be acquired again by nf_ct_remove_expectations:

         CPU0
        ----
  lock(nf_conntrack_expect_lock);
  lock(nf_conntrack_expect_lock);

  *** DEADLOCK ***
  May be due to missing lock nesting notation

  2 locks held by lt-conntrack_gr/12853:
  #0:  (&table[i].mutex){+.+.+.}, at: [<ffffffffa05e2009>]
       nfnetlink_rcv_msg+0x399/0x6a9 [nfnetlink]
  #1:  (nf_conntrack_expect_lock){+.....}, at: [<ffffffffa05f2c1f>]
       ctnetlink_new_conntrack+0x17f/0x408 [nf_conntrack_netlink]

  Call Trace:
   dump_stack+0x85/0xc2
   __lock_acquire+0x1608/0x1680
   ? ctnetlink_parse_tuple_proto+0x10f/0x1c0 [nf_conntrack_netlink]
   lock_acquire+0x100/0x1f0
   ? nf_ct_remove_expectations+0x32/0x90 [nf_conntrack]
   _raw_spin_lock_bh+0x3f/0x50
   ? nf_ct_remove_expectations+0x32/0x90 [nf_conntrack]
   nf_ct_remove_expectations+0x32/0x90 [nf_conntrack]
   ctnetlink_change_helper+0xc6/0x190 [nf_conntrack_netlink]
   ctnetlink_new_conntrack+0x1b2/0x408 [nf_conntrack_netlink]
   nfnetlink_rcv_msg+0x60a/0x6a9 [nfnetlink]
   ? nfnetlink_rcv_msg+0x1b9/0x6a9 [nfnetlink]
   ? nfnetlink_bind+0x1a0/0x1a0 [nfnetlink]
   netlink_rcv_skb+0xa4/0xc0
   nfnetlink_rcv+0x87/0x770 [nfnetlink]

Since the operations are unrelated to nf_ct_expect, so we can drop the
_expect_lock. Also note, after removing the _expect_lock protection,
another CPU may invoke nf_conntrack_helper_unregister, so we should
use rcu_read_lock to protect __nf_conntrack_helper_find invoked by
ctnetlink_change_helper.

Fixes: ca7433df3a ("netfilter: conntrack: seperate expect locking from nf_conntrack_lock")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-24 20:06:28 +02:00
Liping Zhang 14e5676156 netfilter: ctnetlink: drop the incorrect cthelper module request
First, when creating a new ct, we will invoke request_module to try to
load the related inkernel cthelper. So there's no need to call the
request_module again when updating the ct helpinfo.

Second, ctnetlink_change_helper may be called with rcu_read_lock held,
i.e. rcu_read_lock -> nfqnl_recv_verdict -> nfqnl_ct_parse ->
ctnetlink_glue_parse -> ctnetlink_glue_parse_ct ->
ctnetlink_change_helper. But the request_module invocation may sleep,
so we can't call it with the rcu_read_lock held.

Remove it now.

Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-24 20:06:28 +02:00
Liping Zhang 54a5f9d9ab netfilter: nft_set_bitmap: free dummy elements when destroy the set
We forget to free dummy elements when deleting the set. So when I was
running nft-test.py, I saw many kmemleak warnings:
  kmemleak: 1344 new suspected memory leaks ...
  # cat /sys/kernel/debug/kmemleak
  unreferenced object 0xffff8800631345c8 (size 32):
  comm "nft", pid 9075, jiffies 4295743309 (age 1354.815s)
  hex dump (first 32 bytes):
    f8 63 13 63 00 88 ff ff 88 79 13 63 00 88 ff ff  .c.c.....y.c....
    04 0c 00 00 00 00 00 00 00 00 00 00 08 03 00 00  ................
  backtrace:
    [<ffffffff819059da>] kmemleak_alloc+0x4a/0xa0
    [<ffffffff81288174>] __kmalloc+0x164/0x310
    [<ffffffffa061269d>] nft_set_elem_init+0x3d/0x1b0 [nf_tables]
    [<ffffffffa06130da>] nft_add_set_elem+0x45a/0x8c0 [nf_tables]
    [<ffffffffa0613645>] nf_tables_newsetelem+0x105/0x1d0 [nf_tables]
    [<ffffffffa05fe6d4>] nfnetlink_rcv+0x414/0x770 [nfnetlink]
    [<ffffffff817f0ca6>] netlink_unicast+0x1f6/0x310
    [<ffffffff817f10c6>] netlink_sendmsg+0x306/0x3b0
  ...

Fixes: e920dde516 ("netfilter: nft_set_bitmap: keep a list of dummy elements")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-24 20:05:25 +02:00
Liping Zhang 66e5a6b18b netfilter: nf_ct_helper: permit cthelpers with different names via nfnetlink
cthelpers added via nfnetlink may have the same tuple, i.e. except for
the l3proto and l4proto, other fields are all zero. So even with the
different names, we will also fail to add them:
  # nfct helper add ssdp inet udp
  # nfct helper add tftp inet udp
  nfct v1.4.3: netlink error: File exists

So in order to avoid unpredictable behaviour, we should:
1. cthelpers can be selected by nft ct helper obj or xt_CT target, so
report error if duplicated { name, l3proto, l4proto } tuple exist.
2. cthelpers can be selected by nf_ct_tuple_src_mask_cmp when
nf_ct_auto_assign_helper is enabled, so also report error if duplicated
{ l3proto, l4proto, src-port } tuple exist.

Also note, if the cthelper is added from userspace, then the src-port will
always be zero, it's invalid for nf_ct_auto_assign_helper, so there's no
need to check the second point listed above.

Fixes: 893e093c78 ("netfilter: nf_ct_helper: bail out on duplicated helpers")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-24 20:05:05 +02:00
Gao Feng 470acf55a0 netfilter: xt_CT: fix refcnt leak on error path
There are two cases which causes refcnt leak.

1. When nf_ct_timeout_ext_add failed in xt_ct_set_timeout, it should
free the timeout refcnt.
Now goto the err_put_timeout error handler instead of going ahead.

2. When the time policy is not found, we should call module_put.
Otherwise, the related cthelper module cannot be removed anymore.
It is easy to reproduce by typing the following command:
  # iptables -t raw -A OUTPUT -p tcp -j CT --helper ftp --timeout xxx

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-24 20:03:01 +02:00
Ingo Molnar 58d30c36d4 Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU updates from Paul E. McKenney:

 - Documentation updates.

 - Miscellaneous fixes.

 - Parallelize SRCU callback handling (plus overlapping patches).

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-23 11:12:44 +02:00
Gao Feng 122868b378 netfilter: tcp: Use TCP_MAX_WSCALE instead of literal 14
The window scale may be enlarged from 14 to 15 according to the itef
draft https://tools.ietf.org/html/draft-nishida-tcpm-maxwin-03.

Use the macro TCP_MAX_WSCALE to support it easily with TCP stack in
the future.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-19 17:55:17 +02:00
Florian Westphal be7be6e161 netfilter: ipvs: fix incorrect conflict resolution
The commit ab8bc7ed86
("netfilter: remove nf_ct_is_untracked")
changed the line
   if (ct && !nf_ct_is_untracked(ct) && nfct_nat(ct)) {
	   to
   if (ct && nfct_nat(ct)) {

meanwhile, the commit 41390895e5
("netfilter: ipvs: don't check for presence of nat extension")
from ipvs-next had changed the same line to

  if (ct && !nf_ct_is_untracked(ct) && (ct->status & IPS_NAT_MASK)) {

When ipvs-next got merged into nf-next, the merge resolution took
the first version, dropping the conversion of nfct_nat().

While this doesn't cause a problem at the moment, it will once we stop
adding the nat extension by default.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-19 17:55:17 +02:00
Florian Westphal 01026edef9 nefilter: eache: reduce struct size from 32 to 24 byte
Only "cache" needs to use ulong (its used with set_bit()), missed can use
u16.  Also add build-time assertion to ensure event bits fit.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-19 17:55:17 +02:00
Florian Westphal c6dd940b1f netfilter: allow early drop of assured conntracks
If insertion of a new conntrack fails because the table is full, the kernel
searches the next buckets of the hash slot where the new connection
was supposed to be inserted at for an entry that hasn't seen traffic
in reply direction (non-assured), if it finds one, that entry is
is dropped and the new connection entry is allocated.

Allow the conntrack gc worker to also remove *assured* conntracks if
resources are low.

Do this by querying the l4 tracker, e.g. tcp connections are now dropped
if they are no longer established (e.g. in finwait).

This could be refined further, e.g. by adding 'soft' established timeout
(i.e., a timeout that is only used once we get close to resource
exhaustion).

Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-19 17:55:17 +02:00
Florian Westphal b3a5db109e netfilter: conntrack: use u8 for extension sizes again
commit 223b02d923
("netfilter: nf_conntrack: reserve two bytes for nf_ct_ext->len")
had to increase size of the extension offsets because total size of the
extensions had increased to a point where u8 did overflow.

3 years later we've managed to diet extensions a bit and we no longer
need u16.  Furthermore we can now add a compile-time assertion for this
problem.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-19 17:55:17 +02:00
Florian Westphal faec865db9 netfilter: remove last traces of variable-sized extensions
get rid of the (now unused) nf_ct_ext_add_length define and also
rename the function to plain nf_ct_ext_add().

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-19 17:55:17 +02:00
Florian Westphal 9f0f3ebeda netfilter: helpers: remove data_len usage for inkernel helpers
No need to track this for inkernel helpers anymore as
NF_CT_HELPER_BUILD_BUG_ON checks do this now.

All inkernel helpers know what kind of structure they
stored in helper->data.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-19 17:55:17 +02:00
Florian Westphal 157ffffeb5 netfilter: nfnetlink_cthelper: reject too large userspace allocation requests
Userspace should not abuse the kernel to store large amounts of data,
reject requests larger than the private area can accommodate.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-19 17:55:17 +02:00
Florian Westphal dcf67740f2 netfilter: helper: add build-time asserts for helper data size
add a 32 byte scratch area in the helper struct instead of relying
on variable sized helpers plus compile-time asserts to let us know
if 32 bytes aren't enough anymore.

Not having variable sized helpers will later allow to add BUILD_BUG_ON
for the total size of conntrack extensions -- the helper extension is
the only one that doesn't have a fixed size.

The (useless!) NF_CT_HELPER_BUILD_BUG_ON(0); are added so that in case
someone adds a new helper and copy-pastes from one that doesn't store
private data at least some indication that this macro should be used
somehow is there...

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-19 17:55:16 +02:00
Florian Westphal 694a0055f0 netfilter: nft_ct: allow to set ctnetlink event types of a connection
By default the kernel emits all ctnetlink events for a connection.
This allows to select the types of events to generate.

This can be used to e.g. only send DESTROY events but no NEW/UPDATE ones
and will work even if sysctl net.netfilter.nf_conntrack_events is set to 0.

This was already possible via iptables' CT target, but the nft version has
the advantage that it can also be used with already-established conntracks.

The added nf_ct_is_template() check isn't a bug fix as we only support
mark and labels (and unlike ecache the conntrack core doesn't copy those).

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-19 17:55:16 +02:00
Paul E. McKenney 5f0d5a3ae7 mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU
A group of Linux kernel hackers reported chasing a bug that resulted
from their assumption that SLAB_DESTROY_BY_RCU provided an existence
guarantee, that is, that no block from such a slab would be reallocated
during an RCU read-side critical section.  Of course, that is not the
case.  Instead, SLAB_DESTROY_BY_RCU only prevents freeing of an entire
slab of blocks.

However, there is a phrase for this, namely "type safety".  This commit
therefore renames SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU in order
to avoid future instances of this sort of confusion.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <linux-mm@kvack.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
[ paulmck: Add comments mentioning the old name, as requested by Eric
  Dumazet, in order to help people familiar with the old name find
  the new one. ]
Acked-by: David Rientjes <rientjes@google.com>
2017-04-18 11:42:36 -07:00
David S. Miller 6b6cbc1471 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts were simply overlapping changes.  In the net/ipv4/route.c
case the code had simply moved around a little bit and the same fix
was made in both 'net' and 'net-next'.

In the net/sched/sch_generic.c case a fix in 'net' happened at
the same time that a new argument was added to qdisc_hash_add().

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-15 21:16:30 -04:00
Florian Westphal ab8bc7ed86 netfilter: remove nf_ct_is_untracked
This function is now obsolete and always returns false.
This change has no effect on generated code.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-15 11:51:33 +02:00
Florian Westphal cc41c84b7e netfilter: kill the fake untracked conntrack objects
resurrect an old patch from Pablo Neira to remove the untracked objects.

Currently, there are four possible states of an skb wrt. conntrack.

1. No conntrack attached, ct is NULL.
2. Normal (kmem cache allocated) ct attached.
3. a template (kmalloc'd), not in any hash tables at any point in time
4. the 'untracked' conntrack, a percpu nf_conn object, tagged via
   IPS_UNTRACKED_BIT in ct->status.

Untracked is supposed to be identical to case 1.  It exists only
so users can check

-m conntrack --ctstate UNTRACKED vs.
-m conntrack --ctstate INVALID

e.g. attempts to set connmark on INVALID or UNTRACKED conntracks is
supposed to be a no-op.

Thus currently we need to check
 ct == NULL || nf_ct_is_untracked(ct)

in a lot of places in order to avoid altering untracked objects.

The other consequence of the percpu untracked object is that all
-j NOTRACK (and, later, kfree_skb of such skbs) result in an atomic op
(inc/dec the untracked conntracks refcount).

This adds a new kernel-private ctinfo state, IP_CT_UNTRACKED, to
make the distinction instead.

The (few) places that care about packet invalid (ct is NULL) vs.
packet untracked now need to test ct == NULL vs. ctinfo == IP_CT_UNTRACKED,
but all other places can omit the nf_ct_is_untracked() check.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-15 11:47:57 +02:00
Gao Feng 6e354a5e56 netfilter: ecache: Refine the nf_ct_deliver_cached_events
1. Remove single !events condition check to deliver the missed event
even though there is no new event happened.

Consider this case:
1) nf_ct_deliver_cached_events is invoked at the first time, the
event is failed to deliver, then the missed is set.
2) nf_ct_deliver_cached_events is invoked again, but there is no
any new event happened.
The missed event is lost really.

It would try to send the missed event again after remove this check.
And it is ok if there is no missed event because the latter check
!((events | missed) & e->ctmask) could avoid it.

2. Correct the return value check of notify->fcn.
When send the event successfully, it returns 0, not postive value.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-15 11:43:49 +02:00
Gao Feng 7025bac47f netfilter: nf_nat: Fix return NF_DROP in nfnetlink_parse_nat_setup
The __nf_nat_alloc_null_binding invokes nf_nat_setup_info which may
return NF_DROP when memory is exhausted, so convert NF_DROP to -ENOMEM
to make ctnetlink happy. Or ctnetlink_setup_nat treats it as a success
when one error NF_DROP happens actully.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-15 11:04:14 +02:00
Pablo Neira Ayuso a702ece3b1 Merge tag 'ipvs2-for-v4.12' of https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next
Simon Horman says:

====================
Second Round of IPVS Updates for v4.12

please consider these clean-ups and enhancements to IPVS for v4.12.

* Removal unused variable
* Use kzalloc where appropriate
* More efficient detection of presence of NAT extension
====================

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

Conflicts:
	net/netfilter/ipvs/ip_vs_ftp.c
2017-04-15 10:54:40 +02:00
Aaron Conole db268d4dfd ipset: remove unused function __ip_set_get_netlink
There are no in-tree callers.

Signed-off-by: Aaron Conole <aconole@bytheb.org>
Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-15 10:24:41 +02:00
Aaron Conole 809c2d9a3b netfilter: nf_conntrack: remove double assignment
The protonet pointer will unconditionally be rewritten, so just do the
needed assignment first.

Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-14 01:54:23 +02:00
Aaron Conole 7925056827 netfilter: nf_tables: remove double return statement
Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-14 01:54:19 +02:00
Liping Zhang 79e09ef96b netfilter: nft_hash: do not dump the auto generated seed
This can prevent the nft utility from printing out the auto generated
seed to the user, which is unnecessary and confusing.

Fixes: cb1b69b0b1 ("netfilter: nf_tables: add hash expression")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-13 23:20:13 +02:00