1
0
Fork 0
Commit Graph

797243 Commits (a0badcc6652f9871a9908d67297f910cba657b0f)

Author SHA1 Message Date
Yafang Shao a0badcc665 netfilter: conntrack: register sysctl table for gre
This patch adds two sysctl knobs for GRE:

	net.netfilter.nf_conntrack_gre_timeout = 30
	net.netfilter.nf_conntrack_gre_timeout_stream = 180

Update the Documentation as well.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-21 00:51:25 +01:00
Florian Westphal 294304e4c5 netfilter: conntrack: udp: set stream timeout to 2 minutes
We have no explicit signal when a UDP stream has terminated, peers just
stop sending.

For suspected stream connections a timeout of two minutes is sane to keep
NAT mapping alive a while longer.

It matches tcp conntracks 'timewait' default timeout value.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-21 00:48:46 +01:00
Florian Westphal d535c8a69c netfilter: conntrack: udp: only extend timeout to stream mode after 2s
Currently DNS resolvers that send both A and AAAA queries from same source port
can trigger stream mode prematurely, which results in non-early-evictable conntrack entry
for three minutes, even though DNS requests are done in a few milliseconds.

Add a two second grace period where we continue to use the ordinary
30-second default timeout.  Its enough for DNS request/response traffic,
even if two request/reply packets are involved.

ASSURED is still set, else conntrack (and thus a possible
NAT mapping ...) gets zapped too in case conntrack table runs full.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-21 00:48:38 +01:00
Taehee Yoo 06aa151ad1 netfilter: ipt_CLUSTERIP: check MAC address when duplicate config is set
If same destination IP address config is already existing, that config is
just used. MAC address also should be same.
However, there is no MAC address checking routine.
So that MAC address checking routine is added.

test commands:
   %iptables -A INPUT -p tcp -i lo -d 192.168.0.5 --dport 80 \
	   -j CLUSTERIP --new --hashmode sourceip \
	   --clustermac 01:00:5e:00:00:20 --total-nodes 2 --local-node 1
   %iptables -A INPUT -p tcp -i lo -d 192.168.0.5 --dport 80 \
	   -j CLUSTERIP --new --hashmode sourceip \
	   --clustermac 01:00:5e:00:00:21 --total-nodes 2 --local-node 1

After this patch, above commands are disallowed.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-18 01:18:38 +01:00
Taehee Yoo 2a61d8b883 netfilter: ipt_CLUSTERIP: fix sleep-in-atomic bug in clusterip_config_entry_put()
A proc_remove() can sleep. so that it can't be inside of spin_lock.
Hence proc_remove() is moved to outside of spin_lock. and it also
adds mutex to sync create and remove of proc entry(config->pde).

test commands:
SHELL#1
   %while :; do iptables -A INPUT -p udp -i enp2s0 -d 192.168.1.100 \
	   --dport 9000  -j CLUSTERIP --new --hashmode sourceip \
	   --clustermac 01:00:5e:00:00:21 --total-nodes 3 --local-node 3; \
	   iptables -F; done

SHELL#2
   %while :; do echo +1 > /proc/net/ipt_CLUSTERIP/192.168.1.100; \
	   echo -1 > /proc/net/ipt_CLUSTERIP/192.168.1.100; done

[ 2949.569864] BUG: sleeping function called from invalid context at kernel/sched/completion.c:99
[ 2949.579944] in_atomic(): 1, irqs_disabled(): 0, pid: 5472, name: iptables
[ 2949.587920] 1 lock held by iptables/5472:
[ 2949.592711]  #0: 000000008f0ebcf2 (&(&cn->lock)->rlock){+...}, at: refcount_dec_and_lock+0x24/0x50
[ 2949.603307] CPU: 1 PID: 5472 Comm: iptables Tainted: G        W         4.19.0-rc5+ #16
[ 2949.604212] Hardware name: To be filled by O.E.M. To be filled by O.E.M./Aptio CRB, BIOS 5.6.5 07/08/2015
[ 2949.604212] Call Trace:
[ 2949.604212]  dump_stack+0xc9/0x16b
[ 2949.604212]  ? show_regs_print_info+0x5/0x5
[ 2949.604212]  ___might_sleep+0x2eb/0x420
[ 2949.604212]  ? set_rq_offline.part.87+0x140/0x140
[ 2949.604212]  ? _rcu_barrier_trace+0x400/0x400
[ 2949.604212]  wait_for_completion+0x94/0x710
[ 2949.604212]  ? wait_for_completion_interruptible+0x780/0x780
[ 2949.604212]  ? __kernel_text_address+0xe/0x30
[ 2949.604212]  ? __lockdep_init_map+0x10e/0x5c0
[ 2949.604212]  ? __lockdep_init_map+0x10e/0x5c0
[ 2949.604212]  ? __init_waitqueue_head+0x86/0x130
[ 2949.604212]  ? init_wait_entry+0x1a0/0x1a0
[ 2949.604212]  proc_entry_rundown+0x208/0x270
[ 2949.604212]  ? proc_reg_get_unmapped_area+0x370/0x370
[ 2949.604212]  ? __lock_acquire+0x4500/0x4500
[ 2949.604212]  ? complete+0x18/0x70
[ 2949.604212]  remove_proc_subtree+0x143/0x2a0
[ 2949.708655]  ? remove_proc_entry+0x390/0x390
[ 2949.708655]  clusterip_tg_destroy+0x27a/0x630 [ipt_CLUSTERIP]
[ ... ]

Fixes: b3e456fce9 ("netfilter: ipt_CLUSTERIP: fix a race condition of proc file creation")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-18 01:18:24 +01:00
Taehee Yoo b12f7bad5a netfilter: ipt_CLUSTERIP: remove wrong WARN_ON_ONCE in netns exit routine
When network namespace is destroyed, both clusterip_tg_destroy() and
clusterip_net_exit() are called. and clusterip_net_exit() is called
before clusterip_tg_destroy().
Hence cleanup check code in clusterip_net_exit() doesn't make sense.

test commands:
   %ip netns add vm1
   %ip netns exec vm1 bash
   %ip link set lo up
   %iptables -A INPUT -p tcp -i lo -d 192.168.0.5 --dport 80 \
	-j CLUSTERIP --new --hashmode sourceip \
	--clustermac 01:00:5e:00:00:20 --total-nodes 2 --local-node 1
   %exit
   %ip netns del vm1

splat looks like:
[  341.184508] WARNING: CPU: 1 PID: 87 at net/ipv4/netfilter/ipt_CLUSTERIP.c:840 clusterip_net_exit+0x319/0x380 [ipt_CLUSTERIP]
[  341.184850] Modules linked in: ipt_CLUSTERIP nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_tcpudp iptable_filter bpfilter ip_tables x_tables
[  341.184850] CPU: 1 PID: 87 Comm: kworker/u4:2 Not tainted 4.19.0-rc5+ #16
[  341.227509] Workqueue: netns cleanup_net
[  341.227509] RIP: 0010:clusterip_net_exit+0x319/0x380 [ipt_CLUSTERIP]
[  341.227509] Code: 0f 85 7f fe ff ff 48 c7 c2 80 64 2c c0 be a8 02 00 00 48 c7 c7 a0 63 2c c0 c6 05 18 6e 00 00 01 e8 bc 38 ff f5 e9 5b fe ff ff <0f> 0b e9 33 ff ff ff e8 4b 90 50 f6 e9 2d fe ff ff 48 89 df e8 de
[  341.227509] RSP: 0018:ffff88011086f408 EFLAGS: 00010202
[  341.227509] RAX: dffffc0000000000 RBX: 1ffff1002210de85 RCX: 0000000000000000
[  341.227509] RDX: 1ffff1002210de85 RSI: ffff880110813be8 RDI: ffffed002210de58
[  341.227509] RBP: ffff88011086f4d0 R08: 0000000000000000 R09: 0000000000000000
[  341.227509] R10: 0000000000000000 R11: 0000000000000000 R12: 1ffff1002210de81
[  341.227509] R13: ffff880110625a48 R14: ffff880114cec8c8 R15: 0000000000000014
[  341.227509] FS:  0000000000000000(0000) GS:ffff880116600000(0000) knlGS:0000000000000000
[  341.227509] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  341.227509] CR2: 00007f11fd38e000 CR3: 000000013ca16000 CR4: 00000000001006e0
[  341.227509] Call Trace:
[  341.227509]  ? __clusterip_config_find+0x460/0x460 [ipt_CLUSTERIP]
[  341.227509]  ? default_device_exit+0x1ca/0x270
[  341.227509]  ? remove_proc_entry+0x1cd/0x390
[  341.227509]  ? dev_change_net_namespace+0xd00/0xd00
[  341.227509]  ? __init_waitqueue_head+0x130/0x130
[  341.227509]  ops_exit_list.isra.10+0x94/0x140
[  341.227509]  cleanup_net+0x45b/0x900
[ ... ]

Fixes: 613d0776d3 ("netfilter: exit_net cleanup check added")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-18 01:18:09 +01:00
Taehee Yoo 5a86d68bcf netfilter: ipt_CLUSTERIP: fix deadlock in netns exit routine
When network namespace is destroyed, cleanup_net() is called.
cleanup_net() holds pernet_ops_rwsem then calls each ->exit callback.
So that clusterip_tg_destroy() is called by cleanup_net().
And clusterip_tg_destroy() calls unregister_netdevice_notifier().

But both cleanup_net() and clusterip_tg_destroy() hold same
lock(pernet_ops_rwsem). hence deadlock occurrs.

After this patch, only 1 notifier is registered when module is inserted.
And all of configs are added to per-net list.

test commands:
   %ip netns add vm1
   %ip netns exec vm1 bash
   %ip link set lo up
   %iptables -A INPUT -p tcp -i lo -d 192.168.0.5 --dport 80 \
	-j CLUSTERIP --new --hashmode sourceip \
	--clustermac 01:00:5e:00:00:20 --total-nodes 2 --local-node 1
   %exit
   %ip netns del vm1

splat looks like:
[  341.809674] ============================================
[  341.809674] WARNING: possible recursive locking detected
[  341.809674] 4.19.0-rc5+ #16 Tainted: G        W
[  341.809674] --------------------------------------------
[  341.809674] kworker/u4:2/87 is trying to acquire lock:
[  341.809674] 000000005da2d519 (pernet_ops_rwsem){++++}, at: unregister_netdevice_notifier+0x8c/0x460
[  341.809674]
[  341.809674] but task is already holding lock:
[  341.809674] 000000005da2d519 (pernet_ops_rwsem){++++}, at: cleanup_net+0x119/0x900
[  341.809674]
[  341.809674] other info that might help us debug this:
[  341.809674]  Possible unsafe locking scenario:
[  341.809674]
[  341.809674]        CPU0
[  341.809674]        ----
[  341.809674]   lock(pernet_ops_rwsem);
[  341.809674]   lock(pernet_ops_rwsem);
[  341.809674]
[  341.809674]  *** DEADLOCK ***
[  341.809674]
[  341.809674]  May be due to missing lock nesting notation
[  341.809674]
[  341.809674] 3 locks held by kworker/u4:2/87:
[  341.809674]  #0: 00000000d9df6c92 ((wq_completion)"%s""netns"){+.+.}, at: process_one_work+0xafe/0x1de0
[  341.809674]  #1: 00000000c2cbcee2 (net_cleanup_work){+.+.}, at: process_one_work+0xb60/0x1de0
[  341.809674]  #2: 000000005da2d519 (pernet_ops_rwsem){++++}, at: cleanup_net+0x119/0x900
[  341.809674]
[  341.809674] stack backtrace:
[  341.809674] CPU: 1 PID: 87 Comm: kworker/u4:2 Tainted: G        W         4.19.0-rc5+ #16
[  341.809674] Workqueue: netns cleanup_net
[  341.809674] Call Trace:
[ ... ]
[  342.070196]  down_write+0x93/0x160
[  342.070196]  ? unregister_netdevice_notifier+0x8c/0x460
[  342.070196]  ? down_read+0x1e0/0x1e0
[  342.070196]  ? sched_clock_cpu+0x126/0x170
[  342.070196]  ? find_held_lock+0x39/0x1c0
[  342.070196]  unregister_netdevice_notifier+0x8c/0x460
[  342.070196]  ? register_netdevice_notifier+0x790/0x790
[  342.070196]  ? __local_bh_enable_ip+0xe9/0x1b0
[  342.070196]  ? __local_bh_enable_ip+0xe9/0x1b0
[  342.070196]  ? clusterip_tg_destroy+0x372/0x650 [ipt_CLUSTERIP]
[  342.070196]  ? trace_hardirqs_on+0x93/0x210
[  342.070196]  ? __bpf_trace_preemptirq_template+0x10/0x10
[  342.070196]  ? clusterip_tg_destroy+0x372/0x650 [ipt_CLUSTERIP]
[  342.123094]  clusterip_tg_destroy+0x3ad/0x650 [ipt_CLUSTERIP]
[  342.123094]  ? clusterip_net_init+0x3d0/0x3d0 [ipt_CLUSTERIP]
[  342.123094]  ? cleanup_match+0x17d/0x200 [ip_tables]
[  342.123094]  ? xt_unregister_table+0x215/0x300 [x_tables]
[  342.123094]  ? kfree+0xe2/0x2a0
[  342.123094]  cleanup_entry+0x1d5/0x2f0 [ip_tables]
[  342.123094]  ? cleanup_match+0x200/0x200 [ip_tables]
[  342.123094]  __ipt_unregister_table+0x9b/0x1a0 [ip_tables]
[  342.123094]  iptable_filter_net_exit+0x43/0x80 [iptable_filter]
[  342.123094]  ops_exit_list.isra.10+0x94/0x140
[  342.123094]  cleanup_net+0x45b/0x900
[ ... ]

Fixes: 202f59afd4 ("netfilter: ipt_CLUSTERIP: do not hold dev")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-18 01:17:59 +01:00
Phil Sutter 241faeceb8 netfilter: nf_tables: Speed up selective rule dumps
If just a table name was given, nf_tables_dump_rules() continued over
the list of tables even after a match was found. The simple fix is to
exit the loop if it reached the bottom and ctx->table was not NULL.

When iterating over the table's chains, the same problem as above
existed. But worse than that, if a chain name was given the hash table
wasn't used to find the corresponding chain. Fix this by introducing a
helper function iterating over a chain's rules (and taking care of the
cb->args handling), then introduce a shortcut to it if a chain name was
given.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-18 00:08:53 +01:00
Alin Nastac 8294059931 netfilter: nf_nat_sip: fix RTP/RTCP source port translations
Each media stream negotiation between 2 SIP peers will trigger creation
of 4 different expectations (2 RTP and 2 RTCP):
 - INVITE will create expectations for the media packets sent by the
   called peer
 - reply to the INVITE will create expectations for media packets sent
   by the caller

The dport used by these expectations usually match the ones selected
by the SIP peers, but they might get translated due to conflicts with
another expectation. When such event occur, it is important to do
this translation in both directions, dport translation on the receiving
path and sport translation on the sending path.

This commit fixes the sport translation when the peer requiring it is
also the one that starts the media stream. In this scenario, first media
stream packet is forwarded from LAN to WAN and will rely on
nf_nat_sip_expected() to do the necessary sport translation. However, the
expectation matched by this packet does not contain the necessary information
for doing SNAT, this data being stored in the paired expectation created by
the sender's SIP message (INVITE or reply to it).

Signed-off-by: Alin Nastac <alin.nastac@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-17 23:43:58 +01:00
Florian Westphal 5cbabeec1e netfilter: nat: remove nf_nat_l4proto struct
This removes the (now empty) nf_nat_l4proto struct, all its instances
and all the no longer needed runtime (un)register functionality.

nf_nat_need_gre() can be axed as well: the module that calls it (to
load the no-longer-existing nat_gre module) also calls other nat core
functions. GRE nat is now always available if kernel is built with it.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-17 23:33:31 +01:00
Florian Westphal faec18dbb0 netfilter: nat: remove l4proto->manip_pkt
This removes the last l4proto indirection, the two callers, the l3proto
packet mangling helpers for ipv4 and ipv6, now call the
nf_nat_l4proto_manip_pkt() helper.

nf_nat_proto_{dccp,tcp,sctp,gre,icmp,icmpv6} are left behind, even though
they contain no functionality anymore to not clutter this patch.

Next patch will remove the empty files and the nf_nat_l4proto
struct.

nf_nat_proto_udp.c is renamed to nf_nat_proto.c, as it now contains the
other nat manip functionality as well, not just udp and udplite.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-17 23:33:29 +01:00
Florian Westphal 76b90019e0 netfilter: nat: remove l4proto->nlattr_to_range
all protocols did set this to nf_nat_l4proto_nlattr_to_range, so
just call it directly.

The important difference is that we'll now also call it for
protocols that we don't support (i.e., nf_nat_proto_unknown did
not provide .nlattr_to_range).

However, there should be no harm, even icmp provided this callback.
If we don't implement a specific l4nat for this, nothing would make
use of this information, so adding a big switch/case construct listing
all supported l4protocols seems a bit pointless.

This change leaves a single function pointer in the l4proto struct.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-17 23:33:23 +01:00
Florian Westphal fe2d002099 netfilter: nat: remove l4proto->in_range
With exception of icmp, all of the l4 nat protocols set this to
nf_nat_l4proto_in_range.

Get rid of this and just check the l4proto in the caller.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-17 23:33:14 +01:00
Florian Westphal 40e786bd29 netfilter: nat: fold in_range indirection into caller
No need for indirections here, we only support ipv4 and ipv6
and the called functions are very small.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-17 23:33:09 +01:00
Florian Westphal 203f2e7820 netfilter: nat: remove l4proto->unique_tuple
fold remaining users (icmp, icmpv6, gre) into nf_nat_l4proto_unique_tuple.
The static-save of old incarnation of resolved key in gre and icmp is
removed as well, just use the prandom based offset like the others.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-17 23:33:04 +01:00
Florian Westphal 716b23c19e netfilter: nat: un-export nf_nat_l4proto_unique_tuple
almost all l4proto->unique_tuple implementations just call this helper,
so make ->unique_tuple() optional and call its helper directly if the
l4proto doesn't override it.

This is an intermediate step to get rid of ->unique_tuple completely.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-17 23:32:57 +01:00
Florian Westphal 912da924a2 netfilter: remove NF_NAT_RANGE_PROTO_RANDOM support
Historically this was net_random() based, and was then converted to
a hash based algorithm (private boot seed + hash of endpoint addresses)
due to concerns of leaking net_random() bits.

RANDOM_FULLY mode was added later to avoid problems with hash
based mode (see commit 34ce324019,
"netfilter: nf_nat: add full port randomization support" for details).

Just make prandom_u32() the default search starting point and get rid of
->secure_port() altogether.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-17 23:32:36 +01:00
Yafang Shao df7043bed4 netfilter: remove unused parameters in nf_ct_l4proto_[un]register_sysctl()
These parameters aren't used now.
So remove them.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-17 23:32:30 +01:00
Florian Westphal a504b703bb netfilter: nat: limit port clash resolution attempts
In case almost or all available ports are taken, clash resolution can
take a very long time, resulting in soft lockup.

This can happen when many to-be-natted hosts connect to same
destination:port (e.g. a proxy) and all connections pass the same SNAT.

Pick a random offset in the acceptable range, then try ever smaller
number of adjacent port numbers, until either the limit is reached or a
useable port was found.  This results in at most 248 attempts
(128 + 64 + 32 + 16 + 8, i.e. 4 restarts with new search offset)
instead of 64000+,

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-17 23:32:08 +01:00
Xiaozhou Liu b635cbf68f netfilter: nat: remove unnecessary 'else if' branch
Since a pseudo-random starting point is used in finding a port in
the default case, that 'else if' branch above is no longer a necessity.
So remove it to simplify code.

Signed-off-by: Xiaozhou Liu <liuxiaozhou@bytedance.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-17 23:31:03 +01:00
Qian Cai 00ec3ab060 netfilter: ipset: replace a strncpy() with strscpy()
To make overflows as obvious as possible and to prevent code from blithely
proceeding with a truncated string. This also has a side-effect to fix a
compilation warning when using GCC 8.2.1.

net/netfilter/ipset/ip_set_core.c: In function 'ip_set_sockfn_get':
net/netfilter/ipset/ip_set_core.c:2027:3: warning: 'strncpy' writing 32 bytes into a region of size 2 overflows the destination [-Wstringop-overflow=]

Signed-off-by: Qian Cai <cai@gmx.us>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-14 00:04:08 +01:00
Florent Fourcot 8e350ce1f7 netfilter: ipset: fix ip_set_byindex function
New function added by "Introduction of new commands and protocol
version 7" is not working, since we return skb2 to user

Signed-off-by: Victorien Molle <victorien.molle@wifirst.fr>
Signed-off-by: Florent Fourcot <florent.fourcot@wifirst.fr>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-14 00:02:26 +01:00
Florian Westphal 6ed5943f87 netfilter: nat: remove l4 protocol port rovers
This is a leftover from days where single-cpu systems were common:
Store last port used to resolve a clash to use it as a starting point when
the next conflict needs to be resolved.

When we have parallel attempt to connect to same address:port pair,
its likely that both cores end up computing the same "available" port,
as both use same starting port, and newly used ports won't become
visible to other cores until the conntrack gets confirmed later.

One of the cores then has to drop the packet at insertion time because
the chosen new tuple turns out to be in use after all.

Lets simplify this: remove port rover and use a pseudo-random starting
point.

Note that this doesn't make netfilter default to 'fully random' mode;
the 'rover' was only used if NAT could not reuse source port as-is.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-01 12:38:42 +01:00
Pablo Neira Ayuso c3e9305983 netfilter: remove NFC_* cache bits
These are very very (for long time unused) caching infrastructure
definition, remove then. They have nothing to do with the NFC subsystem.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-01 12:38:32 +01:00
Paul E. McKenney c8d1da4000 netfilter: Replace call_rcu_bh(), rcu_barrier_bh(), and synchronize_rcu_bh()
Now that call_rcu()'s callback is not invoked until after bh-disable
regions of code have completed (in addition to explicitly marked
RCU read-side critical sections), call_rcu() can be used in place
of call_rcu_bh().  Similarly, rcu_barrier() can be used in place of
rcu_barrier_bh() and synchronize_rcu() in place of synchronize_rcu_bh().
This commit therefore makes these changes.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-01 12:38:23 +01:00
Taehee Yoo b966098769 netfilter: nf_flow_table: simplify nf_flow_offload_gc_step()
nf_flow_offload_gc_step() and nf_flow_table_iterate() are very similar.
so that many duplicate code can be removed.
After this patch, nf_flow_offload_gc_step() is simple callback function of
nf_flow_table_iterate() like nf_flow_table_do_cleanup().

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-11-12 16:41:09 +01:00
Taehee Yoo 49de9c090f netfilter: nf_flow_table: make nf_flow_table_iterate() static
nf_flow_table_iterate() is local function, make it static.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-11-12 16:40:55 +01:00
Andreas Jaggi 58fc419be2 netfilter: ctnetlink: always honor CTA_MARK_MASK
Useful to only set a particular range of the conntrack mark while
leaving existing parts of the value alone, e.g. when updating
conntrack marks via netlink from userspace.

For NFQUEUE it was already implemented in commit 534473c608
("netfilter: ctnetlink: honor CTA_MARK_MASK when setting ctmark").

This now adds the same functionality also for the other netlink
conntrack mark changes.

Signed-off-by: Andreas Jaggi <andreas.jaggi@waterwave.ch>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-11-12 10:25:00 +01:00
Pablo Neira Ayuso 1226cfe379 Merge branch 'master' of git://blackhole.kfki.hu/nf-next
Jozsef Kadlecsik says:

====================
- Introduction of new commands and thus protocol version 7. The
  new commands makes possible to eliminate the getsockopt interface
  of ipset and use solely netlink to communicate with the kernel.
  Due to the strict attribute checking both in user/kernel space,
  a new protocol number was introduced. Both the kernel/userspace is
  fully backward compatible.
- Make invalid MAC address checks consisten, from Stefano Brivio.
  The patch depends on the next one.
- Allow matching on destination MAC address for mac and ipmac sets,
  also from Stefano Brivio.
====================

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-11-12 10:13:59 +01:00
YueHaibing 30beabb3c3 net: phy: marvell: remove set but not used variable 'pause'
Fixes gcc '-Wunused-but-set-variable' warning:

drivers/net/phy/marvell.c: In function 'm88e1510_config_init':
drivers/net/phy/marvell.c:850:7: warning:
 variable 'pause' set but not used [-Wunused-but-set-variable]

It not used any more after commit 3c1bcc8614 ("net: ethernet: Convert phydev
advertize and supported from u32 to link mode")

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-11 18:19:50 -08:00
David S. Miller 2b9b7502df Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-11-11 17:57:54 -08:00
Linus Torvalds ccda4af0f4 Linux 4.20-rc2 2018-11-11 17:12:31 -06:00
Linus Torvalds 7a3765ed66 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "One last pull request before heading to Vancouver for LPC, here we have:

   1) Don't forget to free VSI contexts during ice driver unload, from
      Victor Raj.

   2) Don't forget napi delete calls during device remove in ice driver,
      from Dave Ertman.

   3) Don't request VLAN tag insertion of ibmvnic device when SKB
      doesn't have VLAN tags at all.

   4) IPV4 frag handling code has to accomodate the situation where two
      threads try to insert the same fragment into the hash table at the
      same time. From Eric Dumazet.

   5) Relatedly, don't flow separate on protocol ports for fragmented
      frames, also from Eric Dumazet.

   6) Memory leaks in qed driver, from Denis Bolotin.

   7) Correct valid MTU range in smsc95xx driver, from Stefan Wahren.

   8) Validate cls_flower nested policies properly, from Jakub Kicinski.

   9) Clearing of stats counters in mc88e6xxx driver doesn't retain
      important bits in the G1_STATS_OP register causing the chip to
      hang. Fix from Andrew Lunn"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (41 commits)
  act_mirred: clear skb->tstamp on redirect
  net: dsa: mv88e6xxx: Fix clearing of stats counters
  tipc: fix link re-establish failure
  net: sched: cls_flower: validate nested enc_opts_policy to avoid warning
  net: mvneta: correct typo
  flow_dissector: do not dissect l4 ports for fragments
  net: qualcomm: rmnet: Fix incorrect assignment of real_dev
  net: aquantia: allow rx checksum offload configuration
  net: aquantia: invalid checksumm offload implementation
  net: aquantia: fixed enable unicast on 32 macvlan
  net: aquantia: fix potential IOMMU fault after driver unbind
  net: aquantia: synchronized flow control between mac/phy
  net: smsc95xx: Fix MTU range
  net: stmmac: Fix RX packet size > 8191
  qed: Fix potential memory corruption
  qed: Fix SPQ entries not returned to pool in error flows
  qed: Fix blocking/unlimited SPQ entries leak
  qed: Fix memory/entry leak in qed_init_sp_request()
  inet: frags: better deal with smp races
  net: hns3: bugfix for not checking return value
  ...
2018-11-11 17:09:48 -06:00
Linus Torvalds e12e00e388 Kbuild fixes for v4.20
- fix build errors in binrpm-pkg and bindeb-pkg targets
 
  - fix false positive matches in merge_config.sh
 
  - fix build version mismatch in deb-pkg target
 
  - fix dtbs_install handling in (bin)deb-pkg target
 
  - revert a commit that allows setlocalversion to write to source tree
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJb6F89AAoJED2LAQed4NsGbO0P/0MIDTnZ3jG4TkoyZNc0TG0k
 WXsk/UxHqhbWbYOrnn7D2C6kqYu9qVLzZmixP1/Z3l6hKYs3NTDKaFOyanmoFNC2
 UsexSmXFOCPF9tzzRROGVi5pZwEA8wfUIuSy6S7UosVps2gjh10cefwiXGYSJljy
 crZusbxJcTJFTXzdAMMRI5R2EzUOPylgeoCAFaCZT6NLo6caYkgBDhCsKFjGTZ1e
 ooTEWEVwWv34S5mxuCCXmdpCikLcCOL2SmxCPBlj0SXhERb/HW1FlukiVY/mdth+
 B5K4E6huIIgrPMowReGE4kPD4sR6d1+Zsjn9gjfchrSuHkOSg3K0GnTlyPTz+hkc
 eij1XTl6qwlbHdYx9aPqo7kF34OIU6rf9OwsGp+/9fHLPRtHonyWM/kurrKW6wSV
 zBvFNTRfe9nTfffSPowW6KAvlr43wf40HC3zZN6KsMKznQ1F1Lvnd5VGNAg64ykt
 ex++W217qRrRvYpyJML0R1pAyEl6GFIqZ+J99iB6kmP/wyJeOQyflA/DKH79xR3B
 w1RXCqR5+4VwKFonkS99KY/H+zPwMH9NnClB8Y5iH18pmdhtwvk8bIfvmgQF/lBC
 g9nLXz856JiEwjO4MOUvhCqeHmgK4F1wNoXKCgbWobNLVVujaG0NQ71jyz3eYyri
 UrTgUM2axQ6gyeGIDSWy
 =Fj3R
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-fixes-v4.20' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild fixes from Masahiro Yamada:

 - fix build errors in binrpm-pkg and bindeb-pkg targets

 - fix false positive matches in merge_config.sh

 - fix build version mismatch in deb-pkg target

 - fix dtbs_install handling in (bin)deb-pkg target

 - revert a commit that allows setlocalversion to write to source tree

* tag 'kbuild-fixes-v4.20' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  builddeb: Fix inclusion of dtbs in debian package
  Revert "scripts/setlocalversion: git: Make -dirty check more robust"
  kbuild: deb-pkg: fix too low build version number
  kconfig: merge_config: avoid false positive matches from comment lines
  kbuild: deb-pkg: fix bindeb-pkg breakage when O= is used
  kbuild: rpm-pkg: fix binrpm-pkg breakage when O= is used
2018-11-11 16:57:55 -06:00
Linus Torvalds 63a42e1a5c for-4.20-rc1-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlvoGIUACgkQxWXV+ddt
 WDta6g//UJSLnVskCUwh8VyMdd47QArQnaLJowOH7wQn4Nqj+2hf04mCq/kv05ed
 OneTezzONZc/qW9fiJGS+Dp77ln4JIDA1hWHtb/A4t9pYlksSQllJ3oiDUVsCp3q
 2EbzrjuNz3iQO6TjKlaHX473CLCMQMXS2OXOUnCkF2maMJSdr86oi+j1UiSnud1/
 C7uMYM3hG8nkfEfjjb1COpkS2MmzYcPruF5RDcbT/WOUfylTsjjX1E7rK/ZEqS9P
 SUcp4uoZe9BNoyWMASLaM7oHE82day4X9MwQoCQFRcm0kq4CnRAZ8X4lBl+M70iW
 7Olii/wNZ2SRiJf3jac/rpxoBHvEskXTHyiHTEmdHp4n1L1pL9GzGYIePQcX7uV1
 Tb6ImdUUKCC//fPqyeB7cEk5yxqahmlFD3qZVs6GnQkzKrPE+ChLx+7PgcJC/XVh
 C5ogNmJm+NvFOuTrYk9zSXg85B8gWHescDJrvNKVizIjw3nKmqiC+dXZljhzw+p8
 HscK9EXsiS8jW9ClfJljXzIa4SeA/i7fQGe4tCKfIrCQ+OqUxWpFCEoxygchinfF
 Rw90fJ0jX083oXsnfFcVdQpQ+SLSKka/aIRMvi58WRgLU3trci5NNN4TFg8TYRKP
 xBDF/iF3sqXajc+xsjoqLhLioZL3Pa5VDNuhsFdois9M5JSRekU=
 =K14u
 -----END PGP SIGNATURE-----

Merge tag 'for-4.20-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "Several fixes to recent release (4.19, fixes tagged for stable) and
  other fixes"

* tag 'for-4.20-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Btrfs: fix missing delayed iputs on unmount
  Btrfs: fix data corruption due to cloning of eof block
  Btrfs: fix infinite loop on inode eviction after deduplication of eof block
  Btrfs: fix deadlock on tree root leaf when finding free extent
  btrfs: avoid link error with CONFIG_NO_AUTO_INLINE
  btrfs: tree-checker: Fix misleading group system information
  Btrfs: fix missing data checksums after a ranged fsync (msync)
  btrfs: fix pinned underflow after transaction aborted
  Btrfs: fix cur_offset in the error case for nocow
2018-11-11 16:54:38 -06:00
Linus Torvalds c140f8b072 A large number of ext4 bug fixes, mostly buffer and memory leaks on
error return cleanup paths.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlvoFrEACgkQ8vlZVpUN
 gaMTSQf+Ogrvm7pfWtXf+RkmhhuyR26T+Hwxgl51m5bKetJBjEsh0qOaIfo7etwG
 aLc1x/pWng2VTCHk4z0Ij9KS8YwLK3sQCBYZoJFyT/R09yGgAhLm+xP5j38WLqrX
 h4GxVgekHSATkG95N/So7F7pQiz7gDowgbaYFW3PooXPoHJnCnTzcr7TGFAQBZAw
 iR+8+KtH5E8IcC7Jj40nemk7Wib45DgaeGpP5P9Ct/Jw7hW+Mwhf56NYOWkLdHyy
 4Kt7rm1Sbxam8k3nksNmIwx28bw+S0Ew1zZgkwgAcKcHaWdrv3TtGPkOA26AH+S3
 UVeORM7xH+zXslIOyFK+7sXUZr5LiQ==
 =BaBl
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "A large number of ext4 bug fixes, mostly buffer and memory leaks on
  error return cleanup paths"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: missing !bh check in ext4_xattr_inode_write()
  ext4: fix buffer leak in __ext4_read_dirblock() on error path
  ext4: fix buffer leak in ext4_expand_extra_isize_ea() on error path
  ext4: fix buffer leak in ext4_xattr_move_to_block() on error path
  ext4: release bs.bh before re-using in ext4_xattr_block_find()
  ext4: fix buffer leak in ext4_xattr_get_block() on error path
  ext4: fix possible leak of s_journal_flag_rwsem in error path
  ext4: fix possible leak of sbi->s_group_desc_leak in error path
  ext4: remove unneeded brelse call in ext4_xattr_inode_update_ref()
  ext4: avoid possible double brelse() in add_new_gdb() on error path
  ext4: avoid buffer leak in ext4_orphan_add() after prior errors
  ext4: avoid buffer leak on shutdown in ext4_mark_iloc_dirty()
  ext4: fix possible inode leak in the retry loop of ext4_resize_fs()
  ext4: fix missing cleanup if ext4_alloc_flex_bg_array() fails while resizing
  ext4: add missing brelse() update_backups()'s error path
  ext4: add missing brelse() add_new_gdb_meta_bg()'s error path
  ext4: add missing brelse() in set_flexbg_block_bitmap()'s error path
  ext4: avoid potential extra brelse in setup_new_flex_group_blocks()
2018-11-11 16:53:02 -06:00
Linus Torvalds b6df7b6db1 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
 "A set of x86 fixes:

   - Cure the LDT remapping to user space on 5 level paging which ended
     up in the KASLR space

   - Remove LDT mapping before freeing the LDT pages

   - Make NFIT MCE handling more robust

   - Unbreak the VSMP build by removing the dependency on paravirt ops

   - Support broken PIT emulation on Microsoft hyperV

   - Don't trace vmware_sched_clock() to avoid tracer recursion

   - Remove -pipe from KBUILD CFLAGS which breaks clang and is also
     slower on GCC

   - Trivial coding style and typo fixes"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/cpu/vmware: Do not trace vmware_sched_clock()
  x86/vsmp: Remove dependency on pv_irq_ops
  x86/ldt: Remove unused variable in map_ldt_struct()
  x86/ldt: Unmap PTEs for the slot before freeing LDT pages
  x86/mm: Move LDT remap out of KASLR region on 5-level paging
  acpi/nfit, x86/mce: Validate a MCE's address before using it
  acpi/nfit, x86/mce: Handle only uncorrectable machine checks
  x86/build: Remove -pipe from KBUILD_CFLAGS
  x86/hyper-v: Fix indentation in hv_do_fast_hypercall16()
  Documentation/x86: Fix typo in zero-page.txt
  x86/hyper-v: Enable PIT shutdown quirk
  clockevents/drivers/i8253: Add support for PIT shutdown quirk
2018-11-11 16:41:50 -06:00
Linus Torvalds 655c6b9777 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:
 "A bunch of perf tooling fixes:

   - Make the Intel PT SQL viewer more robust

   - Make the Intel PT debug log more useful

   - Support weak groups in perf record so it's behaving the same way as
     perf stat

   - Display the LBR stats in callchain entries properly in perf top

   - Handle different PMu names with common prefix properlin in pert
     stat

   - Start syscall augmenting in perf trace. Preparation for
     architecture independent eBPF instrumentation of syscalls.

   - Fix build breakage in JVMTI perf lib

   - Fix arm64 tools build failure wrt smp_load_{acquire,release}"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf tools: Do not zero sample_id_all for group members
  perf tools: Fix undefined symbol scnprintf in libperf-jvmti.so
  perf beauty: Use SRCARCH, ARCH=x86_64 must map to "x86" to find the headers
  perf intel-pt: Add MTC and CYC timestamps to debug log
  perf intel-pt: Add more event information to debug log
  perf scripts python: exported-sql-viewer.py: Fix table find when table re-ordered
  perf scripts python: exported-sql-viewer.py: Add help window
  perf scripts python: exported-sql-viewer.py: Add Selected branches report
  perf scripts python: exported-sql-viewer.py: Fall back to /usr/local/lib/libxed.so
  perf top: Display the LBR stats in callchain entry
  perf stat: Handle different PMU names with common prefix
  perf record: Support weak groups
  perf evlist: Move perf_evsel__reset_weak_group into evlist
  perf augmented_syscalls: Start collecting pathnames in the BPF program
  perf trace: Fix setting of augmented payload when using eBPF + raw_syscalls
  perf trace: When augmenting raw_syscalls plug raw_syscalls:sys_exit too
  perf examples bpf: Start augmenting raw_syscalls:sys_{start,exit}
  tools headers barrier: Fix arm64 tools build failure wrt smp_load_{acquire,release}
2018-11-11 16:39:12 -06:00
Linus Torvalds 08b5278650 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Thomas Gleixner:
 "Just the removal of a redundant call into the sched deadline overrun
  check"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  posix-cpu-timers: Remove useless call to check_dl_overrun()
2018-11-11 16:37:41 -06:00
Linus Torvalds 024d4d4c0c Merge branch 'sched/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Thomas Gleixner:
 "Two small scheduler fixes:

   - Take hotplug lock in sched_init_smp(). Technically not really
     required, but lockdep will complain other.

   - Trivial comment fix in sched/fair"

* 'sched/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Fix a comment in task_numa_fault()
  sched/core: Take the hotplug lock in sched_init_smp()
2018-11-11 16:33:00 -06:00
Linus Torvalds 1acf93ca6c Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking build fix from Thomas Gleixner:
 "A single fix for a build fail with CONFIG_PROFILE_ALL_BRANCHES=y in
  the qspinlock code"

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/qspinlock: Fix compile error
2018-11-11 16:18:10 -06:00
Linus Torvalds 0b002cdd50 Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core fixes from Thomas Gleixner:
 "A couple of fixlets for the core:

   - Kernel doc function documentation fixes

   - Missing prototypes for weak watchdog functions"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  resource/docs: Complete kernel-doc style function documentation
  watchdog/core: Add missing prototypes for weak functions
  resource/docs: Fix new kernel-doc warnings
2018-11-11 16:14:05 -06:00
Heiner Kallweit 9206eb0bc5 PCI: add USR vendor id and use it in r8169 and w6692 driver
The PCI vendor id of U.S. Robotics isn't defined in pci_ids.h so far,
only ISDN driver w6692 has a private definition. Move the definition
to pci_ids.h and use it in the r8169 driver too.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-11 14:00:05 -08:00
Eric Dumazet 48872c11b7 net_sched: sch_fq: add dctcp-like marking
Similar to 80ba92fa1a ("codel: add ce_threshold attribute")

After EDT adoption, it became easier to implement DCTCP-like CE marking.

In many cases, queues are not building in the network fabric but on
the hosts themselves.

If packets leaving fq missed their Earliest Departure Time by XXX usec,
we mark them with ECN CE. This gives a feedback (after one RTT) to
the sender to slow down and find better operating mode.

Example :

tc qd replace dev eth0 root fq ce_threshold 2.5ms

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-11 13:59:21 -08:00
Eric Dumazet c73e5807e4 tcp: tsq: no longer use limit_output_bytes for paced flows
FQ pacing guarantees that paced packets queued by one flow do not
add head-of-line blocking for other flows.

After TCP GSO conversion, increasing limit_output_bytes to 1 MB is safe,
since this maps to 16 skbs at most in qdisc or device queues.
(or slightly more if some drivers lower {gso_max_segs|size})

We still can queue at most 1 ms worth of traffic (this can be scaled
by wifi drivers if they need to)

Tested:

# ethtool -c eth0 | egrep "tx-usecs:|tx-frames:" # 40 Gbit mlx4 NIC
tx-usecs: 16
tx-frames: 16
# tc qdisc replace dev eth0 root fq
# for f in {1..10};do netperf -P0 -H lpaa24,6 -o THROUGHPUT;done

Before patch:
27711
26118
27107
27377
27712
27388
27340
27117
27278
27509

After patch:
37434
36949
36658
36998
37711
37291
37605
36659
36544
37349

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-11 13:57:03 -08:00
David S. Miller 83afb36a70 Merge branch 'tcp-tso-defer-improvements'
Eric Dumazet says:

====================
tcp: tso defer improvements

This series makes tcp_tso_should_defer() a bit smarter :

1) MSG_EOR gives a hint to TCP to not defer some skbs

2) Second patch takes into account that head tstamp
   can be in the future.

3) Third patch uses existing high resolution state variables
   to have a more precise heuristic.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-11 13:54:54 -08:00
Eric Dumazet a682850a11 tcp: get rid of tcp_tso_should_defer() dependency on HZ/jiffies
tcp_tso_should_defer() first heuristic is to not defer
if last send is "old enough".

Its current implementation uses jiffies and its low granularity.

TSO autodefer performance should not rely on kernel HZ :/

After EDT conversion, we have state variables in nanoseconds that
can allow us to properly implement the heuristic.

This patch increases TSO chunk sizes on medium rate flows,
especially when receivers do not use GRO or similar aggregation.

It also reduces bursts for HZ=100 or HZ=250 kernels, making TCP
behavior more uniform.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-11 13:54:53 -08:00
Eric Dumazet f1c6ea3827 tcp: refine tcp_tso_should_defer() after EDT adoption
tcp_tso_should_defer() last step tries to check if the probable
next ACK packet is coming in less than half rtt.

Problem is that the head->tstamp might be in the future,
so we need to use signed arithmetics to avoid overflows.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-11 13:54:53 -08:00
Eric Dumazet 1c09f7d073 tcp: do not try to defer skbs with eor mark (MSG_EOR)
Applications using MSG_EOR are giving a strong hint to TCP stack :

Subsequent sendmsg() can not append more bytes to skbs having
the EOR mark.

Do not try to TSO defer suchs skbs, there is really no hope.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-11 13:54:53 -08:00
Yafang Shao 5e13a0d3f5 tcp: minor optimization in tcp ack fast path processing
Bitwise operation is a little faster.
So I replace after() with using the flag FLAG_SND_UNA_ADVANCED as it is
already set before.

In addtion, there's another similar improvement in tcp_cwnd_reduction().

Cc: Joe Perches <joe@perches.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-11 10:24:18 -08:00