alistair23-linux/net
Martin KaFai Lau 40a1227ea8 tcp: Avoid TCP syncookie rejected by SO_REUSEPORT socket
Although the actual cookie check "__cookie_v[46]_check()" does
not involve sk specific info, it checks whether the sk has recent
synq overflow event in "tcp_synq_no_recent_overflow()".  The
tcp_sk(sk)->rx_opt.ts_recent_stamp is updated every second
when it has sent out a syncookie (through "tcp_synq_overflow()").

The above per sk "recent synq overflow event timestamp" works well
for non SO_REUSEPORT use case.  However, it may cause random
connection request reject/discard when SO_REUSEPORT is used with
syncookie because it fails the "tcp_synq_no_recent_overflow()"
test.

When SO_REUSEPORT is used, it usually has multiple listening
socks serving TCP connection requests destinated to the same local IP:PORT.
There are cases that the TCP-ACK-COOKIE may not be received
by the same sk that sent out the syncookie.  For example,
if reuse->socks[] began with {sk0, sk1},
1) sk1 sent out syncookies and tcp_sk(sk1)->rx_opt.ts_recent_stamp
   was updated.
2) the reuse->socks[] became {sk1, sk2} later.  e.g. sk0 was first closed
   and then sk2 was added.  Here, sk2 does not have ts_recent_stamp set.
   There are other ordering that will trigger the similar situation
   below but the idea is the same.
3) When the TCP-ACK-COOKIE comes back, sk2 was selected.
   "tcp_synq_no_recent_overflow(sk2)" returns true. In this case,
   all syncookies sent by sk1 will be handled (and rejected)
   by sk2 while sk1 is still alive.

The userspace may create and remove listening SO_REUSEPORT sockets
as it sees fit.  E.g. Adding new thread (and SO_REUSEPORT sock) to handle
incoming requests, old process stopping and new process starting...etc.
With or without SO_ATTACH_REUSEPORT_[CB]BPF,
the sockets leaving and joining a reuseport group makes picking
the same sk to check the syncookie very difficult (if not impossible).

The later patches will allow bpf prog more flexibility in deciding
where a sk should be located in a bpf map and selecting a particular
SO_REUSEPORT sock as it sees fit.  e.g. Without closing any sock,
replace the whole bpf reuseport_array in one map_update() by using
map-in-map.  Getting the syncookie check working smoothly across
socks in the same "reuse->socks[]" is important.

A partial solution is to set the newly added sk's ts_recent_stamp
to the max ts_recent_stamp of a reuseport group but that will require
to iterate through reuse->socks[]  OR
pessimistically set it to "now - TCP_SYNCOOKIE_VALID" when a sk is
joining a reuseport group.  However, neither of them will solve the
existing sk getting moved around the reuse->socks[] and that
sk may not have ts_recent_stamp updated, unlikely under continuous
synflood but not impossible.

This patch opts to treat the reuseport group as a whole when
considering the last synq overflow timestamp since
they are serving the same IP:PORT from the userspace
(and BPF program) perspective.

"synq_overflow_ts" is added to "struct sock_reuseport".
The tcp_synq_overflow() and tcp_synq_no_recent_overflow()
will update/check reuse->synq_overflow_ts if the sk is
in a reuseport group.  Similar to the reuseport decision in
__inet_lookup_listener(), both sk->sk_reuseport and
sk->sk_reuseport_cb are tested for SO_REUSEPORT usage.
Update on "synq_overflow_ts" happens at roughly once
every second.

A synflood test was done with a 16 rx-queues and 16 reuseport sockets.
No meaningful performance change is observed.  Before and
after the change is ~9Mpps in IPv4.

Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-11 01:58:45 +02:00
..
6lowpan 6lowpan: iphc: reset mac_header after decompress to fix panic 2018-07-06 12:32:12 +02:00
9p net:mod: remove unneeded variable 'ret' in init_p9 2018-08-08 09:40:44 -07:00
802 treewide: setup_timer() -> timer_setup() 2017-11-21 15:57:07 -08:00
8021q net: remove blank lines at end of file 2018-07-24 14:10:43 -07:00
appletalk Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL 2018-06-28 10:40:47 -07:00
atm net: simplify sock_poll_wait 2018-07-30 09:10:25 -07:00
ax25 ax25: remove blank line at EOF 2018-07-24 14:10:42 -07:00
batman-adv Merge ra.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux 2018-07-20 21:17:12 -07:00
bluetooth Bluetooth: hidp: buffer overflow in hidp_process_report 2018-08-01 09:12:35 +02:00
bpf bpf/test_run: support cgroup local storage 2018-08-03 00:47:32 +02:00
bpfilter bpfilter: remove trailing newline 2018-07-24 14:10:42 -07:00
bridge net/bridge/br_multicast: remove redundant variable "err" 2018-08-06 10:33:44 -07:00
caif net: simplify sock_poll_wait 2018-07-30 09:10:25 -07:00
can Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL 2018-06-28 10:40:47 -07:00
ceph The main piece is a set of libceph changes that revamps how OSD 2018-06-15 07:24:58 +09:00
core tcp: Avoid TCP syncookie rejected by SO_REUSEPORT socket 2018-08-11 01:58:45 +02:00
dcb net: dcb: Add priority-to-DSCP map getters 2018-07-27 13:17:50 -07:00
dccp Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net 2018-08-09 11:52:36 -07:00
decnet decnet: whitespace fixes 2018-07-24 14:10:42 -07:00
dns_resolver net: remove blank lines at end of file 2018-07-24 14:10:43 -07:00
dsa Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net 2018-08-09 11:52:36 -07:00
ethernet net: Convert GRO SKB handling to list_head. 2018-06-26 11:33:04 +09:00
hsr net: hsr: Convert timers to use timer_setup() 2017-10-25 13:00:27 +09:00
ieee802154 net: ieee802154: 6lowpan: remove redundant pointers 'fq' and 'net' 2018-08-06 11:21:15 +02:00
ife net: sched: ife: check on metadata length 2018-04-22 21:12:00 -04:00
ipv4 ipv4: frags: precedence bug in ip_expire() 2018-08-06 13:15:12 -07:00
ipv6 Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net 2018-08-09 11:52:36 -07:00
iucv net:af_iucv: get rid of the unneeded variable 'err' in afiucv_pm_freeze 2018-08-08 09:39:36 -07:00
kcm net: remove blank lines at end of file 2018-07-24 14:10:43 -07:00
key Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next 2018-07-27 09:33:37 -07:00
l2tp Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net 2018-08-05 13:04:31 -07:00
l3mdev
lapb treewide: Remove TIMER_FUNC_TYPE and TIMER_DATA_TYPE casts 2017-11-21 16:35:54 -08:00
llc Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net 2018-08-09 11:52:36 -07:00
mac80211 Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net 2018-07-24 19:21:58 -07:00
mac802154 net: mac802154: tx: expand tailroom if necessary 2018-08-06 11:21:37 +02:00
mpls mpls: remove trailing whitepace 2018-07-24 14:10:42 -07:00
ncsi net/ncsi: Use netdev_dbg for debug messages 2018-06-20 07:26:58 +09:00
netfilter Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next 2018-08-05 16:25:22 -07:00
netlabel audit: use inline function to get audit context 2018-05-14 17:24:18 -04:00
netlink Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net 2018-08-05 13:04:31 -07:00
netrom Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL 2018-06-28 10:40:47 -07:00
nfc net: simplify sock_poll_wait 2018-07-30 09:10:25 -07:00
nsh nsh: set mac len based on inner packet 2018-07-12 16:55:29 -07:00
openvswitch Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net 2018-08-02 10:55:32 -07:00
packet Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net 2018-08-09 11:52:36 -07:00
phonet Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL 2018-06-28 10:40:47 -07:00
psample MAINTAINERS: Update Yotam's E-mail 2017-11-01 12:19:03 +09:00
qrtr net: qrtr: Reset the node and port ID of broadcast messages 2018-07-05 20:20:03 +09:00
rds RDS: IB: fix 'passing zero to ERR_PTR()' warning 2018-08-07 13:19:45 -07:00
rfkill rfkill: Create rfkill-none LED trigger 2018-05-23 11:26:45 +02:00
rose Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL 2018-06-28 10:40:47 -07:00
rxrpc Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net 2018-08-09 11:52:36 -07:00
sched net: sched: cls_flower: set correct offload data in fl_reoffload 2018-08-07 12:35:17 -07:00
sctp sctp: whitespace fixes 2018-07-24 14:10:42 -07:00
smc Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net 2018-08-09 11:52:36 -07:00
strparser strparser: remove redundant variable 'rd_desc' 2018-08-01 10:00:06 -07:00
sunrpc net: Remove some unneeded semicolon 2018-08-04 13:05:39 -07:00
switchdev net: bridge: Add/del switchdev object on host join/leave 2017-11-10 13:41:40 +09:00
tipc Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net 2018-08-09 11:52:36 -07:00
tls net/tls: Mark the end in scatterlist table 2018-08-05 17:13:58 -07:00
unix af_unix: ensure POLLOUT on remote close() for connected dgram socket 2018-08-03 16:44:19 -07:00
vmw_vsock vsock: split dwork to avoid reinitializations 2018-08-07 12:39:13 -07:00
wimax wimax: remove blank lines at EOF 2018-07-24 14:10:42 -07:00
wireless Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net 2018-07-24 19:21:58 -07:00
x25 x25: remove blank lines at EOF 2018-07-24 14:10:42 -07:00
xdp Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net 2018-08-05 13:04:31 -07:00
xfrm Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net 2018-08-02 10:55:32 -07:00
compat.c net: avoid unnecessary sock_flag() check when enable timestamp 2018-08-06 10:42:48 -07:00
Kconfig net: remove blank lines at end of file 2018-07-24 14:10:43 -07:00
Makefile bpfilter: check compiler capability in Kconfig 2018-06-28 13:36:39 +09:00
socket.c Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net 2018-08-02 10:55:32 -07:00
sysctl_net.c net: Drop pernet_operations::async 2018-03-27 13:18:09 -04:00