remarkable-linux/net
Nikolay Aleksandrov 24b9bf43e9 net: fix for a race condition in the inet frag code
I stumbled upon this very serious bug while hunting for another one,
it's a very subtle race condition between inet_frag_evictor,
inet_frag_intern and the IPv4/6 frag_queue and expire functions
(basically the users of inet_frag_kill/inet_frag_put).

What happens is that after a fragment has been added to the hash chain
but before it's been added to the lru_list (inet_frag_lru_add) in
inet_frag_intern, it may get deleted (either by an expired timer if
the system load is high or the timer sufficiently low, or by the
fraq_queue function for different reasons) before it's added to the
lru_list, then after it gets added it's a matter of time for the
evictor to get to a piece of memory which has been freed leading to a
number of different bugs depending on what's left there.

I've been able to trigger this on both IPv4 and IPv6 (which is normal
as the frag code is the same), but it's been much more difficult to
trigger on IPv4 due to the protocol differences about how fragments
are treated.

The setup I used to reproduce this is: 2 machines with 4 x 10G bonded
in a RR bond, so the same flow can be seen on multiple cards at the
same time. Then I used multiple instances of ping/ping6 to generate
fragmented packets and flood the machines with them while running
other processes to load the attacked machine.

*It is very important to have the _same flow_ coming in on multiple CPUs
concurrently. Usually the attacked machine would die in less than 30
minutes, if configured properly to have many evictor calls and timeouts
it could happen in 10 minutes or so.

An important point to make is that any caller (frag_queue or timer) of
inet_frag_kill will remove both the timer refcount and the
original/guarding refcount thus removing everything that's keeping the
frag from being freed at the next inet_frag_put.  All of this could
happen before the frag was ever added to the LRU list, then it gets
added and the evictor uses a freed fragment.

An example for IPv6 would be if a fragment is being added and is at
the stage of being inserted in the hash after the hash lock is
released, but before inet_frag_lru_add executes (or is able to obtain
the lru lock) another overlapping fragment for the same flow arrives
at a different CPU which finds it in the hash, but since it's
overlapping it drops it invoking inet_frag_kill and thus removing all
guarding refcounts, and afterwards freeing it by invoking
inet_frag_put which removes the last refcount added previously by
inet_frag_find, then inet_frag_lru_add gets executed by
inet_frag_intern and we have a freed fragment in the lru_list.

The fix is simple, just move the lru_add under the hash chain locked
region so when a removing function is called it'll have to wait for
the fragment to be added to the lru_list, and then it'll remove it (it
works because the hash chain removal is done before the lru_list one
and there's no window between the two list adds when the frag can get
dropped). With this fix applied I couldn't kill the same machine in 24
hours with the same setup.

Fixes: 3ef0eb0db4 ("net: frag, move LRU list maintenance outside of
rwlock")

CC: Florian Westphal <fw@strlen.de>
CC: Jesper Dangaard Brouer <brouer@redhat.com>
CC: David S. Miller <davem@davemloft.net>

Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-05 20:31:42 -05:00
..
9p 9p/trans_virtio.c: Fix broken zero-copy on vmalloc() buffers 2014-02-10 17:48:54 -08:00
802
8021q
appletalk
atm
ax25
batman-adv batman-adv: fix potential kernel paging error for unicast transmissions 2014-02-17 17:17:02 +01:00
bluetooth Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid 2014-02-18 16:29:46 -08:00
bridge bridge: Prevent possible race condition in br_fdb_change_mac_address 2014-02-10 14:34:34 -08:00
caif net: Include appropriate header file in caif/cfsrvl.c 2014-02-09 17:32:49 -08:00
can can: remove CAN FD compatibility for CAN 2.0 sockets 2014-03-03 14:29:52 +01:00
ceph libceph: do not dereference a NULL bio pointer 2014-02-07 11:37:07 -08:00
core neigh: recompute reachabletime before returning from neigh_periodic_work() 2014-02-27 18:21:17 -05:00
dcb
dccp dccp: re-enable debug macro 2014-02-16 23:45:00 -05:00
decnet net: Move prototype declaration to header file include/net/dn.h from net/decnet/af_decnet.c 2014-02-09 17:32:49 -08:00
dns_resolver
dsa
ethernet
hsr hsr: off by one sanity check in hsr_register_frame_in() 2014-03-03 15:29:42 -05:00
ieee802154 6lowpan: fix lockdep splats 2014-02-10 17:51:29 -08:00
ipv4 net: fix for a race condition in the inet frag code 2014-03-05 20:31:42 -05:00
ipv6 ipv6: ipv6_find_hdr restore prev functionality 2014-02-27 18:27:26 -05:00
ipx net: Move prototype declaration to header file include/net/net_namespace.h from net/ipx/af_ipx.c 2014-02-09 17:32:50 -08:00
irda
iucv
key
l2tp
lapb
llc
mac80211 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless into for-davem 2014-03-03 14:34:45 -05:00
mac802154
mpls
netfilter netfilter: ctnetlink: force null nat binding on insert 2014-02-18 00:13:51 +01:00
netlabel
netlink net: Fix permission check in netlink_connect() 2014-02-25 18:35:14 -05:00
netrom
nfc NFC: NCI: Fix NULL pointer dereference 2014-02-23 23:14:45 +01:00
openvswitch openvswitch: Suppress error messages on megaflow updates 2014-02-04 22:32:38 -08:00
packet af_packet: remove a stray tab in packet_set_ring() 2014-02-18 18:02:25 -05:00
phonet
rds
rfkill
rose
rxrpc
sched sch_tbf: Fix potential memory leak in tbf_change(). 2014-02-27 12:53:50 -05:00
sctp net: sctp: fix sctp_sf_do_5_1D_ce to verify if we/peer is AUTH capable 2014-03-03 16:39:36 -05:00
sunrpc NFS client bugfixes for Linux 3.14 2014-02-19 12:13:02 -08:00
tipc tipc: make bearer set up in module insertion stage 2014-02-22 00:00:15 -05:00
unix
vmw_vsock
wimax
wireless cfg80211: regulatory: reset regdomain in case of error 2014-02-25 16:27:04 +01:00
x25
xfrm xfrm: Fix unlink race when policies are deleted. 2014-02-26 09:52:02 +01:00
compat.c x86, x32: Correct invalid use of user timespec in the kernel 2014-01-30 18:44:13 -08:00
Kconfig
Makefile
nonet.c
socket.c
sysctl_net.c