remarkable-linux/net
Tejun Heo 62f6341c85 cgroup, net_cls: iterate the fds of only the tasks which are being migrated
commit a05d4fd917 upstream.

The net_cls controller controls the classid field of each socket which
is associated with the cgroup.  Because the classid is per-socket
attribute, when a task migrates to another cgroup or the configured
classid of the cgroup changes, the controller needs to walk all
sockets and update the classid value, which was implemented by
3b13758f51 ("cgroups: Allow dynamically changing net_classid").

While the approach is not scalable, migrating tasks which have a lot
of fds attached to them is rare and the cost is born by the ones
initiating the operations.  However, for simplicity, both the
migration and classid config change paths call update_classid() which
scans all fds of all tasks in the target css.  This is an overkill for
the migration path which only needs to cover a much smaller subset of
tasks which are actually getting migrated in.

On cgroup v1, this can lead to unexpected scalability issues when one
tries to migrate a task or process into a net_cls cgroup which already
contains a lot of fds.  Even if the migration traget doesn't have many
to get scanned, update_classid() ends up scanning all fds in the
target cgroup which can be extremely numerous.

Unfortunately, on cgroup v2 which doesn't use net_cls, the problem is
even worse.  Before bfc2cf6f61 ("cgroup: call subsys->*attach() only
for subsystems which are actually affected by migration"), cgroup core
would call the ->css_attach callback even for controllers which don't
see actual migration to a different css.

As net_cls is always disabled but still mounted on cgroup v2, whenever
a process is migrated on the cgroup v2 hierarchy, net_cls sees
identity migration from root to root and cgroup core used to call
->css_attach callback for those.  The net_cls ->css_attach ends up
calling update_classid() on the root net_cls css to which all
processes on the system belong to as the controller isn't used.  This
makes any cgroup v2 migration O(total_number_of_fds_on_the_system)
which is horrible and easily leads to noticeable stalls triggering RCU
stall warnings and so on.

The worst symptom is already fixed in upstream by bfc2cf6f61
("cgroup: call subsys->*attach() only for subsystems which are
actually affected by migration"); however, backporting that commit is
too invasive and we want to avoid other cases too.

This patch updates net_cls's cgrp_attach() to iterate fds of only the
processes which are actually getting migrated.  This removes the
surprising migration cost which is dependent on the total number of
fds in the target cgroup.  As this leaves write_classid() the only
user of update_classid(), open-code the helper into write_classid().

Reported-by: David Goode <dgoode@fb.com>
Fixes: 3b13758f51 ("cgroups: Allow dynamically changing net_classid")
Cc: Nina Schiff <ninasc@fb.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-03-30 09:41:27 +02:00
..
6lowpan
9p IB/core: add support to create a unsafe global rkey to ib_create_pd 2016-09-23 13:47:44 -04:00
802
8021q net: add recursion limit to GRO 2016-10-20 14:32:22 -04:00
appletalk
atm
ax25 ax25: Fix segfault after sock connection timeout 2017-02-04 09:47:09 +01:00
batman-adv batman-adv: Check for alloc errors when preparing TT local data 2016-12-02 10:46:59 +01:00
bluetooth Bluetooth: Fix using the correct source address type 2016-11-22 22:50:46 +01:00
bridge bridge: drop netfilter fake rtable unconditionally 2017-03-22 12:43:34 +01:00
caif net: caif: remove ineffective check 2016-12-05 14:48:48 -05:00
can can: Fix kernel panic at security_sock_rcv_skb 2017-02-18 15:11:40 +01:00
ceph ceph: update readpages osd request according to size of pages 2017-03-12 06:41:53 +01:00
core cgroup, net_cls: iterate the fds of only the tasks which are being migrated 2017-03-30 09:41:27 +02:00
dcb net: dcb: set error code on failures 2016-12-03 23:54:25 -05:00
dccp dccp: fix memory leak during tear-down of unsuccessful connection request 2017-03-22 12:43:35 +01:00
decnet
dns_resolver
dsa net: dsa: Do not destroy invalid network devices 2017-02-18 15:11:43 +01:00
ethernet net: introduce device min_header_len 2017-02-18 15:11:43 +01:00
hsr net/hsr: Remove unused but set variable 2016-10-18 10:28:18 -04:00
ieee802154
ipv4 tcp: initialize icsk_ack.lrcvtime at session start time 2017-03-30 09:41:22 +02:00
ipv6 ipv6: make sure to initialize sockc.tsflags before first use 2017-03-30 09:41:22 +02:00
ipx
irda irda: Fix lockdep annotations in hashbin_delete(). 2017-02-26 11:10:51 +01:00
iucv net/af_iucv: don't use paged skbs for TX on HiperSockets 2017-01-19 20:18:04 +01:00
kcm kcm: fix a null pointer dereference in kcm_sendmsg() 2017-02-26 11:10:50 +01:00
key
l2tp l2tp: avoid use-after-free caused by l2tp_ip_backlog_recv 2017-03-22 12:43:32 +01:00
l3mdev
lapb
llc net/llc: avoid BUG_ON() in skb_orphan() 2017-02-26 11:10:50 +01:00
mac80211 mac80211: use driver-indicated transmitter STA only for data frames 2017-03-15 10:02:48 +08:00
mac802154
mpls mpls: Do not decrement alive counter for unregister events 2017-03-22 12:43:34 +01:00
ncsi net/ncsi: Improve HNCDSC AEN handler 2016-10-20 11:23:08 -04:00
netfilter netfilter: conntrack: refine gc worker heuristics, redux 2017-03-12 06:41:53 +01:00
netlabel
netlink netlink: Do not schedule work from sk_destruct 2016-12-05 19:43:42 -05:00
netrom
nfc
openvswitch openvswitch: Add missing case OVS_TUNNEL_KEY_ATTR_PAD 2017-03-30 09:41:21 +02:00
packet net: don't call strlen() on the user buffer in packet_bind_spkt() 2017-03-22 12:43:32 +01:00
phonet
qrtr
rds RDS: TCP: unregister_netdevice_notifier() in error path of rds_tcp_init_net 2016-12-02 13:29:26 -05:00
rfkill
rose
rxrpc rxrpc: Fix checking of error from ip6_route_output() 2016-10-13 08:43:17 +01:00
sched act_connmark: avoid crashing on malformed nlattrs with null parms 2017-03-22 12:43:34 +01:00
sctp tcp: don't annotate mark on control socket from tcp_v6_send_response() 2017-02-18 15:11:44 +01:00
strparser strparser: destroy workqueue on module exit 2017-03-22 12:43:33 +01:00
sunrpc xprtrdma: Squelch kbuild sparse complaint 2017-03-26 13:05:57 +02:00
switchdev switchdev: Execute bridge ndos only for bridge ports 2016-10-19 10:58:04 -04:00
tipc tipc: check minimum bearer MTU 2016-12-02 14:03:20 -05:00
unix net: unix: properly re-increment inflight counter of GC discarded candidates 2017-03-30 09:41:21 +02:00
vmw_vsock vsock/virtio: fix src/dst cid format 2017-01-09 08:32:23 +01:00
wimax
wireless nl80211: Fix mesh HT operation check 2017-02-14 15:25:37 -08:00
x25
xfrm xfrm_user: fix return value from xfrm_user_rcv_msg 2016-11-30 10:58:53 +01:00
compat.c
Kconfig
Makefile
socket.c net: socket: fix recvmmsg not returning error from sock_error 2017-02-26 11:10:51 +01:00
sysctl_net.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2016-10-06 09:52:23 -07:00