remarkable-linux/net/ipv6
Eric Dumazet c9bee3b7fd tcp: TCP_NOTSENT_LOWAT socket option
Idea of this patch is to add optional limitation of number of
unsent bytes in TCP sockets, to reduce usage of kernel memory.

TCP receiver might announce a big window, and TCP sender autotuning
might allow a large amount of bytes in write queue, but this has little
performance impact if a large part of this buffering is wasted :

Write queue needs to be large only to deal with large BDP, not
necessarily to cope with scheduling delays (incoming ACKS make room
for the application to queue more bytes)

For most workloads, using a value of 128 KB or less is OK to give
applications enough time to react to POLLOUT events in time
(or being awaken in a blocking sendmsg())

This patch adds two ways to set the limit :

1) Per socket option TCP_NOTSENT_LOWAT

2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.

This changes poll()/select()/epoll() to report POLLOUT
only if number of unsent bytes is below tp->nosent_lowat

Note this might increase number of sendmsg()/sendfile() calls
when using non blocking sockets,
and increase number of context switches for blocking sockets.

Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
defined as :
 Specify the minimum number of bytes in the buffer until
 the socket layer will pass the data to the protocol)

Tested:

netperf sessions, and watching /proc/net/protocols "memory" column for TCP

With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)

lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6     1880      2   45458   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
TCP       1696    508   45458   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y

lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6     1880      2   20567   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
TCP       1696    508   20567   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y

Using 128KB has no bad effect on the throughput or cpu usage
of a single flow, although there is an increase of context switches.

A bonus is that we hold socket lock for a shorter amount
of time and should improve latencies of ACK processing.

lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
Final       Final                                             %     Method %      Method
1651584     6291456     16384  20.00   17447.90   10^6bits/s  3.13  S      -1.00  U      0.353   -1.000  usec/KB

 Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':

           412,514 context-switches

     200.034645535 seconds time elapsed

lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
Final       Final                                             %     Method %      Method
1593240     6291456     16384  20.00   17321.16   10^6bits/s  3.35  S      -1.00  U      0.381   -1.000  usec/KB

 Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':

         2,675,818 context-switches

     200.029651391 seconds time elapsed

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-By: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-24 17:54:48 -07:00
..
netfilter Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2013-07-03 14:55:13 -07:00
addrconf.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2013-07-03 14:55:13 -07:00
addrconf_core.c ipv6: add include file to suppress sparse warnings 2013-06-25 02:44:05 -07:00
addrlabel.c rtnetlink: Remove passing of attributes into rtnl_doit functions 2013-03-22 10:31:16 -04:00
af_inet6.c net: ipv6: Add IPv6 support to the ping socket. 2013-05-25 21:07:49 -07:00
ah6.c net: Add skb_unclone() helper function. 2013-02-15 15:10:37 -05:00
anycast.c net: proc: change proc_net_remove to remove_proc_entry 2013-02-18 14:53:08 -05:00
datagram.c net: ipv6: Unify {raw,udp}6_sock_seq_show. 2013-06-04 12:56:14 -07:00
esp6.c ah6/esp6: set transport header correctly for IPsec tunnel mode. 2013-01-08 12:41:30 +01:00
exthdrs.c ipv6: Store Router Alert option in IP6CB directly. 2013-01-13 20:17:14 -05:00
exthdrs_core.c ipv6: Correct comparisons and calculations using skb->tail and skb-transport_header 2013-05-28 23:49:07 -07:00
exthdrs_offload.c ipv6: Pull IPv6 GSO registration out of the module 2012-11-15 17:39:24 -05:00
fib6_rules.c ipv6: introduce ip6_rt_put() 2012-11-03 14:59:05 -04:00
icmp.c net: Convert uses of typedef ctl_table to struct ctl_table 2013-06-13 02:36:09 -07:00
inet6_connection_sock.c ipv6: use newly introduced __ipv6_addr_needs_scope_id and ipv6_iface_scope_id 2013-03-08 12:29:22 -05:00
inet6_hashtables.c soreuseport: TCP/IPv6 implementation 2013-01-23 13:44:01 -05:00
ip6_checksum.c ipv6: move csum_ipv6_magic() and udp6_csum_init() into static library 2013-01-08 17:56:10 -08:00
ip6_fib.c net: ipv6 eliminate parameter "int addrlen" in function fib6_add_1 2013-07-24 14:24:56 -07:00
ip6_flowlabel.c ipv6 flowlabel: add __rcu annotations 2013-03-07 16:33:10 -05:00
ip6_gre.c ipv6,gre: do not leak info to user-space 2013-05-11 17:40:14 -07:00
ip6_icmp.c ipv6: Kill ipv6 dependency of icmpv6_send(). 2013-04-29 13:54:36 -04:00
ip6_input.c ipv6: don't accept node local multicast traffic from the wire 2013-03-29 14:57:33 -04:00
ip6_offload.c MPLS: Add limited GSO support 2013-05-27 22:50:59 -07:00
ip6_offload.h ipv6: Pull IPv6 GSO registration out of the module 2012-11-15 17:39:24 -05:00
ip6_output.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2013-07-03 14:55:13 -07:00
ip6_tunnel.c GRE: Refactor GRE tunneling code. 2013-03-26 12:27:18 -04:00
ip6mr.c ip6mr: change the prototype of ip6_mr_forward(). 2013-07-23 17:20:37 -07:00
ipcomp6.c ipv6: Add redirect support to all protocol icmp error handlers. 2012-07-12 00:25:15 -07:00
ipv6_sockglue.c ipv6: rename datagram_send_ctl and datagram_recv_ctl 2013-01-31 13:53:08 -05:00
Kconfig Tunneling: use IP Tunnel stats APIs. 2013-03-26 12:27:19 -04:00
Makefile net: ipv6: Add IPv6 support to the ping socket. 2013-05-25 21:07:49 -07:00
mcast.c ipv6,mcast: always hold idev->lock before mca_lock 2013-07-01 23:39:21 -07:00
mip6.c ipv6: Correct comparisons and calculations using skb->tail and skb-transport_header 2013-05-28 23:49:07 -07:00
ndisc.c ndisc: bool initializations should use true and false 2013-07-16 12:13:31 -07:00
netfilter.c netfilter: add nf_ipv6_ops hook to fix xt_addrtype with IPv6 2013-05-23 11:58:55 +02:00
output_core.c ipv6: Correct comparisons and calculations using skb->tail and skb-transport_header 2013-05-28 23:49:07 -07:00
ping.c net: ipv6: fix wrong ping_v6_sendmsg return value 2013-07-03 17:42:05 -07:00
proc.c snmp6: remove IPSTATS_MIB_CSUMERRORS 2013-05-31 16:26:49 -07:00
protocol.c ipv6: Pull IPv6 GSO registration out of the module 2012-11-15 17:39:24 -05:00
raw.c net: ipv6: Unify {raw,udp}6_sock_seq_show. 2013-06-04 12:56:14 -07:00
reassembly.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2013-04-22 20:32:51 -04:00
route.c net ipv6: Remove rebundant rt6i_nsiblings initialization 2013-07-24 14:24:56 -07:00
sit.c sit: fix tunnel update via netlink 2013-07-04 14:55:47 -07:00
syncookies.c tcp: Remove TCPCT 2013-03-17 14:35:13 -04:00
sysctl_net_ipv6.c net: Convert uses of typedef ctl_table to struct ctl_table 2013-06-13 02:36:09 -07:00
tcp_ipv6.c tcp: TCP_NOTSENT_LOWAT socket option 2013-07-24 17:54:48 -07:00
tcpv6_offload.c net: Remove code duplication between offload structures 2012-11-15 17:39:51 -05:00
tunnel6.c net: ipv6: Standardize prefixes for message logging 2012-05-16 01:01:03 -04:00
udp.c net: rename ll methods to busy-poll 2013-07-10 17:08:27 -07:00
udp_impl.h ipv6: do not clear pinet6 field 2013-05-11 16:26:38 -07:00
udp_offload.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2013-06-05 16:37:30 -07:00
udplite.c ipv6: do not clear pinet6 field 2013-05-11 16:26:38 -07:00
xfrm6_input.c netfilter: ipv6: use NFPROTO values for NF_HOOK invocation 2010-03-25 16:00:49 +01:00
xfrm6_mode_beet.c ipsec: be careful of non existing mac headers 2012-02-23 16:50:45 -05:00
xfrm6_mode_ro.c [IPSEC]: Make x->lastused an unsigned long 2008-01-28 14:53:52 -08:00
xfrm6_mode_transport.c [IPSEC]: Use IPv6 calling convention as the convention for x->mode->output 2007-10-10 16:55:54 -07:00
xfrm6_mode_tunnel.c xfrm: allow to avoid copying DSCP during encapsulation 2013-03-06 07:02:45 +01:00
xfrm6_output.c xfrm6: remove unneeded NULL check in __xfrm6_output() 2012-02-01 02:52:48 -05:00
xfrm6_policy.c xfrm6: release dev before returning error 2013-05-11 17:40:15 -07:00
xfrm6_state.c ipv6: use IS_ENABLED() 2012-11-01 12:41:35 -04:00
xfrm6_tunnel.c hlist: drop the node parameter from iterators 2013-02-27 19:10:24 -08:00