redonkable/alistair23-linux

Author	SHA1	Message	Date
Jiri Benc	67b61f6c13	netlink: implement nla_get_in_addr and nla_get_in6_addr Those are counterparts to nla_put_in_addr and nla_put_in6_addr. Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-31 13:58:35 -04:00
Jiri Benc	930345ea63	netlink: implement nla_put_in_addr and nla_put_in6_addr IP addresses are often stored in netlink attributes. Add generic functions to do that. For nla_put_in_addr, it would be nicer to pass struct in_addr but this is not used universally throughout the kernel, in way too many places __be32 is used to store IPv4 address. Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-31 13:58:35 -04:00
Jiri Benc	8f55db4860	tcp: simplify inetpeer_addr_base use In many places, the a6 field is typecasted to struct in6_addr. As the fields are in union anyway, just add in6_addr type to the union and get rid of the typecasting. Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-31 13:58:35 -04:00
Alexander Duyck	6e47d6caff	fib_trie: Cleanup ip_fib_net_exit code path While fixing a recent issue I noticed that we are doing some unnecessary work inside the loop for ip_fib_net_exit. As such I am pulling out the initialization to NULL for the locally stored fib_local, fib_main, and fib_default. In addition I am restoring the original code for flushing the table as there is no need to split up the fib_table_flush and hlist_del work since the code for packing the tnodes with multiple key vectors was dropped. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-31 13:18:56 -04:00
Alexander Duyck	ad88d05136	fib_trie: Fix warning on fib4_rules_exit This fixes the following warning: BUG: sleeping function called from invalid context at mm/slub.c:1268 in_atomic(): 1, irqs_disabled(): 0, pid: 6, name: kworker/u8:0 INFO: lockdep is turned off. CPU: 3 PID: 6 Comm: kworker/u8:0 Tainted: G W 4.0.0-rc5+ #895 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Workqueue: netns cleanup_net 0000000000000006 ffff88011953fa68 ffffffff81a203b6 000000002c3a2c39 ffff88011952a680 ffff88011953fa98 ffffffff8109daf0 ffff8801186c6aa8 ffffffff81fbc9e5 00000000000004f4 0000000000000000 ffff88011953fac8 Call Trace: [<ffffffff81a203b6>] dump_stack+0x4c/0x65 [<ffffffff8109daf0>] ___might_sleep+0x1c3/0x1cb [<ffffffff8109db70>] __might_sleep+0x78/0x80 [<ffffffff8117a60e>] slab_pre_alloc_hook+0x31/0x8f [<ffffffff8117d4f6>] __kmalloc+0x69/0x14e [<ffffffff818ed0e1>] ? kzalloc.constprop.20+0xe/0x10 [<ffffffff818ed0e1>] kzalloc.constprop.20+0xe/0x10 [<ffffffff818ef622>] fib_trie_table+0x27/0x8b [<ffffffff818ef6bd>] fib_trie_unmerge+0x37/0x2a6 [<ffffffff810b06e1>] ? arch_local_irq_save+0x9/0xc [<ffffffff818e9793>] fib_unmerge+0x2d/0xb3 [<ffffffff818f5f56>] fib4_rule_delete+0x1f/0x52 [<ffffffff817f1c3f>] ? fib_rules_unregister+0x30/0xb2 [<ffffffff817f1c8b>] fib_rules_unregister+0x7c/0xb2 [<ffffffff818f64a1>] fib4_rules_exit+0x15/0x18 [<ffffffff818e8c0a>] ip_fib_net_exit+0x23/0xf2 [<ffffffff818e91f8>] fib_net_exit+0x32/0x36 [<ffffffff817c8352>] ops_exit_list+0x45/0x57 [<ffffffff817c8d3d>] cleanup_net+0x13c/0x1cd [<ffffffff8108b05d>] process_one_work+0x255/0x4ad [<ffffffff8108af69>] ? process_one_work+0x161/0x4ad [<ffffffff8108b4b1>] worker_thread+0x1cd/0x2ab [<ffffffff8108b2e4>] ? process_scheduled_works+0x2f/0x2f [<ffffffff81090686>] kthread+0xd4/0xdc [<ffffffff8109ec8f>] ? local_clock+0x19/0x22 [<ffffffff810905b2>] ? __kthread_parkme+0x83/0x83 [<ffffffff81a2c0c8>] ret_from_fork+0x58/0x90 [<ffffffff810905b2>] ? __kthread_parkme+0x83/0x83 The issue was that as a part of exiting the default rules were being deleted which resulted in the local trie being unmerged. By moving the freeing of the FIB tables up we can avoid the unmerge since there is no local table left when we call the fib4_rules_exit function. Fixes: `0ddcf43d5d` ("ipv4: FIB Local/MAIN table collapse") Reported-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-31 13:18:56 -04:00
David S. Miller	4ef295e047	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for your net-next tree. Basically, nf_tables updates to add the set extension infrastructure and finish the transaction for sets from Patrick McHardy. More specifically, they are: 1) Move netns to basechain and use recently added possible_net_t, from Patrick McHardy. 2) Use LOGLEVEL_<FOO> from nf_log infrastructure, from Joe Perches. 3) Restore nf_log_trace that was accidentally removed during conflict resolution. 4) nft_queue does not depend on NETFILTER_XTABLES, starting from here all patches from Patrick McHardy. 5) Use raw_smp_processor_id() in nft_meta. Then, several patches to prepare ground for the new set extension infrastructure: 6) Pass object length to the hash callback in rhashtable as needed by the new set extension infrastructure. 7) Cleanup patch to restore struct nft_hash as wrapper for struct rhashtable 8) Another small source code readability cleanup for nft_hash. 9) Convert nft_hash to rhashtable callbacks. And finally... 10) Add the new set extension infrastructure. 11) Convert the nft_hash and nft_rbtree sets to use it. 12) Batch set element release to avoid several RCU grace period in a row and add new function nft_set_elem_destroy() to consolidate set element release. 13) Return the set extension data area from nft_lookup. 14) Refactor existing transaction code to add some helper functions and document it. 15) Complete the set transaction support, using similar approach to what we already use, to activate/deactivate elements in an atomic fashion. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-29 12:43:43 -07:00
Eric Dumazet	41d25fe092	tcp: tcp_syn_flood_action() can be static After commit `1fb6f159fd` ("tcp: add tcp_conn_request"), tcp_syn_flood_action() is no longer used from IPv6. We can make it static, by moving it above tcp_conn_request() Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Octavian Purdila <octavian.purdila@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-29 12:17:18 -07:00
WANG Cong	f243e5a785	ipmr,ip6mr: call ip6mr_free_table() on failure path Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-29 12:13:54 -07:00
Christoph Hellwig	e2e40f2c1e	fs: move struct kiocb to fs.h struct kiocb now is a generic I/O container, so move it to fs.h. Also do a #include diet for aio.h while we're at it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-03-25 20:28:11 -04:00
Hannes Frederic Sowa	b6a7719aed	ipv4: hash net ptr into fragmentation bucket selection As namespaces are sometimes used with overlapping ip address ranges, we should also use the namespace as input to the hash to select the ip fragmentation counter bucket. Cc: Eric Dumazet <edumazet@google.com> Cc: Flavio Leitner <fbl@redhat.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-25 14:07:04 -04:00
Joe Perches	a81b2ce850	netfilter: Use LOGLEVEL_<FOO> defines Use the #defines where appropriate. Miscellanea: Add explicit #include <linux/kernel.h> where it was not previously used so that these #defines are a bit more explicitly defined instead of indirectly included via: module.h->moduleparam.h->kernel.h Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2015-03-25 12:09:39 +01:00
Eric Dumazet	0144a81ccc	tcp: fix ipv4 mapped request socks ss should display ipv4 mapped request sockets like this : tcp SYN-RECV 0 0 ::ffff:192.168.0.1:8080 ::ffff:192.0.2.1:35261 and not like this : tcp SYN-RECV 0 0 192.168.0.1:8080 192.0.2.1:35261 We should init ireq->ireq_family based on listener sk_family, not the actual protocol carried by SYN packet. This means we can set ireq_family in inet_reqsk_alloc() Fixes: `3f66b083a5` ("inet: introduce ireq_family") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-25 00:57:48 -04:00
Eric Dumazet	fd3a154a00	tcp: md5: get rid of tcp_v[46]_reqsk_md5_lookup() With request socks convergence, we no longer need different lookup methods. A request socket can use generic lookup function. Add const qualifier to 2nd tcp_v[46]_md5_lookup() parameter. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-24 21:16:30 -04:00
Eric Dumazet	39f8e58e53	tcp: md5: remove request sock argument of calc_md5_hash() Since request and established sockets now have same base, there is no need to pass two pointers to tcp_v4_md5_hash_skb() or tcp_v6_md5_hash_skb() Also add a const qualifier to their struct tcp_md5sig_key argument. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-24 21:16:30 -04:00
Eric Dumazet	ff74e23f7e	tcp: md5: input path is run under rcu protected sections It is guaranteed that both tcp_v4_rcv() and tcp_v6_rcv() run from rcu read locked sections : ip_local_deliver_finish() and ip6_input_finish() both use rcu_read_lock() Also align tcp_v6_inbound_md5_hash() on tcp_v4_inbound_md5_hash() by returning a boolean. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-24 21:16:29 -04:00
Eric Dumazet	0980c1e308	tcp: use C99 initializers in new_state[] Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-24 21:16:29 -04:00
Eric Dumazet	80f03e27a3	tcp: md5: fix rcu lockdep splat While timer handler effectively runs a rcu read locked section, there is no explicit rcu_read_lock()/rcu_read_unlock() annotations and lockdep can be confused here : net/ipv4/tcp_ipv4.c-906- /* caller either holds rcu_read_lock() or socket lock */ net/ipv4/tcp_ipv4.c:907: md5sig = rcu_dereference_check(tp->md5sig_info, net/ipv4/tcp_ipv4.c-908- sock_owned_by_user(sk) \|\| net/ipv4/tcp_ipv4.c-909- lockdep_is_held(&sk->sk_lock.slock)); Let's explicitely acquire rcu_read_lock() in tcp_make_synack() Before commit `fa76ce7328` ("inet: get rid of central tcp/dccp listener timer"), we were holding listener lock so lockdep was happy. Fixes: `fa76ce7328` ("inet: get rid of central tcp/dccp listener timer") Signed-off-by: Eric DUmazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-24 21:16:29 -04:00
Michal Kubeček	d0c294c53a	tcp: prevent fetching dst twice in early demux code On s390x, gcc 4.8 compiles this part of tcp_v6_early_demux() struct dst_entry *dst = sk->sk_rx_dst; if (dst) dst = dst_check(dst, inet6_sk(sk)->rx_dst_cookie); to code reading sk->sk_rx_dst twice, once for the test and once for the argument of ip6_dst_check() (dst_check() is inline). This allows ip6_dst_check() to be called with null first argument, causing a crash. Protect sk->sk_rx_dst access by READ_ONCE() both in IPv4 and IPv6 TCP early demux code. Fixes: `41063e9dd1` ("ipv4: Early TCP socket demux.") Fixes: `c7109986db` ("ipv6: Early TCP socket demux") Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-23 22:38:24 -04:00
David S. Miller	d5c1d8c567	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: net/netfilter/nf_tables_core.c The nf_tables_core.c conflict was resolved using a conflict resolution from Stephen Rothwell as a guide. Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-23 22:22:43 -04:00
David S. Miller	40451fd013	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next. Basically, more incremental updates for br_netfilter from Florian Westphal, small nf_tables updates (including one fix for rb-tree locking) and small two-liner to add extra validation for the REJECT6 target. More specifically, they are: 1) Use the conntrack status flags from br_netfilter to know that DNAT is happening. Patch for Florian Westphal. 2) nf_bridge->physoutdev == NULL already indicates that the traffic is bridged, so let's get rid of the BRNF_BRIDGED flag. Also from Florian. 3) Another patch to prepare voidization of seq_printf/seq_puts/seq_putc, from Joe Perches. 4) Consolidation of nf_tables_newtable() error path. 5) Kill nf_bridge_pad used by br_netfilter from ip_fragment(), from Florian Westphal. 6) Access rb-tree root node inside the lock and remove unnecessary locking from the get path (we already hold nfnl_lock there), from Patrick McHardy. 7) You cannot use a NFT_SET_ELEM_INTERVAL_END when the set doesn't support interval, also from Patrick. 8) Enforce IP6T_F_PROTO from ip6t_REJECT to make sure the core is actually restricting matches to TCP. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-23 22:02:46 -04:00
Fan Du	c69736696c	inet: fix double request socket freeing Eric Hugne reported following error : I'm hitting this warning on latest net-next when i try to SSH into a machine with eth0 added to a bridge (but i think the problem is older than that) Steps to reproduce: node2 ~ # brctl addif br0 eth0 [ 223.758785] device eth0 entered promiscuous mode node2 ~ # ip link set br0 up [ 244.503614] br0: port 1(eth0) entered forwarding state [ 244.505108] br0: port 1(eth0) entered forwarding state node2 ~ # [ 251.160159] ------------[ cut here ]------------ [ 251.160831] WARNING: CPU: 0 PID: 3 at include/net/request_sock.h:102 tcp_v4_err+0x6b1/0x720() [ 251.162077] Modules linked in: [ 251.162496] CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted 4.0.0-rc3+ #18 [ 251.163334] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 251.164078] ffffffff81a8365c ffff880038a6ba18 ffffffff8162ace4 0000000000009898 [ 251.165084] 0000000000000000 ffff880038a6ba58 ffffffff8104da85 ffff88003fa437c0 [ 251.166195] ffff88003fa437c0 ffff88003fa74e00 ffff88003fa43bb8 ffff88003fad99a0 [ 251.167203] Call Trace: [ 251.167533] [<ffffffff8162ace4>] dump_stack+0x45/0x57 [ 251.168206] [<ffffffff8104da85>] warn_slowpath_common+0x85/0xc0 [ 251.169239] [<ffffffff8104db65>] warn_slowpath_null+0x15/0x20 [ 251.170271] [<ffffffff81559d51>] tcp_v4_err+0x6b1/0x720 [ 251.171408] [<ffffffff81630d03>] ? _raw_read_lock_irq+0x3/0x10 [ 251.172589] [<ffffffff81534e20>] ? inet_del_offload+0x40/0x40 [ 251.173366] [<ffffffff81569295>] icmp_socket_deliver+0x65/0xb0 [ 251.174134] [<ffffffff815693a2>] icmp_unreach+0xc2/0x280 [ 251.174820] [<ffffffff8156a82d>] icmp_rcv+0x2bd/0x3a0 [ 251.175473] [<ffffffff81534ea2>] ip_local_deliver_finish+0x82/0x1e0 [ 251.176282] [<ffffffff815354d8>] ip_local_deliver+0x88/0x90 [ 251.177004] [<ffffffff815350f0>] ip_rcv_finish+0xf0/0x310 [ 251.177693] [<ffffffff815357bc>] ip_rcv+0x2dc/0x390 [ 251.178336] [<ffffffff814f5da3>] __netif_receive_skb_core+0x713/0xa20 [ 251.179170] [<ffffffff814f7fca>] __netif_receive_skb+0x1a/0x80 [ 251.179922] [<ffffffff814f97d4>] process_backlog+0x94/0x120 [ 251.180639] [<ffffffff814f9612>] net_rx_action+0x1e2/0x310 [ 251.181356] [<ffffffff81051267>] __do_softirq+0xa7/0x290 [ 251.182046] [<ffffffff81051469>] run_ksoftirqd+0x19/0x30 [ 251.182726] [<ffffffff8106cc23>] smpboot_thread_fn+0x153/0x1d0 [ 251.183485] [<ffffffff8106cad0>] ? SyS_setgroups+0x130/0x130 [ 251.184228] [<ffffffff8106935e>] kthread+0xee/0x110 [ 251.184871] [<ffffffff81069270>] ? kthread_create_on_node+0x1b0/0x1b0 [ 251.185690] [<ffffffff81631108>] ret_from_fork+0x58/0x90 [ 251.186385] [<ffffffff81069270>] ? kthread_create_on_node+0x1b0/0x1b0 [ 251.187216] ---[ end trace c947fc7b24e42ea1 ]--- [ 259.542268] br0: port 1(eth0) entered forwarding state Remove the double calls to reqsk_put() [edumazet] : I got confused because reqsk_timer_handler() _has_ to call reqsk_put(req) after calling inet_csk_reqsk_queue_drop(), as the timer handler holds a reference on req. Signed-off-by: Fan Du <fan.du@intel.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Erik Hugne <erik.hugne@ericsson.com> Fixes: `fa76ce7328` ("inet: get rid of central tcp/dccp listener timer") Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-23 21:40:48 -04:00
Alexander Duyck	b6f15f828d	fib_trie: Fix regression in handling of inflate/halve failure When I updated the code to address a possible null pointer dereference in resize I ended up reverting an exception handling fix for the suffix length in the event that inflate or halve failed. This change is meant to correct that by reverting the earlier fix and instead simply getting the parent again after inflate has been completed to avoid the possible null pointer issue. Fixes: `ddb4b9a13` ("fib_trie: Address possible NULL pointer dereference in resize") Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-23 16:58:32 -04:00
Eric Dumazet	26e3736090	ipv4: tcp: handle ICMP messages on TCP_NEW_SYN_RECV request sockets tcp_v4_err() can restrict lookups to ehash table, and not to listeners. Note this patch creates the infrastructure, but this means that ICMP messages for request sockets are ignored until complete conversion. New tcp_req_err() helper is exported so that we can use it in IPv6 in following patch. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-23 16:52:26 -04:00
Eric Dumazet	b282705336	net: convert syn_wait_lock to a spinlock This is a low hanging fruit, as we'll get rid of syn_wait_lock eventually. We hold syn_wait_lock for such small sections, that it makes no sense to use a read/write lock. A spin lock is simply faster. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-23 16:52:26 -04:00
Eric Dumazet	8b929ab12f	inet: remove some sk_listener dependencies listener can be source of false sharing. request sock has some useful information like : ireq->ir_iif, ireq->ir_num, ireq->ireq_net This patch does not solve the major problem of having to read sk->sk_protocol which is sharing a cache line with sk->sk_wmem_alloc. (This same field is read later in ip_build_and_send_pkt()) One idea would be to move sk_protocol close to sk_family (using 8 bits instead of 16 for sk_family seems enough) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-23 16:52:26 -04:00
Eric Dumazet	42cb80a235	inet: remove sk_listener parameter from syn_ack_timeout() It is not needed, and req->sk_listener points to the listener anyway. request_sock argument can be const. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-23 16:52:25 -04:00
Eric Dumazet	2b41fab70f	inet: cache listen_sock_qlen() and read rskq_defer_accept once Cache listen_sock_qlen() to limit false sharing, and read rskq_defer_accept once as it might change under us. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-23 16:52:25 -04:00
David S. Miller	c0e41fa76c	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for your net tree, they are: 1) Fix missing initialization of tuple structure in nfnetlink_cthelper to avoid mismatches when looking up to attach userspace helpers to flows, from Ian Wilson. 2) Fix potential crash in nft_hash when we hit -EAGAIN in nft_hash_walk(), from Herbert Xu. 3) We don't need to indicate the hook information to update the basechain default policy in nf_tables. 4) Restore tracing over nfnetlink_log due to recent rework to accomodate logging infrastructure into nf_tables. 5) Fix wrong IP6T_INV_PROTO check in xt_TPROXY. 6) Set IP6T_F_PROTO flag in nft_compat so we can use SYNPROXY6 and REJECT6 from xt over nftables. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-22 16:57:07 -04:00
Florian Westphal	8d0451638a	netfilter: bridge: kill nf_bridge_pad The br_netfilter frag output function calls skb_cow_head() so in case it needs a larger headroom to e.g. re-add a previously stripped PPPOE or VLAN header things will still work (at cost of reallocation). We can then move nf_bridge_encap_header_len to br_netfilter. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2015-03-22 19:45:55 +01:00
Eric Dumazet	d3593b5cef	Revert "selinux: add a skb_owned_by() hook" This reverts commit `ca10b9e9a8`. No longer needed after commit `eb8895debe` ("tcp: tcp_make_synack() should use sock_wmalloc") When under SYNFLOOD, we build lot of SYNACK and hit false sharing because of multiple modifications done on sk_listener->sk_wmem_alloc Since tcp_make_synack() uses sock_wmalloc(), there is no need to call skb_set_owner_w() again, as this adds two atomic operations. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-20 21:36:53 -04:00
David S. Miller	0fa74a4be4	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/ethernet/emulex/benet/be_main.c net/core/sysctl_net_core.c net/ipv4/inet_diag.c The be_main.c conflict resolution was really tricky. The conflict hunks generated by GIT were very unhelpful, to say the least. It split functions in half and moved them around, when the real actual conflict only existed solely inside of one function, that being be_map_pci_bars(). So instead, to resolve this, I checked out be_main.c from the top of net-next, then I applied the be_main.c changes from 'net' since the last time I merged. And this worked beautifully. The inet_diag.c and sysctl_net_core.c conflicts were simple overlapping changes, and were easily to resolve. Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-20 18:51:09 -04:00
Josh Hunt	d22e153718	tcp: fix tcp fin memory accounting tcp_send_fin() does not account for the memory it allocates properly, so sk_forward_alloc can be negative in cases where we've sent a FIN: ss example output (ss -amn \| grep -B1 f4294): tcp FIN-WAIT-1 0 1 192.168.0.1:45520 192.0.2.1:8080 skmem:(r0,rb87380,t0,tb87380,f4294966016,w1280,o0,bl0) Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-20 13:18:52 -04:00
Eric Dumazet	fa76ce7328	inet: get rid of central tcp/dccp listener timer One of the major issue for TCP is the SYNACK rtx handling, done by inet_csk_reqsk_queue_prune(), fired by the keepalive timer of a TCP_LISTEN socket. This function runs for awful long times, with socket lock held, meaning that other cpus needing this lock have to spin for hundred of ms. SYNACK are sent in huge bursts, likely to cause severe drops anyway. This model was OK 15 years ago when memory was very tight. We now can afford to have a timer per request sock. Timer invocations no longer need to lock the listener, and can be run from all cpus in parallel. With following patch increasing somaxconn width to 32 bits, I tested a listener with more than 4 million active request sockets, and a steady SYNFLOOD of ~200,000 SYN per second. Host was sending ~830,000 SYNACK per second. This is ~100 times more what we could achieve before this patch. Later, we will get rid of the listener hash and use ehash instead. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-20 12:40:25 -04:00
Eric Dumazet	52452c5425	inet: drop prev pointer handling in request sock When request sock are put in ehash table, the whole notion of having a previous request to update dl_next is pointless. Also, following patch will get rid of big purge timer, so we want to delete a request sock without holding listener lock. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-20 12:40:25 -04:00
Pablo Neira Ayuso	4017a7ee69	netfilter: restore rule tracing via nfnetlink_log Since `fab4085` ("netfilter: log: nf_log_packet() as real unified interface"), the loginfo structure that is passed to nf_log_packet() is used to explicitly indicate the logger type you want to use. This is a problem for people tracing rules through nfnetlink_log since packets are always routed to the NF_LOG_TYPE logger after the aforementioned patch. We can fix this by removing the trace loginfo structures, but that still changes the log level from 4 to 5 for tracing messages and there may be someone relying on this outthere. So let's just introduce a new nf_log_trace() function that restores the former behaviour. Reported-by: Markus Kötter <koetter@rrzn.uni-hannover.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2015-03-19 11:14:48 +01:00
Eric Dumazet	738e6d30d3	inet: add a schedule point in inet_twsk_purge() On a large hash table, we can easily spend seconds to walk over all entries. Add a cond_resched() to yield cpu if necessary. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-18 22:38:13 -04:00
Marcelo Ricardo Leitner	54ff9ef36b	ipv4, ipv6: kill ip_mc_{join, leave}_group and ipv6_sock_mc_{join, drop} in favor of their inner __ ones, which doesn't grab rtnl. As these functions need to operate on a locked socket, we can't be grabbing rtnl by then. It's too late and doing so causes reversed locking. So this patch: - move rtnl handling to callers instead while already fixing some reversed locking situations, like on vxlan and ipvs code. - renames __ ones to not have the __ mark: __ip_mc_{join,leave}_group -> ip_mc_{join,leave}_group __ipv6_sock_mc_{join,drop} -> ipv6_sock_mc_{join,drop} Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-18 22:05:09 -04:00
Marcelo Ricardo Leitner	baf606d9c9	ipv4,ipv6: grab rtnl before locking the socket There are some setsockopt operations in ipv4 and ipv6 that are grabbing rtnl after having grabbed the socket lock. Yet this makes it impossible to do operations that have to lock the socket when already within a rtnl protected scope, like ndo dev_open and dev_stop. We normally take coarse grained locks first but setsockopt inverted that. So this patch invert the lock logic for these operations and makes setsockopt grab rtnl if it will be needed prior to grabbing socket lock. Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-18 22:05:09 -04:00
Eric Dumazet	08d2cc3b26	inet: request sock should init IPv6/IPv4 addresses In order to be able to use sk_ehashfn() for request socks, we need to initialize their IPv6/IPv4 addresses. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-18 22:00:35 -04:00
Eric Dumazet	b4d6444ea3	inet: get rid of last __inet_hash_connect() argument We now always call __inet_hash_nolisten(), no need to pass it as an argument. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-18 22:00:35 -04:00
Eric Dumazet	77a6a471bc	ipv6: get rid of __inet6_hash() We can now use inet_hash() and __inet_hash() instead of private functions. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-18 22:00:35 -04:00
Eric Dumazet	d1e559d0b1	inet: add IPv6 support to sk_ehashfn() Intent is to converge IPv4 & IPv6 inet_hash functions to factorize code. IPv4 sockets initialize sk_rcv_saddr and sk_v6_daddr in this patch, thanks to new sk_daddr_set() and sk_rcv_saddr_set() helpers. __inet6_hash can now use sk_ehashfn() instead of a private inet6_sk_ehashfn() and will simply use __inet_hash() in a following patch. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-18 22:00:34 -04:00
Eric Dumazet	5b441f76f1	net: introduce sk_ehashfn() helper Goal is to unify IPv4/IPv6 inet_hash handling, and use common helpers for all kind of sockets (full sockets, timewait and request sockets) inet_sk_ehashfn() becomes sk_ehashfn() but still only copes with IPv4 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-18 22:00:34 -04:00
Eric Dumazet	6eada0110c	netns: constify net_hash_mix() and various callers const qualifiers ease code review by making clear which objects are not written in a function. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-18 22:00:34 -04:00
Joe Perches	1ca9e41770	netfilter: Remove uses of seq_<foo> return values The seq_printf/seq_puts/seq_putc return values, because they are frequently misused, will eventually be converted to void. See: commit `1f33c41c03` ("seq_file: Rename seq_overflow() to seq_has_overflowed() and make public") Miscellanea: o realign arguments Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2015-03-18 10:51:35 +01:00
Eric Dumazet	0470c8ca1d	inet: fix request sock refcounting While testing last patch series, I found req sock refcounting was wrong. We must set skc_refcnt to 1 for all request socks added in hashes, but also on request sockets created by FastOpen or syncookies. It is tricky because we need to defer this initialization so that future RCU lookups do not try to take a refcount on a not yet fully initialized request socket. Also get rid of ireq_refcnt alias. Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: `13854e5a60` ("inet: add proper refcounting to request sock") Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-17 22:02:29 -04:00
Eric Dumazet	e3d95ad7da	inet: avoid fastopen lock for regular accept() It is not because a TCP listener is FastOpen ready that all incoming sockets actually used FastOpen. Avoid taking queue->fastopenq->lock if not needed. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-17 22:01:56 -04:00
Eric Dumazet	9439ce00f2	tcp: rename struct tcp_request_sock listener The listener field in struct tcp_request_sock is a pointer back to the listener. We now have req->rsk_listener, so TCP only needs one boolean and not a full pointer. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-17 22:01:56 -04:00
Eric Dumazet	4e9a578e5b	inet: add rsk_listener field to struct request_sock Once we'll be able to lookup request sockets in ehash table, we'll need to get access to listener which created this request. This avoid doing a lookup to find the listener, which benefits for a more solid SO_REUSEPORT, and is needed once we no longer queue request sock into a listener private queue. Note that 'struct tcp_request_sock'->listener could be reduced to a single bit, as TFO listener should match req->rsk_listener. TFO will no longer need to hold a reference on the listener. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-17 22:01:56 -04:00
Eric Dumazet	e49bb337d7	inet: uninline inet_reqsk_alloc() inet_reqsk_alloc() is becoming fat and should not be inlined. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-17 22:01:56 -04:00
Eric Dumazet	407640de21	inet: add sk_listener argument to inet_reqsk_alloc() listener socket can be used to set net pointer, and will be later used to hold a reference on listener. Add a const qualifier to first argument (struct request_sock_ops *), and factorize all write_pnet(&ireq->ireq_net, sock_net(sk)); Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-17 22:01:55 -04:00
Eric Dumazet	7970ddc8f9	tcp: uninline tcp_oow_rate_limited() tcp_oow_rate_limited() is hardly used in fast path, there is no point inlining it. Signed-of-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-17 15:18:00 -04:00
Eric Dumazet	1bfc4438a7	tcp: move tcp_openreq_init() to tcp_input.c This big helper is called once from tcp_conn_request(), there is no point having it in an include. Compiler will inline it anyway. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-17 15:18:00 -04:00
Eric Dumazet	cb7cf8a33f	inet: Clean up inet_csk_wait_for_connect() vs. might_sleep() I got the following trace with current net-next kernel : [14723.885290] WARNING: CPU: 26 PID: 22658 at kernel/sched/core.c:7285 __might_sleep+0x89/0xa0() [14723.885325] do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff810e8734>] prepare_to_wait_exclusive+0x34/0xa0 [14723.885355] CPU: 26 PID: 22658 Comm: netserver Not tainted 4.0.0-dbg-DEV #1379 [14723.885359] ffffffff81a223a8 ffff881fae9e7ca8 ffffffff81650b5d 0000000000000001 [14723.885364] ffff881fae9e7cf8 ffff881fae9e7ce8 ffffffff810a72e7 0000000000000000 [14723.885367] ffffffff81a57620 000000000000093a 0000000000000000 ffff881fae9e7e64 [14723.885371] Call Trace: [14723.885377] [<ffffffff81650b5d>] dump_stack+0x4c/0x65 [14723.885382] [<ffffffff810a72e7>] warn_slowpath_common+0x97/0xe0 [14723.885386] [<ffffffff810a73e6>] warn_slowpath_fmt+0x46/0x50 [14723.885390] [<ffffffff810f4c5d>] ? trace_hardirqs_on_caller+0x10d/0x1d0 [14723.885393] [<ffffffff810e8734>] ? prepare_to_wait_exclusive+0x34/0xa0 [14723.885396] [<ffffffff810e8734>] ? prepare_to_wait_exclusive+0x34/0xa0 [14723.885399] [<ffffffff810ccdc9>] __might_sleep+0x89/0xa0 [14723.885403] [<ffffffff81581846>] lock_sock_nested+0x36/0xb0 [14723.885406] [<ffffffff815829a3>] ? release_sock+0x173/0x1c0 [14723.885411] [<ffffffff815ea1f7>] inet_csk_accept+0x157/0x2a0 [14723.885415] [<ffffffff810e8900>] ? abort_exclusive_wait+0xc0/0xc0 [14723.885419] [<ffffffff8161b96d>] inet_accept+0x2d/0x150 [14723.885424] [<ffffffff8157db6f>] SYSC_accept4+0xff/0x210 [14723.885428] [<ffffffff8165a451>] ? retint_swapgs+0xe/0x44 [14723.885431] [<ffffffff810f4c5d>] ? trace_hardirqs_on_caller+0x10d/0x1d0 [14723.885437] [<ffffffff81369c0e>] ? trace_hardirqs_on_thunk+0x3a/0x3f [14723.885441] [<ffffffff8157ef40>] SyS_accept+0x10/0x20 [14723.885444] [<ffffffff81659872>] system_call_fastpath+0x12/0x17 [14723.885447] ---[ end trace ff74cd83355b1873 ]--- In commit `26cabd3125` Peter added a sched_annotate_sleep() in sk_wait_event() Is the following patch needed as well ? Alternative would be to use sk_wait_event() from inet_csk_wait_for_connect() Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-17 15:03:54 -04:00
Eric Dumazet	9f1ab18672	tcp_metrics: fix wrong lockdep annotations Changes in tcp_metric hash table are protected by tcp_metrics_lock only, not by genl_mutex While we are at it use deref_locked() instead of rcu_dereference() in tcp_new() to avoid unnecessary barrier, as we hold tcp_metrics_lock as well. Reported-by: Andrew Vagin <avagin@parallels.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: `098a697b49` ("tcp_metrics: Use a single hash table for all network namespaces.") Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-16 16:32:23 -04:00
David S. Miller	ca00942a81	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2015-03-16 1) Fix the network header offset in _decode_session6 when multiple IPv6 extension headers are present. From Hajime Tazaki. 2) Fix an interfamily tunnel crash. We set outer mode protocol too early and may dispatch to the wrong address family. Move the setting of the outer mode protocol behind the last accessing of the inner mode to fix the crash. 3) Most callers of xfrm_lookup() expect that dst_orig is released on error. But xfrm_lookup_route() may need dst_orig to handle certain error cases. So introduce a flag that tells what should be done in case of error. From Huaibin Wang. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-16 16:16:49 -04:00
Eric Dumazet	13854e5a60	inet: add proper refcounting to request sock reqsk_put() is the generic function that should be used to release a refcount (and automatically call reqsk_free()) reqsk_free() might be called if refcount is known to be 0 or undefined. refcnt is set to one in inet_csk_reqsk_queue_add() As request socks are not yet in global ehash table, I added temporary debugging checks in reqsk_put() and reqsk_free() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-16 15:55:29 -04:00
Eric Dumazet	2c13270b44	inet: factorize sock_edemux()/sock_gen_put() code sock_edemux() is not used in fast path, and should really call sock_gen_put() to save some code. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-16 15:55:29 -04:00
Eric Dumazet	a58917f584	inet_diag: allow sk_diag_fill() to handle request socks inet_diag_fill_req() is renamed to inet_req_diag_fill() and moved up, so that it can be called fom sk_diag_fill() inet_diag_bc_sk() is ready to handle request socks. inet_twsk_diag_dump() is no longer needed. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-16 15:55:29 -04:00
Eric Dumazet	f7e4eb03f9	inet: ip early demux should avoid request sockets When a request socket is created, we do not cache ip route dst entry, like for timewait sockets. Let's use sk_fullsock() helper. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-16 15:55:29 -04:00
Eric Dumazet	a4458343ac	inet_diag: factorize code in new inet_diag_msg_common_fill() helper Now the three type of sockets share a common base, we can factorize code in inet_diag_msg_common_fill(). inet_diag_entry no longer requires saddr_storage & daddr_storage and the extra copies. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-14 15:05:10 -04:00
Eric Dumazet	a07c92078d	inet_diag: adjust inet_sk_diag_fill() bug condition inet_sk_diag_fill() only copes with non timewait and non request socks Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-14 15:05:10 -04:00
Eric Dumazet	16f86165bd	inet: fill request sock ir_iif for IPv4 Once request socks will be in ehash table, they will need to have a valid ir_iff field. This is currently true only for IPv6. This patch extends support for IPv4 as well. This means inet_diag_fill_req() can now properly use ir_iif, which is better for IPv6 link locals anyway, as request sockets and established sockets will propagate consistent netlink idiag_if. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-14 15:05:10 -04:00
Eric Dumazet	c8e2c80d7e	inet_diag: fix possible overflow in inet_diag_dump_one_icsk() inet_diag_dump_one_icsk() allocates too small skb. Add inet_sk_attr_size() helper right before inet_sk_diag_fill() so that it can be updated if/when new attributes are added. iproute2/ss currently does not use this dump_one() interface, this might explain nobody noticed this problem yet. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-13 15:54:27 -04:00
Eric W. Biederman	098a697b49	tcp_metrics: Use a single hash table for all network namespaces. Now that all of the operations are safe on a single hash table accross network namespaces, allocate a single global hash table and update the code to use it. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-13 01:57:07 -04:00
Eric W. Biederman	04f721c671	tcp_metrics: Rewrite tcp_metrics_flush_all Rewrite tcp_metrics_flush_all so that it can cope with entries from different network namespaces on it's hash chain. This is based on the logic in tcp_metrics_nl_cmd_del for deleting a selection of entries from a tcp metrics hash chain. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-13 01:57:07 -04:00
Eric W. Biederman	8a4bff714f	tcp_metrics: Remove the unused return code from tcp_metrics_flush_all tcp_metrics_flush_all always returns 0. Remove the unnecessary return code. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-13 01:57:07 -04:00
Eric W. Biederman	849e8a0ca8	tcp_metrics: Add a field tcpm_net and verify it matches on lookup In preparation for using one tcp metrics hash table for all network namespaces add a field tcpm_net to struct tcp_metrics_block, and verify that field on all hash table lookups. Make the field tcpm_net of type possible_net_t so it takes no space when network namespaces are disabled. Further add a function tm_net to read that field so we can be efficient when network namespaces are disabled and concise the rest of the time. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-13 01:57:07 -04:00
Eric W. Biederman	3e5da62d0b	tcp_metrics: Mix the network namespace into the hash function. In preparation for using one hash table for all network namespaces mix the network namespace into the hash value. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-13 01:57:07 -04:00
Eric W. Biederman	6493517eae	tcp_metrics: panic when tcp_metrics_init fails. There is not a practical way to cleanup during boot so just panic if there is a problem initializing tcp_metrics. That will at least give us a clear place to start debugging if something does go wrong. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-13 01:57:07 -04:00
Eric Dumazet	3f66b083a5	inet: introduce ireq_family Before inserting request socks into general hash table, fill their socket family. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-12 22:58:13 -04:00
Eric Dumazet	d4f06873b6	inet: get_openreq4() & get_openreq6() do not need listener ireq->ir_num contains local port, use it. Also, get_openreq4() dumping listen_sk->refcnt makes litle sense. inet_diag_fill_req() can also use ireq->ir_num Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-12 22:58:13 -04:00
Eric Dumazet	41b822c59e	inet: prepare sock_edemux() & sock_gen_put() for new SYN_RECV state sock_edemux() & sock_gen_put() should be ready to cope with request socks. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-12 22:58:13 -04:00
Eric Dumazet	bd337c581b	ipv6: add missing ireq_net & ir_cookie initializations I forgot to update dccp_v6_conn_request() & cookie_v6_check(). They both need to set ireq->ireq_net and ireq->ir_cookie Lets clear ireq->ir_cookie in inet_reqsk_alloc() Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: `33cf7c90fe` ("net: add real socket cookies") Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-12 22:58:12 -04:00
Alexander Duyck	0b65bd97ba	fib_trie: Provide a deterministic order for fib_alias w/ tables merged This change makes it so that we should always have a deterministic ordering for the main and local aliases within the merged table when two leaves overlap. So for example if we have a leaf with a key of 192.168.254.0. If we previously added two aliases with a prefix length of 24 from both local and main the first entry would be first and the second would be second. When I was coding this I had added a WARN_ON should such a situation occur as I wasn't sure how likely it would be. However this WARN_ON has been triggered so this is something that should be addressed. With this patch the ordering of the aliases is as follows. First they are sorted on prefix length, then on their table ID, then tos, and finally priority. This way what we end up doing is essentially interleaving the two tables on what used to be leaf_info structure boundaries. Fixes: `0ddcf43d5` ("ipv4: FIB Local/MAIN table collapse") Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-12 18:26:51 -04:00
Alexander Duyck	3c9e9f7320	fib_trie: Avoid NULL pointer if local table is not allocated The function fib_unmerge assumed the local table had already been allocated. If that is not the case however when custom rules are applied then this can result in a NULL pointer dereference. In order to prevent this we must check the value of the local table pointer and if it is NULL simply return 0 as there is no local table to separate from the main. Fixes: `0ddcf43d5` ("ipv4: FIB Local/MAIN table collapse") Reported-by: Madhu Challa <challa@noironetworks.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-12 18:26:51 -04:00
Eric W. Biederman	0c5c9fb551	net: Introduce possible_net_t Having to say > #ifdef CONFIG_NET_NS > struct net net; > #endif in structures is a little bit wordy and a little bit error prone. Instead it is possible to say: > typedef struct { > #ifdef CONFIG_NET_NS > struct net net; > #endif > } possible_net_t; And then in a header say: > possible_net_t net; Which is cleaner and easier to use and easier to test, as the possible_net_t is always there no matter what the compile options. Further this allows read_pnet and write_pnet to be functions in all cases which is better at catching typos. This change adds possible_net_t, updates the definitions of read_pnet and write_pnet, updates optional struct net * variables that write_pnet uses on to have the type possible_net_t, and finally fixes up the b0rked users of read_pnet and write_pnet. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-12 14:39:40 -04:00
Eric W. Biederman	efd7ef1c19	net: Kill hold_net release_net hold_net and release_net were an idea that turned out to be useless. The code has been disabled since 2008. Kill the code it is long past due. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-12 14:39:40 -04:00
Eric Dumazet	c29390c6df	xps: must clear sender_cpu before forwarding John reported that my previous commit added a regression on his router. This is because sender_cpu & napi_id share a common location, so get_xps_queue() can see garbage and perform an out of bound access. We need to make sure sender_cpu is cleared before doing the transmit, otherwise any NIC busy poll enabled (skb_mark_napi_id()) can trigger this bug. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: John <jw@nuclearfallout.net> Bisected-by: John <jw@nuclearfallout.net> Fixes: `2bd82484bb` ("xps: fix xps for stacked devices") Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-11 23:51:18 -04:00
Eric Dumazet	d77c555d32	net: fix CONFIG_NET_NS=n compilation I forgot to use write_pnet() in three locations. Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: `33cf7c90fe` ("net: add real socket cookies") Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-11 23:28:49 -04:00
Eric Dumazet	33cf7c90fe	net: add real socket cookies A long standing problem in netlink socket dumps is the use of kernel socket addresses as cookies. 1) It is a security concern. 2) Sockets can be reused quite quickly, so there is no guarantee a cookie is used once and identify a flow. 3) request sock, establish sock, and timewait socks for a given flow have different cookies. Part of our effort to bring better TCP statistics requires to switch to a different allocator. In this patch, I chose to use a per network namespace 64bit generator, and to use it only in the case a socket needs to be dumped to netlink. (This might be refined later if needed) Note that I tried to carry cookies from request sock, to establish sock, then timewait sockets. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Eric Salo <salo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-11 21:55:28 -04:00
Alexander Duyck	654eff4516	fib_trie: Only display main table in /proc/net/route When we merged the tries for local and main I had overlooked the iterator for /proc/net/route. As a result it was outputting both local and main when the two tries were merged. This patch resolves that by only providing output for aliases that are actually in the main trie. As a result we should go back to the original behavior which I assume will be necessary to maintain legacy support. Fixes: `0ddcf43d5` ("ipv4: FIB Local/MAIN table collapse") Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-11 21:24:32 -04:00
Alexander Duyck	61f0d861fc	fib_trie: Fix uninitialized variable warning The 0-day kernel test infrastructure reported a use of uninitialized variable warning for local_table due to the fact that the local and main allocations had been swapped from the original setup. This change corrects that by making it so that we free the main table if the local table allocation fails. Fixes: `0ddcf43d5` ("ipv4: FIB Local/MAIN table collapse") Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-11 17:33:44 -04:00
Neal Cardwell	d578e18ce9	tcp: restore 1.5x per RTT limit to CUBIC cwnd growth in congestion avoidance Commit `814d488c61` ("tcp: fix the timid additive increase on stretch ACKs") fixed a bug where tcp_cong_avoid_ai() would either credit a connection with an increase of snd_cwnd_cnt, or increase snd_cwnd, but not both, resulting in cwnd increasing by 1 packet on at most every alternate invocation of tcp_cong_avoid_ai(). Although the commit correctly implemented the CUBIC algorithm, which can increase cwnd by as much as 1 packet per 1 packet ACKed (2x per RTT), in practice that could be too aggressive: in tests on network paths with small buffers, YouTube server retransmission rates nearly doubled. This commit restores CUBIC to a maximum cwnd growth rate of 1 packet per 2 packets ACKed (1.5x per RTT). In YouTube tests this restored retransmit rates to low levels. Testing: This patch has been tested in datacenter netperf transfers and live youtube.com and google.com servers. Fixes: `9cd981dcf1` ("tcp: fix stretch ACK bugs in CUBIC") Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-11 16:51:51 -04:00
Neal Cardwell	9949afa42b	tcp: fix tcp_cong_avoid_ai() credit accumulation bug with decreases in w The recent change to tcp_cong_avoid_ai() to handle stretch ACKs introduced a bug where snd_cwnd_cnt could accumulate a very large value while w was large, and then if w was reduced snd_cwnd could be incremented by a large delta, leading to a large burst and high packet loss. This was tickled when CUBIC's bictcp_update() sets "ca->cnt = 100 * cwnd". This bug crept in while preparing the upstream version of `814d488c61`. Testing: This patch has been tested in datacenter netperf transfers and live youtube.com and google.com servers. Fixes: `814d488c61` ("tcp: fix the timid additive increase on stretch ACKs") Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-11 16:51:51 -04:00
Sabrina Dubroca	6dede75b7e	fib_trie: call fib_table_flush_external under RTNL Move rtnl_lock() before the call to fib4_rules_exit so that fib_table_flush_external is called under RTNL. Fixes: `104616e74e` ("switchdev: don't support custom ip rules, for now") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Acked-by: Alexander Duyck <alexander.h.duyck@redhat.com> Reviewed-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-11 16:46:26 -04:00
Alexander Duyck	0ddcf43d5d	ipv4: FIB Local/MAIN table collapse This patch is meant to collapse local and main into one by converting tb_data from an array to a pointer. Doing this allows us to point the local table into the main while maintaining the same variables in the table. As such the tb_data was converted from an array to a pointer, and a new array called data is added in order to still provide an object for tb_data to point to. In order to track the origin of the fib aliases a tb_id value was added in a hole that existed on 64b systems. Using this we can also reverse the merge in the event that custom FIB rules are enabled. With this patch I am seeing an improvement of 20ns to 30ns for routing lookups as long as custom rules are not enabled, with custom rules enabled we fall back to split tables and the original behavior. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-11 16:22:14 -04:00
Alexander Duyck	ddb4b9a132	fib_trie: Address possible NULL pointer dereference in resize If the inflate call failed it would return NULL. As a result tp would be set to NULL and cause use to trigger a NULL pointer dereference in should_halve if the inflate failed on the first attempt. In order to prevent this we should decrement max_work before we actually attempt to inflate as this will force us to exit before attempting to halve a node we should have inflated. In order to keep things symmetric between inflate and halve I went ahead and also moved the decrement of max_work for the halve case as well so we take care of that before we actually attempt to halve the tnode. Fixes: `88bae714` ("fib_trie: Add key vector to root, return parent key_vector in resize") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-10 18:36:56 -04:00
Alexander Duyck	3ec320dd5c	fib_trie: Correctly handle case of key == 0 in leaf_walk_rcu In the case of a trie that had no tnodes with a key of 0 the initial look-up would fail resulting in an out-of-bounds cindex on the first tnode. This resulted in an entire trie being skipped. In order resolve this I have updated the cindex logic in the initial look-up so that if the key is zero we will always traverse the child zero path. Fixes: `8be33e95` ("fib_trie: Fib walk rcu should take a tnode and key instead of a trie and a leaf") Reported-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Tested-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-10 16:13:55 -04:00
Eric Dumazet	34160ea3f9	inet_diag: add const to inet_diag_req_v2 diag dumpers should not modify the request. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-10 13:45:28 -04:00
Eric Dumazet	e31c5e0e48	inet_diag: cleanups Remove all inline keywords, add some const, and cleanup style. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-10 13:45:28 -04:00
David S. Miller	515fb5c317	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Pablo Neira Ayuso says: ==================== Netfilter fixes for net-next The following batch contains a couple of fixes to address some fallout from the previous pull request, they are: 1) Address link problems in the bridge code after `e5de75b`. Fix it by using rcu hook to address to avoid ifdef pollution and hard dependency between bridge and br_netfilter. 2) Address sparse warnings in the netfilter reject code, patch from Florian Westphal. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-10 12:48:47 -04:00
Florian Westphal	a03a8dbe20	netfilter: fix sparse warnings in reject handling make C=1 CF=-D__CHECK_ENDIAN__ shows following: net/bridge/netfilter/nft_reject_bridge.c:65:50: warning: incorrect type in argument 3 (different base types) net/bridge/netfilter/nft_reject_bridge.c:65:50: expected restricted __be16 [usertype] protocol [..] net/bridge/netfilter/nft_reject_bridge.c:102:37: warning: cast from restricted __be16 net/bridge/netfilter/nft_reject_bridge.c:102:37: warning: incorrect type in argument 1 (different base types) [..] net/bridge/netfilter/nft_reject_bridge.c:121:50: warning: incorrect type in argument 3 (different base types) [..] net/bridge/netfilter/nft_reject_bridge.c:168:52: warning: incorrect type in argument 3 (different base types) [..] net/bridge/netfilter/nft_reject_bridge.c:233:52: warning: incorrect type in argument 3 (different base types) [..] Caused by two (harmless) errors: 1. htons() instead of ntohs() 2. __be16 for protocol in nf_reject_ipXhdr_put API, use u8 instead. Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2015-03-10 15:01:32 +01:00
Scott Feldman	f8f2147150	switchdev: add netlink flags to IPv4 FIB add op Pass in the netlink flags (NLM_F_*) into switchdev driver for IPv4 FIB add op to allow driver to 1) optimize hardware updates, 2) handle ip route prepend and append commands correctly. Suggested-by: Jamal Hadi Salim <jhs@mojatatu.com> Suggested-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: Scott Feldman <sfeldma@gmail.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-09 23:56:52 -04:00
David S. Miller	3cef5c5b0b	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/ethernet/cadence/macb.c Overlapping changes in macb driver, mostly fixes and cleanups in 'net' overlapping with the integration of at91_ether into macb in 'net-next'. Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-09 23:38:02 -04:00
Eric W. Biederman	ddb3b6033c	net: Remove protocol from struct dst_ops After my change to neigh_hh_init to obtain the protocol from the neigh_table there are no more users of protocol in struct dst_ops. Remove the protocol field from dst_ops and all of it's initializers. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-09 16:06:10 -04:00
David S. Miller	5428aef811	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for your net-next tree. Basically, improvements for the packet rejection infrastructure, deprecation of CLUSTERIP, cleanups for nf_tables and some untangling for br_netfilter. More specifically they are: 1) Send packet to reset flow if checksum is valid, from Florian Westphal. 2) Fix nf_tables reject bridge from the input chain, also from Florian. 3) Deprecate the CLUSTERIP target, the cluster match supersedes it in functionality and it's known to have problems. 4) A couple of cleanups for nf_tables rule tracing infrastructure, from Patrick McHardy. 5) Another cleanup to place transaction declarations at the bottom of nf_tables.h, also from Patrick. 6) Consolidate Kconfig dependencies wrt. NF_TABLES. 7) Limit table names to 32 bytes in nf_tables. 8) mac header copying in bridge netfilter is already required when calling ip_fragment(), from Florian Westphal. 9) move nf_bridge_update_protocol() to br_netfilter.c, also from Florian. 10) Small refactor in br_netfilter in the transmission path, again from Florian. 11) Move br_nf_pre_routing_finish_bridge_slow() to br_netfilter. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-09 15:58:21 -04:00
Willem de Bruijn	c247f0534c	ip: fix error queue empty skb handling When reading from the error queue, msg_name and msg_control are only populated for some errors. A new exception for empty timestamp skbs added a false positive on icmp errors without payload. `traceroute -M udpconn` only displayed gateways that return payload with the icmp error: the embedded network headers are pulled before sock_queue_err_skb, leaving an skb with skb->len == 0 otherwise. Fix this regression by refining when msg_name and msg_control branches are taken. The solutions for the two fields are independent. msg_name only makes sense for errors that configure serr->port and serr->addr_offset. Test the first instead of skb->len. This also fixes another issue. saddr could hold the wrong data, as serr->addr_offset is not initialized in some code paths, pointing to the start of the network header. It is only valid when serr->port is set (non-zero). msg_control support differs between IPv4 and IPv6. IPv4 only honors requests for ICMP and timestamps with SOF_TIMESTAMPING_OPT_CMSG. The skb->len test can simply be removed, because skb->dev is also tested and never true for empty skbs. IPv6 honors requests for all errors aside from local errors and timestamps on empty skbs. In both cases, make the policy more explicit by moving this logic to a new function that decides whether to process msg_control and that optionally prepares the necessary fields in skb->cb[]. After this change, the IPv4 and IPv6 paths are more similar. The last case is rxrpc. Here, simply refine to only match timestamps. Fixes: `49ca0d8bfa` ("net-timestamp: no-payload option") Reported-by: Jan Niehusmann <jan@gondor.com> Signed-off-by: Willem de Bruijn <willemb@google.com> ---- Changes v1->v2 - fix local origin test inversion in ip6_datagram_support_cmsg - make v4 and v6 code paths more similar by introducing analogous ipv4_datagram_support_cmsg - fix compile bug in rxrpc Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-08 23:01:54 -04:00
Alexander Duyck	88bae7149a	fib_trie: Add key vector to root, return parent key_vector in resize This change makes it so that the root of the trie contains a key_vector, by doing this we make room to essentially collapse the entire trie by at least one cache line as we can store the information about the tnode or leaf that is pointed to in the root. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-06 15:49:28 -05:00
Alexander Duyck	f23e59fbd7	fib_trie: Move parent from key_vector to tnode This change pulls the parent pointer from the key_vector and places it in the tnode structure. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-06 15:49:28 -05:00

1 2 3 4 5 ...

6648 commits