1
0
Fork 0
Commit Graph

40697 Commits (1b655b19e28b70a45db2e30b5a864af503afff89)

Author SHA1 Message Date
Eric Dumazet 3e4006f0b8 ipv6: tcp: add rcu locking in tcp_v6_send_synack()
When first SYNACK is sent, we already hold rcu_read_lock(), but this
is not true if a SYNACK is retransmitted, as a timer (soft) interrupt
does not hold rcu_read_lock()

Fixes: 45f6fad84c ("ipv6: add complete rcu protection around np->opt")
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-10 22:58:03 -05:00
Eric Dumazet a78cb84c62 net: add scheduling point in recvmmsg/sendmmsg
Applications often have to reduce number of datagrams
they receive or send per system call to avoid starvation problems.

Really the kernel should take care of this by using cond_resched(),
so that applications can experiment bigger batch sizes.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-10 22:56:29 -05:00
Lubomir Rintel 3d171f3907 ipv6: always add flag an address that failed DAD with DADFAILED
The userspace needs to know why is the address being removed so that it can
perhaps obtain a new address.

Without the DADFAILED flag it's impossible to distinguish removal of a
temporary and tentative address due to DAD failure from other reasons (device
removed, manual address removal).

Signed-off-by: Lubomir Rintel <lkundrak@v3.sk>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-10 22:54:27 -05:00
Daniel Borkmann 1f211a1b92 net, sched: add clsact qdisc
This work adds a generalization of the ingress qdisc as a qdisc holding
only classifiers. The clsact qdisc works on ingress, but also on egress.
In both cases, it's execution happens without taking the qdisc lock, and
the main difference for the egress part compared to prior version of [1]
is that this can be applied with _any_ underlying real egress qdisc (also
classless ones).

Besides solving the use-case of [1], that is, allowing for more programmability
on assigning skb->priority for the mqprio case that is supported by most
popular 10G+ NICs, it also opens up a lot more flexibility for other tc
applications. The main work on classification can already be done at clsact
egress time if the use-case allows and state stored for later retrieval
f.e. again in skb->priority with major/minors (which is checked by most
classful qdiscs before consulting tc_classify()) and/or in other skb fields
like skb->tc_index for some light-weight post-processing to get to the
eventual classid in case of a classful qdisc. Another use case is that
the clsact egress part allows to have a central egress counterpart to
the ingress classifiers, so that classifiers can easily share state (e.g.
in cls_bpf via eBPF maps) for ingress and egress.

Currently, default setups like mq + pfifo_fast would require for this to
use, for example, prio qdisc instead (to get a tc_classify() run) and to
duplicate the egress classifier for each queue. With clsact, it allows
for leaving the setup as is, it can additionally assign skb->priority to
put the skb in one of pfifo_fast's bands and it can share state with maps.
Moreover, we can access the skb's dst entry (f.e. to retrieve tclassid)
w/o the need to perform a skb_dst_force() to hold on to it any longer. In
lwt case, we can also use this facility to setup dst metadata via cls_bpf
(bpf_skb_set_tunnel_key()) without needing a real egress qdisc just for
that (case of IFF_NO_QUEUE devices, for example).

The realization can be done without any changes to the scheduler core
framework. All it takes is that we have two a-priori defined minors/child
classes, where we can mux between ingress and egress classifier list
(dev->ingress_cl_list and dev->egress_cl_list, latter stored close to
dev->_tx to avoid extra cacheline miss for moderate loads). The egress
part is a bit similar modelled to handle_ing() and patched to a noop in
case the functionality is not used. Both handlers are now called
sch_handle_ingress() and sch_handle_egress(), code sharing among the two
doesn't seem practical as there are various minor differences in both
paths, so that making them conditional in a single handler would rather
slow things down.

Full compatibility to ingress qdisc is provided as well. Since both
piggyback on TC_H_CLSACT, only one of them (ingress/clsact) can exist
per netdevice, and thus ingress qdisc specific behaviour can be retained
for user space. This means, either a user does 'tc qdisc add dev foo ingress'
and configures ingress qdisc as usual, or the 'tc qdisc add dev foo clsact'
alternative, where both, ingress and egress classifier can be configured
as in the below example. ingress qdisc supports attaching classifier to any
minor number whereas clsact has two fixed minors for muxing between the
lists, therefore to not break user space setups, they are better done as
two separate qdiscs.

I decided to extend the sch_ingress module with clsact functionality so
that commonly used code can be reused, the module is being aliased with
sch_clsact so that it can be auto-loaded properly. Alternative would have been
to add a flag when initializing ingress to alter its behaviour plus aliasing
to a different name (as it's more than just ingress). However, the first would
end up, based on the flag, choosing the new/old behaviour by calling different
function implementations to handle each anyway, the latter would require to
register ingress qdisc once again under different alias. So, this really begs
to provide a minimal, cleaner approach to have Qdisc_ops and Qdisc_class_ops
by its own that share callbacks used by both.

Example, adding qdisc:

   # tc qdisc add dev foo clsact
   # tc qdisc show dev foo
   qdisc mq 0: root
   qdisc pfifo_fast 0: parent :1 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
   qdisc pfifo_fast 0: parent :2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
   qdisc pfifo_fast 0: parent :3 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
   qdisc pfifo_fast 0: parent :4 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
   qdisc clsact ffff: parent ffff:fff1

Adding filters (deleting, etc works analogous by specifying ingress/egress):

   # tc filter add dev foo ingress bpf da obj bar.o sec ingress
   # tc filter add dev foo egress  bpf da obj bar.o sec egress
   # tc filter show dev foo ingress
   filter protocol all pref 49152 bpf
   filter protocol all pref 49152 bpf handle 0x1 bar.o:[ingress] direct-action
   # tc filter show dev foo egress
   filter protocol all pref 49152 bpf
   filter protocol all pref 49152 bpf handle 0x1 bar.o:[egress] direct-action

A 'tc filter show dev foo' or 'tc filter show dev foo parent ffff:' will
show an empty list for clsact. Either using the parent names (ingress/egress)
or specifying the full major/minor will then show the related filter lists.

Prior work on a mqprio prequeue() facility [1] was done mainly by John Fastabend.

  [1] http://patchwork.ozlabs.org/patch/512949/

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-10 22:13:15 -05:00
Sasha Levin 320f1a4a17 net: sctp: prevent writes to cookie_hmac_alg from accessing invalid memory
proc_dostring() needs an initialized destination string, while the one
provided in proc_sctp_do_hmac_alg() contains stack garbage.

Thus, writing to cookie_hmac_alg would strlen() that garbage and end up
accessing invalid memory.

Fixes: 3c68198e7 ("sctp: Make hmac algorithm selection for cookie generation dynamic")
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-10 18:01:01 -05:00
Daniel Borkmann f8ffad69c9 bpf: add skb_postpush_rcsum and fix dev_forward_skb occasions
Add a small helper skb_postpush_rcsum() and fix up redirect locations
that need CHECKSUM_COMPLETE fixups on ingress. dev_forward_skb() expects
a proper csum that covers also Ethernet header, f.e. since 2c26d34bbc
("net/core: Handle csum for CHECKSUM_COMPLETE VXLAN forwarding"), we
also do skb_postpull_rcsum() after pulling Ethernet header off via
eth_type_trans().

When using eBPF in a netns setup f.e. with vxlan in collect metadata mode,
I can trigger the following csum issue with an IPv6 setup:

  [  505.144065] dummy1: hw csum failure
  [...]
  [  505.144108] Call Trace:
  [  505.144112]  <IRQ>  [<ffffffff81372f08>] dump_stack+0x44/0x5c
  [  505.144134]  [<ffffffff81607cea>] netdev_rx_csum_fault+0x3a/0x40
  [  505.144142]  [<ffffffff815fee3f>] __skb_checksum_complete+0xcf/0xe0
  [  505.144149]  [<ffffffff816f0902>] nf_ip6_checksum+0xb2/0x120
  [  505.144161]  [<ffffffffa08c0e0e>] icmpv6_error+0x17e/0x328 [nf_conntrack_ipv6]
  [  505.144170]  [<ffffffffa0898eca>] ? ip6t_do_table+0x2fa/0x645 [ip6_tables]
  [  505.144177]  [<ffffffffa08c0725>] ? ipv6_get_l4proto+0x65/0xd0 [nf_conntrack_ipv6]
  [  505.144189]  [<ffffffffa06c9a12>] nf_conntrack_in+0xc2/0x5a0 [nf_conntrack]
  [  505.144196]  [<ffffffffa08c039c>] ipv6_conntrack_in+0x1c/0x20 [nf_conntrack_ipv6]
  [  505.144204]  [<ffffffff8164385d>] nf_iterate+0x5d/0x70
  [  505.144210]  [<ffffffff816438d6>] nf_hook_slow+0x66/0xc0
  [  505.144218]  [<ffffffff816bd302>] ipv6_rcv+0x3f2/0x4f0
  [  505.144225]  [<ffffffff816bca40>] ? ip6_make_skb+0x1b0/0x1b0
  [  505.144232]  [<ffffffff8160b77b>] __netif_receive_skb_core+0x36b/0x9a0
  [  505.144239]  [<ffffffff8160bdc8>] ? __netif_receive_skb+0x18/0x60
  [  505.144245]  [<ffffffff8160bdc8>] __netif_receive_skb+0x18/0x60
  [  505.144252]  [<ffffffff8160ccff>] process_backlog+0x9f/0x140
  [  505.144259]  [<ffffffff8160c4a5>] net_rx_action+0x145/0x320
  [...]

What happens is that on ingress, we push Ethernet header back in, either
from cls_bpf or right before skb_do_redirect(), but without updating csum.
The "hw csum failure" can be fixed by using the new skb_postpush_rcsum()
helper for the dev_forward_skb() case to correct the csum diff again.

Thanks to Hannes Frederic Sowa for the csum_partial() idea!

Fixes: 3896d655f4 ("bpf: introduce bpf_clone_redirect() helper")
Fixes: 27b29f6305 ("bpf: add bpf_redirect() helper")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-10 17:54:28 -05:00
Daniel Borkmann fdc5432a7b net, sched: add skb_at_tc_ingress helper
Add a skb_at_tc_ingress() as this will be needed elsewhere as well and
can hide the ugly ifdef.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-10 17:54:28 -05:00
Nikolay Borisov b840d15d39 ipv4: Namespecify the tcp_keepalive_intvl sysctl knob
This is the final part required to namespaceify the tcp
keep alive mechanism.

Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-10 17:32:09 -05:00
Nikolay Borisov 9bd6861bd4 ipv4: Namespecify tcp_keepalive_probes sysctl knob
This is required to have full tcp keepalive mechanism namespace
support.

Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-10 17:32:09 -05:00
Nikolay Borisov 13b287e8d1 ipv4: Namespaceify tcp_keepalive_time sysctl knob
Different net namespaces might have different requirements as to
the keepalive time of tcp sockets. This might be required in cases
where different firewall rules are in place which require tcp
timeout sockets to be increased/decreased independently of the host.

Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-10 17:32:09 -05:00
Hannes Frederic Sowa 787d7ac308 udp: restrict offloads to one namespace
udp tunnel offloads tend to aggregate datagrams based on inner
headers. gro engine gets notified by tunnel implementations about
possible offloads. The match is solely based on the port number.

Imagine a tunnel bound to port 53, the offloading will look into all
DNS packets and tries to aggregate them based on the inner data found
within. This could lead to data corruption and malformed DNS packets.

While this patch minimizes the problem and helps an administrator to find
the issue by querying ip tunnel/fou, a better way would be to match on
the specific destination ip address so if a user space socket is bound
to the same address it will conflict.

Cc: Tom Herbert <tom@herbertland.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-10 17:28:24 -05:00
Elad Raz f1fecb1d10 bridge: Reflect MDB entries to hardware
Offload MDB changes per port to hardware

Signed-off-by: Elad Raz <eladr@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-10 16:50:21 -05:00
Elad Raz 4d41e12593 switchdev: Adding MDB entry offload
Define HW multicast entry: MAC and VID.
Using a MAC address simplifies support for both IPV4 and IPv6.

Signed-off-by: Elad Raz <eladr@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-10 16:50:20 -05:00
Sven Eckelmann ed21d170e8 batman-adv: Add kerneldoc for batadv_neigh_node::refcount
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-01-09 20:56:00 +08:00
Sven Eckelmann 8a3719a184 batman-adv: Remove kerneldoc for missing struct members
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-01-09 20:56:00 +08:00
Sven Eckelmann 006a199d5d batman-adv: Fix kerneldoc member names in for main structs
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-01-09 20:56:00 +08:00
Sven Eckelmann 426fc6c811 batman-adv: Fix kernel-doc parsing of main structs
kernel-doc is not able to skip an #ifdef between the kernel documentation
block and the start of the struct. Moving the #ifdef before the kernel doc
block avoids this problem

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-01-09 20:56:00 +08:00
Markus Elfring e087f34f28 batman-adv: Split a condition check
Let us split a check for a condition at the beginning of the
batadv_is_ap_isolated() function so that a direct return can be performed
in this function if the variable "vlan" contained a null pointer.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-01-09 20:56:00 +08:00
Markus Elfring f75a33aeed batman-adv: Delete an unnecessary check before the function call "batadv_softif_vlan_free_ref"
The batadv_softif_vlan_free_ref() function tests whether its argument is NULL
and then returns immediately. Thus the test around the call is not needed.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-01-09 20:56:00 +08:00
Markus Elfring 8bbb7cb232 batman-adv: Less checks in batadv_tvlv_unicast_send()
* Let us return directly if a call of the batadv_orig_hash_find() function
  returned a null pointer.

* Omit the initialisation for the variable "skb" at the beginning.

* Replace an assignment by a call of the kfree_skb() function
  and delete the affected variable "ret" then.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-01-09 20:56:00 +08:00
Markus Elfring c799443ee1 batman-adv: Delete unnecessary checks before the function call "kfree_skb"
The kfree_skb() function tests whether its argument is NULL and then
returns immediately. Thus the test around the calls is not needed.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-01-09 20:56:00 +08:00
Sven Eckelmann d737ccbed3 batman-adv: Add function to convert string to batadv throughput
The code to convert the throughput information from a string to the
batman-adv internal (100Kibit/s) representation is duplicated in
batadv_parse_gw_bandwidth. Move this functionality to its own function
batadv_parse_throughput to reduce the code complexity.

Signed-off-by: Sven Eckelmann <sven@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-01-09 20:56:00 +08:00
Simon Wunderlich 9e728e8438 batman-adv: only call post function if something changed
Currently, the post function is also called on errors or if there were
no changes, which is redundant for the functions currently using these
facilities.

Signed-off-by: Simon Wunderlich <simon@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-01-09 20:56:00 +08:00
Simon Wunderlich e1544f3c87 batman-adv: increase BLA wait periods to 6
If networks take a long time to come up, e.g. due to lossy links, then
the bridge loop avoidance wait time to suppress broadcasts may not wait
long enough and detect a backbone before the mesh is brought up.
Increasing the wait period further to 60 seconds makes this scenario
less likely.

Signed-off-by: Simon Wunderlich <simon@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-01-09 20:56:00 +08:00
Simon Wunderlich d68081a240 batman-adv: purge bridge loop avoidance when its disabled
When bridge loop avoidance is disabled through sysfs, the internal
datastructures are not disabled, but only BLA operations are disabled.
To be sure that they are removed, purge the data immediately. That is
especially useful if a firmwares network state is changed, and the BLA
wait periods should restart on the new network.

Signed-off-by: Simon Wunderlich <simon@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-01-09 20:56:00 +08:00
Marek Lindner 143d157c9e batman-adv: remove leftovers of unused BATADV_PRIMARIES_FIRST_HOP flag
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-01-09 20:56:00 +08:00
Sven Eckelmann 008a374487 batman-adv: Fix lockdep annotation of batadv_tlv_container_remove
The function handles tlv containers and not tlv handlers. Thus the
lockdep_assert_held has to check for the container_list lock.

Fixes: 2c72d655b0 ("batman-adv: Annotate deleting functions with external lock via lockdep")
Signed-off-by: Sven Eckelmann <sven@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-01-09 20:56:00 +08:00
Simon Wunderlich 4a4d045eb2 batman-adv: Start new development cycle
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-01-09 20:56:00 +08:00
Lance Richardson ad64b8be71 ipv4: eliminate lock count warnings in ping.c
Add lock release/acquire annotations to ping_seq_start() and
ping_seq_stop() to satisfy sparse.

Signed-off-by: Lance Richardson <lrichard@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-08 21:30:43 -05:00
Lance Richardson 30d3d83a7d ipv4: fix endianness warnings in ip_tunnel_core.c
Eliminate endianness mismatch warnings (reported by sparse) in this file by
using appropriate nla_put_*()/nla_get_*() calls.

Signed-off-by: Lance Richardson <lrichard@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-08 21:30:43 -05:00
Al Viro 6108209c4a Merge branch 'for-linus' into work.misc 2016-01-08 21:20:11 -05:00
David S. Miller 9b59377b75 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next, they are:

1) Release nf_tables objects on netns destructions via
   nft_release_afinfo().

2) Destroy basechain and rules on netdevice removal in the new netdev
   family.

3) Get rid of defensive check against removal of inactive objects in
   nf_tables.

4) Pass down netns pointer to our existing nfnetlink callbacks, as well
   as commit() and abort() nfnetlink callbacks.

5) Allow to invert limit expression in nf_tables, so we can throttle
   overlimit traffic.

6) Add packet duplication for the netdev family.

7) Add forward expression for the netdev family.

8) Define pr_fmt() in conntrack helpers.

9) Don't leave nfqueue configuration on inconsistent state in case of
   errors, from Ken-ichirou MATSUZAWA, follow up patches are also from
   him.

10) Skip queue option handling after unbind.

11) Return error on unknown both in nfqueue and nflog command.

12) Autoload ctnetlink when NFQA_CFG_F_CONNTRACK is set.

13) Add new NFTA_SET_USERDATA attribute to store user data in sets,
    from Carlos Falgueras.

14) Add support for 64 bit byteordering changes nf_tables, from Florian
    Westphal.

15) Add conntrack byte/packet counter matching support to nf_tables,
    also from Florian.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-08 20:53:16 -05:00
David S. Miller 250fbf129e Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2016-01-08

Here's one more bluetooth-next pull request for the 4.5 kernel:

 - Support for CRC check and promiscuous mode for CC2520
 - Fixes to btmrvl driver
 - New ACPI IDs for hci_bcm driver
 - Limited Discovery support for the Bluetooth mgmt interface
 - Minor other cleanups here and there

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-08 13:17:31 -05:00
Florian Westphal 48f66c905a netfilter: nft_ct: add byte/packet counter support
If the accounting extension isn't present, we'll return a counter
value of 0.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-01-08 14:44:09 +01:00
Florian Westphal ce1e7989d9 netfilter: nft_byteorder: provide 64bit le/be conversion
Needed to convert the (64bit) conntrack counters to BE ordering.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-01-08 13:32:11 +01:00
Carlos Falgueras García e6d8ecac9e netfilter: nf_tables: Add new attributes into nft_set to store user data.
User data is stored at after 'nft_set_ops' private data into 'data[]'
flexible array. The field 'udata' points to user data and 'udlen' stores
its length.

Add new flag NFTA_SET_USERDATA.

Signed-off-by: Carlos Falgueras García <carlosfg@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-01-08 13:25:08 +01:00
Ken-ichirou MATSUZAWA eb075954e9 netfilter: nfnetlink_log: just returns error for unknown command
This patch stops processing options for unknown command.

Signed-off-by: Ken-ichirou MATSUZAWA <chamas@h4.dion.ne.jp>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-01-08 13:25:07 +01:00
Ken-ichirou MATSUZAWA 71b2e5f5ca netfilter: nfnetlink_queue: autoload nf_conntrack_netlink module NFQA_CFG_F_CONNTRACK config flag
This patch enables to load nf_conntrack_netlink module if
NFQA_CFG_F_CONNTRACK config flag is specified.

Signed-off-by: Ken-ichirou MATSUZAWA <chamas@h4.dion.ne.jp>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-01-08 13:25:06 +01:00
Ken-ichirou MATSUZAWA 21c3c971d1 netfilter: nfnetlink_queue: just returns error for unknown command
This patch stops processing options for unknown command.

Signed-off-by: Ken-ichirou MATSUZAWA <chamas@h4.dion.ne.jp>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-01-08 13:25:05 +01:00
Ken-ichirou MATSUZAWA 17bc6b4884 netfilter: nfnetlink_queue: don't handle options after unbind
This patch stops processing after destroying a queue instance.

Signed-off-by: Ken-ichirou MATSUZAWA <chamas@h4.dion.ne.jp>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-01-08 13:25:04 +01:00
Ken-ichirou MATSUZAWA 60d2c7f9ab netfilter: nfnetlink_queue: validate dependencies to avoid breaking atomicity
Check that dependencies are fulfilled before updating the queue
instance, otherwise we can leave things in intermediate state on errors
in nfqnl_recv_config().

Signed-off-by: Ken-ichirou MATSUZAWA <chamas@h4.dion.ne.jp>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-01-08 13:25:03 +01:00
Trond Myklebust daaadd2283 Merge branch 'bugfixes'
* bugfixes:
  SUNRPC: Fixup socket wait for memory
  SUNRPC: Fix a missing break in rpc_anyaddr()
  pNFS/flexfiles: Fix an Oopsable typo in ff_mirror_match_fh()
  NFS: Fix attribute cache revalidation
  NFS: Ensure we revalidate attributes before using execute_ok()
  NFS: Flush reclaim writes using FLUSH_COND_STABLE
  NFS: Background flush should not be low priority
  NFSv4.1/pnfs: Fixup an lo->plh_block_lgets imbalance in layoutreturn
  NFSv4: Don't perform cached access checks before we've OPENed the file
  NFS: Allow the combination pNFS and labeled NFS
  NFS42: handle layoutstats stateid error
  nfs: Fix race in __update_open_stateid()
  nfs: fix missing assignment in nfs4_sequence_done tracepoint
2016-01-07 18:45:36 -05:00
Andrew Lunn 0071f56e46 dsa: Register netdev before phy
When the phy is connected, an info message is printed. If the netdev
it is attached to has not been registered yet, the name
'uninitialised' in the output. By registering the netdev first, then
connecting they phy, we can avoid this.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-07 14:31:26 -05:00
Andrew Lunn 7f854420fb phy: Add API for {un}registering an mdio device to a bus.
Rather than have drivers directly manipulate the mii_bus structure,
provide and API for registering and unregistering devices on an MDIO
bus, and performing lookups.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-07 14:31:26 -05:00
Andrew Lunn e5a03bfd87 phy: Add an mdio_device structure
Not all devices attached to an MDIO bus are phys. So add an
mdio_device structure to represent the generic parts of an mdio
device, and place this structure into the phy_device.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-07 14:31:26 -05:00
Andrew Lunn 2220943a21 phy: Centralise print about attached phy
Many Ethernet drivers contain the same netdev_info() print statement
about the attached phy. Move it into the phy device code. Additionally
add a varargs function which can be used to append additional
information.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-07 14:31:25 -05:00
J. Bruce Fields 3daa020f9b Revert "svcrdma: Do not send XDR roundup bytes for a write chunk"
This reverts commit 6f18dc8939.

Just as one example, it appears this code could do the wrong thing in
the case of a two-byte NFS READ that crosses a page boundary.

Chuck says: "In that case, nfsd would pass down an xdr_buf that has one
byte in a page, one byte in another page, and a two-byte XDR pad. The
logic introduced by this optimization would be fooled, and neither the
second byte nor the XDR pad would be written to the client."

Cc: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-01-07 10:10:48 -05:00
Sven Eckelmann 13bbdd370f batman-adv: Fix invalid read while copying bat_iv.bcast_own
batadv_iv_ogm_orig_del_if removes a part of the bcast_own which previously
belonged to the now removed interface. This is done by copying all data
which comes before the removed interface and then appending all the data
which comes after the removed interface.

The address calculation for the position of the data which comes after the
removed interface assumed that the bat_iv.bcast_own is a pointer to a
single byte datatype. But it is a pointer to unsigned long and thus the
calculated position was wrong off factor sizeof(unsigned long).

Fixes: 83a8342678a0 ("more basic routing code added (forwarding packets /
bitarray added)")

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-01-07 14:24:05 +08:00
David S. Miller 9e0efaf6b4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2016-01-06 22:54:18 -05:00
Yuchung Cheng 8b8a321ff7 tcp: fix zero cwnd in tcp_cwnd_reduction
Patch 3759824da8 ("tcp: PRR uses CRB mode by default and SS mode
conditionally") introduced a bug that cwnd may become 0 when both
inflight and sndcnt are 0 (cwnd = inflight + sndcnt). This may lead
to a div-by-zero if the connection starts another cwnd reduction
phase by setting tp->prior_cwnd to the current cwnd (0) in
tcp_init_cwnd_reduction().

To prevent this we skip PRR operation when nothing is acked or
sacked. Then cwnd must be positive in all cases as long as ssthresh
is positive:

1) The proportional reduction mode
   inflight > ssthresh > 0

2) The reduction bound mode
  a) inflight == ssthresh > 0

  b) inflight < ssthresh
     sndcnt > 0 since newly_acked_sacked > 0 and inflight < ssthresh

Therefore in all cases inflight and sndcnt can not both be 0.
We check invalid tp->prior_cwnd to avoid potential div0 bugs.

In reality this bug is triggered only with a sequence of less common
events.  For example, the connection is terminating an ECN-triggered
cwnd reduction with an inflight 0, then it receives reordered/old
ACKs or DSACKs from prior transmission (which acks nothing). Or the
connection is in fast recovery stage that marks everything lost,
but fails to retransmit due to local issues, then receives data
packets from other end which acks nothing.

Fixes: 3759824da8 ("tcp: PRR uses CRB mode by default and SS mode conditionally")
Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-06 16:39:56 -05:00
David S. Miller c7f5d10549 net: Add eth_platform_get_mac_address() helper.
A repeating pattern in drivers has become to use OF node information
and, if not found, platform specific host information to extract the
ethernet address for a given device.

Currently this is done with a call to of_get_mac_address() and then
some ifdef'd stuff for SPARC.

Consolidate this into a portable routine, and provide the
arch_get_platform_mac_address() weak function hook for all
architectures to implement if they want.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-06 16:31:56 -05:00
Francesco Ruggeri 07a5d38453 net: possible use after free in dst_release
dst_release should not access dst->flags after decrementing
__refcnt to 0. The dst_entry may be in dst_busy_list and
dst_gc_task may dst_destroy it before dst_release gets a chance
to access dst->flags.

Fixes: d69bbf88c8 ("net: fix a race in dst_release()")
Fixes: 27b75c95f1 ("net: avoid RCU for NOCACHE dst")
Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-06 15:00:27 -05:00
Elad Raz 404cdbf089 bridge: add vlan filtering change for new bridged device
Notifying hardware about newly bridged port vlan-aware changes.

Signed-off-by: Elad Raz <eladr@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-06 14:42:41 -05:00
Elad Raz 6b72a77020 bridge: add vlan filtering change notification
Notifying hardware about bridge vlan-aware changes.

Signed-off-by: Elad Raz <eladr@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-06 14:42:40 -05:00
Elad Raz 08474cc1e6 bridge: Propagate vlan add failure to user
Disallow adding interfaces to a bridge when vlan filtering operation
failed. Send the failure code to the user.

Signed-off-by: Elad Raz <eladr@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-06 14:42:40 -05:00
Trond Myklebust 13331a551a SUNRPC: Fixup socket wait for memory
We're seeing hangs in the NFS client code, with loops of the form:

 RPC: 30317 xmit incomplete (267368 left of 524448)
 RPC: 30317 call_status (status -11)
 RPC: 30317 call_transmit (status 0)
 RPC: 30317 xprt_prepare_transmit
 RPC: 30317 xprt_transmit(524448)
 RPC:       xs_tcp_send_request(267368) = -11
 RPC: 30317 xmit incomplete (267368 left of 524448)
 RPC: 30317 call_status (status -11)
 RPC: 30317 call_transmit (status 0)
 RPC: 30317 xprt_prepare_transmit
 RPC: 30317 xprt_transmit(524448)

Turns out commit ceb5d58b21 ("net: fix sock_wake_async() rcu protection")
moved SOCKWQ_ASYNC_NOSPACE out of sock->flags and into sk->sk_wq->flags,
however it never tried to fix up the code in net/sunrpc.

The new idiom is to use the flags in the RCU protected struct socket_wq.
While we're at it, clear out the now redundant places where we set/clear
SOCKWQ_ASYNC_NOSPACE and SOCK_NOSPACE. In principle, sk_stream_wait_memory()
is supposed to set these for us, so we only need to clear them in the
particular case of our ->write_space() callback.

Fixes: ceb5d58b21 ("net: fix sock_wake_async() rcu protection")
Cc: Eric Dumazet <edumazet@google.com>
Cc: stable@vger.kernel.org # 4.4
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-01-06 12:22:25 -05:00
Johannes Berg 787b306cf3 Bluetooth: avoid rebuilding hci_sock all the time
Instead, allow using string formatting with send_monitor_note()
and access init_utsname().

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-01-06 16:36:44 +01:00
John Fastabend 73c20a8b72 net: sched: fix missing free per cpu on qstats
When a qdisc is using per cpu stats (currently just the ingress
qdisc) only the bstats are being freed. This also free's the qstats.

Fixes: b0ab6f9275 ("net: sched: enable per cpu qstats")
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-06 01:40:21 -05:00
Craig Gallek 00ce3a15d8 soreuseport: change consume_skb to kfree_skb in error case
Fixes: 538950a1b7 ("soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPF")
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Craig Gallek <kraig@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-06 01:30:27 -05:00
Craig Gallek 1134158ba3 soreuseport: pass skb to secondary UDP socket lookup
This socket-lookup path did not pass along the skb in question
in my original BPF-based socket selection patch.  The skb in the
udpN_lib_lookup2 path can be used for BPF-based socket selection just
like it is in the 'traditional' udpN_lib_lookup path.

udpN_lib_lookup2 kicks in when there are greater than 10 sockets in
the same hlist slot.  Coincidentally, I chose 10 sockets per
reuseport group in my functional test, so the lookup2 path was not
excersised. This adds an additional set of tests with 20 sockets.

Fixes: 538950a1b7 ("soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPF")
Fixes: 3ca8e40299 ("soreuseport: BPF selection functional test")
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Craig Gallek <kraig@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-06 01:28:04 -05:00
Florian Westphal a72a5e2d34 inet: kill unused skb_free op
The only user was removed in commit
029f7f3b87 ("netfilter: ipv6: nf_defrag: avoid/free clone operations").

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-05 22:25:57 -05:00
Hannes Frederic Sowa ff62198553 bridge: Only call /sbin/bridge-stp for the initial network namespace
[I stole this patch from Eric Biederman. He wrote:]

> There is no defined mechanism to pass network namespace information
> into /sbin/bridge-stp therefore don't even try to invoke it except
> for bridge devices in the initial network namespace.
>
> It is possible for unprivileged users to cause /sbin/bridge-stp to be
> invoked for any network device name which if /sbin/bridge-stp does not
> guard against unreasonable arguments or being invoked twice on the
> same network device could cause problems.

[Hannes: changed patch using netns_eq]

Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-05 16:46:17 -05:00
Xin Long c79c066691 sctp: remove the local_bh_disable/enable in sctp_endpoint_lookup_assoc
sctp_endpoint_lookup_assoc is called in the protection of sock lock
there is no need to call local_bh_disable in this function. so remove
them.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-05 12:24:02 -05:00
Xin Long b5eff71283 sctp: drop the old assoc hashtable of sctp
transport hashtable will replace the association hashtable,
so association hashtable is not used in sctp any more, so
drop the codes about that.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-05 12:24:01 -05:00
Xin Long 39f66a7dce sctp: apply rhashtable api to sctp procfs
Traversal the transport rhashtable, get the association only once through
the condition assoc->peer.primary_path != transport.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-05 12:24:01 -05:00
Xin Long 4f00878126 sctp: apply rhashtable api to send/recv path
apply lookup apis to two functions, for __sctp_endpoint_lookup_assoc
and __sctp_lookup_association, it's invoked in the protection of sock
lock, it will be safe, but sctp_lookup_association need to call
rcu_read_lock() and to detect the t->dead to protect it.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-05 12:24:01 -05:00
Xin Long d6c0256a60 sctp: add the rhashtable apis for sctp global transport hashtable
tranport hashtbale will replace the association hashtable to do the
lookup for transport, and then get association by t->assoc, rhashtable
apis will be used because of it's resizable, scalable and using rcu.

lport + rport + paddr will be the base hashkey to locate the chain,
with net to protect one netns from another, then plus the laddr to
compare to get the target.

this patch will provider the lookup functions:
- sctp_epaddr_lookup_transport
- sctp_addrs_lookup_transport

hash/unhash functions:
- sctp_hash_transport
- sctp_unhash_transport

init/destroy functions:
- sctp_transport_hashtable_init
- sctp_transport_hashtable_destroy

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-05 12:24:00 -05:00
Johan Hedberg 78b781ca0d Bluetooth: Add support for Start Limited Discovery command
This patch implements the mgmt Start Limited Discovery command. Most
of existing Start Discovery code is reused since the only difference
is the presence of a 'limited' flag as part of the discovery state.

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-01-05 17:02:50 +01:00
Johan Hedberg 0d3b7f64c8 Bluetooth: Change eir_has_data_type() to more generic eir_get_data()
To make the EIR parsing helper more general purpose, make it return
the found data and its length rather than just saying whether the data
was present or not.

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-01-05 17:02:49 +01:00
Rainer Weikusat c845acb324 af_unix: Fix splice-bind deadlock
On 2015/11/06, Dmitry Vyukov reported a deadlock involving the splice
system call and AF_UNIX sockets,

http://lists.openwall.net/netdev/2015/11/06/24

The situation was analyzed as

(a while ago) A: socketpair()
B: splice() from a pipe to /mnt/regular_file
	does sb_start_write() on /mnt
C: try to freeze /mnt
	wait for B to finish with /mnt
A: bind() try to bind our socket to /mnt/new_socket_name
	lock our socket, see it not bound yet
	decide that it needs to create something in /mnt
	try to do sb_start_write() on /mnt, block (it's
	waiting for C).
D: splice() from the same pipe to our socket
	lock the pipe, see that socket is connected
	try to lock the socket, block waiting for A
B:	get around to actually feeding a chunk from
	pipe to file, try to lock the pipe.  Deadlock.

on 2015/11/10 by Al Viro,

http://lists.openwall.net/netdev/2015/11/10/4

The patch fixes this by removing the kern_path_create related code from
unix_mknod and executing it as part of unix_bind prior acquiring the
readlock of the socket in question. This means that A (as used above)
will sb_start_write on /mnt before it acquires the readlock, hence, it
won't indirectly block B which first did a sb_start_write and then
waited for a thread trying to acquire the readlock. Consequently, A
being blocked by C waiting for B won't cause a deadlock anymore
(effectively, both A and B acquire two locks in opposite order in the
situation described above).

Dmitry Vyukov(<dvyukov@google.com>) tested the original patch.

Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-04 23:22:49 -05:00
David Ahern b5bdacf3bb net: Propagate lookup failure in l3mdev_get_saddr to caller
Commands run in a vrf context are not failing as expected on a route lookup:
    root@kenny:~# ip ro ls table vrf-red
    unreachable default

    root@kenny:~# ping -I vrf-red -c1 -w1 10.100.1.254
    ping: Warning: source address might be selected on device other than vrf-red.
    PING 10.100.1.254 (10.100.1.254) from 0.0.0.0 vrf-red: 56(84) bytes of data.

    --- 10.100.1.254 ping statistics ---
    2 packets transmitted, 0 received, 100% packet loss, time 999ms

Since the vrf table does not have a route for 10.100.1.254 the ping
should have failed. The saddr lookup causes a full VRF table lookup.
Propogating a lookup failure to the user allows the command to fail as
expected:

    root@kenny:~# ping -I vrf-red -c1 -w1 10.100.1.254
    connect: No route to host

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-04 22:58:30 -05:00
Craig Gallek 538950a1b7 soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPF
Expose socket options for setting a classic or extended BPF program
for use when selecting sockets in an SO_REUSEPORT group.  These options
can be used on the first socket to belong to a group before bind or
on any socket in the group after bind.

This change includes refactoring of the existing sk_filter code to
allow reuse of the existing BPF filter validation checks.

Signed-off-by: Craig Gallek <kraig@google.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-04 22:49:59 -05:00
Craig Gallek e32ea7e747 soreuseport: fast reuseport UDP socket selection
Include a struct sock_reuseport instance when a UDP socket binds to
a specific address for the first time with the reuseport flag set.
When selecting a socket for an incoming UDP packet, use the information
available in sock_reuseport if present.

This required adding an additional field to the UDP source address
equality function to differentiate between exact and wildcard matches.
The original use case allowed wildcard matches when checking for
existing port uses during bind.  The new use case of adding a socket
to a reuseport group requires exact address matching.

Performance test (using a machine with 2 CPU sockets and a total of
48 cores):  Create reuseport groups of varying size.  Use one socket
from this group per user thread (pinning each thread to a different
core) calling recvmmsg in a tight loop.  Record number of messages
received per second while saturating a 10G link.
  10 sockets: 18% increase (~2.8M -> 3.3M pkts/s)
  20 sockets: 14% increase (~2.9M -> 3.3M pkts/s)
  40 sockets: 13% increase (~3.0M -> 3.4M pkts/s)

This work is based off a similar implementation written by
Ying Cai <ycai@google.com> for implementing policy-based reuseport
selection.

Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-04 22:49:58 -05:00
Craig Gallek ef456144da soreuseport: define reuseport groups
struct sock_reuseport is an optional shared structure referenced by each
socket belonging to a reuseport group.  When a socket is bound to an
address/port not yet in use and the reuseport flag has been set, the
structure will be allocated and attached to the newly bound socket.
When subsequent calls to bind are made for the same address/port, the
shared structure will be updated to include the new socket and the
newly bound socket will reference the group structure.

Usually, when an incoming packet was destined for a reuseport group,
all sockets in the same group needed to be considered before a
dispatching decision was made.  With this structure, an appropriate
socket can be found after looking up just one socket in the group.

This shared structure will also allow for more complicated decisions to
be made when selecting a socket (eg a BPF filter).

This work is based off a similar implementation written by
Ying Cai <ycai@google.com> for implementing policy-based reuseport
selection.

Signed-off-by: Craig Gallek <kraig@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-04 22:49:58 -05:00
David S. Miller 15ab90f400 NFC 4.5 pull request
This is the first NFC pull request for 4.5 and it brings:
 
 - A new driver for the STMicroelectronics ST95HF NFC chipset.
   The ST95HF is an NFC digital transceiver with an embedded analog
   front-end and as such relies on the Linux NFC digital
   implementation. This is the 3rd user of the NFC digital stack.
 
 - ACPI support for the ST st-nci and st21nfca drivers.
 
 - A small improvement for the nfcsim driver, as we can now tune
   the Rx delay through sysfs.
 
 - A bunch of minor cleanups and small fixes from Christophe Ricard,
   for a few drivers and the NFC core code.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJWgsv0AAoJEIqAPN1PVmxKvukP/3eJwA+chAUF89/fqwqFTaJN
 fffLoOxx2OBIbTXD2VV36yw4bAo9tbKDAYBZiot3Ig7Kg0SeCJ5oPA9xCVnWPrEY
 hxFAldvl+lWQs8nrUOgUZItiFBeUPdfW9YX/yKhUZVc+602nUG/e/+6x8B5MhIce
 SAfgCyd0c16DApltP7sw1muyZMvsO6Ow6dyNzDUVYZuabvEhe3SLSj9KFJi7Thsp
 h41Iv+bwPLhwF4RXGA6rei/gdEDSMRohprdj3uTDiTarGW+OpcAO0zWACS5m4eR0
 zF19+HjGPUk/LpFRaU31xX0ZQQjOTmmfsOt4FBb3P7oJx47egycsadLYPexeG7nj
 ruyS6ezlRX1I/tZsnyLNJK92mK5TXYLz2uJ8r2ii/BgPNE+AErB3zKCC+EjXzWhh
 AvClGu5b88WJLxoq3I3l5evPwGhebGZ8N/1uiFsHOxvzKVLgxwOmNLRGN4XXxB2i
 UbIHgBb6smsu/l+3q9R83kfoMaoWnr+OUIi2QPQVDt/K7t1LfsCuIhzcGSgo1VuW
 fGlA1iu+CNDknofeCl4JDo2UXAETO4gdKWw87GXeUcbbraLUczZeO7FFLZqxbMYc
 OCaPYshmVFeZRypYdRWDHw67ivj0/h+9iq4PP1XOROkRFH746dD/p4yamJwVi20B
 samZ8VPwzgH3/ohQJyX3
 =VFUH
 -----END PGP SIGNATURE-----

Merge tag 'nfc-next-4.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/nfc-next

Samuel Ortiz says:

====================
NFC 4.5 pull request

This is the first NFC pull request for 4.5 and it brings:

- A new driver for the STMicroelectronics ST95HF NFC chipset.
  The ST95HF is an NFC digital transceiver with an embedded analog
  front-end and as such relies on the Linux NFC digital
  implementation. This is the 3rd user of the NFC digital stack.

- ACPI support for the ST st-nci and st21nfca drivers.

- A small improvement for the nfcsim driver, as we can now tune
  the Rx delay through sysfs.

- A bunch of minor cleanups and small fixes from Christophe Ricard,
  for a few drivers and the NFC core code.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-04 21:48:15 -05:00
Eric Dumazet 197c949e77 udp: properly support MSG_PEEK with truncated buffers
Backport of this upstream commit into stable kernels :
89c22d8c3b ("net: Fix skb csum races when peeking")
exposed a bug in udp stack vs MSG_PEEK support, when user provides
a buffer smaller than skb payload.

In this case,
skb_copy_and_csum_datagram_iovec(skb, sizeof(struct udphdr),
                                 msg->msg_iov);
returns -EFAULT.

This bug does not happen in upstream kernels since Al Viro did a great
job to replace this into :
skb_copy_and_csum_datagram_msg(skb, sizeof(struct udphdr), msg);
This variant is safe vs short buffers.

For the time being, instead reverting Herbert Xu patch and add back
skb->ip_summed invalid changes, simply store the result of
udp_lib_checksum_complete() so that we avoid computing the checksum a
second time, and avoid the problematic
skb_copy_and_csum_datagram_iovec() call.

This patch can be applied on recent kernels as it avoids a double
checksumming, then backported to stable kernels as a bug fix.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-04 17:23:36 -05:00
Guillaume Nault 98f40b3e22 l2tp: rely on ppp layer for skb scrubbing
Since 79c441ae50 ("ppp: implement x-netns support"), the PPP layer
calls skb_scrub_packet() whenever the skb is received on the PPP
device. Manually resetting packet meta-data in the L2TP layer is thus
redundant.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-04 16:45:24 -05:00
Julia Lawall 29663b0cc1 mac802154: constify ieee802154_llsec_ops structure
The ieee802154_llsec_ops structure is never modified, so declare it as
const.

Done with the help of Coccinelle.

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Acked-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-01-04 20:40:41 +01:00
Heikki Krogerus d3d2072540 Bluetooth: hci_bcm: move all Broadcom ACPI IDs to BCM HCI driver
The IDs should all be for Broadcom BCM43241 module, and
hci_bcm is now the proper driver for them. This removes one
of two different ways of handling PM with the module.

Cc: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Heikki Krogerus <heikki.krogerus@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-01-04 19:22:05 +01:00
Pablo Neira Ayuso ad6d950393 netfilter: nf_ct_helper: define pr_fmt()
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-01-04 17:48:51 +01:00
Pablo Neira Ayuso 39e6dea28a netfilter: nf_tables: add forward expression to the netdev family
You can use this to forward packets from ingress to the egress path of
the specified interface. This provides a fast path to bounce packets
from one interface to another specific destination interface.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-01-04 17:48:38 +01:00
Al Viro 222e4adec5 ... and a couple in net/9p
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-01-04 10:29:17 -05:00
Al Viro 16e5c1fc36 convert a bunch of open-coded instances of memdup_user_nul()
A _lot_ of ->write() instances were open-coding it; some are
converted to memdup_user_nul(), a lot more remain...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-01-04 10:26:58 -05:00
Al Viro 7e935c7ca1 Merge branch 'memdup_user_nul' into work.misc 2016-01-04 10:25:34 -05:00
Pablo Neira Ayuso 502061f81d netfilter: nf_tables: add packet duplication to the netdev family
You can use this to duplicate packets and inject them at the egress path
of the specified interface. This duplication allows you to inspect
traffic from the dummy or any other interface dedicated to this purpose.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-01-03 21:04:23 +01:00
Pablo Neira Ayuso c7862a5f0d netfilter: nft_limit: allow to invert matching criteria
This patch allows you to invert the ratelimit matching criteria, so you
can match packets over the ratelimit. This is required to support what
hashlimit does.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-01-03 20:58:52 +01:00
David S. Miller c07f30ad68 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-12-31 18:20:10 -05:00
David S. Miller 7b22a7788b Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2015-12-31

Here's (probably) the last bluetooth-next pull request for the 4.5
kernel:

 - Add support for BCM2E65 ACPI ID
 - Minor fixes/cleanups in the bcm203x & bfusb drivers
 - Minor debugfs related fix in 6lowpan code

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-31 12:22:45 -05:00
Andrew Lunn f3a4094558 ethtool: Add phy statistics
Ethernet PHYs can maintain statistics, for example errors while idle
and receive errors. Add an ethtool mechanism to retrieve these
statistics, using the same model as MAC statistics.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-31 00:53:10 -05:00
Trond Myklebust 0b161e6330 SUNRPC: Fix a missing break in rpc_anyaddr()
The missing break means that we always return EAFNOSUPPORT when
faced with a request for an IPv6 loopback address.

Reported-by: coverity (CID 401987)
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-12-30 18:14:06 -05:00
Xin Long 068d8bd338 sctp: sctp should release assoc when sctp_make_abort_user return NULL in sctp_close
In sctp_close, sctp_make_abort_user may return NULL because of memory
allocation failure. If this happens, it will bypass any state change
and never free the assoc. The assoc has no chance to be freed and it
will be kept in memory with the state it had even after the socket is
closed by sctp_close().

So if sctp_make_abort_user fails to allocate memory, we should abort
the asoc via sctp_primitive_ABORT as well. Just like the annotation in
sctp_sf_cookie_wait_prm_abort and sctp_sf_do_9_1_prm_abort said,
"Even if we can't send the ABORT due to low memory delete the TCB.
This is a departure from our typical NOMEM handling".

But then the chunk is NULL (low memory) and the SCTP_CMD_REPLY cmd would
dereference the chunk pointer, and system crash. So we should add
SCTP_CMD_REPLY cmd only when the chunk is not NULL, just like other
places where it adds SCTP_CMD_REPLY cmd.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-30 16:57:16 -05:00
Nicolai Stange 574aab1e02 net, socket, socket_wq: fix missing initialization of flags
Commit ceb5d58b21 ("net: fix sock_wake_async() rcu protection") from
the current 4.4 release cycle introduced a new flags member in
struct socket_wq and moved SOCKWQ_ASYNC_NOSPACE and SOCKWQ_ASYNC_WAITDATA
from struct socket's flags member into that new place.

Unfortunately, the new flags field is never initialized properly, at least
not for the struct socket_wq instance created in sock_alloc_inode().

One particular issue I encountered because of this is that my GNU Emacs
failed to draw anything on my desktop -- i.e. what I got is a transparent
window, including the title bar. Bisection lead to the commit mentioned
above and further investigation by means of strace told me that Emacs
is indeed speaking to my Xorg through an O_ASYNC AF_UNIX socket. This is
reproducible 100% of times and the fact that properly initializing the
struct socket_wq ->flags fixes the issue leads me to the conclusion that
somehow SOCKWQ_ASYNC_WAITDATA got set in the uninitialized ->flags,
preventing my Emacs from receiving any SIGIO's due to data becoming
available and it got stuck.

Make sock_alloc_inode() set the newly created struct socket_wq's ->flags
member to zero.

Fixes: ceb5d58b21 ("net: fix sock_wake_async() rcu protection")
Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-30 16:38:01 -05:00
Joe Stringer 90c7afc96c openvswitch: Fix template leak in error cases.
Commit 5b48bb8506c5 ("openvswitch: Fix helper reference leak") fixed a
reference leak on helper objects, but inadvertently introduced a leak on
the ct template.

Previously, ct_info.ct->general.use was initialized to 0 by
nf_ct_tmpl_alloc() and only incremented when ovs_ct_copy_action()
returned successful. If an error occurred while adding the helper or
adding the action to the actions buffer, the __ovs_ct_free_action()
cleanup would use nf_ct_put() to free the entry; However, this relies on
atomic_dec_and_test(ct_info.ct->general.use). This reference must be
incremented first, or nf_ct_put() will never free it.

Fix the issue by acquiring a reference to the template immediately after
allocation.

Fixes: cae3a26275 ("openvswitch: Allow attaching helpers to ct action")
Fixes: 5b48bb8506c5 ("openvswitch: Fix helper reference leak")
Signed-off-by: Joe Stringer <joe@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-29 15:27:52 -05:00
Dan Carpenter c6dc65d885 NFC: nci: memory leak in nci_core_conn_create()
I've moved the check for "number_destination_params" forward
a few lines to avoid leaking "cmd".

Fixes: caa575a86e ('NFC: nci: fix possible crash in nci_core_conn_create')

Acked-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-12-29 19:06:23 +01:00
Christophe Ricard 9afec6d386 nfc: netlink: HCI event connectivity implementation
Add support for missing HCI event EVT_CONNECTIVITY and forward
it to userspace.

Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-12-29 19:06:20 +01:00
Christophe Ricard 2a84193f14 NFC: nci: Fix error check of nci_hci_create_pipe() result
net/nfc/nci/hci.c: In function nci_hci_connect_gate :
net/nfc/nci/hci.c:679: warning: comparison is always false due to limited range of data type

In case of error, nci_hci_create_pipe() returns NCI_HCI_INVALID_PIPE,
and not a negative error code.

Correct the check to fix this.

Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-12-29 19:06:18 +01:00
Shikha Singh ce2e56cdfb NFC: digital: Add Type4A tags support
The definition of DIGITAL_PROTO_NFCA_RF_TECH is modified to support
ISO14443 Type4A tags. Without this change it is not possible to start
polling for ISO14443 Type4A tags from the initiator side.

Signed-off-by: Shikha Singh <shikha.singh@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2015-12-29 19:06:05 +01:00
Trond Myklebust 8d0ed0ca63 NFS: NFSoRDMA Client Side Changes
These patches mostly fix send queue ordering issues inside the NFSoRDMA
 client, but there are also two patches from Dan Carpenter fixing up smatch
 warnings.
 
 Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJWdG7tAAoJENfLVL+wpUDrqWEP/j+K2BVkanx0/Yi1WjFHPuTW
 sKc02qeuhajVcS4QdLygD+4fcnVHsqfmStOg1dHLkQpG62mot8R59aLctISjndfo
 Q8O0JlUPYJSUZ1A5e++8BqHi6IiN+hBw+4rxuyqseEvsgcWI1bzFl/Ajm823fjMB
 HjcpLPokKmZvgIRpasBtA4O6Lg01rDb80v9a/DFPOa8cwSVmervgHa1ZEGU/KyZH
 5taRYUiIIkankEJE4VqQQ86bWIGKbuGlsHH+7BcDxtqNISQhq8uZsC2EMOxtdjnf
 IKHt6B6IVZoMfERSaRHOCkK30sEDUi0u3zYkgodFq7xNTW7MBwmCs/xEUqhLtBoe
 KV2imTNOuwzl6v3Vnjm9yKa+0PF8ejg3VYNZOrZWSeBaGPSmyiDlQ05zdcH+F5i7
 OQeo1ebIb2Wu6PJgZwA3QbNjWkddlwF7WrA58PbHgxEyYJeMHJ697g2cQiUdCBxl
 mbnkHyZqcG4ko2hbt5kNM+1TWQoY+HdGs8BJAV6W+UBO6ZYNL+ciM23uFSb5XiJC
 bWUPgUEu6vepW0JaCMt0clpFlWtLHFzclwkgOiWfiNKOtmHeENzE0L4JmQX6QNa6
 xW/Vn6ctxVx6AzTNdKZO8VK+XUgkH/7Y/x3H/c+zAeq25M2CJs36zdgkDGc9v4h9
 eVEBs2tHgI23aAk8brgY
 =kwhB
 -----END PGP SIGNATURE-----

Merge tag 'nfs-rdma-4.5' of git://git.linux-nfs.org/projects/anna/nfs-rdma

NFS: NFSoRDMA Client Side Changes

These patches mostly fix send queue ordering issues inside the NFSoRDMA
client, but there are also two patches from Dan Carpenter fixing up smatch
warnings.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

* tag 'nfs-rdma-4.5' of git://git.linux-nfs.org/projects/anna/nfs-rdma:
  xprtrdma: Revert commit e7104a2a96 ('xprtrdma: Cap req_cqinit').
  xprtrdma: Invalidate in the RPC reply handler
  xprtrdma: Add ro_unmap_sync method for all-physical registration
  xprtrdma: Add ro_unmap_sync method for FMR
  xprtrdma: Add ro_unmap_sync method for FRWR
  xprtrdma: Introduce ro_unmap_sync method
  xprtrdma: Move struct ib_send_wr off the stack
  xprtrdma: Disable RPC/RDMA backchannel debugging messages
  xprtrdma: xprt_rdma_free() must not release backchannel reqs
  xprtrdma: Fix additional uses of spin_lock_irqsave(rb_lock)
  xprtrdma: checking for NULL instead of IS_ERR()
  xprtrdma: clean up some curly braces
2015-12-28 14:49:14 -05:00
Pablo Neira Ayuso 5913beaf0d netfilter: nfnetlink: pass down netns pointer to commit() and abort() callbacks
Adapt callsites to avoid recurrent lookup of the netns pointer.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-12-28 18:43:15 +01:00
Pablo Neira Ayuso 7b8002a151 netfilter: nfnetlink: pass down netns pointer to call() and call_rcu()
Adapt callsites to avoid recurrent lookup of the netns pointer.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-12-28 18:41:41 +01:00
Pablo Neira Ayuso f4c756b4ea netfilter: nf_tables: remove check against removal of inactive objects
The following sequence inside a batch, although not very useful, is
valid:

 add table foo
 ...
 delete table foo

This may be generated by some robot while applying some incremental
upgrade, so remove the defensive checks against this.

This patch keeps the check on the get/dump path by now, we have to
replace the inactive flag by introducing object generations.

Reported-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-12-28 18:37:20 +01:00
Pablo Neira Ayuso 5ebe0b0eec netfilter: nf_tables: destroy basechain and rules on netdevice removal
If the netdevice is destroyed, the resources that are attached should
be released too as they belong to the device that is now gone.

Suggested-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-12-28 18:34:35 +01:00
Pablo Neira Ayuso df05ef874b netfilter: nf_tables: release objects on netns destruction
We have to release the existing objects on netns removal otherwise we
leak them. Chains are unregistered in first place to make sure no
packets are walking on our rules and sets anymore.

The object release happens by when we unregister the family via
nft_release_afinfo() which is called from nft_unregister_afinfo() from
the corresponding __net_exit path in every family.

Reported-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-12-28 18:34:35 +01:00
Stefan Hajnoczi d1358917f2 SUNRPC: drop unused xs_reclassify_socketX() helpers
xs_reclassify_socket4() and friends used to be called directly.
xs_reclassify_socket() is called instead nowadays.

The xs_reclassify_socketX() helper functions are empty when
CONFIG_DEBUG_LOCK_ALLOC is not defined.  Drop them since they have no
callers.

Note that AF_LOCAL still calls xs_reclassify_socketu() directly but is
easily converted to generic xs_reclassify_socket().

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-12-28 09:57:15 -05:00
Marcelo Ricardo Leitner 3538a5c8ff sctp: label accepted/peeled off sockets
Accepted or peeled off sockets were missing a security label (e.g.
SELinux) which means that socket was in "unlabeled" state.

This patch clones the sock's label from the parent sock and resolves the
issue (similar to AF_BLUETOOTH protocol family).

Cc: Paul Moore <pmoore@redhat.com>
Cc: David Teigland <teigland@redhat.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-28 00:18:47 -05:00
Marcelo Ricardo Leitner 9ba0b9636d sctp: use GFP_USER for user-controlled kmalloc
Commit cacc062152 ("sctp: use GFP_USER for user-controlled kmalloc")
missed two other spots.

For connectx, as it's more likely to be used by kernel users of the API,
it detects if GFP_USER should be used or not.

Fixes: cacc062152 ("sctp: use GFP_USER for user-controlled kmalloc")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-28 00:18:47 -05:00
Pravin B Shelar 039f50629b ip_tunnel: Move stats update to iptunnel_xmit()
By moving stats update into iptunnel_xmit(), we can simplify
iptunnel_xmit() usage. With this change there is no need to
call another function (iptunnel_xmit_stats()) to update stats
in tunnel xmit code path.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-25 23:32:23 -05:00
Geliang Tang aeb7ed14fe bridge: use kobj_to_dev instead of to_dev
kobj_to_dev has been defined in linux/device.h, so I replace to_dev
with it.

Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-23 22:26:48 -05:00
Hannes Frederic Sowa c1a9a291ce ipv6: honor ifindex in case we receive ll addresses in router advertisements
Marc Haber reported we don't honor interface indexes when we receive link
local router addresses in router advertisements. Luckily the non-strict
version of ipv6_chk_addr already does the correct job here, so we can
simply use it to lighten the checks and use those addresses by default
without any configuration change.

Link: <http://permalink.gmane.org/gmane.linux.network/391348>
Reported-by: Marc Haber <mh+netdev@zugschlus.de>
Cc: Marc Haber <mh+netdev@zugschlus.de>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-23 22:03:54 -05:00
Scott Mayhew c3d4879e01 sunrpc: Add a function to close temporary transports immediately
Add a function svc_age_temp_xprts_now() to close temporary transports
whose xpt_local matches the address passed in server_addr immediately
instead of waiting for them to be closed by the timer function.

The function is intended to be used by notifier_blocks that will be
added to nfsd and lockd that will run when an ip address is deleted.

This will eliminate the ACK storms and client hangs that occur in
HA-NFS configurations where nfsd & lockd is left running on the cluster
nodes all the time and the NFS 'service' is migrated back and forth
within a short timeframe.

Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-12-23 10:08:15 -05:00
Florian Westphal 271c3b9b7b tcp: honour SO_BINDTODEVICE for TW_RST case too
Hannes points out that when we generate tcp reset for timewait sockets we
pretend we found no socket and pass NULL sk to tcp_vX_send_reset().

Make it cope with inet tw sockets and then provide tw sk.

This makes RSTs appear on correct interface when SO_BINDTODEVICE is used.

Packetdrill test case:
// want default route to be used, we rely on BINDTODEVICE
`ip route del 192.0.2.0/24 via 192.168.0.2 dev tun0`

0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
// test case still works due to BINDTODEVICE
0.001 setsockopt(3, SOL_SOCKET, SO_BINDTODEVICE, "tun0", 4) = 0
0.100...0.200 connect(3, ..., ...) = 0

0.100 > S 0:0(0) <mss 1460,sackOK,nop,nop>
0.200 < S. 0:0(0) ack 1 win 32792 <mss 1460,sackOK,nop,nop>
0.200 > . 1:1(0) ack 1

0.210 close(3) = 0

0.210 > F. 1:1(0) ack 1 win 29200
0.300 < . 1:1(0) ack 2 win 46

// more data while in FIN_WAIT2, expect RST
1.300 < P. 1:1001(1000) ack 1 win 46

// fails without this change -- default route is used
1.301 > R 1:1(0) win 0

Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-22 17:03:05 -05:00
Florian Westphal e46787f0dd tcp: send_reset: test for non-NULL sk first
tcp_md5_do_lookup requires a full socket, so once we extend
_send_reset() to also accept timewait socket we would have to change

if (!sk && hash_location)

to something like

if ((!sk || !sk_fullsock(sk)) && hash_location) {
  ...
} else {
  (sk && sk_fullsock(sk)) tcp_md5_do_lookup()
}

Switch the two branches: check if we have a socket first, then
fall back to a listener lookup if we saw a md5 option (hash_location).

Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-22 17:03:05 -05:00
WANG Cong 5449a5ca9b addrconf: always initialize sysctl table data
When sysctl performs restrict writes, it allows to write from
a middle position of a sysctl file, which requires us to initialize
the table data before calling proc_dostring() for the write case.

Fixes: 3d1bec9932 ("ipv6: introduce secret_stable to ipv6_devconf")
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-22 17:00:58 -05:00
David S. Miller 024f35c552 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:

====================
pull request (net): ipsec 2015-12-22

Just one patch to fix dst_entries_init with multiple namespaces.
From Dan Streetman.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-22 16:26:31 -05:00
Lorenzo Colitti 2010b93e93 net: tcp: deal with listen sockets properly in tcp_abort.
When closing a listen socket, tcp_abort currently calls
tcp_done without clearing the request queue. If the socket has a
child socket that is established but not yet accepted, the child
socket is then left without a parent, causing a leak.

Fix this by setting the socket state to TCP_CLOSE and calling
inet_csk_listen_stop with the socket lock held, like tcp_close
does.

Tested using net_test. With this patch, calling SOCK_DESTROY on a
listen socket that has an established but not yet accepted child
socket results in the parent and the child being closed, such
that they no longer appear in sock_diag dumps.

Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-22 16:01:47 -05:00
Andrey Ryabinin e459dfeeb6 ipv6/addrlabel: fix ip6addrlbl_get()
ip6addrlbl_get() has never worked. If ip6addrlbl_hold() succeeded,
ip6addrlbl_get() will exit with '-ESRCH'. If ip6addrlbl_hold() failed,
ip6addrlbl_get() will use about to be free ip6addrlbl_entry pointer.

Fix this by inverting ip6addrlbl_hold() check.

Fixes: 2a8cc6c890 ("[IPV6] ADDRCONF: Support RFC3484 configurable address selection policy table.")
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Cong Wang <cwang@twopensource.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-22 15:57:54 -05:00
Ido Schimmel ef9cdd0fed switchdev: bridge: Pass ageing time as clock_t instead of jiffies
The bridge's ageing time is offloaded to hardware when:
	1) A port joins a bridge
	2) The ageing time of the bridge is changed

In the first case the ageing time is offloaded as jiffies, but in the
second case it's offloaded as clock_t, which is what existing switchdev
drivers expect to receive.

Fixes: 6ac311ae8b ("Adding switchdev ageing notification on port bridged")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-22 15:56:44 -05:00
Sebastian Andrzej Siewior f2830d0989 RDS: don't pretend to use cpu notifiers
It looks like an attempt to use CPU notifier here which was never
completed. Nobody tried to wire it up completely since 2k9. So I unwind
this code and get rid of everything not required. Oh look! 19 lines were
removed while code still does the same thing.

Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Tested-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-22 15:23:05 -05:00
Geliang Tang 5c29482dd1 net-sysfs: use to_net_dev in net_namespace()
Use to_net_dev() instead of open-coding it.

Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-22 15:04:09 -05:00
David S. Miller a7c09ae63d Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains two netfilter fixes:

1) Oneliner from Florian to dump missing NFT_CT_L3PROTOCOL netlink
   attribute, from Florian Westphal.

2) Another oneliner for nf_tables to use skb->protocol from the new
   netdev family, we can't assume ethernet there.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-22 14:54:13 -05:00
Or Gerlitz e3e45b1b43 xprtrdma: Avoid calling ib_query_device
Instead, use the cached copy of the attributes present on the device.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-12-22 14:39:00 -05:00
Or Gerlitz 0353261c12 net/rds: Avoid calling ib_query_device
Instead, use the cached copy of the attributes present on the device.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-12-22 14:39:00 -05:00
Alexander Aring 92e17ee72a 6lowpan: fix debugfs interface entry name
This patches moves the debugfs interface related register after
netdevice register. The function lowpan_dev_debugfs_init will use
"dev->name" which can be before register_netdevice a format string.
The function register_netdevice will evaluate the format string if
necessary and replace "dev->name" to the real interface name.

Reported-by: Lukasz Duda <lukasz.duda@nordicsemi.no>
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Acked-by: Lukasz Duda <lukasz.duda@nordicsemi.no>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-12-20 08:21:00 +01:00
Geliang Tang 7eb7404f7e Bluetooth: use list_for_each_entry*
Use list_for_each_entry*() instead of list_for_each*() to simplify
the code.

Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-12-20 08:11:10 +01:00
Simon Horman e905eabc90 openvswitch: correct encoding of set tunnel action attributes
In a set action tunnel attributes should be encoded in a
nested action.

I noticed this because ovs-dpctl was reporting an error
when dumping flows due to the incorrect encoding of tunnel attributes
in a set action.

Fixes: fc4099f172 ("openvswitch: Fix egress tunnel info.")
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-18 16:35:35 -05:00
Pravin B Shelar 6d3c348a63 ipip: ioctl: Remove superfluous IP-TTL handling.
IP-TTL case is already handled in ip_tunnel_ioctl() API.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-18 16:07:59 -05:00
Eric Dumazet 07f6f4a31e tcp: diag: add support for request sockets to tcp_abort()
Adding support for SYN_RECV request sockets to tcp_abort()
is quite easy after our tcp listener rewrite.

Note that we also need to better handle listeners, or we might
leak not yet accepted children, because of a missing
inet_csk_listen_stop() call.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Tested-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-18 16:06:39 -05:00
Daniel Borkmann 23bf88078a bpf: fix misleading comment in bpf_convert_filter
Comment says "User BPF's register A is mapped to our BPF register 6",
which is actually wrong as the mapping is on register 0. This can
already be inferred from the code itself. So just remove it before
someone makes assumptions based on that. Only code tells truth. ;)

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-18 16:04:51 -05:00
Daniel Borkmann 8b614aebec bpf: move clearing of A/X into classic to eBPF migration prologue
Back in the days where eBPF (or back then "internal BPF" ;->) was not
exposed to user space, and only the classic BPF programs internally
translated into eBPF programs, we missed the fact that for classic BPF
A and X needed to be cleared. It was fixed back then via 83d5b7ef99
("net: filter: initialize A and X registers"), and thus classic BPF
specifics were added to the eBPF interpreter core to work around it.

This added some confusion for JIT developers later on that take the
eBPF interpreter code as an example for deriving their JIT. F.e. in
f75298f5c3 ("s390/bpf: clear correct BPF accumulator register"), at
least X could leak stack memory. Furthermore, since this is only needed
for classic BPF translations and not for eBPF (verifier takes care
that read access to regs cannot be done uninitialized), more complexity
is added to JITs as they need to determine whether they deal with
migrations or native eBPF where they can just omit clearing A/X in
their prologue and thus reduce image size a bit, see f.e. cde66c2d88
("s390/bpf: Only clear A and X for converted BPF programs"). In other
cases (x86, arm64), A and X is being cleared in the prologue also for
eBPF case, which is unnecessary.

Lets move this into the BPF migration in bpf_convert_filter() where it
actually belongs as long as the number of eBPF JITs are still few. It
can thus be done generically; allowing us to remove the quirk from
__bpf_prog_run() and to slightly reduce JIT image size in case of eBPF,
while reducing code duplication on this matter in current(/future) eBPF
JITs.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Tested-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Cc: Zi Shen Lim <zlim.lnx@gmail.com>
Cc: Yang Shi <yang.shi@linaro.org>
Acked-by: Yang Shi <yang.shi@linaro.org>
Acked-by: Zi Shen Lim <zlim.lnx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-18 16:04:51 -05:00
Daniel Borkmann 05c74e5e53 bpf: add bpf_skb_load_bytes helper
When hacking tc programs with eBPF, one of the issues that come up
from time to time is to load addresses from headers. In eBPF as in
classic BPF, we have BPF_LD | BPF_ABS | BPF_{B,H,W} instructions that
extract a byte, half-word or word out of the skb data though helpers
such as bpf_load_pointer() (interpreter case).

F.e. extracting a whole IPv6 address could possibly look like ...

  union v6addr {
    struct {
      __u32 p1;
      __u32 p2;
      __u32 p3;
      __u32 p4;
    };
    __u8 addr[16];
  };

  [...]

  a.p1 = htonl(load_word(skb, off));
  a.p2 = htonl(load_word(skb, off +  4));
  a.p3 = htonl(load_word(skb, off +  8));
  a.p4 = htonl(load_word(skb, off + 12));

  [...]

  /* access to a.addr[...] */

This work adds a complementary helper bpf_skb_load_bytes() (we also
have bpf_skb_store_bytes()) as an alternative where the same call
would look like from an eBPF program:

  ret = bpf_skb_load_bytes(skb, off, addr, sizeof(addr));

Same verifier restrictions apply as in ffeedafbf0 ("bpf: introduce
current->pid, tgid, uid, gid, comm accessors") case, where stack memory
access needs to be statically verified and thus guaranteed to be
initialized in first use (otherwise verifier cannot tell whether a
subsequent access to it is valid or not as it's runtime dependent).

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-18 16:04:50 -05:00
David S. Miller 59ce9670ce Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains the first batch of Netfilter updates for
the upcoming 4.5 kernel. This batch contains userspace netfilter header
compilation fixes, support for packet mangling in nf_tables, the new
tracing infrastructure for nf_tables and cgroup2 support for iptables.
More specifically, they are:

1) Two patches to include dependencies in our netfilter userspace
   headers to resolve compilation problems, from Mikko Rapeli.

2) Four comestic cleanup patches for the ebtables codebase, from Ian Morris.

3) Remove duplicate include in the netfilter reject infrastructure,
   from Stephen Hemminger.

4) Two patches to simplify the netfilter defragmentation code for IPv6,
   patch from Florian Westphal.

5) Fix root ownership of /proc/net netfilter for unpriviledged net
   namespaces, from Philip Whineray.

6) Get rid of unused fields in struct nft_pktinfo, from Florian Westphal.

7) Add mangling support to our nf_tables payload expression, from
   Patrick McHardy.

8) Introduce a new netlink-based tracing infrastructure for nf_tables,
   from Florian Westphal.

9) Change setter functions in nfnetlink_log to be void, from
    Rami Rosen.

10) Add netns support to the cttimeout infrastructure.

11) Add cgroup2 support to iptables, from Tejun Heo.

12) Introduce nfnl_dereference_protected() in nfnetlink, from Florian.

13) Add support for mangling pkttype in the nf_tables meta expression,
    also from Florian.

BTW, I need that you pull net into net-next, I have another batch that
requires changes that I don't yet see in net.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-18 15:37:42 -05:00
Chuck Lever 26ae9d1c5a xprtrdma: Revert commit e7104a2a96 ('xprtrdma: Cap req_cqinit').
The root of the problem was that sends (especially unsignalled
FASTREG and LOCAL_INV Work Requests) were not properly flow-
controlled, which allowed a send queue overrun.

Now that the RPC/RDMA reply handler waits for invalidation to
complete, the send queue is properly flow-controlled. Thus this
limit is no longer necessary.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-12-18 15:34:33 -05:00
Chuck Lever 68791649a7 xprtrdma: Invalidate in the RPC reply handler
There is a window between the time the RPC reply handler wakes the
waiting RPC task and when xprt_release() invokes ops->buf_free.
During this time, memory regions containing the data payload may
still be accessed by a broken or malicious server, but the RPC
application has already been allowed access to the memory containing
the RPC request's data payloads.

The server should be fenced from client memory containing RPC data
payloads _before_ the RPC application is allowed to continue.

This change also more strongly enforces send queue accounting. There
is a maximum number of RPC calls allowed to be outstanding. When an
RPC/RDMA transport is set up, just enough send queue resources are
allocated to handle registration, Send, and invalidation WRs for
each those RPCs at the same time.

Before, additional RPC calls could be dispatched while invalidation
WRs were still consuming send WQEs. When invalidation WRs backed
up, dispatching additional RPCs resulted in a send queue overrun.

Now, the reply handler prevents RPC dispatch until invalidation is
complete. This prevents RPC call dispatch until there are enough
send queue resources to proceed.

Still to do: If an RPC exits early (say, ^C), the reply handler has
no opportunity to perform invalidation. Currently, xprt_rdma_free()
still frees remaining RDMA resources, which could deadlock.
Additional changes are needed to handle invalidation properly in this
case.

Reported-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-12-18 15:34:33 -05:00
Chuck Lever 73eee9b2de xprtrdma: Add ro_unmap_sync method for all-physical registration
physical's ro_unmap is synchronous already. The new ro_unmap_sync
method just has to DMA unmap all MRs associated with the RPC
request.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-12-18 15:34:33 -05:00
Chuck Lever 7c7a5390dc xprtrdma: Add ro_unmap_sync method for FMR
FMR's ro_unmap method is already synchronous because ib_unmap_fmr()
is a synchronous verb. However, some improvements can be made here.

1. Gather all the MRs for the RPC request onto a list, and invoke
   ib_unmap_fmr() once with that list. This reduces the number of
   doorbells when there is more than one MR to invalidate

2. Perform the DMA unmap _after_ the MRs are unmapped, not before.
   This is critical after invalidating a Write chunk.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-12-18 15:34:33 -05:00
Chuck Lever c9918ff56d xprtrdma: Add ro_unmap_sync method for FRWR
FRWR's ro_unmap is asynchronous. The new ro_unmap_sync posts
LOCAL_INV Work Requests and waits for them to complete before
returning.

Note also, DMA unmapping is now done _after_ invalidation.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-12-18 15:34:33 -05:00
Chuck Lever 32d0ceecdf xprtrdma: Introduce ro_unmap_sync method
In the current xprtrdma implementation, some memreg strategies
implement ro_unmap synchronously (the MR is knocked down before the
method returns) and some asynchonously (the MR will be knocked down
and returned to the pool in the background).

To guarantee the MR is truly invalid before the RPC consumer is
allowed to resume execution, we need an unmap method that is
always synchronous, invoked from the RPC/RDMA reply handler.

The new method unmaps all MRs for an RPC. The existing ro_unmap
method unmaps only one MR at a time.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-12-18 15:34:33 -05:00
Chuck Lever 3cf4e169be xprtrdma: Move struct ib_send_wr off the stack
For FRWR FASTREG and LOCAL_INV, move the ib_*_wr structure off
the stack. This allows frwr_op_map and frwr_op_unmap to chain
WRs together without limit to register or invalidate a set of MRs
with a single ib_post_send().

(This will be for chaining LOCAL_INV requests).

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-12-18 15:34:33 -05:00
Chuck Lever c8bbe0c7fe xprtrdma: Disable RPC/RDMA backchannel debugging messages
Clean up.

Fixes: 63cae47005 ('xprtrdma: Handle incoming backward direction')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-12-18 15:34:33 -05:00
Chuck Lever ffc4d9b159 xprtrdma: xprt_rdma_free() must not release backchannel reqs
Preserve any rpcrdma_req that is attached to rpc_rqst's allocated
for the backchannel. Otherwise, after all the pre-allocated
backchannel req's are consumed, incoming backward calls start
writing on freed memory.

Somehow this hunk got lost.

Fixes: f531a5dbc4 ('xprtrdma: Pre-allocate backward rpc_rqst')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-12-18 15:34:33 -05:00
Chuck Lever 9b06688bc3 xprtrdma: Fix additional uses of spin_lock_irqsave(rb_lock)
Clean up.

rb_lock critical sections added in rpcrdma_ep_post_extra_recv()
should have first been converted to use normal spin_lock now that
the reply handler is a work queue.

The backchannel set up code should use the appropriate helper
instead of open-coding a rb_recv_bufs list add.

Problem introduced by glib patch re-ordering on my part.

Fixes: f531a5dbc4 ('xprtrdma: Pre-allocate backward rpc_rqst')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-12-18 15:34:33 -05:00
Dan Carpenter abfb689711 xprtrdma: checking for NULL instead of IS_ERR()
The rpcrdma_create_req() function returns error pointers or success.  It
never returns NULL.

Fixes: f531a5dbc4 ('xprtrdma: Pre-allocate backward rpc_rqst and send/receive buffers')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-12-18 15:34:33 -05:00
Dan Carpenter 38b95bcf12 xprtrdma: clean up some curly braces
It doesn't matter either way, but the curly braces were clearly intended
here.  It causes a Smatch warning.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-12-18 15:34:33 -05:00
David Ahern 6dd9a14e92 net: Allow accepted sockets to be bound to l3mdev domain
Allow accepted sockets to derive their sk_bound_dev_if setting from the
l3mdev domain in which the packets originated. A sysctl setting is added
to control the behavior which is similar to sk_mark and
sysctl_tcp_fwmark_accept.

This effectively allow a process to have a "VRF-global" listen socket,
with child sockets bound to the VRF device in which the packet originated.
A similar behavior can be achieved using sk_mark, but a solution using marks
is incomplete as it does not handle duplicate addresses in different L3
domains/VRFs. Allowing sockets to inherit the sk_bound_dev_if from l3mdev
domain provides a complete solution.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-18 14:43:38 -05:00
Bjørn Mork cc9da6cc4f ipv6: addrconf: use stable address generator for ARPHRD_NONE
Add a new address generator mode, using the stable address generator
with an automatically generated secret. This is intended as a default
address generator mode for device types with no EUI64 implementation.
The new generator is used for ARPHRD_NONE interfaces initially, adding
default IPv6 autoconf support to e.g. tun interfaces.

If the addrgenmode is set to 'random', either by default or manually,
and no stable secret is available, then a random secret is used as
input for the stable-privacy address generator.  The secret can be
read and modified like manually configured secrets, using the proc
interface.  Modifying the secret will change the addrgen mode to
'stable-privacy' to indicate that it operates on a known secret.

Existing behaviour of the 'stable-privacy' mode is kept unchanged. If
a known secret is available when the device is created, then the mode
will default to 'stable-privacy' as before.  The mode can be manually
set to 'random' but it will behave exactly like 'stable-privacy' in
this case. The secret will not change.

Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: 吉藤英明 <hideaki.yoshifuji@miraclelinux.com>
Signed-off-by: Bjørn Mork <bjorn@mork.no>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-18 14:41:07 -05:00
Arnd Bergmann 8cb964daeb ila: add NETFILTER dependency
The recently added generic ILA translation facility fails to
build when CONFIG_NETFILTER is disabled:

net/ipv6/ila/ila_xlat.c:229:20: warning: 'struct nf_hook_state' declared inside parameter list
net/ipv6/ila/ila_xlat.c:235:27: error: array type has incomplete element type 'struct nf_hook_ops'
 static struct nf_hook_ops ila_nf_hook_ops[] __read_mostly = {

This adds an explicit Kconfig dependency to avoid that case.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 7f00feaf10 ("ila: Add generic ILA translation facility")
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-18 14:19:28 -05:00
Florian Westphal d5f79b6e4d netfilter: nft_ct: include direction when dumping NFT_CT_L3PROTOCOL key
one nft userspace test case fails with

'ct l3proto original ipv4' mismatches 'ct l3proto ipv4'

... because NFTA_CT_DIRECTION attr is missing.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-12-18 14:45:45 +01:00
Pablo Neira Ayuso aa47e42c60 netfilter: nf_tables: use skb->protocol instead of assuming ethernet header
Otherwise we may end up with incorrect network and transport header for
other protocols.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-12-18 14:45:45 +01:00
Florian Westphal b4aae759c2 netfilter: meta: add support for setting skb->pkttype
This allows to redirect bridged packets to local machine:

ether type ip ether daddr set aa:53:08:12:34:56 meta pkttype set unicast
Without 'set unicast', ip stack discards PACKET_OTHERHOST skbs.

It is also useful to add support for a '-m cluster like' nft rule
(where switch floods packets to several nodes, and each cluster node
 node processes a subset of packets for load distribution).

Mangling is restricted to HOST/OTHER/BROAD/MULTICAST, i.e. you cannot set
skb->pkt_type to PACKET_KERNEL or change PACKET_LOOPBACK to PACKET_HOST.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-12-18 14:12:56 +01:00
David S. Miller b3e0d3d7ba Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/geneve.c

Here we had an overlapping change, where in 'net' the extraneous stats
bump was being removed whilst in 'net-next' the final argument to
udp_tunnel6_xmit_skb() was being changed.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-17 22:08:28 -05:00
Linus Torvalds 73796d8bf2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix uninitialized variable warnings in nfnetlink_queue, a lot of
    people reported this...  From Arnd Bergmann.

 2) Don't init mutex twice in i40e driver, from Jesse Brandeburg.

 3) Fix spurious EBUSY in rhashtable, from Herbert Xu.

 4) Missing DMA unmaps in mvpp2 driver, from Marcin Wojtas.

 5) Fix race with work structure access in pppoe driver causing
    corruptions, from Guillaume Nault.

 6) Fix OOPS due to sh_eth_rx() not checking whether netdev_alloc_skb()
    actually succeeded or not, from Sergei Shtylyov.

 7) Don't lose flags when settifn IFA_F_OPTIMISTIC in ipv6 code, from
    Bjørn Mork.

 8) VXLAN_HD_RCO defined incorrectly, fix from Jiri Benc.

 9) Fix clock source used for cookies in SCTP, from Marcelo Ricardo
    Leitner.

10) aurora driver needs HAS_DMA dependency, from Geert Uytterhoeven.

11) ndo_fill_metadata_dst op of vxlan has to handle ipv6 tunneling
    properly as well, from Jiri Benc.

12) Handle request sockets properly in xfrm layer, from Eric Dumazet.

13) Double stats update in ipv6 geneve transmit path, fix from Pravin B
    Shelar.

14) sk->sk_policy[] needs RCU protection, and as a result
    xfrm_policy_destroy() needs to free policies using an RCU grace
    period, from Eric Dumazet.

15) SCTP needs to clone ipv6 tx options in order to avoid use after
    free, from Eric Dumazet.

16) Missing kbuild export if ila.h, from Stephen Hemminger.

17) Missing mdiobus_alloc() return value checking in mdio-mux.c, from
    Tobias Klauser.

18) Validate protocol value range in ->create() methods, from Hannes
    Frederic Sowa.

19) Fix early socket demux races that result in illegal dst reuse, from
    Eric Dumazet.

20) Validate socket address length in pptp code, from WANG Cong.

21) skb_reorder_vlan_header() uses incorrect offset and can corrupt
    packets, from Vlad Yasevich.

22) Fix memory leaks in nl80211 registry code, from Ola Olsson.

23) Timeout loop count handing fixes in mISDN, xgbe, qlge, sfc, and
    qlcnic.  From Dan Carpenter.

24) msg.msg_iocb needs to be cleared in recvfrom() otherwise, for
    example, AF_ALG will interpret it as an async call.  From Tadeusz
    Struk.

25) inetpeer_set_addr_v4 forgets to initialize the 'vif' field, from
    Eric Dumazet.

26) rhashtable enforces the minimum table size not early enough,
    breaking how we calculate the per-cpu lock allocations.  From
    Herbert Xu.

27) Fix FCC port lockup in 82xx driver, from Martin Roth.

28) FOU sockets need to be freed using RCU, from Hannes Frederic Sowa.

29) Fix out-of-bounds access in __skb_complete_tx_timestamp() and
    sock_setsockopt() wrt.  timestamp handling.  From WANG Cong.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (117 commits)
  net: check both type and procotol for tcp sockets
  drivers: net: xgene: fix Tx flow control
  tcp: restore fastopen with no data in SYN packet
  af_unix: Revert 'lock_interruptible' in stream receive code
  fou: clean up socket with kfree_rcu
  82xx: FCC: Fixing a bug causing to FCC port lock-up
  gianfar: Don't enable RX Filer if not supported
  net: fix warnings in 'make htmldocs' by moving macro definition out of field declaration
  rhashtable: Fix walker list corruption
  rhashtable: Enforce minimum size on initial hash table
  inet: tcp: fix inetpeer_set_addr_v4()
  ipv6: automatically enable stable privacy mode if stable_secret set
  net: fix uninitialized variable issue
  bluetooth: Validate socket address length in sco_sock_bind().
  net_sched: make qdisc_tree_decrease_qlen() work for non mq
  ser_gigaset: remove unnecessary kfree() calls from release method
  ser_gigaset: fix deallocation of platform device structure
  ser_gigaset: turn nonsense checks into WARN_ON
  ser_gigaset: fix up NULL checks
  qlcnic: fix a timeout loop
  ...
2015-12-17 14:05:22 -08:00
WANG Cong ac5cc97799 net: check both type and procotol for tcp sockets
Dmitry reported the following out-of-bound access:

Call Trace:
 [<ffffffff816cec2e>] __asan_report_load4_noabort+0x3e/0x40
mm/kasan/report.c:294
 [<ffffffff84affb14>] sock_setsockopt+0x1284/0x13d0 net/core/sock.c:880
 [<     inline     >] SYSC_setsockopt net/socket.c:1746
 [<ffffffff84aed7ee>] SyS_setsockopt+0x1fe/0x240 net/socket.c:1729
 [<ffffffff85c18c76>] entry_SYSCALL_64_fastpath+0x16/0x7a
arch/x86/entry/entry_64.S:185

This is because we mistake a raw socket as a tcp socket.
We should check both sk->sk_type and sk->sk_protocol to ensure
it is a tcp socket.

Willem points out __skb_complete_tx_timestamp() needs to fix as well.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-17 15:46:32 -05:00
Eric Dumazet 07e100f984 tcp: restore fastopen with no data in SYN packet
Yuchung tracked a regression caused by commit 57be5bdad7 ("ip: convert
tcp_sendmsg() to iov_iter primitives") for TCP Fast Open.

Some Fast Open users do not actually add any data in the SYN packet.

Fixes: 57be5bdad7 ("ip: convert tcp_sendmsg() to iov_iter primitives")
Reported-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-17 15:37:39 -05:00
Rainer Weikusat 3822b5c2fc af_unix: Revert 'lock_interruptible' in stream receive code
With b3ca9b02b0, the AF_UNIX SOCK_STREAM
receive code was changed from using mutex_lock(&u->readlock) to
mutex_lock_interruptible(&u->readlock) to prevent signals from being
delayed for an indefinite time if a thread sleeping on the mutex
happened to be selected for handling the signal. But this was never a
problem with the stream receive code (as opposed to its datagram
counterpart) as that never went to sleep waiting for new messages with the
mutex held and thus, wouldn't cause secondary readers to block on the
mutex waiting for the sleeping primary reader. As the interruptible
locking makes the code more complicated in exchange for no benefit,
change it back to using mutex_lock.

Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-17 15:33:47 -05:00
Hannes Frederic Sowa 715f504b11 ipv6: add IPV6_HDRINCL option for raw sockets
Same as in Windows, we miss IPV6_HDRINCL for SOL_IPV6 and SOL_RAW.
The SOL_IP/IP_HDRINCL is not available for IPv6 sockets.

Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-17 15:12:28 -05:00
Xin Long 32bc201e19 ipv6: allow routes to be configured with expire values
Add the support for adding expire value to routes,  requested by
Tom Gundersen <teg@jklm.no> for systemd-networkd, and NetworkManager
wants it too.

implement it by adding the new RTNETLINK attribute RTA_EXPIRES.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-17 15:08:51 -05:00
Hannes Frederic Sowa 3036facbb7 fou: clean up socket with kfree_rcu
fou->udp_offloads is managed by RCU. As it is actually included inside
the fou sockets, we cannot let the memory go out of scope before a grace
period. We either can synchronize_rcu or switch over to kfree_rcu to
manage the sockets. kfree_rcu seems appropriate as it is used by vxlan
and geneve.

Fixes: 23461551c0 ("fou: Support for foo-over-udp RX path")
Cc: Tom Herbert <tom@herbertland.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-16 19:03:02 -05:00
Hubert Sokolowski b3379041dd net: Pass ndm_state to route netlink FDB notifications.
Before this change applications monitoring FDB notifications
were not able to determine whether a new FDB entry is permament
or not:
bridge fdb add f1:f2:f3:f4:f5:f8 dev sw0p1 temp self
bridge fdb add f1:f2:f3:f4:f5:f9 dev sw0p1 self

bridge monitor fdb

f1:f2:f3:f4:f5:f8 dev sw0p1 self permanent
f1:f2:f3:f4:f5:f9 dev sw0p1 self permanent

With this change ndm_state from the original netlink message
is passed to the new netlink message sent as notification.

bridge fdb add f1:f2:f3:f4:f5:f6 dev sw0p1 self
bridge fdb add f1:f2:f3:f4:f5:f7 dev sw0p1 temp self

bridge monitor fdb
f1:f2:f3:f4:f5:f6 dev sw0p1 self permanent
f1:f2:f3:f4:f5:f7 dev sw0p1 self static

Signed-off-by: Hubert Sokolowski <hubert.sokolowski@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-16 18:36:46 -05:00
David S. Miller 4d4f37910b Another set of fixes:
* memory leak fixes (from Ola)
  * operating mode notification spec compliance fix (from Eyal)
  * copy rfkill names in case pointer becomes invalid (myself)
  * two hardware restart fixes (myself)
  * get rid of "limiting TX power" log spam (myself)
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJWcAbmAAoJEGt7eEactAAd9Z8P/igt1Xe7sFzRq5pi5+hXKMdp
 +jaDQsp0SSc2W53puXhMOqfC6zyD7zl41gRv/u7XCq/FHNInkmzDRz7LcXPQV1CR
 yUxxUTDBZ1nIk9a5uDI9nWuBDh6wlHG0FGl1Ud5bRHcZnPUntO2hk9863rWbTwbJ
 a+jgP41o41keQll1DogWQtzK7MyaH5h8CaLtKE9cklzZlKz6Arc5beCVRGBV/Iy6
 rfe8OzA/nJLjunAvnRt+XfQYxkSroffTvrqw4j2Eb4PrWr/eFMTGojzx1qaHIM2S
 vZrd3O95KDF0fapsQimJDlkiktHhC1Dyc4AP+pKVOimzFazV6IHw6dmHz1QqvQ+7
 fRR/5bvIUtiRgOwvWSxPzbrw8xEogaZP0O2lEsL5IeOeCgOl5SMdcYqqwZaBU3b6
 igAMIeMJ5fg7rhLEHregR32V7Ykk7x5cSXK4uXIq5FlNOzHKE5oY8J4PU9Uwn5/w
 v22ikTDBOBAelSYTlzNsWHggJ4yKtvGlO4vwpqcOlPG/uroPRbS/mYgjlrtLCBcX
 XWS6mksVNKrb5nmCEd0GxmOp7ZlMXt4Ut8hE8e0eL1jcghs1rJ59z2Xf9zZDfSDN
 hVmkzS2UPcQwnpwk6UtNsv3RyxgI57ekZx30wrz+U0gvcW25Nf67NnVCspApvy09
 IWu1FPVST14NX+ibsKsf
 =z3WR
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-davem-2015-12-15' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
Another set of fixes:
 * memory leak fixes (from Ola)
 * operating mode notification spec compliance fix (from Eyal)
 * copy rfkill names in case pointer becomes invalid (myself)
 * two hardware restart fixes (myself)
 * get rid of "limiting TX power" log spam (myself)
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-16 18:33:38 -05:00
David S. Miller 04ad37836c Included changes:
- change my email in MAINTAINERS and Doc files
 - create and export list of single hop neighs per interface
 - protect CRC in the BLA code by means of its own lock
 - minor fixes and code cleanups
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJWcRAcAAoJENpFlCjNi1MRxPQP/A8wAuHYUyMDu6ov5VRUR+24
 TYKgCVI1evuYlkOdgHkTQfTlOILgV2NHvORzKwQRNToyvtsd4VIeXZfFJ0UXB0Wo
 sMz9vrmmcndGkcYPWs+6IoVUbxMPZXjRGhme3Ig7nNTOQ4GvcSaQANbYWAMBfg61
 h6pxez67el9jzso2acHEgvOyaldOLDcbbdwt4vY7bPRgD9VXpE3nazBymaZXLLxs
 1ByXiCgJw32gXSWv8RQ/j13btjrbWCdmcEz9Ag1xO5i5u9BI1VJ8nmbLuNu63S0G
 Ftgvk5QBeMAxs3xDiTmtQD3bmiS87Jy+1d5rFb581/arM5SnYq6GZIrb81tdVTU6
 PMkMZ6/EUV2HSogq9PJ1ZDk/0oPYT5tqfLtJWcaAZplmWYt3ZUrQsPo4z5CtbBhU
 6Ar29G5slkgLslcBqn6YB00LxwOmj7elyVdPtL+wMCojtut+Ds4O+FzPdvd/145F
 hCtIe55b2ciBsfp1dDXP5P15HeEMjLiN2xWJKAPLhDCmGruJ6Pnly+ElzgV+zdKL
 7Qe1mOXwticMy3pH+ST7CP47tp5uyT0ak27eo+oOn4LI/ppM2qW1P9bNUZ6RrVwf
 dRRh0dzQruaLQauHK2Z6BWtA2Q36wY2anwmBONK34NH6VfRMB1GJMi3JF8gDgTds
 SWwU5FGWTn/cnCejtsbh
 =/Z+R
 -----END PGP SIGNATURE-----

Merge tag 'batman-adv-for-davem' of git://git.open-mesh.org/linux-merge

Antonio Quartulli says:

====================
Included changes:
- change my email in MAINTAINERS and Doc files
- create and export list of single hop neighs per interface
- protect CRC in the BLA code by means of its own lock
- minor fixes and code cleanups
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-16 11:09:40 -05:00
Zhu Yanjun 566178f853 net: sctp: dynamically enable or disable pf state
As we all know, the value of pf_retrans >= max_retrans_path can
disable pf state. The variables of pf_retrans and max_retrans_path
can be changed by the userspace application.

Sometimes the user expects to disable pf state while the 2
variables are changed to enable pf state. So it is necessary to
introduce a new variable to disable pf state.

According to the suggestions from Vlad Yasevich, extra1 and extra2
are removed. The initialization of pf_enable is added.

Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: Zhu Yanjun <zyjzyj2000@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-16 10:56:50 -05:00
Simon Wunderlich 5a1dd8a477 batman-adv: lock crc access in bridge loop avoidance
We have found some networks in which nodes were constantly requesting
other nodes BLA claim tables to synchronize, just to ask for that again
once completed. The reason was that the crc checksum of the asked nodes
were out of sync due to missing locking and multiple writes to the same
crc checksum when adding/removing entries. Therefore the asked nodes
constantly reported the wrong crc, which caused repeating requests.

To avoid multiple functions changing a backbone gateways crc entry at
the same time, lock it using a spinlock.

Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Tested-by: Alfons Name <AlfonsName@web.de>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-12-16 15:17:09 +08:00
Sven Eckelmann c05a57f6fb batman-adv: Fix typo 'wether' -> 'whether'
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-12-16 15:17:09 +08:00
Sven Eckelmann 01f6b5c76a batman-adv: Use chain pointer when purging fragments
The chain pointer was already created in batadv_frag_purge_orig to make the
checks more readable. Just use the chain pointer everywhere instead of
having the same dereference + array access in the most lines of this
function.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Acked-by: Martin Hundebøll <martin@hundeboll.net>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-12-16 15:17:09 +08:00
Simon Wunderlich ad7e2c466d batman-adv: unify flags access style in tt global add
This should slightly improve readability

Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-12-16 15:17:08 +08:00
Simon Wunderlich c169c59dd5 batman-adv: detect local excess vlans in TT request
If the local representation of the global TT table of one originator has
more VLAN entries than the respective TT update, there is some
inconsistency present. By detecting and reporting this inconsistency,
the global table gets updated and the excess VLAN will get removed in
the process.

Reported-by: Alessandro Bolletta <alessandro@mediaspot.net>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Acked-by: Antonio Quartulli <antonio@meshcoding.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-12-16 15:17:08 +08:00
Hannes Frederic Sowa 9b29c6962b ipv6: automatically enable stable privacy mode if stable_secret set
Bjørn reported that while we switch all interfaces to privacy stable mode
when setting the secret, we don't set this mode for new interfaces. This
does not make sense, so change this behaviour.

Fixes: 622c81d57b ("ipv6: generation of stable privacy addresses for link-local and autoconf")
Reported-by: Bjørn Mork <bjorn@mork.no>
Cc: Bjørn Mork <bjorn@mork.no>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-15 23:37:32 -05:00
Eric Dumazet 6857a02af5 sctp: use GFP_KERNEL in sctp_init()
modules init functions being called from process context, we better
use GFP_KERNEL allocations to increase our chances to get these
high-order pages we want for SCTP hash tables.

This mostly matters if SCTP module is loaded once memory got fragmented.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-15 23:27:45 -05:00
Lorenzo Colitti c1e64e298b net: diag: Support destroying TCP sockets.
This implements SOCK_DESTROY for TCP sockets. It causes all
blocking calls on the socket to fail fast with ECONNABORTED and
causes a protocol close of the socket. It informs the other end
of the connection by sending a RST, i.e., initiating a TCP ABORT
as per RFC 793. ECONNABORTED was chosen for consistency with
FreeBSD.

Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-15 23:26:52 -05:00
Lorenzo Colitti 6eb5d2e08f net: diag: Support SOCK_DESTROY for inet sockets.
This passes the SOCK_DESTROY operation to the underlying protocol
diag handler, or returns -EOPNOTSUPP if that handler does not
define a destroy operation.

Most of this patch is just renaming functions. This is not
strictly necessary, but it would be fairly counterintuitive to
have the code to destroy inet sockets be in a function whose name
starts with inet_diag_get.

Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-15 23:26:51 -05:00
Lorenzo Colitti 64be0aed59 net: diag: Add the ability to destroy a socket.
This patch adds a SOCK_DESTROY operation, a destroy function
pointer to sock_diag_handler, and a diag_destroy function
pointer.  It does not include any implementation code.

Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-15 23:26:51 -05:00
Lorenzo Colitti b613f56ec9 net: diag: split inet_diag_dump_one_icsk into two
Currently, inet_diag_dump_one_icsk finds a socket and then dumps
its information to userspace. Split it into a part that finds the
socket and a part that dumps the information.

Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-15 23:26:51 -05:00
Tom Herbert 7f00feaf10 ila: Add generic ILA translation facility
This patch implements an ILA tanslation table. This table can be
configured with identifier to locator mappings, and can be be queried
to resolve a mapping. Queries can be parameterized based on interface,
direction (incoming or outoing), and matching locator.  The table is
implemented using rhashtable and is configured via netlink (through
"ip ila .." in iproute).

The table may be used as alternative means to do do ILA tanslations
other than the lw tunnels

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-15 23:25:20 -05:00
Tom Herbert fc9e50f5a5 netlink: add a start callback for starting a netlink dump
The start callback allows the caller to set up a context for the
dump callbacks. Presumably, the context can then be destroyed in
the done callback.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-15 23:25:20 -05:00
Tom Herbert 33f11d1614 ila: Create net/ipv6/ila directory
Create ila directory in preparation for supporting other hooks in the
kernel than LWT for doing ILA. This includes:
  - Moving ila.c to ila/ila_lwt.c
  - Splitting out some common functions into ila_common.c

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-15 23:25:20 -05:00
Tom Herbert 6ae23ad362 net: Add driver helper functions to determine checksum offloadability
Add skb_csum_offload_chk driver helper function to determine if a
device with limited checksum offload capabilities is able to offload the
checksum for a given packet.

This patch includes:
  - The skb_csum_offload_chk function. Returns true if checksum is
    offloadable, else false. Optionally, in the case that the checksum
    is not offloable, the function can call skb_checksum_help to resolve
    the checksum. skb_csum_offload_chk also returns whether the checksum
    refers to an encapsulated checksum.
  - Definition of skb_csum_offl_spec structure that caller uses to
    indicate rules about what it can offload (e.g. IPv4/v6, TCP/UDP only,
    whether encapsulated checksums can be offloaded, whether checksum with
    IPv6 extension headers can be offloaded).
  - Ancilary functions called skb_csum_offload_chk_help,
    skb_csum_off_chk_help_cmn, skb_csum_off_chk_help_cmn_v4_only.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-15 16:50:21 -05:00
Tom Herbert 9a49850d0a tcp: Fix conditions to determine checksum offload
In tcp_send_sendpage and tcp_sendmsg we check the route capabilities to
determine if checksum offload can be performed. This check currently
does not take the IP protocol into account for devices that advertise
only one of NETIF_F_IPV6_CSUM or NETIF_F_IP_CSUM. This patch adds a
function to check capabilities for checksum offload with a socket
called sk_check_csum_caps. This function checks for specific IPv4 or
IPv6 offload support based on the family of the socket.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-15 16:50:20 -05:00
Tom Herbert c8cd0989bd net: Eliminate NETIF_F_GEN_CSUM and NETIF_F_V[46]_CSUM
These netif flags are unnecessary convolutions. It is more
straightforward to just use NETIF_F_HW_CSUM, NETIF_F_IP_CSUM,
and NETIF_F_IPV6_CSUM directly.

This patch also:
    - Cleans up can_checksum_protocol
    - Simplifies netdev_intersect_features

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-15 16:50:20 -05:00
Tom Herbert a188222b6e net: Rename NETIF_F_ALL_CSUM to NETIF_F_CSUM_MASK
The name NETIF_F_ALL_CSUM is a misnomer. This does not correspond to the
set of features for offloading all checksums. This is a mask of the
checksum offload related features bits. It is incorrect to set both
NETIF_F_HW_CSUM and NETIF_F_IP_CSUM or NETIF_F_IPV6 at the same time for
features of a device.

This patch:
  - Changes instances of NETIF_F_ALL_CSUM to NETIF_F_CSUM_MASK (where
    NETIF_F_ALL_CSUM is being used as a mask).
  - Changes bonding, sfc/efx, ipvlan, macvlan, vlan, and team drivers to
    use NEITF_F_HW_CSUM in features list instead of NETIF_F_ALL_CSUM.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-15 16:50:08 -05:00
Tom Herbert 53692b1de4 sctp: Rename NETIF_F_SCTP_CSUM to NETIF_F_SCTP_CRC
The SCTP checksum is really a CRC and is very different from the
standards 1's complement checksum that serves as the checksum
for IP protocols. This offload interface is also very different.
Rename NETIF_F_SCTP_CSUM to NETIF_F_SCTP_CRC to highlight these
differences. The term CSUM should be reserved in the stack to refer
to the standard 1's complement IP checksum.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-15 16:49:58 -05:00
tadeusz.struk@intel.com 130ed5d105 net: fix uninitialized variable issue
msg_iocb needs to be initialized on the recv/recvfrom path.
Otherwise afalg will wrongly interpret it as an async call.

Cc: stable@vger.kernel.org
Reported-by: Harald Freudenberger <freude@linux.vnet.ibm.com>
Signed-off-by: Tadeusz Struk <tadeusz.struk@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-15 15:46:48 -05:00
David S. Miller 5233252fce bluetooth: Validate socket address length in sco_sock_bind().
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-15 15:39:08 -05:00
Eric Dumazet 225734de70 net_sched: make qdisc_tree_decrease_qlen() work for non mq
Stas Nichiporovich reported a regression in his HFSC qdisc setup
on a non multi queue device.

It turns out I mistakenly added a TCQ_F_NOPARENT flag on all qdisc
allocated in qdisc_create() for non multi queue devices, which was
rather buggy. I was clearly mislead by the TCQ_F_ONETXQUEUE that is
also set here for no good reason, since it only matters for the root
qdisc.

Fixes: 4eaf3b84f2 ("net_sched: fix qdisc_tree_decrease_qlen() races")
Reported-by: Stas Nichiporovich <stasn77@gmail.com>
Tested-by: Stas Nichiporovich <stasn77@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-15 13:41:52 -05:00
Ido Schimmel 6ff64f6f92 switchdev: Pass original device to port netdev driver
switchdev drivers need to know the netdev on which the switchdev op was
invoked. For example, the STP state of a VLAN interface configured on top
of a port can change while being member in a bridge. In this case, the
underlying driver should only change the STP state of that particular
VLAN and not of all the VLANs configured on the port.

However, current switchdev infrastructure only passes the port netdev down
to the driver. Solve that by passing the original device down to the
driver as part of the required switchdev object / attribute.

This doesn't entail any change in current switchdev drivers. It simply
enables those supporting stacked devices to know the originating device
and act accordingly.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-15 11:58:20 -05:00
Ido Schimmel 9d547833f0 switchdev: vlan: Use switchdev_port* in vlan_netdev_ops
We need to be able to propagate static FDB entries and certain bridge
port attributes (e.g. learning, flooding) down to the port netdev
driver when bridge port is a VLAN interface.

Achieve that by setting ndo_bridge* and ndo_fdb* in vlan_netdev_ops to
the corresponding switchdev_port* functions. This is consistent with
team and bond devices.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-15 11:58:20 -05:00
Simon Wunderlich 18165f6f65 batman-adv: rename equiv/equal or better to similar or better
Since the function applies a threshold and also slightly worse
values are accepted, ''equal or better'' does not represent the
intention of the function. ''Similar or better'' represents that better.

Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-12-16 00:21:42 +08:00
Marek Lindner 4ff1e2a738 batman-adv: update last seen field of single hop originators
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-12-16 00:21:42 +08:00
Marek Lindner 7587405ab9 batman-adv: export single hop neighbor list via debugfs
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-12-16 00:21:42 +08:00
Marek Lindner 8248a4c7c8 batman-adv: add bat_hardif_neigh_init algo ops call
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-12-16 00:21:41 +08:00
Marek Lindner cef63419f7 batman-adv: add list of unique single hop neighbors per hard-interface
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-12-16 00:21:41 +08:00
Florian Westphal 9c55d3b545 nfnetlink: add nfnl_dereference_protected helper
to avoid overly long line in followup patch.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-12-15 15:14:04 +01:00
Eyal Shapira cf1e05c636 mac80211: handle width changes from opmode notification IE in beacon
An AP can send an operating channel width change in a beacon
opmode notification IE as long as there's a change in the nss as
well (See 802.11ac-2013 section 10.41).
So don't limit updating to nss only from an opmode notification IE.

Signed-off-by: Eyal Shapira <eyalx.shapira@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-12-15 13:16:47 +01:00
Johannes Berg a87da0cbc4 mac80211: suppress unchanged "limiting TX power" messages
When the AP is advertising limited TX power, the message can be
printed over and over again. Suppress it when the power level
isn't changing.

This fixes https://bugzilla.kernel.org/show_bug.cgi?id=106011

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-12-15 13:14:04 +01:00
Johannes Berg 1ea2c86480 mac80211: reprogram in interface order
During reprogramming, mac80211 currently first adds all the channel
contexts, then binds them to the vifs and then goes to reconfigure
all the interfaces. Drivers might, perhaps implicitly, rely on the
operation order for certain things that typically happen within a
single function elsewhere in mac80211. To avoid problems with that,
reorder the code in mac80211's restart/reprogramming to work fully
within the interface loop so that the order of operations is like
in normal operation.

For iwlwifi, this fixes a firmware crash when reprogramming with an
AP/GO interface active.

Reported-by: David Spinadel <david.spinadel@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-12-15 13:13:59 +01:00
Johannes Berg 74430f9489 mac80211: run scan completed work on reconfig failure
When reconfiguration during resume fails while a scan is pending
for completion work, that work will never run, and the scan will
be stuck forever. Factor out the code to recover this and call it
also in ieee80211_handle_reconfig_failure().

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-12-15 13:12:50 +01:00
Ola Olsson 707554b4d1 nl80211: Fix potential memory leak in nl80211_connect
Free cached keys if the last early return path is taken.

Signed-off-by: Ola Olsson <ola.olsson@sonymobile.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-12-15 13:11:26 +01:00
Ola Olsson e5dbe0701a nl80211: Fix potential memory leak in nl80211_set_wowlan
Compared to cfg80211_rdev_free_wowlan in core.h,
the error goto label lacks the freeing of nd_config.
Fix that.

Signed-off-by: Ola Olsson <ola.olsson@sonymobile.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-12-15 13:10:05 +01:00
Ola Olsson 09d118008f nl80211: fix a few memory leaks in reg.c
The first leak occurs when entering the default case
in the switch for the initiator in set_regdom.
The second leaks a platform_device struct if the
platform registration in regulatory_init succeeds but
the sub sequent regulatory hint fails due to no memory.

Signed-off-by: Ola Olsson <ola.olsson@sonymobile.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-12-15 13:08:02 +01:00
Vlad Yasevich f654861569 skbuff: Fix offset error in skb_reorder_vlan_header
skb_reorder_vlan_header is called after the vlan header has
been pulled.  As a result the offset of the begining of
the mac header has been incrased by 4 bytes (VLAN_HLEN).
When moving the mac addresses, include this incrase in
the offset calcualation so that the mac addresses are
copied correctly.

Fixes: a6e18ff111 (vlan: Fix untag operations of stacked vlans with REORDER_HEADER off)
CC: Nicolas Dichtel <nicolas.dichtel@6wind.com>
CC: Patrick McHardy <kaber@trash.net>
Signed-off-by: Vladislav Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-15 00:30:41 -05:00
Eric Dumazet 5037e9ef94 net: fix IP early demux races
David Wilder reported crashes caused by dst reuse.

<quote David>
  I am seeing a crash on a distro V4.2.3 kernel caused by a double
  release of a dst_entry.  In ipv4_dst_destroy() the call to
  list_empty() finds a poisoned next pointer, indicating the dst_entry
  has already been removed from the list and freed. The crash occurs
  18 to 24 hours into a run of a network stress exerciser.
</quote>

Thanks to his detailed report and analysis, we were able to understand
the core issue.

IP early demux can associate a dst to skb, after a lookup in TCP/UDP
sockets.

When socket cache is not properly set, we want to store into
sk->sk_dst_cache the dst for future IP early demux lookups,
by acquiring a stable refcount on the dst.

Problem is this acquisition is simply using an atomic_inc(),
which works well, unless the dst was queued for destruction from
dst_release() noticing dst refcount went to zero, if DST_NOCACHE
was set on dst.

We need to make sure current refcount is not zero before incrementing
it, or risk double free as David reported.

This patch, being a stable candidate, adds two new helpers, and use
them only from IP early demux problematic paths.

It might be possible to merge in net-next skb_dst_force() and
skb_dst_force_safe(), but I prefer having the smallest patch for stable
kernels : Maybe some skb_dst_force() callers do not expect skb->dst
can suddenly be cleared.

Can probably be backported back to linux-3.6 kernels

Reported-by: David J. Wilder <dwilder@us.ibm.com>
Tested-by: David J. Wilder <dwilder@us.ibm.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-14 23:52:00 -05:00