1
0
Fork 0
Commit Graph

3745 Commits (aa816c1b390aacb698dd6faf5a8cbffb5123c03a)

Author SHA1 Message Date
David S. Miller 7b46a644a4 netlink: Fix bugs in nlmsg_end() conversions.
Commit 053c095a82 ("netlink: make nlmsg_end() and genlmsg_end()
void") didn't catch all of the cases where callers were breaking out
on the return value being equal to zero, which they no longer should
when zero means success.

Fix all such cases.

Reported-by: Marcel Holtmann <marcel@holtmann.org>
Reported-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-18 23:36:08 -05:00
Johannes Berg 053c095a82 netlink: make nlmsg_end() and genlmsg_end() void
Contrary to common expectations for an "int" return, these functions
return only a positive value -- if used correctly they cannot even
return 0 because the message header will necessarily be in the skb.

This makes the very common pattern of

  if (genlmsg_end(...) < 0) { ... }

be a whole bunch of dead code. Many places also simply do

  return nlmsg_end(...);

and the caller is expected to deal with it.

This also commonly (at least for me) causes errors, because it is very
common to write

  if (my_function(...))
    /* error condition */

and if my_function() does "return nlmsg_end()" this is of course wrong.

Additionally, there's not a single place in the kernel that actually
needs the message length returned, and if anyone needs it later then
it'll be very easy to just use skb->len there.

Remove this, and make the functions void. This removes a bunch of dead
code as described above. The patch adds lines because I did

-	return nlmsg_end(...);
+	nlmsg_end(...);
+	return 0;

I could have preserved all the function's return values by returning
skb->len, but instead I've audited all the places calling the affected
functions and found that none cared. A few places actually compared
the return value with <= 0 in dump functionality, but that could just
be changed to < 0 with no change in behaviour, so I opted for the more
efficient version.

One instance of the error I've made numerous times now is also present
in net/phonet/pn_netlink.c in the route_dumpit() function - it didn't
check for <0 or <=0 and thus broke out of the loop every single time.
I've preserved this since it will (I think) have caused the messages to
userspace to be formatted differently with just a single message for
every SKB returned to userspace. It's possible that this isn't needed
for the tools that actually use this, but I don't even know what they
are so couldn't test that changing this behaviour would be acceptable.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-18 01:03:45 -05:00
Roopa Prabhu 02dba4388d bridge: fix setlink/dellink notifications
problems with bridge getlink/setlink notifications today:
        - bridge setlink generates two notifications to userspace
                - one from the bridge driver
                - one from rtnetlink.c (rtnl_bridge_notify)
        - dellink generates one notification from rtnetlink.c. Which
	means bridge setlink and dellink notifications are not
	consistent

        - Looking at the code it appears,
	If both BRIDGE_FLAGS_MASTER and BRIDGE_FLAGS_SELF were set,
        the size calculation in rtnl_bridge_notify can be wrong.
        Example: if you set both BRIDGE_FLAGS_MASTER and BRIDGE_FLAGS_SELF
        in a setlink request to rocker dev, rtnl_bridge_notify will
	allocate skb for one set of bridge attributes, but,
	both the bridge driver and rocker dev will try to add
	attributes resulting in twice the number of attributes
	being added to the skb.  (rocker dev calls ndo_dflt_bridge_getlink)

There are multiple options:
1) Generate one notification including all attributes from master and self:
   But, I don't think it will work, because both master and self may use
   the same attributes/policy. Cannot pack the same set of attributes in a
   single notification from both master and slave (duplicate attributes).

2) Generate one notification from master and the other notification from
   self (This seems to be ideal):
     For master: the master driver will send notification (bridge in this
	example)
     For self: the self driver will send notification (rocker in the above
	example. It can use helpers from rtnetlink.c to do so. Like the
	ndo_dflt_bridge_getlink api).

This patch implements 2) (leaving the 'rtnl_bridge_notify' around to be used
with 'self').

v1->v2 :
	- rtnl_bridge_notify is now called only for self,
	so, remove 'BRIDGE_FLAGS_SELF' check and cleanup a few things
	- rtnl_bridge_dellink used to always send a RTM_NEWLINK msg
	earlier. So, I have changed the notification from br_dellink to
	go as RTM_NEWLINK

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-17 23:49:51 -05:00
Eric Dumazet ac64da0b83 net: rps: fix cpu unplug
softnet_data.input_pkt_queue is protected by a spinlock that
we must hold when transferring packets from victim queue to an active
one. This is because other cpus could still be trying to enqueue packets
into victim queue.

A second problem is that when we transfert the NAPI poll_list from
victim to current cpu, we absolutely need to special case the percpu
backlog, because we do not want to add complex locking to protect
process_queue : Only owner cpu is allowed to manipulate it, unless cpu
is offline.

Based on initial patch from Prasad Sodagudi & Subash Abhinov
Kasiviswanathan.

This version is better because we do not slow down packet processing,
only make migration safer.

Reported-by: Prasad Sodagudi <psodagud@codeaurora.org>
Reported-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-16 01:02:42 -05:00
David S. Miller 3f3558bb51 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/xen-netfront.c

Minor overlapping changes in xen-netfront.c, mostly to do
with some buffer management changes alongside the split
of stats into TX and RX.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-15 00:53:17 -05:00
Jean-Francois Remy 4bf6980dd0 neighbour: fix base_reachable_time(_ms) not effective immediatly when changed
When setting base_reachable_time or base_reachable_time_ms on a
specific interface through sysctl or netlink, the reachable_time
value is not updated.

This means that neighbour entries will continue to be updated using the
old value until it is recomputed in neigh_period_work (which
    recomputes the value every 300*HZ).
On systems with HZ equal to 1000 for instance, it means 5mins before
the change is effective.

This patch changes this behavior by recomputing reachable_time after
each set on base_reachable_time or base_reachable_time_ms.
The new value will become effective the next time the neighbour's timer
is triggered.

Changes are made in two places: the netlink code for set and the sysctl
handling code. For sysctl, I use a proc_handler. The ipv6 network
code does provide its own handler but it already refreshes
reachable_time correctly so it's not an issue.
Any other user of neighbour which provide its own handlers must
refresh reachable_time.

Signed-off-by: Jean-Francois Remy <jeff@melix.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-14 00:28:00 -05:00
Jiri Pirko df8a39defa net: rename vlan_tx_* helpers since "tx" is misleading there
The same macros are used for rx as well. So rename it.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-13 17:51:08 -05:00
Pankaj Gupta 1059590254 net: allow large number of rx queues
netif_alloc_rx_queues() uses kcalloc() to allocate memory
for "struct netdev_queue *_rx" array.
If we are doing large rx queue allocation kcalloc() might
fail, so this patch does a fallback to vzalloc().
Similar implementation is done for tx queue allocation in
netif_alloc_netdev_queues().

We avoid failure of high order memory allocation
with the help of vzalloc(), this allows us to do large
rx and tx queue allocation which in turn helps us to
increase the number of queues in tun.

As vmalloc() adds overhead on a critical network path,
__GFP_REPEAT flag is used with kzalloc() to do this fallback
only when really needed.

Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: David Gibson <dgibson@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-12 17:05:05 -05:00
Ed Swierk 2f4383667d ethtool: Extend ethtool plugin module eeprom API to phylib
This patch extends the ethtool plugin module eeprom API to support cards
whose phy support is delegated to a separate driver.

The handlers for ETHTOOL_GMODULEINFO and ETHTOOL_GMODULEEEPROM call the
module_info and module_eeprom functions if the phy driver provides them;
otherwise the handlers call the equivalent ethtool_ops functions provided
by network drivers with built-in phy support.

Signed-off-by: Ed Swierk <eswierk@skyportsystems.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-06 17:16:55 -05:00
Daniel Borkmann ea69763999 net: tcp: add RTAX_CC_ALGO fib handling
This patch adds the minimum necessary for the RTAX_CC_ALGO congestion
control metric to be set up and dumped back to user space.

While the internal representation of RTAX_CC_ALGO is handled as a u32
key, we avoided to expose this implementation detail to user space, thus
instead, we chose the netlink attribute that is being exchanged between
user space to be the actual congestion control algorithm name, similarly
as in the setsockopt(2) API in order to allow for maximum flexibility,
even for 3rd party modules.

It is a bit unfortunate that RTAX_QUICKACK used up a whole RTAX slot as
it should have been stored in RTAX_FEATURES instead, we first thought
about reusing it for the congestion control key, but it brings more
complications and/or confusion than worth it.

Joint work with Florian Westphal.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-05 22:55:24 -05:00
Hubert Sokolowski 6cb69742da net: Do not call ndo_dflt_fdb_dump if ndo_fdb_dump is defined
Add checking whether the call to ndo_dflt_fdb_dump is needed.
It is not expected to call ndo_dflt_fdb_dump unconditionally
by some drivers (i.e. qlcnic or macvlan) that defines
own ndo_fdb_dump. Other drivers define own ndo_fdb_dump
and don't want ndo_dflt_fdb_dump to be called at all.
At the same time it is desirable to call the default dump
function on a bridge device.
Fix attributes that are passed to dev->netdev_ops->ndo_fdb_dump.
Add extra checking in br_fdb_dump to avoid duplicate entries
as now filter_dev can be NULL.

Following tests for filtering have been performed before
the change and after the patch was applied to make sure
they are the same and it doesn't break the filtering algorithm.

[root@localhost ~]# cd /root/iproute2-3.18.0/bridge
[root@localhost bridge]# modprobe dummy
[root@localhost bridge]# ./bridge fdb add f1:f2:f3:f4:f5:f6 dev dummy0
[root@localhost bridge]# brctl addbr br0
[root@localhost bridge]# brctl addif  br0 dummy0
[root@localhost bridge]# ip link set dev br0 address 02:00:00:12:01:04
[root@localhost bridge]# # show all
[root@localhost bridge]# ./bridge fdb show
33:33:00:00:00:01 dev p2p1 self permanent
01:00:5e:00:00:01 dev p2p1 self permanent
33:33:ff:ac:ce:32 dev p2p1 self permanent
33:33:00:00:02:02 dev p2p1 self permanent
01:00:5e:00:00:fb dev p2p1 self permanent
33:33:00:00:00:01 dev p7p1 self permanent
01:00:5e:00:00:01 dev p7p1 self permanent
33:33:ff:79:50:53 dev p7p1 self permanent
33:33:00:00:02:02 dev p7p1 self permanent
01:00:5e:00:00:fb dev p7p1 self permanent
f2:46:50:85:6d:d9 dev dummy0 master br0 permanent
f2:46:50:85:6d:d9 dev dummy0 vlan 1 master br0 permanent
33:33:00:00:00:01 dev dummy0 self permanent
f1:f2:f3:f4:f5:f6 dev dummy0 self permanent
33:33:00:00:00:01 dev br0 self permanent
02:00:00:12:01:04 dev br0 vlan 1 master br0 permanent
02:00:00:12:01:04 dev br0 master br0 permanent
[root@localhost bridge]# # filter by bridge
[root@localhost bridge]# ./bridge fdb show br br0
f2:46:50:85:6d:d9 dev dummy0 master br0 permanent
f2:46:50:85:6d:d9 dev dummy0 vlan 1 master br0 permanent
33:33:00:00:00:01 dev dummy0 self permanent
f1:f2:f3:f4:f5:f6 dev dummy0 self permanent
33:33:00:00:00:01 dev br0 self permanent
02:00:00:12:01:04 dev br0 vlan 1 master br0 permanent
02:00:00:12:01:04 dev br0 master br0 permanent
[root@localhost bridge]# # filter by port
[root@localhost bridge]# ./bridge fdb show brport dummy0
f2:46:50:85:6d:d9 master br0 permanent
f2:46:50:85:6d:d9 vlan 1 master br0 permanent
33:33:00:00:00:01 self permanent
f1:f2:f3:f4:f5:f6 self permanent
[root@localhost bridge]# # filter by port + bridge
[root@localhost bridge]# ./bridge fdb show br br0 brport dummy0
f2:46:50:85:6d:d9 master br0 permanent
f2:46:50:85:6d:d9 vlan 1 master br0 permanent
33:33:00:00:00:01 self permanent
f1:f2:f3:f4:f5:f6 self permanent
[root@localhost bridge]#

Signed-off-by: Hubert Sokolowski <hubert.sokolowski@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-05 22:52:06 -05:00
Florian Westphal e8768f9715 net: skbuff: don't zero tc members when freeing skb
Not needed, only four cases:
 - kfree_skb (or one of its aliases).
   Don't need to zero, memory will be freed.
 - kfree_skb_partial and head was stolen:  memory will be freed.
 - skb_morph:  The skb header fields (including tc ones) will be
   copied over from the 'to-be-morphed' skb right after
   skb_release_head_state returns.
 - skb_segment:  Same as before, all the skb header
   fields are copied over from the original skb right away.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-02 16:04:29 -05:00
Jesse Gross 5f35227ea3 net: Generalize ndo_gso_check to ndo_features_check
GSO isn't the only offload feature with restrictions that
potentially can't be expressed with the current features mechanism.
Checksum is another although it's a general issue that could in
theory apply to anything. Even if it may be possible to
implement these restrictions in other ways, it can result in
duplicate code or inefficient per-packet behavior.

This generalizes ndo_gso_check so that drivers can remove any
features that don't make sense for a given packet, similar to
netif_skb_features(). It also converts existing driver
restrictions to the new format, completing the work that was
done to support tunnel protocols since the issues apply to
checksums as well.

By actually removing features from the set that are used to do
offloading, it solves another problem with the existing
interface. In these cases, GSO would run with the original set
of features and not do anything because it appears that
segmentation is not required.

CC: Tom Herbert <therbert@google.com>
CC: Joe Stringer <joestringer@nicira.com>
CC: Eric Dumazet <edumazet@google.com>
CC: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by:  Tom Herbert <therbert@google.com>
Fixes: 04ffcb255f ("net: Add ndo_gso_check")
Tested-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-26 17:20:56 -05:00
Jay Vosburgh 2c26d34bbc net/core: Handle csum for CHECKSUM_COMPLETE VXLAN forwarding
When using VXLAN tunnels and a sky2 device, I have experienced
checksum failures of the following type:

[ 4297.761899] eth0: hw csum failure
[...]
[ 4297.765223] Call Trace:
[ 4297.765224]  <IRQ>  [<ffffffff8172f026>] dump_stack+0x46/0x58
[ 4297.765235]  [<ffffffff8162ba52>] netdev_rx_csum_fault+0x42/0x50
[ 4297.765238]  [<ffffffff8161c1a0>] ? skb_push+0x40/0x40
[ 4297.765240]  [<ffffffff8162325c>] __skb_checksum_complete+0xbc/0xd0
[ 4297.765243]  [<ffffffff8168c602>] tcp_v4_rcv+0x2e2/0x950
[ 4297.765246]  [<ffffffff81666ca0>] ? ip_rcv_finish+0x360/0x360

	These are reliably reproduced in a network topology of:

container:eth0 == host(OVS VXLAN on VLAN) == bond0 == eth0 (sky2) -> switch

	When VXLAN encapsulated traffic is received from a similarly
configured peer, the above warning is generated in the receive
processing of the encapsulated packet.  Note that the warning is
associated with the container eth0.

        The skbs from sky2 have ip_summed set to CHECKSUM_COMPLETE, and
because the packet is an encapsulated Ethernet frame, the checksum
generated by the hardware includes the inner protocol and Ethernet
headers.

	The receive code is careful to update the skb->csum, except in
__dev_forward_skb, as called by dev_forward_skb.  __dev_forward_skb
calls eth_type_trans, which in turn calls skb_pull_inline(skb, ETH_HLEN)
to skip over the Ethernet header, but does not update skb->csum when
doing so.

	This patch resolves the problem by adding a call to
skb_postpull_rcsum to update the skb->csum after the call to
eth_type_trans.

Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-26 16:16:51 -05:00
Thomas Graf b8fb4e0648 net: Reset secmark when scrubbing packet
skb_scrub_packet() is called when a packet switches between a context
such as between underlay and overlay, between namespaces, or between
L3 subnets.

While we already scrub the packet mark, connection tracking entry,
and cached destination, the security mark/context is left intact.

It seems wrong to inherit the security context of a packet when going
from overlay to underlay or across forwarding paths.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-24 00:21:43 -05:00
Toshiaki Makita 796f2da81b net: Fix stacked vlan offload features computation
When vlan tags are stacked, it is very likely that the outer tag is stored
in skb->vlan_tci and skb->protocol shows the inner tag's vlan_proto.
Currently netif_skb_features() first looks at skb->protocol even if there
is the outer tag in vlan_tci, thus it incorrectly retrieves the protocol
encapsulated by the inner vlan instead of the inner vlan protocol.
This allows GSO packets to be passed to HW and they end up being
corrupted.

Fixes: 58e998c6d2 ("offloading: Force software GSO for multiple vlan tags.")
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-24 00:08:33 -05:00
Pravin B Shelar d0edc7bf39 mpls: Fix config check for mpls.
Fixes MPLS GSO for case when mpls is compiled as kernel module.

Fixes: 0d89d2035f ("MPLS: Add limited GSO support").
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-23 23:57:30 -05:00
Herbert Xu ceb8d5bf17 net: Rearrange loop in net_rx_action
This patch rearranges the loop in net_rx_action to reduce the
amount of jumping back and forth when reading the code.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-23 23:20:21 -05:00
Herbert Xu 6bd373ebba net: Always poll at least one device in net_rx_action
We should only perform the softnet_break check after we have polled
at least one device in net_rx_action.  Otherwise a zero or negative
setting of netdev_budget can lock up the whole system.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-23 23:20:21 -05:00
Herbert Xu 001ce546bb net: Detect drivers that reschedule NAPI and exhaust budget
The commit d75b1ade56 (net: less
interrupt masking in NAPI) required drivers to leave poll_list
empty if the entire budget is consumed.

We have already had two broken drivers so let's add a check for
this.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-23 23:20:21 -05:00
Herbert Xu 726ce70e9e net: Move napi polling code out of net_rx_action
This patch creates a new function napi_poll and moves the napi
polling code from net_rx_action into it.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-23 23:20:21 -05:00
Jason Wang af6dabc9c7 net: drop the packet when fails to do software segmentation or header check
Commit cecda693a9 ("net: keep original skb
which only needs header checking during software GSO") keeps the original
skb for packets that only needs header check, but it doesn't drop the
packet if software segmentation or header check were failed.

Fixes cecda693a9 ("net: keep original skb which only needs header checking during software GSO")
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-23 23:12:11 -05:00
Linus Torvalds 00c845dbfe Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix NBMA tunnel mac header handling in GRE, from Timo Teräs.

 2) Fix a NAPI race in the fec driver, from Nimrod Andy.

 3) The new IFF_VNET_LE bit is outside the size of the flags member it
    is stored in (which is 16-bits), store the state locally in the
    drivers.  From Michael S Tsirkin.

 4) We are kicking the tires with the new wireless maintainership
    situation.  Bluetooth fixes via Johan Hedberg, and mac80211 fixes
    from Johannes Berg.

 5) Fix locking and leaks in geneve driver, from Jesse Gross.

 6) Make netlink TX mmap code always copy, so we don't have to be
    potentially exposed to the user changing the underlying contents
    from underneath us.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (63 commits)
  be2net: Fix incorrect setting of tunnel offload flag in netdev features
  bnx2x: fix typos in "configure"
  xen-netback: support frontends without feature-rx-notify again
  MAINTAINERS: changes for wireless
  cxgb4: Fix decoding QSA module for ethtool get settings
  geneve: Fix races between socket add and release.
  geneve: Remove socket and offload handlers at destruction.
  netlink: Don't reorder loads/stores before marking mmap netlink frame as available
  netlink: Always copy on mmap TX.
  Bluetooth: Fix bug with filter in service discovery optimization
  mac80211: free management frame keys when removing station
  net: Disallow providing non zero VLAN ID for NIC drivers FDB add flow
  net/mlx4: Cache line CQE/EQE stride fixes
  net: fec: Fix NAPI race
  xen-netfront: use napi_complete() correctly to prevent Rx stalling
  ip_tunnel: Add missing validation of encap type to ip_tunnel_encap_setup()
  ip_tunnel: Add sanity checks to ip_tunnel_encap_add_ops()
  net: Allow FIXED_PHY to be modular.
  if_tun: drop broken IFF_VNET_LE
  macvtap: drop broken IFF_VNET_LE
  ...
2014-12-18 16:41:13 -08:00
Linus Torvalds 603ba7e41b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs pile #2 from Al Viro:
 "Next pile (and there'll be one or two more).

  The large piece in this one is getting rid of /proc/*/ns/* weirdness;
  among other things, it allows to (finally) make nameidata completely
  opaque outside of fs/namei.c, making for easier further cleanups in
  there"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  coda_venus_readdir(): use file_inode()
  fs/namei.c: fold link_path_walk() call into path_init()
  path_init(): don't bother with LOOKUP_PARENT in argument
  fs/namei.c: new helper (path_cleanup())
  path_init(): store the "base" pointer to file in nameidata itself
  make default ->i_fop have ->open() fail with ENXIO
  make nameidata completely opaque outside of fs/namei.c
  kill proc_ns completely
  take the targets of /proc/*/ns/* symlinks to separate fs
  bury struct proc_ns in fs/proc
  copy address of proc_ns_ops into ns_common
  new helpers: ns_alloc_inum/ns_free_inum
  make proc_ns_operations work with struct ns_common * instead of void *
  switch the rest of proc_ns_operations to working with &...->ns
  netns: switch ->get()/->put()/->install()/->inum() to working with &net->ns
  make mntns ->get()/->put()/->install()/->inum() work with &mnt_ns->ns
  common object embedded into various struct ....ns
2014-12-16 15:53:03 -08:00
Or Gerlitz 65891feac2 net: Disallow providing non zero VLAN ID for NIC drivers FDB add flow
The current implementations all use dev_uc_add_excl() and such whose API
doesn't support vlans, so we can't make it with NICs HW for now.

Fixes: f6f6424ba7 ('net: make vid as a parameter for ndo_fdb_add/ndo_fdb_del')
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-16 15:41:19 -05:00
Linus Torvalds e3aa91a7cb Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto update from Herbert Xu:
 - The crypto API is now documented :)
 - Disallow arbitrary module loading through crypto API.
 - Allow get request with empty driver name through crypto_user.
 - Allow speed testing of arbitrary hash functions.
 - Add caam support for ctr(aes), gcm(aes) and their derivatives.
 - nx now supports concurrent hashing properly.
 - Add sahara support for SHA1/256.
 - Add ARM64 version of CRC32.
 - Misc fixes.

* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (77 commits)
  crypto: tcrypt - Allow speed testing of arbitrary hash functions
  crypto: af_alg - add user space interface for AEAD
  crypto: qat - fix problem with coalescing enable logic
  crypto: sahara - add support for SHA1/256
  crypto: sahara - replace tasklets with kthread
  crypto: sahara - add support for i.MX53
  crypto: sahara - fix spinlock initialization
  crypto: arm - replace memset by memzero_explicit
  crypto: powerpc - replace memset by memzero_explicit
  crypto: sha - replace memset by memzero_explicit
  crypto: sparc - replace memset by memzero_explicit
  crypto: algif_skcipher - initialize upon init request
  crypto: algif_skcipher - removed unneeded code
  crypto: algif_skcipher - Fixed blocking recvmsg
  crypto: drbg - use memzero_explicit() for clearing sensitive data
  crypto: drbg - use MODULE_ALIAS_CRYPTO
  crypto: include crypto- module prefix in template
  crypto: user - add MODULE_ALIAS
  crypto: sha-mb - remove a bogus NULL check
  crytpo: qat - Fix 64 bytes requests
  ...
2014-12-13 13:33:26 -08:00
Linus Torvalds 70e71ca0af Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:

 1) New offloading infrastructure and example 'rocker' driver for
    offloading of switching and routing to hardware.

    This work was done by a large group of dedicated individuals, not
    limited to: Scott Feldman, Jiri Pirko, Thomas Graf, John Fastabend,
    Jamal Hadi Salim, Andy Gospodarek, Florian Fainelli, Roopa Prabhu

 2) Start making the networking operate on IOV iterators instead of
    modifying iov objects in-situ during transfers.  Thanks to Al Viro
    and Herbert Xu.

 3) A set of new netlink interfaces for the TIPC stack, from Richard
    Alpe.

 4) Remove unnecessary looping during ipv6 routing lookups, from Martin
    KaFai Lau.

 5) Add PAUSE frame generation support to gianfar driver, from Matei
    Pavaluca.

 6) Allow for larger reordering levels in TCP, which are easily
    achievable in the real world right now, from Eric Dumazet.

 7) Add a variable of napi_schedule that doesn't need to disable cpu
    interrupts, from Eric Dumazet.

 8) Use a doubly linked list to optimize neigh_parms_release(), from
    Nicolas Dichtel.

 9) Various enhancements to the kernel BPF verifier, and allow eBPF
    programs to actually be attached to sockets.  From Alexei
    Starovoitov.

10) Support TSO/LSO in sunvnet driver, from David L Stevens.

11) Allow controlling ECN usage via routing metrics, from Florian
    Westphal.

12) Remote checksum offload, from Tom Herbert.

13) Add split-header receive, BQL, and xmit_more support to amd-xgbe
    driver, from Thomas Lendacky.

14) Add MPLS support to openvswitch, from Simon Horman.

15) Support wildcard tunnel endpoints in ipv6 tunnels, from Steffen
    Klassert.

16) Do gro flushes on a per-device basis using a timer, from Eric
    Dumazet.  This tries to resolve the conflicting goals between the
    desired handling of bulk vs.  RPC-like traffic.

17) Allow userspace to ask for the CPU upon what a packet was
    received/steered, via SO_INCOMING_CPU.  From Eric Dumazet.

18) Limit GSO packets to half the current congestion window, from Eric
    Dumazet.

19) Add a generic helper so that all drivers set their RSS keys in a
    consistent way, from Eric Dumazet.

20) Add xmit_more support to enic driver, from Govindarajulu
    Varadarajan.

21) Add VLAN packet scheduler action, from Jiri Pirko.

22) Support configurable RSS hash functions via ethtool, from Eyal
    Perry.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1820 commits)
  Fix race condition between vxlan_sock_add and vxlan_sock_release
  net/macb: fix compilation warning for print_hex_dump() called with skb->mac_header
  net/mlx4: Add support for A0 steering
  net/mlx4: Refactor QUERY_PORT
  net/mlx4_core: Add explicit error message when rule doesn't meet configuration
  net/mlx4: Add A0 hybrid steering
  net/mlx4: Add mlx4_bitmap zone allocator
  net/mlx4: Add a check if there are too many reserved QPs
  net/mlx4: Change QP allocation scheme
  net/mlx4_core: Use tasklet for user-space CQ completion events
  net/mlx4_core: Mask out host side virtualization features for guests
  net/mlx4_en: Set csum level for encapsulated packets
  be2net: Export tunnel offloads only when a VxLAN tunnel is created
  gianfar: Fix dma check map error when DMA_API_DEBUG is enabled
  cxgb4/csiostor: Don't use MASTER_MUST for fw_hello call
  net: fec: only enable mdio interrupt before phy device link up
  net: fec: clear all interrupt events to support i.MX6SX
  net: fec: reset fep link status in suspend function
  net: sock: fix access via invalid file descriptor
  net: introduce helper macro for_each_cmsghdr
  ...
2014-12-11 14:27:06 -08:00
Alexei Starovoitov 198bf1b046 net: sock: fix access via invalid file descriptor
0day robot reported the following crash:
[   21.233581] BUG: unable to handle kernel NULL pointer dereference at 0000000000000007
[   21.234709] IP: [<ffffffff8156ebda>] sk_attach_bpf+0x39/0xc2

It's due to bpf_prog_get() returning ERR_PTR.
Check it properly.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Fixes: 89aa075832 ("net: sock: allow eBPF programs to be attached to sockets")
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-10 23:34:27 -05:00
Gu Zheng f95b414edb net: introduce helper macro for_each_cmsghdr
Introduce helper macro for_each_cmsghdr as a wrapper of the enumerating
cmsghdr from msghdr, just cleanup.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-10 22:41:55 -05:00
Al Viro 707c5960f1 Merge branch 'nsfs' into for-next 2014-12-10 21:31:59 -05:00
David S. Miller 22f10923dd Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/amd/xgbe/xgbe-desc.c
	drivers/net/ethernet/renesas/sh_eth.c

Overlapping changes in both conflict cases.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-10 15:48:20 -05:00
Alexander Duyck fd11a83dd3 net: Pull out core bits of __netdev_alloc_skb and add __napi_alloc_skb
This change pulls the core functionality out of __netdev_alloc_skb and
places them in a new function named __alloc_rx_skb.  The reason for doing
this is to make these bits accessible to a new function __napi_alloc_skb.
In addition __alloc_rx_skb now has a new flags value that is used to
determine which page frag pool to allocate from.  If the SKB_ALLOC_NAPI
flag is set then the NAPI pool is used.  The advantage of this is that we
do not have to use local_irq_save/restore when accessing the NAPI pool from
NAPI context.

In my test setup I saw at least 11ns of savings using the napi_alloc_skb
function versus the netdev_alloc_skb function, most of this being due to
the fact that we didn't have to call local_irq_save/restore.

The main use case for napi_alloc_skb would be for things such as copybreak
or page fragment based receive paths where an skb is allocated after the
data has been received instead of before.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-10 13:31:57 -05:00
Alexander Duyck ffde7328a3 net: Split netdev_alloc_frag into __alloc_page_frag and add __napi_alloc_frag
This patch splits the netdev_alloc_frag function up so that it can be used
on one of two page frag pools instead of being fixed on the
netdev_alloc_cache.  By doing this we can add a NAPI specific function
__napi_alloc_frag that accesses a pool that is only used from softirq
context.  The advantage to this is that we do not need to call
local_irq_save/restore which can be a significant savings.

I also took the opportunity to refactor the core bits that were placed in
__alloc_page_frag.  First I updated the allocation to do either a 32K
allocation or an order 0 page.  This is based on the changes in commmit
d9b2938aa where it was found that latencies could be reduced in case of
failures.  Then I also rewrote the logic to work from the end of the page to
the start.  By doing this the size value doesn't have to be used unless we
have run out of space for page fragments.  Finally I cleaned up the atomic
bits so that we just do an atomic_sub_and_test and if that returns true then
we set the page->_count via an atomic_set.  This way we can remove the extra
conditional for the atomic_read since it would have led to an atomic_inc in
the case of success anyway.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-10 13:31:57 -05:00
David S. Miller 6e5f59aacb Merge branch 'for-davem-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
More iov_iter work for the networking from Al Viro.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-10 13:17:23 -05:00
Linus Torvalds 86c6a2fddf Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The main changes in this cycle are:

   - 'Nested Sleep Debugging', activated when CONFIG_DEBUG_ATOMIC_SLEEP=y.

     This instruments might_sleep() checks to catch places that nest
     blocking primitives - such as mutex usage in a wait loop.  Such
     bugs can result in hard to debug races/hangs.

     Another category of invalid nesting that this facility will detect
     is the calling of blocking functions from within schedule() ->
     sched_submit_work() -> blk_schedule_flush_plug().

     There's some potential for false positives (if secondary blocking
     primitives themselves are not ready yet for this facility), but the
     kernel will warn once about such bugs per bootup, so the warning
     isn't much of a nuisance.

     This feature comes with a number of fixes, for problems uncovered
     with it, so no messages are expected normally.

   - Another round of sched/numa optimizations and refinements, for
     CONFIG_NUMA_BALANCING=y.

   - Another round of sched/dl fixes and refinements.

  Plus various smaller fixes and cleanups"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits)
  sched: Add missing rcu protection to wake_up_all_idle_cpus
  sched/deadline: Introduce start_hrtick_dl() for !CONFIG_SCHED_HRTICK
  sched/numa: Init numa balancing fields of init_task
  sched/deadline: Remove unnecessary definitions in cpudeadline.h
  sched/cpupri: Remove unnecessary definitions in cpupri.h
  sched/deadline: Fix rq->dl.pushable_tasks bug in push_dl_task()
  sched/fair: Fix stale overloaded status in the busiest group finding logic
  sched: Move p->nr_cpus_allowed check to select_task_rq()
  sched/completion: Document when to use wait_for_completion_io_*()
  sched: Update comments about CLONE_NEWUTS and CLONE_NEWIPC
  sched/fair: Kill task_struct::numa_entry and numa_group::task_list
  sched: Refactor task_struct to use numa_faults instead of numa_* pointers
  sched/deadline: Don't check CONFIG_SMP in switched_from_dl()
  sched/deadline: Reschedule from switched_from_dl() after a successful pull
  sched/deadline: Push task away if the deadline is equal to curr during wakeup
  sched/deadline: Add deadline rq status print
  sched/deadline: Fix artificial overrun introduced by yield_task_dl()
  sched/rt: Clean up check_preempt_equal_prio()
  sched/core: Use dl_bw_of() under rcu_read_lock_sched()
  sched: Check if we got a shallowest_idle_cpu before searching for least_loaded_cpu
  ...
2014-12-09 21:21:34 -08:00
Roopa Prabhu 1d460b988d rocker: remove swdev mode
Remove use of 'swdev' mode in rocker. rocker dev offloads
can use the BRIDGE_FLAGS_SELF to indicate offload to hardware.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-09 18:24:47 -05:00
Li RongQing e008f3f07f net: avoid to call skb_queue_len again
the queue length of sd->input_pkt_queue has been put into qlen,
and impossible to change, since hold the lock

Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Cc: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-09 17:03:19 -05:00
Al Viro d3a9632f09 skb_copy_datagram_iovec() can die
no callers other than itself.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-09 16:29:11 -05:00
Al Viro e5a4b0bb80 switch memcpy_to_msg() and skb_copy{,_and_csum}_datagram_msg() to primitives
... making both non-draining.  That means that tcp_recvmsg() becomes
non-draining.  And _that_ would break iscsit_do_rx_data() unless we
	a) make sure tcp_recvmsg() is uniformly non-draining (it is)
	b) make sure it copes with arbitrary (including shifted)
iov_iter (it does, all it uses is iov_iter primitives)
	c) make iscsit_do_rx_data() initialize ->msg_iter only once.

Fortunately, (c) is doable with minimal work and we are rid of one
the two places where kernel send/recvmsg users would be unhappy with
non-draining behaviour.

Actually, that makes all but one of ->recvmsg() instances iov_iter-clean.
The exception is skcipher_recvmsg() and it also isn't hard to convert
to primitives (iov_iter_get_pages() is needed there).  That'll wait
a bit - there's some interplay with ->sendmsg() path for that one.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-09 16:29:10 -05:00
Hannes Frederic Sowa dbfc4fb7d5 dst: no need to take reference on DST_NOCACHE dsts
Since commit f886497212 ("ipv4: fix dst race in sk_dst_get()")
DST_NOCACHE dst_entries get freed by RCU. So there is no need to get a
reference on them when we are in rcu protected sections.

Cc: Eric Dumazet <edumazet@google.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Reviewed-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-09 16:08:17 -05:00
Eric Dumazet 6ffe75eb53 net: avoid two atomic operations in fast clones
Commit ce1a4ea3f1 ("net: avoid one atomic operation in skb_clone()")
took the wrong way to save one atomic operation.

It is actually possible to avoid two atomic operations, if we
do not change skb->fclone values, and only rely on clone_ref
content to signal if the clone is available or not.

skb_clone() can simply use the fast clone if clone_ref is 1.

kfree_skbmem() can avoid the atomic_dec_and_test() if clone_ref is 1.

Note that because we usually free the clone before the original skb,
this particular attempt is only done for the original skb to have better
branch prediction.

SKB_FCLONE_FREE is removed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Chris Mason <clm@fb.com>
Cc: Sabrina Dubroca <sd@queasysnail.net>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-09 13:40:20 -05:00
Mahesh Bandewar 395eea6ccf rtnetlink: delay RTM_DELLINK notification until after ndo_uninit()
The commit 56bfa7ee7c ("unregister_netdevice : move RTM_DELLINK to
until after ndo_uninit") tried to do this ealier but while doing so
it created a problem. Unfortunately the delayed rtmsg_ifinfo() also
delayed call to fill_info(). So this translated into asking driver
to remove private state and then query it's private state. This
could have catastropic consequences.

This change breaks the rtmsg_ifinfo() into two parts - one takes the
precise snapshot of the device by called fill_info() before calling
the ndo_uninit() and the second part sends the notification using
collected snapshot.

It was brought to notice when last link is deleted from an ipvlan device
when it has free-ed the port and the subsequent .fill_info() call is
trying to get the info from the port.

kernel: [  255.139429] ------------[ cut here ]------------
kernel: [  255.139439] WARNING: CPU: 12 PID: 11173 at net/core/rtnetlink.c:2238 rtmsg_ifinfo+0x100/0x110()
kernel: [  255.139493] Modules linked in: ipvlan bonding w1_therm ds2482 wire cdc_acm ehci_pci ehci_hcd i2c_dev i2c_i801 i2c_core msr cpuid bnx2x ptp pps_core mdio libcrc32c
kernel: [  255.139513] CPU: 12 PID: 11173 Comm: ip Not tainted 3.18.0-smp-DEV #167
kernel: [  255.139514] Hardware name: Intel RML,PCH/Ibis_QC_18, BIOS 1.0.10 05/15/2012
kernel: [  255.139515]  0000000000000009 ffff880851b6b828 ffffffff815d87f4 00000000000000e0
kernel: [  255.139516]  0000000000000000 ffff880851b6b868 ffffffff8109c29c 0000000000000000
kernel: [  255.139518]  00000000ffffffa6 00000000000000d0 ffffffff81aaf580 0000000000000011
kernel: [  255.139520] Call Trace:
kernel: [  255.139527]  [<ffffffff815d87f4>] dump_stack+0x46/0x58
kernel: [  255.139531]  [<ffffffff8109c29c>] warn_slowpath_common+0x8c/0xc0
kernel: [  255.139540]  [<ffffffff8109c2ea>] warn_slowpath_null+0x1a/0x20
kernel: [  255.139544]  [<ffffffff8150d570>] rtmsg_ifinfo+0x100/0x110
kernel: [  255.139547]  [<ffffffff814f78b5>] rollback_registered_many+0x1d5/0x2d0
kernel: [  255.139549]  [<ffffffff814f79cf>] unregister_netdevice_many+0x1f/0xb0
kernel: [  255.139551]  [<ffffffff8150acab>] rtnl_dellink+0xbb/0x110
kernel: [  255.139553]  [<ffffffff8150da90>] rtnetlink_rcv_msg+0xa0/0x240
kernel: [  255.139557]  [<ffffffff81329283>] ? rhashtable_lookup_compare+0x43/0x80
kernel: [  255.139558]  [<ffffffff8150d9f0>] ? __rtnl_unlock+0x20/0x20
kernel: [  255.139562]  [<ffffffff8152cb11>] netlink_rcv_skb+0xb1/0xc0
kernel: [  255.139563]  [<ffffffff8150a495>] rtnetlink_rcv+0x25/0x40
kernel: [  255.139565]  [<ffffffff8152c398>] netlink_unicast+0x178/0x230
kernel: [  255.139567]  [<ffffffff8152c75f>] netlink_sendmsg+0x30f/0x420
kernel: [  255.139571]  [<ffffffff814e0b0c>] sock_sendmsg+0x9c/0xd0
kernel: [  255.139575]  [<ffffffff811d1d7f>] ? rw_copy_check_uvector+0x6f/0x130
kernel: [  255.139577]  [<ffffffff814e11c9>] ? copy_msghdr_from_user+0x139/0x1b0
kernel: [  255.139578]  [<ffffffff814e1774>] ___sys_sendmsg+0x304/0x310
kernel: [  255.139581]  [<ffffffff81198723>] ? handle_mm_fault+0xca3/0xde0
kernel: [  255.139585]  [<ffffffff811ebc4c>] ? destroy_inode+0x3c/0x70
kernel: [  255.139589]  [<ffffffff8108e6ec>] ? __do_page_fault+0x20c/0x500
kernel: [  255.139597]  [<ffffffff811e8336>] ? dput+0xb6/0x190
kernel: [  255.139606]  [<ffffffff811f05f6>] ? mntput+0x26/0x40
kernel: [  255.139611]  [<ffffffff811d2b94>] ? __fput+0x174/0x1e0
kernel: [  255.139613]  [<ffffffff814e2129>] __sys_sendmsg+0x49/0x90
kernel: [  255.139615]  [<ffffffff814e2182>] SyS_sendmsg+0x12/0x20
kernel: [  255.139617]  [<ffffffff815df092>] system_call_fastpath+0x12/0x17
kernel: [  255.139619] ---[ end trace 5e6703e87d984f6b ]---

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Reported-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Roopa Prabhu <roopa@cumulusnetworks.com>
Cc: David S. Miller <davem@davemloft.net>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-09 13:36:57 -05:00
Eyal Perry 892311f66f ethtool: Support for configurable RSS hash function
This patch extends the set/get_rxfh ethtool-options for getting or
setting the RSS hash function.

It modifies drivers implementation of set/get_rxfh accordingly.

This change also delegates the responsibility of checking whether a
modification to a certain RX flow hash parameter is supported to the
driver implementation of set_rxfh.

User-kernel API is done through the new hfunc bitmask field in the
ethtool_rxfh struct. A bit set in the hfunc field is corresponding to an
index in the new string-set ETH_SS_RSS_HASH_FUNCS.

Got approval from most of the relevant driver maintainers that their
driver is using Toeplitz, and for the few that didn't answered, also
assumed it is Toeplitz.

Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Ariel Elior <ariel.elior@qlogic.com>
Cc: Prashant Sreedharan <prashant@broadcom.com>
Cc: Michael Chan <mchan@broadcom.com>
Cc: Hariprasad S <hariprasad@chelsio.com>
Cc: Sathya Perla <sathya.perla@emulex.com>
Cc: Subbu Seetharaman <subbu.seetharaman@emulex.com>
Cc: Ajit Khaparde <ajit.khaparde@emulex.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
Cc: Bruce Allan <bruce.w.allan@intel.com>
Cc: Carolyn Wyborny <carolyn.wyborny@intel.com>
Cc: Don Skidmore <donald.c.skidmore@intel.com>
Cc: Greg Rose <gregory.v.rose@intel.com>
Cc: Matthew Vick <matthew.vick@intel.com>
Cc: John Ronciak <john.ronciak@intel.com>
Cc: Mitch Williams <mitch.a.williams@intel.com>
Cc: Amir Vadai <amirv@mellanox.com>
Cc: Solarflare linux maintainers <linux-net-drivers@solarflare.com>
Cc: Shradha Shah <sshah@solarflare.com>
Cc: Shreyas Bhatewara <sbhatewara@vmware.com>
Cc: "VMware, Inc." <pv-drivers@vmware.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Eyal Perry <eyalpe@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-08 21:07:10 -05:00
Alexei Starovoitov 89aa075832 net: sock: allow eBPF programs to be attached to sockets
introduce new setsockopt() command:

setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd, sizeof(prog_fd))

where prog_fd was received from syscall bpf(BPF_PROG_LOAD, attr, ...)
and attr->prog_type == BPF_PROG_TYPE_SOCKET_FILTER

setsockopt() calls bpf_prog_get() which increments refcnt of the program,
so it doesn't get unloaded while socket is using the program.

The same eBPF program can be attached to multiple sockets.

User task exit automatically closes socket which calls sk_filter_uncharge()
which decrements refcnt of eBPF program

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-05 21:47:32 -08:00
Al Viro f77c80142e bury struct proc_ns in fs/proc
a) make get_proc_ns() return a pointer to struct ns_common
b) mirror ns_ops in dentry->d_fsdata of ns dentries, so that
is_mnt_ns_file() could get away with fewer dereferences.

That way struct proc_ns becomes invisible outside of fs/proc/*.c

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-04 14:34:54 -05:00
Al Viro 33c429405a copy address of proc_ns_ops into ns_common
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-04 14:34:47 -05:00
Al Viro 6344c433a4 new helpers: ns_alloc_inum/ns_free_inum
take struct ns_common *, for now simply wrappers around proc_{alloc,free}_inum()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-04 14:34:36 -05:00
Al Viro 64964528b2 make proc_ns_operations work with struct ns_common * instead of void *
We can do that now.  And kill ->inum(), while we are at it - all instances
are identical.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-04 14:34:17 -05:00
Al Viro ff24870f46 netns: switch ->get()/->put()/->install()/->inum() to working with &net->ns
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-04 14:34:04 -05:00
Al Viro 435d5f4bb2 common object embedded into various struct ....ns
for now - just move corresponding ->proc_inum instances over there

Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-04 14:31:00 -05:00
Scott Feldman 2c3c031c8f bridge: add brport flags to dflt bridge_getlink
To allow brport device to return current brport flags set on port.  Add
returned flags to nested IFLA_PROTINFO netlink msg built in dflt getlink.
With this change, netlink msg returned for bridge_getlink contains the port's
offloaded flag settings (the port's SELF settings).

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-02 20:01:24 -08:00
Jiri Pirko aecbe01e74 net-sysfs: expose physical switch id for particular device
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Reviewed-by: Thomas Graf <tgraf@suug.ch>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-02 20:01:22 -08:00
Jiri Pirko 82f2841291 rtnl: expose physical switch id for particular device
The netdevice represents a port in a switch, it will expose
IFLA_PHYS_SWITCH_ID value via rtnl. Two netdevices with the same value
belong to one physical switch.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Reviewed-by: Thomas Graf <tgraf@suug.ch>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-02 20:01:21 -08:00
Jiri Pirko 02637fce3e net: rename netdev_phys_port_id to more generic name
So this can be reused for identification of other "items" as well.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Reviewed-by: Thomas Graf <tgraf@suug.ch>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-02 20:01:19 -08:00
Jiri Pirko f6f6424ba7 net: make vid as a parameter for ndo_fdb_add/ndo_fdb_del
Do the work of parsing NDA_VLAN directly in rtnetlink code, pass simple
u16 vid to drivers from there.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-02 20:01:18 -08:00
Nicolas Dichtel e0ebde0e13 rtnetlink: release net refcnt on error in do_setlink()
rtnl_link_get_net() holds a reference on the 'struct net', we need to release
it in case of error.

CC: Eric W. Biederman <ebiederm@xmission.com>
Fixes: b51642f6d7 ("net: Enable a userns root rtnl calls that are safe for unprivilged users")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-29 21:05:43 -08:00
David S. Miller 60b7379dc5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2014-11-29 20:47:48 -08:00
Thomas Graf aa68c20ff3 bridge: Sanitize IFLA_EXT_MASK for AF_BRIDGE:RTM_GETLINK
Only search for IFLA_EXT_MASK if the message actually carries a
ifinfomsg header and validate minimal length requirements for
IFLA_EXT_MASK.

Fixes: 6cbdceeb ("bridge: Dump vlan information from a bridge port")
Cc: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-26 15:29:01 -05:00
Thomas Graf 6e8d1c5545 bridge: Validate IFLA_BRIDGE_FLAGS attribute length
Payload is currently accessed blindly and may exceed valid message
boundaries.

Fixes: 407af3299 ("bridge: Add netlink interface to configure vlans on bridge ports")
Cc: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-26 15:29:00 -05:00
Daniel Borkmann 79e886599e crypto: algif - add and use sock_kzfree_s() instead of memzero_explicit()
Commit e1bd95bf7c ("crypto: algif - zeroize IV buffer") and
2a6af25bef ("crypto: algif - zeroize message digest buffer")
added memzero_explicit() calls on buffers that are later on
passed back to sock_kfree_s().

This is a discussed follow-up that, instead, extends the sock
API and adds sock_kzfree_s(), which internally uses kzfree()
instead of kfree() for passing the buffers back to slab.

Having sock_kzfree_s() allows to keep the changes more minimal
by just having a drop-in replacement instead of adding
memzero_explicit() calls everywhere before sock_kfree_s().

In kzfree(), the compiler is not allowed to optimize the memset()
away and thus there's no need for memzero_explicit(). Both,
sock_kfree_s() and sock_kzfree_s() are wrappers for
__sock_kfree_s() and call into kfree() resp. kzfree(); here,
__sock_kfree_s() needs to be explicitly inlined as we want the
compiler to optimize the call and condition away and thus it
produces e.g. on x86_64 the _same_ assembler output for
sock_kfree_s() before and after, and thus also allows for
avoiding code duplication.

Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2014-11-25 22:50:39 +08:00
Al Viro 8feb2fb2bb switch AF_PACKET and AF_UNIX to skb_copy_datagram_from_iter()
... and kill skb_copy_datagram_iovec()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-24 05:16:39 -05:00
Al Viro 195e952d03 kill zerocopy_sg_from_iovec()
no users left

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-24 05:16:39 -05:00
Al Viro 3a654f975b new helpers: skb_copy_datagram_from_iter() and zerocopy_sg_from_iter()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-24 05:03:08 -05:00
David S. Miller 1459143386 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ieee802154/fakehard.c

A bug fix went into 'net' for ieee802154/fakehard.c, which is removed
in 'net-next'.

Add build fix into the merge from Stephen Rothwell in openvswitch, the
logging macros take a new initial 'log' argument, a new call was added
in 'net' so when we merge that in here we have to explicitly add the
new 'log' arg to it else the build fails.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-21 22:28:24 -05:00
Eric Dumazet e7820e39b7 net: Revert "net: avoid one atomic operation in skb_clone()"
Not sure what I was thinking, but doing anything after
releasing a refcount is suicidal or/and embarrassing.

By the time we set skb->fclone to SKB_FCLONE_FREE, another cpu
could have released last reference and freed whole skb.

We potentially corrupt memory or trap if CONFIG_DEBUG_PAGEALLOC is set.

Reported-by: Chris Mason <clm@fb.com>
Fixes: ce1a4ea3f1 ("net: avoid one atomic operation in skb_clone()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-21 15:26:32 -05:00
Jiri Pirko 93515d53b1 net: move vlan pop/push functions into common code
So it can be used from out of openvswitch code.
Did couple of cosmetic changes on the way, namely variable naming and
adding support for 8021AD proto.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-21 14:20:18 -05:00
Jiri Pirko e21951212f net: move make_writable helper into common code
note that skb_make_writable already exists in net/netfilter/core.c
but does something slightly different.

Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-21 14:20:17 -05:00
Jiri Pirko 5968250c86 vlan: introduce *vlan_hwaccel_push_inside helpers
Use them to push skb->vlan_tci into the payload and avoid code
duplication.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-21 14:20:17 -05:00
Jiri Pirko 62749e2cb3 vlan: rename __vlan_put_tag to vlan_insert_tag_set_proto
Name fits better. Plus there's going to be introduced
__vlan_insert_tag later on.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-21 14:20:17 -05:00
Al Viro 08adb7dabd fold verify_iovec() into copy_msghdr_from_user()
... and do the same on the compat side of things.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-19 16:23:49 -05:00
Al Viro 0844932009 {compat_,}verify_iovec(): switch to generic copying of iovecs
use {compat_,}rw_copy_check_uvector().  As the result, we are
guaranteed that all iovecs seen in ->msg_iov by ->sendmsg()
and ->recvmsg() will pass access_ok().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-19 16:23:16 -05:00
Markus Elfring ef87c5d6a1 net: pktgen: Deletion of an unnecessary check before the function call "proc_remove"
The proc_remove() function tests whether its argument is NULL and then
returns immediately. Thus the test around the call is not needed.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-19 15:20:15 -05:00
Fabian Frederick 54aeba7f06 dev_ioctl: use sizeof(x) instead of sizeof x
Also remove spaces after cast.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-18 15:27:32 -05:00
Fabian Frederick e56f735913 net/core: include linux/types.h instead of asm/types.h
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-18 15:26:32 -05:00
Fabian Frederick 1d2398dc7c net: fix spelling for synchronized
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-18 15:26:32 -05:00
Eric Dumazet 960fb622f8 net: provide a per host RSS key generic infrastructure
RSS (Receive Side Scaling) typically uses Toeplitz hash and a 40 or 52 bytes
RSS key.

Some drivers use a constant (and well known key), some drivers use a random
key per port, making bonding setups hard to tune. Well known keys increase
attack surface, considering that number of queues is usually a power of two.

This patch provides infrastructure to help drivers doing the right thing.

netdev_rss_key_fill() should be used by drivers to initialize their RSS key,
even if they provide ethtool -X support to let user redefine the key later.

A new /proc/sys/net/core/netdev_rss_key file can be used to get the host
RSS key even for drivers not providing ethtool -x support, in case some
applications want to precisely setup flows to match some RX queues.

Tested:

myhost:~# cat /proc/sys/net/core/netdev_rss_key
11:63:99:bb:79:fb:a5:a7:07:45:b2:20:bf:02:42:2d:08:1a:dd:19:2b:6b:23:ac:56:28:9d:70:c3:ac:e8:16:4b:b7:c1:10:53:a4:78:41:36:40:74:b6:15:ca:27:44:aa:b3:4d:72

myhost:~# ethtool -x eth0
RX flow hash indirection table for eth0 with 8 RX ring(s):
    0:      0     1     2     3     4     5     6     7
RSS hash key:
11:63:99:bb:79:fb:a5:a7:07:45:b2:20:bf:02:42:2d:08:1a:dd:19:2b:6b:23:ac:56:28:9d:70:c3:ac:e8:16:4b:b7:c1:10:53:a4:78:41

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-16 15:59:11 -05:00
Ingo Molnar e9ac5f0fa8 Merge branch 'sched/urgent' into sched/core, to pick up fixes before applying more changes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-16 10:50:25 +01:00
Michal Kubeček fbe168ba91 net: generic dev_disable_lro() stacked device handling
Large receive offloading is known to cause problems if received packets
are passed to other host. Therefore the kernel disables it by calling
dev_disable_lro() whenever a network device is enslaved in a bridge or
forwarding is enabled for it (or globally). For virtual devices we need
to disable LRO on the underlying physical device (which is actually
receiving the packets).

Current dev_disable_lro() code handles this  propagation for a vlan
(including 802.1ad nested vlan), macvlan or a vlan on top of a macvlan.
It doesn't handle other stacked devices and their combinations, in
particular propagation from a bond to its slaves which often causes
problems in virtualization setups.

As we now have generic data structures describing the upper-lower device
relationship, dev_disable_lro() can be generalized to disable LRO also
for all lower devices (if any) once it is disabled for the device
itself.

For bonding and teaming devices, it is necessary to disable LRO not only
on current slaves at the moment when dev_disable_lro() is called but
also on any slave (port) added later.

v2: use lower device links for all devices (including vlan and macvlan)

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-13 14:48:56 -05:00
WANG Cong d7480fd3b1 neigh: remove dynamic neigh table registration support
Currently there are only three neigh tables in the whole kernel:
arp table, ndisc table and decnet neigh table. What's more,
we don't support registering multiple tables per family.
Therefore we can just make these tables statically built-in.

Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-11 15:23:54 -05:00
Joe Perches ba7a46f16d net: Convert LIMIT_NETDEBUG to net_dbg_ratelimited
Use the more common dynamic_debug capable net_dbg_ratelimited
and remove the LIMIT_NETDEBUG macro.

All messages are still ratelimited.

Some KERN_<LEVEL> uses are changed to KERN_DEBUG.

This may have some negative impact on messages that were
emitted at KERN_INFO that are not not enabled at all unless
DEBUG is defined or dynamic_debug is enabled.  Even so,
these messages are now _not_ emitted by default.

This also eliminates the use of the net_msg_warn sysctl
"/proc/sys/net/core/warnings".  For backward compatibility,
the sysctl is not removed, but it has no function.  The extern
declaration of net_msg_warn is removed from sock.h and made
static in net/core/sysctl_net_core.c

Miscellanea:

o Update the sysctl documentation
o Remove the embedded uses of pr_fmt
o Coalesce format fragments
o Realign arguments

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-11 14:10:31 -05:00
Eric Dumazet 2c8c56e15d net: introduce SO_INCOMING_CPU
Alternative to RPS/RFS is to use hardware support for multiple
queues.

Then split a set of million of sockets into worker threads, each
one using epoll() to manage events on its own socket pool.

Ideally, we want one thread per RX/TX queue/cpu, but we have no way to
know after accept() or connect() on which queue/cpu a socket is managed.

We normally use one cpu per RX queue (IRQ smp_affinity being properly
set), so remembering on socket structure which cpu delivered last packet
is enough to solve the problem.

After accept(), connect(), or even file descriptor passing around
processes, applications can use :

 int cpu;
 socklen_t len = sizeof(cpu);

 getsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, &cpu, &len);

And use this information to put the socket into the right silo
for optimal performance, as all networking stack should run
on the appropriate cpu, without need to send IPI (RPS/RFS).

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-11 13:00:06 -05:00
Eric Dumazet 3b47d30396 net: gro: add a per device gro flush timer
Tuning coalescing parameters on NIC can be really hard.

Servers can handle both bulk and RPC like traffic, with conflicting
goals : bulk flows want as big GRO packets as possible, RPC want minimal
latencies.

To reach big GRO packets on 10Gbe NIC, one can use :

ethtool -C eth0 rx-usecs 4 rx-frames 44

But this penalizes rpc sessions, with an increase of latencies, up to
50% in some cases, as NICs generally do not force an interrupt when
a packet with TCP Push flag is received.

Some NICs do not have an absolute timer, only a timer rearmed for every
incoming packet.

This patch uses a different strategy : Let GRO stack decides what do do,
based on traffic pattern.

Packets with Push flag wont be delayed.
Packets without Push flag might be held in GRO engine, if we keep
receiving data.

This new mechanism is off by default, and shall be enabled by setting
/sys/class/net/ethX/gro_flush_timeout to a value in nanosecond.

To fully enable this mechanism, drivers should use napi_complete_done()
instead of napi_complete().

Tested:
 Ran 200 netperf TCP_STREAM from A to B (10Gbe mlx4 link, 8 RX queues)

Without this feature, we send back about 305,000 ACK per second.

GRO aggregation ratio is low (811/305 = 2.65 segments per GRO packet)

Setting a timer of 2000 nsec is enough to increase GRO packet sizes
and reduce number of ACK packets. (811/19.2 = 42)

Receiver performs less calls to upper stacks, less wakes up.
This also reduces cpu usage on the sender, as it receives less ACK
packets.

Note that reducing number of wakes up increases cpu efficiency, but can
decrease QPS, as applications wont have the chance to warmup cpu caches
doing a partial read of RPC requests/answers if they fit in one skb.

B:~# sar -n DEV 1 10 | grep eth0 | tail -1
Average:         eth0 811269.80 305732.30 1199462.57  19705.72      0.00
0.00      0.50

B:~# echo 2000 >/sys/class/net/eth0/gro_flush_timeout

B:~# sar -n DEV 1 10 | grep eth0 | tail -1
Average:         eth0 811577.30  19230.80 1199916.51   1239.80      0.00
0.00      0.50

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-10 12:05:59 -05:00
Herbert Xu bfe1be38fc net: Kill skb_copy_datagram_const_iovec
Now that both macvtap and tun are using skb_copy_datagram_iter, we
can kill the abomination that is skb_copy_datagram_const_iovec.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-07 12:13:34 -05:00
Herbert Xu a8f820aa40 inet: Add skb_copy_datagram_iter
This patch adds skb_copy_datagram_iter, which is identical to
skb_copy_datagram_iovec except that it operates on iov_iter
instead of iovec.

Eventually all users of skb_copy_datagram_iovec should switch
over to iov_iter and then we can remove skb_copy_datagram_iovec.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-07 12:13:34 -05:00
Simon Horman 25cd9ba0ab openvswitch: Add basic MPLS support to kernel
Allow datapath to recognize and extract MPLS labels into flow keys
and execute actions which push, pop, and set labels on packets.

Based heavily on work by Leo Alterman, Ravi K, Isaku Yamahata and Joe Stringer.

Cc: Ravi K <rkerur@gmail.com>
Cc: Leo Alterman <lalterman@nicira.com>
Cc: Isaku Yamahata <yamahata@valinux.co.jp>
Cc: Joe Stringer <joe@wand.net.nz>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
2014-11-05 23:52:33 -08:00
Pravin B Shelar 59b93b41e7 net: Remove MPLS GSO feature.
Device can export MPLS GSO support in dev->mpls_features same way
it export vlan features in dev->vlan_features. So it is safe to
remove NETIF_F_GSO_MPLS redundant flag.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
2014-11-05 23:52:33 -08:00
David S. Miller 51f3d02b98 net: Add and use skb_copy_datagram_msg() helper.
This encapsulates all of the skb_copy_datagram_iovec() callers
with call argument signature "skb, offset, msghdr->msg_iov, length".

When we move to iov_iters in the networking, the iov_iter object will
sit in the msghdr.

Having a helper like this means there will be less places to touch
during that transformation.

Based upon descriptions and patch from Al Viro.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-05 16:46:40 -05:00
Tom Herbert e585f23636 udp: Changes to udp_offload to support remote checksum offload
Add a new GSO type, SKB_GSO_TUNNEL_REMCSUM, which indicates remote
checksum offload being done (in this case inner checksum must not
be offloaded to the NIC).

Added logic in __skb_udp_tunnel_segment to handle remote checksum
offload case.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-05 16:30:03 -05:00
Peter Zijlstra ff960a7317 netdev, sched/wait: Fix sleeping inside wait event
rtnl_lock_unregistering*() take rtnl_lock() -- a mutex -- inside a
wait loop. The wait loop relies on current->state to function, but so
does mutex_lock(), nesting them makes for the inner to destroy the
outer state.

Fix this using the new wait_woken() bits.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Cong Wang <cwang@twopensource.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jerry Chu <hkchu@google.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Cc: sfeldma@cumulusnetworks.com <sfeldma@cumulusnetworks.com>
Cc: stephen hemminger <stephen@networkplumber.org>
Cc: Tom Gundersen <teg@jklm.no>
Cc: Tom Herbert <therbert@google.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Vlad Yasevich <vyasevic@redhat.com>
Cc: netdev@vger.kernel.org
Link: http://lkml.kernel.org/r/20141029173110.GE15602@worktop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-04 07:17:48 +01:00
Eric Dumazet d75b1ade56 net: less interrupt masking in NAPI
net_rx_action() can mask irqs a single time to transfert sd->poll_list
into a private list, for a very short duration.

Then, napi_complete() can avoid masking irqs again,
and net_rx_action() only needs to mask irq again in slow path.

This patch removes 2 couples of irq mask/unmask per typical NAPI run,
more if multiple napi were triggered.

Note this also allows to give control back to caller (do_softirq())
more often, so that other softirq handlers can be called a bit earlier,
or ksoftirqd can be wakeup earlier under pressure.

This was developed while testing an alternative to RX interrupt
mitigation to reduce latencies while keeping or improving GRO
aggregation on fast NIC.

Idea is to test napi->gro_list at the end of a napi->poll() and
reschedule one NAPI poll, but after servicing a full round of
softirqs (timers, TX, rcu, ...). This will be allowed only if softirq
is currently serviced by idle task or ksoftirqd, and resched not needed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-03 12:25:09 -05:00
David S. Miller 55b42b5ca2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/phy/marvell.c

Simple overlapping changes in drivers/net/phy/marvell.c

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-01 14:53:27 -04:00
Guenter Roeck e0fb6fb6d5 net: ethtool: Return -EOPNOTSUPP if user space tries to read EEPROM with lengh 0
If a driver supports reading EEPROM but no EEPROM is installed in the system,
the driver's get_eeprom_len function returns 0. ethtool will subsequently
try to read that zero-length EEPROM anyway. If the driver does not support
EEPROM access at all, this operation will return -EOPNOTSUPP. If the driver
does support EEPROM access but no EEPROM is installed, the operation will
return -EINVAL. Return -EOPNOTSUPP in both cases for consistency.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-31 16:12:34 -04:00
Nicolas Dichtel 75fbfd3323 neigh: optimize neigh_parms_release()
In neigh_parms_release() we loop over all entries to find the entry given in
argument and being able to remove it from the list. By using a double linked
list, we can avoid this loop.

Here are some numbers with 30 000 dummy interfaces configured:

Before the patch:
$ time rmmod dummy
real	2m0.118s
user	0m0.000s
sys	1m50.048s

After the patch:
$ time rmmod dummy
real	1m9.970s
user	0m0.000s
sys	0m47.976s

Suggested-by: Thierry Herbelot <thierry.herbelot@6wind.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-29 16:11:50 -04:00
Eric Dumazet bc9ad166e3 net: introduce napi_schedule_irqoff()
napi_schedule() can be called from any context and has to mask hard
irqs.

Add a variant that can only be called from hard interrupts handlers
or when irqs are already masked.

Many NIC drivers can use it from their hard IRQ handler instead of
generic variant.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-29 16:07:27 -04:00
Toshiaki Makita 432c856fcf net: skb_segment() should preserve backpressure
This patch generalizes commit d6a4a10411 ("tcp: GSO should be TSQ
friendly") to protocols using skb_set_owner_w()

TCP uses its own destructor (tcp_wfree) and needs a more complex scheme
as explained in commit 6ff50cd555 ("tcp: gso: do not generate out of
order packets")

This allows UDP sockets using UFO to get proper backpressure,
thus avoiding qdisc drops and excessive cpu usage.

Here are performance test results (macvlan on vlan):

- Before
# netperf -t UDP_STREAM ...
Socket  Message  Elapsed      Messages
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992   65507   60.00      144096 1224195    1258.56
212992           60.00          51              0.45

Average:        CPU     %user     %nice   %system   %iowait    %steal     %idle
Average:        all      0.23      0.00     25.26      0.08      0.00     74.43

- After
# netperf -t UDP_STREAM ...
Socket  Message  Elapsed      Messages
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992   65507   60.00      109593      0     957.20
212992           60.00      109593            957.20

Average:        CPU     %user     %nice   %system   %iowait    %steal     %idle
Average:        all      0.18      0.00      8.38      0.02      0.00     91.43

[edumazet] Rewrote patch and changelog.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-29 14:47:19 -04:00
Eric Dumazet 93a35f59f1 net: napi_reuse_skb() should check pfmemalloc
Do not reuse skb if it was pfmemalloc tainted, otherwise
future frame might be dropped anyway.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-26 22:47:23 -04:00
Karl Beldan a63ba13eec net: tso: fix unaligned access to crafted TCP header in helper API
The crafted header start address is from a driver supplied buffer, which
one can reasonably expect to be aligned on a 4-bytes boundary.
However ATM the TSO helper API is only used by ethernet drivers and
the tcp header will then be aligned to a 2-bytes only boundary from the
header start address.

Signed-off-by: Karl Beldan <karl.beldan@rivierawaves.com>
Cc: Ezequiel Garcia <ezequiel.garcia@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-22 12:52:55 -04:00
Florian Westphal f993bc25e5 net: core: handle encapsulation offloads when computing segment lengths
if ->encapsulation is set we have to use inner_tcp_hdrlen and add the
size of the inner network headers too.

This is 'mostly harmless'; tbf might send skb that is slightly over
quota or drop skb even if it would have fit.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-20 12:38:13 -04:00
Linus Torvalds 2e923b0251 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Include fixes for netrom and dsa (Fabian Frederick and Florian
    Fainelli)

 2) Fix FIXED_PHY support in stmmac, from Giuseppe CAVALLARO.

 3) Several SKB use after free fixes (vxlan, openvswitch, vxlan,
    ip_tunnel, fou), from Li ROngQing.

 4) fec driver PTP support fixes from Luwei Zhou and Nimrod Andy.

 5) Use after free in virtio_net, from Michael S Tsirkin.

 6) Fix flow mask handling for megaflows in openvswitch, from Pravin B
    Shelar.

 7) ISDN gigaset and capi bug fixes from Tilman Schmidt.

 8) Fix route leak in ip_send_unicast_reply(), from Vasily Averin.

 9) Fix two eBPF JIT bugs on x86, from Alexei Starovoitov.

10) TCP_SKB_CB() reorganization caused a few regressions, fixed by Cong
    Wang and Eric Dumazet.

11) Don't overwrite end of SKB when parsing malformed sctp ASCONF
    chunks, from Daniel Borkmann.

12) Don't call sock_kfree_s() with NULL pointers, this function also has
    the side effect of adjusting the socket memory usage.  From Cong Wang.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (90 commits)
  bna: fix skb->truesize underestimation
  net: dsa: add includes for ethtool and phy_fixed definitions
  openvswitch: Set flow-key members.
  netrom: use linux/uaccess.h
  dsa: Fix conversion from host device to mii bus
  tipc: fix bug in bundled buffer reception
  ipv6: introduce tcp_v6_iif()
  sfc: add support for skb->xmit_more
  r8152: return -EBUSY for runtime suspend
  ipv4: fix a potential use after free in fou.c
  ipv4: fix a potential use after free in ip_tunnel_core.c
  hyperv: Add handling of IP header with option field in netvsc_set_hash()
  openvswitch: Create right mask with disabled megaflows
  vxlan: fix a free after use
  openvswitch: fix a use after free
  ipv4: dst_entry leak in ip_send_unicast_reply()
  ipv4: clean up cookie_v4_check()
  ipv4: share tcp_v4_save_options() with cookie_v4_check()
  ipv4: call __ip_options_echo() in cookie_v4_check()
  atm: simplify lanai.c by using module_pci_driver
  ...
2014-10-18 09:31:37 -07:00
Tom Herbert 04ffcb255f net: Add ndo_gso_check
Add ndo_gso_check which a device can define to indicate whether is
is capable of doing GSO on a packet. This funciton would be called from
the stack to determine whether software GSO is needed to be done. A
driver should populate this function if it advertises GSO types for
which there are combinations that it wouldn't be able to handle. For
instance a device that performs UDP tunneling might only implement
support for transparent Ethernet bridging type of inner packets
or might have limitations on lengths of inner headers.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-15 12:11:00 -04:00