1
0
Fork 0
Commit Graph

416464 Commits (fd27e0d44a893b45832df3cb8e021bd1d773a73f)

Author SHA1 Message Date
Daniel Borkmann fd27e0d44a net: vxlan: do not use vxlan_net before checking event type
Jesse Brandeburg reported that commit acaf4e7099 caused a panic
when adding a network namespace while vxlan module was present in
the system:

[<ffffffff814d0865>] vxlan_lowerdev_event+0xf5/0x100
[<ffffffff816e9e5d>] notifier_call_chain+0x4d/0x70
[<ffffffff810912be>] __raw_notifier_call_chain+0xe/0x10
[<ffffffff810912d6>] raw_notifier_call_chain+0x16/0x20
[<ffffffff815d9610>] call_netdevice_notifiers_info+0x40/0x70
[<ffffffff815d9656>] call_netdevice_notifiers+0x16/0x20
[<ffffffff815e1bce>] register_netdevice+0x1be/0x3a0
[<ffffffff815e1dce>] register_netdev+0x1e/0x30
[<ffffffff814cb94a>] loopback_net_init+0x4a/0xb0
[<ffffffffa016ed6e>] ? lockd_init_net+0x6e/0xb0 [lockd]
[<ffffffff815d6bac>] ops_init+0x4c/0x150
[<ffffffff815d6d23>] setup_net+0x73/0x110
[<ffffffff815d725b>] copy_net_ns+0x7b/0x100
[<ffffffff81090e11>] create_new_namespaces+0x101/0x1b0
[<ffffffff81090f45>] copy_namespaces+0x85/0xb0
[<ffffffff810693d5>] copy_process.part.26+0x935/0x1500
[<ffffffff811d5186>] ? mntput+0x26/0x40
[<ffffffff8106a15c>] do_fork+0xbc/0x2e0
[<ffffffff811b7f2e>] ? ____fput+0xe/0x10
[<ffffffff81089c5c>] ? task_work_run+0xac/0xe0
[<ffffffff8106a406>] SyS_clone+0x16/0x20
[<ffffffff816ee689>] stub_clone+0x69/0x90
[<ffffffff816ee329>] ? system_call_fastpath+0x16/0x1b

Apparently loopback device is being registered first and thus we
receive an event notification when vxlan_net is not ready. Hence,
when we call net_generic() and request vxlan_net_id, we seem to
access garbage at that point in time. In setup_net() where we set
up a newly allocated network namespace, we traverse the list of
pernet ops ...

list_for_each_entry(ops, &pernet_list, list) {
	error = ops_init(ops, net);
	if (error < 0)
		goto out_undo;
}

... and loopback_net_init() is invoked first here, so in the middle
of setup_net() we get this notification in vxlan. As currently we
only care about devices that unregister, move access through
net_generic() there. Fix is based on Cong Wang's proposal, but
only changes what is needed here. It sucks a bit as we only work
around the actual cure: right now it seems the only way to check if
a netns actually finished traversing all init ops would be to check
if it's part of net_namespace_list. But that I find quite expensive
each time we go through a notifier callback. Anyway, did a couple
of tests and it seems good for now.

Fixes: acaf4e7099 ("net: vxlan: when lower dev unregisters remove vxlan dev as well")
Reported-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Tested-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-17 18:49:18 -08:00
David S. Miller 7dd40c197c Merge branch 'ixgbe'
Aaron Brown says:

====================
Intel Wired LAN Driver Updates

This series contains updates to ixgbe Ethan Zhao.  The first one replaces
the magic number "63" with a macro, IXGBE_MAX_VFS_DRV_LIMIT, the second
moves the call to set driver_max_VFS to before SRIOV is enabled.

The code of these patches match the v3 (1/2) and v2 (2/2) versions sent
to the e1000-devel and netdev mailing lists.  The intermediate versions
(v4, v5) are from sorting out style issues, mostly tabs to spaces and
split lines probably introduced via mailer.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-17 18:38:13 -08:00
ethan.zhao 31ac910e10 ixgbe: set driver_max_VFs should be done before enabling SRIOV
commit 43dc4e01 Limit number of reported VFs to device
 specific value It doesn't work and always returns -EBUSY because VFs are
 already enabled.

ixgbe_enable_sriov()
        pci_enable_sriov()
                sriov_enable()
                {
                ... ..
                iov->ctrl |= PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE;
                pci_cfg_access_lock(dev);
                ... ...
                }

pci_sriov_set_totalvfs()
{
... ...
if (dev->sriov->ctrl & PCI_SRIOV_CTRL_VFE)
                return -EBUSY;
...
}

So should set driver_max_VFs with pci_sriov_set_totalvfs() before
enable VFs with ixgbe_enable_sriov().

V2: revised for net-next tree.

Signed-off-by: Ethan Zhao <ethan.kernel@gmail.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-17 18:37:31 -08:00
ethan.zhao dcc23e3ab6 ixgbe: define IXGBE_MAX_VFS_DRV_LIMIT macro and cleanup const 63
Because ixgbe driver limit the max number of VF
 functions could be enabled to 63, so define one macro IXGBE_MAX_VFS_DRV_LIMIT
 and cleanup the const 63 in code.

v3: revised for net-next tree.

Signed-off-by: Ethan Zhao <ethan.kernel@gmail.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-17 18:37:31 -08:00
Eric Dumazet 6c7e7610ff ipv4: fix a dst leak in tunnels
This patch :

1) Remove a dst leak if DST_NOCACHE was set on dst
   Fix this by holding a reference only if dst really cached.

2) Remove a lockdep warning in __tunnel_dst_set()
    This was reported by Cong Wang.

3) Remove usage of a spinlock where xchg() is enough

4) Remove some spurious inline keywords.
   Let compiler decide for us.

Fixes: 7d442fab0a ("ipv4: Cache dst in tunnels")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Cong Wang <cwang@twopensource.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-17 18:36:39 -08:00
Simon Horman db893473d3 sh_eth: Add support for r7s72100
The r7s72100 SoC includes a fast ethernet controller.

Signed-off-by: Simon Horman <horms+renesas@verge.net.au>
Acked-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-17 18:13:58 -08:00
Simon Horman 504c8ca55c sh_eth: Use bool as return type of sh_eth_is_gether()
Return a boolean from sh_eth_is_gether() and refactor it as a one-liner.

Signed-off-by: Simon Horman <horms+renesas@verge.net.au>
Acked-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-17 18:13:58 -08:00
Flavio Leitner 6a7cc41872 ipv6: send Change Status Report after DAD is completed
The RFC 3810 defines two type of messages for multicast
listeners. The "Current State Report" message, as the name
implies, refreshes the *current* state to the querier.
Since the querier sends Query messages periodically, there
is no need to retransmit the report.

On the other hand, any change should be reported immediately
using "State Change Report" messages. Since it's an event
triggered by a change and that it can be affected by packet
loss, the rfc states it should be retransmitted [RobVar] times
to make sure routers will receive timely.

Currently, we are sending "Current State Reports" after
DAD is completed.  Before that, we send messages using
unspecified address (::) which should be silently discarded
by routers.

This patch changes to send "State Change Report" messages
after DAD is completed fixing the behavior to be RFC compliant
and also to pass TAHI IPv6 testsuite.

Signed-off-by: Flavio Leitner <fbl@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-17 18:12:29 -08:00
stephen hemminger c3bc40e28b qlcnic: remove unused code
Remove function  qlcnic_enable_eswitch which was defined
but never used in current code.

Compile tested only.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-17 18:04:19 -08:00
stephen hemminger 2104140043 qlcnic: make local functions static
Functions only used in one file should be static.
Found by running make namespacecheck

Compile tested only.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-17 18:04:19 -08:00
Florent Fourcot 1d13a96c74 ipv6: tcp: fix flowlabel value in ACK messages send from TIME_WAIT
This patch is following the commit b903d324be (ipv6: tcp: fix TCLASS
value in ACK messages sent from TIME_WAIT).

For the same reason than tclass, we have to store the flow label in the
inet_timewait_sock to provide consistency of flow label on the last ACK.

Signed-off-by: Florent Fourcot <florent.fourcot@enst-bretagne.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-17 17:56:33 -08:00
David S. Miller d037c4d70f Merge branch 'for-davem' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next
John W. Linville says:

====================
Please pull this batch of updates for the 3.14 stream!

For the mac80211 bits, Johannes says:

"This time I have uAPSD fixes since I was working on that, hwsim
improvements to make dynamic radios possible for the test suite, the
evidently long-overdue channel_change_time removal and a few other small
collected fix and improvements."

For the iwlwifi bits, Emmanuel says:

"Besides a few trivial patches, I have an important workaround for a HW
issue that has kept me busy for a long time. Along with it, a fix that
prevents an error from being printed.
Eyal fixes our behavior against SISO APs and Ilan fixes an issue with
multiple interface scenarios.
Eliad fixes an error path in our init flow.
We also have a few 'static analyzers' fix."

For the NFC bits, Samuel says:

"It includes:

* A new NFC driver for Marvell's 8897, and a few NCI fixes and
  improvements needed to support this chipset.

* An LLCP fix for how we were setting the default MIU on a p2p link. If
  there is no explicit MIU extension announced at connection time, we
  must use the default one and not the one announced at LLCP link
  establishement time.

* A pn544 EEPROM config update. Some of the currently EEPROM configured
  values are overwriting the firmware ones while other should not be set
  by the driver itself.

* Some NFC digital stack fixes and improvements. Asynchronous functions
  are better documented, RF technologies and CRC functions are set upon
  PSL_REQ reception, and a few minor bugs are fixed.

* Minor and miscelaneous pn533, mei_phy and port100 fixes."

For the ath bits, Kalle says:

"Janusz added Kconfig option for DFS. The DFS code was there already, but
after fixes to mac80211 we can now enable it.

Bartosz added a runtime firmware feature flag to disable P2P. Our 10.1
firmware branch doesn't support P2P and ath10k can now disable that. He
also added a limit for how many clients can connect to ath10k AP.

Michal fixed WEP shared authentication, in case someone still uses it.
And I added firmware debug log to help the firmware engineers."

Along with that is a small batch of ath9k updates and a few other bits
here and there.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-17 17:30:55 -08:00
John W. Linville 7916a07557 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next into for-davem 2014-01-17 14:43:17 -05:00
David S. Miller cf84eb0b09 Merge branch 'virtio_rx_merging'
Michael Dalton says:

====================
virtio-net: mergeable rx buffer size auto-tuning

The virtio-net device currently uses aligned MTU-sized mergeable receive
packet buffers. Network throughput for workloads with large average
packet size can be improved by posting larger receive packet buffers.
However, due to SKB truesize effects, posting large (e.g, PAGE_SIZE)
buffers reduces the throughput of workloads that do not benefit from GRO
and have no large inbound packets.

This patchset introduces virtio-net mergeable buffer size auto-tuning,
with buffer sizes ranging from aligned MTU-size to PAGE_SIZE. Packet
buffer size is chosen based on a per-receive queue EWMA of incoming
packet size.

To unify mergeable receive buffer memory allocation and improve
SKB frag coalescing, all mergeable buffer memory allocation is
migrated to per-receive queue page frag allocators.

The per-receive queue mergeable packet buffer size is exported via
sysfs, and the network device sysfs layer has been extended to add
support for device-specific per-receive queue sysfs attribute groups.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 23:46:17 -08:00
Michael Dalton fbf28d78f5 virtio-net: initial rx sysfs support, export mergeable rx buffer size
Add initial support for per-rx queue sysfs attributes to virtio-net. If
mergeable packet buffers are enabled, adds a read-only mergeable packet
buffer size sysfs attribute for each RX queue.

Suggested-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael Dalton <mwdalton@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 23:46:07 -08:00
Michael Dalton 03144b5869 lib: Ensure EWMA does not store wrong intermediate values
To ensure ewma_read() without a lock returns a valid but possibly
out of date average, modify ewma_add() by using ACCESS_ONCE to prevent
intermediate wrong values from being written to avg->internal.

Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Michael Dalton <mwdalton@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 23:46:06 -08:00
Michael Dalton a953be53ce net-sysfs: add support for device-specific rx queue sysfs attributes
Extend existing support for netdevice receive queue sysfs attributes to
permit a device-specific attribute group. Initial use case for this
support will be to allow the virtio-net device to export per-receive
queue mergeable receive buffer size.

Signed-off-by: Michael Dalton <mwdalton@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 23:46:06 -08:00
Michael Dalton ab7db91705 virtio-net: auto-tune mergeable rx buffer size for improved performance
Commit 2613af0ed1 ("virtio_net: migrate mergeable rx buffers to page frag
allocators") changed the mergeable receive buffer size from PAGE_SIZE to
MTU-size, introducing a single-stream regression for benchmarks with large
average packet size. There is no single optimal buffer size for all
workloads.  For workloads with packet size <= MTU bytes, MTU + virtio-net
header-sized buffers are preferred as larger buffers reduce the TCP window
due to SKB truesize. However, single-stream workloads with large average
packet sizes have higher throughput if larger (e.g., PAGE_SIZE) buffers
are used.

This commit auto-tunes the mergeable receiver buffer packet size by
choosing the packet buffer size based on an EWMA of the recent packet
sizes for the receive queue. Packet buffer sizes range from MTU_SIZE +
virtio-net header len to PAGE_SIZE. This improves throughput for
large packet workloads, as any workload with average packet size >=
PAGE_SIZE will use PAGE_SIZE buffers.

These optimizations interact positively with recent commit
ba27524103 ("virtio-net: coalesce rx frags when possible during rx"),
which coalesces adjacent RX SKB fragments in virtio_net. The coalescing
optimizations benefit buffers of any size.

Benchmarks taken from an average of 5 netperf 30-second TCP_STREAM runs
between two QEMU VMs on a single physical machine. Each VM has two VCPUs
with all offloads & vhost enabled. All VMs and vhost threads run in a
single 4 CPU cgroup cpuset, using cgroups to ensure that other processes
in the system will not be scheduled on the benchmark CPUs. Trunk includes
SKB rx frag coalescing.

net-next w/ virtio_net before 2613af0ed1 (PAGE_SIZE bufs): 14642.85Gb/s
net-next (MTU-size bufs):  13170.01Gb/s
net-next + auto-tune: 14555.94Gb/s

Jason Wang also reported a throughput increase on mlx4 from 22Gb/s
using MTU-sized buffers to about 26Gb/s using auto-tuning.

Signed-off-by: Michael Dalton <mwdalton@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 23:46:06 -08:00
Michael Dalton fb51879dbc virtio-net: use per-receive queue page frag alloc for mergeable bufs
The virtio-net driver currently uses netdev_alloc_frag() for GFP_ATOMIC
mergeable rx buffer allocations. This commit migrates virtio-net to use
per-receive queue page frags for GFP_ATOMIC allocation. This change unifies
mergeable rx buffer memory allocation, which now will use skb_refill_frag()
for both atomic and GFP-WAIT buffer allocations.

To address fragmentation concerns, if after buffer allocation there
is too little space left in the page frag to allocate a subsequent
buffer, the remaining space is added to the current allocated buffer
so that the remaining space can be used to store packet data.

Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael Dalton <mwdalton@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 23:46:06 -08:00
Michael Dalton 097b4f19e5 net: allow > 0 order atomic page alloc in skb_page_frag_refill
skb_page_frag_refill currently permits only order-0 page allocs
unless GFP_WAIT is used. Change skb_page_frag_refill to attempt
higher-order page allocations whether or not GFP_WAIT is used. If
memory cannot be allocated, the allocator will fall back to
successively smaller page allocs (down to order-0 page allocs).

This change brings skb_page_frag_refill in line with the existing
page allocation strategy employed by netdev_alloc_frag, which attempts
higher-order page allocations whether or not GFP_WAIT is set, falling
back to successively lower-order page allocations on failure. Part
of migration of virtio-net to per-receive queue page frag allocators.

Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Michael Dalton <mwdalton@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 23:46:06 -08:00
Wei Yongjun 722e47d792 net_sched: fix error return code in fw_change_attrs()
The error code was not set if change indev fail, so the error
condition wasn't reflected in the return value. Fix to return a
negative error code from this error handling case instead of 0.

Fixes: 2519a602c2 ('net_sched: optimize tcf_match_indev()')
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 19:12:03 -08:00
David S. Miller 8b88a11e44 Merge branch 'tipc'
Ying Xue says:

====================
tipc: align TIPC behaviours of waiting for events with other stacks

Comparing the current implementations of waiting for events in TIPC
socket layer with other stacks, TIPC's behaviour is very different
because wait_event_interruptible_timeout()/wait_event_interruptible()
are always used by TIPC to wait for events while relevant socket or
port variables are fed to them as their arguments. As socket lock has
to be released temporarily before the two routines of waiting for
events are called, their arguments associated with socket or port
structures are out of socket lock protection. This might cause
serious issues where the process of calling socket syscall such as
sendsmg(), connect(), accept(), and recvmsg(), cannot be waken up
at all even if proper event arrives or improperly be woken up
although the condition of waking up the process is not satisfied
in practice.

Therefore, aligning its behaviours with similar functions implemented
in other stacks, for instance, sk_stream_wait_connect() and
inet_csk_wait_for_connect() etc, can avoid above risks for us.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 19:11:22 -08:00
Ying Xue 9bbb4ecc68 tipc: standardize recvmsg routine
Standardize the behaviour of waiting for events in TIPC recvmsg()
so that all variables of socket or port structures are protected
within socket lock, allowing the process of calling recvmsg() to
be woken up at appropriate time.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 19:10:34 -08:00
Ying Xue 391a6dd1da tipc: standardize sendmsg routine of connected socket
Standardize the behaviour of waiting for events in TIPC send_packet()
so that all variables of socket or port structures are protected within
socket lock, allowing the process of calling sendmsg() to be woken up
at appropriate time.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 19:10:34 -08:00
Ying Xue 3f40504f7e tipc: standardize sendmsg routine of connectionless socket
Comparing the behaviour of how to wait for events in TIPC sendmsg()
with other stacks, the TIPC implementation might be perceived as
different, and sometimes even incorrect. For instance, sk_sleep()
and tport->congested variables associated with socket are exposed
without socket lock protection while wait_event_interruptible_timeout()
accesses them. So standardizing it with similar implementation
in other stacks can help us correct these errors which the process
of calling sendmsg() cannot be woken up event if an expected event
arrive at socket or improperly woken up although the wake condition
doesn't match.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 19:10:34 -08:00
Ying Xue 6398e23cdb tipc: standardize accept routine
Comparing the behaviour of how to wait for events in TIPC accept()
with other stacks, the TIPC implementation might be perceived as
different, and sometimes even incorrect. As sk_sleep() and
sk->sk_receive_queue variables associated with socket are not
protected by socket lock, the process of calling accept() may be
woken up improperly or sometimes cannot be woken up at all. After
standardizing it with inet_csk_wait_for_connect routine, we can
get benefits including: avoiding 'thundering herd' phenomenon,
adding a timeout mechanism for accept(), coping with a pending
signal, and having sk_sleep() and sk->sk_receive_queue being
always protected within socket lock scope and so on.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 19:10:34 -08:00
Ying Xue 78eb3a5379 tipc: standardize connect routine
Comparing the behaviour of how to wait for events in TIPC connect()
with other stacks, the TIPC implementation might be perceived as
different, and sometimes even incorrect. For instance, as both
sock->state and sk_sleep() are directly fed to
wait_event_interruptible_timeout() as its arguments, and socket lock
has to be released before we call wait_event_interruptible_timeout(),
the two variables associated with socket are exposed out of socket
lock protection, thereby probably getting stale values so that the
process of calling connect() cannot be woken up exactly even if
correct event arrives or it is woken up improperly even if the wake
condition is not satisfied in practice. Therefore, standardizing its
behaviour with sk_stream_wait_connect routine can avoid these risks.

Additionally the implementation of connect routine is simplified as a
whole, allowing it to return correct values in all different cases.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 19:10:34 -08:00
wangweidong abfce3ef58 sctp: remove the unnecessary assignment
When go the right path, the status is 0, no need to assign it again.
So just remove the assignment.

Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 17:31:49 -08:00
Jason Wang be121f46af virtio-net: drop rq->max and rq->num
It looks like there's no need for those two fields:

- Unless there's a failure for the first refill try, rq->max should be always
  equal to the vring size.
- rq->num is only used to determine the condition that we need to do the refill,
  we could check vq->num_free instead.
- rq->num was required to be increased or decreased explicitly after each
  get/put which results a bad API.

So this patch removes them both to make the code simpler.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 17:30:42 -08:00
Lad, Prabhakar 9b05f462b7 net: davinci_mdio: Fix sparse warning
This patch fixes following sparse warning
davinci_mdio.c:85:27: warning: symbol 'default_pdata' was not declared. Should it be static?
Also makes the default_pdata as a constant.

Signed-off-by: Lad, Prabhakar <prabhakar.csengg@gmail.com>
Acked-by: Mugunthan V N <mugunthanvnm@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 17:29:53 -08:00
Veaceslav Falico 3ec775b9fb bonding: handle slave's name change with primary_slave logic
Currently, if a slave's name change, we just pass it by. However, if the
slave is a current primary_slave, then we end up with using a slave, whose
name != params.primary, for primary_slave. And vice-versa, if we don't have
a primary_slave but have params.primary set - we will not detected a new
primary_slave.

Fix this by catching the NETDEV_CHANGENAME event and setting primary_slave
accordingly. Also, if the primary_slave was changed, issue a reselection of
the active slave, cause the priorities have changed.

Reported-by: Ding Tianhong <dingtianhong@huawei.com>
CC: Ding Tianhong <dingtianhong@huawei.com>
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Acked-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 17:26:47 -08:00
WANG Cong 6c80563c2f net_sched: act: pick a different type for act_xt
In tcf_register_action() we check either ->type or ->kind to see if
there is an existing action registered, but ipt action registers two
actions with same type but different kinds. They should have different
types too.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 17:24:11 -08:00
David S. Miller 7dff08bbda Included change:
- properly format already existing kerneldoc
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABCAAGBQJS1xdYAAoJEEKTMo6mOh1VNv0P/R73udErvBaHTUIAVSJ3pH8/
 rQ673H+/ZDi6zJQNlNe5UHLu5/IrK/ARCz5Ldl71lNpNHjRNGloSmjIUsdOZY01q
 D8ZR9YdWxpVcTCNSKxGnDrhVyShM3bN70h2KuYWv01/IsTGJKvkkhnG0SXqJqKYm
 XBRrEyVnyn52mQM7BhWSkCrX+wSb8OyEtie4Gdn8LjKJ0i2LYeQZ7vnXSR4aORXt
 pqYe7O+cwWZoj7Vpqea8Ye7srt6BEATEGwuezcQYH+FvaIEepORHWsqqZm6z/o91
 wyjjoPcFVyKoEUASdrkC0Hq9eqoaRBVo3yZ/AWlFJ06O/eVhTNKIBN6HykmPJ+vW
 9QESaHjZ8Gackz6ehU86h7mu6Mtrbfk73BZfG23SVL6cZltXazE539v0Pv8Ge+P/
 2NKdDjRp1vCHViRLNCwX20VmZFg7bHVDHgYV0gBH/VgZAihhRhSQk9LEEhk1fbij
 UQjBMtRnWOoxNHi311Khsxqm+0X82/mBN2mw5VdV5S9ZnBXbbCNRAuxKlZWiZNGV
 MJ2oPUrzs4RIlhKuCc1Ckjr2n7xtbU/j05ASZ4iJgbDkulkGEniDBL+K1bOFIPJl
 Kb0T4F/0ucz2IcbeDijG8Xe+5XZmaFNbDQ0ooBiHrOAEVM9vqsiQ76P5uetWPnWr
 Gop6D4HE4fCGjWcONp1u
 =1KKs
 -----END PGP SIGNATURE-----

Merge tag 'batman-adv-for-davem' of git://git.open-mesh.org/linux-merge

Included change:
- properly format already existing kerneldoc

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 17:22:58 -08:00
WANG Cong fb1d598d48 net_sched: act: use tcf_hash_release() in net/sched/act_police.c
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 17:21:55 -08:00
Shannon Nelson 0aebd2d9ab i40e: updates to AdminQ interface
Refinements to cloud support in the Firmware API.

Signed-off-by: Shannon Nelson <shannon.nelson@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 17:20:51 -08:00
Shannon Nelson 68bf94aae1 i40e: check desc pointer before printing
Check that the descriptors were allocated before trying to dump
them to the logfile.  While we're there, de-trick-ify the code
so as to be easier to read and not abusing the types and unions.

Change-ID: I22898f4b22cecda3582d4d9e4018da9cd540f177
Signed-off-by: Shannon Nelson <shannon.nelson@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 17:20:37 -08:00
Veaceslav Falico b01f236c66 team: block mtu change before it happens via NETDEV_PRECHANGEMTU
Now it catches the NETDEV_CHANGEMTU notification, which is signaled after
the actual change happened on the device, and returns NOTIFY_BAD, so that
the change on the device is reverted.

This might be quite costly and messy, so use the new NETDEV_PRECHANGEMTU to
catch the MTU change before the actual change happens and signal that it's
forbidden to do it.

CC: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 17:15:42 -08:00
Veaceslav Falico 1d486bfb66 net: add NETDEV_PRECHANGEMTU to notify before mtu change happens
Currently, if a device changes its mtu, first the change happens (invloving
all the side effects), and after that the NETDEV_CHANGEMTU is sent so that
other devices can catch up with the new mtu. However, if they return
NOTIFY_BAD, then the change is reverted and error returned.

This is a really long and costy operation (sometimes). To fix this, add
NETDEV_PRECHANGEMTU notification which is called prior to any change
actually happening, and if any callee returns NOTIFY_BAD - the change is
aborted. This way we're skipping all the playing with apply/revert the mtu.

CC: "David S. Miller" <davem@davemloft.net>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Eric Dumazet <edumazet@google.com>
CC: Nicolas Dichtel <nicolas.dichtel@6wind.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 17:15:41 -08:00
Florian Fainelli 31cf344caf r6040: use ETH_ZLEN instead of MISR for SKB length checking
Ever since this driver was merged the following code was included:

if (skb->len < MISR)
	skb->len = MISR;

MISR is defined to 0x3C which is also equivalent to ETH_ZLEN, but use
ETH_ZLEN directly which is exactly what we want to be checking for.

Reported-by: Marc Volovic <marcv@ezchip.com>
Signed-off-by: Florian Fainelli <florian@openwrt.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 16:22:54 -08:00
Florian Fainelli 4f8d9f3ce0 r6040: add delays in MDIO read/write polling loops
On newer and faster machines (Vortex X86DX) using the r6040 driver, it
was noticed that the driver was returning an error during probing traced
down to being the MDIO bus probing and the inability to complete a MDIO
read operation in time. It turns out that the MDIO operations on these
faster machines usually complete after ~2140 iterations which is bigger
than 2048 (MAC_DEF_TIMEOUT) and results in spurious timeouts depending
on the system load.

Update r6040_phy_read() and r6040_phy_write() to include a 1
micro second delay in each busy-looping iteration of the loop which is a
much safer operation than incrementing MAC_DEF_TIMEOUT.

Reported-by: Nils Koehler <nils.koehler@ibt-interfaces.de>
Reported-by: Daniel Goertzen <daniel.goertzen@gmail.com>
Signed-off-by: Florian Fainelli <florian@openwrt.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 16:22:54 -08:00
Paul Durrant 2c0057dec9 xen-netfront: add support for IPv6 offloads
This patch adds support for IPv6 checksum offload and GSO when those
features are available in the backend.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 16:22:54 -08:00
Tom Herbert 0b4cec8c2e net: Check skb->rxhash in gro_receive
When initializing a gro_list for a packet, first check the rxhash of
the incoming skb against that of the skb's in the list. This should be
a very strong inidicator of whether the flow is going to be matched,
and potentially allows a lot of other checks to be short circuited.
Use skb_hash_raw so that we don't force the hash to be calculated.

Tested by running netperf 200 TCP_STREAMs between two machines with
GRO, HW rxhash, and 1G. Saw no performance degration, slight reduction
of time in dev_gro_receive.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 16:22:54 -08:00
Tom Herbert 57bdf7f42b net: Add skb_get_hash_raw
Function to just return skb->rxhash without checking to see if it needs
to be recomputed.

Signed-off-by: Tom Herbert <therbert@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 16:22:54 -08:00
stephen hemminger e40c10fc89 vxge: make local functions static
Remove unused function vxge_hw_vpath_vid_get

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 16:22:54 -08:00
stephen hemminger 2fd888a5e7 bnad: code cleanup
Use 'make namespacecheck' to code that could be declared static.
After that remove code that is not being used.

Compile tested only.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 16:22:53 -08:00
Barry Song 5b22721d1f dm9000: fix a lot of checkpatch issues
recently, dm9000 codes have many checkpatch errors and warnings:

WARNING: please, no space before tabs
3: FILE: dm9000.c:3:
+ * ^ICopyright (C) 1997  Sten Wang$

WARNING: please, no space before tabs
5: FILE: dm9000.c:5:
+ * ^IThis program is free software; you can redistribute it and/or$

WARNING: please, no space before tabs
6: FILE: dm9000.c:6:
+ * ^Imodify it under the terms of the GNU General Public License$

WARNING: please, no space before tabs
7: FILE: dm9000.c:7:
+ * ^Ias published by the Free Software Foundation; either version 2$

WARNING: please, no space before tabs
8: FILE: dm9000.c:8:
+ * ^Iof the License, or (at your option) any later version.$

WARNING: please, no space before tabs
10: FILE: dm9000.c:10:
+ * ^IThis program is distributed in the hope that it will be useful,$

WARNING: please, no space before tabs
11: FILE: dm9000.c:11:
+ * ^Ibut WITHOUT ANY WARRANTY; without even the implied warranty of$

WARNING: please, no space before tabs
12: FILE: dm9000.c:12:
+ * ^IMERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the$

WARNING: please, no space before tabs
13: FILE: dm9000.c:13:
+ * ^IGNU General Public License for more details.$

WARNING: do not add new typedefs
97: FILE: dm9000.c:97:
+typedef struct board_info {

ERROR: spaces prohibited around that ':' (ctx:WxV)
113: FILE: dm9000.c:113:
+	unsigned int	in_suspend :1;
 	            	           ^

ERROR: spaces prohibited around that ':' (ctx:WxV)
114: FILE: dm9000.c:114:
+	unsigned int	wake_supported :1;
 	            	               ^

This patch fixes important errors in it.

Signed-off-by: Barry Song <Baohua.Song@csr.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 16:22:53 -08:00
Daniel Borkmann b013840810 packet: use percpu mmap tx frame pending refcount
In PF_PACKET's packet mmap(), we can avoid using one atomic_inc()
and one atomic_dec() call in skb destructor and use a percpu
reference count instead in order to determine if packets are
still pending to be sent out. Micro-benchmark with [1] that has
been slightly modified (that is, protcol = 0 in socket(2) and
bind(2)), example on a rather crappy testing machine; I expect
it to scale and have even better results on bigger machines:

./packet_mm_tx -s7000 -m7200 -z700000 em1, avg over 2500 runs:

With patch:    4,022,015 cyc
Without patch: 4,812,994 cyc

time ./packet_mm_tx -s64 -c10000000 em1 > /dev/null, stable:

With patch:
  real         1m32.241s
  user         0m0.287s
  sys          1m29.316s

Without patch:
  real         1m38.386s
  user         0m0.265s
  sys          1m35.572s

In function tpacket_snd(), it is okay to use packet_read_pending()
since in fast-path we short-circuit the condition already with
ph != NULL, since we have next frames to process. In case we have
MSG_DONTWAIT, we also do not execute this path as need_wait is
false here anyway, and in case of _no_ MSG_DONTWAIT flag, it is
okay to call a packet_read_pending(), because when we ever reach
that path, we're done processing outgoing frames anyway and only
look if there are skbs still outstanding to be orphaned. We can
stay lockless in this percpu counter since it's acceptable when we
reach this path for the sum to be imprecise first, but we'll level
out at 0 after all pending frames have reached the skb destructor
eventually through tx reclaim. When people pin a tx process to
particular CPUs, we expect overflows to happen in the reference
counter as on one CPU we expect heavy increase; and distributed
through ksoftirqd on all CPUs a decrease, for example. As
David Laight points out, since the C language doesn't define the
result of signed int overflow (i.e. rather than wrap, it is
allowed to saturate as a possible outcome), we have to use
unsigned int as reference count. The sum over all CPUs when tx
is complete will result in 0 again.

The BUG_ON() in tpacket_destruct_skb() we can remove as well. It
can _only_ be set from inside tpacket_snd() path and we made sure
to increase tx_ring.pending in any case before we called po->xmit(skb).
So testing for tx_ring.pending == 0 is not too useful. Instead, it
would rather have been useful to test if lower layers didn't orphan
the skb so that we're missing ring slots being put back to
TP_STATUS_AVAILABLE. But such a bug will be caught in user space
already as we end up realizing that we do not have any
TP_STATUS_AVAILABLE slots left anymore. Therefore, we're all set.

Btw, in case of RX_RING path, we do not make use of the pending
member, therefore we also don't need to use up any percpu memory
here. Also note that __alloc_percpu() already returns a zero-filled
percpu area, so initialization is done already.

  [1] http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmap

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 16:17:12 -08:00
Daniel Borkmann 87a2fd286a packet: don't unconditionally schedule() in case of MSG_DONTWAIT
In tpacket_snd(), when we've discovered a first frame that is
not in status TP_STATUS_SEND_REQUEST, and return a NULL buffer,
we exit the send routine in case of MSG_DONTWAIT, since we've
finished traversing the mmaped send ring buffer and don't care
about pending frames.

While doing so, we still unconditionally call an expensive
schedule() in the packet_current_frame() "error" path, which
is unnecessary in this case since it's enough to just quit
the function.

Also, in case MSG_DONTWAIT is not set, we should rather test
for need_resched() first and do schedule() only if necessary
since meanwhile pending frames could already have finished
processing and called skb destructor.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 16:17:11 -08:00
Daniel Borkmann 902fefb82e packet: improve socket create/bind latency in some cases
Most people acquire PF_PACKET sockets with a protocol argument in
the socket call, e.g. libpcap does so with htons(ETH_P_ALL) for
all its sockets. Most likely, at some point in time a subsequent
bind() call will follow, e.g. in libpcap with ...

  memset(&sll, 0, sizeof(sll));
  sll.sll_family          = AF_PACKET;
  sll.sll_ifindex         = ifindex;
  sll.sll_protocol        = htons(ETH_P_ALL);

... as arguments. What happens in the kernel is that already
in socket() syscall, we install a proto hook via register_prot_hook()
if our protocol argument is != 0. Yet, in bind() we're almost
doing the same work by doing a unregister_prot_hook() with an
expensive synchronize_net() call in case during socket() the proto
was != 0, plus follow-up register_prot_hook() with a bound device
to it this time, in order to limit traffic we get.

In the case when the protocol and user supplied device index (== 0)
does not change from socket() to bind(), we can spare us doing
the same work twice. Similarly for re-binding to the same device
and protocol. For these scenarios, we can decrease create/bind
latency from ~7447us (sock-bind-2 case) to ~89us (sock-bind-1 case)
with this patch.

Alternatively, for the first case, if people care, they should
simply create their sockets with proto == 0 argument and define
the protocol during bind() as this saves a call to synchronize_net()
as well (sock-bind-3 case).

In all other cases, we're tied to user space behaviour we must not
change, also since a bind() is not strictly required. Thus, we need
the synchronize_net() to make sure no asynchronous packet processing
paths still refer to the previous elements of po->prot_hook.

In case of mmap()ed sockets, the workflow that includes bind() is
socket() -> setsockopt(<ring>) -> bind(). In that case, a pair of
{__unregister, register}_prot_hook is being called from setsockopt()
in order to install the new protocol receive handler. Thus, when
we call bind and can skip a re-hook, we have already previously
installed the new handler. For fanout, this is handled different
entirely, so we should be good.

Timings on an i7-3520M machine:

  * sock-bind-1:   89 us
  * sock-bind-2: 7447 us
  * sock-bind-3:   75 us

sock-bind-1:
  socket(PF_PACKET, SOCK_RAW, htons(ETH_P_IP)) = 3
  bind(3, {sa_family=AF_PACKET, proto=htons(ETH_P_IP), if=all(0),
           pkttype=PACKET_HOST, addr(0)={0, }, 20) = 0

sock-bind-2:
  socket(PF_PACKET, SOCK_RAW, htons(ETH_P_IP)) = 3
  bind(3, {sa_family=AF_PACKET, proto=htons(ETH_P_IP), if=lo(1),
           pkttype=PACKET_HOST, addr(0)={0, }, 20) = 0

sock-bind-3:
  socket(PF_PACKET, SOCK_RAW, 0) = 3
  bind(3, {sa_family=AF_PACKET, proto=htons(ETH_P_IP), if=lo(1),
           pkttype=PACKET_HOST, addr(0)={0, }, 20) = 0

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 16:17:11 -08:00
David S. Miller ec48a7879e i40e: Remove autogenerated Module.symvers file.
Fixes: 9d8bf54 ("i40e: associate VMDq queue with VM type")
Reported-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-16 16:12:45 -08:00