Commit graph

3733 commits

Author SHA1 Message Date
Al Viro 3a654f975b new helpers: skb_copy_datagram_from_iter() and zerocopy_sg_from_iter()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-24 05:03:08 -05:00
David S. Miller 1459143386 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ieee802154/fakehard.c

A bug fix went into 'net' for ieee802154/fakehard.c, which is removed
in 'net-next'.

Add build fix into the merge from Stephen Rothwell in openvswitch, the
logging macros take a new initial 'log' argument, a new call was added
in 'net' so when we merge that in here we have to explicitly add the
new 'log' arg to it else the build fails.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-21 22:28:24 -05:00
Eric Dumazet e7820e39b7 net: Revert "net: avoid one atomic operation in skb_clone()"
Not sure what I was thinking, but doing anything after
releasing a refcount is suicidal or/and embarrassing.

By the time we set skb->fclone to SKB_FCLONE_FREE, another cpu
could have released last reference and freed whole skb.

We potentially corrupt memory or trap if CONFIG_DEBUG_PAGEALLOC is set.

Reported-by: Chris Mason <clm@fb.com>
Fixes: ce1a4ea3f1 ("net: avoid one atomic operation in skb_clone()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-21 15:26:32 -05:00
Jiri Pirko 93515d53b1 net: move vlan pop/push functions into common code
So it can be used from out of openvswitch code.
Did couple of cosmetic changes on the way, namely variable naming and
adding support for 8021AD proto.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-21 14:20:18 -05:00
Jiri Pirko e21951212f net: move make_writable helper into common code
note that skb_make_writable already exists in net/netfilter/core.c
but does something slightly different.

Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-21 14:20:17 -05:00
Jiri Pirko 5968250c86 vlan: introduce *vlan_hwaccel_push_inside helpers
Use them to push skb->vlan_tci into the payload and avoid code
duplication.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-21 14:20:17 -05:00
Jiri Pirko 62749e2cb3 vlan: rename __vlan_put_tag to vlan_insert_tag_set_proto
Name fits better. Plus there's going to be introduced
__vlan_insert_tag later on.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-21 14:20:17 -05:00
Al Viro 08adb7dabd fold verify_iovec() into copy_msghdr_from_user()
... and do the same on the compat side of things.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-19 16:23:49 -05:00
Al Viro 0844932009 {compat_,}verify_iovec(): switch to generic copying of iovecs
use {compat_,}rw_copy_check_uvector().  As the result, we are
guaranteed that all iovecs seen in ->msg_iov by ->sendmsg()
and ->recvmsg() will pass access_ok().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-19 16:23:16 -05:00
Markus Elfring ef87c5d6a1 net: pktgen: Deletion of an unnecessary check before the function call "proc_remove"
The proc_remove() function tests whether its argument is NULL and then
returns immediately. Thus the test around the call is not needed.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-19 15:20:15 -05:00
Fabian Frederick 54aeba7f06 dev_ioctl: use sizeof(x) instead of sizeof x
Also remove spaces after cast.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-18 15:27:32 -05:00
Fabian Frederick e56f735913 net/core: include linux/types.h instead of asm/types.h
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-18 15:26:32 -05:00
Fabian Frederick 1d2398dc7c net: fix spelling for synchronized
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-18 15:26:32 -05:00
Eric Dumazet 960fb622f8 net: provide a per host RSS key generic infrastructure
RSS (Receive Side Scaling) typically uses Toeplitz hash and a 40 or 52 bytes
RSS key.

Some drivers use a constant (and well known key), some drivers use a random
key per port, making bonding setups hard to tune. Well known keys increase
attack surface, considering that number of queues is usually a power of two.

This patch provides infrastructure to help drivers doing the right thing.

netdev_rss_key_fill() should be used by drivers to initialize their RSS key,
even if they provide ethtool -X support to let user redefine the key later.

A new /proc/sys/net/core/netdev_rss_key file can be used to get the host
RSS key even for drivers not providing ethtool -x support, in case some
applications want to precisely setup flows to match some RX queues.

Tested:

myhost:~# cat /proc/sys/net/core/netdev_rss_key
11:63:99:bb:79:fb:a5:a7:07:45:b2:20:bf:02:42:2d:08:1a:dd:19:2b:6b:23:ac:56:28:9d:70:c3:ac:e8:16:4b:b7:c1:10:53:a4:78:41:36:40:74:b6:15:ca:27:44:aa:b3:4d:72

myhost:~# ethtool -x eth0
RX flow hash indirection table for eth0 with 8 RX ring(s):
    0:      0     1     2     3     4     5     6     7
RSS hash key:
11:63:99:bb:79:fb:a5:a7:07:45:b2:20:bf:02:42:2d:08:1a:dd:19:2b:6b:23:ac:56:28:9d:70:c3:ac:e8:16:4b:b7:c1:10:53:a4:78:41

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-16 15:59:11 -05:00
Ingo Molnar e9ac5f0fa8 Merge branch 'sched/urgent' into sched/core, to pick up fixes before applying more changes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-16 10:50:25 +01:00
Michal Kubeček fbe168ba91 net: generic dev_disable_lro() stacked device handling
Large receive offloading is known to cause problems if received packets
are passed to other host. Therefore the kernel disables it by calling
dev_disable_lro() whenever a network device is enslaved in a bridge or
forwarding is enabled for it (or globally). For virtual devices we need
to disable LRO on the underlying physical device (which is actually
receiving the packets).

Current dev_disable_lro() code handles this  propagation for a vlan
(including 802.1ad nested vlan), macvlan or a vlan on top of a macvlan.
It doesn't handle other stacked devices and their combinations, in
particular propagation from a bond to its slaves which often causes
problems in virtualization setups.

As we now have generic data structures describing the upper-lower device
relationship, dev_disable_lro() can be generalized to disable LRO also
for all lower devices (if any) once it is disabled for the device
itself.

For bonding and teaming devices, it is necessary to disable LRO not only
on current slaves at the moment when dev_disable_lro() is called but
also on any slave (port) added later.

v2: use lower device links for all devices (including vlan and macvlan)

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-13 14:48:56 -05:00
WANG Cong d7480fd3b1 neigh: remove dynamic neigh table registration support
Currently there are only three neigh tables in the whole kernel:
arp table, ndisc table and decnet neigh table. What's more,
we don't support registering multiple tables per family.
Therefore we can just make these tables statically built-in.

Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-11 15:23:54 -05:00
Joe Perches ba7a46f16d net: Convert LIMIT_NETDEBUG to net_dbg_ratelimited
Use the more common dynamic_debug capable net_dbg_ratelimited
and remove the LIMIT_NETDEBUG macro.

All messages are still ratelimited.

Some KERN_<LEVEL> uses are changed to KERN_DEBUG.

This may have some negative impact on messages that were
emitted at KERN_INFO that are not not enabled at all unless
DEBUG is defined or dynamic_debug is enabled.  Even so,
these messages are now _not_ emitted by default.

This also eliminates the use of the net_msg_warn sysctl
"/proc/sys/net/core/warnings".  For backward compatibility,
the sysctl is not removed, but it has no function.  The extern
declaration of net_msg_warn is removed from sock.h and made
static in net/core/sysctl_net_core.c

Miscellanea:

o Update the sysctl documentation
o Remove the embedded uses of pr_fmt
o Coalesce format fragments
o Realign arguments

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-11 14:10:31 -05:00
Eric Dumazet 2c8c56e15d net: introduce SO_INCOMING_CPU
Alternative to RPS/RFS is to use hardware support for multiple
queues.

Then split a set of million of sockets into worker threads, each
one using epoll() to manage events on its own socket pool.

Ideally, we want one thread per RX/TX queue/cpu, but we have no way to
know after accept() or connect() on which queue/cpu a socket is managed.

We normally use one cpu per RX queue (IRQ smp_affinity being properly
set), so remembering on socket structure which cpu delivered last packet
is enough to solve the problem.

After accept(), connect(), or even file descriptor passing around
processes, applications can use :

 int cpu;
 socklen_t len = sizeof(cpu);

 getsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, &cpu, &len);

And use this information to put the socket into the right silo
for optimal performance, as all networking stack should run
on the appropriate cpu, without need to send IPI (RPS/RFS).

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-11 13:00:06 -05:00
Eric Dumazet 3b47d30396 net: gro: add a per device gro flush timer
Tuning coalescing parameters on NIC can be really hard.

Servers can handle both bulk and RPC like traffic, with conflicting
goals : bulk flows want as big GRO packets as possible, RPC want minimal
latencies.

To reach big GRO packets on 10Gbe NIC, one can use :

ethtool -C eth0 rx-usecs 4 rx-frames 44

But this penalizes rpc sessions, with an increase of latencies, up to
50% in some cases, as NICs generally do not force an interrupt when
a packet with TCP Push flag is received.

Some NICs do not have an absolute timer, only a timer rearmed for every
incoming packet.

This patch uses a different strategy : Let GRO stack decides what do do,
based on traffic pattern.

Packets with Push flag wont be delayed.
Packets without Push flag might be held in GRO engine, if we keep
receiving data.

This new mechanism is off by default, and shall be enabled by setting
/sys/class/net/ethX/gro_flush_timeout to a value in nanosecond.

To fully enable this mechanism, drivers should use napi_complete_done()
instead of napi_complete().

Tested:
 Ran 200 netperf TCP_STREAM from A to B (10Gbe mlx4 link, 8 RX queues)

Without this feature, we send back about 305,000 ACK per second.

GRO aggregation ratio is low (811/305 = 2.65 segments per GRO packet)

Setting a timer of 2000 nsec is enough to increase GRO packet sizes
and reduce number of ACK packets. (811/19.2 = 42)

Receiver performs less calls to upper stacks, less wakes up.
This also reduces cpu usage on the sender, as it receives less ACK
packets.

Note that reducing number of wakes up increases cpu efficiency, but can
decrease QPS, as applications wont have the chance to warmup cpu caches
doing a partial read of RPC requests/answers if they fit in one skb.

B:~# sar -n DEV 1 10 | grep eth0 | tail -1
Average:         eth0 811269.80 305732.30 1199462.57  19705.72      0.00
0.00      0.50

B:~# echo 2000 >/sys/class/net/eth0/gro_flush_timeout

B:~# sar -n DEV 1 10 | grep eth0 | tail -1
Average:         eth0 811577.30  19230.80 1199916.51   1239.80      0.00
0.00      0.50

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-10 12:05:59 -05:00
Herbert Xu bfe1be38fc net: Kill skb_copy_datagram_const_iovec
Now that both macvtap and tun are using skb_copy_datagram_iter, we
can kill the abomination that is skb_copy_datagram_const_iovec.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-07 12:13:34 -05:00
Herbert Xu a8f820aa40 inet: Add skb_copy_datagram_iter
This patch adds skb_copy_datagram_iter, which is identical to
skb_copy_datagram_iovec except that it operates on iov_iter
instead of iovec.

Eventually all users of skb_copy_datagram_iovec should switch
over to iov_iter and then we can remove skb_copy_datagram_iovec.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-07 12:13:34 -05:00
Simon Horman 25cd9ba0ab openvswitch: Add basic MPLS support to kernel
Allow datapath to recognize and extract MPLS labels into flow keys
and execute actions which push, pop, and set labels on packets.

Based heavily on work by Leo Alterman, Ravi K, Isaku Yamahata and Joe Stringer.

Cc: Ravi K <rkerur@gmail.com>
Cc: Leo Alterman <lalterman@nicira.com>
Cc: Isaku Yamahata <yamahata@valinux.co.jp>
Cc: Joe Stringer <joe@wand.net.nz>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
2014-11-05 23:52:33 -08:00
Pravin B Shelar 59b93b41e7 net: Remove MPLS GSO feature.
Device can export MPLS GSO support in dev->mpls_features same way
it export vlan features in dev->vlan_features. So it is safe to
remove NETIF_F_GSO_MPLS redundant flag.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
2014-11-05 23:52:33 -08:00
David S. Miller 51f3d02b98 net: Add and use skb_copy_datagram_msg() helper.
This encapsulates all of the skb_copy_datagram_iovec() callers
with call argument signature "skb, offset, msghdr->msg_iov, length".

When we move to iov_iters in the networking, the iov_iter object will
sit in the msghdr.

Having a helper like this means there will be less places to touch
during that transformation.

Based upon descriptions and patch from Al Viro.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-05 16:46:40 -05:00
Tom Herbert e585f23636 udp: Changes to udp_offload to support remote checksum offload
Add a new GSO type, SKB_GSO_TUNNEL_REMCSUM, which indicates remote
checksum offload being done (in this case inner checksum must not
be offloaded to the NIC).

Added logic in __skb_udp_tunnel_segment to handle remote checksum
offload case.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-05 16:30:03 -05:00
Peter Zijlstra ff960a7317 netdev, sched/wait: Fix sleeping inside wait event
rtnl_lock_unregistering*() take rtnl_lock() -- a mutex -- inside a
wait loop. The wait loop relies on current->state to function, but so
does mutex_lock(), nesting them makes for the inner to destroy the
outer state.

Fix this using the new wait_woken() bits.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Cong Wang <cwang@twopensource.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jerry Chu <hkchu@google.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Cc: sfeldma@cumulusnetworks.com <sfeldma@cumulusnetworks.com>
Cc: stephen hemminger <stephen@networkplumber.org>
Cc: Tom Gundersen <teg@jklm.no>
Cc: Tom Herbert <therbert@google.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Vlad Yasevich <vyasevic@redhat.com>
Cc: netdev@vger.kernel.org
Link: http://lkml.kernel.org/r/20141029173110.GE15602@worktop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-04 07:17:48 +01:00
Eric Dumazet d75b1ade56 net: less interrupt masking in NAPI
net_rx_action() can mask irqs a single time to transfert sd->poll_list
into a private list, for a very short duration.

Then, napi_complete() can avoid masking irqs again,
and net_rx_action() only needs to mask irq again in slow path.

This patch removes 2 couples of irq mask/unmask per typical NAPI run,
more if multiple napi were triggered.

Note this also allows to give control back to caller (do_softirq())
more often, so that other softirq handlers can be called a bit earlier,
or ksoftirqd can be wakeup earlier under pressure.

This was developed while testing an alternative to RX interrupt
mitigation to reduce latencies while keeping or improving GRO
aggregation on fast NIC.

Idea is to test napi->gro_list at the end of a napi->poll() and
reschedule one NAPI poll, but after servicing a full round of
softirqs (timers, TX, rcu, ...). This will be allowed only if softirq
is currently serviced by idle task or ksoftirqd, and resched not needed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-03 12:25:09 -05:00
David S. Miller 55b42b5ca2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/phy/marvell.c

Simple overlapping changes in drivers/net/phy/marvell.c

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-01 14:53:27 -04:00
Guenter Roeck e0fb6fb6d5 net: ethtool: Return -EOPNOTSUPP if user space tries to read EEPROM with lengh 0
If a driver supports reading EEPROM but no EEPROM is installed in the system,
the driver's get_eeprom_len function returns 0. ethtool will subsequently
try to read that zero-length EEPROM anyway. If the driver does not support
EEPROM access at all, this operation will return -EOPNOTSUPP. If the driver
does support EEPROM access but no EEPROM is installed, the operation will
return -EINVAL. Return -EOPNOTSUPP in both cases for consistency.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-31 16:12:34 -04:00
Nicolas Dichtel 75fbfd3323 neigh: optimize neigh_parms_release()
In neigh_parms_release() we loop over all entries to find the entry given in
argument and being able to remove it from the list. By using a double linked
list, we can avoid this loop.

Here are some numbers with 30 000 dummy interfaces configured:

Before the patch:
$ time rmmod dummy
real	2m0.118s
user	0m0.000s
sys	1m50.048s

After the patch:
$ time rmmod dummy
real	1m9.970s
user	0m0.000s
sys	0m47.976s

Suggested-by: Thierry Herbelot <thierry.herbelot@6wind.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-29 16:11:50 -04:00
Eric Dumazet bc9ad166e3 net: introduce napi_schedule_irqoff()
napi_schedule() can be called from any context and has to mask hard
irqs.

Add a variant that can only be called from hard interrupts handlers
or when irqs are already masked.

Many NIC drivers can use it from their hard IRQ handler instead of
generic variant.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-29 16:07:27 -04:00
Toshiaki Makita 432c856fcf net: skb_segment() should preserve backpressure
This patch generalizes commit d6a4a10411 ("tcp: GSO should be TSQ
friendly") to protocols using skb_set_owner_w()

TCP uses its own destructor (tcp_wfree) and needs a more complex scheme
as explained in commit 6ff50cd555 ("tcp: gso: do not generate out of
order packets")

This allows UDP sockets using UFO to get proper backpressure,
thus avoiding qdisc drops and excessive cpu usage.

Here are performance test results (macvlan on vlan):

- Before
# netperf -t UDP_STREAM ...
Socket  Message  Elapsed      Messages
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992   65507   60.00      144096 1224195    1258.56
212992           60.00          51              0.45

Average:        CPU     %user     %nice   %system   %iowait    %steal     %idle
Average:        all      0.23      0.00     25.26      0.08      0.00     74.43

- After
# netperf -t UDP_STREAM ...
Socket  Message  Elapsed      Messages
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992   65507   60.00      109593      0     957.20
212992           60.00      109593            957.20

Average:        CPU     %user     %nice   %system   %iowait    %steal     %idle
Average:        all      0.18      0.00      8.38      0.02      0.00     91.43

[edumazet] Rewrote patch and changelog.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-29 14:47:19 -04:00
Eric Dumazet 93a35f59f1 net: napi_reuse_skb() should check pfmemalloc
Do not reuse skb if it was pfmemalloc tainted, otherwise
future frame might be dropped anyway.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-26 22:47:23 -04:00
Karl Beldan a63ba13eec net: tso: fix unaligned access to crafted TCP header in helper API
The crafted header start address is from a driver supplied buffer, which
one can reasonably expect to be aligned on a 4-bytes boundary.
However ATM the TSO helper API is only used by ethernet drivers and
the tcp header will then be aligned to a 2-bytes only boundary from the
header start address.

Signed-off-by: Karl Beldan <karl.beldan@rivierawaves.com>
Cc: Ezequiel Garcia <ezequiel.garcia@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-22 12:52:55 -04:00
Florian Westphal f993bc25e5 net: core: handle encapsulation offloads when computing segment lengths
if ->encapsulation is set we have to use inner_tcp_hdrlen and add the
size of the inner network headers too.

This is 'mostly harmless'; tbf might send skb that is slightly over
quota or drop skb even if it would have fit.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-20 12:38:13 -04:00
Linus Torvalds 2e923b0251 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Include fixes for netrom and dsa (Fabian Frederick and Florian
    Fainelli)

 2) Fix FIXED_PHY support in stmmac, from Giuseppe CAVALLARO.

 3) Several SKB use after free fixes (vxlan, openvswitch, vxlan,
    ip_tunnel, fou), from Li ROngQing.

 4) fec driver PTP support fixes from Luwei Zhou and Nimrod Andy.

 5) Use after free in virtio_net, from Michael S Tsirkin.

 6) Fix flow mask handling for megaflows in openvswitch, from Pravin B
    Shelar.

 7) ISDN gigaset and capi bug fixes from Tilman Schmidt.

 8) Fix route leak in ip_send_unicast_reply(), from Vasily Averin.

 9) Fix two eBPF JIT bugs on x86, from Alexei Starovoitov.

10) TCP_SKB_CB() reorganization caused a few regressions, fixed by Cong
    Wang and Eric Dumazet.

11) Don't overwrite end of SKB when parsing malformed sctp ASCONF
    chunks, from Daniel Borkmann.

12) Don't call sock_kfree_s() with NULL pointers, this function also has
    the side effect of adjusting the socket memory usage.  From Cong Wang.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (90 commits)
  bna: fix skb->truesize underestimation
  net: dsa: add includes for ethtool and phy_fixed definitions
  openvswitch: Set flow-key members.
  netrom: use linux/uaccess.h
  dsa: Fix conversion from host device to mii bus
  tipc: fix bug in bundled buffer reception
  ipv6: introduce tcp_v6_iif()
  sfc: add support for skb->xmit_more
  r8152: return -EBUSY for runtime suspend
  ipv4: fix a potential use after free in fou.c
  ipv4: fix a potential use after free in ip_tunnel_core.c
  hyperv: Add handling of IP header with option field in netvsc_set_hash()
  openvswitch: Create right mask with disabled megaflows
  vxlan: fix a free after use
  openvswitch: fix a use after free
  ipv4: dst_entry leak in ip_send_unicast_reply()
  ipv4: clean up cookie_v4_check()
  ipv4: share tcp_v4_save_options() with cookie_v4_check()
  ipv4: call __ip_options_echo() in cookie_v4_check()
  atm: simplify lanai.c by using module_pci_driver
  ...
2014-10-18 09:31:37 -07:00
Tom Herbert 04ffcb255f net: Add ndo_gso_check
Add ndo_gso_check which a device can define to indicate whether is
is capable of doing GSO on a packet. This funciton would be called from
the stack to determine whether software GSO is needed to be done. A
driver should populate this function if it advertises GSO types for
which there are combinations that it wouldn't be able to handle. For
instance a device that performs UDP tunneling might only implement
support for transparent Ethernet bridging type of inner packets
or might have limitations on lengths of inner headers.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-15 12:11:00 -04:00
Linus Torvalds 0429fbc0bd Merge branch 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
Pull percpu consistent-ops changes from Tejun Heo:
 "Way back, before the current percpu allocator was implemented, static
  and dynamic percpu memory areas were allocated and handled separately
  and had their own accessors.  The distinction has been gone for many
  years now; however, the now duplicate two sets of accessors remained
  with the pointer based ones - this_cpu_*() - evolving various other
  operations over time.  During the process, we also accumulated other
  inconsistent operations.

  This pull request contains Christoph's patches to clean up the
  duplicate accessor situation.  __get_cpu_var() uses are replaced with
  with this_cpu_ptr() and __this_cpu_ptr() with raw_cpu_ptr().

  Unfortunately, the former sometimes is tricky thanks to C being a bit
  messy with the distinction between lvalues and pointers, which led to
  a rather ugly solution for cpumask_var_t involving the introduction of
  this_cpu_cpumask_var_ptr().

  This converts most of the uses but not all.  Christoph will follow up
  with the remaining conversions in this merge window and hopefully
  remove the obsolete accessors"

* 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (38 commits)
  irqchip: Properly fetch the per cpu offset
  percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t -fix
  ia64: sn_nodepda cannot be assigned to after this_cpu conversion. Use __this_cpu_write.
  percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t
  Revert "powerpc: Replace __get_cpu_var uses"
  percpu: Remove __this_cpu_ptr
  clocksource: Replace __this_cpu_ptr with raw_cpu_ptr
  sparc: Replace __get_cpu_var uses
  avr32: Replace __get_cpu_var with __this_cpu_write
  blackfin: Replace __get_cpu_var uses
  tile: Use this_cpu_ptr() for hardware counters
  tile: Replace __get_cpu_var uses
  powerpc: Replace __get_cpu_var uses
  alpha: Replace __get_cpu_var
  ia64: Replace __get_cpu_var uses
  s390: cio driver &__get_cpu_var replacements
  s390: Replace __get_cpu_var uses
  mips: Replace __get_cpu_var uses
  MIPS: Replace __get_cpu_var uses in FPU emulator.
  arm: Replace __this_cpu_ptr with raw_cpu_ptr
  ...
2014-10-15 07:48:18 +02:00
David S. Miller e53da5fbfc net: Trap attempts to call sock_kfree_s() with a NULL pointer.
Unlike normal kfree() it is never right to call sock_kfree_s() with
a NULL pointer, because sock_kfree_s() also has the side effect of
discharging the memory from the sockets quota.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-14 17:02:37 -04:00
Eric Dumazet 4c450583d9 net: fix races in page->_count manipulation
This is illegal to use atomic_set(&page->_count, ...) even if we 'own'
the page. Other entities in the kernel need to use get_page_unless_zero()
to get a reference to the page before testing page properties, so we could
loose a refcount increment.

The only case it is valid is when page->_count is 0

Fixes: 540eb7bf0b ("net: Update alloc frag to reduce get/put page usage and recycle pages")
Signed-off-by: Eric Dumaze <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-10 15:37:29 -04:00
Alexander Duyck 5af7fb6e3e flow-dissector: Fix alignment issue in __skb_flow_get_ports
This patch addresses a kernel unaligned access bug seen on a sparc64 system
with an igb adapter.  Specifically the __skb_flow_get_ports was returning a
be32 pointer which was then having the value directly returned.

In order to prevent this it is actually easier to simply not populate the
ports or address values when an skb is not present.  In this case the
assumption is that the data isn't needed and rather than slow down the
faster aligned accesses by making them have to assume the unaligned path on
architectures that don't support efficent unaligned access it makes more
sense to simply switch off the bits that were copying the source and
destination address/port for the case where we only care about the protocol
types and lengths which are normally 16 bit fields anyway.

Reported-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-10 15:33:47 -04:00
Li RongQing 8ea6e345a6 net: filter: fix the comments
1. sk_run_filter has been renamed, sk_filter() is using SK_RUN_FILTER.
2. Remove wrong comments about storing intermediate value.
3. replace sk_run_filter with __bpf_prog_run for check_load_and_stores's
comments

Cc: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-10 15:11:51 -04:00
Masanari Iida de3f0d0eff net: Missing @ before descriptions cause make xmldocs warning
This patch fix following warning.
Warning(.//net/core/skbuff.c:4142): No description found for parameter 'header_len'
Warning(.//net/core/skbuff.c:4142): No description found for parameter 'data_len'
Warning(.//net/core/skbuff.c:4142): No description found for parameter 'max_page_order'
Warning(.//net/core/skbuff.c:4142): No description found for parameter 'errcode'
Warning(.//net/core/skbuff.c:4142): No description found for parameter 'gfp_mask'

Acutually the descriptions exist, but missing "@" in front.

This problem start to happen when following commit was merged
into Linus's tree during 3.18-rc1 merge period.
commit 2e4e441071
net: add alloc_skb_with_frags() helper

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-09 18:57:14 -04:00
Linus Torvalds 35a9ad8af0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
 "Most notable changes in here:

   1) By far the biggest accomplishment, thanks to a large range of
      contributors, is the addition of multi-send for transmit.  This is
      the result of discussions back in Chicago, and the hard work of
      several individuals.

      Now, when the ->ndo_start_xmit() method of a driver sees
      skb->xmit_more as true, it can choose to defer the doorbell
      telling the driver to start processing the new TX queue entires.

      skb->xmit_more means that the generic networking is guaranteed to
      call the driver immediately with another SKB to send.

      There is logic added to the qdisc layer to dequeue multiple
      packets at a time, and the handling mis-predicted offloads in
      software is now done with no locks held.

      Finally, pktgen is extended to have a "burst" parameter that can
      be used to test a multi-send implementation.

      Several drivers have xmit_more support: i40e, igb, ixgbe, mlx4,
      virtio_net

      Adding support is almost trivial, so export more drivers to
      support this optimization soon.

      I want to thank, in no particular or implied order, Jesper
      Dangaard Brouer, Eric Dumazet, Alexander Duyck, Tom Herbert, Jamal
      Hadi Salim, John Fastabend, Florian Westphal, Daniel Borkmann,
      David Tat, Hannes Frederic Sowa, and Rusty Russell.

   2) PTP and timestamping support in bnx2x, from Michal Kalderon.

   3) Allow adjusting the rx_copybreak threshold for a driver via
      ethtool, and add rx_copybreak support to enic driver.  From
      Govindarajulu Varadarajan.

   4) Significant enhancements to the generic PHY layer and the bcm7xxx
      driver in particular (EEE support, auto power down, etc.) from
      Florian Fainelli.

   5) Allow raw buffers to be used for flow dissection, allowing drivers
      to determine the optimal "linear pull" size for devices that DMA
      into pools of pages.  The objective is to get exactly the
      necessary amount of headers into the linear SKB area pre-pulled,
      but no more.  The new interface drivers use is eth_get_headlen().
      From WANG Cong, with driver conversions (several had their own
      by-hand duplicated implementations) by Alexander Duyck and Eric
      Dumazet.

   6) Support checksumming more smoothly and efficiently for
      encapsulations, and add "foo over UDP" facility.  From Tom
      Herbert.

   7) Add Broadcom SF2 switch driver to DSA layer, from Florian
      Fainelli.

   8) eBPF now can load programs via a system call and has an extensive
      testsuite.  Alexei Starovoitov and Daniel Borkmann.

   9) Major overhaul of the packet scheduler to use RCU in several major
      areas such as the classifiers and rate estimators.  From John
      Fastabend.

  10) Add driver for Intel FM10000 Ethernet Switch, from Alexander
      Duyck.

  11) Rearrange TCP_SKB_CB() to reduce cache line misses, from Eric
      Dumazet.

  12) Add Datacenter TCP congestion control algorithm support, From
      Florian Westphal.

  13) Reorganize sk_buff so that __copy_skb_header() is significantly
      faster.  From Eric Dumazet"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1558 commits)
  netlabel: directly return netlbl_unlabel_genl_init()
  net: add netdev_txq_bql_{enqueue, complete}_prefetchw() helpers
  net: description of dma_cookie cause make xmldocs warning
  cxgb4: clean up a type issue
  cxgb4: potential shift wrapping bug
  i40e: skb->xmit_more support
  net: fs_enet: Add NAPI TX
  net: fs_enet: Remove non NAPI RX
  r8169:add support for RTL8168EP
  net_sched: copy exts->type in tcf_exts_change()
  wimax: convert printk to pr_foo()
  af_unix: remove 0 assignment on static
  ipv6: Do not warn for informational ICMP messages, regardless of type.
  Update Intel Ethernet Driver maintainers list
  bridge: Save frag_max_size between PRE_ROUTING and POST_ROUTING
  tipc: fix bug in multicast congestion handling
  net: better IFF_XMIT_DST_RELEASE support
  net/mlx4_en: remove NETDEV_TX_BUSY
  3c59x: fix bad split of cpu_to_le32(pci_map_single())
  net: bcmgenet: fix Tx ring priority programming
  ...
2014-10-08 21:40:54 -04:00
Linus Torvalds d0cd84817c dmaengine-3.17
1/ Step down as dmaengine maintainer see commit 08223d80df "dmaengine
    maintainer update"
 
 2/ Removal of net_dma, as it has been marked 'broken' since 3.13 (commit
    7787380336 "net_dma: mark broken"), without reports of performance
    regression.
 
 3/ Miscellaneous fixes
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJUKDLKAAoJEB7SkWpmfYgC7wwP/iNHqRjf1suMUTBIF3P6Hgbe
 VCUwh0IkuujMPDG46WRn6cYzarRxVPLoGaLHLPszgjI6pmGPVv19wqeDOlUxtcmr
 0iQWEWv/zqseaAIW+4gj/WYCyMgKil49EUBJKCZCfNmIaad+e0pr8f0uE5yOkHPM
 tqWoZERu9A4dlXGr1TjeOZVzdnPrCt92MrLDN6ZZ6tMuJaEc5PauaLxKTeGy5fYj
 UB+k1xJQzECbsYfpB+uCVYl5/qPO1rNyuBYS8THCsW+JYmrbbfH2kkF2lo2FaUpO
 8Yd50FtzXHKWwAt7BzfIwU2M7x0wRmryrC/xsQi6M+WmVeHYvvHUIpzaA66xRZ5x
 fCy3Fu8sEnmnmboAbh2v2c5uTycqRl2xPzbpLAuxglloXIxzi3ckp6ESF/Z4SldH
 oxIoEievN7lah3vKgvlHZYcWDzrYr8EKf/EzFe9RqDBQDKtzDzre1H9Uivr387Vm
 uFUcGHYG/GXuX47C7EUsMtaSW2UEoR2ytw/HR6CKFPTVXwAzEO6kA9vg0EqL0iIq
 2wVLgavlZuwegmaUBgnr+bgVZMvVN7OU7fAIRVe5xNO6itrPKvheSlQthmRiiq9C
 uzOu4PS6PexqzHUNPCcJpCsj+lawmCSrE0bxtPzTA/CQInVgWs219V9+W5Gn/0YA
 EARN9k6ueX9PZPQrPQLm
 =BBBv
 -----END PGP SIGNATURE-----

Merge tag 'dmaengine-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/dmaengine

Pull dmaengine updates from Dan Williams:
 "Even though this has fixes marked for -stable, given the size and the
  needed conflict resolutions this is 3.18-rc1/merge-window material.

  These patches have been languishing in my tree for a long while.  The
  fact that I do not have the time to do proper/prompt maintenance of
  this tree is a primary factor in the decision to step down as
  dmaengine maintainer.  That and the fact that the bulk of drivers/dma/
  activity is going through Vinod these days.

  The net_dma removal has not been in -next.  It has developed simple
  conflicts against mainline and net-next (for-3.18).

  Continuing thanks to Vinod for staying on top of drivers/dma/.

  Summary:

   1/ Step down as dmaengine maintainer see commit 08223d80df
      "dmaengine maintainer update"

   2/ Removal of net_dma, as it has been marked 'broken' since 3.13
      (commit 7787380336 "net_dma: mark broken"), without reports of
      performance regression.

   3/ Miscellaneous fixes"

* tag 'dmaengine-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/dmaengine:
  net: make tcp_cleanup_rbuf private
  net_dma: revert 'copied_early'
  net_dma: simple removal
  dmaengine maintainer update
  dmatest: prevent memory leakage on error path in thread
  ioat: Use time_before_jiffies()
  dmaengine: fix xor sources continuation
  dma: mv_xor: Rename __mv_xor_slot_cleanup() to mv_xor_slot_cleanup()
  dma: mv_xor: Remove all callers of mv_xor_slot_cleanup()
  dma: mv_xor: Remove unneeded mv_xor_clean_completed_slots() call
  ioat: Use pci_enable_msix_exact() instead of pci_enable_msix()
  drivers: dma: Include appropriate header file in dca.c
  drivers: dma: Mark functions as static in dma_v3.c
  dma: mv_xor: Add DMA API error checks
  ioat/dca: Use dev_is_pci() to check whether it is pci device
2014-10-07 20:39:25 -04:00
Eric Dumazet 0287587884 net: better IFF_XMIT_DST_RELEASE support
Testing xmit_more support with netperf and connected UDP sockets,
I found strange dst refcount false sharing.

Current handling of IFF_XMIT_DST_RELEASE is not optimal.

Dropping dst in validate_xmit_skb() is certainly too late in case
packet was queued by cpu X but dequeued by cpu Y

The logical point to take care of drop/force is in __dev_queue_xmit()
before even taking qdisc lock.

As Julian Anastasov pointed out, need for skb_dst() might come from some
packet schedulers or classifiers.

This patch adds new helper to cleanly express needs of various drivers
or qdiscs/classifiers.

Drivers that need skb_dst() in their ndo_start_xmit() should call
following helper in their setup instead of the prior :

	dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
->
	netif_keep_dst(dev);

Instead of using a single bit, we use two bits, one being
eventually rebuilt in bonding/team drivers.

The other one, is permanent and blocks IFF_XMIT_DST_RELEASE being
rebuilt in bonding/team. Eventually, we could add something
smarter later.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-07 13:22:11 -04:00
WANG Cong 02c0fc1b8f net_sched: fix unused variables in __gnet_stats_copy_basic_cpu()
Probably not a big deal, but we'd better just use the
one we get in retry loop.

Fixes: commit 22e0f8b932 ("net: sched: make bstats per cpu and estimator RCU safe")
Reported-by: Joe Perches <joe@perches.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-07 00:10:49 -04:00
Eric Dumazet 1ff0dc9499 net: validate_xmit_vlan() is static
Marking this as static allows compiler to inline it.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-06 18:17:17 -04:00
Eric Dumazet fcbeb976d7 net: introduce netdevice gso_min_segs attribute
Some TSO engines might have a too heavy setup cost, that impacts
performance on hosts sending small bursts (2 MSS per packet).

This patch adds a device gso_min_segs, allowing drivers to set
a minimum segment size for TSO packets, according to the NIC
performance.

Tested on a mlx4 NIC, this allows to get a ~110% increase of
throughput when sending 2 MSS per packet.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-06 17:56:28 -04:00
Eric Dumazet 1255a50554 ethtool: Ethtool parameter to dynamically change tx_copybreak
Use new ethtool [sg]et_tunable() to set tx_copybread (inline threshold)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-06 01:04:16 -04:00
Eric Dumazet bec3cfdca3 net: skb_segment() provides list head and tail
Its unfortunate we have to walk again skb list to find the tail
after segmentation, even if data is probably hot in cpu caches.

skb_segment() can store the tail of the list into segs->prev,
and validate_xmit_skb_list() can immediately get the tail.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-06 00:37:30 -04:00
David S. Miller 61b37d2f54 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains another batch with Netfilter/IPVS updates
for net-next, they are:

1) Add abstracted ICMP codes to the nf_tables reject expression. We
   introduce four reasons to reject using ICMP that overlap in IPv4
   and IPv6 from the semantic point of view. This should simplify the
   maintainance of dual stack rule-sets through the inet table.

2) Move nf_send_reset() functions from header files to per-family
   nf_reject modules, suggested by Patrick McHardy.

3) We have to use IS_ENABLED(CONFIG_BRIDGE_NETFILTER) everywhere in the
   code now that br_netfilter can be modularized. Convert remaining spots
   in the network stack code.

4) Use rcu_barrier() in the nf_tables module removal path to ensure that
   we don't leave object that are still pending to be released via
   call_rcu (that may likely result in a crash).

5) Remove incomplete arch 32/64 compat from nft_compat. The original (bad)
   idea was to probe the word size based on the xtables match/target info
   size, but this assumption is wrong when you have to dump the information
   back to userspace.

6) Allow to filter from prerouting and postrouting in the nf_tables bridge.
   In order to emulate the ebtables NAT chains (which are actually simple
   filter chains with no special semantics), we have support filtering from
   this hooks too.

7) Add explicit module dependency between xt_physdev and br_netfilter.
   This provides a way to detect if the user needs br_netfilter from
   the configuration path. This should reduce the breakage of the
   br_netfilter modularization.

8) Cleanup coding style in ip_vs.h, from Simon Horman.

9) Fix crash in the recently added nf_tables masq expression. We have
   to register/unregister the notifiers to clean up the conntrack table
   entries from the module init/exit path, not from the rule addition /
   deletion path. From Arturo Borrero.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-05 21:32:37 -04:00
Vijay Subramanian c8753d55af net: Cleanup skb cloning by adding SKB_FCLONE_FREE
SKB_FCLONE_UNAVAILABLE has overloaded meaning depending on type of skb.
1: If skb is allocated from head_cache, it indicates fclone is not available.
2: If skb is a companion fclone skb (allocated from fclone_cache), it indicates
it is available to be used.

To avoid confusion for case 2 above, this patch  replaces
SKB_FCLONE_UNAVAILABLE with SKB_FCLONE_FREE where appropriate. For fclone
companion skbs, this indicates it is free for use.

SKB_FCLONE_UNAVAILABLE will now simply indicate skb is from head_cache and
cannot / will not have a companion fclone.

Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-04 20:34:25 -04:00
Eric Dumazet 01291202ed net: do not export skb_gro_receive()
skb_gro_receive() is only called from tcp_gro_receive() which is
not in a module.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-03 15:54:30 -07:00
Eric Dumazet 55a93b3ea7 qdisc: validate skb without holding lock
Validation of skb can be pretty expensive :

GSO segmentation and/or checksum computations.

We can do this without holding qdisc lock, so that other cpus
can queue additional packets.

Trick is that requeued packets were already validated, so we carry
a boolean so that sch_direct_xmit() can validate a fresh skb list,
or directly use an old one.

Tested on 40Gb NIC (8 TX queues) and 200 concurrent flows, 48 threads
host.

Turning TSO on or off had no effect on throughput, only few more cpu
cycles. Lock contention on qdisc lock disappeared.

Same if disabling TX checksum offload.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-03 15:36:11 -07:00
David S. Miller 739e4a758e Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/usb/r8152.c
	net/netfilter/nfnetlink.c

Both r8152 and nfnetlink conflicts were simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-02 11:25:43 -07:00
Pablo Neira Ayuso 1109a90c01 netfilter: use IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
In 34666d4 ("netfilter: bridge: move br_netfilter out of the core"),
the bridge netfilter code has been modularized.

Use IS_ENABLED instead of ifdef to cover the module case.

Fixes: 34666d4 ("netfilter: bridge: move br_netfilter out of the core")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-10-02 18:30:54 +02:00
Alexei Starovoitov 38b2cf2982 net: pktgen: packet bursting via skb->xmit_more
This patch demonstrates the effect of delaying update of HW tailptr.
(based on earlier patch by Jesper)

burst=1 is the default. It sends one packet with xmit_more=false
burst=2 sends one packet with xmit_more=true and
        2nd copy of the same packet with xmit_more=false
burst=3 sends two copies of the same packet with xmit_more=true and
        3rd copy with xmit_more=false

Performance with ixgbe (usec 30):
burst=1  tx:9.2 Mpps
burst=2  tx:13.5 Mpps
burst=3  tx:14.5 Mpps full 10G line rate

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 22:08:12 -04:00
Eric Dumazet ce1a4ea3f1 net: avoid one atomic operation in skb_clone()
Fast clone cloning can actually avoid an atomic_inc(), if we
guarantee prior clone_ref value is 1.

This requires a change kfree_skbmem(), to perform the
atomic_dec_and_test() on clone_ref before setting fclone to
SKB_FCLONE_UNAVAILABLE.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 21:27:23 -04:00
Eric Dumazet d0bf4a9e92 net: cleanup and document skb fclone layout
Lets use a proper structure to clearly document and implement
skb fast clones.

Then, we might experiment more easily alternative layouts.

This patch adds a new skb_fclone_busy() helper, used by tcp and xfrm,
to stop leaking of implementation details.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 16:34:25 -04:00
John Fastabend b0ab6f9275 net: sched: enable per cpu qstats
After previous patches to simplify qstats the qstats can be
made per cpu with a packed union in Qdisc struct.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-30 01:02:26 -04:00
John Fastabend 6401585366 net: sched: restrict use of qstats qlen
This removes the use of qstats->qlen variable from the classifiers
and makes it an explicit argument to gnet_stats_copy_queue().

The qlen represents the qdisc queue length and is packed into
the qstats at the last moment before passnig to user space. By
handling it explicitely we avoid, in the percpu stats case, having
to figure out which per_cpu variable to put it in.

It would probably be best to remove it from qstats completely
but qstats is a user space ABI and can't be broken. A future
patch could make an internal only qstats structure that would
avoid having to allocate an additional u32 variable on the
Qdisc struct. This would make the qstats struct 128bits instead
of 128+32.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-30 01:02:26 -04:00
John Fastabend 22e0f8b932 net: sched: make bstats per cpu and estimator RCU safe
In order to run qdisc's without locking statistics and estimators
need to be handled correctly.

To resolve bstats make the statistics per cpu. And because this is
only needed for qdiscs that are running without locks which is not
the case for most qdiscs in the near future only create percpu
stats when qdiscs set the TCQ_F_CPUSTATS flag.

Next because estimators use the bstats to calculate packets per
second and bytes per second the estimator code paths are updated
to use the per cpu statistics.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-30 01:02:26 -04:00
Eric Dumazet 73d3fe6d1c gro: fix aggregation for skb using frag_list
In commit 8a29111c7c ("net: gro: allow to build full sized skb")
I added a regression for linear skb that traditionally force GRO
to use the frag_list fallback.

Erez Shitrit found that at most two segments were aggregated and
the "if (skb_gro_len(p) != pinfo->gso_size)" test was failing.

This is because pinfo at this spot still points to the last skb in the
chain, instead of the first one, where we find the correct gso_size
information.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 8a29111c7c ("net: gro: allow to build full sized skb")
Reported-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-29 15:17:59 -04:00
Eric Dumazet b193722731 net: reorganize sk_buff for faster __copy_skb_header()
With proliferation of bit fields in sk_buff, __copy_skb_header() became
quite expensive, showing as the most expensive function in a GSO
workload.

__copy_skb_header() performance is also critical for non GSO TCP
operations, as it is used from skb_clone()

This patch carefully moves all the fields that were not copied in a
separate zone : cloned, nohdr, fclone, peeked, head_frag, xmit_more

Then I moved all other fields and all other copied fields in a section
delimited by headers_start[0]/headers_end[0] section so that we
can use a single memcpy() call, inlined by compiler using long
word load/stores.

I also tried to make all copies in the natural orders of sk_buff,
to help hardware prefetching.

I made sure sk_buff size did not change.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-29 12:27:20 -04:00
Dan Williams 7bced39751 net_dma: simple removal
Per commit "77873803363c net_dma: mark broken" net_dma is no longer used
and there is no plan to fix it.

This is the mechanical removal of bits in CONFIG_NET_DMA ifdef guards.
Reverting the remainder of the net_dma induced changes is deferred to
subsequent patches.

Marked for stable due to Roman's report of a memory leak in
dma_pin_iovec_pages():

    https://lkml.org/lkml/2014/9/3/177

Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Vinod Koul <vinod.koul@intel.com>
Cc: David Whipple <whipple@securedatainnovations.ch>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Cc: <stable@vger.kernel.org>
Reported-by: Roman Gushchin <klamm@yandex-team.ru>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2014-09-28 07:05:16 -07:00
Eric Dumazet ff04a771ad net : optimize skb_release_data()
Cache skb_shinfo(skb) in a variable to avoid computing it multiple
times.

Reorganize the tests to remove one indentation level.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 16:53:49 -04:00
LEROY Christophe 58e3cac561 net: optimise inet_proto_csum_replace4()
csum_partial() is a generic function which is not optimised for small fixed
length calculations, and its use requires to store "from" and "to" values in
memory while we already have them available in registers. This also has impact,
especially on RISC processors. In the same spirit as the change done by
Eric Dumazet on csum_replace2(), this patch rewrites inet_proto_csum_replace4()
taking into account RFC1624.

I spotted during a NATted tcp transfert that csum_partial() is one of top 5
consuming functions (around 8%), and the second user of csum_partial() is
inet_proto_csum_replace4().

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 16:14:17 -04:00
Eric Dumazet f4a775d144 net: introduce __skb_header_release()
While profiling TCP stack, I noticed one useless atomic operation
in tcp_sendmsg(), caused by skb_header_release().

It turns out all current skb_header_release() users have a fresh skb,
that no other user can see, so we can avoid one atomic operation.

Introduce __skb_header_release() to clearly document this.

This gave me a 1.5 % improvement on TCP_RR workload.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 15:40:06 -04:00
Joe Perches 6ea754eb76 net: Change netdev_<level> logging functions to return void
No caller or macro uses the return value so make all
the functions return void.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 15:17:17 -04:00
Tom Herbert 53e5039896 net: Remove gso_send_check as an offload callback
The send_check logic was only interesting in cases of TCP offload and
UDP UFO where the checksum needed to be initialized to the pseudo
header checksum. Now we've moved that logic into the related
gso_segment functions so gso_send_check is no longer needed.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 00:22:47 -04:00
David S. Miller 1f6d80358d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	arch/mips/net/bpf_jit.c
	drivers/net/can/flexcan.c

Both the flexcan and MIPS bpf_jit conflicts were cases of simple
overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-23 12:09:27 -04:00
Jason Wang cecda693a9 net: keep original skb which only needs header checking during software GSO
Commit ce93718fb7 ("net: Don't keep
around original SKB when we software segment GSO frames") frees the
original skb after software GSO even for dodgy gso skbs. This breaks
the stream throughput from untrusted sources, since only header
checking was done during software GSO instead of a true
segmentation. This patch fixes this by freeing the original gso skb
only when it was really segmented by software.

Fixes ce93718fb7 ("net: Don't keep
around original SKB when we software segment GSO frames.")

Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-22 14:57:08 -04:00
Eric Dumazet 2e4e441071 net: add alloc_skb_with_frags() helper
Extract from sock_alloc_send_pskb() code building skb with frags,
so that we can reuse this in other contexts.

Intent is to use it from tcp_send_rcvq(), tcp_collapse(), ...

We also want to replace some skb_linearize() calls to a more reliable
strategy in pathological cases where we need to reduce number of frags.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-19 16:25:23 -04:00
Eric Dumazet e93a0435f8 tcp: allow segment with FIN in tcp_try_coalesce()
We can allow a segment with FIN to be aggregated,
if we take care to add tcp flags,
and if skb_try_coalesce() takes care of zero sized skbs.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-15 14:41:07 -04:00
Alexander Y. Fomichev 7ce64c79c4 net: fix creation adjacent device symlinks
__netdev_adjacent_dev_insert may add adjust device of different net
namespace, without proper check it leads to emergence of broken
sysfs links from/to devices in another namespace.
Fix: rewrite netdev_adjacent_is_neigh_list macro as a function,
     move net_eq check into netdev_adjacent_is_neigh_list.
     (thanks David)
     related to: 4c75431ac3

Signed-off-by: Alexander Fomichev <git.user@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-15 14:24:53 -04:00
Sasha Levin c0d1379a19 net: bpf: correctly handle errors in sk_attach_filter()
Commit "net: bpf: make eBPF interpreter images read-only" has changed bpf_prog
to be vmalloc()ed but never handled some of the errors paths of the old code.

On error within sk_attach_filter (which userspace can easily trigger), we'd
kfree() the vmalloc()ed memory, and leak the internal bpf_work_struct.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-13 17:37:49 -04:00
Hannes Frederic Sowa 233577a220 net: filter: constify detection of pkt_type_offset
Currently we have 2 pkt_type_offset functions doing the same thing and
spread across the architecture files. Remove those and replace them
with a PKT_TYPE_OFFSET macro helper which gets the constant value from a
zero sized sk_buff member right in front of the bitfield with offsetof.
This new offset marker does not change size of struct sk_buff.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Markos Chandras <markos.chandras@imgtec.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Daniel Borkmann <dborkman@redhat.com>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Denis Kirjanov <kda@linux-powerpc.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-13 17:07:21 -04:00
WANG Cong 6c555490e0 ipv6: drop useless rcu_read_lock() in anycast
These code is now protected by rtnl lock, rcu read lock
is useless now.

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-13 16:38:42 -04:00
John Fastabend 46e5da40ae net: qdisc: use rcu prefix and silence sparse warnings
Add __rcu notation to qdisc handling by doing this we can make
smatch output more legible. And anyways some of the cases should
be using rcu_dereference() see qdisc_all_tx_empty(),
qdisc_tx_chainging(), and so on.

Also *wake_queue() API is commonly called from driver timer routines
without rcu lock or rtnl lock. So I added rcu_read_lock() blocks
around netif_wake_subqueue and netif_tx_wake_queue.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-13 12:30:25 -04:00
Alexander Duyck bf7fa551e0 mac80211: Resolve sk_refcnt/sk_wmem_alloc issue in wifi ack path
There is a possible issue with the use, or lack thereof of sk_refcnt and
sk_wmem_alloc in the wifi ack status functionality.

Specifically if a socket were to request acknowledgements, and the socket
were to have sk_refcnt drop to 0 resulting in it waiting on sk_wmem_alloc
to reach 0 it would be possible to have sock_queue_err_skb orphan the last
buffer, resulting in __sk_free being called on the socket.  After this the
buffer is enqueued on sk_error_queue, however the queue has already been
flushed resulting in at least a memory leak, if not a data corruption.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-12 17:51:25 -04:00
Alexander Duyck cab41c47d9 skb: Add documentation for skb_clone_sk
This change adds some documentation to the call skb_clone_sk.  This is
meant to help clarify the purpose of the function for other developers.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-12 17:51:24 -04:00
Joe Perches 294a0b7f31 pktgen: Convert pr_warning to pr_warn
Use the more common pr_warn.
Realign arguments.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-10 12:40:10 -07:00
Eric Dumazet 416c51e17b netns: remove one sparse warning
net/core/net_namespace.c:227:18: warning: incorrect type in argument 1
(different address spaces)
net/core/net_namespace.c:227:18:    expected void const *<noident>
net/core/net_namespace.c:227:18:    got struct net_generic [noderef]
<asn:4>*gen

We can use rcu_access_pointer() here as read-side access to the pointer
was removed at least one grace period ago.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-09 20:10:45 -07:00
Daniel Borkmann 286aad3c40 net: bpf: be friendly to kmemcheck
Reported by Mikulas Patocka, kmemcheck currently barks out a
false positive since we don't have special kmemcheck annotation
for bitfields used in bpf_prog structure.

We currently have jited:1, len:31 and thus when accessing len
while CONFIG_KMEMCHECK enabled, kmemcheck throws a warning that
we're reading uninitialized memory.

As we don't need the whole bit universe for pages member, we
can just split it to u16 and use a bool flag for jited instead
of a bitfield.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-09 16:58:56 -07:00
Eric Dumazet 82d5e2b8b4 net: fix skb_page_frag_refill() kerneldoc
In commit d9b2938aab ("net: attempt a single high order allocation)
I forgot to update kerneldoc, as @prio parameter was renamed to @gfp

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-08 14:12:14 -07:00
David S. Miller eb84d6b604 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2014-09-07 21:41:53 -07:00
Alexander Duyck 56193d1bce net: Add function for parsing the header length out of linear ethernet frames
This patch updates some of the flow_dissector api so that it can be used to
parse the length of ethernet buffers stored in fragments.  Most of the
changes needed were to __skb_get_poff as it needed to be updated to support
sending a linear buffer instead of a skb.

I have split __skb_get_poff into two functions, the first is skb_get_poff
and it retains the functionality of the original __skb_get_poff.  The other
function is __skb_get_poff which now works much like __skb_flow_dissect in
relation to skb_flow_dissect in that it provides the same functionality but
works with just a data buffer and hlen instead of needing an skb.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-05 17:47:02 -07:00
Alexander Duyck 82eabd9eb2 net: merge cases where sock_efree and sock_edemux are the same function
Since sock_efree and sock_demux are essentially the same code for non-TCP
sockets and the case where CONFIG_INET is not defined we can combine the
code or replace the call to sock_edemux in several spots.  As a result we
can avoid a bit of unnecessary code or code duplication.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-05 17:43:45 -07:00
Alexander Duyck 62bccb8cdb net-timestamp: Make the clone operation stand-alone from phy timestamping
The phy timestamping takes a different path than the regular timestamping
does in that it will create a clone first so that the packets needing to be
timestamped can be placed in a queue, or the context block could be used.

In order to support these use cases I am pulling the core of the code out
so it can be used in other drivers beyond just phy devices.

In addition I have added a destructor named sock_efree which is meant to
provide a simple way for dropping the reference to skb exceptions that
aren't part of either the receive or send windows for the socket, and I
have removed some duplication in spots where this destructor could be used
in place of sock_edemux.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-05 17:43:45 -07:00
Alexander Duyck 37846ef018 net-timestamp: Merge shared code between phy and regular timestamping
This change merges the shared bits that exist between skb_tx_tstamp and
skb_complete_tx_timestamp.  By doing this we can avoid the two diverging as
there were already changes pushed into skb_tx_tstamp that hadn't made it
into the other function.

In addition this resolves issues with the fact that
skb_complete_tx_timestamp was included in linux/skbuff.h even though it was
only compiled in if phy timestamping was enabled.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-05 17:43:45 -07:00
Masanari Iida e793c0f70e net: treewide: Fix typo found in DocBook/networking.xml
This patch fix spelling typo found in DocBook/networking.xml.
It is because the neworking.xml is generated from comments
in the source, I have to fix typo in comments within the source.

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-05 17:35:28 -07:00
Govindarajulu Varadarajan f0db9b0734 ethtool: Add generic options for tunables
This patch adds new ethtool cmd, ETHTOOL_GTUNABLE & ETHTOOL_STUNABLE for getting
tunable values from driver.

Add get_tunable and set_tunable to ethtool_ops. Driver implements these
functions for getting/setting tunable value.

Signed-off-by: Govindarajulu Varadarajan <_govind@gmx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-05 12:12:20 -07:00
Daniel Borkmann e020836d95 dev_ioctl: remove dev_load() CAP_SYS_MODULE message
Marcel reported to see the following message when autoloading
is being triggered when adding nlmon device:

  Loading kernel module for a network device with
  CAP_SYS_MODULE (deprecated). Use CAP_NET_ADMIN and alias
  netdev-nlmon instead.

This false-positive happens despite with having correct
capabilities set, e.g. through issuing `ip link del dev nlmon`
more than once on a valid device with name nlmon, but Marcel
has also seen it on creation time when no nlmon module is
previously compiled-in or loaded as module and the device
name equals a link type name (e.g. nlmon, vxlan, team).

Stephen says:

  The netdev module alias is a hold over from the past. For
  normal devices, people used to create a alias eth0 to and
  point it to the type of network device used, that was back
  in the bad old ISA days before real discovery.

  Also, the tunnels create module alias for the control device
  and ip used to use this to autoload the tunnel device.

  The message is bogus and should just be removed, I also see
  it in a couple of other cases where tap devices are renamed
  for other usese.

As mentioned in 8909c9ad8f ("net: don't allow CAP_NET_ADMIN
to load non-netdev kernel modules"), we nevertheless still
might want to leave the old autoloading behaviour in place
as it could break old scripts, so for now, lets just remove
the log message as Stephen suggests.

Reference: http://thread.gmane.org/gmane.linux.kernel/1105168
Reported-by: Marcel Holtmann <marcel@holtmann.org>
Suggested-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-05 12:04:40 -07:00
Daniel Borkmann 60a3b2253c net: bpf: make eBPF interpreter images read-only
With eBPF getting more extended and exposure to user space is on it's way,
hardening the memory range the interpreter uses to steer its command flow
seems appropriate.  This patch moves the to be interpreted bytecode to
read-only pages.

In case we execute a corrupted BPF interpreter image for some reason e.g.
caused by an attacker which got past a verifier stage, it would not only
provide arbitrary read/write memory access but arbitrary function calls
as well. After setting up the BPF interpreter image, its contents do not
change until destruction time, thus we can setup the image on immutable
made pages in order to mitigate modifications to that code. The idea
is derived from commit 314beb9bca ("x86: bpf_jit_comp: secure bpf jit
against spraying attacks").

This is possible because bpf_prog is not part of sk_filter anymore.
After setup bpf_prog cannot be altered during its life-time. This prevents
any modifications to the entire bpf_prog structure (incl. function/JIT
image pointer).

Every eBPF program (including classic BPF that are migrated) have to call
bpf_prog_select_runtime() to select either interpreter or a JIT image
as a last setup step, and they all are being freed via bpf_prog_free(),
including non-JIT. Therefore, we can easily integrate this into the
eBPF life-time, plus since we directly allocate a bpf_prog, we have no
performance penalty.

Tested with seccomp and test_bpf testsuite in JIT/non-JIT mode and manual
inspection of kernel_page_tables.  Brad Spengler proposed the same idea
via Twitter during development of this patch.

Joint work with Hannes Frederic Sowa.

Suggested-by: Brad Spengler <spender@grsecurity.net>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Kees Cook <keescook@chromium.org>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-05 12:02:48 -07:00
Jesper Dangaard Brouer 1f59533f9c qdisc: validate frames going through the direct_xmit path
In commit 50cbe9ab5f ("net: Validate xmit SKBs right when we
pull them out of the qdisc") the validation code was moved out of
dev_hard_start_xmit and into dequeue_skb.

However this overlooked the fact that we do not always enqueue
the skb onto a qdisc. First situation is if qdisc have flag
TCQ_F_CAN_BYPASS and qdisc is empty.  Second situation is if
there is no qdisc on the device, which is a common case for
software devices.

Originally spotted and inital patch by Alexander Duyck.
As a result Alex was seeing issues trying to connect to a
vhost_net interface after commit 50cbe9ab5f was applied.

Added a call to validate_xmit_skb() in __dev_xmit_skb(), in the
code path for qdiscs with TCQ_F_CAN_BYPASS flag, and in
__dev_queue_xmit() when no qdisc.

Also handle the error situation where dev_hard_start_xmit() could
return a skb list, and does not return dev_xmit_complete(rc) and
falls through to the kfree_skb(), in that situation it should
call kfree_skb_list().

Fixes:  50cbe9ab5f ("net: Validate xmit SKBs right when we pull them out of the qdisc")
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-03 20:41:42 -07:00
Nicolas Dichtel ba9989069f rtnl/do_setlink(): notify when a netdev is modified
Depending on which parameters were updated, the changes were not propagated via
the notifier chain and netlink.

The new flag has been set only when the change did not cause a call to the
notifier chain and/or to the netlink notification functions.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-02 12:57:04 -07:00
Nicolas Dichtel 90c325e3bf rtnl/do_setlink(): last arg is now a set of flags
There is no functional changes with this commit, it only prepares the next one.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-02 12:57:04 -07:00
Nicolas Dichtel 1889b0e7ef rtnl/do_setlink(): set modified when IFLA_LINKMODE is updated
The only effect of this patch is to print a warning if IFLA_LINKMODE is updated
and a following change fails.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-02 12:57:04 -07:00
Nicolas Dichtel 5d1180fcac rtnl/do_setlink(): set modified when IFLA_TXQLEN is updated
The only effect of this patch is to print a warning if IFLA_TXQLEN is updated
and a following change fails.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-02 12:57:04 -07:00
Willem de Bruijn 364a9e9324 sock: deduplicate errqueue dequeue
sk->sk_error_queue is dequeued in four locations. All share the
exact same logic. Deduplicate.

Also collapse the two critical sections for dequeue (at the top of
the recv handler) and signal (at the bottom).

This moves signal generation for the next packet forward, which should
be harmless.

It also changes the behavior if the recv handler exits early with an
error. Previously, a signal for follow-up packets on the errqueue
would then not be scheduled. The new behavior, to always signal, is
arguably a bug fix.

For rxrpc, the change causes the same function to be called repeatedly
for each queued packet (because the recv handler == sk_error_report).
It is likely that all packets will fail for the same reason (e.g.,
memory exhaustion).

This code runs without sk_lock held, so it is not safe to trust that
sk->sk_err is immutable inbetween releasing q->lock and the subsequent
test. Introduce int err just to avoid this potential race.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-01 21:49:08 -07:00
Tom Herbert 5a21232983 net: Support for csum_bad in skbuff
This flag indicates that an invalid checksum was detected in the
packet. __skb_mark_checksum_bad helper function was added to set this.

Checksums can be marked bad from a driver or the GRO path (the latter
is implemented in this patch). csum_bad is checked in
__skb_checksum_validate_complete (i.e. calling that when ip_summed ==
CHECKSUM_NONE).

csum_bad works in conjunction with ip_summed value. In the case that
ip_summed is CHECKSUM_NONE and csum_bad is set, this implies that the
first (or next) checksum encountered in the packet is bad. When
ip_summed is CHECKSUM_UNNECESSARY, the first checksum after the last
one validated is bad. For example, if ip_summed == CHECKSUM_UNNECESSARY,
csum_level == 1, and csum_bad is set-- then the third checksum in the
packet is bad. In the normal path, the packet will be dropped when
processing the protocol layer of the bad checksum:
__skb_decr_checksum_unnecessary called twice for the good checksums
changing ip_summed to CHECKSUM_NONE so that
__skb_checksum_validate_complete is called to validate the third
checksum and that will fail since csum_bad is set.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-01 21:36:27 -07:00
Jesper Dangaard Brouer afb84b6261 pktgen: add flag NO_TIMESTAMP to disable timestamping
Then testing the TX limits of the stack, then it is useful to
be-able to disable the do_gettimeofday() timetamping on every packet.

This implements a pktgen flag NO_TIMESTAMP which will disable this
call to do_gettimeofday().

The performance change on (my system E5-2695) with skb_clone=0, goes
from TX 2,423,751 pps to 2,567,165 pps with flag NO_TIMESTAMP. Thus,
the cost of do_gettimeofday() or saving is approx 23 nanosec.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-01 18:06:59 -07:00
David S. Miller 8dcda22a5d net: xmit_list() becomes dev_hard_start_xmit().
Now fundamentally we can process lists of SKBs as cheaply
as single packets.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-01 17:39:56 -07:00
David S. Miller ce93718fb7 net: Don't keep around original SKB when we software segment GSO frames.
Just maintain the list properly by returning the head of the remaining
SKB list from dev_hard_start_xmit().

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-01 17:39:56 -07:00
David S. Miller 50cbe9ab5f net: Validate xmit SKBs right when we pull them out of the qdisc.
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-01 17:39:56 -07:00
David S. Miller eae3f88ee4 net: Separate out SKB validation logic from transmit path.
dev_hard_start_xmit() does two things, it first validates and
canonicalizes the SKB, then it actually sends it.

Make a set of helper functions for doing the first part.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-01 17:39:55 -07:00
David S. Miller 95f6b3dda2 net: Have xmit_list() signal more==true when appropriate.
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-01 17:39:55 -07:00
David S. Miller fa2dbdc253 net: Pass a "more" indication down into netdev_start_xmit() code paths.
For now it will always be false.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-01 17:39:55 -07:00
David S. Miller 7f2e870f2a net: Move main gso loop out of dev_hard_start_xmit() into helper.
There is a slight policy change happening here as well.

The previous code would drop the entire rest of the GSO skb if any of
them got, for example, a congestion notification.

That makes no sense, anything NET_XMIT_MASK and below is something
like congestion or policing.  And in the congestion case it doesn't
even mean the packet was actually dropped.

Just continue until dev_xmit_complete() evaluates to false.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-01 17:39:55 -07:00
David S. Miller 2ea2551375 net: Create xmit_one() helper for dev_hard_start_xmit()
Hopefully making the code a bit easier to read and digest.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-01 17:39:55 -07:00
David S. Miller 10b3ad8c21 net: Do txq_trans_update() in netdev_start_xmit()
That way we don't have to audit every call site to make sure it is
doing this properly.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-01 17:39:55 -07:00
Tom Herbert 662880f442 net: Allow GRO to use and set levels of checksum unnecessary
Allow GRO path to "consume" checksums provided in CHECKSUM_UNNECESSARY
and to report new checksums verfied for use in fallback to normal
path.

Change GRO checksum path to track csum_level using a csum_cnt field
in NAPI_GRO_CB. On GRO initialization, if ip_summed is
CHECKSUM_UNNECESSARY set NAPI_GRO_CB(skb)->csum_cnt to
skb->csum_level + 1. For each checksum verified, decrement
NAPI_GRO_CB(skb)->csum_cnt while its greater than zero. If a checksum
is verfied and NAPI_GRO_CB(skb)->csum_cnt == 0, we have verified a
deeper checksum than originally indicated in skbuf so increment
csum_level (or initialize to CHECKSUM_UNNECESSARY if ip_summed is
CHECKSUM_NONE or CHECKSUM_COMPLETE).

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-29 20:41:11 -07:00
Eric Dumazet d9b2938aab net: attempt a single high order allocation
In commit ed98df3361 ("net: use __GFP_NORETRY for high order
allocations") we tried to address one issue caused by order-3
allocations.

We still observe high latencies and system overhead in situations where
compaction is not successful.

Instead of trying order-3, order-2, and order-1, do a single order-3
best effort and immediately fallback to plain order-0.

This mimics slub strategy to fallback to slab min order if the high
order allocation used for performance failed.

Order-3 allocations give a performance boost only if they can be done
without recurring and expensive memory scan.

Quoting David :

The page allocator relies on synchronous (sync light) memory compaction
after direct reclaim for allocations that don't retry and deferred
compaction doesn't work with this strategy because the allocation order
is always decreasing from the previous failed attempt.

This means sync light compaction will always be encountered if memory
cannot be defragmented or reclaimed several times during the
skb_page_frag_refill() iteration.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-29 20:28:23 -07:00
Daniel Borkmann 10c51b5623 net: add skb_get_tx_queue() helper
Replace occurences of skb_get_queue_mapping() and follow-up
netdev_get_tx_queue() with an actual helper function.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-29 20:02:07 -07:00
Christoph Lameter 903ceff7ca net: Replace get_cpu_var through this_cpu_ptr
Replace uses of get_cpu_var for address calculation through this_cpu_ptr.

Cc: netdev@vger.kernel.org
Cc: Eric Dumazet <edumazet@google.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-08-26 13:45:47 -04:00
Michal Kubeček db115037bb net: fix checksum features handling in netif_skb_features()
This is follow-up to

  da08143b85 ("vlan: more careful checksum features handling")

which introduced more careful feature intersection in vlan code,
taking into account that HW_CSUM should be considered superset
of IP_CSUM/IPV6_CSUM. The same is needed in netif_skb_features()
in order to avoid offloading mismatch warning when vlan is
created on top of a bond consisting of slaves supporting IP/IPv6
checksumming but not vlan Tx offloading.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-25 17:23:03 -07:00
WANG Cong 453a940ea7 net: make skb an optional parameter for__skb_flow_dissect()
Fixes: commit 690e36e726 (net: Allow raw buffers to be passed into the flow dissector)
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-25 17:21:26 -07:00
WANG Cong 6451b3f59a net: fix comments for __skb_flow_get_ports()
Fixes: commit 690e36e726 (net: Allow raw buffers to be passed into the flow dissector)
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-25 17:21:26 -07:00
Alexander Y. Fomichev 4c75431ac3 net: prevent of emerging cross-namespace symlinks
Code manipulating sysfs symlinks on adjacent net_devices(s)
currently doesn't take into account that devices potentially
belong to different namespaces.

This patch trying to fix an issue as follows:
- check for net_ns before creating / deleting symlink.
  for now only netdev_adjacent_rename_links and
  __netdev_adjacent_dev_remove are affected, afaics
  __netdev_adjacent_dev_insert implies both net_devs
  belong to the same namespace.
- Drop all existing symlinks to / from all adj_devs before
  switching namespace and recreate them just after.

Signed-off-by: Alexander Y. Fomichev <git.user@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-25 15:17:43 -07:00
David S. Miller 4798248e4e net: Add ops->ndo_xmit_flush()
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-24 23:02:45 -07:00
Tom Herbert 573e8fca25 net: skb_gro_checksum_* functions
Add skb_gro_checksum_validate, skb_gro_checksum_validate_zero_check,
and skb_gro_checksum_simple_validate, and __skb_gro_checksum_complete.
These are the cognates of the normal checksum functions but are used
in the gro_receive path and operate on GRO related fields in sk_buffs.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-24 18:09:23 -07:00
Daniel Borkmann 8fc54f6891 net: use reciprocal_scale() helper
Replace open codings of (((u64) <x> * <y>) >> 32) with reciprocal_scale().

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-23 12:21:21 -07:00
David S. Miller 690e36e726 net: Allow raw buffers to be passed into the flow dissector.
Drivers, and perhaps other entities we have not yet considered,
sometimes want to know how deep the protocol headers go before
deciding how large of an SKB to allocate and how much of the packet to
place into the linear SKB area.

For example, consider a driver which has a device which DMAs into
pools of pages and then tells the driver where the data went in the
DMA descriptor(s).  The driver can then build an SKB and reference
most of the data via SKB fragments (which are page/offset/length
triplets).

However at least some of the front of the packet should be placed into
the linear SKB area, which comes before the fragments, so that packet
processing can get at the headers efficiently.  The first thing each
protocol layer is going to do is a "pskb_may_pull()" so we might as
well aggregate as much of this as possible while we're building the
SKB in the driver.

Part of supporting this is that we don't have an SKB yet, so we want
to be able to let the flow dissector operate on a raw buffer in order
to compute the offset of the end of the headers.

So now we have a __skb_flow_dissect() which takes an explicit data
pointer and length.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-23 12:13:41 -07:00
Eric Dumazet 884cf705c7 net: remove dead code after sk_data_ready change
As a followup to commit 676d23690f ("net: Fix use after free by
removing length arg from sk_data_ready callbacks"), we can remove
some useless code in sock_queue_rcv_skb() and rxrpc_queue_rcv_skb()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-22 21:08:50 -07:00
Eric Dumazet d2de875c6d net: use ktime_get_ns() and ktime_get_real_ns() helpers
ktime_get_ns() replaces ktime_to_ns(ktime_get())

ktime_get_real_ns() replaces ktime_to_ns(ktime_get_real())

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-22 19:57:23 -07:00
Linus Torvalds f0094b28f3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "Several networking final fixes and tidies for the merge window:

   1) Changes during the merge window unintentionally took away the
      ability to build bluetooth modular, fix from Geert Uytterhoeven.

   2) Several phy_node reference count bug fixes from Uwe Kleine-König.

   3) Fix ucc_geth build failures, also from Uwe Kleine-König.

   4) Fix klog false positivies when netlink messages go to network
      taps, by properly resetting the network header.  Fix from Daniel
      Borkmann.

   5) Sizing estimate of VF netlink messages is too small, from Jiri
      Benc.

   6) New APM X-Gene SoC ethernet driver, from Iyappan Subramanian.

   7) VLAN untagging is erroneously dependent upon whether the VLAN
      module is loaded or not, but there are generic dependencies that
      matter wrt what can be expected as the SKB enters the stack.
      Make the basic untagging generic code, and do it unconditionally.
      From Vlad Yasevich.

   8) xen-netfront only has so many slots in it's transmit queue so
      linearize packets that have too many frags.  From Zoltan Kiss.

   9) Fix suspend/resume PHY handling in bcmgenet driver, from Florian
      Fainelli"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (55 commits)
  net: bcmgenet: correctly resume adapter from Wake-on-LAN
  net: bcmgenet: update UMAC_CMD only when link is detected
  net: bcmgenet: correctly suspend and resume PHY device
  net: bcmgenet: request and enable main clock earlier
  net: ethernet: myricom: myri10ge: myri10ge.c: Cleaning up missing null-terminate after strncpy call
  xen-netfront: Fix handling packets on compound pages with skb_linearize
  net: fec: Support phys probed from devicetree and fixed-link
  smsc: replace WARN_ON() with WARN_ON_SMP()
  xen-netback: Don't deschedule NAPI when carrier off
  net: ethernet: qlogic: qlcnic: Remove duplicate object file from Makefile
  wan: wanxl: Remove typedefs from struct names
  m68k/atari: EtherNEC - ethernet support (ne)
  net: ethernet: ti: cpmac.c: Cleaning up missing null-terminate after strncpy call
  hdlc: Remove typedefs from struct names
  airo_cs: Remove typedef local_info_t
  atmel: Remove typedef atmel_priv_ioctl
  com20020_cs: Remove typedef com20020_dev_t
  ethernet: amd: Remove typedef local_info_t
  net: Always untag vlan-tagged traffic on input.
  drivers: net: Add APM X-Gene SoC ethernet driver support.
  ...
2014-08-13 18:27:40 -06:00
Vlad Yasevich 0d5501c1c8 net: Always untag vlan-tagged traffic on input.
Currently the functionality to untag traffic on input resides
as part of the vlan module and is build only when VLAN support
is enabled in the kernel.  When VLAN is disabled, the function
vlan_untag() turns into a stub and doesn't really untag the
packets.  This seems to create an interesting interaction
between VMs supporting checksum offloading and some network drivers.

There are some drivers that do not allow the user to change
tx-vlan-offload feature of the driver.  These drivers also seem
to assume that any VLAN-tagged traffic they transmit will
have the vlan information in the vlan_tci and not in the vlan
header already in the skb.  When transmitting skbs that already
have tagged data with partial checksum set, the checksum doesn't
appear to be updated correctly by the card thus resulting in a
failure to establish TCP connections.

The following is a packet trace taken on the receiver where a
sender is a VM with a VLAN configued.  The host VM is running on
doest not have VLAN support and the outging interface on the
host is tg3:
10:12:43.503055 52:54:00:ae:42:3f > 28:d2:44:7d:c2:de, ethertype 802.1Q
(0x8100), length 78: vlan 100, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 27243,
offset 0, flags [DF], proto TCP (6), length 60)
    10.0.100.1.58545 > 10.0.100.10.ircu-2: Flags [S], cksum 0xdc39 (incorrect
-> 0x48d9), seq 1069378582, win 29200, options [mss 1460,sackOK,TS val
4294837885 ecr 0,nop,wscale 7], length 0
10:12:44.505556 52:54:00:ae:42:3f > 28:d2:44:7d:c2:de, ethertype 802.1Q
(0x8100), length 78: vlan 100, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 27244,
offset 0, flags [DF], proto TCP (6), length 60)
    10.0.100.1.58545 > 10.0.100.10.ircu-2: Flags [S], cksum 0xdc39 (incorrect
-> 0x44ee), seq 1069378582, win 29200, options [mss 1460,sackOK,TS val
4294838888 ecr 0,nop,wscale 7], length 0

This connection finally times out.

I've only access to the TG3 hardware in this configuration thus have
only tested this with TG3 driver.  There are a lot of other drivers
that do not permit user changes to vlan acceleration features, and
I don't know if they all suffere from a similar issue.

The patch attempt to fix this another way.  It moves the vlan header
stipping code out of the vlan module and always builds it into the
kernel network core.  This way, even if vlan is not supported on
a virtualizatoin host, the virtual machines running on top of such
host will still work with VLANs enabled.

CC: Patrick McHardy <kaber@trash.net>
CC: Nithin Nayak Sujir <nsujir@broadcom.com>
CC: Michael Chan <mchan@broadcom.com>
CC: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-11 12:16:51 -07:00
Linus Torvalds 77e40aae76 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull namespace updates from Eric Biederman:
 "This is a bunch of small changes built against 3.16-rc6.  The most
  significant change for users is the first patch which makes setns
  drmatically faster by removing unneded rcu handling.

  The next chunk of changes are so that "mount -o remount,.." will not
  allow the user namespace root to drop flags on a mount set by the
  system wide root.  Aks this forces read-only mounts to stay read-only,
  no-dev mounts to stay no-dev, no-suid mounts to stay no-suid, no-exec
  mounts to stay no exec and it prevents unprivileged users from messing
  with a mounts atime settings.  I have included my test case as the
  last patch in this series so people performing backports can verify
  this change works correctly.

  The next change fixes a bug in NFS that was discovered while auditing
  nsproxy users for the first optimization.  Today you can oops the
  kernel by reading /proc/fs/nfsfs/{servers,volumes} if you are clever
  with pid namespaces.  I rebased and fixed the build of the
  !CONFIG_NFS_FS case yesterday when a build bot caught my typo.  Given
  that no one to my knowledge bases anything on my tree fixing the typo
  in place seems more responsible that requiring a typo-fix to be
  backported as well.

  The last change is a small semantic cleanup introducing
  /proc/thread-self and pointing /proc/mounts and /proc/net at it.  This
  prevents several kinds of problemantic corner cases.  It is a
  user-visible change so it has a minute chance of causing regressions
  so the change to /proc/mounts and /proc/net are individual one line
  commits that can be trivially reverted.  Unfortunately I lost and
  could not find the email of the original reporter so he is not
  credited.  From at least one perspective this change to /proc/net is a
  refgression fix to allow pthread /proc/net uses that were broken by
  the introduction of the network namespace"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  proc: Point /proc/mounts at /proc/thread-self/mounts instead of /proc/self/mounts
  proc: Point /proc/net at /proc/thread-self/net instead of /proc/self/net
  proc: Implement /proc/thread-self to point at the directory of the current thread
  proc: Have net show up under /proc/<tgid>/task/<tid>
  NFS: Fix /proc/fs/nfsfs/servers and /proc/fs/nfsfs/volumes
  mnt: Add tests for unprivileged remount cases that have found to be faulty
  mnt: Change the default remount atime from relatime to the existing value
  mnt: Correct permission checks in do_remount
  mnt: Move the test for MNT_LOCK_READONLY from change_mount_flags into do_remount
  mnt: Only change user settable mount flags in remount
  namespaces: Use task_lock and not rcu to protect nsproxy
2014-08-09 17:10:41 -07:00
Jiri Benc 945a36761f rtnetlink: fix VF info size
Commit 1d8faf48c7 ("net/core: Add VF link state control") added new
attribute to IFLA_VF_INFO group in rtnl_fill_ifinfo but did not adjust size
of the allocated memory in if_nlmsg_size/rtnl_vfinfo_size. As the result, we
may trigger warnings in rtnl_getlink and similar functions when many VF
links are enabled, as the information does not fit into the allocated skb.

Fixes: 1d8faf48c7 ("net/core: Add VF link state control")
Reported-by: Yulong Pei <ypei@redhat.com>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-08 10:28:09 -07:00
Linus Torvalds ae045e2455 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
 "Highlights:

   1) Steady transitioning of the BPF instructure to a generic spot so
      all kernel subsystems can make use of it, from Alexei Starovoitov.

   2) SFC driver supports busy polling, from Alexandre Rames.

   3) Take advantage of hash table in UDP multicast delivery, from David
      Held.

   4) Lighten locking, in particular by getting rid of the LRU lists, in
      inet frag handling.  From Florian Westphal.

   5) Add support for various RFC6458 control messages in SCTP, from
      Geir Ola Vaagland.

   6) Allow to filter bridge forwarding database dumps by device, from
      Jamal Hadi Salim.

   7) virtio-net also now supports busy polling, from Jason Wang.

   8) Some low level optimization tweaks in pktgen from Jesper Dangaard
      Brouer.

   9) Add support for ipv6 address generation modes, so that userland
      can have some input into the process.  From Jiri Pirko.

  10) Consolidate common TCP connection request code in ipv4 and ipv6,
      from Octavian Purdila.

  11) New ARP packet logger in netfilter, from Pablo Neira Ayuso.

  12) Generic resizable RCU hash table, with intial users in netlink and
      nftables.  From Thomas Graf.

  13) Maintain a name assignment type so that userspace can see where a
      network device name came from (enumerated by kernel, assigned
      explicitly by userspace, etc.) From Tom Gundersen.

  14) Automatic flow label generation on transmit in ipv6, from Tom
      Herbert.

  15) New packet timestamping facilities from Willem de Bruijn, meant to
      assist in measuring latencies going into/out-of the packet
      scheduler, latency from TCP data transmission to ACK, etc"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1536 commits)
  cxgb4 : Disable recursive mailbox commands when enabling vi
  net: reduce USB network driver config options.
  tg3: Modify tg3_tso_bug() to handle multiple TX rings
  amd-xgbe: Perform phy connect/disconnect at dev open/stop
  amd-xgbe: Use dma_set_mask_and_coherent to set DMA mask
  net: sun4i-emac: fix memory leak on bad packet
  sctp: fix possible seqlock seadlock in sctp_packet_transmit()
  Revert "net: phy: Set the driver when registering an MDIO bus device"
  cxgb4vf: Turn off SGE RX/TX Callback Timers and interrupts in PCI shutdown routine
  team: Simplify return path of team_newlink
  bridge: Update outdated comment on promiscuous mode
  net-timestamp: ACK timestamp for bytestreams
  net-timestamp: TCP timestamping
  net-timestamp: SCHED timestamp on entering packet scheduler
  net-timestamp: add key to disambiguate concurrent datagrams
  net-timestamp: move timestamp flags out of sk_flags
  net-timestamp: extend SCM_TIMESTAMPING ancillary data struct
  cxgb4i : Move stray CPL definitions to cxgb4 driver
  tcp: reduce spurious retransmits due to transient SACK reneging
  qlcnic: Initialize dcbnl_ops before register_netdev
  ...
2014-08-06 09:38:14 -07:00
David S. Miller d247b6ab3c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/Makefile
	net/ipv6/sysctl_net_ipv6.c

Two ipv6_table_template[] additions overlap, so the index
of the ipv6_table[x] assignments needed to be adjusted.

In the drivers/net/Makefile case, we've gotten rid of the
garbage whereby we had to list every single USB networking
driver in the top-level Makefile, there is just one
"USB_NETWORKING" that guards everything.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-05 18:46:26 -07:00
Willem de Bruijn 4ed2d765df net-timestamp: TCP timestamping
TCP timestamping extends SO_TIMESTAMPING to bytestreams.

Bytestreams do not have a 1:1 relationship between send() buffers and
network packets. The feature interprets a send call on a bytestream as
a request for a timestamp for the last byte in that send() buffer.

The choice corresponds to a request for a timestamp when all bytes in
the buffer have been sent. That assumption depends on in-order kernel
transmission. This is the common case. That said, it is possible to
construct a traffic shaping tree that would result in reordering.
The guarantee is strong, then, but not ironclad.

This implementation supports send and sendpages (splice). GSO replaces
one large packet with multiple smaller packets. This patch also copies
the option into the correct smaller packet.

This patch does not yet support timestamping on data in an initial TCP
Fast Open SYN, because that takes a very different data path.

If ID generation in ee_data is enabled, bytestream timestamps return a
byte offset, instead of the packet counter for datagrams.

The implementation supports a single timestamp per packet. It silenty
replaces requests for previous timestamps. To avoid missing tstamps,
flush the tcp queue by disabling Nagle, cork and autocork. Missing
tstamps can be detected by offset when the ee_data ID is enabled.

Implementation details:

- On GSO, the timestamping code can be included in the main loop. I
moved it into its own loop to reduce the impact on the common case
to a single branch.

- To avoid leaking the absolute seqno to userspace, the offset
returned in ee_data must always be relative. It is an offset between
an skb and sk field. The first is always set (also for GSO & ACK).
The second must also never be uninitialized. Only allow the ID
option on sockets in the ESTABLISHED state, for which the seqno
is available. Never reset it to zero (instead, move it to the
current seqno when reenabling the option).

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-05 16:35:54 -07:00
Willem de Bruijn e7fd288538 net-timestamp: SCHED timestamp on entering packet scheduler
Kernel transmit latency is often incurred in the packet scheduler.
Introduce a new timestamp on transmission just before entering the
scheduler. When data travels through multiple devices (bonding,
tunneling, ...) each device will export an individual timestamp.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-05 16:35:54 -07:00
Willem de Bruijn 09c2d251b7 net-timestamp: add key to disambiguate concurrent datagrams
Datagrams timestamped on transmission can coexist in the kernel stack
and be reordered in packet scheduling. When reading looped datagrams
from the socket error queue it is not always possible to unique
correlate looped data with original send() call (for application
level retransmits). Even if possible, it may be expensive and complex,
requiring packet inspection.

Introduce a data-independent ID mechanism to associate timestamps with
send calls. Pass an ID alongside the timestamp in field ee_data of
sock_extended_err.

The ID is a simple 32 bit unsigned int that is associated with the
socket and incremented on each send() call for which software tx
timestamp generation is enabled.

The feature is enabled only if SOF_TIMESTAMPING_OPT_ID is set, to
avoid changing ee_data for existing applications that expect it 0.
The counter is reset each time the flag is reenabled. Reenabling
does not change the ID of already submitted data. It is possible
to receive out of order IDs if the timestamp stream is not quiesced
first.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-05 16:35:54 -07:00
Willem de Bruijn b9f40e21ef net-timestamp: move timestamp flags out of sk_flags
sk_flags is reaching its limit. New timestamping options will not fit.
Move all of them into a new field sk->sk_tsflags.

Added benefit is that this removes boilerplate code to convert between
SOF_TIMESTAMPING_.. and SOCK_TIMESTAMPING_.. in getsockopt/setsockopt.

SOCK_TIMESTAMPING_RX_SOFTWARE is also used to toggle the receive
timestamp logic (netstamp_needed). That can be simplified and this
last key removed, but will leave that for a separate patch.

Signed-off-by: Willem de Bruijn <willemb@google.com>

----

The u16 in sock can be moved into a 16-bit hole below sk_gso_max_segs,
though that scatters tstamp fields throughout the struct.
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-05 16:35:54 -07:00
Willem de Bruijn f24b9be595 net-timestamp: extend SCM_TIMESTAMPING ancillary data struct
Applications that request kernel tx timestamps with SO_TIMESTAMPING
read timestamps as recvmsg() ancillary data. The response is defined
implicitly as timespec[3].

1) define struct scm_timestamping explicitly and

2) add support for new tstamp types. On tx, scm_timestamping always
   accompanies a sock_extended_err. Define previously unused field
   ee_info to signal the type of ts[0]. Introduce SCM_TSTAMP_SND to
   define the existing behavior.

The reception path is not modified. On rx, no struct similar to
sock_extended_err is passed along with SCM_TIMESTAMPING.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-05 16:35:53 -07:00
Linus Torvalds 47dfe4037e Merge branch 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup changes from Tejun Heo:
 "Mostly changes to get the v2 interface ready.  The core features are
  mostly ready now and I think it's reasonable to expect to drop the
  devel mask in one or two devel cycles at least for a subset of
  controllers.

   - cgroup added a controller dependency mechanism so that block cgroup
     can depend on memory cgroup.  This will be used to finally support
     IO provisioning on the writeback traffic, which is currently being
     implemented.

   - The v2 interface now uses a separate table so that the interface
     files for the new interface are explicitly declared in one place.
     Each controller will explicitly review and add the files for the
     new interface.

   - cpuset is getting ready for the hierarchical behavior which is in
     the similar style with other controllers so that an ancestor's
     configuration change doesn't change the descendants' configurations
     irreversibly and processes aren't silently migrated when a CPU or
     node goes down.

  All the changes are to the new interface and no behavior changed for
  the multiple hierarchies"

* 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (29 commits)
  cpuset: fix the WARN_ON() in update_nodemasks_hier()
  cgroup: initialize cgrp_dfl_root_inhibit_ss_mask from !->dfl_files test
  cgroup: make CFTYPE_ONLY_ON_DFL and CFTYPE_NO_ internal to cgroup core
  cgroup: distinguish the default and legacy hierarchies when handling cftypes
  cgroup: replace cgroup_add_cftypes() with cgroup_add_legacy_cftypes()
  cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes
  cgroup: split cgroup_base_files[] into cgroup_{dfl|legacy}_base_files[]
  cpuset: export effective masks to userspace
  cpuset: allow writing offlined masks to cpuset.cpus/mems
  cpuset: enable onlined cpu/node in effective masks
  cpuset: refactor cpuset_hotplug_update_tasks()
  cpuset: make cs->{cpus, mems}_allowed as user-configured masks
  cpuset: apply cs->effective_{cpus,mems}
  cpuset: initialize top_cpuset's configured masks at mount
  cpuset: use effective cpumask to build sched domains
  cpuset: inherit ancestor's masks if effective_{cpus, mems} becomes empty
  cpuset: update cs->effective_{cpus, mems} when config changes
  cpuset: update cpuset->effective_{cpus,mems} at hotplug
  cpuset: add cs->effective_cpus and cs->effective_mems
  cgroup: clean up sane_behavior handling
  ...
2014-08-04 10:11:28 -07:00
Alexei Starovoitov 7ae457c1e5 net: filter: split 'struct sk_filter' into socket and bpf parts
clean up names related to socket filtering and bpf in the following way:
- everything that deals with sockets keeps 'sk_*' prefix
- everything that is pure BPF is changed to 'bpf_*' prefix

split 'struct sk_filter' into
struct sk_filter {
	atomic_t        refcnt;
	struct rcu_head rcu;
	struct bpf_prog *prog;
};
and
struct bpf_prog {
        u32                     jited:1,
                                len:31;
        struct sock_fprog_kern  *orig_prog;
        unsigned int            (*bpf_func)(const struct sk_buff *skb,
                                            const struct bpf_insn *filter);
        union {
                struct sock_filter      insns[0];
                struct bpf_insn         insnsi[0];
                struct work_struct      work;
        };
};
so that 'struct bpf_prog' can be used independent of sockets and cleans up
'unattached' bpf use cases

split SK_RUN_FILTER macro into:
    SK_RUN_FILTER to be used with 'struct sk_filter *' and
    BPF_PROG_RUN to be used with 'struct bpf_prog *'

__sk_filter_release(struct sk_filter *) gains
__bpf_prog_release(struct bpf_prog *) helper function

also perform related renames for the functions that work
with 'struct bpf_prog *', since they're on the same lines:

sk_filter_size -> bpf_prog_size
sk_filter_select_runtime -> bpf_prog_select_runtime
sk_filter_free -> bpf_prog_free
sk_unattached_filter_create -> bpf_prog_create
sk_unattached_filter_destroy -> bpf_prog_destroy
sk_store_orig_filter -> bpf_prog_store_orig_filter
sk_release_orig_filter -> bpf_release_orig_filter
__sk_migrate_filter -> bpf_migrate_filter
__sk_prepare_filter -> bpf_prepare_filter

API for attaching classic BPF to a socket stays the same:
sk_attach_filter(prog, struct sock *)/sk_detach_filter(struct sock *)
and SK_RUN_FILTER(struct sk_filter *, ctx) to execute a program
which is used by sockets, tun, af_packet

API for 'unattached' BPF programs becomes:
bpf_prog_create(struct bpf_prog **)/bpf_prog_destroy(struct bpf_prog *)
and BPF_PROG_RUN(struct bpf_prog *, ctx) to execute a program
which is used by isdn, ppp, team, seccomp, ptp, xt_bpf, cls_bpf, test_bpf

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-02 15:03:58 -07:00
Alexei Starovoitov 8fb575ca39 net: filter: rename sk_convert_filter() -> bpf_convert_filter()
to indicate that this function is converting classic BPF into eBPF
and not related to sockets

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-02 15:02:38 -07:00
Alexei Starovoitov 4df95ff488 net: filter: rename sk_chk_filter() -> bpf_check_classic()
trivial rename to indicate that this functions performs classic BPF checking

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-02 15:02:38 -07:00
Alexei Starovoitov 009937e78a net: filter: rename sk_filter_proglen -> bpf_classic_proglen
trivial rename to better match semantics of macro

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-02 15:02:38 -07:00
Alexei Starovoitov 278571baca net: filter: simplify socket charging
attaching bpf program to a socket involves multiple socket memory arithmetic,
since size of 'sk_filter' is changing when classic BPF is converted to eBPF.
Also common path of program creation has to deal with two ways of freeing
the memory.

Simplify the code by delaying socket charging until program is ready and
its size is known

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-02 15:02:37 -07:00
Vlad Yasevich fcdfe3a7fa net: Correctly set segment mac_len in skb_segment().
When performing segmentation, the mac_len value is copied right
out of the original skb.  However, this value is not always set correctly
(like when the packet is VLAN-tagged) and we'll end up copying a bad
value.

One way to demonstrate this is to configure a VM which tags
packets internally and turn off VLAN acceleration on the forwarding
bridge port.  The packets show up corrupt like this:
16:18:24.985548 52:54:00🆎be:25 > 52:54:00:26:ce:a3, ethertype 802.1Q
(0x8100), length 1518: vlan 100, p 0, ethertype 0x05e0,
        0x0000:  8cdb 1c7c 8cdb 0064 4006 b59d 0a00 6402 ...|...d@.....d.
        0x0010:  0a00 6401 9e0d b441 0a5e 64ec 0330 14fa ..d....A.^d..0..
        0x0020:  29e3 01c9 f871 0000 0101 080a 000a e833)....q.........3
        0x0030:  000f 8c75 6e65 7470 6572 6600 6e65 7470 ...unetperf.netp
        0x0040:  6572 6600 6e65 7470 6572 6600 6e65 7470 erf.netperf.netp
        0x0050:  6572 6600 6e65 7470 6572 6600 6e65 7470 erf.netperf.netp
        0x0060:  6572 6600 6e65 7470 6572 6600 6e65 7470 erf.netperf.netp
        ...

This also leads to awful throughput as GSO packets are dropped and
cause retransmissions.

The solution is to set the mac_len using the values already available
in then new skb.  We've already adjusted all of the header offset, so we
might as well correctly figure out the mac_len using skb_reset_mac_len().
After this change, packets are segmented correctly and performance
is restored.

CC: Eric Dumazet <edumazet@google.com>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-31 22:28:39 -07:00
Pablo Neira 34c5bd66e5 net: filter: don't release unattached filter through call_rcu()
sk_unattached_filter_destroy() does not always need to release the
filter object via rcu. Since this filter is never attached to the
socket, the caller should be responsible for releasing the filter
in a safe way, which may not necessarily imply rcu.

This is a short summary of clients of this function:

1) xt_bpf.c and cls_bpf.c use the bpf matchers from rules, these rules
   are removed from the packet path before the filter is released. Thus,
   the framework makes sure the filter is safely removed.

2) In the ppp driver, the ppp_lock ensures serialization between the
   xmit and filter attachment/detachment path. This doesn't use rcu
   so deferred release via rcu makes no sense.

3) In the isdn/ppp driver, it is called from isdn_ppp_release()
   the isdn_ppp_ioctl(). This driver uses mutex and spinlocks, no rcu.
   Thus, deferred rcu makes no sense to me either, the deferred releases
   may be just masking the effects of wrong locking strategy, which
   should be fixed in the driver itself.

4) In the team driver, this is the only place where the rcu
   synchronization with unattached filter is used. Therefore, this
   patch introduces synchronize_rcu() which is called from the
   genetlink path to make sure the filter doesn't go away while packets
   are still walking over it. I think we can revisit this once struct
   bpf_prog (that only wraps specific bpf code bits) is in place, then
   add some specific struct rcu_head in the scope of the team driver if
   Jiri thinks this is needed.

Deferred rcu release for unattached filters was originally introduced
in 302d663 ("filter: Allow to create sk-unattached filters").

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-30 19:56:27 -07:00
Thomas Graf 80019d310f net: Remove unlikely() for WARN_ON() conditions
No need for the unlikely(), WARN_ON() and BUG_ON() internally use
unlikely() on the condition.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-30 17:41:47 -07:00
David S. Miller f139c74a8d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-30 13:25:49 -07:00
Eric W. Biederman 728dba3a39 namespaces: Use task_lock and not rcu to protect nsproxy
The synchronous syncrhonize_rcu in switch_task_namespaces makes setns
a sufficiently expensive system call that people have complained.

Upon inspect nsproxy no longer needs rcu protection for remote reads.
remote reads are rare.  So optimize for same process reads and write
by switching using rask_lock instead.

This yields a simpler to understand lock, and a faster setns system call.

In particular this fixes a performance regression observed
by Rafael David Tinoco <rafael.tinoco@canonical.com>.

This is effectively a revert of Pavel Emelyanov's commit
cf7b708c8d Make access to task's nsproxy lighter
from 2007.  The race this originialy fixed no longer exists as
do_notify_parent uses task_active_pid_ns(parent) instead of
parent->nsproxy.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2014-07-29 18:08:50 -07:00
Andrey Ryabinin 40eea803c6 net: sendmsg: fix NULL pointer dereference
Sasha's report:
	> While fuzzing with trinity inside a KVM tools guest running the latest -next
	> kernel with the KASAN patchset, I've stumbled on the following spew:
	>
	> [ 4448.949424] ==================================================================
	> [ 4448.951737] AddressSanitizer: user-memory-access on address 0
	> [ 4448.952988] Read of size 2 by thread T19638:
	> [ 4448.954510] CPU: 28 PID: 19638 Comm: trinity-c76 Not tainted 3.16.0-rc4-next-20140711-sasha-00046-g07d3099-dirty #813
	> [ 4448.956823]  ffff88046d86ca40 0000000000000000 ffff880082f37e78 ffff880082f37a40
	> [ 4448.958233]  ffffffffb6e47068 ffff880082f37a68 ffff880082f37a58 ffffffffb242708d
	> [ 4448.959552]  0000000000000000 ffff880082f37a88 ffffffffb24255b1 0000000000000000
	> [ 4448.961266] Call Trace:
	> [ 4448.963158] dump_stack (lib/dump_stack.c:52)
	> [ 4448.964244] kasan_report_user_access (mm/kasan/report.c:184)
	> [ 4448.965507] __asan_load2 (mm/kasan/kasan.c:352)
	> [ 4448.966482] ? netlink_sendmsg (net/netlink/af_netlink.c:2339)
	> [ 4448.967541] netlink_sendmsg (net/netlink/af_netlink.c:2339)
	> [ 4448.968537] ? get_parent_ip (kernel/sched/core.c:2555)
	> [ 4448.970103] sock_sendmsg (net/socket.c:654)
	> [ 4448.971584] ? might_fault (mm/memory.c:3741)
	> [ 4448.972526] ? might_fault (./arch/x86/include/asm/current.h:14 mm/memory.c:3740)
	> [ 4448.973596] ? verify_iovec (net/core/iovec.c:64)
	> [ 4448.974522] ___sys_sendmsg (net/socket.c:2096)
	> [ 4448.975797] ? put_lock_stats.isra.13 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
	> [ 4448.977030] ? lock_release_holdtime (kernel/locking/lockdep.c:273)
	> [ 4448.978197] ? lock_release_non_nested (kernel/locking/lockdep.c:3434 (discriminator 1))
	> [ 4448.979346] ? check_chain_key (kernel/locking/lockdep.c:2188)
	> [ 4448.980535] __sys_sendmmsg (net/socket.c:2181)
	> [ 4448.981592] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2600)
	> [ 4448.982773] ? trace_hardirqs_on (kernel/locking/lockdep.c:2607)
	> [ 4448.984458] ? syscall_trace_enter (arch/x86/kernel/ptrace.c:1500 (discriminator 2))
	> [ 4448.985621] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2600)
	> [ 4448.986754] SyS_sendmmsg (net/socket.c:2201)
	> [ 4448.987708] tracesys (arch/x86/kernel/entry_64.S:542)
	> [ 4448.988929] ==================================================================

This reports means that we've come to netlink_sendmsg() with msg->msg_name == NULL and msg->msg_namelen > 0.

After this report there was no usual "Unable to handle kernel NULL pointer dereference"
and this gave me a clue that address 0 is mapped and contains valid socket address structure in it.

This bug was introduced in f3d3342602
(net: rework recvmsg handler msg_name and msg_namelen logic).
Commit message states that:
	"Set msg->msg_name = NULL if user specified a NULL in msg_name but had a
	 non-null msg_namelen in verify_iovec/verify_compat_iovec. This doesn't
	 affect sendto as it would bail out earlier while trying to copy-in the
	 address."
But in fact this affects sendto when address 0 is mapped and contains
socket address structure in it. In such case copy-in address will succeed,
verify_iovec() function will successfully exit with msg->msg_namelen > 0
and msg->msg_name == NULL.

This patch fixes it by setting msg_namelen to 0 if msg_name == NULL.

Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: <stable@vger.kernel.org>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-29 12:20:22 -07:00