Commit graph

232 commits

Author SHA1 Message Date
Eric Dumazet 4a7e7c2ad5 netlink: fix races after skb queueing
As soon as an skb is queued into socket receive_queue, another thread
can consume it, so we are not allowed to reference skb anymore, or risk
use after free.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-06 04:21:06 -04:00
Pablo Neira Ayuso 7175c88307 netlink: allow to pass data pointer to netlink_dump_start() callback
This patch allows you to pass a data pointer that can be
accessed from the dump callback.

Netfilter is going to use this patch to provide filtered dumps
to user-space. This is specifically interesting in ctnetlink that
may handle lots of conntrack entries. We can save precious
cycles by skipping the conversion to TLV format of conntrack
entries that are not interesting for user-space.

More specifically, ctnetlink will include one operation to allow
to filter the dumping of conntrack entries by ctmark values.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-02-26 14:10:44 -05:00
Pablo Neira Ayuso 80d326fab5 netlink: add netlink_dump_control structure for netlink_dump_start()
Davem considers that the argument list of this interface is getting
out of control. This patch tries to address this issue following
his proposal:

struct netlink_dump_control c = { .dump = dump, .done = done, ... };

netlink_dump_start(..., &c);

Suggested by David S. Miller.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-02-26 14:10:06 -05:00
Denys Vlasenko a46621a3a8 net: Deinline __nlmsg_put and genlmsg_put. -7k code on i386 defconfig.
text	   data	    bss	    dec	    hex	filename
8455963	 532732	1810804	10799499 a4c98b	vmlinux.o.before
8448899	 532732	1810804	10792435 a4adf3	vmlinux.o

This change also removes commented-out copy of __nlmsg_put
which was last touched in 2005 with "Enable once all users
have been converted" comment on top.

Changes in v2: rediffed against net-next.

Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-30 15:22:06 -05:00
David S. Miller 035c4c16be netlink: Undo const marker in netlink_is_kernel().
We can't do this without propagating the const to nlk_sk()
too, otherwise:

net/netlink/af_netlink.c: In function ‘netlink_is_kernel’:
net/netlink/af_netlink.c:103:2: warning: passing argument 1 of ‘nlk_sk’ discards ‘const’ qualifier from pointer target type [enabled by default]
net/netlink/af_netlink.c:96:36: note: expected ‘struct sock *’ but argument is of type ‘const struct sock *’

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-23 17:33:03 -05:00
stephen hemminger 2c64580046 netlink: wake up netlink listeners sooner (v2)
This patch changes it to yield sooner at halfway instead. Still not a cure-all
for listener overrun if listner is slow, but works much reliably.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-22 22:37:19 -05:00
stephen hemminger b57ef81ff8 netlink: af_netlink cleanup (v2)
Don't inline functions that cover several lines, and do inline
the trivial ones. Also make some arguments const.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-22 22:37:19 -05:00
Eric Dumazet 16e5726269 af_unix: dont send SCM_CREDENTIALS by default
Since commit 7361c36c52 (af_unix: Allow credentials to work across
user and pid namespaces) af_unix performance dropped a lot.

This is because we now take a reference on pid and cred in each write(),
and release them in read(), usually done from another process,
eventually from another cpu. This triggers false sharing.

# Events: 154K cycles
#
# Overhead  Command       Shared Object        Symbol
# ........  .......  ..................  .........................
#
    10.40%  hackbench  [kernel.kallsyms]   [k] put_pid
     8.60%  hackbench  [kernel.kallsyms]   [k] unix_stream_recvmsg
     7.87%  hackbench  [kernel.kallsyms]   [k] unix_stream_sendmsg
     6.11%  hackbench  [kernel.kallsyms]   [k] do_raw_spin_lock
     4.95%  hackbench  [kernel.kallsyms]   [k] unix_scm_to_skb
     4.87%  hackbench  [kernel.kallsyms]   [k] pid_nr_ns
     4.34%  hackbench  [kernel.kallsyms]   [k] cred_to_ucred
     2.39%  hackbench  [kernel.kallsyms]   [k] unix_destruct_scm
     2.24%  hackbench  [kernel.kallsyms]   [k] sub_preempt_count
     1.75%  hackbench  [kernel.kallsyms]   [k] fget_light
     1.51%  hackbench  [kernel.kallsyms]   [k]
__mutex_lock_interruptible_slowpath
     1.42%  hackbench  [kernel.kallsyms]   [k] sock_alloc_send_pskb

This patch includes SCM_CREDENTIALS information in a af_unix message/skb
only if requested by the sender, [man 7 unix for details how to include
ancillary data using sendmsg() system call]

Note: This might break buggy applications that expected SCM_CREDENTIAL
from an unaware write() system call, and receiver not using SO_PASSCRED
socket option.

If SOCK_PASSCRED is set on source or destination socket, we still
include credentials for mere write() syscalls.

Performance boost in hackbench : more than 50% gain on a 16 thread
machine (2 quad-core cpus, 2 threads per core)

hackbench 20 thread 2000

4.228 sec instead of 9.102 sec

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-09-28 13:29:50 -04:00
Eric Dumazet 33d480ce6d net: cleanup some rcu_dereference_raw
RCU api had been completed and rcu_access_pointer() or
rcu_dereference_protected() are better than generic
rcu_dereference_raw()

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-12 02:55:28 -07:00
John W. Linville 36099365c7 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next-2.6 into for-davem
Conflicts:
	drivers/net/wireless/rtlwifi/pci.c
	include/linux/netlink.h
2011-06-24 15:25:51 -04:00
Johannes Berg 670dc2833d netlink: advertise incomplete dumps
Consider the following situation:
 * a dump that would show 8 entries, four in the first
   round, and four in the second
 * between the first and second rounds, 6 entries are
   removed
 * now the second round will not show any entry, and
   even if there is a sequence/generation counter the
   application will not know

To solve this problem, add a new flag NLM_F_DUMP_INTR
to the netlink header that indicates the dump wasn't
consistent, this flag can also be set on the MSG_DONE
message that terminates the dump, and as such above
situation can be detected.

To achieve this, add a sequence counter to the netlink
callback struct. Of course, netlink code still needs
to use this new functionality. The correct way to do
that is to always set cb->seq when a dumpit callback
is invoked and call nl_dump_check_consistent() for
each new message. The core code will also call this
function for the final MSG_DONE message.

To make it usable with generic netlink, a new function
genlmsg_nlhdr() is needed to obtain the netlink header
from the genetlink user header.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2011-06-22 16:09:45 -04:00
Dan Carpenter c63d6ea306 rtnetlink: unlock on error path in netlink_dump()
In c7ac8679be "rtnetlink: Compute and store minimum ifinfo dump
size", we moved the allocation under the lock so we need to unlock
on error path.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: David S. Miller <davem@conan.davemloft.net>
2011-06-16 23:51:35 -04:00
Greg Rose c7ac8679be rtnetlink: Compute and store minimum ifinfo dump size
The message size allocated for rtnl ifinfo dumps was limited to
a single page.  This is not enough for additional interface info
available with devices that support SR-IOV and caused a bug in
which VF info would not be displayed if more than approximately
40 VFs were created per interface.

Implement a new function pointer for the rtnl_register service that will
calculate the amount of data required for the ifinfo dump and allocate
enough data to satisfy the request.

Signed-off-by: Greg Rose <gregory.v.rose@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2011-06-09 20:38:07 -07:00
Dan Rosenberg 71338aa7d0 net: convert %p usage to %pK
The %pK format specifier is designed to hide exposed kernel pointers,
specifically via /proc interfaces.  Exposing these pointers provides an
easy target for kernel write vulnerabilities, since they reveal the
locations of writable structures containing easily triggerable function
pointers.  The behavior of %pK depends on the kptr_restrict sysctl.

If kptr_restrict is set to 0, no deviation from the standard %p behavior
occurs.  If kptr_restrict is set to 1, the default, if the current user
(intended to be a reader via seq_printf(), etc.) does not have CAP_SYSLOG
(currently in the LSM tree), kernel pointers using %pK are printed as 0's.
 If kptr_restrict is set to 2, kernel pointers using %pK are printed as
0's regardless of privileges.  Replacing with 0's was chosen over the
default "(null)", which cannot be parsed by userland %p, which expects
"(nil)".

The supporting code for kptr_restrict and %pK are currently in the -mm
tree.  This patch converts users of %p in net/ to %pK.  Cases of printing
pointers to the syslog are not covered, since this would eliminate useful
information for postmortem debugging and the reading of the syslog is
already optionally protected by the dmesg_restrict sysctl.

Signed-off-by: Dan Rosenberg <drosenberg@vsecurity.com>
Cc: James Morris <jmorris@namei.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Thomas Graf <tgraf@infradead.org>
Cc: Eugene Teo <eugeneteo@kernel.org>
Cc: Kees Cook <kees.cook@canonical.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David S. Miller <davem@davemloft.net>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Eric Paris <eparis@parisplace.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-24 01:13:12 -04:00
Lai Jiangshan 37b6b935e9 net,rcu: convert call_rcu(listeners_free_rcu) to kfree_rcu()
The rcu callback listeners_free_rcu() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(listeners_free_rcu).

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-07 22:50:51 -07:00
David S. Miller 0a0e9ae1bd Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/bnx2x/bnx2x.h
2011-03-03 21:27:42 -08:00
Patrick McHardy 01a16b21d6 netlink: kill eff_cap from struct netlink_skb_parms
Netlink message processing in the kernel is synchronous these days,
capabilities can be checked directly in security_netlink_recv() from
the current process.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Reviewed-by: James Morris <jmorris@namei.org>
[chrisw: update to include pohmelfs and uvesafb]
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-03 13:32:07 -08:00
Patrick McHardy c53fa1ed92 netlink: kill loginuid/sessionid/sid members from struct netlink_skb_parms
Netlink message processing in the kernel is synchronous these days, the
session information can be collected when needed.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-03 10:55:40 -08:00
Andrey Vagin b44d211e16 netlink: handle errors from netlink_dump()
netlink_dump() may failed, but nobody handle its error.
It generates output data, when a previous portion has been returned to
user space. This mechanism works when all data isn't go in skb. If we
enter in netlink_recvmsg() and skb is absent in the recv queue, the
netlink_dump() will not been executed. So if netlink_dump() is failed
one time, the new data never appear and the reader will sleep forever.

netlink_dump() is called from two places:

1. from netlink_sendmsg->...->netlink_dump_start().
   In this place we can report error directly and it will be returned
   by sendmsg().

2. from netlink_recvmsg
   There we can't report error directly, because we have a portion of
   valid output data and call netlink_dump() for prepare the next portion.
   If netlink_dump() is failed, the socket will be mark as error and the
   next recvmsg will be failed.

Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-28 12:18:12 -08:00
Eric Dumazet 5c398dc8f5 netlink: fix netlink_change_ngroups()
commit 6c04bb18dd (netlink: use call_rcu for netlink_change_ngroups)
used a somewhat convoluted and racy way to perform call_rcu().

The old block of memory is freed after a grace period, but the rcu_head
used to track it is located in new block.

This can clash if we call two times or more netlink_change_ngroups(),
and a block is freed before another. call_rcu() called on different cpus
makes no guarantee in order of callbacks.

Fix this using a more standard way of handling this : Each block of
memory contains its own rcu_head, so that no 'use after free' can
happens.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Johannes Berg <johannes@sipsolutions.net>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-24 16:25:39 -07:00
David S. Miller b963ea89f0 netlink: Make NETLINK_USERSOCK work again.
Once we started enforcing the a nl_table[] entry exist for
a protocol, NETLINK_USERSOCK stopped working.  Add a dummy
table entry so that it works again.

Reported-by: Thomas Voegtle <tv@lio96.de>
Tested-by: Thomas Voegtle <tv@lio96.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-31 09:51:37 -07:00
Johannes Berg 68d6ac6d27 netlink: fix compat recvmsg
Since
commit 1dacc76d00
Author: Johannes Berg <johannes@sipsolutions.net>
Date:   Wed Jul 1 11:26:02 2009 +0000

    net/compat/wext: send different messages to compat tasks

we had a race condition when setting and then
restoring frag_list. Eric attempted to fix it,
but the fix created even worse problems.

However, the original motivation I had when I
added the code that turned out to be racy is
no longer clear to me, since we only copy up
to skb->len to userspace, which doesn't include
the frag_list length. As a result, not doing
any frag_list clearing and restoring avoids
the race condition, while not introducing any
other problems.

Additionally, while preparing this patch I found
that since none of the remaining netlink code is
really aware of the frag_list, we need to use the
original skb's information for packet information
and credentials. This fixes, for example, the
group information received by compat tasks.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: stable@kernel.org [2.6.31+, for 2.6.35 revert 1235f504aa]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-18 23:35:58 -07:00
David S. Miller daa3766e70 Revert "netlink: netlink_recvmsg() fix"
This reverts commit 1235f504aa.

It causes regressions worse than the problem it was trying
to fix.  Eric will try to solve the problem another way.

Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-15 23:21:50 -07:00
Eric Dumazet 1235f504aa netlink: netlink_recvmsg() fix
commit 1dacc76d00
(net/compat/wext: send different messages to compat tasks)
introduced a race condition on netlink, in case MSG_PEEK is used.

An skb given by skb_recv_datagram() might be shared, we must copy it
before any modification, or risk fatal corruption.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-26 13:09:16 -07:00
Neil Horman 70d4bf6d46 drop_monitor: convert some kfree_skb call sites to consume_skb
Convert a few calls from kfree_skb to consume_skb

Noticed while I was working on dropwatch that I was detecting lots of internal
skb drops in several places.  While some are legitimate, several were not,
freeing skbs that were at the end of their life, rather than being discarded due
to an error.  This patch converts those calls sites from using kfree_skb to
consume_skb, which quiets the in-kernel drop_monitor code from detecting them as
drops.  Tested successfully by myself

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-20 13:28:05 -07:00
Eric W. Biederman b47030c71d af_netlink: Add needed scm_destroy after scm_send.
scm_send occasionally allocates state in the scm_cookie, so I have
modified netlink_sendmsg to guarantee that when scm_send succeeds
scm_destory will be called to free that state.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Reviewed-by: Daniel Lezcano <daniel.lezcano@free.fr>
Acked-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-16 14:55:56 -07:00
Eric W. Biederman 910a7e905f netlink: Implment netlink_broadcast_filtered
When netlink sockets are used to convey data that is in a namespace
we need a way to select a subset of the listening sockets to deliver
the packet to.  For the network namespace we have been doing this
by only transmitting packets in the correct network namespace.

For data belonging to other namespaces netlink_bradcast_filtered
provides a mechanism that allows us to examine the destination
socket and to decide if we should transmit the specified packet
to it.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-05-21 09:37:32 -07:00
David S. Miller 4a35ecf8bf Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/bonding/bond_main.c
	drivers/net/via-velocity.c
	drivers/net/wireless/iwlwifi/iwl-agn.c
2010-04-06 23:53:30 -07:00
Changli Gao 6503d96168 net: check the length of the socket address passed to connect(2)
check the length of the socket address passed to connect(2).

Check the length of the socket address passed to connect(2). If the
length is invalid, -EINVAL will be returned.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
----
net/bluetooth/l2cap.c | 3 ++-
net/bluetooth/rfcomm/sock.c | 3 ++-
net/bluetooth/sco.c | 3 ++-
net/can/bcm.c | 3 +++
net/ieee802154/af_ieee802154.c | 3 +++
net/ipv4/af_inet.c | 5 +++++
net/netlink/af_netlink.c | 3 +++
7 files changed, 20 insertions(+), 3 deletions(-)
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-01 17:26:01 -07:00
Tom Goff 66aa4a55fe netlink: use the appropriate namespace pid
This was included in OpenVZ kernels but wasn't integrated upstream.
>From git://git.openvz.org/pub/linux-2.6.24-openvz:

	commit 5c69402f18adf7276352e051ece2cf31feefab02
	Author: Alexey Dobriyan <adobriyan@openvz.org>
	Date:   Mon Dec 24 14:37:45 2007 +0300

	    netlink: fixup ->tgid to work in multiple PID namespaces

Signed-off-by: Tom Goff <thomas.goff@boeing.com>
Acked-by: Alexey Dobriyan <adobriyan@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-26 20:13:58 -07:00
Pablo Neira Ayuso 1a50307ba1 netlink: fix NETLINK_RECV_NO_ENOBUFS in netlink_set_err()
Currently, ENOBUFS errors are reported to the socket via
netlink_set_err() even if NETLINK_RECV_NO_ENOBUFS is set. However,
that should not happen. This fixes this problem and it changes the
prototype of netlink_set_err() to return the number of sockets that
have set the NETLINK_RECV_NO_ENOBUFS socket option. This return
value is used in the next patch in these bugfix series.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-20 14:29:03 -07:00
Masatake YAMATO cf0aa4e07c netlink: Adding inode field to /proc/net/netlink
The Inode field in /proc/net/{tcp,udp,packet,raw,...} is useful to know the types of
file descriptors associated to a process. Actually lsof utility uses the field.
Unfortunately, unlike /proc/net/{tcp,udp,packet,raw,...}, /proc/net/netlink doesn't have the field.
This patch adds the field to /proc/net/netlink.

Signed-off-by: Masatake YAMATO <yamato@redhat.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-28 01:29:49 -08:00
Alexey Dobriyan 974c37e9d8 netlink: fix for too early rmmod
Netlink code does module autoload if protocol userspace is asking for is
not ready. However, module can dissapear right after it was autoloaded.
Example: modprobe/rmmod stress-testing and xfrm_user.ko providing NETLINK_XFRM.

netlink_create() in such situation _will_ create userspace socket and
_will_not_ pin module. Now if module was removed and we're going to call
->netlink_rcv into nothing:

BUG: unable to handle kernel paging request at ffffffffa02f842a
					       ^^^^^^^^^^^^^^^^
	modules are loaded near these addresses here

IP: [<ffffffffa02f842a>] 0xffffffffa02f842a
PGD 161f067 PUD 1623063 PMD baa12067 PTE 0
Oops: 0010 [#1] PREEMPT SMP DEBUG_PAGEALLOC
last sysfs file: /sys/devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/uevent
CPU 1
Pid: 11515, comm: ip Not tainted 2.6.33-rc5-netns-00594-gaaa5728-dirty #6 P5E/P5E
RIP: 0010:[<ffffffffa02f842a>]  [<ffffffffa02f842a>] 0xffffffffa02f842a
RSP: 0018:ffff8800baa3db48  EFLAGS: 00010292
RAX: ffff8800baa3dfd8 RBX: ffff8800be353640 RCX: 0000000000000000
RDX: ffffffff81959380 RSI: ffff8800bab7f130 RDI: 0000000000000001
RBP: ffff8800baa3db58 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000011
R13: ffff8800be353640 R14: ffff8800bcdec240 R15: ffff8800bd488010
FS:  00007f93749656f0(0000) GS:ffff880002300000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffffffffa02f842a CR3: 00000000ba82b000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process ip (pid: 11515, threadinfo ffff8800baa3c000, task ffff8800bab7eb30)
Stack:
 ffffffff813637c0 ffff8800bd488000 ffff8800baa3dba8 ffffffff8136397d
<0> 0000000000000000 ffffffff81344adc 7fffffffffffffff 0000000000000000
<0> ffff8800baa3ded8 ffff8800be353640 ffff8800bcdec240 0000000000000000
Call Trace:
 [<ffffffff813637c0>] ? netlink_unicast+0x100/0x2d0
 [<ffffffff8136397d>] netlink_unicast+0x2bd/0x2d0

	netlink_unicast_kernel:
		nlk->netlink_rcv(skb);

 [<ffffffff81344adc>] ? memcpy_fromiovec+0x6c/0x90
 [<ffffffff81364263>] netlink_sendmsg+0x1d3/0x2d0
 [<ffffffff8133975b>] sock_sendmsg+0xbb/0xf0
 [<ffffffff8106cdeb>] ? __lock_acquire+0x27b/0xa60
 [<ffffffff810a18c3>] ? might_fault+0x73/0xd0
 [<ffffffff810a18c3>] ? might_fault+0x73/0xd0
 [<ffffffff8106db22>] ? __lock_release+0x82/0x170
 [<ffffffff810a190e>] ? might_fault+0xbe/0xd0
 [<ffffffff810a18c3>] ? might_fault+0x73/0xd0
 [<ffffffff81344c77>] ? verify_iovec+0x47/0xd0
 [<ffffffff8133a509>] sys_sendmsg+0x1a9/0x360
 [<ffffffff813c2be5>] ? _raw_spin_unlock_irqrestore+0x65/0x70
 [<ffffffff8106aced>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffff813c2bc2>] ? _raw_spin_unlock_irqrestore+0x42/0x70
 [<ffffffff81197004>] ? __up_read+0x84/0xb0
 [<ffffffff8106ac95>] ? trace_hardirqs_on_caller+0x145/0x190
 [<ffffffff813c207f>] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [<ffffffff8100262b>] system_call_fastpath+0x16/0x1b
Code:  Bad RIP value.
RIP  [<ffffffffa02f842a>] 0xffffffffa02f842a
 RSP <ffff8800baa3db48>
CR2: ffffffffa02f842a

If module was quickly removed after autoloading, return -E.

Return -EPROTONOSUPPORT if module was quickly removed after autoloading.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-03 18:13:43 -08:00
Octavian Purdila 09ad9bc752 net: use net_eq to compare nets
Generated with the following semantic patch

@@
struct net *n1;
struct net *n2;
@@
- n1 == n2
+ net_eq(n1, n2)

@@
struct net *n1;
struct net *n2;
@@
- n1 != n2
+ !net_eq(n1, n2)

applied over {include,net,drivers/net}.

Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-25 15:14:13 -08:00
Johannes Berg 649300b927 netlink: remove subscriptions check on notifier
The netlink URELEASE notifier doesn't notify for
sockets that have been used to receive multicast
but it should be called for such sockets as well
since they might _also_ be used for sending and
not solely for receiving multicast. We will need
that for nl80211 (generic netlink sockets) in the
future.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-17 04:08:49 -08:00
Cyrill Gorcunov 13cfa97bef net: netlink_getname, packet_getname -- use DECLARE_SOCKADDR guard
Use guard DECLARE_SOCKADDR in a few more places which allow
us to catch if the structure copied back is too big.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-10 20:54:41 -08:00
Eric Paris 3f378b6844 net: pass kern to net_proto_family create function
The generic __sock_create function has a kern argument which allows the
security system to make decisions based on if a socket is being created by
the kernel or by userspace.  This patch passes that flag to the
net_proto_family specific create function, so it can do the same thing.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-05 22:18:14 -08:00
Stephen Hemminger ec1b4cf74c net: mark net_proto_ops as const
All usages of structure net_proto_ops should be declared const.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-07 01:10:46 -07:00
David S. Miller b7058842c9 net: Make setsockopt() optlen be unsigned.
This provides safety against negative optlen at the type
level instead of depending upon (sometimes non-trivial)
checks against this sprinkled all over the the place, in
each and every implementation.

Based upon work done by Arjan van de Ven and feedback
from Linus Torvalds.

Signed-off-by: David S. Miller <davem@davemloft.net>
2009-09-30 16:12:20 -07:00
John Fastabend 5dba93aedf net: fix nlmsg len size for skb when error bit is set.
Currently, the nlmsg->len field is not set correctly in  netlink_ack()
for ack messages that include the nlmsg of the error frame.  This
corrects the length field passed to __nlmsg_put to use the correct
payload size.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-09-26 20:16:11 -07:00
Johannes Berg b8273570f8 genetlink: fix netns vs. netlink table locking (2)
Similar to commit d136f1bd36,
there's a bug when unregistering a generic netlink family,
which is caught by the might_sleep() added in that commit:

    BUG: sleeping function called from invalid context at net/netlink/af_netlink.c:183
    in_atomic(): 1, irqs_disabled(): 0, pid: 1510, name: rmmod
    2 locks held by rmmod/1510:
     #0:  (genl_mutex){+.+.+.}, at: [<ffffffff8138283b>] genl_unregister_family+0x2b/0x130
     #1:  (rcu_read_lock){.+.+..}, at: [<ffffffff8138270c>] __genl_unregister_mc_group+0x1c/0x120
    Pid: 1510, comm: rmmod Not tainted 2.6.31-wl #444
    Call Trace:
     [<ffffffff81044ff9>] __might_sleep+0x119/0x150
     [<ffffffff81380501>] netlink_table_grab+0x21/0x100
     [<ffffffff813813a3>] netlink_clear_multicast_users+0x23/0x60
     [<ffffffff81382761>] __genl_unregister_mc_group+0x71/0x120
     [<ffffffff81382866>] genl_unregister_family+0x56/0x130
     [<ffffffffa0007d85>] nl80211_exit+0x15/0x20 [cfg80211]
     [<ffffffffa000005a>] cfg80211_exit+0x1a/0x40 [cfg80211]

Fix in the same way by grabbing the netlink table lock
before doing rcu_read_lock().

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-09-24 15:44:05 -07:00
Jan Beulich 4481374ce8 mm: replace various uses of num_physpages by totalram_pages
Sizing of memory allocations shouldn't depend on the number of physical
pages found in a system, as that generally includes (perhaps a huge amount
of) non-RAM pages.  The amount of what actually is usable as storage
should instead be used as a basis here.

Some of the calculations (i.e.  those not intending to use high memory)
should likely even use (totalram_pages - totalhigh_pages).

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Dave Airlie <airlied@linux.ie>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22 07:17:38 -07:00
Johannes Berg d136f1bd36 genetlink: fix netns vs. netlink table locking
Since my commits introducing netns awareness into
genetlink we can get this problem:

BUG: scheduling while atomic: modprobe/1178/0x00000002
2 locks held by modprobe/1178:
 #0:  (genl_mutex){+.+.+.}, at: [<ffffffff8135ee1a>] genl_register_mc_grou
 #1:  (rcu_read_lock){.+.+..}, at: [<ffffffff8135eeb5>] genl_register_mc_g
Pid: 1178, comm: modprobe Not tainted 2.6.31-rc8-wl-34789-g95cb731-dirty #
Call Trace:
 [<ffffffff8103e285>] __schedule_bug+0x85/0x90
 [<ffffffff81403138>] schedule+0x108/0x588
 [<ffffffff8135b131>] netlink_table_grab+0xa1/0xf0
 [<ffffffff8135c3a7>] netlink_change_ngroups+0x47/0x100
 [<ffffffff8135ef0f>] genl_register_mc_group+0x12f/0x290

because I overlooked that netlink_table_grab() will
schedule, thinking it was just the rwlock. However,
in the contention case, that isn't actually true.

Fix this by letting the code grab the netlink table
lock first and then the RCU for netns protection.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-09-14 17:02:50 -07:00
Patrick McHardy 3a6c2b419b netlink: constify nlmsghdr arguments
Consitfy nlmsghdr arguments to a couple of functions as preparation
for the next patch, which will constify the netlink message data in
all nfnetlink users.

Signed-off-by: Patrick McHardy <kaber@trash.net>
2009-08-25 16:07:40 +02:00
Johannes Berg 1dacc76d00 net/compat/wext: send different messages to compat tasks
Wireless extensions have the unfortunate problem that events
are multicast netlink messages, and are not independent of
pointer size. Thus, currently 32-bit tasks on 64-bit platforms
cannot properly receive events and fail with all kinds of
strange problems, for instance wpa_supplicant never notices
disassociations, due to the way the 64-bit event looks (to a
32-bit process), the fact that the address is all zeroes is
lost, it thinks instead it is 00:00:00:00:01:00.

The same problem existed with the ioctls, until David Miller
fixed those some time ago in an heroic effort.

A different problem caused by this is that we cannot send the
ASSOCREQIE/ASSOCRESPIE events because sending them causes a
32-bit wpa_supplicant on a 64-bit system to overwrite its
internal information, which is worse than it not getting the
information at all -- so we currently resort to sending a
custom string event that it then parses. This, however, has a
severe size limitation we are frequently hitting with modern
access points; this limitation would can be lifted after this
patch by sending the correct binary, not custom, event.

A similar problem apparently happens for some other netlink
users on x86_64 with 32-bit tasks due to the alignment for
64-bit quantities.

In order to fix these problems, I have implemented a way to
send compat messages to tasks. When sending an event, we send
the non-compat event data together with a compat event data in
skb_shinfo(main_skb)->frag_list. Then, when the event is read
from the socket, the netlink code makes sure to pass out only
the skb that is compatible with the task. This approach was
suggested by David Miller, my original approach required
always sending two skbs but that had various small problems.

To determine whether compat is needed or not, I have used the
MSG_CMSG_COMPAT flag, and adjusted the call path for recv and
recvfrom to include it, even if those calls do not have a cmsg
parameter.

I have not solved one small part of the problem, and I don't
think it is necessary to: if a 32-bit application uses read()
rather than any form of recvmsg() it will still get the wrong
(64-bit) event. However, neither do applications actually do
this, nor would it be a regression.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-07-15 08:53:39 -07:00
Johannes Berg 6c04bb18dd netlink: use call_rcu for netlink_change_ngroups
For the network namespace work in generic netlink I need
to be able to call this function under rcu_read_lock(),
otherwise the locking becomes a nightmare and more locks
would be needed. Instead, just embed a struct rcu_head
(actually a struct listeners_rcu_head that also carries
the pointer to the memory block) into the listeners
memory so we can use call_rcu() instead of synchronising
and then freeing. No rcu_barrier() is needed since this
code cannot be modular.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-07-12 14:03:24 -07:00
Johannes Berg 487420df79 netlink: remove unused exports
I added those myself in commits b4ff4f04 and 84659eb5,
but I see no reason now why they should be exported,
only generic netlink uses them which cannot be modular.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-07-12 14:03:19 -07:00
Eric Dumazet 31e6d363ab net: correct off-by-one write allocations reports
commit 2b85a34e91
(net: No more expensive sock_hold()/sock_put() on each tx)
changed initial sk_wmem_alloc value.

We need to take into account this offset when reporting
sk_wmem_alloc to user, in PROC_FS files or various
ioctls (SIOCOUTQ/TIOCOUTQ)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-18 00:29:12 -07:00
Pablo Neira Ayuso 38938bfe34 netlink: add NETLINK_NO_ENOBUFS socket flag
This patch adds the NETLINK_NO_ENOBUFS socket flag. This flag can
be used by unicast and broadcast listeners to avoid receiving
ENOBUFS errors.

Generally speaking, ENOBUFS errors are useful to notify two things
to the listener:

a) You may increase the receiver buffer size via setsockopt().
b) You have lost messages, you may be out of sync.

In some cases, ignoring ENOBUFS errors can be useful. For example:

a) nfnetlink_queue: this subsystem does not have any sort of resync
method and you can decide to ignore ENOBUFS once you have set a
given buffer size.

b) ctnetlink: you can use this together with the socket flag
NETLINK_BROADCAST_SEND_ERROR to stop getting ENOBUFS errors as
you do not need to resync (packets whose event are not delivered
are drop to provide reliable logging and state-synchronization).

Moreover, the use of NETLINK_NO_ENOBUFS also reduces a "go up, go down"
effect in terms of performance which is due to the netlink congestion
control when the listener cannot back off. The effect is the following:

1) throughput rate goes up and netlink messages are inserted in the
receiver buffer.
2) Then, netlink buffer fills and overruns (set on nlk->state bit 0).
3) While the listener empties the receiver buffer, netlink keeps
dropping messages. Thus, throughput goes dramatically down.
4) Then, once the listener has emptied the buffer (nlk->state
bit 0 is set off), goto step 1.

This effect is easy to trigger with netlink broadcast under heavy
load, and it is more noticeable when using a big receiver buffer.
You can find some results in [1] that show this problem.

[1] http://1984.lsi.us.es/linux/netlink/

This patch also includes the use of sk_drop to account the number of
netlink messages drop due to overrun. This value is shown in
/proc/net/netlink.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-03-24 16:37:55 -07:00
David S. Miller b5bb14386e Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-next-2.6 2009-03-24 13:24:36 -07:00
Pablo Neira Ayuso dd5b6ce6fd nefilter: nfnetlink: add nfnetlink_set_err and use it in ctnetlink
This patch adds nfnetlink_set_err() to propagate the error to netlink
broadcast listener in case of memory allocation errors in the
message building.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2009-03-23 13:21:06 +01:00
David S. Miller 508827ff0a Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/tokenring/tmspci.c
	drivers/net/ucc_geth_mii.c
2009-03-05 02:06:47 -08:00
Pablo Neira Ayuso 4843b93c96 netlink: invert error code in netlink_set_err()
The callers of netlink_set_err() currently pass a negative value
as parameter for the error code. However, sk->sk_err wants a
positive error value. Without this patch, skb_recv_datagram() called
by netlink_recvmsg() may return a positive value to report an error.

Another choice to fix this is to change callers to pass a positive
error value, but this seems a bit inconsistent and error prone
to me. Indeed, the callers of netlink_set_err() assumed that the
(usual) negative value for error codes was fine before this patch :).

This patch also includes some documentation in docbook format
for netlink_set_err() to avoid this sort of confusion.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-03-03 23:37:30 -08:00
Wei Yongjun 91744f6559 netlink: remove some pointless conditionals before kfree_skb()
Remove some pointless conditionals before kfree_skb().

Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-02-26 23:07:34 -08:00
Pablo Neira Ayuso 1ce85fe402 netlink: change nlmsg_notify() return value logic
This patch changes the return value of nlmsg_notify() as follows:

If NETLINK_BROADCAST_ERROR is set by any of the listeners and
an error in the delivery happened, return the broadcast error;
else if there are no listeners apart from the socket that
requested a change with the echo flag, return the result of the
unicast notification. Thus, with this patch, the unicast
notification is handled in the same way of a broadcast listener
that has set the NETLINK_BROADCAST_ERROR socket flag.

This patch is useful in case that the caller of nlmsg_notify()
wants to know the result of the delivery of a netlink notification
(including the broadcast delivery) and take any action in case
that the delivery failed. For example, ctnetlink can drop packets
if the event delivery failed to provide reliable logging and
state-synchronization at the cost of dropping packets.

This patch also modifies the rtnetlink code to ignore the return
value of rtnl_notify() in all callers. The function rtnl_notify()
(before this patch) returned the error of the unicast notification
which makes rtnl_set_sk_err() reports errors to all listeners. This
is not of any help since the origin of the change (the socket that
requested the echoing) notices the ENOBUFS error if the notification
fails and should resync itself.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-02-24 23:18:28 -08:00
Pablo Neira Ayuso be0c22a46c netlink: add NETLINK_BROADCAST_ERROR socket option
This patch adds NETLINK_BROADCAST_ERROR which is a netlink
socket option that the listener can set to make netlink_broadcast()
return errors in the delivery to the caller. This option is useful
if the caller of netlink_broadcast() do something with the result
of the message delivery, like in ctnetlink where it drops a network
packet if the event delivery failed, this is used to enable reliable
logging and state-synchronization. If this socket option is not set,
netlink_broadcast() only reports ESRCH errors and silently ignore
ENOBUFS errors, which is what most netlink_broadcast() callers
should do.

This socket option is based on a suggestion from Patrick McHardy.
Patrick McHardy can exchange this patch for a beer from me ;).

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-02-20 01:01:08 -08:00
Pablo Neira Ayuso ff491a7334 netlink: change return-value logic of netlink_broadcast()
Currently, netlink_broadcast() reports errors to the caller if no
messages at all were delivered:

1) If, at least, one message has been delivered correctly, returns 0.
2) Otherwise, if no messages at all were delivered due to skb_clone()
   failure, return -ENOBUFS.
3) Otherwise, if there are no listeners, return -ESRCH.

With this patch, the caller knows if the delivery of any of the
messages to the listeners have failed:

1) If it fails to deliver any message (for whatever reason), return
   -ENOBUFS.
2) Otherwise, if all messages were delivered OK, returns 0.
3) Otherwise, if no listeners, return -ESRCH.

In the current ctnetlink code and in Netfilter in general, we can add
reliable logging and connection tracking event delivery by dropping the
packets whose events were not successfully delivered over Netlink. Of
course, this option would be settable via /proc as this approach reduces
performance (in terms of filtered connections per seconds by a stateful
firewall) but providing reliable logging and event delivery (for
conntrackd) in return.

This patch also changes some clients of netlink_broadcast() that
may report ENOBUFS errors via printk. This error handling is not
of any help. Instead, the userspace daemons that are listening to
those netlink messages should resync themselves with the kernel-side
if they hit ENOBUFS.

BTW, netlink_broadcast() clients include those that call
cn_netlink_send(), nlmsg_multicast() and genlmsg_multicast() since they
internally call netlink_broadcast() and return its error value.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-02-05 23:56:36 -08:00
Eric Dumazet 3755810ceb net: Make sure BHs are disabled in sock_prot_inuse_add()
There is still a call to sock_prot_inuse_add() in af_netlink
while in a preemptable section. Add explicit BH disable around
this call.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-24 14:05:22 -08:00
David S. Miller 6f756a8c36 net: Make sure BHs are disabled in sock_prot_inuse_add()
The rule of calling sock_prot_inuse_add() is that BHs must
be disabled.  Some new calls were added where this was not
true and this tiggers warnings as reported by Ilpo.

Fix this by adding explicit BH disabling around those call sites.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-23 17:34:03 -08:00
Eric Dumazet c1fd3b9455 net: af_netlink should update its inuse counter
In order to have relevant information for NETLINK protocol, in
/proc/net/protocols, we should use sock_prot_inuse_add() to
update a (percpu and pernamespace) counter of inuse sockets.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-23 15:48:22 -08:00
Johannes Berg 95a5afca4a net: Remove CONFIG_KMOD from net/ (towards removing CONFIG_KMOD entirely)
Some code here depends on CONFIG_KMOD to not try to load
protocol modules or similar, replace by CONFIG_MODULES
where more than just request_module depends on CONFIG_KMOD
and and also use try_then_request_module in ebtables.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-16 15:24:51 -07:00
Alan Cox 113aa838ec net: Rationalise email address: Network Specific Parts
Clean up the various different email addresses of mine listed in the code
to a single current and valid address. As Dave says his network merges
for 2.6.28 are now done this seems a good point to send them in where
they won't risk disrupting real changes.

Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-13 19:01:08 -07:00
Ilpo Järvinen 547b792cac net: convert BUG_TRAP to generic WARN_ON
Removes legacy reinvent-the-wheel type thing. The generic
machinery integrates much better to automated debugging aids
such as kerneloops.org (and others), and is unambiguous due to
better naming. Non-intuively BUG_TRAP() is actually equal to
WARN_ON() rather than BUG_ON() though some might actually be
promoted to BUG_ON() but I left that to future.

I could make at least one BUILD_BUG_ON conversion.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-25 21:43:18 -07:00
David S. Miller ea2aca084b Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:

	Documentation/feature-removal-schedule.txt
	drivers/net/wan/hdlc_fr.c
	drivers/net/wireless/iwlwifi/iwl-4965.c
	drivers/net/wireless/iwlwifi/iwl3945-base.c
2008-07-05 23:08:07 -07:00
Wang Chen 8487460720 netlink: Unneeded local variable
We already have a variable, which has the same capability.

Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-01 19:55:09 -07:00
Denis V. Lunev 9457afee85 netlink: Remove nonblock parameter from netlink_attachskb
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-05 11:23:39 -07:00
Eric Paris 2532386f48 Audit: collect sessionid in netlink messages
Previously I added sessionid output to all audit messages where it was
available but we still didn't know the sessionid of the sender of
netlink messages.  This patch adds that information to netlink messages
so we can audit who sent netlink messages.

Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-28 06:18:03 -04:00
Linus Torvalds 3925e6fc1f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6:
  security: fix up documentation for security_module_enable
  Security: Introduce security= boot parameter
  Audit: Final renamings and cleanup
  SELinux: use new audit hooks, remove redundant exports
  Audit: internally use the new LSM audit hooks
  LSM/Audit: Introduce generic Audit LSM hooks
  SELinux: remove redundant exports
  Netlink: Use generic LSM hook
  Audit: use new LSM hooks instead of SELinux exports
  SELinux: setup new inode/ipc getsecid hooks
  LSM: Introduce inode_getsecid and ipc_getsecid hooks
2008-04-18 18:18:30 -07:00
Ahmed S. Darwish 0ce784ca72 Netlink: Use generic LSM hook
Don't use SELinux exported selinux_get_task_sid symbol.
Use the generic LSM equivalent instead.

Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Paul Moore <paul.moore@hp.com>
2008-04-19 09:52:35 +10:00
YOSHIFUJI Hideaki 878628fbf2 [NET] NETNS: Omit namespace comparision without CONFIG_NET_NS.
Introduce an inline net_eq() to compare two namespaces.
Without CONFIG_NET_NS, since no namespace other than &init_net
exists, it is always 1.

We do not need to convert 1) inline vs inline and
2) inline vs &init_net comparisons.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2008-03-26 04:40:00 +09:00
YOSHIFUJI Hideaki 1218854afa [NET] NETNS: Omit seq_net_private->net without CONFIG_NET_NS.
Without CONFIG_NET_NS, no namespace other than &init_net exists,
no need to store net in seq_net_private.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2008-03-26 04:39:56 +09:00
YOSHIFUJI Hideaki 3b1e0a655f [NET] NETNS: Omit sock->sk_net without CONFIG_NET_NS.
Introduce per-sock inlines: sock_net(), sock_net_set()
and per-inet_timewait_sock inlines: twsk_net(), twsk_net_set().
Without CONFIG_NET_NS, no namespace other than &init_net exists.
Let's explicitly define them to help compiler optimizations.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2008-03-26 04:39:55 +09:00
Stephen Hemminger b1153f29ee netlink: make socket filters work on netlink
Make socket filters work for netlink unicast and notifications.
This is useful for applications like Zebra that get overrun with
messages that are then ignored.

Note: netlink messages are in host byte order, but packet filter
state machine operations are done as network byte order.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-21 15:46:12 -07:00
Denis V. Lunev edf0208702 [NET]: Make netlink_kernel_release publically available as sk_release_kernel.
This staff will be needed for non-netlink kernel sockets, which should
also not pin a namespace like tcp_socket and icmp_socket.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-29 11:18:32 -08:00
Denis V. Lunev 9dfbec1fb2 [NETLINK]: No need for a separate __netlink_release call.
Merge it to netlink_kernel_release.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-29 11:17:56 -08:00
Al Viro 0c11b9428f [PATCH] switch audit_get_loginuid() to task_struct *
all callers pass something->audit_context

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-02-01 14:04:59 -05:00
Pavel Emelyanov 23fe18669e [NETNS]: Fix race between put_net() and netlink_kernel_create().
The comment about "race free view of the set of network
namespaces" was a bit hasty. Look (there even can be only
one CPU, as discovered by Alexey Dobriyan and Denis Lunev):

put_net()
  if (atomic_dec_and_test(&net->refcnt))
    /* true */
      __put_net(net);
        queue_work(...);

/*
 * note: the net now has refcnt 0, but still in
 * the global list of net namespaces
 */

== re-schedule ==

register_pernet_subsys(&some_ops);
  register_pernet_operations(&some_ops);
    (*some_ops)->init(net);
      /*
       * we call netlink_kernel_create() here
       * in some places
       */
      netlink_kernel_create();
         sk_alloc();
            get_net(net); /* refcnt = 1 */
         /*
          * now we drop the net refcount not to
          * block the net namespace exit in the
          * future (or this can be done on the
          * error path)
          */
         put_net(sk->sk_net);
             if (atomic_dec_and_test(&...))
                   /*
                    * true. BOOOM! The net is
                    * scheduled for release twice
                    */

When thinking on this problem, I decided, that getting and
putting the net in init callback is wrong. If some init
callback needs to have a refcount-less reference on the struct
net, _it_ has to be careful himself, rather than relying on
the infrastructure to handle this correctly.

In case of netlink_kernel_create(), the problem is that the
sk_alloc() gets the given namespace, but passing the info
that we don't want to get it inside this call is too heavy.

Instead, I propose to crate the socket inside an init_net
namespace and then re-attach it to the desired one right
after the socket is created.

After doing this, we also have to be careful on error paths
not to drop the reference on the namespace, we didn't get
the one on.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Denis Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-31 19:27:22 -08:00
Denis V. Lunev 775516bfa2 [NETNS]: Namespace stop vs 'ip r l' race.
During network namespace stop process kernel side netlink sockets
belonging to a namespace should be closed. They should not prevent
namespace to stop, so they do not increment namespace usage
counter. Though this counter will be put during last sock_put.

The raplacement of the correct netns for init_ns solves the problem
only partial as socket to be stoped until proper stop is a valid
netlink kernel socket and can be looked up by the user processes. This
is not a problem until it resides in initial namespace (no processes
inside this net), but this is not true for init_net.

So, hold the referrence for a socket, remove it from lookup tables and
only after that change namespace and perform a last put.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Tested-by: Alexey Dobriyan <adobriyan@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 15:08:08 -08:00
Denis V. Lunev b7c6ba6eb1 [NETNS]: Consolidate kernel netlink socket destruction.
Create a specific helper for netlink kernel socket disposal. This just
let the code look better and provides a ground for proper disposal
inside a namespace.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Tested-by: Alexey Dobriyan <adobriyan@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 15:08:07 -08:00
Denis V. Lunev 869e58f870 [NETNS]: Double free in netlink_release.
Netlink protocol table is global for all namespaces. Some netlink
protocols have been virtualized, i.e. they have per/namespace netlink
socket. This difference can easily lead to double free if more than 1
namespace is started. Count the number of kernel netlink sockets to
track that this table is not used any more.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Tested-by: Alexey Dobriyan <adobriyan@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 15:08:05 -08:00
Ilpo Järvinen 3f25252675 [NETLINK] af_netlink: kill some bloat
net/netlink/af_netlink.c:
  netlink_realloc_groups        |  -46
  netlink_insert                |  -49
  netlink_autobind              |  -94
  netlink_clear_multicast_users |  -48
  netlink_bind                  |  -55
  netlink_setsockopt            |  -54
  netlink_release               |  -86
  netlink_kernel_create         |  -47
  netlink_change_ngroups        |  -56
 9 functions changed, 535 bytes removed, diff: -535

net/netlink/af_netlink.c:
  netlink_table_ungrab |  +53
 1 function changed, 53 bytes added, diff: +53

net/netlink/af_netlink.o:
 10 functions changed, 53 bytes added, 535 bytes removed, diff: -482

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 15:01:50 -08:00
Eric Dumazet 9a429c4983 [NET]: Add some acquires/releases sparse annotations.
Add __acquires() and __releases() annotations to suppress some sparse
warnings.

example of warnings :

net/ipv4/udp.c:1555:14: warning: context imbalance in 'udp_seq_start' - wrong
count at exit
net/ipv4/udp.c:1571:13: warning: context imbalance in 'udp_seq_stop' -
unexpected unlock

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 15:00:31 -08:00
Eric Dumazet ea72912c88 [NETLINK]: kzalloc() conversion
nl_pid_hash_alloc() is renamed to nl_pid_hash_zalloc().
It is now returning zeroed memory to its callers.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:57:06 -08:00
Patrick McHardy 6ac552fdc6 [NETLINK]: af_netlink.c checkpatch cleanups
Fix large number of checkpatch errors.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:55:50 -08:00
Denis V. Lunev e372c41401 [NET]: Consolidate net namespace related proc files creation.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:54:28 -08:00
Denis V. Lunev 022cbae611 [NET]: Move unneeded data to initdata section.
This patch reverts Eric's commit 2b008b0a8e

It diets .text & .data section of the kernel if CONFIG_NET_NS is not set.
This is safe after list operations cleanup.

Signed-of-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-13 03:23:50 -08:00
Patrick McHardy c3d8d1e30c [NETLINK]: Fix unicast timeouts
Commit ed6dcf4a in the history.git tree broke netlink_unicast timeouts
by moving the schedule_timeout() call to a new function that doesn't
propagate the remaining timeout back to the caller. This means on each
retry we start with the full timeout again.

ipc/mqueue.c seems to actually want to wait indefinitely so this
behaviour is retained.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-07 04:15:12 -08:00
Pavel Emelyanov 6257ff2177 [NET]: Forget the zero_it argument of sk_alloc()
Finally, the zero_it argument can be completely removed from
the callers and from the function prototype.

Besides, fix the checkpatch.pl warnings about using the
assignments inside if-s.

This patch is rather big, and it is a part of the previous one.
I splitted it wishing to make the patches more readable. Hope 
this particular split helped.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-01 00:39:31 -07:00
Eric W. Biederman 2b008b0a8e [NET]: Marking struct pernet_operations __net_initdata was inappropriate
It is not safe to to place struct pernet_operations in a special section.
We need struct pernet_operations to last until we call unregister_pernet_subsys.
Which doesn't happen until module unload.

So marking struct pernet_operations is a disaster for modules in two ways.
- We discard it before we call the exit method it points to.
- Because I keep struct pernet_operations on a linked list discarding
  it for compiled in code removes elements in the middle of a linked
  list and does horrible things for linked insert.

So this looks safe assuming __exit_refok is not discarded
for modules.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-26 22:54:53 -07:00
Denis V. Lunev 5c58298c25 [NETLINK]: Fix ACK processing after netlink_dump_start
Revert to original netlink behavior. Do not reply with ACK if the
netlink dump has bees successfully started.

libnl has been broken by the cd40b7d398
The following command reproduce the problem:
   /nl-route-get 192.168.1.1

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-23 21:27:51 -07:00
Jesper Juhl f937f1f46b [NETLINK]: Don't leak 'listeners' in netlink_kernel_create()
The Coverity checker spotted that we'll leak the storage allocated
to 'listeners' in netlink_kernel_create() when the
  if (!nl_table[unit].registered)
check is false.

This patch avoids the leak.

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-15 12:26:32 -07:00
Denis V. Lunev cd40b7d398 [NET]: make netlink user -> kernel interface synchronious
This patch make processing netlink user -> kernel messages synchronious.
This change was inspired by the talk with Alexey Kuznetsov about current
netlink messages processing. He says that he was badly wrong when introduced 
asynchronious user -> kernel communication.

The call netlink_unicast is the only path to send message to the kernel
netlink socket. But, unfortunately, it is also used to send data to the
user.

Before this change the user message has been attached to the socket queue
and sk->sk_data_ready was called. The process has been blocked until all
pending messages were processed. The bad thing is that this processing
may occur in the arbitrary process context.

This patch changes nlk->data_ready callback to get 1 skb and force packet
processing right in the netlink_unicast.

Kernel -> user path in netlink_unicast remains untouched.

EINTR processing for in netlink_run_queue was changed. It forces rtnl_lock
drop, but the process remains in the cycle until the message will be fully
processed. So, there is no need to use this kludges now.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 21:15:29 -07:00
Denis V. Lunev aed815601f [NET]: unify netlink kernel socket recognition
There are currently two ways to determine whether the netlink socket is a
kernel one or a user one. This patch creates a single inline call for
this purpose and unifies all the calls in the af_netlink.c

No similar calls are found outside af_netlink.c.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 21:14:32 -07:00
Denis V. Lunev 7ee015e0fa [NET]: cleanup 3rd argument in netlink_sendskb
netlink_sendskb does not use third argument. Clean it and save a couple of
bytes.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 21:14:03 -07:00
Pavel Emelyanov cf7732e4cc [NET]: Make core networking code use seq_open_private
This concerns the ipv4 and ipv6 code mostly, but also the netlink
and unix sockets.

The netlink code is an example of how to use the __seq_open_private()
call - it saves the net namespace on this private.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:55:33 -07:00
Pavel Emelyanov 4665079cbb [NETNS]: Move some code into __init section when CONFIG_NET_NS=n
With the net namespaces many code leaved the __init section,
thus making the kernel occupy more memory than it did before.
Since we have a config option that prohibits the namespace
creation, the functions that initialize/finalize some netns
stuff are simply not needed and can be freed after the boot.

Currently, this is almost not noticeable, since few calls
are no longer in __init, but when the namespaces will be
merged it will be possible to free more code. I propose to
use the __net_init, __net_exit and __net_initdata "attributes"
for functions/variables that are not used if the CONFIG_NET_NS
is not set to save more space in memory.

The exiting functions cannot just reside in the __exit section,
as noticed by David, since the init section will have
references on it and the compilation will fail due to modpost
checks. These references can exist, since the init namespace
never dies and the exit callbacks are never called. So I
introduce the __exit_refok attribute just like it is already
done with the __init_refok.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:54:58 -07:00
Denis Cheng 26ff5ddc5a [NETLINK]: the temp variable name max is ambiguous
with the macro max provided by <linux/kernel.h>, so changed its name
to a more proper one: limit

Signed-off-by: Denis Cheng <crquan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:51:25 -07:00
Denis Cheng 99406c885a [NETLINK]: use the macro min(x,y) provided by <linux/kernel.h> instead
Signed-off-by: Denis Cheng <crquan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:51:25 -07:00
Herbert Xu 0cfad07555 [NETLINK]: Avoid pointer in netlink_run_queue
I was looking at Patrick's fix to inet_diag and it occured
to me that we're using a pointer argument to return values
unnecessarily in netlink_run_queue.  Changing it to return
the value will allow the compiler to generate better code
since the value won't have to be memory-backed.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:51:24 -07:00
Eric W. Biederman 077130c0cf [NET]: Fix race when opening a proc file while a network namespace is exiting.
The problem:  proc_net files remember which network namespace the are
against but do not remember hold a reference count (as that would pin
the network namespace).   So we currently have a small window where
the reference count on a network namespace may be incremented when opening
a /proc file when it has already gone to zero.

To fix this introduce maybe_get_net and get_proc_net.

maybe_get_net increments the network namespace reference count only if it is
greater then zero, ensuring we don't increment a reference count after it
has gone to zero.

get_proc_net handles all of the magic to go from a proc inode to the network
namespace instance and call maybe_get_net on it.

PROC_NET the old accessor is removed so that we don't get confused and use
the wrong helper function.

Then I fix up the callers to use get_proc_net and handle the case case
where get_proc_net returns NULL.  In that case I return -ENXIO because
effectively the network namespace has already gone away so the files
we are trying to access don't exist anymore.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:49:22 -07:00